Table of Contents
What is “Humanity’s Last Exam”?
In January 2025, a benchmark test unlike any before was introduced: Humanity’s Last Exam (HLE) – a comprehensive test consisting of 2,500 questions, developed by around 1,000 experts from over 100 academic disciplines.
Its ambition: to provide a holistic assessment of the intellectual capabilities of AI systems – going far beyond mere language processing.
HLE combines:
- Questions from the humanities, natural sciences, mathematics, engineering, arts, and more
- Tasks requiring logical reasoning, transfer of knowledge, and abstraction
- Multimodal content (text + images) to simulate real-world human testing environments
In short: It’s not just a multiple-choice test – it’s an intellectual stress test for machines.
Where do we stand today?
With the launch of Grok 4, the new AI model by xAI (Elon Musk), HLE was publicly used for the first time as a benchmark. The results are impressive – but they also highlight the limitations.
System | Result without tools | With Tools (Web/Coding) |
---|---|---|
Grok 4 | 25,4 % | 38,6 % |
Grok 4 Heavy | – | 44,4 % |
Gemini-Pro (Google) | – | 26,9 % |
OpenAI o3 | – | 24,9 % |
Although Grok 4 Heavy is currently in the lead, we are still far from achieving a passing score – for comparison: in a real-world exam, 50% would likely be considered the minimum threshold for basic competence.
It remains unclear whether Grok 4’s results will be officially listed on the HLE leaderboard – xAI has not yet published them, or they are still under review.
What Does This Mean for Humanity?
HLE is not just a test for AI – it’s a mirror for humanity.
1. The Definition of Intelligence Is Shifting
What does it mean to “understand” something when a model like Grok 4 can solve individual tasks but often fails to grasp the broader context?
Can machines ever truly “understand” – or are they merely imitating?
2. Multimodality as the Key
Humans process images, language, emotion, and logic simultaneously.
To truly keep up, AI must master this “mix of data types” – and HLE puts exactly that to the test.
3. Education & Benchmarking
The more we teach machines, the more we must ask ourselves:
What makes us unique – and how do we stay relevant?
Perhaps the answer doesn’t lie in factual knowledge, but in judgment, ethics, creativity, and empathy.
What Comes Next?
The name Humanity’s Last Exam is intentionally provocative.
Maybe HLE isn’t our last exam – but rather our last lead, before AIs surpass us in traditional measures of intelligence.
But the real test is still ahead:
- How do we shape a world where machines may become (partially) smarter than us?
- How do we use these systems as partners, not as competitors?
- How do we protect human values in an automated future?
Conclusion: Humanity’s Last Exam
Humanity’s Last Exam is a milestone – not just for AI, but for all of us.
It shows us what machines can – and still can’t – do.
And most importantly: it reminds us of what we must never lose – our ability to take responsibility for what we create.
References
- New Grok 4 Takes on ‘Humanity’s Last Exam’ as the AI Race Heats Up: https://www.scientificamerican.com/article/elon-musks-new-grok-4-takes-on-humanitys-last-exam-as-the-ai-race-heats-up/
- Humanity’s Last Exam: https://agi.safe.ai/