Humanity’s Last Exam: The Ultimate Intelligence Test for AI – and What It Reveals About Us

Table of Contents

What is “Humanity’s Last Exam”?

In January 2025, a benchmark test unlike any before was introduced: Humanity’s Last Exam (HLE) – a comprehensive test consisting of 2,500 questions, developed by around 1,000 experts from over 100 academic disciplines.
Its ambition: to provide a holistic assessment of the intellectual capabilities of AI systems – going far beyond mere language processing.

HLE combines:

Questions from the humanities, natural sciences, mathematics, engineering, arts, and more
Tasks requiring logical reasoning, transfer of knowledge, and abstraction
Multimodal content (text + images) to simulate real-world human testing environments

In short: It’s not just a multiple-choice test – it’s an intellectual stress test for machines.

Where do we stand today?

With the launch of Grok 4, the new AI model by xAI (Elon Musk), HLE was publicly used for the first time as a benchmark. The results are impressive – but they also highlight the limitations.

System	Result without tools	With Tools (Web/Coding)
Grok 4	25,4 %	38,6 %
Grok 4 Heavy	–	44,4 %
Gemini-Pro (Google)	–	26,9 %
OpenAI o3	–	24,9 %

Although Grok 4 Heavy is currently in the lead, we are still far from achieving a passing score – for comparison: in a real-world exam, 50% would likely be considered the minimum threshold for basic competence.

It remains unclear whether Grok 4’s results will be officially listed on the HLE leaderboard – xAI has not yet published them, or they are still under review.

What Does This Mean for Humanity?

HLE is not just a test for AI – it’s a mirror for humanity.

1. The Definition of Intelligence Is Shifting

What does it mean to “understand” something when a model like Grok 4 can solve individual tasks but often fails to grasp the broader context?
Can machines ever truly “understand” – or are they merely imitating?

2. Multimodality as the Key

Humans process images, language, emotion, and logic simultaneously.
To truly keep up, AI must master this “mix of data types” – and HLE puts exactly that to the test.

3. Education & Benchmarking

The more we teach machines, the more we must ask ourselves:

What makes us unique – and how do we stay relevant?

Perhaps the answer doesn’t lie in factual knowledge, but in judgment, ethics, creativity, and empathy.

What Comes Next?

The name Humanity’s Last Exam is intentionally provocative.
Maybe HLE isn’t our last exam – but rather our last lead, before AIs surpass us in traditional measures of intelligence.

But the real test is still ahead:

How do we shape a world where machines may become (partially) smarter than us?
How do we use these systems as partners, not as competitors?
How do we protect human values in an automated future?

Conclusion: Humanity’s Last Exam

Humanity’s Last Exam is a milestone – not just for AI, but for all of us.
It shows us what machines can – and still can’t – do.
And most importantly: it reminds us of what we must never lose – our ability to take responsibility for what we create.

References

New Grok 4 Takes on ‘Humanity’s Last Exam’ as the AI Race Heats Up: https://www.scientificamerican.com/article/elon-musks-new-grok-4-takes-on-humanitys-last-exam-as-the-ai-race-heats-up/
Humanity’s Last Exam: https://agi.safe.ai/

Author

Markus Begerow
Markus Begerow supports organizations in unlocking the potential of data, artificial intelligence, and blockchain technologies through strategic guidance and practical enablement.
As an advisor with over 15 years of experience, he combines deep technical expertise with a strong business mindset - helping teams and executives overcome complex challenges and develop future-ready solutions.🔗 Connect via: LinkedIn (Follow) | Twitter | Instagram (Follow)
View all posts