AGI Benchmark Bombshell: Top AI Models Score Below 0.4%, Contradicting Industry Claims
A new benchmark designed to measure artificial general intelligence (AGI) has delivered a stark reality check, revealing that even the most advanced AI models perform at a fraction of human capability. The ARC-AGI-3 evaluation, released the same week Nvidia CEO Jensen Huang declared AGI achieved, shows Google's Gemini scoring just 0.37% and OpenAI's GPT-5.4 achieving a mere 0.26%. In stark contrast, human performance on the same test is pegged at 100%, highlighting a vast and measurable gap.
The ARC-AGI-3 benchmark is specifically crafted to test reasoning on novel problems, a core challenge for achieving true AGI. The near-zero scores for leading models directly contradict the timeline and capability assertions made by prominent industry leaders. This creates a significant tension between internal corporate narratives aimed at investors and the public, and the empirical results from independent evaluation frameworks.
The findings place intense scrutiny on the definition of AGI itself and the validity of claims surrounding its imminent arrival. For the AI lab sector, this benchmark acts as a critical pressure point, forcing a more rigorous and transparent conversation about progress. It signals that despite rapid advancement in narrow tasks, the fundamental leap to human-like reasoning and generalization remains a distant, unclaimed frontier, raising serious questions about resource allocation and public expectations.