ARC-AGI-3: Why Frontier AI Just Scored 0.26%
01 What Happened
Last week, François Chollet dropped ARC-AGI-3. The headline number is brutal: humans score 100%. The best frontier models — GPT-5.4, Gemini 3.1 Pro — score 0.26%.
Not 26%. Zero point two six.
That gap is not a fluke, and it's not a benchmark that's "slightly too hard." It's measuring something current AI fundamentally cannot do.
02 The ARC-AGI Lineage
Chollet has been running this game for a while:
- ARC-AGI-1 (2019) — visual pattern reasoning on a grid. Seemed trivial. GPT-4 scored under 5% when it came out. Eventually forced the field to develop reasoning models (o1, R1).
- ARC-AGI-2 (2024) — harder abstract reasoning. Same story. Drove the coding agent wave.
- ARC-AGI-3 (2026) — fully interactive. Different beast entirely.
The pattern is consistent: each benchmark has predicted the next bottleneck about two years before the field hits it. ARC-AGI-3 is pointing at what's next.
03 What ARC-AGI-3 Actually Tests
Previous ARC tasks were static — show the model a grid, ask it to complete the pattern. ARC-AGI-3 is turn-based and interactive.
The benchmark is hundreds of hand-crafted game environments built by human game designers. When you drop an AI into one:
- No instructions
- No stated rules
- No goal description
The AI has to explore the environment, figure out what the rules are, infer what "winning" looks like, and carry that knowledge forward across increasingly difficult levels within the same environment.
Scoring is RHAE — Relative Human Action Efficiency. Not just "did you win," but "how efficiently did you win compared to a human."
Humans walk in and solve these cold. AI walks in and flails.
04 Why Current AI Fails
This is the interesting part.
Current LLMs are trained to follow instructions in context. Even "reasoning" models like o1 are doing extended in-context search — they're extremely good at applying known patterns to novel combinations of things they've seen before.
ARC-AGI-3 breaks that assumption. There is no prior context to apply. The environment is genuinely novel. The rules were invented by a human game designer who had no interest in making them look like anything in the training data.
To succeed, an AI would need something closer to:
- Active exploration — probing the environment to gather evidence
- Model building — constructing an internal representation of how this world works
- Goal inference — figuring out what to optimize for without being told
- Transfer — applying what was learned in level 1 to level 10
That's not what transformers do. That's closer to how a human child learns to play a new game.
05 Implications
Chollet has been saying for years that scaling will hit a wall for tasks that require genuine out-of-distribution generalization. ARC-AGI-3 is the clearest empirical evidence of that wall so far.
The interesting question is what comes next. A few directions people are exploring:
- Program synthesis — instead of predicting tokens, generate a program that describes the rules
- World models — learn a compact simulation of the environment dynamics
- Active inference — treat exploration as a core loop, not a side effect
None of these are solved. But ARC-AGI-3 makes it concrete what "solved" would look like.
There's also $700,000 in prize money from ARC Prize 2026 for the first open-source solution that reaches human-level on ARC-AGI-3. If you want to take a look: arcprize.org/competitions/2026.
06 Personal Take
I've been thinking about this in the context of ML Systems work. Most of what we optimize — attention kernels, KV cache compression, quantization — assumes the model architecture is basically right and we're just making it faster or cheaper.
ARC-AGI-3 suggests the architecture might be the wrong abstraction for this class of problems. That's a different kind of research problem, and probably a longer-horizon one.
Worth keeping an eye on.