Gemini 0.37%, Claude 0.25%, Grok 0%. Humans Destroyed Them All: ARC-AGI-3

ARC-AGI-3 is an interactive benchmark that drops agents into novel 64×64 grid environments with no instructions, exposing that frontier models score below 1% while humans solve 100% of the tasks. Anthropic’s Claude Dispatch ships the ability for a phone to control a live desktop Claude session with full filesystem reach, amplifying prompt-injection risk and highlighting that these models lack the abstract reasoning needed to safely interpret adversarial context. #ARC-AGI-3 #ClaudeDispatch

Keypoints

ARC-AGI-3 isolates fluid intelligence by providing zero instructions, forcing agents to infer rules and goals from scratch.
Frontier models scored under 1% (Gemini 3.1 Pro 0.37%, GPT-5.4 0.26%, Claude Opus 4.6 0.25%, Grok-4.20 0%), while humans achieved 100%.
The RHAE metric penalizes inefficient action sequences, revealing that models brute-force exploration rather than forming coherent strategies.
Claude Dispatch operates outside Cowork’s sandbox with default full filesystem access, increasing the blast radius for prompt-injection and MCP-style poisoning attacks.
Mitigations include least-privilege scoping, dedicated network-segregated hardware, and designing deployments assuming breach from day one.

SHARE THIS STORY

WhatsApp X (Twitter)Telegram Bluesky Facebook LinkedIn Threads Email Print