Gemini 0.37%, Claude 0.25%, Grok 0%. Humans Destroyed Them All: ARC-AGI-3

Gemini 0.37%, Claude 0.25%, Grok 0%. Humans Destroyed Them All: ARC-AGI-3

ARC-AGI-3 is an interactive benchmark that drops agents into novel 64×64 grid environments with no instructions, exposing that frontier models score below 1% while humans solve 100% of the tasks. Anthropic’s Claude Dispatch ships the ability for a phone to control a live desktop Claude session with full filesystem reach, amplifying prompt-injection risk and highlighting that these models lack the abstract reasoning needed to safely interpret adversarial context. #ARC-AGI-3 #ClaudeDispatch

Keypoints

  • ARC-AGI-3 isolates fluid intelligence by providing zero instructions, forcing agents to infer rules and goals from scratch.
  • Frontier models scored under 1% (Gemini 3.1 Pro 0.37%, GPT-5.4 0.26%, Claude Opus 4.6 0.25%, Grok-4.20 0%), while humans achieved 100%.
  • The RHAE metric penalizes inefficient action sequences, revealing that models brute-force exploration rather than forming coherent strategies.
  • Claude Dispatch operates outside Cowork’s sandbox with default full filesystem access, increasing the blast radius for prompt-injection and MCP-style poisoning attacks.
  • Mitigations include least-privilege scoping, dedicated network-segregated hardware, and designing deployments assuming breach from day one.

Read More: https://www.toxsec.com/p/gemini-037-claude-025-grok-0-humans