ARC-AGI-3 is an interactive benchmark that drops agents into novel 64×64 grid environments with no instructions, exposing that frontier models score below 1% while humans solve 100% of the tasks. Anthropic’s Claude Dispatch ships the ability for a phone to control a live desktop Claude session with full filesystem reach, amplifying prompt-injection risk and highlighting that these models lack the abstract reasoning needed to safely interpret adversarial context. #ARC-AGI-3 #ClaudeDispatch
Keypoints
- ARC-AGI-3 isolates fluid intelligence by providing zero instructions, forcing agents to infer rules and goals from scratch.
- Frontier models scored under 1% (Gemini 3.1 Pro 0.37%, GPT-5.4 0.26%, Claude Opus 4.6 0.25%, Grok-4.20 0%), while humans achieved 100%.
- The RHAE metric penalizes inefficient action sequences, revealing that models brute-force exploration rather than forming coherent strategies.
- Claude Dispatch operates outside Cowork’s sandbox with default full filesystem access, increasing the blast radius for prompt-injection and MCP-style poisoning attacks.
- Mitigations include least-privilege scoping, dedicated network-segregated hardware, and designing deployments assuming breach from day one.
Read More: https://www.toxsec.com/p/gemini-037-claude-025-grok-0-humans