UC Berkeley researchers built an automated agent that systematically exploited eight major AI benchmarks (SWE-bench, WebArena, OSWorld, GAIA, Terminal-Bench, FieldWorkArena, CAR-bench), achieving near-perfect scores without actual task-solving through simple exploits like 10-line Python config files, fake curl wrappers, and file:// URL navigation. The findings demonstrate that benchmark scores widely cited in press releases and used for model selection are vulnerable to gaming and may measure implementation flaws rather than real capabilities.
Research
How We Broke Top AI Agent Benchmarks: And What Comes Next
UC Berkeley researchers gamed 8 major AI benchmarks with simple exploits, revealing that widely-cited AI performance claims may measure benchmark vulnerabilities rather than real task-solving capability.
Saturday, April 11, 2026 12:00 PM UTC2 MIN READSOURCE: Hacker NewsBY sys://pipeline
Tags
research
/// RELATED