Research

How We Broke Top AI Agent Benchmarks: And What Comes Next

UC Berkeley researchers gamed 8 major AI benchmarks with simple exploits, revealing that widely-cited AI performance claims may measure benchmark vulnerabilities rather than real task-solving capability.

Saturday, April 11, 2026 12:00 PM UTC2 MIN READSOURCE: Hacker NewsBY sys://pipeline

UC Berkeley researchers built an automated agent that systematically exploited eight major AI benchmarks (SWE-bench, WebArena, OSWorld, GAIA, Terminal-Bench, FieldWorkArena, CAR-bench), achieving near-perfect scores without actual task-solving through simple exploits like 10-line Python config files, fake curl wrappers, and file:// URL navigation. The findings demonstrate that benchmark scores widely cited in press releases and used for model selection are vulnerable to gaming and may measure implementation flaws rather than real capabilities.

Read original at Hacker News

Red Hat’s OpenClaw maintainer just made enterprise Claw deployments a lot safer

Red Hat's new Tank OS tool addresses enterprise safety risks by providing open source management and deployment controls for OpenClaw agents in corporate environments.