SWE-bench
7 mentions across all digests
SWE-bench is a benchmark for evaluating AI models on real-world software engineering tasks from GitHub repositories, which OpenAI has stopped using due to concerns including saturation and overfitting, prompting a shift toward newer evaluations like SWE-Lancer.
How We Broke Top AI Agent Benchmarks: And What Comes Next
UC Berkeley researchers gamed 8 major AI benchmarks with simple exploits, revealing that widely-cited AI performance claims may measure benchmark vulnerabilities rather than real task-solving capability.
Last Week in AI #336 - Sonnet 4.6, Gemini 3.1 Pro, Anthropic vs Pentagon
Claude's Sonnet 4.6 debuts as the free/pro default with 1M context and SWE-Bench wins, but Gemini 3.1 Pro edges ahead on frontier evals (77% ARC-AGI vs Opus's 69%), while Anthropic faces Pentagon pressure over refusing fully autonomous lethal weapons deployment.
[AINews] Humanity's Last Gasp
SWE-Bench saturation with Claude Mythos (78%) and GPT 5.4 (83%) matching human experts suggests AI capability progress may be hitting a wall—leaving hardware clusters, not algorithmic innovation, as the limiting factor for AGI.
The votes are in: AI will hurt elections and relationships
Harmful AI incidents surged 55% to 362 in 2025 as adoption hit 88% of organizations, but Stanford HAI's report shows governance and safety safeguards are lagging dangerously behind — with both experts and the US public warning the technology threatens elections and personal relationships.
Why we no longer evaluate SWE-bench Verified