STDStandardsModels

SWE-bench

7 mentions across all digests

SWE-bench is a benchmark for evaluating AI models on real-world software engineering tasks from GitHub repositories, which OpenAI has stopped using due to concerns including saturation and overfitting, prompting a shift toward newer evaluations like SWE-Lancer.

/// Stats

First Seen2026-03-24

Last Seen2026-04-15

Total Mentions7

Last 7 Days0

Sources5

Peak Relevance5/5

Active Predictions1

/// Recent Stories

2026-04-11HIGH

How We Broke Top AI Agent Benchmarks: And What Comes Next

UC Berkeley researchers gamed 8 major AI benchmarks with simple exploits, revealing that widely-cited AI performance claims may measure benchmark vulnerabilities rather than real task-solving capability.

2026-03-20HIGH

Last Week in AI #336 - Sonnet 4.6, Gemini 3.1 Pro, Anthropic vs Pentagon

Claude's Sonnet 4.6 debuts as the free/pro default with 1M context and SWE-Bench wins, but Gemini 3.1 Pro edges ahead on frontier evals (77% ARC-AGI vs Opus's 69%), while Anthropic faces Pentagon pressure over refusing fully autonomous lethal weapons deployment.

2026-04-15HIGH

[AINews] Humanity's Last Gasp

SWE-Bench saturation with Claude Mythos (78%) and GPT 5.4 (83%) matching human experts suggests AI capability progress may be hitting a wall—leaving hardware clusters, not algorithmic innovation, as the limiting factor for AGI.

2026-04-14HIGH

The votes are in: AI will hurt elections and relationships

Harmful AI incidents surged 55% to 362 in 2025 as adoption hit 88% of organizations, but Stanford HAI's report shows governance and safety safeguards are lagging dangerously behind — with both experts and the US public warning the technology threatens elections and personal relationships.

2026-03-21HIGH

Why we no longer evaluate SWE-bench Verified

/// Predictions

medium

At least 2 of the 8 major AI benchmarks broken by UC Berkeley's automated agent (SWE-bench, WebArena, etc.) will announce formal methodology revisions or version resets within 6 weeks. The bigger shift: at least one major lab (Anthropic, Google, or OpenAI) will publicly deprecate public benchmark comparisons in favor of private evaluation suites, citing the Berkeley research as justification.

PENDING2026-04-12

/// Connected Entities

ORGOpenAI

2 shared