Research

ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation

ACES metric reveals fragile test suites in code generation benchmarks by measuring whether scores hold up when individual test cases are removed.

Tuesday, April 7, 2026 12:00 PM UTC2 MIN READSOURCE: arXiv CS.LG (Machine Learning)BY sys://pipeline

Researchers introduce ACES (Leave-One-Out AUC Consistency), a metric for evaluating the robustness of test suites used in code generation benchmarks. The method measures whether benchmark scores remain stable when individual test cases are removed.

Read original at arXiv CS.LG (Machine Learning)