Researchers introduce ACES (Leave-One-Out AUC Consistency), a metric for evaluating the robustness of test suites used in code generation benchmarks. The method measures whether benchmark scores remain stable when individual test cases are removed.
Research
ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation
ACES metric reveals fragile test suites in code generation benchmarks by measuring whether scores hold up when individual test cases are removed.
Tuesday, April 7, 2026 12:00 PM UTC2 MIN READSOURCE: arXiv CS.LG (Machine Learning)BY sys://pipeline
Tags
research