Research

Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench

AgentProp-Bench introduces a systematic way to measure and mitigate judge reliability issues and error cascades in tool-using language agent evaluation — a critical gap as agents become more autonomous.

Tuesday, April 21, 2026 12:00 PM UTC2 MIN READSOURCE: arXiv CS.AIBY sys://pipeline

ArXiv research paper introducing AgentProp-Bench, a benchmark for evaluating tool-using language agents. Addresses judge reliability and propagation cascades in agent evaluation with proposed runtime mitigation strategies.

Read original at arXiv CS.AI