Researchers introduce an RL framework for scientific ideation that addresses reward hacking using a novel multi-agent reward function. The system employs Group Relative Policy Optimization to handle sparse rewards, trained on ICLR-320 (problem-solution pairs from ICLR 2024). Experiments show significant improvements over baselines in novelty, feasibility, and effectiveness.
Research
Debate as Reward: A Multi-Agent Reward System for Scientific Ideation via RL Post-Training
Multi-agent debate functions as a reward signal in RL post-training for scientific ideation, preventing reward hacking while achieving measurable gains in novelty and feasibility on ICLR-320 benchmark.
Tuesday, April 21, 2026 12:00 PM UTC2 MIN READSOURCE: arXiv CS.AIBY sys://pipeline
Tags
research
/// RELATED