GRPO
4 mentions across all digests
GRPO (Group Relative Policy Optimization) is a reinforcement learning algorithm developed by DeepSeek for training language models on verifiable reasoning tasks, widely adopted in 2025 RLVR pipelines and used in agentic RL training over multi-step trajectories.
Cross-Modal Coreference Alignment: Enabling Reliable Information Transfer in Omni-LLMs
Researchers expose systematic cross-modal entity alignment failures across 13 SOTA omni-LLMs via the CrossOmni benchmark and demonstrate fixes through both training-free and fine-tuning approaches.
The State Of LLMs 2025: Progress, Problems, and Predictions
DeepSeek R1 sparked a post-training paradigm shift: RLVR and GRPO techniques are becoming the industry standard, replacing RLHF with architectures converging on MoE and efficient attention.
Unlocking Agentic RL Training for GPT-OSS: A Practical Retrospective
On the Shifting Global Compute Landscape