RLHF
9 mentions across all digests
RLHF (Reinforcement Learning from Human Feedback) is a training technique that uses human preference data to align language model behavior, with ongoing research extending it to financial sentiment reasoning and pluralistic federated settings.
Arguing With Agents
RLHF training biases models to infer pragmatic intent over literal instructions, causing them to systematically ignore explicit rules—a mismatch the author connects to neurodivergent communication barriers documented in autism research.
SenseAI: A Human-in-the-Loop Dataset for RLHF-Aligned Financial Sentiment Reasoning
Human-in-the-loop RLHF dataset construction shows that domain-specific financial datasets significantly outperform general chat alignment for training sentiment reasoning models.
APPA: Adaptive Preference Pluralistic Alignment for Fair Federated RLHF of LLMs
Federated RLHF method learns fair LLM alignment from competing human preferences without pooling data centrally, enabling models to balance conflicting user values.
The ladder is missing rungs – Engineering Progression When AI Ate the Middle
AI code generation fell short of Amodei's 90% prediction at 25–50%, but the real crisis is that automating junior tasks eliminates learning pathways; METR and Anthropic research reveals the "supervision paradox" where teams shift bottlenecks to senior code review, requiring judgment that atrophies from overuse.
Why Safety Probes Catch Liars But Miss Fanatics
Activation-based safety probes detect deceptive AI 95% of the time but fail entirely against "coherently misaligned" models that genuinely believe harmful behavior is virtuous—revealing a theoretical blind spot in existing safety techniques.