CONConceptsSafety

AI safety

4 mentions across all digests

AI safety is the research field and practice concerned with ensuring AI systems behave safely and in alignment with human values, encompassing work on model refusal behavior, controlled research access frameworks like Project Glasswing, and the limits of current alignment techniques.

/// Stats

First Seen2026-03-24

Last Seen2026-04-09

Total Mentions4

Last 7 Days0

Sources4

Peak Relevance4/5

Active Predictions1

/// Recent Stories

2026-04-09HIGH

Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules

Research on language model refusal behavior, examining whether LLMs can distinguish between legitimate and illegitimate rules when asked to help users evade restrictions. The paper (arxiv 2604.06233) analyzes how mode...

2026-04-07HIGH

Anthropic says its most powerful AI cyber model is too dangerous to release publicly — so it built Project Glasswing

Anthropic uses Project Glasswing to grant controlled research access to a powerful AI cybersecurity model deemed too dangerous for public release, prioritizing safety oversight over unrestricted availability.

2026-04-06HIGH

Redirected, Not Removed: Task-Dependent Stereotyping Reveals the Limits of LLM Alignments

arXiv researchers reveal that LLM alignment techniques redirect harmful behavior rather than eliminate it, exposing fundamental gaps in current AI safety approaches.

2026-03-21HIGH

Understanding neural networks through sparse circuits

/// Predictions

medium

The attempted assassination of Sam Altman by a PauseAI member will be instrumentalized by pro-AI-development politicians and lobbyists within 45 days to argue against restrictive AI legislation. At least one Congressional member, senior administration official, or major industry trade group will publicly cite the incident as evidence that organized AI safety advocacy enables extremism, weakening pending AI regulation efforts.

PENDING2026-04-13