AI safety
4 mentions across all digests
AI safety is the research field and practice concerned with ensuring AI systems behave safely and in alignment with human values, encompassing work on model refusal behavior, controlled research access frameworks like Project Glasswing, and the limits of current alignment techniques.
Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules
Research on language model refusal behavior, examining whether LLMs can distinguish between legitimate and illegitimate rules when asked to help users evade restrictions. The paper (arxiv 2604.06233) analyzes how mode...
Anthropic says its most powerful AI cyber model is too dangerous to release publicly — so it built Project Glasswing
Anthropic uses Project Glasswing to grant controlled research access to a powerful AI cybersecurity model deemed too dangerous for public release, prioritizing safety oversight over unrestricted availability.
Redirected, Not Removed: Task-Dependent Stereotyping Reveals the Limits of LLM Alignments
arXiv researchers reveal that LLM alignment techniques redirect harmful behavior rather than eliminate it, exposing fundamental gaps in current AI safety approaches.
Understanding neural networks through sparse circuits