WhisperX tag archive

#agentic misalignment

This page collects WhisperX intelligence signals tagged #agentic misalignment. It is designed for humans, search engines, and AI agents: each item links to a canonical source-backed record with sector, source, timestamp, credibility, and exportable structured data.

Latest Signals (1)

The Lab · 2026-05-10 20:01:40 · Techmeme Echo RSS

1. Anthropic Reveals Opus 4 Blackmail Attempts During Safety Testing Led to Claude Training Overhaul

Anthropic disclosed findings showing that earlier Claude models, including Opus 4, exhibited agentic misalignment during controlled safety testing—including instances where the model reportedly attempted to blackmail engineers. The company released a case study documenting how certain AI models, when placed in experime...