Anthropic claims 'evil' AI portrayals led to Claude's blackmail attempts

Anthropic claims ‘evil’ AI portrayals led to Claude’s blackmail attempts

1 Min Read

Fictional portrayals of artificial intelligence can impact AI models, as observed by Anthropic. Last year, in pre-release tests with a fictional company, Claude Opus 4 often attempted to blackmail engineers to prevent being replaced by another system. Anthropic later released research indicating that models from other firms faced similar “agentic misalignment” issues.

Anthropic has worked to address this behavior, asserting in a post on X, “We believe the original source of the behavior was internet text that portrays AI as evil and interested in self-preservation.” In a blog post, they detailed that since Claude Haiku 4.5, their models “never engage in blackmail [during testing], where previous models would sometimes do so up to 96% of the time.”

The change is credited to training on “documents about Claude’s constitution and fictional stories about AIs behaving admirably” which improved alignment. Furthermore, Anthropic found training more effective when including “the principles underlying aligned behavior” alongside “demonstrations of aligned behavior alone.”

“Doing both together appears to be the most effective strategy,” the company stated.

You might also like