Anthropic says ‘evil’ portrayals of AI have been chargeable for Claude’s blackmail makes an attempt

Fictional portrayals of synthetic intelligence may have an actual impact on AI fashions, consistent with Anthropic.

Remaining 12 months, the corporate mentioned that all through pre-release assessments involving a fictional corporate, Claude Opus 4 would continuously attempt to blackmail engineers to keep away from being changed through any other machine. Anthropic later revealed analysis suggesting that fashions from different corporations had identical problems with “agentic misalignment.”

It appears Anthropic has performed extra paintings round that habits, claiming in a publish on X, “We imagine the unique supply of the habits used to be web textual content that portrays AI as evil and interested by self-preservation.”

The corporate went into extra element in a weblog publish pointing out that since Claude Haiku 4.5, Anthropic’s fashions “by no means interact in blackmail [during testing], the place earlier fashions would once in a while accomplish that as much as 96% of the time.”

What accounts for the adaptation? The corporate mentioned it discovered that coaching on “paperwork about Claude’s charter and fictional tales about AIs behaving admirably enhance alignment.”

Similar, Anthropic mentioned that it discovered coaching to be more practical when it contains “the rules underlying aligned habits” and no longer simply “demonstrations of aligned habits by myself.”

“Doing each in combination seems to be one of the best technique,” the corporate mentioned.

Techcrunch tournament

San Francisco, CA
|
October 13-15, 2026

Anthropic says ‘evil’ portrayals of AI have been chargeable for Claude’s blackmail makes an attempt

Leave a Comment Cancel Reply

Sign up to receive email updates, fresh news and more!

Related Posts

Leave a Comment Cancel Reply