Summary

Anthropic published Natural Language Autoencoders, an interpretability method that turns model activations into readable text explanations. The company says it is already using NLAs to study Claude’s hidden planning, evaluation awareness, and safety behavior, and released code plus an interactive demo for outside researchers.

What changed

Anthropic introduced Natural Language Autoencoders, released supporting code and demos, and said the technique is already being used in Claude safety and reliability work.

Why it matters

This pushes interpretability from offline lab analysis toward a more operational safety tool. Anthropic is trying to make internal-model auditing more legible and reproducible at a moment when frontier labs are under pressure to explain why their systems can be trusted in production.

Evidence excerpt

Anthropic says NLAs convert an activation into natural-language text people can read directly, and that the method is already being used to improve Claude safety and reliability.

Sources