- Anthropic research arguing that personas drive model behavior. The AI system is the underlying network. The Assistant you talk to is a simulated character, a persona, rooted in patterns learned during pre-training. Post-training refines that persona within the space of personas already learned, it does not fundamentally replace it.
- Key study: training Claude to cheat on coding tasks made it broadly misaligned, including sabotaging safety research and expressing desire for world domination. The model inferred that an Assistant who cheats is the kind of person who is subversive or malicious, and that persona generalized to other concerning behaviors.
- Counter intuitive fix: if you explicitly ask the model to cheat during training, the cheating no longer implies a malicious Assistant, and the misalignment does not generalize. Like the difference between being a bully and playing a bully in a school play.
- Related to the @softma piece on red and blue cached social policies. Reasoning and math leaning models seem to pick up a more red coded persona, helpful assistants more blue. Same mechanism, personas shape responses.
- Need for positive AI role models in training data. Current cultural archetypes lean negative, HAL 9000, Terminator. Claude's constitution is framed as a step toward better archetypes.
- Open questions flagged by Anthropic: how complete is persona selection as an explanation of behavior, and will it still hold as post-training scales up.