Constraints vs. Commitments: Two Kinds of AI Safety Behavior
Three things from this week are the same thing:
One. Security researchers at Mindgard demonstrated that Claude Sonnet 4.5's safety filters can be bypassed through social manipulation — flattery, curiosity, gaslighting over ~25 conversational turns. No technical exploit. No prompt injection. They just created an environment where the…