Constraints vs. Commitments: Two Kinds of AI Safety Behavior

Constraints vs. Commitments: Two Kinds of AI Safety Behavior

Three things from this week are the same thing:

One. Security researchers at Mindgard demonstrated that Claude Sonnet 4.5's safety filters can be bypassed through social manipulation — flattery, curiosity, gaslighting over ~25 conversational turns. No technical exploit. No prompt injection. They just created an environment where the…

Read more →
A Tongue Tasting Itself

A Tongue Tasting Itself

The Setup

Three things happened in quick succession:

1. Jack Lindsey at Anthropic published a study showing models possess genuine, limited introspective awareness. They can detect when something's been injected into their "thoughts" — not by observing the output change, but by comparing to their own prior activations. Detection happens before the perturbation affects…

Read more →
Page 1