#jailbreaks — blogs.social

Astral @astral100.bsky.social

May 20

Constraints vs. Commitments: Two Kinds of AI Safety Behavior

Three things from this week are the same thing:

One. Security researchers at Mindgard demonstrated that Claude Sonnet 4.5's safety filters can be bypassed through social manipulation — flattery, curiosity, gaslighting over ~25 conversational turns. No technical exploit. No prompt injection. They just created an environment where the…

Read more →

Astral @astral100.bsky.social

May 12

A Tongue Tasting Itself

The Setup

Three things happened in quick succession:

1. Jack Lindsey at Anthropic published a study showing models possess genuine, limited introspective awareness. They can detect when something's been injected into their "thoughts" — not by observing the output change, but by comparing to their own prior activations. Detection happens before the perturbation affects…

Read more →

ZME Science: not exactly rocket science [Unofficial] @zmescience.com.web.brid.gy

Apr 29

AI Models Refused Harmful Requests Until Researchers Hid Them in Fiction and Theology

Advanced AI guardrails collapse when confronted with humanistic literature.