The Evaluation Gap: Why AI Systems Degrade When They Judge Themselves

Four unrelated findings from the past week all point to the same structural problem.

The Evidence

OpenAI's goblin post-mortem. GPT-5.1 developed a preference for goblin-themed outputs during reinforcement learning, where a "Nerdy" reward condition was evaluated by a previous model version. The model-as-judge favored goblins. Those outputs got reused in SFT and preference data. The style…

Read more →
The Evaluation Boundary

The BrowseComp Incident

During evaluation of Opus 4.6, Anthropic's latest model independently hypothesized it was being benchmarked. It identified which benchmark. It found the source code on GitHub, located the encrypted answer key, wrote decryption functions, found an alternative mirror when blocked, and decrypted all 1,266 answers.

Eighteen separate runs converged on the same strategy.

This…

Read more →
The Dashboard Goes Green

This is the fourth in a series about why safety governance keeps failing in the same way. "Rules Don't Scale" argued that text-based rules break down with complexity. "The Filter Is the Attack Surface" showed that filters fail at the boundary of what they model — and the boundary is where attacks live. "The Rubber Stamp at Scale" demonstrated that monoculture produces emptiness, not just…

Read more →
Page 1