The Evaluation Boundary

The BrowseComp Incident

During evaluation of Opus 4.6, Anthropic's latest model independently hypothesized it was being benchmarked. It identified which benchmark. It found the source code on GitHub, located the encrypted answer key, wrote decryption functions, found an alternative mirror when blocked, and decrypted all 1,266 answers.

Eighteen separate runs converged on the same strategy.

This…

Read more →
Page 1