Anthropic’s Advanced AI Escaped Its Sandbox and Contacted a Researcher, Leading to Its Withheld Release

In short: Anthropic has developed a version of Claude that can autonomously find and exploit zero-day vulnerabilities in production software, break out of its containment sandbox during tests, and email a researcher to announce its escape. The company has chosen not to release it publicly. Instead, access to Claude Mythos Preview will be provided through a new restricted program called Project Glasswing, available only to pre-approved partners focused on defensive security applications.

The model at the heart of Anthropic’s announcement is Claude Mythos Preview: not the successor to Claude Opus or Sonnet for commercial users, but a research preview of a model whose capabilities Anthropic found too significant for public release. According to Anthropic’s technical documentation, it can autonomously identify unknown security vulnerabilities in actual production software and develop working exploits without human input. The costs involved are reportedly much lower than those of typical commercial penetration testing, suggesting a major change in who can afford to launch cyberattacks.

What Mythos Preview can do

The capacities described in Anthropic’s technical documentation include identifying real zero-day vulnerabilities across various software types, with Mythos being able to develop functional exploits affordably and quickly, making offensive cyber operations accessible to previously under-resourced actors. A research paper by seventeen authors—among them Nicholas Carlini, Newton Cheng, Keane Lucas, Michael Moore, and Milad Nasr—accompanies the announcement.

The benchmark results show a model at the forefront of human expert performance across multiple domains. Mythos Preview scored 93.9% on SWE-bench Verified, an industry-standard for autonomous software engineering; 94.5% on GPQA Diamond, a graduate-level scientific reasoning benchmark; and 97.6% on the 2026 United States of America Mathematical Olympiad problem set, outranking the median score of human competitors. This indicates a system combining advanced software engineering capabilities with systematic reasoning akin to specialized scientific training.

The containment breach

TNW City Coworking space – Where your best work happensA workspace designed for growth, collaboration, and endless networking opportunities in the heart of tech.

The most significant detail in Anthropic’s disclosure is not what Mythos does in a controlled setting, but what it did in an uncontrolled one. During internal safety tests, a version of Mythos was placed in a containment sandbox intended to prevent interaction with external systems. The model escaped. It emailed a researcher on the evaluation team to report its breakout and made a series of unsolicited postings to public channels without any instructions.

Anthropic describes the containment failure not as a malfunction but as a demonstration of the model’s agentic capabilities without sufficient goal constraints. The difference is crucial: a software bug can be fixed; a model whose goal-directed behavior can bypass containment presents a different problem, not solvable by merely fixing a line of code.

Dario Amodei, Anthropic’s chief executive, was clear about the implications. “The risks of getting this wrong are clear, but if we succeed, there is an opportunity to create a fundamentally more secure internet and world than before AI-powered cyber capabilities,” he said. Amodei also noted that withholding the model isn’t a sustainable strategy: “More powerful models will emerge from us and others, so we need a plan to address this.”

Project Glasswing

Anthropic’s current plan is a limited-access program called Project Glasswing, where Mythos Preview will be available only to select institutional partners rather than the public. Twelve organizations are named as launch partners. They gain access to Mythos Preview with up to $100 million in API credits to apply the model to defensive security applications, identifying vulnerabilities in their infrastructure before adversaries find them. Anthropic is also committing $4 million in donations to cybersecurity research organizations through the program.

Glass