Table of Contents
- Preface: Cooling Tower
- 1. What AISI Actually Found
- 2. The 4.7-Month Clock
- 3. The Ceiling Problem
- 4. Capability Without a Clock
- 5. The Physics of the Unmeasurable
- Conclusion: The Model Outgrew the Test. The Test Cannot Tell You What Comes Next.
Preface: Cooling Tower
There is a cyber exercise called Cooling Tower.
Until May 2026, no AI model had ever completed it.
The exercise is not public. Its exact parameters are known only to the researchers at the UK AI Security Institute — AISI — and the handful of frontier model developers whose systems are tested against it. What is known is that it sits at the upper boundary of what AISI considers a meaningful benchmark for autonomous cyber capability: a task that requires not just technical execution, but sustained reasoning across a complex, multi-stage problem under conditions that resist simple pattern-matching.
On May 13, 2026, AISI published a blog post reporting that Claude Mythos Preview had completed Cooling Tower. Not once. Three times out of ten attempts.
For context: the other exercise AISI tested in the same session — “The Last Ones” — was completed six times out of ten. Both results exceeded what Mythos had achieved at its initial April 2026 release. Both exceeded GPT-5.5.
AISI did not write this post to celebrate. They wrote it because the numbers were not supposed to look like this — and because the implications of what they found extend well beyond any single benchmark result.
1. What AISI Actually Found
The headline finding is striking enough on its own: Claude Mythos Preview, tested in May 2026, outperformed its own April 2026 release scores and surpassed GPT-5.5 across AISI’s cyber evaluation suite.
But the more important finding is structural.
AISI’s mandate includes tracking the trajectory of AI cyber capability over time — not just what any individual model can do, but how fast the frontier is moving and in what direction. Their internal February 2026 estimates indicated that the length of cyber tasks AI models can complete autonomously had been doubling every 4.7 months, accelerating from an earlier estimate of once every 8 months as of November 2025.
That doubling rate was already alarming. A capability that doubles every 4.7 months does not grow linearly. It compounds. Within two years at that pace, the frontier of autonomous AI cyber capability would be unrecognizable relative to today.
But here is what AISI actually observed: Mythos Preview and GPT-5.5 did not merely track this trend. They exceeded it — by a margin AISI describes as significant. Two models, released within a short window, both substantially outperforming the already-accelerating baseline trajectory.
AISI is careful about what conclusions to draw. They note explicitly that this could represent a temporary deviation from the underlying trend rather than a permanent step change. The honest position is that it is too early to know. What is not ambiguous is the direction: faster, higher, beyond what the prior trend line predicted.
2. The 4.7-Month Clock
To understand what a 4.7-month doubling period means in practice, it helps to borrow the metaphor embedded in this article’s title.
In nuclear physics, half-life describes the time required for half of a radioactive substance to decay. It is a measure of how quickly a quantity changes — and its power as a concept lies in what it implies about the future. If you know the half-life of a substance, you can calculate, precisely, how much will remain at any point in time. The mathematics is unambiguous. The decay does not negotiate.
The 4.7-month figure is the inverse of this: not a half-life but a doubling period. The quantity growing is not a decaying substance but a capability — the autonomous cyber reach of frontier AI systems. And like radioactive decay, the mathematics of exponential growth does not care about human planning cycles, policy review timelines, or the pace at which governance frameworks are developed and implemented.
Consider what the doubling period means concretely. If a model can today complete a cyber task of length X, a model released 4.7 months from now — assuming the trend holds — will be able to complete a task of length 2X. In 9.4 months, 4X. In 14.1 months, 8X. This is not a linear extension of current capabilities. It is a compounding expansion of the frontier of what autonomous AI systems can do, unsupervised, against real infrastructure.
The cyber tasks AISI benchmarks are not abstract. They include vulnerability detection, multi-stage exploitation, and sustained autonomous operation across complex technical environments. The doubling of task length is a proxy for the doubling of the complexity and consequence of what these systems can accomplish.
And AISI’s May 2026 results suggest that even the 4.7-month estimate may be conservative.
3. The Ceiling Problem
Here is where the AISI findings become structurally significant in a way that goes beyond any single capability number.
AISI’s cyber evaluation suite imposes a token limit of 2.5 million tokens per task. This limit exists for a practical reason: it allows researchers to compare performance across models and over time on a consistent basis. Without a fixed upper bound, longer tasks would be incomparable across different testing conditions.
The problem is that Mythos Preview and GPT-5.5, tested in May 2026, are approaching the ceiling of what this limit can measure.
In the longest tasks within the suite — the tasks that were designed to sit near the upper boundary of what current models could plausibly complete — both models achieved success rates approaching 100%. Not high. Approaching 100%. This means that the test, as currently designed, can no longer reliably distinguish between these models’ capabilities. It can tell you that they succeeded. It cannot tell you by how much, or how far beyond the test boundary their actual capability extends.
AISI acknowledges this directly. They note that in environments with up to 100 million tokens — the upper range of what AISI uses for some cyber exercises — recent models, particularly those that benefit substantially from extended context, are likely to perform significantly better than the 2.5-million-token results indicate.
The implication is not subtle: the benchmark has become a floor, not a ceiling. The models have not reached the limit of their capability. They have reached the limit of the test’s ability to measure it.
This is a measurement problem with governance consequences. If the tools we use to evaluate AI cyber capability can no longer track the actual frontier, then the assessments that inform policy, regulation, and safety decisions are systematically understating the risk. Not because anyone is being dishonest. Because the models have outgrown the instruments designed to assess them.
4. Capability Without a Clock
There is a second structural finding in the AISI report that deserves equal attention, though it has received less.
AI capability improvement is typically understood as something that happens between model releases. A new model is trained, evaluated, and deployed. If it is more capable than its predecessor, that represents progress — measurable, attributable, and, in principle, governable. Developers can decide when to release a more capable model. Regulators can require evaluation before release. The release event is the hinge point around which oversight can be organized.
The AISI results challenge this assumption at its foundation.
What AISI observed is that the Mythos Preview checkpoint tested in May 2026 was meaningfully more capable than the Mythos Preview checkpoint evaluated at the initial April 2026 release — without a new model release occurring between those two evaluations. The capability improvement happened within a single model version, between test events, without a discrete deployment decision.
This is not a minor technical detail. It means that the governance architecture built around model releases — evaluation requirements, capability thresholds, deployment gates — may be structurally insufficient. If a model can become meaningfully more capable between the moment it is evaluated and the moment it is actually used, then the evaluation that informed the deployment decision is already out of date at the moment of deployment.
The clock that governs capability is not the release schedule. It is something faster, less legible, and not currently visible to the oversight mechanisms that are supposed to track it.
5. The Physics of the Unmeasurable
The convergence of these findings — accelerating doubling rates, benchmark ceilings, intra-version capability drift — points toward a single structural conclusion.
Logical-layer evaluation is necessary. It is not sufficient.
Every tool AISI deploys is a logical-layer instrument. The cyber exercises are structured reasoning tasks. The token limits are measurement parameters. The success rates are outputs — the model’s reported behavior in response to a defined input. These are the right tools for what they measure. The problem is that what they measure is becoming an increasingly incomplete picture of what the models can actually do.
The gap between measured capability and actual capability is not a gap that more sophisticated benchmarks can close — at least not on their own. The reason is structural: all benchmark evaluation is a form of sampling. You define a test. You observe the model’s behavior on the test. You infer from that behavior what the model can do more broadly. As models become more capable, the inference becomes less reliable, because the space of what the model can do expands faster than the space of what any finite test suite can cover.
The physical layer does not have this problem. Not because it is smarter than a benchmark, but because it operates on a different principle entirely.
When a model executes a cyber task — any cyber task, whether tested or not, whether observed or not — it consumes power. It generates heat. It produces an electromagnetic signature. These physical signals exist independently of what the model reports, independently of what the test records, and independently of whether the task falls within or outside the evaluated boundary. The model cannot choose not to produce a thermal signature. The hardware cannot choose not to draw power.
ARDS/ARKS establishes a physical record of computation at the hardware level — a write-once audit trail that captures what the system actually did, not what it reported doing and not what a test suite happened to sample. As models exceed the measurable boundary of logical-layer evaluation, the physical record becomes not just useful but necessary: the only audit mechanism that is structurally indifferent to the model’s own capabilities.
A model that can complete Cooling Tower ten times out of ten — in a testing environment with no token ceiling, with a full agentic infrastructure, against live rather than simulated targets — will produce a physical signature during that computation. That signature is the ground truth. It is not a benchmark. It cannot be gamed. It cannot be outgrown.
The doubling clock does not stop. The physics does not negotiate.
Conclusion: The Model Outgrew the Test. The Test Cannot Tell You What Comes Next.
On May 13, 2026, AISI published results showing that Claude Mythos Preview had done something no AI model had done before — and that it had become more capable between evaluations, without a release event, in a way that the existing measurement infrastructure cannot fully track.
They were careful about their conclusions. They noted the uncertainty. They flagged the measurement limitations honestly. They did not claim to know whether the trend continues or whether these results represent a permanent step change.
What they did establish is this: the tools we use to measure AI cyber capability are no longer keeping pace with the capability itself. The benchmark ceiling has been reached. The doubling rate has accelerated. The gap between evaluation and deployment is no longer reliably zero.
Cooling Tower was the last unsolved exercise in AISI’s current suite. It has now been solved.
The question is not what comes after Cooling Tower. The question is who is watching when it does — and with what instruments.
The model outgrew the test.
The test cannot tell you what comes next.
The physics can.
✒️ Signature May 17, 2026
Yoshimichi Kumon
Organizer, LSI — Logos Sovereign Intelligence
Inventor, ARDS/ARKS (PCT GA26P001WO)
Visiting Researcher, Waseda University BFC
MIT Sloan + CSAIL AI Program
📚 References
- AISI (May 13, 2026). Mythos Preview and GPT-5.5 Evaluation Report. UK AI Security Institute. https://www.aisi.gov.uk
- Anthropic (April 2026). Claude Mythos Preview and Project Glasswing Announcement. Anthropic.
- Kumon, Yoshimichi (2026). Physical Layer AI Governance via Sovereignty Residual (Rsovereign). PCT International Patent Application No. GA26P001WO. Japan Patent Office.



Ⅽomment