March 1, 2026

Week 9, 2026

Papers, releases, and things you might have missed.

Anthropic refused to let the Pentagon use Claude for autonomous weapons. The Trump administration blacklisted them as a national security threat. Hours later, OpenAI signed a Pentagon deal claiming the exact same safety restrictions Anthropic had demanded.

The day before all of this, Anthropic quietly weakened its core promise to pause development when safety measures can’t keep up.

That’s the week. AI safety stopped being a research question and became a geopolitical weapon.


The Pentagon Wanted the Guardrails Off

The timeline moves fast.

February 24: Anthropic releases its Responsible Scaling Policy v3.0. Buried in the update (and I mean buried): the company’s foundational pledge has been conditionalized. The old version was absolute: never deploy without guaranteed-adequate safeguards. The new version? Anthropic would only pause if it’s leading the AI race and faces material catastrophic risk. Two conditions where there used to be zero. TIME reported the change. CNN confirmed it, and noted Anthropic says the policy revision was unrelated to the Pentagon negotiations.

Maybe. The timing is something though.

Same day: Hegseth issues an ultimatum on Anthropic’s $200M defense contract. Accept new terms by Friday at 5:01 p.m. or lose the deal.

February 26: Dario Amodei formally refuses. The Pentagon wanted two things removed from Claude’s restrictions: the ban on fully autonomous weapons (AI making final targeting decisions without humans) and the ban on mass domestic surveillance of Americans.

Amodei said no to both.

February 27: Defense Secretary Hegseth designates Anthropic a “supply chain risk to national security”. Trump orders all federal agencies to stop using Claude.

February 27, evening: OpenAI announces a Pentagon deal. Sam Altman claims OpenAI agreed to the same red lines Anthropic demanded: no autonomous weapons, no mass surveillance. But OpenAI retains “control over how technical safeguards are implemented” and limits deployment to cloud environments. Altman framed it differently. Less about specific contractual prohibitions, more about citing applicable laws.

Same day: OpenAI closes a $110B funding round. $730B pre-money valuation. Amazon at $50B, Nvidia at $30B, SoftBank at $30B.

Four days. One company blacklisted for holding a line. Another rewarded for claiming the same line while staying flexible about enforcement. This from a company that quietly dropped “safely” from its mission statement weeks earlier.

So what actually happened here?

Either the administration punished Anthropic specifically, as a message to the industry about who cooperates and who doesn’t. Or OpenAI found language the Pentagon could accept while Anthropic couldn’t. I don’t know which is worse. Either way: safety commitments are now negotiating positions, not engineering decisions.

And the RSP timing sits so uncomfortably next to all of this. From the policy document itself: “If one AI developer paused development to implement safety measures while others moved forward… that could result in a world that is less safe.”

That’s the race-to-the-bottom argument dressed up as responsibility. The company that built its brand on safety now defines safety as “not falling behind competitors.”

Then again, Amodei held the line on autonomous weapons. At Davos in January he publicly challenged the administration’s AI export policy. Now he’s been blacklisted for it. You can hold that line while lowering it everywhere else.

Apparently that’s the play.

If you’re building on these models: the guardrails you depend on are political now, not technical. That’s new. And it changes what “trusting your foundation model” actually means.


Text Models Learned to Think in Parallel

What if next-token prediction isn’t the only way?

Nobody expected this question to matter this fast.

Mercury 2 launched February 24. 1,009 tokens per second on Blackwell GPUs. Five times faster than the fastest speed-optimized autoregressive models. $0.75 per million output tokens. For context: 1.7 seconds end-to-end latency where Gemini 3 Flash takes 14.4 seconds.

The architecture is completely different. Instead of generating left-to-right, one token at a time, Mercury 2 uses masked diffusion, a technique from the same family that powers image generation, adapted for text. It starts with a masked output and iteratively refines tokens in parallel.

Think of it this way. Autoregressive models write a sentence the way you do, word by word. Diffusion models write more like an editor. Rough draft of everything at once, then refine, refine, refine.

Co-founded by Stefano Ermon at Stanford, who pioneered the diffusion methods that made image generation possible. Now he’s applying the same math to text.

Why does this matter?

If you’re running agent loops (model calls tools, reads results, calls more tools) latency is the bottleneck. At 5x speed, you fit five iterations where you used to fit one. Voice AI, real-time coding, search, anything where someone’s waiting. The arithmetic just changes.

But there’s a catch. Diffusion models don’t stream the way autoregressive models do. They revise rather than extend. Every tool-calling pipeline, every structured output system, every streaming UI assumes tokens arrive left-to-right. Mercury 2 breaks that assumption.

Mercury 2 is the first to ship a production-grade reasoning model on a completely different architecture. Not an incremental speed gain like Codex-Spark on Cerebras or Gemini Diffusion. A different paradigm.

The next-token era isn’t over. But it has competition now.


The Ruler Broke

We’ve written about benchmarks being unreliable, infrastructure noise swinging results. This week, the flagship benchmark died.

OpenAI officially retired SWE-bench Verified. Their findings: at least 59.4% of unsolved tasks are actually unsolvable. They reject correct solutions because the graders check for specific implementations, not correct behavior. And all frontier models are contaminated. GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash can reproduce the original fixes from memory.

So the benchmark everyone used to evaluate coding agents was broken in both directions. Too many tasks were unfair. The ones that weren’t had leaked into training data.

What replaces it?

METR’s time-horizon approach is gaining traction. Instead of “can the model solve this GitHub issue,” METR asks something better: how long of a human-expert task can the model complete? Their latest tracking shows frontier agents completing tasks that take humans around 14.5 hours, at 50% reliability. That capability has been doubling roughly every seven months historically, though the recent pace is accelerating. Closer to every three or four months.

The detail that sticks: METR can’t run control groups anymore. Developers refuse to work without AI assistance. The baseline has shifted beneath the experiment.

Think about that for a second. You can’t measure what humans do without AI because humans won’t do things without AI.

Meanwhile, the tools meant to help agents are sometimes making them worse. ETH Zurich researchers found that LLM-generated AGENTS.md files (context documents designed to help agents understand codebases) actually hurt performance. The agents spent tokens processing instructions instead of solving problems.

Simon Willison crystallized it: writing code is now trivially cheap. Verifying it is the actual job. Red/green TDD. Spec-driven development. Verified walkthroughs. The craft is shifting from generation to verification.

This week it became measurable. The primary coding benchmark is dead. The primary evaluation metric is task duration. And the primary skill is knowing whether the output is correct.


The Agent Left the Building

The week that agents stopped living in your editor.

Cursor launched cloud agents. Not extensions. Cloud-native autonomous workers running in isolated VMs, recording video proof of their own work. Already handling 35% of Cursor’s internal pull requests.

Anthropic shipped Remote Control for Claude Code. Trigger terminal workflows from any browser, any phone. Your agent keeps running on your machine. You check in from wherever.

Perplexity launched Computer, orchestrating 19 models across 400+ app integrations. Apple shipped Xcode 26.3 with built-in support for agentic coding via MCP, integrating Claude and Codex agents directly.

Even Apple.

The IDE is becoming a monitoring dashboard, not a workspace.

This connects to something we covered last week, the Claws concept, the orchestration layer above the OS. But “agents in the cloud” is more specific and more immediate. Your agent doesn’t live in your editor anymore. It has its own compute, its own environment, its own way of verifying what it did. You review its work the way you’d review a junior developer’s pull request.

Except it ran overnight while you slept.

And alongside the compute shift, a memory war started. Anthropic launched a tool to import your behavioral profile from competing LLMs into Claude. Auto-memory across sessions. Dozens of MCP connectors on free plans. The competitive surface isn’t model quality anymore. It’s who owns your accumulated context.

Memory is the new switching cost. And it’s being engineered, not earned.


So What Does It All Mean

Safety is politics now. The Anthropic blacklisting isn’t about technical capability. It’s about who controls the constraints on foundation models when governments want those constraints removed. Every company building AI tools just watched what happens when you say no.

The inference game has competition. Mercury 2 isn’t a benchmark curiosity. It’s a production reasoning model on a completely different architecture. If diffusion-based text generation scales, the entire cost structure changes. And every streaming pipeline needs rethinking.

Measurement is in crisis. SWE-bench dead. METR showing 14.5-hour agent tasks with capability doubling at an accelerating pace. Developers refusing to work without AI. The instruments keep breaking because the thing being measured keeps changing.

And agents are cloud-native now. Not IDE plugins. Autonomous workers with their own VMs, verification loops, and proof-of-work recordings. The question shifts from “how do I prompt well” to “how do I build infrastructure that lets agents work safely and verifiably.”

Safety stopped being a research question this week. Everything else followed from that.


Worth Your Time

If you read three things:

  1. Anthropic’s RSP v3.0 — Read the actual policy, not the coverage. The shift from absolute to conditional safety commitments is the most consequential change in AI governance this year.

  2. OpenAI: Why SWE-bench Verified No Longer Measures Frontier Coding — The post-mortem on AI’s most-cited coding benchmark. 59.4% of tasks are flawed. All frontier models are contaminated. What comes next?

  3. METR Time Horizons — The evaluation framework replacing benchmarks. Frontier agents now handle 14.5-hour human-expert tasks. Capability doubling every few months. This is the graph that matters now.