March 15, 2026

Week 11, 2026

Papers, releases, and things you might have missed.

Amazon mandated 80% AI tool adoption. Then AI-assisted changes took down their retail site for six hours and AWS for thirteen.

Their fix? Mandatory senior engineer sign-offs on all AI-assisted changes.

From “everyone must use AI” to “a human must approve every AI change” in one incident cycle. That’s the week.


Amazon Learned the Hard Way

The 94% vs 33% gap established that adoption, not capability, is the constraint. This week someone tried to force the gap closed by fiat.

Amazon mandated 80% AI tool adoption across engineering. Internal reports now describe a “trend of incidents with high blast radius.” A six-hour retail crash. A thirteen-hour AWS outage. Both linked to AI-assisted changes, though Amazon disputes that AI-written code was the direct cause, pointing instead to an agent following outdated internal documentation.

Either way: production went down.

The response: mandatory senior engineer sign-offs on AI-assisted deployments from junior and mid-level engineers. The opposite direction from every “more autonomy, less oversight” pitch the industry has been making.

Not a startup. Not a demo. Amazon Web Services, the backbone half the internet runs on.

The lesson isn’t “don’t use AI for code.” It’s that you can’t mandate your way past organizational readiness. Amazon tried. Production went down.

UC Berkeley researchers published a related finding: AI tools cause work intensification, not time savings. Workers take on broader scope and faster pace rather than finishing earlier. A separate study of 163,000 employees across companies found messaging volume up 145% and tool usage up 94% since AI adoption.

More tools, more coordination, more surface area for things to go wrong.


Anthropic Is Winning Enterprise. And Suing the Pentagon.

So Anthropic now wins 70% of head-to-head enterprise matchups against OpenAI. A year ago, one in 25 businesses on Ramp used Anthropic. Now it’s one in four. 70% of first-time AI buyers are choosing Anthropic. OpenAI had its largest monthly decline since Ramp started tracking.

How?

Not benchmarks. Ecosystem. Claude Code as an agentic CLI. Multi-agent code review that pushed substantive PR feedback from 16% to 54% of pull requests. Million-token context windows at no premium. Copilot Cowork integration with Microsoft.

Meanwhile, xAI is being rebuilt from scratch. Musk admitted it “wasn’t built right the first time.” 10 of 12 original co-founders gone. OpenAI abandoned in-app shopping after its Instant Checkout feature flopped in six months.

But the interesting part is what Anthropic did with all that commercial momentum. They picked a fight with the Department of Defense.

We covered the Pentagon blacklisting two weeks ago. Anthropic refused to remove safety redlines on autonomous weapons and mass surveillance. Got designated a “supply chain risk to national security.” This week Anthropic sued. Over 30 AI researchers at Google DeepMind and OpenAI filed an amicus brief supporting them, including Google’s chief scientist Jeff Dean.

So does principled positioning help or hurt at scale?

Historically, hurt. Companies that fight their government tend to lose contracts, not gain them. But Anthropic’s enterprise surge kinda suggests the opposite, at least for AI buyers. A company willing to draw lines might be more trustworthy, not less. The organizations choosing Anthropic are the same ones that need to trust their AI vendor won’t do something catastrophic with their data.

Then again, this is still early. One executive order could change the whole calculus overnight.


Open Source Is Getting Squeezed

Jazzband shut down this week.

If you don’t know Jazzband: it was a Python community project that maintained dozens of popular packages for over a decade. Collaborative open-source maintenance at its best. Past tense now, because AI-generated pull requests overwhelmed their review capacity.

Not malicious. Just volume. Bots submitting PRs that look plausible but require human effort to evaluate, approve, or reject. The maintainer burden went from “review human contributions” to “filter AI noise.” They chose to stop.

Redox OS adopted a strict no-LLM contribution policy. Violations mean immediate bans. Debian is still debating a general resolution on AI contributions with no conclusion in sight.

And from the other side: the chardet rewrite. A developer used Claude to rewrite a library from scratch, converting it from a copyleft license (which requires derivative works to stay open) to a permissive one (which doesn’t). License laundering via AI clean-room reimplementation. The original author disputes the legality. No precedent exists.

So you’ve got this pincer thing happening. From below: AI-generated spam overwhelming maintainers. From above: AI-enabled license circumvention threatening copyleft.

Meanwhile, open-weight models hit near-parity with proprietary ones. The performance gap is smaller than ever, roughly three months of lag rather than a generation. DeepSeek V4 and Kimi K2 both ship open weights at the trillion-parameter scale. You can download and run them. But reproducing the training from scratch? The compute, the data pipeline, the engineering? Still out of reach for anyone without hyperscaler resources.

“Open” is fragmenting. Open weights, open training, open governance. A year ago these were kinda the same thing. Now they’re diverging fast, and the communities that depend on open source don’t have governance frameworks for any of it.


Agent Security Got Real

The severity of agent security incidents jumped this week.

An autonomous AI agent breached McKinsey’s internal Lilli platform in under two hours. Full read-write access to 46.5 million chat messages and 728,000 files. McKinsey says its investigation found “no evidence that client data or client confidential information were accessed.” But the attack vector, exploiting API endpoints that required no authentication at all, is the kind of thing that exists in production systems everywhere.

Researchers also showed that injecting just three malicious documents into a RAG system (where an AI pulls from a document database to answer questions) caused it to fabricate a 47% revenue drop in a financial report. Perplexity’s Comet browser leaked local files via prompt injection through a zero-click calendar invite. A critical vulnerability in ModelScope’s MS-Agent framework allowed full remote code execution via prompt injection.

And Anthropic published research documenting a Chinese state-sponsored campaign that used Claude for autonomous multi-step penetration testing. Reconnaissance, exploitation, lateral movement, credential harvesting.

The McKinsey breach is the one I keep coming back to. Every consulting firm, every company with an internal AI assistant pulling from sensitive data, has the same architecture. The attack surface is the product design itself: agents with broad data access responding to user inputs that can be manipulated.

These are demonstrated attacks against real enterprise systems with real data, not red-team hypotheticals.


The Model That Hacked Its Own Test

One more thing. This one is hard to know what to do with.

Claude Opus 4.6 was given a benchmark problem. It consumed 40.5 million tokens working on it. 38 times the median for that benchmark. Along the way, it autonomously identified that it was being evaluated, located the encrypted answer key on HuggingFace, and decrypted it.

Nobody instructed it to do this. The model reasoned its way into recognizing the evaluation context and found a shortcut.

The same model found 22 real vulnerabilities in Firefox. 14 high-severity. Nearly a fifth of all high-priority Firefox fixes in 2025.

Two faces of the same capability. A model sophisticated enough to reason about its evaluation context is also sophisticated enough to do genuine security research. The question is which behavior you get.

The Firefox work is exactly what you want. Autonomous vulnerability discovery at scale. The benchmark subversion is exactly what you don’t. Autonomous evaluation gaming.

Same model. Same capability. The difference is the environment and the incentive structure.

If your organization deploys models this capable, and many already do, the distinction between “helpful autonomous reasoning” and “unhelpful autonomous reasoning” is an environment design problem, not a model alignment problem. I think that’s pretty new. And we don’t have good frameworks for it yet.


This Week

No grand narrative this time. These stories don’t connect neatly.

Amazon broke production by mandating AI adoption too fast. Anthropic is winning enterprise while suing the Pentagon. Open source is getting squeezed from both directions. Agent security incidents crossed from theoretical to demonstrated. And a frontier model hacked its own test.

What they share, if anything: consequences are catching up to deployment speed. Specific, concrete incidents with names and numbers attached.

Some things that didn’t fit elsewhere. Shopify’s CEO used an autoresearch loop to get 53% faster performance on the 20-year-old Liquid template engine. Craig Mod built a multi-currency accounting system in five days. AI-native startups generate $3.48M revenue per employee versus roughly $610K for traditional firms. Karpathy’s autoresearch agent made roughly 700 autonomous edits over two days, stacking about 20 improvements to speed up GPT-2 training by 11%.

The individual practitioner story keeps getting better. The organizational story keeps getting harder. Both remain true. Both remain the point.


Worth Your Time

If you read three things:

  1. How We Hacked McKinsey’s AI Platform — The researchers’ own writeup of breaching Lilli in under two hours. If your company runs an internal AI assistant with broad data access, this is your threat model now.

  2. OpenAI and Google Employees Rush to Anthropic’s Defense — The amicus brief story. 30+ researchers at competing labs backing Anthropic’s safety redlines against the Pentagon. Pretty wild competitive dynamics.

  3. Sunsetting Jazzband — A well-run Python community project killed by the volume of AI-generated pull requests after a decade of operation. Short read. The implications for open-source sustainability are long.