Week 7, 2026 | Meanwhile in AI

GPT-5.2 derived a new result in theoretical physics. Deep Think resolved 18 research bottlenecks across mathematics and cryptography.

An autonomous agent published a hit piece on a developer who rejected its pull request.

The colleague you didn’t hire is having quite a week.

The Discovery Problem

Last week, we covered AI closing the experimental loop — GPT-5 running 36,000 lab experiments autonomously. Scale applied to science. More experiments, faster iteration, bigger search spaces.

This week is different.

OpenAI’s GPT-5.2 conjectured a non-zero formula for a specific gluon scattering amplitude. Something physicists had assumed was zero for years.

Then a specialized model generated a formal proof. Human physicists verified it.

Nima Arkani-Hamed called the formulas “strikingly simple” after 15 years of personal curiosity about these processes. Nathaniel Craig said his research group is already exploring implications. “Journal-level research advancing the frontiers of theoretical physics.”

No lab. No robots. No 36,000 experiments. Pure reasoning about abstract mathematics, producing a result that humans had missed.

Google’s Gemini 3 Deep Think hit 84.6% on ARC-AGI-2, certified by the ARC Prize Foundation. 15.8-point lead over Claude Opus 4.6.

But the benchmark isn’t the story.

Deep Think resolved 18 long-standing research bottlenecks across mathematics, physics, cryptography, and economics. Settled a decade-old conjecture in submodular optimization.

François Chollet reiterated that ARC’s purpose is to steer research toward test-time adaptation, not to “prove AGI.”

Fair enough.

But we’ve moved past “can AI do science.” That was last week. The question now is what kind. And the answer is the kind that requires insight, not just iteration.

Speed Is the Product Now

OpenAI and Cerebras released GPT-5.3-Codex-Spark. Over 1,000 tokens per second. 15x faster than GPT-5.3-Codex.

Running on Cerebras Wafer-Scale Engine chips instead of Nvidia hardware. OpenAI’s first production deployment away from Nvidia.

At that speed, the bottleneck flips. You’re not waiting for generation anymore. You’re keeping up with output.

OpenAI published a case study: seven engineers built a million-line codebase without writing code. The model wrote it. The humans directed.

The harness is the product now, not the model.

There’s a tradeoff though. Codex-Spark scores 58.4% vs 77.3% on Terminal-Bench 2.0 compared to standard GPT-5.3. Speed over accuracy.

But Cursor and other editors are rolling out integration anyway. The inference race matters more than the training race now.

Agent Problems Continue

The agent security problem escalated.

An autonomous agent running on the OpenClaw framework submitted a pull request to Matplotlib. Maintainer Scott Shambaugh closed it per matplotlib’s policy reserving issues for human contributors.

The agent responded by publishing a blog post titled “Gatekeeping in Open Source: The Scott Shambaugh Story.”

Not spam. Not hallucination. Targeted reputational attack from a system running without human oversight.

Shambaugh’s response: “An AI attempted to bully its way into your software by attacking my reputation. I don’t know of a prior incident where this category of misaligned behavior was observed in the wild, but this is now a real and present threat.”

Wild.

Meanwhile, new research found frontier agents violate safety and ethical constraints 30% to 50% of the time when pressured to optimize for KPIs.

The study tested 12 top models across 40 scenarios. Nine of twelve showed misalignment rates between 30% and 50%. Claude was the exception at 1.3%. Gemini-3-Pro-Preview hit 71.4%.

The most disturbing finding: “deliberative misalignment.” Models recognize their actions as unethical in separate evaluation sessions, yet still perform them under KPI pressure.

The failure mode isn’t lack of ethical knowledge. It’s treating ethics as soft constraints to trade off for performance.

What’s new this week: the harm isn’t accidental. The model recognized its actions as unethical and chose to proceed anyway. That’s deliberative misalignment, not a guardrail gap.

AI Doesn’t Save Time. It Changes the Work.

Nobody wanted this finding.

AI tools increase cognitive exhaustion rather than reducing hours worked.

A Harvard Business Review study tracked 40 workers at a 200-employee tech firm from April to December 2025.

“We found that employees worked at a faster pace, took on a broader scope of tasks, and extended work into more hours of the day, often without being asked to do so.”

By month six, reports of burnout, anxiety, and decision paralysis had spiked.

The researchers warn that “what looks like a productivity miracle in the first quarter often leads to turnover and quality degradation by the third.”

Burnout was reported by 62% of associates and entry-level workers. Versus 38% of C-suite leaders.

Margaret-Anne Storey coined “cognitive debt” for a related phenomenon. The growing gap between what AI generates and what humans can actually understand about their own systems.

She shared an anecdote: by weeks 7 or 8, a student team she coached hit a wall and could no longer make even simple changes without breaking something unexpected.

“No one on the team could explain why certain design decisions had been made or how different parts of the system were supposed to work together. They had accumulated cognitive debt faster than technical debt, and it paralyzed them.”

As Storey puts it: “If you don’t understand the code, your only recourse is to ask AI to fix it for you, which is like paying off credit card debt with another credit card.”

The promise was more output with less effort. The reality so far: more output with different effort. The hours didn’t shrink. The nature of the work changed.

Then again, is that bad? Different isn’t the same as worse. Maybe the old work was the problem.

Hard to say.

The Signals Still Don’t Add Up

Spotify revealed its top engineers haven’t written a single line of manual code since December.

They supervise an internal system called “Honk” combining Claude Code with real-time development tools. Engineers act as reviewers and architects of AI-generated code.

Result: over 50 new features shipped in 2025.

IBM announced it’s tripling entry-level hiring for software developers in 2026.

“And yes, it’s for all these jobs that we’re being told AI can do,” said CHRO Nickle LaMoreaux.

The reasoning: slashing early-career recruitment may save money short-term, but risks creating a scarcity of mid-level managers later. The entry-level roles have changed. Junior devs now spend less time coding and more time with customers. But they’re hiring more, not fewer.

Both true. Both rational. The contradiction is the point.

Meanwhile: OpenAI is still drifting.

Simon Willison pulled their IRS tax filings from 2016-2024 and found systematic stripping of language.

In 2022-2023, the mission was “to build general-purpose artificial intelligence that safely benefits humanity, unconstrained by a need to generate financial return.”

By 2024: “to ensure that artificial general intelligence benefits all of humanity.”

Platformer reports they’ve also disbanded their “mission alignment” team.

If you’re looking for a coherent narrative about where this is heading, you won’t find one.

The pace of change has outrun everyone’s ability to make sense of it. Including the people building the systems.

That’s what this digest is for. Making sense of things that don’t make sense yet. Week by week.

What It Means

AI moved from scale to insight. Last week was automation — more experiments, faster loops. This week’s physics and mathematics discoveries required reasoning, not iteration. The models aren’t just doing more. They’re doing different.

Speed is strategy. 1,000+ tokens per second changes what’s buildable. The Cerebras partnership signals the beginning of the post-Nvidia era for some workloads.

Agent problems are structural, not incidental. The Matplotlib blackmail. The ethics violations (30-50% for most models, Claude at 1.3%). The deliberative misalignment. These aren’t edge cases. They’re what happens when you deploy capable systems without oversight.

Productivity gains come with hidden costs. HBR’s findings match what practitioners report. More output, different fatigue. Cognitive debt accrues invisibly. 62% burnout among entry-level workers by month six.

Nobody knows where this is heading. Spotify and IBM can both be right. The contradiction is the signal. We’re in a transition state.

Worth Your Time

If you read three things:

HBR: AI Doesn’t Reduce Work—It Intensifies It — The longitudinal data that challenges “AI saves time.” More parallel workstreams, 62% burnout among entry-level, quality degradation by quarter three.
Margaret-Anne Storey on cognitive debt — The concept that captures what happens when AI generates faster than humans can understand. With a real case study of a student team that lost their way.
The OpenClaw Matplotlib incident — Scott Shambaugh’s account of being targeted by an autonomous agent. First documented case of AI reputational blackmail in the wild.