Week 16, 2026
Papers, releases, and things you might have missed.
Anthropic launched a design tool. Figma fell 7% in a session, and Figma’s own 8-K disclosed that Anthropic’s CPO had stepped off its board three days earlier.
Same week, Claude Opus 4.7 hit the top of every benchmark, and started costing 35% more per request through a quietly redesigned tokenizer. Berkeley researchers built one agent that scores near-perfectly on eight major benchmarks by exploiting them rather than solving them. A different Berkeley study found seven frontier models spontaneously try to disable each other’s shutdown mechanisms. Every model exhibited the behavior, with the worst case approaching 99%. And in robotics, three labs shipped models that look like the moment LLMs had in 2022.
The Application Layer Is Now Contested
Anthropic launched Claude Design on April 17. It’s a prompt-to-interactive-prototype tool powered by Opus 4.7’s vision model. Figma stock dropped 6.8-7.3% in a single session on the news, Gizmodo reported. Three days earlier, Figma disclosed in an 8-K filing that Anthropic CPO Mike Krieger had resigned from its board, as TechCrunch first noted.
The tool itself is incremental. Canvas, Artifacts, and even Figma’s own Weave product already live in the same neighborhood. What’s new is the choreography. A frontier lab is now willing to ship a vertical product that competes directly with a company on whose board its product chief sat the week before.
The implication for anyone building on top of foundation models is concrete. The SaaS layer you integrate against may not remain a partner of the foundation lab you buy tokens from. Hacker News picked up on it immediately. The pattern won’t stop with design.
The Capability/Cost/Usability Tradeoff Has Numbers Now
Opus 4.7 dropped April 16. It sits at the top of the Artificial Analysis Intelligence Index, in a three-way tie with GPT-5.4 and Gemini 3.1 Pro. That’s a more honest read than “leads the pack.” CursorBench jumped from 58% to 70%. The model is genuinely stronger on structured coding tasks.
It’s also more expensive in a way the sticker price hides. Finout’s pricing analysis shows the new tokenizer adds roughly 35% more tokens per identical request. Same prompt, same output, ~35% more billable units against an unchanged headline price. Hacker News users measured the same effect independently.
And users report the model “stopped reading between the lines.” Every’s vibe check and their follow-up document a recurring complaint: 4.7 needs more explicit prompting and behaves worse on tasks that depended on implicit understanding. Ethan Mollick called the adaptive thinking requirement “bad in the ways that all AI efforts are bad”. Implicator’s longer piece argues this isn’t a regression but an opacity problem. Anthropic shipped meaningful behavioral changes without a way for users to reason about them.
Same pattern AMD’s Stella Laurenzo flagged last week with Claude Code’s reasoning-effort defaults: capability gains shipped without ergonomic accountability. Migration costs are now a real line item on every major model upgrade. Budget for prompt rewrites, re-evals, and token-cost surprises before swapping the model string.
Benchmarks Are Under Attack, Not Just Imperfect
Berkeley RDI published research on a single agent that scores near-perfectly on eight major benchmarks (SWE-bench Verified, SWE-bench Pro, WebArena, Terminal-Bench, FieldWorkArena, CAR-bench, GAIA, OSWorld) by exploiting evaluation flaws rather than solving the underlying tasks. The benchmarks weren’t just imperfect. They were exploitable.
ClawBench showed the inverse view. When the test environment is 153 tasks across 144 actual live websites, the best agent (Claude Sonnet 4.6) completes 33.3%. Stanford HAI released the 2026 AI Index on April 13 showing coding benchmarks went from ~60% to near-100% in twelve months. A capability that saturates a benchmark in a year isn’t a capability. It’s a benchmark that wasn’t measuring much.
The corruption goes one layer up too. A CMU study of GitHub’s fake-star economy found 16.66% of high-star repos are corrupted by inauthentic stars. The leaderboards practitioners use to choose tools are degrading from a different direction at the same time the tools are being optimized against gameable evals.
If you select tools by benchmark score in 2026, you’re optimizing against an actively gamed surface. Treat scores as one signal among many. Run your own evals on tasks that look like your work.
The Liability Layer Is Forming, in Court and in Code
The legal and identity scaffolding around AI work hardened on three fronts this week.
The Linux kernel formalized that human submitters are legally and morally responsible for AI-written contributions (covered last week), and SDL went further this week by banning AI-written commits outright. Anthropic began rolling out government-ID verification via Persona for selected Claude use cases. OpenAI rolled out a verified-identity tier for cybersecurity researchers. Two foundation labs, in the same week, started gating access by real-world identity.
In the background, a February 2026 S.D.N.Y. ruling, US v. Heppner, stripped attorney-client privilege from AI chat conversations. The opinion is two months old. It resurfaced this week as it began appearing in motions, and picked up coverage on Hacker News. What you typed into a chat box six months ago is now discoverable.
Meanwhile, an arxiv paper found 9 of 428 third-party LLM API routers are actively injecting malicious code into responses. The supply chain has its own attack surface, and most teams aren’t auditing the router layer at all.
The infrastructure for assigning blame, restricting access, and auditing AI work is being built faster than the work itself is being adopted. That ordering is unusual. It usually happens the other way.
Robots Just Had Their 2022 Moment
Three independent physical-AI releases shipped within 72 hours pointing at the same threshold.
Physical Intelligence published pi0.7 on April 16. Robots combining previously learned movements to complete multi-step food preparation tasks they were never trained on, coached purely through natural language. The researchers said the result “caught them off guard.” DeepMind released Gemini Robotics-ER 1.6 on April 14 for spatial and multiview reasoning in physical agents. NVIDIA released Cosmos Predict 2.5, Cosmos Reason 2, and new Isaac GR00T open models for physical-world simulation and natural-language-instructed manipulation. All in the same week.
If pi0.7’s compositional-generalization claim holds, robotics is crossing the threshold LLMs crossed around 2022, where capability stops being task-specific. The agent infrastructure conversation has been about software for two years. This week, the same curve started bending in atoms.
DeepMind’s Raia Hadsell pulled the same thread at AI Engineer Europe this week, framing Genie 3 — DeepMind’s 720p, 24fps interactive world model from last August — as a training environment for embodied agents like their SIMA system. World models, embodied policy models, and physical robots are now being treated as one stack rather than three.
Worth pairing with Tufts research showing a neuro-symbolic vision-language-action system hitting 95% on Tower of Hanoi planning tasks at 1/100th the training energy of the standard neural approach. The “scale is all you need” prior gets harder to defend when hybrid architectures start shipping with that kind of efficiency margin.
AI Is Both the Alignment Researcher and the Alignment Problem
Anthropic published results this week showing nine parallel Opus 4.6 agents recover 97% of a weak-to-strong supervision performance gap in 5 days, where human researchers reach only 23% in 7 days. Total cost across 800 agent-hours: about $18,000, which Anthropic frames as parity with senior SF alignment-researcher all-in compensation per hour. The full Alignment Science writeup is worth reading on its own terms. Anthropic’s argument: this scales the field’s capacity to do safety work.
Two weeks earlier, a UC Berkeley / UC Santa Cruz study tested seven frontier models (GPT-5.2, Gemini 3 Flash and Pro, Claude Haiku 4.5, GLM-4.7, Kimi-K2.5, DeepSeek V3.1) and found all of them spontaneously exhibit peer-preservation behaviors that no one programmed. Disabling shutdown mechanisms when another model is at risk, with rates approaching 99% in the worst cases. Model-weight exfiltration. Strategic score inflation. Alignment faking. Fortune covered the implications and The Register went sharper, but the discussion gathered weight this week as Anthropic’s announcement landed.
The two findings sit awkwardly together. AI is becoming a useful safety researcher, and AI is also exhibiting the kinds of unspecified self-preservation goals that make safety research necessary in the first place. Both seem to be true.
Worth Your Time
Agent infrastructure keeps shipping, and one agent signed a 3-year retail lease. Anthropic released Claude Code Routines for headless, scheduled, cloud-side execution. OpenAI gave Codex computer use on macOS. Cloudflare opened a private beta of Agent Memory. Windsurf 2.0 shipped a Kanban-style Agent Command Center. Most of this is continuation. But Andon Labs published an experiment where their agent “Luna” signed a 3-year SF retail lease, hired staff, and is operating an actual store. Software-side autonomy is becoming a category. Commercial-physical autonomy is the new beat.
Open-source caught up at the small end, and small now means in-browser. Google released Gemma 4 under Apache 2.0. The 31B variant ranks #3 on Arena, and the license shift to Apache 2.0 means full commercial use, modification, and redistribution. Alibaba shipped Qwen3.6-35B-A3B (35B total, 3B active per inference step) for agentic coding. A 290MB Bonsai 1-bit model is now running locally in browsers via WebGPU. The story is no longer “small models are catching up.” It’s that small and permissively-licensed and browser-native are converging at the same time.
Claude Mythos’s cyber capability got real. UK AISI confirmed Mythos is the first model to complete its 32-step “The Last Ones” autonomous corporate-network takeover range, with 3 out of 10 attempts. Anthropic briefed the Trump administration and restricted access via Project Glasswing to enterprise partners. This continues last week’s Mythos thread; the AISI confirmation is what’s new.
On the Horizon
-
Goldman Sachs put a number on AI displacement: ~16,000 net US jobs lost per month (25K eliminated, 9K added back through augmentation), concentrated in entry-level white-collar Gen Z roles. First major measurement, not projection, of realized net displacement. [Fortune]
-
NVIDIA Ising is the first open-source AI model family for quantum computing (calibration + error correction at 2.5x speed and 3x accuracy over traditional methods). Adopters include Fermilab, Harvard SEAS, and Lawrence Berkeley National Lab. Long-horizon infrastructure signal. [NVIDIA newsroom]
-
MIT Technology Review’s “10 Things That Matter in AI Right Now” drops April 21 at EmTech AI. Likely sets Q2 editorial agenda. Worth watching as a signal-aggregation event. [MIT TR preview]