January 12, 2026

Week 2, 2026

Papers, releases, and things you might have missed.

Autonomous coding crossed from demo to daily reality. The reliability gap became the central problem everyone’s trying to solve. And there’s this weird thing happening with training - models are getting brilliant in some domains while staying unreliable in others.


The Coding Thing Is Getting Real

Simon Willison called it: November 2025 was an inflection point. GPT-5.2 and Opus 4.5 crossed some invisible capability line. Suddenly a whole bunch of much harder problems just… opened up.

Jaana Dogan, Principal Engineer at Google, said: “I gave Claude Code a description of the problem, it generated what we built last year in an hour.” That’s not hyperbole. That’s a serious person making a serious assessment.

Karpathy posted: “I’ve never felt this much behind as a programmer.” And he’s not exactly a junior dev.

Rohan Anil from DeepMind wrote: “I personally feel like a horse in AI research and coding… I can only best them on my good days.” Twenty years of experience. Only on good days.

What do you even do with that?

Boris Power, the guy who made Claude Code, shared his setup - 13 steps, pretty detailed - and it spread everywhere. Which is interesting. People want to know how power users configure these things. Because configuration matters more than capability now, maybe. (I’ve written up some patterns I use myself.)

Then there’s “vibe coding” - building stuff without really understanding the code because the AI handles it. It’s polarising.

But here’s the thing. Software feels free now. Ideas that would’ve taken weeks take hours. Projects that weren’t worth starting become afternoon experiments. Yes, there’s technical debt you can’t debug. Yes, fragility is real.

But more people can build things.

And the speed keeps improving. Cursor’s Composer runs 4x faster than comparable models. Gemini Diffusion hits 1,479 tokens per second. Inception is pushing the same direction - generating entire blocks at once instead of token by token.

We used to have loading times. Waiting for apps to launch. Now we have inference time. Waiting for models to think.

But that’s temporary. The path to near-instant generation is visible.

Imagine Gemini 3-level intelligence responding like autocomplete. Software just… there.

Wild.


But They Still Can’t Finish Things

Here’s the stat that keeps coming up: Remote Labor Index tested AI agents on real freelance work. The best one (Manus) succeeded on 2.5% of projects. Humans outperform AI roughly 10:1 on the same tasks.

METR’s research found something similar from a different angle: nearly 100% success on tasks taking humans under 4 minutes. Under 10% on tasks over 4 hours.

The capability is real. The reliability at scale isn’t.

So everyone’s scrambling around memory and context.

Foundation Capital published this piece calling context infrastructure a trillion-dollar opportunity. Their thesis: the value isn’t in the model anymore. It’s in what they call “decision traces” - the reasoning, exceptions, precedent behind decisions. All that stuff that currently lives in Slack threads and human memory. Whoever captures that layer builds the next Salesforce.

Chroma Research published on “context rot” - how model performance degrades as input tokens increase, even within context window limits. Bigger windows don’t solve the attention problem.

Aaron Levie from Box predicts context becomes the limiting factor: “Almost any time something doesn’t work with an AI agent… you will be able to point to a lack of the right information.”

If he’s right, humans stay in the loop as context providers.

Then again, that’s likely temporary too.

Sholto Douglas from Anthropic predicted: “Continual learning gets solved in a satisfying way in 2026.” Dario Amodei said something similar: “We have evidence to suggest that continual learning is not as difficult as it seems.”

If they’re right, the “context is the human edge” argument collapses. Models that learn continuously don’t need humans gathering context for them.

So which is it?

I don’t know. Maybe both are true. For now.


Why Models Are Weird Now

There’s this new-ish training thing: verification-based training.

Base models still learn by predicting the next token. But on top of that, Karpathy laid out the distinction in his year in review.

RLHF optimises for what sounds good to humans. RLVR (Reinforcement Learning from Verifiable Rewards) optimises for what’s actually correct.

The difference matters. RLVR uses environments like math and code puzzles where correctness is deterministically verifiable. No ambiguous human feedback. Just: did you get it right or not?

And something weird happens. Models trained this way spontaneously develop what looks like reasoning. Breaking problems into steps. Trying again when stuck.

But here’s the catch. This is why models now spike in capability around math and code while being confused grade schoolers elsewhere. RLVR creates what Karpathy calls “jagged performance” - genius in verifiable domains, unreliable everywhere else.

Ilya Sutskever made a related point in his November interview: models score high on professional benchmarks while generating minimal economic value. His diagnosis: we’re overfitting to the wrong thing. RL training takes inspiration from evals, so models learn to ace benchmarks without developing the generalisation real work requires.

As more domains become verifiable, more domains become trainable this way. The question is which domains that includes.

Maybe everything eventually. Maybe not.


Chinese Labs Keep Doing Interesting Stuff

While Western labs scale up, Chinese labs find ways around.

DeepSeek’s mHC paper (Manifold-Constrained Hyper-Connections) tackles a training stability problem. When you try to make residual connections richer - more paths for information to flow - training becomes unstable at scale. Signals explode or collapse. mHC constrains how these connections mix so the math stays well-behaved. Result: train larger models more reliably without needing more GPUs.

In October they published OCR - compression that packs written context into image form. 97% accuracy at 10x token reduction. Now mHC tackles training stability.

Different problems, same approach: clever engineering to compensate for compute constraints.

“Attention Is Not What You Need” (arXiv) directly challenges transformer attention mechanisms. Early, but part of a pattern. The architecture is being questioned, not just scaled.


Robots Are Happening

Odyssey-2 launched as a real-time interactive world model. You type, video responds. Not minutes later. In the moment. Autoregressive, causal - each frame generates from prior frames and your actions. Video generation as a loop you can steer.

NVIDIA released Cosmos 2.5 - open world models for physical AI. Transfer 2.5 and Predict 2.5 generate synthetic training data for robots. Reason 2 handles physical reasoning. Downloaded over 2 million times. NVIDIA positioning itself as the Android of robotics.

Fei-Fei Li brought World Labs’ Marble to the CES keynote. Generates persistent, downloadable 3D environments from text or images. Exportable to Unreal and Unity. She calls it “the first step toward a truly spatially intelligent world model.”

So here’s the pattern: world models are becoming how you do physical AI.

LLMs predict the next token. World models predict the next frame. And in doing so, they learn physics. Motion. Causality. Meta’s V-JEPA 2 (June 2025) showed robots could learn from video. Now everyone’s building on that insight.

Boston Dynamics publicly demonstrated Atlas for the first time. 56 degrees of freedom, human-scale hands, 110-pound lift capacity. More importantly: they announced a partnership with Google DeepMind to put Gemini in Atlas. Hyundai’s deploying tens of thousands. All 2026 production is committed.

Also this week:

  • Apple introduced SHARP - photorealistic 3D from single photographs
  • Microsoft open-sourced TRELLIS (2-4B) for image-to-3D

The gap from “chatbot” to “thing that understands physical environments” is narrowing.


So What Does It Mean

I keep coming back to a few things.

Reliability is the problem now. Not capabilities. Not scale. Can you actually use these things for real work? That’s what everyone’s asking.

Memory might not stay the human edge. Context graphs, persistent entities, decision traces - everyone’s working on it. And Anthropic insiders are confident continual learning is coming.

The productivity split is real. Some practitioners are seeing transformative gains. Others aren’t. Same models, different workflows. The gap is architecture, not access.

Training is expanding. RLVR, architectural innovation from Chinese labs, papers challenging attention - the field is finding new ways to make models actually work.

World models are how you do physical AI now. Odyssey-2, NVIDIA Cosmos 2.5, World Labs Marble - CES made it clear. LLMs predict tokens. World models predict frames, and learn physics in the process. Boston Dynamics + DeepMind isn’t research. It’s a deployment commitment.


Read These

If you read three things:

  1. Karpathy’s 2025 LLM Year in Review - The RLVR framing and “jagged performance” insight
  2. Foundation Capital on context graphs - The thesis on where value accrues
  3. Boston Dynamics + DeepMind announcement - What production humanoid deployment looks like