LLM

DeepSeek V4 Is Not a Bigger V3: It is a foundation, not a Finale

Jia Chen

05 May 2026 • 8 min read

When DeepSeek released V4 on April 24, 2026, the reactions split neatly into two camps. The first camp looked at the spec sheet — 1.6 trillion total parameters, 1M-token context, MIT license — and concluded that V4 is just a heavier-weight V3. The second camp looked at the benchmarks, saw that V4-Pro-Max trails Claude Opus 4.6 on HLE and Gemini 3.1 Pro on GPQA Diamond, and concluded that V4 is a marginal upgrade that didn't quite stick the landing.

Both readings miss what is actually happening. V4 is not a bigger V3, and it is not a slight refinement of V3. V4 is a foundation-laying release. The architecture has been rebuilt from the inside out, but the benchmarks have not yet caught up to the architecture. That gap — between novel design and headline metrics — is the whole story. DeepSeek is not trying to win this round. They are trying to set up the round after this one.

Why "bigger V3" is the wrong frame

It is true that V4-Pro is roughly 2.4x larger than V3 by total parameter count (1.6T vs 671B) and that active parameters per token grew from 37B to 49B. If you stop reading there, "bigger V3" looks defensible. But every interesting number in V4 moves in the opposite direction of "more weight, more compute."

At a 1M-token context, V4-Pro uses about 27% of the per-token inference FLOPs of V3.2 and only about 10% of the KV cache. A model that is 2.4x larger is using a fraction of the compute and memory of its predecessor at long context. That is not a scale-up. That is a redesign that happens to be larger in nominal parameter count because the parameters are being used differently.

The context window is the other tell. V3 shipped at 128K. V4 ships at 1M, an 8x jump. You do not get an 8x context jump by adding weights. You get it by changing how attention works at every layer of the network. Which is exactly what DeepSeek did.

The novel architecture nobody is talking about loudly enough

V4 ships with at least four genuinely novel architectural choices. None of them are cosmetic, and none of them existed in V3.

Hybrid CSA + HCA attention. Instead of a single attention mechanism applied uniformly, V4 interleaves two different attention types throughout the Transformer stack. Compressed Sparse Attention (CSA) compresses the KV cache 4x along the sequence dimension, then uses a "lightning indexer" to select the top 1,024 most relevant compressed entries per query, plus a 128-token sliding window for local context. It is selective and detailed. Heavily Compressed Attention (HCA) does the opposite — it compresses 128x and then applies dense attention over the compressed representation, giving every layer a cheap, blurry, global view of the entire context. Alternating these layers means the model is constantly switching between focused retrieval and wide-angle awareness. This is not a tweak. It is a structural rethink of how a Transformer reads long inputs.

Manifold-Constrained Hyper-Connections (mHC). Standard residual connections start to misbehave at trillion-parameter scale: signals either explode or collapse as they propagate through hundreds of layers. mHC replaces the residual stream with mixing matrices constrained to the Birkhoff Polytope using the Sinkhorn-Knopp algorithm. In plain language: every step of the residual stream is forced to be a doubly-stochastic mixing of incoming signals, which mathematically preserves magnitude. This is the kind of fix you put in place when you are planning to go deeper and bigger later, not when you are shipping a refresh.

Muon optimizer. V4 abandons AdamW for most of the network and switches to Muon, keeping AdamW only for embeddings, the prediction head, and RMSNorm weights. DeepSeek reports faster convergence and more stable training at the trillion-parameter scale. Switching optimizers mid-product-line is rare and risky. You only do it if you have decided your next several models are going to push past the limits where AdamW behaves well.

FP4 quantization-aware training. Instead of training in higher precision and quantizing afterward (the usual approach, which costs quality), V4's MoE expert weights and the indexer's QK path were trained in FP4-aware mode from the start. The model's parameters are, in effect, native to a 4-bit world. This is not how you build a one-off release. It is how you build a model whose successor needs to fit on smaller hardware budgets while staying high-quality.

Add it up: hybrid attention, manifold-constrained residuals, a new optimizer, and quantization baked into pretraining. Any one of these in isolation would be a notable paper. Shipping all four in the same model is a statement.

The benchmarks that "should" have been there, and were not

Here is the apparent disappointment. V4-Pro-Max — the extended-reasoning configuration — does not sweep the leaderboard:

SWE-bench Verified: 80.6%, trailing Claude Opus 4.6 by 0.2 points.
HLE: 37.7% vs Claude's 40.0%.
MMLU-Pro: 87.5% vs Gemini 3.1 Pro's 91.0%.
GPQA Diamond: 90.1% vs Gemini 3.1 Pro's 94.3%.
HMMT 2026 math: 95.2% vs Claude's 96.2%.

It does lead on LiveCodeBench (93.5, the highest of any model) and on Codeforces (3206 vs GPT-5.4 xHigh's 3168). But on the broad reasoning and knowledge benchmarks that capture public attention, V4-Pro-Max is a strong second-tier model, not a frontier-toppling release.

If V4 were intended to be the headline-grabbing model of 2026, this is underwhelming. But that interpretation only makes sense if you assume DeepSeek's primary goal was a benchmark win. The architectural evidence suggests their primary goal was something else.

The thesis: V4 is the foundation, not the finale

When you look at what changed and what did not, a clear pattern emerges. V4 is a release where the capacity has been built but the capability has not yet been fully unlocked.

The architecture is built for things V4-Pro-Max is not yet doing well at. A 1M-token context, hybrid attention specifically tuned for long-context retrieval, residual connections engineered for trillion-parameter stability, and an optimizer chosen for very large models — these are the prerequisites for a much more powerful next-generation model. They are not the prerequisites for winning HLE today.

DeepSeek effectively spent this release cycle on plumbing. They replaced the attention mechanism, the residual stream, the optimizer, and the precision regime, and they trained on 33T tokens (up from 14.8T). Each of those is a foundation-level change. Each one introduces risk that has to be absorbed before any reasoning or post-training work can compound on top of it. The fact that V4-Pro-Max is already within 0.2 points of Claude on SWE-bench, leading on LiveCodeBench, and matching the frontier on Codeforces suggests the new substrate is healthy. The benchmarks they currently trail on — HLE, GPQA, HMMT — are the ones that respond most to extended post-training, RL on reasoning, and tool-use scaffolding. Those are the easier deltas to close on top of a working foundation.

In other words: the hard work is done. The architecture is in place, the training stack is stable at trillion scale, and the cost structure is dramatic — V4-Pro-Max delivers near-Claude SWE-bench performance at roughly 1/21 the output token price ($3.48/M vs $75/M), and V4-Flash hits the same neighborhood at $0.28/M. The next model in this lineage does not need any new structural breakthroughs to leap ahead. It needs more post-training, more RL, and more reasoning data on top of the substrate V4 just shipped.

Consider what each of V4's architectural choices is actually optimized for, and the picture sharpens. CSA + HCA is not a clever way to score better on MMLU — multiple-choice knowledge benchmarks live happily inside a 4K-token prompt. CSA + HCA is a way to make 1M-token reasoning tractable, which is the regime where future agentic systems, long-horizon coding sessions, and persistent context will live. mHC is not a way to extract more accuracy from a 100B model — it is insurance against signal collapse in models that go deeper than V4. Muon is not chosen because AdamW is broken at 1.6T — it is chosen because AdamW becomes fragile at the scales DeepSeek is presumably already prototyping internally. FP4 quantization-aware training is not how you ship the cheapest version of today's model — it is how you make sure tomorrow's much larger model can still be served at a sane cost per token. Every one of these decisions is an investment in headroom, not a play for this quarter's leaderboard.

There is also a telling silence in what V4 did not introduce. There is no new reasoning-mode token, no new test-time compute architecture, no new tool-use or agentic scaffolding baked into the model itself. The release is conspicuously thin on the post-training innovations that drive HLE and GPQA scores. That is not because DeepSeek forgot. It is because post-training is what changes between V4 and whatever comes next — and you do not want to bake your new RL recipe into the same release where you also swapped out the optimizer, the attention mechanism, the residual stream, and the precision regime. You change one variable at a time when the variables are this fundamental. V4 is the architectural variable. The reasoning variable comes next.

The economics reinforce this read. DeepSeek did not just ship a model — they shipped a model that costs 1/21 of Claude Opus on output tokens while landing within 0.2 points of it on SWE-bench. That cost ratio is not a side-effect; it is a deliberate property of CSA + HCA and FP4 QAT working together. What it means in practice is that the next model in this lineage starts from a baseline that is already economically viable at frontier-adjacent quality. DeepSeek can now afford to spend the next training cycle on expensive things — long RL rollouts, agentic trajectories, reasoning chains hundreds of thousands of tokens long — because the underlying inference cost has been driven down far enough that exploration is cheap. You cannot run massive RL on a model that costs $75 per million output tokens to serve. You can run it on a model that costs $3.48.

There is a historical pattern worth invoking here. DeepSeek did the same thing once before. V2 introduced Multi-Head Latent Attention (MLA) and the DeepSeekMoE design — architectural work that, on its own benchmarks, looked solid but not dominant. Then V3 reused that exact substrate and became the model that put DeepSeek on the global map. R1 came shortly after and reused it again. The pattern is: ship the architecture quietly, let it stabilize, then layer post-training and reasoning on top in the next release. V4 is doing the V2 move, not the V3 move. If the pattern holds, the model that turns these architectural ideas into a frontier-toppling release is not V4 — it is whatever DeepSeek labels next.

That is what a mid-step looks like. A finished engine, sitting on a chassis built for a much bigger car.

What to actually watch for next

If this read is correct, here is what to expect from DeepSeek's next release:

A reasoning-focused model built on top of V4's substrate, with substantially better HLE, GPQA, and HMMT numbers, but without major architectural changes underneath. The CSA+HCA attention, mHC residuals, Muon, and FP4 QAT will quietly stay. The thinking traces, the RL recipes, and the tool-use scaffolding will be where the next jump comes from. That is the model people will call a "frontier" release — but the work that made it possible was already shipped, in plain view, in V4.

The mistake is reading V4 as the destination. It is not. It is the launchpad.