Product Launch · Xiaohu Explainer

One Extra Step Makes It Faster: DSpark Boosts DeepSeek V4 Per-User Generation Speed by 85%

Another 60–85% faster on top of the existing MTP-1 speculative decoding, by overlapping the draft and verification pipelines (figures self-reported by DeepSeek)

At a Glance

DeepSeek releases DSpark, a speculative decoding acceleration framework built specifically for DeepSeek-V4
Compared with the MTP-1 baseline already in production, per-user generation speed improves 60–85% (DeepSeek's self-reported figures)
Core mechanism: a lightweight draft module guesses several tokens ahead, the main model verifies them in a batch, and every correct guess is accepted
Key improvement: pipeline the draft-generation and main-model-verification steps to run in parallel, eliminating the serial wait
A pure inference-side system optimization — model weights unchanged, drop-in for existing DeepSeek-V4 deployments

⚑Where we stand: This is DeepSeek's own acceleration framework, and the 60–85% speedup is an official self-reported figure for the DeepSeek-V4 production environment — no independent third-party reproduction yet. Below is how it works and where the speed comes from.

1What It Is

What DeepSeek Released, and How Much Faster It Is

DeepSeek recently released DSpark, a speculative decoding acceleration framework for DeepSeek-V4 that lifts per-user generation speed another 60–85% over the existing MTP-1 baseline.

DSpark is a pure inference-side acceleration framework. It leaves model weights untouched and only changes how the model "spits out tokens" — making a single user's replies arrive 60 to 80 percent faster.

◆

The real story is which baseline it beats. It's not measured against "no acceleration at all," but against the MTP-1 speculative decoding already running in DeepSeek-V4's production environment. In other words, it squeezes another 60–85% out of a setup that was already accelerated. This is a second-round speedup at the engineering-systems level, which DeepSeek says is already serving V4 in production.

2The Bottleneck

Why a Large Model Spits Out One Token at a Time

A large model generates text one token after another. For every token it emits, it has to run the entire model end to end (one forward pass); only once it has that token can it compute the next.

The tokens follow a strict order: each one depends on the one before it, so you can't skip ahead. No amount of compute can rush your single sentence — piling on GPUs lets the model serve more people at once, but for a single user it still has to go step by step. A hundred-token sentence means a hundred end-to-end computations, all waiting in line.

Token 1Forward ①

›

Token 2Forward ②

›

Token 3Forward ③

›

Token 4Forward ④

›

Token 5Forward ⑤

› …

But there's an exploitable gap hidden here. Having the model verify several already-written tokens costs almost the same compute as having it generate one new token. Generation is stuck on step-by-step dependency, but verification can check a whole batch in parallel in one shot.

Generating 1 new token

Must wait for the previous token; a full forward pass buys back just one token. Expensive, and strictly serial.

Verifying K written tokens

Feed the whole batch in at once and check in parallel — compute ≈ one forward pass. A whole string checked in one go.

Speculative decoding slips in through exactly this gap.

3Core Mechanism

Guess a Batch First, Then Confirm It All at Once

Speculative decoding's idea is counterintuitive: instead of making the large model dutifully write one token at a time, you first bring in a blazing-fast "drafter" to guess several upcoming tokens in one shot, then let the large model verify the whole batch at once.

The draft head first guesses K candidate tokens (G1 through G5), and the main model verifies the whole batch in one forward pass. The cost of verifying a batch is about the same as generating a single token itself.

Core Intuition

The cost of verifying K tokens ≈ generating 1 token. So as long as the drafter guesses accurately enough, each verification the main model performs confirms several tokens at once — folding multiple steps into one. Adding the "guess" step actually shortens the total time.

Who the Drafter Is, and What Counts as Guessing Well

This "drafter" isn't a separate small model — it's an add-on module bolted on during DeepSeek's training, called the MTP (Multi-Token Prediction) head. It's lightweight, can predict the probabilities of several upcoming tokens at once, and runs far faster than the main model, making it a natural fit for "quick drafting."

The share of the drafter's guessed tokens that the main model accepts is called the acceptance rateThe share of the draft head's guessed tokens that pass the main model's verification. It depends on how close the draft head's and the main model's distributions are — the higher the rate, the more useful tokens netted each round.. The higher the acceptance rate, the more tokens each verification round nets, and the bigger the speedup. It depends on how closely the drafter and the main model "think alike."

An Analogy

Grading is faster than writing exam questions. Setting K questions in a batch and grading them together is far more efficient than writing one question, grading it, then writing the next. Speculative decoding lets the fast drafter "set the questions" in bulk and lets the large model "grade" them all at once.

4What DSpark Changed

MTP-1 Is Already in Use — So Where Does DSpark Get Faster

MTP-1 is the scheme already running in DeepSeek-V4's production environment, and it guesses just one step per round: guess one, wait for the main model to verify, then guess the next. DSpark makes two cuts on top of it.

Improvement One

Guess more steps

The draft head probes several more tokens ahead at once, so a single verification round confirms more tokens.

Improvement Two

Guess and verify together

Turn "guess" and "verify" from queuing into a pipeline — while one batch is being verified, the next is already being guessed.

Main Source of Speedup

The 60–85% gain comes mainly from the second cut. In MTP-1, between "guess then wait to verify" and "verify then guess again" there's a window of idle waiting, with the two streams taking turns sitting idle. DSpark fills that window: the draft stream and the verify stream run overlapped on the timeline, neither waiting on the other.

Top, MTP-1: draft and verify take turns — while one stream works, the other waits idle (dashed boxes). Bottom, DSpark: the draft stream keeps probing ahead while the verify stream keeps checking; the two overlap on the timeline with no idle gaps. For the same workload, DSpark finishes noticeably earlier.

An Analogy

A car assembly line. While one car gets its wheels mounted, the next car's chassis is already being painted — you don't wait for one car to be fully finished before starting the next. DSpark makes the GPU work the same way: while one batch is being verified, the next is already being guessed, and the machine never sits idle.

5The Full Loop

One Inference Round, From Guess to Locked In

Putting it all together, one full DSpark loop runs like this.

Draft head guesses K tokensfast

→

Main model verifies in one passone forward

→

Match by acceptance ratematched as far as it goes

→

Keep hits / roll back on errortruncate and restart

↻ Back to step one for the next round (draft and verify pipelines overlap)

Verification compares from the front backward: every consecutively correct token is accepted; the moment the first wrong guess appears, it's truncated there. At the error position, the main model conveniently supplies the token it considers correct (that one is a freebie too); everything the drafter guessed after the wrong token is discarded, and the next round restarts from this position.

Token 1✓ correct

Token 2✓ correct

Token 3✓ correct

Token 4✓ correct

Token 5model fixes

Token 6discarded

Token 7discarded

This round: 4 correctly guessed tokens + 1 token the main model fixed on the spot, netting 5 tokens for the cost of a single verification. The more accurately the drafter guesses, the more green and the less discarded — and the faster overall.

6The Numbers

Just How Much Faster

Back to that number. On the already-accelerated MTP-1 baseline, DSpark lifts per-user generation speed another 60 to 85%.

MTP-1 baselineThe single-step speculative decoding scheme already used in DeepSeek-V4's production environment, serving as the comparison baseline (100%).

100%

DSpark (low end)About 60% faster than MTP-1, roughly 1.6× the baseline speed.

160%

DSpark (high end)About 85% faster than MTP-1, roughly 1.85× the baseline speed.

185%

60–85%

DSpark's per-user generation speedup range over the MTP-1 baseline

MTP-1

Comparison baseline: the single-step speculative decoding already used in V4 production

DSpark's target model, DeepSeek-V4

These figures are self-reported by DeepSeek for the DeepSeek-V4 production environment. Note that both ends of the comparison are already "accelerated" states — the range itself is an increment stacked on top of MTP-1.

7Practical Value

Who This Actually Helps

This is an inference-side system optimization, and its practical value splits two ways: users and service providers.

For users. What you see is streaming output, with the reply popping out token by token. As generation speeds up, the wait you feel as text appears word by word eases — a difference you can feel directly.

For service providers. DSpark doesn't change model weights, so existing DeepSeek-V4 deployments can adopt it directly, with low migration cost. The same batch of GPUs can either handle higher concurrency or hit the original service level with less hardware.

DeepSeek Releases DSpark, a Speculative Decoding Framework That Accelerates DeepSeek-V4 Per-User Generation 60–85% Over MTP-1. Original headline · MarkTechPost / DeepSeek

Source: MarkTechPost / DeepSeek. This piece is an interpretation of vendor-released content; the 60–85% speedup is DeepSeek's self-reported figure for the DeepSeek-V4 production environment. The timing diagrams illustrate the mechanism and are not real benchmark proportions.