One Extra Step Makes It Faster: DSpark Boosts DeepSeek V4 Per-User Generation Speed by 85%
- DeepSeek releases DSpark, a speculative decoding acceleration framework built specifically for DeepSeek-V4
- Compared with the MTP-1 baseline already in production, per-user generation speed improves 60–85% (DeepSeek's self-reported figures)
- Core mechanism: a lightweight draft module guesses several tokens ahead, the main model verifies them in a batch, and every correct guess is accepted
- Key improvement: pipeline the draft-generation and main-model-verification steps to run in parallel, eliminating the serial wait
- A pure inference-side system optimization — model weights unchanged, drop-in for existing DeepSeek-V4 deployments
What DeepSeek Released, and How Much Faster It Is
DeepSeek recently released DSpark, a speculative decoding acceleration framework for DeepSeek-V4 that lifts per-user generation speed another 60–85% over the existing MTP-1 baseline.
The real story is which baseline it beats. It's not measured against "no acceleration at all," but against the MTP-1 speculative decoding already running in DeepSeek-V4's production environment. In other words, it squeezes another 60–85% out of a setup that was already accelerated. This is a second-round speedup at the engineering-systems level, which DeepSeek says is already serving V4 in production.
Why a Large Model Spits Out One Token at a Time
A large model generates text one token after another. For every token it emits, it has to run the entire model end to end (one forward pass); only once it has that token can it compute the next.
The tokens follow a strict order: each one depends on the one before it, so you can't skip ahead. No amount of compute can rush your single sentence — piling on GPUs lets the model serve more people at once, but for a single user it still has to go step by step. A hundred-token sentence means a hundred end-to-end computations, all waiting in line.
But there's an exploitable gap hidden here. Having the model verify several already-written tokens costs almost the same compute as having it generate one new token. Generation is stuck on step-by-step dependency, but verification can check a whole batch in parallel in one shot.
Must wait for the previous token; a full forward pass buys back just one token. Expensive, and strictly serial.
Feed the whole batch in at once and check in parallel — compute ≈ one forward pass. A whole string checked in one go.
Speculative decoding slips in through exactly this gap.
Guess a Batch First, Then Confirm It All at Once
Speculative decoding's idea is counterintuitive: instead of making the large model dutifully write one token at a time, you first bring in a blazing-fast "drafter" to guess several upcoming tokens in one shot, then let the large model verify the whole batch at once.
The cost of verifying K tokens ≈ generating 1 token. So as long as the drafter guesses accurately enough, each verification the main model performs confirms several tokens at once — folding multiple steps into one. Adding the "guess" step actually shortens the total time.
Who the Drafter Is, and What Counts as Guessing Well
This "drafter" isn't a separate small model — it's an add-on module bolted on during DeepSeek's training, called the MTP (Multi-Token Prediction) head. It's lightweight, can predict the probabilities of several upcoming tokens at once, and runs far faster than the main model, making it a natural fit for "quick drafting."
The share of the drafter's guessed tokens that the main model accepts is called the acceptance rateThe share of the draft head's guessed tokens that pass the main model's verification. It depends on how close the draft head's and the main model's distributions are — the higher the rate, the more useful tokens netted each round.. The higher the acceptance rate, the more tokens each verification round nets, and the bigger the speedup. It depends on how closely the drafter and the main model "think alike."
Grading is faster than writing exam questions. Setting K questions in a batch and grading them together is far more efficient than writing one question, grading it, then writing the next. Speculative decoding lets the fast drafter "set the questions" in bulk and lets the large model "grade" them all at once.
MTP-1 Is Already in Use — So Where Does DSpark Get Faster
MTP-1 is the scheme already running in DeepSeek-V4's production environment, and it guesses just one step per round: guess one, wait for the main model to verify, then guess the next. DSpark makes two cuts on top of it.
The draft head probes several more tokens ahead at once, so a single verification round confirms more tokens.
Turn "guess" and "verify" from queuing into a pipeline — while one batch is being verified, the next is already being guessed.
The 60–85% gain comes mainly from the second cut. In MTP-1, between "guess then wait to verify" and "verify then guess again" there's a window of idle waiting, with the two streams taking turns sitting idle. DSpark fills that window: the draft stream and the verify stream run overlapped on the timeline, neither waiting on the other.
A car assembly line. While one car gets its wheels mounted, the next car's chassis is already being painted — you don't wait for one car to be fully finished before starting the next. DSpark makes the GPU work the same way: while one batch is being verified, the next is already being guessed, and the machine never sits idle.
One Inference Round, From Guess to Locked In
Putting it all together, one full DSpark loop runs like this.
Verification compares from the front backward: every consecutively correct token is accepted; the moment the first wrong guess appears, it's truncated there. At the error position, the main model conveniently supplies the token it considers correct (that one is a freebie too); everything the drafter guessed after the wrong token is discarded, and the next round restarts from this position.
This round: 4 correctly guessed tokens + 1 token the main model fixed on the spot, netting 5 tokens for the cost of a single verification. The more accurately the drafter guesses, the more green and the less discarded — and the faster overall.
Just How Much Faster
Back to that number. On the already-accelerated MTP-1 baseline, DSpark lifts per-user generation speed another 60 to 85%.
These figures are self-reported by DeepSeek for the DeepSeek-V4 production environment. Note that both ends of the comparison are already "accelerated" states — the range itself is an increment stacked on top of MTP-1.
Who This Actually Helps
This is an inference-side system optimization, and its practical value splits two ways: users and service providers.
For users. What you see is streaming output, with the reply popping out token by token. As generation speeds up, the wait you feel as text appears word by word eases — a difference you can feel directly.
For service providers. DSpark doesn't change model weights, so existing DeepSeek-V4 deployments can adopt it directly, with low migration cost. The same batch of GPUs can either handle higher concurrency or hit the original service level with less hardware.
DeepSeek Releases DSpark, a Speculative Decoding Framework That Accelerates DeepSeek-V4 Per-User Generation 60–85% Over MTP-1. Original headline · MarkTechPost / DeepSeek