Wan Streamer: Real-Time AI That Listens, Watches, and Speaks at Once
- Wan Streamer v0.1 is a natively streaming, end-to-end interactive foundation model that models language, audio, and video as both inputs and outputs inside a single Transformer, coordinated by block-causal attention, generating in a streaming fashion as data arrives.
- The model itself computes a chunk of response in about 200ms; add 350ms of round-trip network latency and total interaction latency is about 550ms. At 25fps, the shortest streaming unit is just 160ms.
- The team calls it the only model that outputs synchronized audio and video with a single end-to-end Transformer while keeping total latency under one second. Existing systems either output voice only (GPT-4o Realtime, Doubao, Gemini Live) or stitch together a chain of modules — ASR + LLM + TTS + animation.
- The current v0.1 runs at only 192p and is positioned as a proof of concept for the end-to-end design. The character demos released are unedited, pre-recorded model outputs — not a live online experience.
- The latency numbers in the comparison charts mix different measurement criteria: the top group is the end-to-end interaction loop, while the bottom group counts only the rendering stage (excluding external LLM/ASR/TTS). The team warns to read them with the annotated caveats in mind.
One Model Pulls Off Real-Time Audio-Video Conversation
The Tongyi Wanxiang (Wan) team recently released Wan Streamer v0.1, a real-time audio-video interaction model. There's no shortage of AI that can converse in real time now, but almost none can watch your face, listen to you, speak back, and bring its own moving face — all at once. Wan Streamer squeezes all of that into a single model.
Why it's worth a look: today's real-time conversational systems fall into two camps. One responds fast but outputs only sound, with no visible face (GPT-4o Realtime, Doubao, Gemini Live); the other has a face but stitches it together from a chain of external modules — ASR, a language model, TTS, animation. The team calls Wan Streamer the only model that emits synchronized audio and video from a single end-to-end Transformer while keeping total latency under one second.
Four Character Demos + a Live Screen Recording
The four demos below all come from the same model — only the character, voice, and setting change. See the results first, then the how.
This next one is different: a screen recording of a real networked conversation, where you can see the full real-time response pipeline.
Why the Old Approach Is Slow: A Relay of Handoffs, Waiting at Every Step
The old approach is slow because it's a pipeline stitched from separate models: speech is first turned into text (ASR), the text is fed to a language model to think up an answer (LLM), the answer is synthesized back into speech (TTS), and finally a face is driven to move (animation rendering). Every stage has to wait for the previous one to deliver, the waits pile up segment by segment, and errors in recognition and lip-sync accumulate all along the way.
Perceive · reason · plan · generate, all together
End-to-end is like one person hearing you out and replying directly; the cascade is like a game of telephone — every relay is a beat slower and may garble the message. That middle layer first turns speech/video into text and then uses text to drive everything downstream, so text is the hidden bridge between the modulesIn a traditional cascade, text is the intermediate representation between several independently trained modules. Wan Streamer drops that intermediate bridge and couples the modalities directly. — and the more bridges there are, the slower and more error-prone it gets.
The original makes a judgment call here: real-time audio-video interaction isn't simply "multimodal understanding" plus "multimodal generation" — it's fundamentally full-duplex, so streamability is a modeling constraint, not just a post-launch engineering optimization. A system built on offline encoders, bidirectional decoders, and turn-based dialogue can't engineer its way to genuine low-latency full duplex. That's exactly the core the next section unpacks.
Its Approach: One Model Covers Everything From Listening to Speaking
Wan Streamer's core is a single sentence: interleave the input tokens and output tokens of vision, audio, and text into one sequence and hand it to a single Transformer; coordinate it with block-causal attention so it computes and emits as data arrives.
A single end-to-end Transformer removes the external modules — VAD, ASR, the language model, TTS, animation, video generation — and jointly optimizes perception, reasoning, response planning, speech and visual generation, response timing, and turn-taking inside one persistent state. Low latency, full duplex, and synchronized audio-video all trace their root back to this.
The model treats the interaction as one continuous causal stream: your observations and its responses jointly update the current context. In each streaming unit it encodes whatever user observations are available so far, then predicts the next chunk of response based on the full causal history of both sides. The language response is a string of discrete tokens trained with next-token prediction; the audio and video responses live in a continuous latent space and are jointly generated with conditional flow matching, conditioned on the same clean context, so that speech, motion, appearance, and scene evolution are denoised together as one coupled whole rather than generated separately and then stitched.
To support this stream, the whole stack is causal by design from the start: a strictly causal audio-video VAE for streaming latent encoding, a causal audio-video encoder, a causal audio-video decoder, and a temporally causal Transformer coordinated by block-causal attention. After denoising, the estimated clean latents are appended directly to the history as context for later units; the causal decoder then renders them into the final audio and video. The external modules this design erases are:
How It Listens and Speaks at Once, and Can Be Interrupted Anytime
Human interaction with the world is inherently streaming and full-duplex: we don't listen to the end, then think separately, then finally answer — we watch, listen, and speak all at once, pausing and interrupting at any moment, with perception and expression overlapping on the timescale of audio and video. A real-time interaction model has to be built the same way.
A causal encoder + a causal decoder + low-latency multimodal token scheduling shrink the streaming unit at 25fps down to 160ms: input speech and video immediately affect the output, and the generated audio and visual states are coupled before decoding rather than patched up after the fact; every emitted unit is written back to the interaction history. So it can listen and speak at once — while you talk it's still listening, and it can adjust when interrupted.
Full duplex is like a normal phone call: you can cut in while the other person is still talking. The walkie-talkie style — "let go before you can listen" — is half duplex.
This mechanism works thanks to block-causal attention. It treats a small block (say a 160ms audio-video segment) as one processing unit: tokens inside a block can see each other (bidirectionally), but a block can only see past blocks, not future ones. This keeps the within-block context while still computing as data arrives, without waiting for the whole segment to finish.
160ms
160ms
160ms
160ms
Tokens within a block see each other; between blocks the view only looks left, into the past: block 3 can start computing the moment it arrives, because it depends only on blocks 1 and 2 — no need to wait for the future block 4. That's streaming generation.
It's like thinking in "phrases" as you speak: you weigh the words inside one phrase together, but you can't foresee the next phrase that hasn't left your mouth yet. There are two matching causal pieces here, the causal encoder/decoderThey see only the past, never the future. An ordinary encoder needs a complete segment before it can encode; the causal version can encode as it receives, like a simultaneous interpreter translating while listening, without waiting for the speech to end., that let both perception and generation move forward as data arrives.
Open for deployment details: how thinker–performer squeezes latency down to 200ms
Wan Streamer is a single end-to-end model at training time; for real-time deployment, the same model is split into a thinker–performer pipeline across two GPUs to overlap computation as much as possible. Once the system finishes prefill, the thinker broadcasts the initial KV-cache to the performer; the two share the same full-history state, and the unified model's behavior is fully preserved.
The thinker handles the causal audio-video encoder, one short computation for language prediction and state updates, KV-cache construction, and decoding the previous unit's latents into audio-video for immediate output. The performer handles only latent generation, running the flow-matching solver for the next audio-video unit on the shared full-history KV context. Because the performer never runs the decoder and the thinker never runs the costly solver, decoding and generation don't block each other.
As long as the performer's time plus the communication time fits within one 160ms unit, real-time throughput holds. And the signal-to-signal path of "encode → state update → latent generation → decode" is the roughly 200ms model-side latency, kept within budget via CUDA graph capture, compilation, and optimized operators.
Versus Other Systems: Where It's Faster and What It Can Do
The two groups of latency numbers below measure different things and must be read separately. The top group is the complete end-to-end interaction loop (perceiving the user and producing a response), and within it only Wan Streamer also outputs video; the bottom group is digital-human / audio-video renderers, counted only up to the rendering stage, excluding the external language model, ASR, and TTS they depend on — so the latency users actually feel is higher than the chart shows.
Coverage across capability dimensions is below, and Wan Streamer is the only row checked all the way across:
| System | Perceive video | Output video | Full duplex | End-to-end | Sub-second response |
|---|---|---|---|---|---|
| Wan Streamer | ✓ | ✓ | ✓ | ✓ | ✓ |
| Doubao Voice | ✓ | ✗ | ✓ | ✗ | ~ |
| GPT-4o Realtime | ✓ | ✗ | ~ | ✗ | ✓ |
| StreamAvatar | ~ | ✓ | ~ | ✗ | ✗ |
| LPM 1.0 | ~ | ✓ | ✓ | ✗ | ~ |
✓ = yes ~ = partial / undisclosed ✗ = no. Full duplex means the system keeps perceiving while generating — understanding and responding at the same time. A cell marked "~" is either partially supported or not publicly disclosed.
Streamability is a modeling constraint, not merely a deployment optimization: a system built on offline encoders, bidirectional decoders, or turn-based dialogue can hardly engineer its way back to genuine low-latency full-duplex capability. Wan Streamer · The Full-Duplex Challenge