Research Explainer · Xiaohu Explainer

Wan Streamer: Real-Time AI That Listens, Watches, and Speaks at Once

Model-side response ~200ms, total latency ~550ms; v0.1 is only 192p, and the demos are pre-recorded, not a live experience.

At a Glance

Wan Streamer v0.1 is a natively streaming, end-to-end interactive foundation model that models language, audio, and video as both inputs and outputs inside a single Transformer, coordinated by block-causal attention, generating in a streaming fashion as data arrives.
The model itself computes a chunk of response in about 200ms; add 350ms of round-trip network latency and total interaction latency is about 550ms. At 25fps, the shortest streaming unit is just 160ms.
The team calls it the only model that outputs synchronized audio and video with a single end-to-end Transformer while keeping total latency under one second. Existing systems either output voice only (GPT-4o Realtime, Doubao, Gemini Live) or stitch together a chain of modules — ASR + LLM + TTS + animation.
The current v0.1 runs at only 192p and is positioned as a proof of concept for the end-to-end design. The character demos released are unedited, pre-recorded model outputs — not a live online experience.
The latency numbers in the comparison charts mix different measurement criteria: the top group is the end-to-end interaction loop, while the bottom group counts only the rendering stage (excluding external LLM/ASR/TTS). The team warns to read them with the annotated caveats in mind.

⚑

This is Wan Streamer's official release page (a vendor announcement). The latency figures, capability comparisons, and positioning language like "the only" all come from the team, and some comparison numbers mix different measurement boundaries — flagged inline where relevant.

1What This Is

One Model Pulls Off Real-Time Audio-Video Conversation

The Tongyi Wanxiang (Wan) team recently released Wan Streamer v0.1, a real-time audio-video interaction model. There's no shortage of AI that can converse in real time now, but almost none can watch your face, listen to you, speak back, and bring its own moving face — all at once. Wan Streamer squeezes all of that into a single model.

Inside a single Transformer it handles language, audio, and video as both input and output at once, achieving sub-second full-duplex audio-video conversation: the model itself computes a chunk of response in about 200 milliseconds, and total latency after the network round-trip is about 550 milliseconds.

⚡

Why it's worth a look: today's real-time conversational systems fall into two camps. One responds fast but outputs only sound, with no visible face (GPT-4o Realtime, Doubao, Gemini Live); the other has a face but stitches it together from a chain of external modules — ASR, a language model, TTS, animation. The team calls Wan Streamer the only model that emits synchronized audio and video from a single end-to-end Transformer while keeping total latency under one second.

~200 ms

Model-side response latency (team-reported)

~550 ms

Total interaction latency = 200ms model-side + 350ms network

160 ms

Shortest streaming unit at 25fps

192p

v0.1 resolution, a proof of concept for the end-to-end design

Breaking down the 550ms total latency

Model-side 200ms

Round-trip network 350ms

The model itself accounts for only 200ms; the other 350ms is the network round-trip. In other words, the model's raw reaction speed is faster than the headline total latency suggests.

2See It in Action

Four Character Demos + a Live Screen Recording

The four demos below all come from the same model — only the character, voice, and setting change. See the results first, then the how.

Before you watch, to be clear: these are all unedited, pre-recorded model outputs, not a live online experience. The current v0.1 runs at only 192p, used to test whether the end-to-end design holds up. The team says scaling to higher resolution later should be relatively easy — but that's a plan, not something v0.1 has already done.

Chinese · Warm indoor video call. Chatting about shaving, working from home, wanting to catch a new action movie with great effects. Clear, natural male voice. Source: Wan Streamer official demo

Chinese · Bright white interior. Chatting about ships (CP), showbiz gossip, Stephen Chow's Kung Fu Hustle, ending with an imitation of the iconic grin. Light, cheerful female voice. Source: Wan Streamer official demo

English · Close-up in a car. A woman says she's exhausted and thanks the other person for their patient company. Tired, sincere female voice. Source: Wan Streamer official demo

English · Close-up in a light-toned interior. Chatting about mindless phone-scrolling, automatic habits, turning off notifications. Natural female voice. Source: Wan Streamer official demo

This next one is different: a screen recording of a real networked conversation, where you can see the full real-time response pipeline.

Real networked conversation recording: the local user is on the left, the AI Agent's real-time response on the right, with a synchronized scrolling text stream below. Video compressed for the web. Source: Wan Streamer official

3Where the Old Approach Gets Stuck

Why the Old Approach Is Slow: A Relay of Handoffs, Waiting at Every Step

The old approach is slow because it's a pipeline stitched from separate models: speech is first turned into text (ASR), the text is fed to a language model to think up an answer (LLM), the answer is synthesized back into speech (TTS), and finally a face is driven to move (animation rendering). Every stage has to wait for the previous one to deliver, the waits pile up segment by segment, and errors in recognition and lip-sync accumulate all along the way.

Old Approach · Cascaded Pipeline

Audio-video input

↓ wait

ASR recognition

↓ wait

LLM thinks up answer

↓ wait

TTS synthesizes speech

↓ wait

Animation / rendering

↓

Output

Every arrow is one wait plus one round of error accumulation; the modules bridge to each other through text; most systems output voice only, or barely stitch a face together, and don't report end-to-end latency.

Wan Streamer · End-to-End Single Model

Audio-video input

↓

A single Transformer
Perceive · reason · plan · generate, all together

↓

Synchronized audio-video output

No seams, and the waits collapse; turn-taking, interruption, and long-range consistency are all learned together as one coherent behavior.

An analogy

End-to-end is like one person hearing you out and replying directly; the cascade is like a game of telephone — every relay is a beat slower and may garble the message. That middle layer first turns speech/video into text and then uses text to drive everything downstream, so text is the hidden bridge between the modulesIn a traditional cascade, text is the intermediate representation between several independently trained modules. Wan Streamer drops that intermediate bridge and couples the modalities directly. — and the more bridges there are, the slower and more error-prone it gets.

The original makes a judgment call here: real-time audio-video interaction isn't simply "multimodal understanding" plus "multimodal generation" — it's fundamentally full-duplex, so streamability is a modeling constraint, not just a post-launch engineering optimization. A system built on offline encoders, bidirectional decoders, and turn-based dialogue can't engineer its way to genuine low-latency full duplex. That's exactly the core the next section unpacks.

4The Core Innovation

Its Approach: One Model Covers Everything From Listening to Speaking

Wan Streamer's core is a single sentence: interleave the input tokens and output tokens of vision, audio, and text into one sequence and hand it to a single Transformer; coordinate it with block-causal attention so it computes and emits as data arrives.

Hero · Shared Root

A single end-to-end Transformer removes the external modules — VAD, ASR, the language model, TTS, animation, video generation — and jointly optimizes perception, reasoning, response planning, speech and visual generation, response timing, and turn-taking inside one persistent state. Low latency, full duplex, and synchronized audio-video all trace their root back to this.

Vision tokensAudio tokensText tokens

The model treats the interaction as one continuous causal stream: your observations and its responses jointly update the current context. In each streaming unit it encodes whatever user observations are available so far, then predicts the next chunk of response based on the full causal history of both sides. The language response is a string of discrete tokens trained with next-token prediction; the audio and video responses live in a continuous latent space and are jointly generated with conditional flow matching, conditioned on the same clean context, so that speech, motion, appearance, and scene evolution are denoised together as one coupled whole rather than generated separately and then stitched.

Wan Streamer's overall framework: language, audio, and video are modeled as both input and output inside a single Transformer, coordinated by block-causal attention, generating in a streaming fashion as data arrives. Source: Wan Streamer official

To support this stream, the whole stack is causal by design from the start: a strictly causal audio-video VAE for streaming latent encoding, a causal audio-video encoder, a causal audio-video decoder, and a temporally causal Transformer coordinated by block-causal attention. After denoising, the estimated clean latents are appended directly to the history as context for later units; the causal decoder then renders them into the final audio and video. The external modules this design erases are:

External VADASR recognitionExternal language modelTTS synthesisAnimation moduleVideo generation module

5How It Listens and Speaks at Once

How It Listens and Speaks at Once, and Can Be Interrupted Anytime

Human interaction with the world is inherently streaming and full-duplex: we don't listen to the end, then think separately, then finally answer — we watch, listen, and speak all at once, pausing and interrupting at any moment, with perception and expression overlapping on the timescale of audio and video. A real-time interaction model has to be built the same way.

Hero · Rebuilt Full Duplex

A causal encoder + a causal decoder + low-latency multimodal token scheduling shrink the streaming unit at 25fps down to 160ms: input speech and video immediately affect the output, and the generated audio and visual states are coupled before decoding rather than patched up after the fact; every emitted unit is written back to the interaction history. So it can listen and speak at once — while you talk it's still listening, and it can adjust when interrupted.

Half-duplex (turn-based) · listen only after speaking ends

Single channel

Listen

Speak

Listen

Full-duplex · listening and speaking overlap

Perceive

Continuously perceiving the user

Respond

Generating speech + video + motion in sync

The overlap zone is the key: while the user is talking the agent is still listening, so it can be interrupted and adjust on the fly; every generated unit is written back to the interaction history and becomes context for the next step.

Full duplex · an analogy

Full duplex is like a normal phone call: you can cut in while the other person is still talking. The walkie-talkie style — "let go before you can listen" — is half duplex.

This mechanism works thanks to block-causal attention. It treats a small block (say a 160ms audio-video segment) as one processing unit: tokens inside a block can see each other (bidirectionally), but a block can only see past blocks, not future ones. This keeps the within-block context while still computing as data arrives, without waiting for the whole segment to finish.

Block 1
160ms

↔ intra-block two-way

←

Block 2
160ms

↔ intra-block two-way

←

Block 3
160ms

↔ intra-block two-way

←

Block 4
160ms

↔ intra-block two-way

Tokens within a block see each other; between blocks the view only looks left, into the past: block 3 can start computing the moment it arrives, because it depends only on blocks 1 and 2 — no need to wait for the future block 4. That's streaming generation.

Block-causal · an analogy

It's like thinking in "phrases" as you speak: you weigh the words inside one phrase together, but you can't foresee the next phrase that hasn't left your mouth yet. There are two matching causal pieces here, the causal encoder/decoderThey see only the past, never the future. An ordinary encoder needs a complete segment before it can encode; the causal version can encode as it receives, like a simultaneous interpreter translating while listening, without waiting for the speech to end., that let both perception and generation move forward as data arrives.

Open for deployment details: how thinker–performer squeezes latency down to 200ms

Wan Streamer is a single end-to-end model at training time; for real-time deployment, the same model is split into a thinker–performer pipeline across two GPUs to overlap computation as much as possible. Once the system finishes prefill, the thinker broadcasts the initial KV-cache to the performer; the two share the same full-history state, and the unified model's behavior is fully preserved.

The thinker handles the causal audio-video encoder, one short computation for language prediction and state updates, KV-cache construction, and decoding the previous unit's latents into audio-video for immediate output. The performer handles only latent generation, running the flow-matching solver for the next audio-video unit on the shared full-history KV context. Because the performer never runs the decoder and the thinker never runs the costly solver, decoding and generation don't block each other.

As long as the performer's time plus the communication time fits within one 160ms unit, real-time throughput holds. And the signal-to-signal path of "encode → state update → latent generation → decode" is the roughly 200ms model-side latency, kept within budget via CUDA graph capture, compilation, and optimized operators.

Thinker–performer streaming inference overlap: in unit k, the thinker encodes the current observation, updates the KV-cache, and decodes the previous unit's latents for immediate output; the performer only runs the flow-matching solver to generate the next latents, returned in the following unit. Perception, decoding, communication, and denoising overlap across adjacent units. Source: Wan Streamer official

6By the Numbers

Versus Other Systems: Where It's Faster and What It Can Do

The two groups of latency numbers below measure different things and must be read separately. The top group is the complete end-to-end interaction loop (perceiving the user and producing a response), and within it only Wan Streamer also outputs video; the bottom group is digital-human / audio-video renderers, counted only up to the rendering stage, excluding the external language model, ASR, and TTS they depend on — so the latency users actually feel is higher than the chart shows.

End-to-end interaction loop · perceive→respond (includes network, shorter is better)

Wan Streamervoice + video

0.55smodel-side 0.2s

GPT-4o Realtimevoice only

~0.8smodel-side 0.23s

Doubao Voicevoice only

~1.0smodel-side 0.7s

Gemini Livevoice only

1.2–3.6s

Rendering stage only · excludes external LLM/ASR/TTS (real latency is higher)

LPM 1.0render only

~0.35s

OmniForcingrender only

~0.7s

Hallo-Liverender only

0.94s

StreamAvatarrender only

~1.2s

The two scales are independent; you can't compare magnitudes directly across the groups. The bottom group's numbers don't include the external "brain" — add that back and the real latency is clearly higher. The values are taken from the closest reported criteria in each system's public figures, mixing different measurement boundaries; see the paper for exact definitions.

Coverage across capability dimensions is below, and Wan Streamer is the only row checked all the way across:

System	Perceive video	Output video	Full duplex	End-to-end	Sub-second response
Wan Streamer	✓	✓	✓	✓	✓
Doubao Voice	✓	✗	✓	✗	~
GPT-4o Realtime	✓	✗	~	✗	✓
StreamAvatar	~	✓	~	✗	✗
LPM 1.0	~	✓	✓	✗	~

✓ = yes　~ = partial / undisclosed　✗ = no. Full duplex means the system keeps perceiving while generating — understanding and responding at the same time. A cell marked "~" is either partially supported or not publicly disclosed.

Streamability is a modeling constraint, not merely a deployment optimization: a system built on offline encoders, bidirectional decoders, or turn-based dialogue can hardly engineer its way back to genuine low-latency full-duplex capability. Wan Streamer · The Full-Duplex Challenge

Source: Wan Streamer v0.1 official release page (wan-streamer.com), paper arXiv:2606.25041, June 2026. This piece is an explainer of vendor content; the latency and capability comparison numbers are labeled per the team's annotated criteria, with mixed measurement boundaries and unverified predictions flagged inline. The demo videos and framework diagram are cited directly from the official site.

Wan Streamer: Real-Time AI That Listens, Watches, and Speaks at Once

One Model Pulls Off Real-Time Audio-Video Conversation

Four Character Demos + a Live Screen Recording

Why the Old Approach Is Slow: A Relay of Handoffs, Waiting at Every Step

Its Approach: One Model Covers Everything From Listening to Speaking

How It Listens and Speaks at Once, and Can Be Interrupted Anytime

Versus Other Systems: Where It's Faster and What It Can Do

Related explainers