Product Launch · XiaoHu Explains

Meituan Releases LongCat-2.0: a 1.6-Trillion-Parameter Model Trained End-to-End on Domestic Chips, No NVIDIA GPUs

Training used over 50,000 domestic AI chips and 35 trillion tokens; most benchmark scores come from Meituan's own eval framework, and the weights aren't truly open for download yet.

30-second overview

On June 30, 2026, Meituan's LongCat team released and open-sourced LongCat-2.0, a large MoE model with 1.6 trillion total parameters and roughly 48 billion activated per token.
Both training and large-scale deployment run entirely on a cluster of more than 50,000 domestic AI ASIC chips, covering over 35 trillion tokens, with no NVIDIA GPUs used.
The architecture builds on LongCat-Flash, adding LongCat Sparse Attention (LSA) and a 135-billion-parameter N-gram Embedding to speed up long context and cut inference memory overhead.
Meituan's own benchmarks show it beats Gemini 3.1 Pro and GPT-5.5 on code/Agent tasks like SWE-bench Pro and SWE-bench Multilingual, but trails Claude Opus 4.7 and 4.8; on foundational skills like IFEval and GPQA-diamond it lags the top models.
As of publication the weights weren't actually up on HuggingFace, the vast majority of scores were measured in-house on Meituan's own eval framework, and independent third-party reproduction is still pending.

⚑This is vendor content: the technical details come from Meituan's official LongCat blog, the vast majority of scores were measured in-house on Meituan's own eval framework, and only those marked with * are externally published values. Meituan didn't name the specific domestic chip vendor, the model weights weren't truly downloadable at the time of reporting, and independent third-party reproduction is still pending. Where data appears below, we label the source directly rather than repeating this caveat each time.

1What happened

What Meituan actually did this time

On June 30, 2026, Meituan's LongCat team released and open-sourced LongCat-2.0, an ultra-large MoE (Mixture-of-Experts) language model with 1.6 trillion total parameters and roughly 48 billion activated per token.

The most remarkable part isn't the parameter count — it's where it runs: the entire training and large-scale deployment is built on a "superpod" cluster of more than 50,000 domestic AI ASIC chips, covering over 35 trillion tokens, without a single NVIDIA GPU.

🎯Why it's worth a look: Since 2022 the U.S. has imposed AI-chip export controls on China. This is the first competitive trillion-parameter model publicly claimed to be trained entirely on domestic hardware. After systems optimization, training throughput is more than 35% higher than a naive implementation, and the whole pre-training run had no rollbacks and no unrecoverable loss spikes — direct evidence that frontier-scale training can be done on alternative hardware.

1.6T

Total parameters

50k+

Domestic AI chips

35T+

Training tokens

Trained fully on domestic compute clusters　·　~~NVIDIA GPU × 0~~

In their own words: "LongCat-2.0 proves that we now have the ability to train large-scale models on domestic compute clusters." The LongCat team was only founded in 2023, and its first model shipped just late last year.

2Capability demo

What it can actually do now

Forget parameters and benchmarks for a second. A "codebase migration" demo from Meituan gives a more intuitive feel for its real capability: porting an entire plugin onto a new SDK — and making it actually run.

1Reads the whole context at once: takes in the entire codebase and the migration docs together, not just snippets.

2Maps out the existing architecture: works out how the plugin is currently organized and how the parts call each other.

3Rewrites onto the new SDK: rewrites the whole plugin against the new interface while keeping every existing feature.

4Catches bugs along the way: finds and fixes latent problems in the original code during the migration.

5Compiles on the first build: not a pile of code a human has to keep tweaking, but right the first time.

Meituan also showed demo scenarios for code engineering, Agent-and-research, and content generation. Work like this demands two things of a model: it has to fit ultra-long inputs and stay consistent across a long chain — exactly the two directions its architecture focused on, which the next few sections break down.

3Benchmark comparison

Where do the benchmarks actually land

Meituan compared LongCat-2.0 against several top closed models in a unified eval framework. Seeing where its strengths and weaknesses lie is more useful than any single number.

LongCat-2.0 benchmark comparison against top models

LongCat-2.0 benchmark comparison against Gemini 3.1 Pro / GPT-5.5 / the Claude Opus series. Source: official LongCat blog / The Decoder

Benchmark	LongCat-2.0	Gemini 3.1 Pro	GPT-5.5	Opus 4.6	Opus 4.7	Opus 4.8
Code Agent
Terminal-Bench 2.1	70.8	70.7*	73.8*	-	71.7*	78.9*
SWE-bench Pro	59.5	54.2*	58.6*	57.3*	64.3*	69.2*
SWE-bench Multilingual	77.3	76.9*	-	77.8*	80.5*	84.8*
General Agent
FORTE †	73.2	70.3	77.8	73.2	77.6	77.2
BrowseComp	79.9	85.9*	84.4*	84.0*	79.3*	84.3*
RWSearch	78.8	76.3	85.3	81.3	79.3	77.3
Foundational skills
IFEval	90.0	96.1	95.0	92.2	88.7	86.0
Writing Bench	83.8	83.7	84.7	-	85.3	85.2
IMO-AnswerBench	81.8	90.0	79.5	75.3*	81.8	75.3
GPQA-diamond	88.9	94.3*	93.6*	91.3*	94.2*	92.4

Source: values marked * are externally reported; the rest are Meituan's in-house tests in a unified harness; scores normalized to 0–100. † FORTE is a general Agent benchmark.

Code / Agent track: who LongCat-2.0 leads and trails (SWE-bench Pro)

Opus 4.869.2

Opus 4.764.3

LongCat-2.059.5

GPT-5.558.6

Gemini 3.1 Pro54.2

The read is straightforward: its strength is code and Agent work — on SWE-bench Pro (59.5) and SWE-bench Multilingual (77.3) it beats Gemini 3.1 Pro and GPT-5.5, but falls below Claude Opus 4.7 and 4.8. There's a clear gap in foundational skills — on IFEval (90.0), IMO-AnswerBench (81.8), and GPQA-diamond (88.9) it's outpaced by Gemini and GPT-5.5. It hasn't overtaken across the board; it has caught up to — and locally surpassed — the Western leaders on the "code + Agent" track it deliberately honed, while still trailing on pure knowledge and math reasoning. It's also deeply adapted to the mainstream agent frameworks Claude Code, OpenClaw, and Hermes.

4The problem

Why long-text processing hits a wall

Agent applications increasingly need to read in ultra-long inputs in one go: an entire codebase, a whole document. But processing long text comes with an unavoidable cost problem.

The standard approach compares every word against every other word pairwise (attention), so as the text grows the number of comparisons explodes quadratically. The idea behind sparse attention: don't compare everything — first use an "indexer" to pick out the most relevant handful of words to focus on.

An analogy · sparse-attention indexing

It's like finding an answer in a thick book: you check the table of contents to pick the relevant chapters instead of reading word-by-word from page one to the end. The indexer is that "table of contents."

DeepSeek's sparse attention (DSA) tackles this with fine-grained sparsity, but Meituan's own tests found that the "Lightning Indexer" inside DSA is still a bottleneck: its output is non-contiguous (unfriendly to hardware) and its scoring cost is still quadratic. In other words, the table of contents isn't picked fast enough, and flipping through it is itself laborious. That's exactly what the next section's core innovation goes after.

5Core innovation · part one

How LongCat speeds up long-context processing

LongCat Sparse Attention (LSA) makes three orthogonal efficiency changes to that stuck indexer. Orthogonal means the three don't interfere with each other and can each be switched on and off independently.

Hero · LSA

The core idea isn't to swap in a different indexer, but to cut the cost of "flipping through the table of contents" from three different angles: make memory access orderly, let one indexing pass serve multiple layers, and score coarse-then-fine. Stacked together, the three changes amortize the indexing cost, which is what lets long context run fast.

Streaming-aware SI Cross-layer CLI Hierarchical HI

SI · Streaming-aware Indexing

It combines "hardware-aligned contiguous access" with "dynamic random selection," reorganizing scattered memory access into predictable sequential reads, which enables coalesced access to HBM memory and raises effective bandwidth. For the same batch of tokens, the access pattern goes from scattered all over the place to reading straight down one line.

Before · scattered random access

After · sequential reads

CLI · Cross-Layer Indexing

It exploits an empirical rule: attention salience is stable across adjacent layers (neighboring layers want to pick roughly the same words). So one indexing computation serves several consecutive layers at inference time instead of being recomputed per layer, amortizing the indexing cost. This is achieved through cross-layer distillation during training.

HI · Hierarchical Indexing

Coarse-to-fine two-stage scoring: first do a coarse recall with block-level approximate scoring to circle the roughly relevant candidate region, then do fine-grained token selection within that much smaller candidate set. The candidate space the indexer actually has to process each time is shrunk. In LongCat-2.0, HI is used training-free and turned on only for selected ultra-long-context tasks.

This mechanism also extends to a 3-step MTP (multi-token prediction) module, used to speed up speculative decoding (guess several words at once — save time when the guesses are right). Below is Meituan's overview diagram of the LSA design.

Overview of the LongCat Sparse Attention (LSA) design: three orthogonal changes — streaming-aware indexing / cross-layer indexing / hierarchical indexing. Source: official LongCat blog

6Core innovation · part two

Under 10% more parameters buys roughly 100× the vocabulary space

The second innovation is called N-gram Embedding. Its idea in one line: instead of piling the extra parameters into more experts, move them to dedicated memory for "common word combinations."

An analogy · N-gram Embedding

The usual approach has the model memorize individual characters one by one. N-gram Embedding instead memorizes common consecutive combinations as a whole "card," so the model recognizes a common combination at a glance instead of assembling it from scratch each time. It's like learning English by memorizing not just the 26 letters but also common whole words as cards, recognizing them on sight.

LongCat-2.0 inherits this design from LongCat-Flash-Lite, sets the n-gram size to 5, packs in 135 billion N-gram Embedding parameters, and uses N-gram token combinations to expand the embedding space roughly 100×, capturing richer local context. The key is that two scaling principles decide where those parameters should go.

Path A · keep piling on experts

Even without counting N-gram, the model's sparsity is already around 97% — past the sweet spot. Piling the same amount of parameters into MoE experts yields next to nothing.

Gain ≈ maxed out

Path B · put it into N-gram Embedding

Moving the same amount of parameters to memorize common word combinations pays off far more than ordinary experts; at inference it also shifts memory I/O away from the experts.

Vocab ×100

But more isn't always better. Experiments found that once N-gram Embedding exceeds 50% of the total parameter budget, its advantage over piling on experts weakens. So LongCat-2.0 keeps it strictly under 10%, leaving plenty of safety margin. The direct benefit: shifting parameters from experts to N-gram Embedding at inference lowers memory I/O for large-batch decoding and speeds up generation.

Overview of the N-gram Embedding architecture: scaling parameters along a sparse dimension orthogonal to MoE. Source: official LongCat blog

7Engineering systems

Running all this stably on domestic chips is the real battle

Beyond the algorithmic innovations, a huge amount of low-level engineering adaptation is what makes this run on domestic chips with less memory — and without going off the rails. Meituan also admits: compared with the mature NVIDIA GPU ecosystem, the supporting software community isn't as mature yet.

The primary constraint is memory. Their accelerator's per-card memory is clearly smaller than the H800's (80GB), so at scale memory is the first bottleneck. The response takes two routes: make parallelism finer-grained, and make the communication domain bigger.

6D parallelism: a dedicated parallel dimension for N-gram Embedding

TPTensor parallelism

CPContext parallelism

EPExpert parallelism

DPData parallelism

PPPipeline parallelism

EMBPNew: dedicated parallelism to accelerate N-gram EmbeddingNEW

Superpod: stretching the high-bandwidth communication domain to hundreds of devices

Inside a superpod is full-mesh high bandwidth; superpods connect over a RoCE network, expanding the high-bandwidth communication domain to hundreds of devices to feed the bandwidth-hungry TP/CP/EP parallelism.

At the same scale and environment, the superpod alone adds roughly 30% more pre-training throughput. Together with memory optimizations (ZeRO-1, selective recomputation, OOM-aware offloading, routing padding tokens to a "null expert") and a large-scale-deployed Muon optimizer, the overall systems optimization delivers more than a 35% training-throughput gain over a naive implementation.

Reliability: getting the same result every time — and catching hardware errors

In plain terms · deterministic operators / bit-flip detection

Deterministic operators mean the same input produces exactly the same result every time, with no tiny differences from varying hardware scheduling order — which makes problems reproducible. Bit-flip detection automatically spots computation errors like a hardware bit being flipped by accident (0 becoming 1) and catches them in time.

Expand: the invisible hard work behind production-grade reliability

Enforced determinism: both the communication and computation paths are forced to be deterministic, with an in-house set of deterministic operators covering the Embedding, FA, LSA, and MoE layers to guarantee reproducibility.
Numerical reliability: all reduction operators switch to "binary-tree segmented accumulation" to reduce floating-point error buildup; the accelerator's arithmetic precision is validated against a high-precision baseline under real LLM workloads; bit-flip detection is added inside some compute-intensive operators to catch hardware bit flips.
Fault recovery: end-to-end monitoring drives fault identification, traffic switchover, and automatic recovery without manual intervention; isolating a faulty link has no perceptible impact on training, and a repaired link must pass a stress test before rejoining.

Meituan emphasizes: the entire pre-training run had no rollbacks and no unrecoverable loss spikes. They see this as direct evidence that frontier-scale training can be done on an alternative hardware platform.

8Deployment & post-training

From trained to actually usable is one more hurdle

1.6 trillion parameters, serving at 1M context — training it isn't enough; it has to be deployable as an actually usable product, and it has to hold several capabilities at once.

Native 1M long-context training

To strengthen long-range tasks, training brings in LSA and trains on hundreds of billions of tokens of 1M-context data. The scaling scheme uses all-gather-based CP parallelism, with CP scalable beyond 512, achieving native 1M-length training; data is reshuffled at the get-batch stage and sharded with a balanced CP strategy to keep the load balanced.

Inference serving: optimize reading the question and emitting the answer separately

In plain terms · PD-disaggregated deployment (Prefill-Decode split)

The two stages — "understanding your question" (prefill) and "emitting the answer character by character" (decode) — are split onto different machines and optimized separately, because the two steps consume different kinds of hardware resources.

Prefill nodes · optimize TTFT (time to first token)

Multi-node chunked pipeline parallelism (CPP) shrinks the EP domain, paired with attention sequence parallelism (SP), so the "read the question" step produces the first token faster.

Decode nodes · optimize TPOT (time per output token)

KVP shards the KV-cache across devices, paired with a large EP degree (EP128) to cut per-card weight memory and expert I/O, keeping "emitting the answer" steady.

Post-training: learn from three sets of "teachers," then fuse into one model

Agent expert

Autonomously executes tasks in complex real-world scenarios: precise tool calls, reliable argument parsing across multi-turn API interactions, and self-correction that suppresses infinite loops and repeated calls.

Reasoning expert

Extends logical reasoning depth, adapts compute to problem difficulty, and is stronger on math, STEM problem-solving, and multi-hop reasoning.

Interaction expert

Focuses on human alignment: fine-grained instruction following, suppressing factual hallucinations, and building bounded safety mechanisms without sacrificing helpfulness.

The strongest abilities of all three experts are fused into the final model via the MOPD architecture
→ MOPD fusion → strong agent execution / deep reasoning / high-quality interaction all at once

Overview of the MOPD multi-expert post-training architecture. Source: official LongCat blog

Expand: what else was squeezed out on the inference side

Model layer: attention uses the absorb compute mode; the indexer and MLA prolog are pipelined on concurrent streams to hide indexing overhead; KV-cache parallelism (KVP) shards the KV-cache across devices; ScMoE runs the dense branch and the MoE branch fully in parallel.
Accelerator layer: Super Kernel further squeezes launch overhead within the kernel; Weight Prefetch uses the larger L2 cache to prefetch weights, hiding I/O latency inside the previous operator's computation.
Load balancing: expert-parallel load balancing (EPLB) moves statistics collection and placement computation off the forward critical path to run asynchronously.

9Takeaways

Remember these numbers

The scale numbers from the whole piece, gathered in one place. These are the anchors for understanding what makes LongCat-2.0 special.

1.6T

Total parameters

48B

Activated params per token

50k+

Domestic AI ASIC chips

35T+

Training data tokens

35%+

Training-throughput gain (vs. naive)

30%

Extra throughput from superpods

135B

N-gram Embedding parameters

100×

Effective vocab-space expansion

1M tokens

Native long-context length

One last thing to keep straight: Meituan says it "introduced and open-sourced" LongCat-2.0, and the blog links to GitHub (github.com/meituan-longcat/LongCat-2.0), HuggingFace, an online demo (longcat.chat), and API docs. But as of publication, the weights weren't actually downloadable, and whether third parties can independently reproduce the scores is still pending. Most of the scores above are Meituan's in-house numbers from its own eval framework, so leave room when comparing across models.

LongCat-2.0 proves that we now have the ability to train large-scale models on domestic compute clusters. Official LongCat technical blog

Sources: the official LongCat technical blog (longcat.chat/blog/longcat-2.0/) and reporting by The Decoder. Technical details follow the official blog; the geopolitical and industry significance and the external fact-checking perspective come from The Decoder. Except for the externally published values marked *, all scores are Meituan's in-house tests within a unified eval framework. This piece is an interpretation of publicly available information and is not an evaluation conclusion.