Meituan Releases LongCat-2.0: a 1.6-Trillion-Parameter Model Trained End-to-End on Domestic Chips, No NVIDIA GPUs
- On June 30, 2026, Meituan's LongCat team released and open-sourced LongCat-2.0, a large MoE model with 1.6 trillion total parameters and roughly 48 billion activated per token.
- Both training and large-scale deployment run entirely on a cluster of more than 50,000 domestic AI ASIC chips, covering over 35 trillion tokens, with no NVIDIA GPUs used.
- The architecture builds on LongCat-Flash, adding LongCat Sparse Attention (LSA) and a 135-billion-parameter N-gram Embedding to speed up long context and cut inference memory overhead.
- Meituan's own benchmarks show it beats Gemini 3.1 Pro and GPT-5.5 on code/Agent tasks like SWE-bench Pro and SWE-bench Multilingual, but trails Claude Opus 4.7 and 4.8; on foundational skills like IFEval and GPQA-diamond it lags the top models.
- As of publication the weights weren't actually up on HuggingFace, the vast majority of scores were measured in-house on Meituan's own eval framework, and independent third-party reproduction is still pending.
What Meituan actually did this time
On June 30, 2026, Meituan's LongCat team released and open-sourced LongCat-2.0, an ultra-large MoE (Mixture-of-Experts) language model with 1.6 trillion total parameters and roughly 48 billion activated per token.
In their own words: "LongCat-2.0 proves that we now have the ability to train large-scale models on domestic compute clusters." The LongCat team was only founded in 2023, and its first model shipped just late last year.
What it can actually do now
Forget parameters and benchmarks for a second. A "codebase migration" demo from Meituan gives a more intuitive feel for its real capability: porting an entire plugin onto a new SDK — and making it actually run.
Meituan also showed demo scenarios for code engineering, Agent-and-research, and content generation. Work like this demands two things of a model: it has to fit ultra-long inputs and stay consistent across a long chain — exactly the two directions its architecture focused on, which the next few sections break down.
Where do the benchmarks actually land
Meituan compared LongCat-2.0 against several top closed models in a unified eval framework. Seeing where its strengths and weaknesses lie is more useful than any single number.
| Benchmark | LongCat-2.0 | Gemini 3.1 Pro | GPT-5.5 | Opus 4.6 | Opus 4.7 | Opus 4.8 |
|---|---|---|---|---|---|---|
| Code Agent | ||||||
| Terminal-Bench 2.1 | 70.8 | 70.7* | 73.8* | - | 71.7* | 78.9* |
| SWE-bench Pro | 59.5 | 54.2* | 58.6* | 57.3* | 64.3* | 69.2* |
| SWE-bench Multilingual | 77.3 | 76.9* | - | 77.8* | 80.5* | 84.8* |
| General Agent | ||||||
| FORTE † | 73.2 | 70.3 | 77.8 | 73.2 | 77.6 | 77.2 |
| BrowseComp | 79.9 | 85.9* | 84.4* | 84.0* | 79.3* | 84.3* |
| RWSearch | 78.8 | 76.3 | 85.3 | 81.3 | 79.3 | 77.3 |
| Foundational skills | ||||||
| IFEval | 90.0 | 96.1 | 95.0 | 92.2 | 88.7 | 86.0 |
| Writing Bench | 83.8 | 83.7 | 84.7 | - | 85.3 | 85.2 |
| IMO-AnswerBench | 81.8 | 90.0 | 79.5 | 75.3* | 81.8 | 75.3 |
| GPQA-diamond | 88.9 | 94.3* | 93.6* | 91.3* | 94.2* | 92.4 |
The read is straightforward: its strength is code and Agent work — on SWE-bench Pro (59.5) and SWE-bench Multilingual (77.3) it beats Gemini 3.1 Pro and GPT-5.5, but falls below Claude Opus 4.7 and 4.8. There's a clear gap in foundational skills — on IFEval (90.0), IMO-AnswerBench (81.8), and GPQA-diamond (88.9) it's outpaced by Gemini and GPT-5.5. It hasn't overtaken across the board; it has caught up to — and locally surpassed — the Western leaders on the "code + Agent" track it deliberately honed, while still trailing on pure knowledge and math reasoning. It's also deeply adapted to the mainstream agent frameworks Claude Code, OpenClaw, and Hermes.
Why long-text processing hits a wall
Agent applications increasingly need to read in ultra-long inputs in one go: an entire codebase, a whole document. But processing long text comes with an unavoidable cost problem.
The standard approach compares every word against every other word pairwise (attention), so as the text grows the number of comparisons explodes quadratically. The idea behind sparse attention: don't compare everything — first use an "indexer" to pick out the most relevant handful of words to focus on.
It's like finding an answer in a thick book: you check the table of contents to pick the relevant chapters instead of reading word-by-word from page one to the end. The indexer is that "table of contents."
DeepSeek's sparse attention (DSA) tackles this with fine-grained sparsity, but Meituan's own tests found that the "Lightning Indexer" inside DSA is still a bottleneck: its output is non-contiguous (unfriendly to hardware) and its scoring cost is still quadratic. In other words, the table of contents isn't picked fast enough, and flipping through it is itself laborious. That's exactly what the next section's core innovation goes after.
How LongCat speeds up long-context processing
LongCat Sparse Attention (LSA) makes three orthogonal efficiency changes to that stuck indexer. Orthogonal means the three don't interfere with each other and can each be switched on and off independently.
The core idea isn't to swap in a different indexer, but to cut the cost of "flipping through the table of contents" from three different angles: make memory access orderly, let one indexing pass serve multiple layers, and score coarse-then-fine. Stacked together, the three changes amortize the indexing cost, which is what lets long context run fast.
It combines "hardware-aligned contiguous access" with "dynamic random selection," reorganizing scattered memory access into predictable sequential reads, which enables coalesced access to HBM memory and raises effective bandwidth. For the same batch of tokens, the access pattern goes from scattered all over the place to reading straight down one line.
It exploits an empirical rule: attention salience is stable across adjacent layers (neighboring layers want to pick roughly the same words). So one indexing computation serves several consecutive layers at inference time instead of being recomputed per layer, amortizing the indexing cost. This is achieved through cross-layer distillation during training.
Coarse-to-fine two-stage scoring: first do a coarse recall with block-level approximate scoring to circle the roughly relevant candidate region, then do fine-grained token selection within that much smaller candidate set. The candidate space the indexer actually has to process each time is shrunk. In LongCat-2.0, HI is used training-free and turned on only for selected ultra-long-context tasks.
This mechanism also extends to a 3-step MTP (multi-token prediction) module, used to speed up speculative decoding (guess several words at once — save time when the guesses are right). Below is Meituan's overview diagram of the LSA design.
Under 10% more parameters buys roughly 100× the vocabulary space
The second innovation is called N-gram Embedding. Its idea in one line: instead of piling the extra parameters into more experts, move them to dedicated memory for "common word combinations."
The usual approach has the model memorize individual characters one by one. N-gram Embedding instead memorizes common consecutive combinations as a whole "card," so the model recognizes a common combination at a glance instead of assembling it from scratch each time. It's like learning English by memorizing not just the 26 letters but also common whole words as cards, recognizing them on sight.
LongCat-2.0 inherits this design from LongCat-Flash-Lite, sets the n-gram size to 5, packs in 135 billion N-gram Embedding parameters, and uses N-gram token combinations to expand the embedding space roughly 100×, capturing richer local context. The key is that two scaling principles decide where those parameters should go.
Even without counting N-gram, the model's sparsity is already around 97% — past the sweet spot. Piling the same amount of parameters into MoE experts yields next to nothing.
Gain ≈ maxed outMoving the same amount of parameters to memorize common word combinations pays off far more than ordinary experts; at inference it also shifts memory I/O away from the experts.
Vocab ×100But more isn't always better. Experiments found that once N-gram Embedding exceeds 50% of the total parameter budget, its advantage over piling on experts weakens. So LongCat-2.0 keeps it strictly under 10%, leaving plenty of safety margin. The direct benefit: shifting parameters from experts to N-gram Embedding at inference lowers memory I/O for large-batch decoding and speeds up generation.
Running all this stably on domestic chips is the real battle
Beyond the algorithmic innovations, a huge amount of low-level engineering adaptation is what makes this run on domestic chips with less memory — and without going off the rails. Meituan also admits: compared with the mature NVIDIA GPU ecosystem, the supporting software community isn't as mature yet.
The primary constraint is memory. Their accelerator's per-card memory is clearly smaller than the H800's (80GB), so at scale memory is the first bottleneck. The response takes two routes: make parallelism finer-grained, and make the communication domain bigger.
6D parallelism: a dedicated parallel dimension for N-gram Embedding
Superpod: stretching the high-bandwidth communication domain to hundreds of devices
At the same scale and environment, the superpod alone adds roughly 30% more pre-training throughput. Together with memory optimizations (ZeRO-1, selective recomputation, OOM-aware offloading, routing padding tokens to a "null expert") and a large-scale-deployed Muon optimizer, the overall systems optimization delivers more than a 35% training-throughput gain over a naive implementation.
Reliability: getting the same result every time — and catching hardware errors
Deterministic operators mean the same input produces exactly the same result every time, with no tiny differences from varying hardware scheduling order — which makes problems reproducible. Bit-flip detection automatically spots computation errors like a hardware bit being flipped by accident (0 becoming 1) and catches them in time.
Expand: the invisible hard work behind production-grade reliability
- Enforced determinism: both the communication and computation paths are forced to be deterministic, with an in-house set of deterministic operators covering the Embedding, FA, LSA, and MoE layers to guarantee reproducibility.
- Numerical reliability: all reduction operators switch to "binary-tree segmented accumulation" to reduce floating-point error buildup; the accelerator's arithmetic precision is validated against a high-precision baseline under real LLM workloads; bit-flip detection is added inside some compute-intensive operators to catch hardware bit flips.
- Fault recovery: end-to-end monitoring drives fault identification, traffic switchover, and automatic recovery without manual intervention; isolating a faulty link has no perceptible impact on training, and a repaired link must pass a stress test before rejoining.
Meituan emphasizes: the entire pre-training run had no rollbacks and no unrecoverable loss spikes. They see this as direct evidence that frontier-scale training can be done on an alternative hardware platform.
From trained to actually usable is one more hurdle
1.6 trillion parameters, serving at 1M context — training it isn't enough; it has to be deployable as an actually usable product, and it has to hold several capabilities at once.
Native 1M long-context training
To strengthen long-range tasks, training brings in LSA and trains on hundreds of billions of tokens of 1M-context data. The scaling scheme uses all-gather-based CP parallelism, with CP scalable beyond 512, achieving native 1M-length training; data is reshuffled at the get-batch stage and sharded with a balanced CP strategy to keep the load balanced.
Inference serving: optimize reading the question and emitting the answer separately
The two stages — "understanding your question" (prefill) and "emitting the answer character by character" (decode) — are split onto different machines and optimized separately, because the two steps consume different kinds of hardware resources.
Post-training: learn from three sets of "teachers," then fuse into one model
Autonomously executes tasks in complex real-world scenarios: precise tool calls, reliable argument parsing across multi-turn API interactions, and self-correction that suppresses infinite loops and repeated calls.
Extends logical reasoning depth, adapts compute to problem difficulty, and is stronger on math, STEM problem-solving, and multi-hop reasoning.
Focuses on human alignment: fine-grained instruction following, suppressing factual hallucinations, and building bounded safety mechanisms without sacrificing helpfulness.
→ MOPD fusion → strong agent execution / deep reasoning / high-quality interaction all at once
Expand: what else was squeezed out on the inference side
- Model layer: attention uses the absorb compute mode; the indexer and MLA prolog are pipelined on concurrent streams to hide indexing overhead; KV-cache parallelism (KVP) shards the KV-cache across devices; ScMoE runs the dense branch and the MoE branch fully in parallel.
- Accelerator layer: Super Kernel further squeezes launch overhead within the kernel; Weight Prefetch uses the larger L2 cache to prefetch weights, hiding I/O latency inside the previous operator's computation.
- Load balancing: expert-parallel load balancing (EPLB) moves statistics collection and placement computation off the forward critical path to run asynchronously.
Remember these numbers
The scale numbers from the whole piece, gathered in one place. These are the anchors for understanding what makes LongCat-2.0 special.
One last thing to keep straight: Meituan says it "introduced and open-sourced" LongCat-2.0, and the blog links to GitHub (github.com/meituan-longcat/LongCat-2.0), HuggingFace, an online demo (longcat.chat), and API docs. But as of publication, the weights weren't actually downloadable, and whether third parties can independently reproduce the scores is still pending. Most of the scores above are Meituan's in-house numbers from its own eval framework, so leave room when comparing across models.
LongCat-2.0 proves that we now have the ability to train large-scale models on domestic compute clusters. Official LongCat technical blog