Bridgewater trained a model built just for financial-information triage — 84.7% accuracy, and the method is public
- Thinking Machines Lab teamed up with AIA Labs, part of Bridgewater, and used its own fine-tuning platform Tinker to train a custom model built specifically for financial-information triage.
- Top LLMs (the Gemini, Claude, and GPT families) ran 6 financial-triage tasks with simple prompts and averaged only about 50% accuracy; after repeated prompt tuning they topped out at 78.2%, still short of the 80% trust threshold investors require.
- The training data started out non-expert-labeled and full of errors. The team designed a scheme: send only the disagreement samples — where the model's judgment clashes with the label — to experts for review, keep the rest, and hold labeling costs down.
- Starting from the open-source Qwen3-235B as the base, standard GRPO reinforcement-learning fine-tuning pushed accuracy to 73.48%; stacking on interleaved-batch training, the CISPO asymmetric-clipping loss, and on-policy distillation with a dynamically promoted teacher brought it to 84.66%.
- The final model hits 84.7% accuracy — 29.8% fewer errors than the best frontier model tested (78.2%) — at just 1/13.8 the per-task inference cost of the matching frontier model.
News a portfolio manager judges in a second, AI can only guess at
Thinking Machines Lab and AIA Labs — part of Bridgewater — co-published a write-up laying out a method and results for training a financial-information triage model on their own fine-tuning platform, Tinker.
What they wanted to automate isn't writing research reports, but the "information sorting" a portfolio manager repeats countless times a day: picking out the part actually worth reading from news, research notes, company filings, and emails. The reading itself isn't hard — what's hard is layer upon layer of close judgment calls, and they eat up huge amounts of time. The team wanted to see whether this work could be handed to a model.
The six "little jobs" a portfolio manager does every day
For a portfolio manager these six things are instinct — judged in a second — but the moment you ask them to spell out "why I judged it that way," they stall, which is exactly why it's hard to teach an AI. The team broke each one out into a benchmark.
The first three are classification tasks (scored on accuracy + F1); the last three are localization tasks (scored on exact-match accuracy). The original says there are many more similar tasks internally, all with the same pattern: on this kind of work, frontier models generally lose to a model you train yourself.
Both touch politics and finance — so why is one relevant and one not
Take a real example from "financial-article relevance." Both headlines below straddle geopolitics and finance, but to a macro investor, one is worth reading and one should be swiped straight away. Guess which one is relevant first.
Reveal: which one's relevant, and why
B is relevant, A isn't. In context, the Greenland one reads more like a political posture the market won't really take seriously; the China tariffs, meanwhile, drove the S&P 500's biggest single-day drop in weeks — a hard market signal. Yet both touch geopolitics and finance, so keywords alone can't tell them apart, and this is exactly where models crash. This kind of call tests investment context, not surface-level word matching.
Tweak the prompt every which way, AI still stalls at 78.2%
The team first took the easy road: lean on prompt engineering to save the day. Experts rewrote the instructions to match the real task descriptions, and even redefined the tasks — for instance, changing article classification from two buckets ("relevant / not relevant") to three ("relevant and interesting, relevant but dull, not relevant"), because a minor IPO story counts as financially relevant yet lacks the big-picture significance a macro investor wants.
Prompts lifted accuracy from coin-flip level to the low 70s, but that's the ceiling — even automated prompt optimization couldn't squeeze out more. The original also notes "pricier isn't necessarily more accurate": GPT 5.4 costs 43% more than 5.2 yet gains only a sliver of accuracy, showing new models make little progress on this kind of task — especially per dollar spent.
Per-model accuracy / F1 after tuning (original data)
| Frontier model (best prompt) | Accuracy | Positive-class F1 |
|---|---|---|
| Model family 1 | ~47.2% | 77.2% |
| Model 2 | 50.1% | 74.3% |
| Model 3 | 47.2% | 75.8% |
| Model 4 (best) | 48.5% | 78.2% |
| Model 5 | 45.6% | 78.0% |
Note: F1 is averaged over the 3 classification tasks, accuracy over all 6 tasks; in the original's terms the best frontier model tops out at 78.2% accuracy, which is the benchmark the custom model is measured against.
The labels themselves were wrong: how to send only the truly hard samples to experts
With prompts maxed out, the team turned to fine-tuning. But the first hurdle wasn't training — it was the data: these judgments are only valuable once they've passed an investment expert's eye. What they initially bought from a vendor was non-expert labeling; trained on it, the model stayed poor, and only by reading through the model's reasoning did they find that the labels in the dataset were often simply wrong.
Having experts relabel everything is too expensive. The team came up with a clever move: let the model itself flag the "suspicious" samples and send only those to experts. The logic is direct — if a sample clashes even with its own training set, then either the question is genuinely hard or the original label is wrong, and both are worth an expert's glance.
This way, expert effort is spent only on the genuinely contested samples while the rest are kept as-is — cleaning the data and controlling cost at once. The final evaluation runs on a fully independent held-out set that took no part in the cleaning, so no one grades their own homework.
A three-move combo that lifts accuracy from 73% to 84.7%
With the data clean, on to training. The team picked Qwen3-235B — the open-source model academia has studied most thoroughly — as the base, ran all training on Tinker, and never touched GPU infrastructure. Step one laid the foundation with standard GRPO reinforcement-learning fine-tuning, jumping accuracy from the base's 44.8% to 73.48% — but still short of the 80% threshold. What actually pushes it over the line is the three improvements stacked on top.
GRPO is a reinforcement-learning method that needs no separately trained "judge model": the model produces a batch of candidate answers to the same question, they compare against each other for who's closest to the reference answer, and the better ones get reinforced going forward. It's like a group tackling the same problem — no teacher grading, just comparing among themselves who answered more correctly.
The steps are three real data points from the original: base 44.8% → GRPO 73.48% → full recipe 84.66%. The 11-point leap in between comes from the combination of the three improvements below.
The percentages for each of the three below come from "ablation experiments": remove one item from the full recipe and see how much accuracy drops. The numbers don't simply add up — each item is indispensable, and pulling any one out clearly drags the score back down.
Full ablation data (drop one item, see the fall)
| Training setup | Avg accuracy | Positive-class F1 |
|---|---|---|
| Qwen base | 44.8% | 55.24% |
| Qwen + GRPO | 73.48% | 88.95% |
| Qwen + full recipe | 84.66% | 92.99% |
| − Interleaved-batch training | 72.18% | 89.01% |
| − CISPO asymmetric clipping | 74.56% | 90.64% |
| − On-policy distillation | 72.39% | 87.93% |
| − Dynamic teacher promotion (fixed base as teacher) | 81.55% | 89.41% |
What CISPO and "on-policy distillation" actually do
Of the three moves above, CISPO and on-policy distillation are the two scariest-sounding names. No reinforcement-learning background needed: what they solve is really the same plain problem — don't let the model over-learn or learn crooked, and give it a reliable teacher to learn under.
It governs "how much the model can change per step." It gives different tolerances to "learning toward the good" and "learning toward the bad": like correcting a student, letting them stride toward the right answer, but the moment they head the wrong way, tightening the reins at once. That way it neither over-learns nor gets too timid and hesitant.
On-policy distillation + dynamic teacher promotion
The student model practices on its own while referencing a "teacher model's" answer distribution; stray too far from the teacher and it gets docked and pulled back (the original uses a penalty term — the bigger the student-teacher gap, the more reward is subtracted). The key is that the teacher isn't fixed: validation-set accuracy is checked every 20 steps, and whenever the student sets a new high, the student itself is promoted to the new teacher — never teaching backward with a weaker model.
Like an apprentice reviewed on a schedule: the moment the apprentice surpasses the current master, the master's seat is handed to that apprentice, and a "stronger version of yourself" teaches the next stage. This step (dynamic teacher promotion) adds 3.1% over using a fixed base as the teacher.
Errors down by nearly a third, inference cost cut to 1/14
With every improvement stacked, the custom model lifts average accuracy from the best frontier model's 78.2% to 84.7% — a level the team considers good enough for everyday use. The bigger saving is money: the model is far smaller, and per-task inference costs just 1/13.8 of the matching frontier model.
The original says this conclusion holds far beyond the 6 tasks shown here — many similar internal tasks follow the same pattern. The methodology isn't tied to finance either: a recipe of "route hard samples out via model disagreement, have experts clean the data, then do reinforcement-learning fine-tuning" can, the original argues, generalize to other institutions' own concrete judgment tasks. They call it "differentiated intelligence": custom models tuned to specific organizational needs that beat general-purpose frontier models on their home turf.
Our results show the possibility of a future of differentiated intelligence, where custom models tuned to specific organizational needs outperform frontier models. Thinking Machines Lab (with Bridgewater AIA Labs), "Learning to Replicate Expert Judgment in Financial Tasks," June 2026