Research Explainer · Xiaohu Explainer

Bridgewater trained a model built just for financial-information triage — 84.7% accuracy, and the method is public

In partnership with Thinking Machines, they fine-tuned an open-source model on expert-labeled data: 29.8% fewer errors than the best frontier model, at just 1/14 the inference cost
At a glance
  • Thinking Machines Lab teamed up with AIA Labs, part of Bridgewater, and used its own fine-tuning platform Tinker to train a custom model built specifically for financial-information triage.
  • Top LLMs (the Gemini, Claude, and GPT families) ran 6 financial-triage tasks with simple prompts and averaged only about 50% accuracy; after repeated prompt tuning they topped out at 78.2%, still short of the 80% trust threshold investors require.
  • The training data started out non-expert-labeled and full of errors. The team designed a scheme: send only the disagreement samples — where the model's judgment clashes with the label — to experts for review, keep the rest, and hold labeling costs down.
  • Starting from the open-source Qwen3-235B as the base, standard GRPO reinforcement-learning fine-tuning pushed accuracy to 73.48%; stacking on interleaved-batch training, the CISPO asymmetric-clipping loss, and on-policy distillation with a dynamically promoted teacher brought it to 84.66%.
  • The final model hits 84.7% accuracy — 29.8% fewer errors than the best frontier model tested (78.2%) — at just 1/13.8 the per-task inference cost of the matching frontier model.
⚑ Stance note: This piece is the official blog post jointly published by Thinking Machines Lab and Bridgewater AIA Labs; the model, data, and comparison results are all the publishers' own self-assessment, using a public subset of their internal data. All figures below are relayed as stated in the original.
1A portfolio manager reads it at a glance — AI can only guess

News a portfolio manager judges in a second, AI can only guess at

Thinking Machines Lab and AIA Labs — part of Bridgewater — co-published a write-up laying out a method and results for training a financial-information triage model on their own fine-tuning platform, Tinker.

What they wanted to automate isn't writing research reports, but the "information sorting" a portfolio manager repeats countless times a day: picking out the part actually worth reading from news, research notes, company filings, and emails. The reading itself isn't hard — what's hard is layer upon layer of close judgment calls, and they eat up huge amounts of time. The team wanted to see whether this work could be handed to a model.

Here's the result up front: top LLMs, used straight out of the box, average only about 50% accuracy — barely better than a coin flip — while this small custom model reaches 84.7%, at just 1/13.8 the inference cost of a frontier model.
Why it's worth a look: a small model fine-tuned from an open-source base cut the strongest frontier LLM's error rate (78.2%) by 29.8% on concrete judgment tasks, while slashing per-task inference cost to 1/13.8 of it. It wasn't a "smarter general-purpose model" that won — it was "a small model tuned for one specific job."
2The six little jobs to automate

The six "little jobs" a portfolio manager does every day

For a portfolio manager these six things are instinct — judged in a second — but the moment you ask them to spell out "why I judged it that way," they stall, which is exactly why it's hard to teach an AI. The team broke each one out into a benchmark.

TASK 01
Financial-article relevance
Whether a financial article is worth an investment executive's time. The catch: "relevant" isn't "meaningful" — it takes investment judgment, not keyword matching.
TASK 02
Central-bank document direction
Whether a central-bank document hints at where rates are headed. You have to read the policy lean between the lines — a human just knows from experience.
TASK 03
Does the note answer the question
Given an investor's question and a research document, whether the document actually helps. The call is "is there an answer," not "is it mentioned."
TASK 04
Boilerplate flagging
Whether a research note is pure template (recurring boilerplate) or has one-off new analysis tucked inside. Tell which — and find which page the new analysis ends on.
TASK 05
Document truncation
Find where a document turns into boilerplate. A human eyeballs where the body ends; the model has to pinpoint it.
TASK 06
Email truncation
Find where an email body turns into signatures, disclaimers, and other boilerplate. Same one-second call: where does the body end.

The first three are classification tasks (scored on accuracy + F1); the last three are localization tasks (scored on exact-match accuracy). The original says there are many more similar tasks internally, all with the same pattern: on this kind of work, frontier models generally lose to a model you train yourself.

3One to read, one to swipe away

Both touch politics and finance — so why is one relevant and one not

Take a real example from "financial-article relevance." Both headlines below straddle geopolitics and finance, but to a macro investor, one is worth reading and one should be swiped straight away. Guess which one is relevant first.

Headline A
"Trump insists Greenland belongs to him"
Source: ft.com — the image from an article about Trump and Greenland.
Headline B
"Trump threatens new tariffs on China; US stocks tumble into the close"
Source: ft.com. The S&P 500 posted its biggest single-day drop since April, snapping a multi-week rally.
Reveal: which one's relevant, and why

B is relevant, A isn't. In context, the Greenland one reads more like a political posture the market won't really take seriously; the China tariffs, meanwhile, drove the S&P 500's biggest single-day drop in weeks — a hard market signal. Yet both touch geopolitics and finance, so keywords alone can't tell them apart, and this is exactly where models crash. This kind of call tests investment context, not surface-level word matching.

4The ceiling of prompt engineering

Tweak the prompt every which way, AI still stalls at 78.2%

The team first took the easy road: lean on prompt engineering to save the day. Experts rewrote the instructions to match the real task descriptions, and even redefined the tasks — for instance, changing article classification from two buckets ("relevant / not relevant") to three ("relevant and interesting, relevant but dull, not relevant"), because a minor IPO story counts as financially relevant yet lacks the big-picture significance a macro investor wants.

Simple prompt
(frontier average)
~50%
After prompt tuning
(best frontier model)
78.2%
╌╌ 80% investor trust threshold: tuned to the max, the best frontier model is still 1.8 points short — it doesn't cross.

Prompts lifted accuracy from coin-flip level to the low 70s, but that's the ceiling — even automated prompt optimization couldn't squeeze out more. The original also notes "pricier isn't necessarily more accurate": GPT 5.4 costs 43% more than 5.2 yet gains only a sliver of accuracy, showing new models make little progress on this kind of task — especially per dollar spent.

Per-model accuracy / F1 after tuning (original data)
Frontier model (best prompt)AccuracyPositive-class F1
Model family 1~47.2%77.2%
Model 250.1%74.3%
Model 347.2%75.8%
Model 4 (best)48.5%78.2%
Model 545.6%78.0%

Note: F1 is averaged over the 3 classification tasks, accuracy over all 6 tasks; in the original's terms the best frontier model tops out at 78.2% accuracy, which is the benchmark the custom model is measured against.

5Fix the data first, then talk training

The labels themselves were wrong: how to send only the truly hard samples to experts

With prompts maxed out, the team turned to fine-tuning. But the first hurdle wasn't training — it was the data: these judgments are only valuable once they've passed an investment expert's eye. What they initially bought from a vendor was non-expert labeling; trained on it, the model stayed poor, and only by reading through the model's reasoning did they find that the labels in the dataset were often simply wrong.

Having experts relabel everything is too expensive. The team came up with a clever move: let the model itself flag the "suspicious" samples and send only those to experts. The logic is direct — if a sample clashes even with its own training set, then either the question is genuinely hard or the original label is wrong, and both are worth an expert's glance.

Non-expert labeled data
Train an initial model
Model re-scores the same data
Pick disagreement samples vs. labels
Send only those to experts
Clean training set emerges
Final test on a held-out set

This way, expert effort is spent only on the genuinely contested samples while the rest are kept as-is — cleaning the data and controlling cost at once. The final evaluation runs on a fully independent held-out set that took no part in the cleaning, so no one grades their own homework.

6Core: three moves that lift accuracy

A three-move combo that lifts accuracy from 73% to 84.7%

With the data clean, on to training. The team picked Qwen3-235B — the open-source model academia has studied most thoroughly — as the base, ran all training on Tinker, and never touched GPU infrastructure. Step one laid the foundation with standard GRPO reinforcement-learning fine-tuning, jumping accuracy from the base's 44.8% to 73.48% — but still short of the 80% threshold. What actually pushes it over the line is the three improvements stacked on top.

Grasp the foundation first · GRPO

GRPO is a reinforcement-learning method that needs no separately trained "judge model": the model produces a batch of candidate answers to the same question, they compare against each other for who's closest to the reference answer, and the better ones get reinforced going forward. It's like a group tackling the same problem — no teacher grading, just comparing among themselves who answered more correctly.

80% trust threshold 44.8% Qwen base 73.48% +GRPO tuning 84.66% Full recipe (+3) 1/13.8 Inf. cost

The steps are three real data points from the original: base 44.8% → GRPO 73.48% → full recipe 84.66%. The 11-point leap in between comes from the combination of the three improvements below.

Core innovation · 3 improvements

The percentages for each of the three below come from "ablation experiments": remove one item from the full recipe and see how much accuracy drops. The numbers don't simply add up — each item is indispensable, and pulling any one out clearly drags the score back down.

+12.1%
Interleaved-batch training
Instead of training all six tasks "fully mixed into one batch," each batch trains just one task, rotating across tasks. 12.1% higher than fully mixed batches.
+10.1%
CISPO asymmetric clipping
Swap out the standard importance-sampling loss for CISPO asymmetric clipping to control the size of each update step. 10.1% higher than the original loss.
+3.1%
Dynamic teacher promotion
In on-policy distillation, the teacher isn't fixed to the base — whenever the student sets a new high, it takes over. 3.1% higher than a fixed teacher.
Full ablation data (drop one item, see the fall)
Training setupAvg accuracyPositive-class F1
Qwen base44.8%55.24%
Qwen + GRPO73.48%88.95%
Qwen + full recipe84.66%92.99%
− Interleaved-batch training72.18%89.01%
− CISPO asymmetric clipping74.56%90.64%
− On-policy distillation72.39%87.93%
− Dynamic teacher promotion (fixed base as teacher)81.55%89.41%
7Jargon unpacked

What CISPO and "on-policy distillation" actually do

Of the three moves above, CISPO and on-policy distillation are the two scariest-sounding names. No reinforcement-learning background needed: what they solve is really the same plain problem — don't let the model over-learn or learn crooked, and give it a reliable teacher to learn under.

CISPO asymmetric-clipping loss

It governs "how much the model can change per step." It gives different tolerances to "learning toward the good" and "learning toward the bad": like correcting a student, letting them stride toward the right answer, but the moment they head the wrong way, tightening the reins at once. That way it neither over-learns nor gets too timid and hesitant.

On-policy distillation + dynamic teacher promotion

The student model practices on its own while referencing a "teacher model's" answer distribution; stray too far from the teacher and it gets docked and pulled back (the original uses a penalty term — the bigger the student-teacher gap, the more reward is subtracted). The key is that the teacher isn't fixed: validation-set accuracy is checked every 20 steps, and whenever the student sets a new high, the student itself is promoted to the new teacher — never teaching backward with a weaker model.

① Student practices Generate answers, reward ② Compare to teacher Farther off, more penalty ③ Check val · 20 steps New accuracy high? ④ On a new high, student → teacher ↺ New teacher leads the next round — level only rises
In one line

Like an apprentice reviewed on a schedule: the moment the apprentice surpasses the current master, the master's seat is handed to that apprentice, and a "stronger version of yourself" teaches the next stage. This step (dynamic teacher promotion) adds 3.1% over using a fixed base as the teacher.

8Final results

Errors down by nearly a third, inference cost cut to 1/14

With every improvement stacked, the custom model lifts average accuracy from the best frontier model's 78.2% to 84.7% — a level the team considers good enough for everyday use. The bigger saving is money: the model is far smaller, and per-task inference costs just 1/13.8 of the matching frontier model.

Best frontier model
78.2%
Average accuracy (ceiling after prompt tuning, below the 80% threshold)
Per-task inference cost (baseline)
In-house custom model
84.7%
Average accuracy — 29.8% fewer errors than the best frontier model
1/13.8
Per-task inference cost — down to a fourteenth of the frontier model
29.8%
Error-rate reduction of the in-house model vs. the best frontier model
13.8×
Reduction in per-task inference cost

The original says this conclusion holds far beyond the 6 tasks shown here — many similar internal tasks follow the same pattern. The methodology isn't tied to finance either: a recipe of "route hard samples out via model disagreement, have experts clean the data, then do reinforcement-learning fine-tuning" can, the original argues, generalize to other institutions' own concrete judgment tasks. They call it "differentiated intelligence": custom models tuned to specific organizational needs that beat general-purpose frontier models on their home turf.

Our results show the possibility of a future of differentiated intelligence, where custom models tuned to specific organizational needs outperform frontier models. Thinking Machines Lab (with Bridgewater AIA Labs), "Learning to Replicate Expert Judgment in Financial Tasks," June 2026
Source: Thinking Machines Lab official blog, "Learning to replicate expert judgment in financial tasks" (with Bridgewater AIA Labs, June 2026). Authors: Su, Sarah; Zhu, Kevin; Xiao, Emily; Alur, Rohan; Kang, Daniel. This is a Xiaohu Explainer visual explainer; all models, data, and comparison conclusions reflect the publishers' self-assessment, based on a public subset of their internal data.