Product Launch · Xiaohu Explainer

Anthropic Ships the Advisor Tool: When a Cheap Model Gets Stuck Mid-Task, It Can Call a Smarter Model on the Spot

The advisor model returns just a few hundred words of guidance instead of doing the whole job — same quality, lower total cost. Currently a beta feature limited to the Claude API and AWS.
At a glance
  • Anthropic has launched a beta Advisor tool for the Claude API: a faster, cheaper "executor model" can call a smarter "advisor model" mid-generation, get strategic guidance, and then finish the task itself.
  • The whole exchange happens inside a single API request: the advisor uses no tools and does no context management, its reasoning is discarded, and only the final guidance text is passed back to the executor.
  • The advisor must be as capable as the executor or stronger — for example, with Claude Sonnet 5 as executor, only stronger models like Claude Opus 4.7 or Claude Opus 4.8 qualify as advisors, with pairings strictly defined by an official compatibility table.
  • Each advisor call typically outputs just 400 to 700 text tokens (1400 to 1800 including thinking), far less than having the advisor regenerate the whole task from scratch — that's where the savings come from.
  • For executors that don't reach for the advisor on their own (especially Claude Haiku 4.5), Anthropic found a "nudge" trick: inserting a "you haven't asked the advisor yet" line on the second turn lifts task pass rates by about 7 points — but timing the nudge too early or too late directly changes the result.
Note on stance: this piece is based on Anthropic's official developer docs and is a vendor's own product documentation. The pass-rate lifts, call ratios, and token ranges cited are all Anthropic's internal behavioral-eval figures; the vendor itself notes "results vary by task — evaluate on your own workloads." What follows covers the mechanism and the official framing, not an endorsement of results.
1What it is

Split the doer from the advisor — pick a different model for each

Anthropic recently launched a beta tool for the Claude API called the Advisor tool. It lets a faster, cheaper "executor model" call a smarter "advisor model" mid-task to get strategic guidance, then keep writing once it has the advice.

A nimble but less-seasoned model does the actual work of producing the answer, and at key decision points it knocks on the door of the stronger model next door, asks a few questions, and heads back to its desk to keep writing — instead of handing the whole task to the strong model to redo from scratch.

Why it's worth a look: a single advisor call typically outputs just 400 to 700 text tokens (about 1400 to 1800 with thinking), far below what regenerating the whole task would take. Anthropic's claim is that with this kind of "interject mid-stream" collaboration, the executor's output can approach the quality of the advisor solving the task alone — while most tokens are still billed at the executor's lower rate. Enabling it requires the beta header advisor-tool-2026-03-01.
Executor executor Fires an "empty call" mid-task Advisor advisor "Use a channel; close inputs first…"
On the left, the executor churns out tokens and does the work; on the right, the advisor stays silent until it's called, then lights up and pops out a single suggestion — after which the executor carries on with that line in hand
2The problem

Before, it was either-or: cheap or smart

In multi-step agent tasks — writing code, operating a computer, multi-step research — most turns are mechanical, and only a few key moments truly need top-tier intelligence. But calling a model used to mean one of two paths, neither satisfying.

Executor alone (small model all the way)
Cost
Low
Quality
Drops at key moments
Advisor alone (big model all the way)
Cost
High — every mechanical step at top rate
Quality
High
Executor + advisor together (this tool)
Cost
Bulk at executor rate, only a few hundred tokens at advisor rate
Quality
Vendor says it approaches advisor-solo

The collaborative mode lands in the middle of the quality-cost tradeoff: the vast majority of tokens come from the cheap executor at a low rate, and only the handful of key moments that truly demand thought spend a few hundred tokens to get advice from a top-tier model.

3Core mechanism · Key

Halfway through the work, place a call to the advisor

The advisor tool hangs in the tools array like any other tool, and when to call it is the executor's own decision. A complete call runs in four steps.

Hero · How the collaboration happens

The whole round trip happens inside a single /v1/messages request — you incur no extra network round trips on your end. The one exception is when the advisor pauses before finishing, in which case you need to send that conversation back verbatim to resume it.

STEP 1Executor fires an "empty call," signaling timing, with empty input
STEP 2The server runs one inference on the advisor, which sees the full conversation
STEP 3Advice returns to the executor as an advisor_tool_result
STEP 4Executor keeps generating, advice in hand

In step one, the executor emits a server_tool_use block named advisor, and its input is always empty: the executor only decides "now is the time to ask," while the server automatically fills in the context handed to the advisor. In step two, the server runs the advisor model separately on its own side; the advisor uses a system prompt provided by Anthropic and sees the executor's full conversation — including your system prompt, tool definitions, all prior turns and tool results, plus everything the executor has written so far this turn.

The advisor is a "give-advice-only, hands-off" role: it uses no tools of its own and does no context management; its reasoning is discarded before returning, and only the guidance text makes it back to the executor.

Analogy · The empty call

The reason the executor's call is "empty" is like knocking on a senior colleague's door without having to recap the whole project first — the company's shared doc system has already laid your current progress in front of them, so you just ask.

If the advisor hasn't finished: resume from a pause (pause_turn)

Sometimes the request ends early while the advisor call is still pending, and the response carries a stop_reason: "pause_turn" — only the server_tool_use block that started the call, with no matching result yet. When this happens, just append that assistant message back onto messages verbatim (keeping the server_tool_use block) and send another request with the same advisor tool and beta header. No new user message, no tool_result block needed.

Analogy · Resume from a pause

Like getting a "line busy, please try again" on a phone call — you don't hang up, you just redial the same number to pick up where you left off. If the resumed turn pauses again, repeat the same move.

Open for details: the two block types in the response, and the advisor's "encrypted advice"

A successful call first shows a server_tool_use block (starting the call) in the assistant content, immediately followed by an advisor_tool_result block (the advisor's reply). The latter's content is a union type: advisors like Claude Opus 4.8 return a plaintext advisor_result (the text field is readable), while the two advisors Claude Fable 5 and Claude Mythos 5 return an encrypted advisor_redacted_result — you get an unreadable encrypted_content that the server decrypts and renders into the executor's prompt on the next turn. In both cases you must pass the content back verbatim on subsequent turns. On a failed call, the result block carries an error_code (such as overloaded, prompt_too_long, max_uses_exceeded); the executor sees the error and continues without advice, and the request itself does not fail.

4Pairing rules

Who can advise whom isn't arbitrary

The top-level model field is the executor; the model field inside the tool definition is the advisor, and the two must form a valid pair. There's one hard line: the advisor must be Claude Sonnet 4.6 or stronger, and at least as capable as the executor. Get it wrong and the API returns a 400, calling out that the combination isn't supported.

Executor modelEligible advisor models
Claude Haiku 4.5Fable 5Mythos 5Opus 4.8Opus 4.7Opus 4.6Sonnet 4.6
Claude Sonnet 4.6Fable 5Mythos 5Opus 4.8Opus 4.7Opus 4.6Sonnet 4.6
Claude Sonnet 5Fable 5Mythos 5Opus 4.8Opus 4.7
Claude Opus 4.6Fable 5Mythos 5Opus 4.8Opus 4.7Opus 4.6
Claude Opus 4.7Fable 5Mythos 5Opus 4.8Opus 4.7
Claude Opus 4.8Fable 5Mythos 5Opus 4.8Opus 4.7
Claude Fable 5Fable 5
Claude Mythos 5Mythos 5

Dashed box = comparable capability, can advise each other (e.g. Opus 4.7 and Opus 4.8). The highlighted Sonnet 5 row is a clear example: it can only pair with Opus 4.7 or Opus 4.8 — not even Opus 4.6 makes the cut.

The logic behind the hard rule "advisor capability must be ≥ executor" is plain: if the advisor is weaker than the executor, it can't offer genuinely valuable advice and the whole call is wasted. So the vendor locks pairings down with a compatibility table, and weak-advisor-with-strong-executor combos simply can't be submitted.

5The nudge · Key

Some models won't ask for advice on their own, so there's a "nudge"

Some executor models won't reach for the advisor on the first turn — especially lighter models like Claude Haiku 4.5. The vendor's fix: if it didn't call the advisor on turn one, insert a short reminder (a nudge) as a separate user message before turn two. But whether that nudge lands early or late significantly changes the outcome.

Hero · Timing sensitivity

The nudge itself works well — the trick is timing. Nudge too early, before the executor has grasped the task, and this low-information call crowds out a later, more valuable one. Nudge too late and you miss a better window. Anthropic laid out this timing sensitivity with real experimental data.

+7 pp
Task pass-rate lift for Claude Haiku 4.5 after a nudge on the right turn (Anthropic internal behavioral eval)
74%–98%
Share that calls the advisor on turn two right after a nudge: Claude Sonnet ~74%, Claude Haiku 4.5 ~98%
−3~4 pp
Drop on tasks whose baseline first call belongs after turn 7, when interrupted by a turn-2 nudge
86% → no cost
On a browsing task with an 86% baseline call rate, adding a nudge only raises participation with no hit to task performance
Turn the nudge is inserted (relative to baseline first call) Task performance Too early −3~4 pp Near baseline · best Too late Missed window
The default NUDGE_TURN is 2, placing the nudge where the model "has grasped the task but hasn't locked in its approach." For tasks whose baseline first call comes late (after turn 7), a hard push on turn 2 drags them out of the sweet spot and to the left

The vendor's guidance is specific: first measure which turn your executor typically makes its first advisor call on without any nudge (call it turn N), then set NUDGE_TURN higher than N. If your workload mixes simple and complex tasks, you can raise NUDGE_TURN to 3, letting simple two-turn tasks finish first instead of being forced into an unnecessary consultation by the nudge.

Nudge on the right turn
After the nudge, Haiku calls the advisor on turn two 98% of the time, lifting pass rate by about 7 points. On a browsing task already at an 86% call rate, adding a nudge only raises participation with no drop in performance.
Nudge inserted too early
On tasks whose baseline first call belongs after turn 7, a turn-2 nudge interrupts; this low-information consultation crowds out a better later call, and performance actually drops 3 to 4 points.
Wrong model
On a Sonnet executor, this plaintext nudge showed no measurable effect in the vendor's tests; on an Opus executor it slightly lowered pass rate, so the nudge doesn't apply to Opus.
Open for details: how to insert the nudge, and how it differs from "forced calls"

The nudge should be its own separate user message following the tool result, not tucked into the same message as a sibling block. Two consecutive user messages are valid; on Haiku and Sonnet the vendor found both styles equally effective, and making it its own message just keeps the nudge distinct from tool output. Also, if your system prompt already has restraint-oriented wording like "only ask the advisor when genuinely uncertain," skip the nudge — the two instructions would clash. To force a consultation on a given request, set tool_choice to point at advisor, but a forced call can't be combined with extended thinking or the API returns a 400.

6Billing

The extra you spend on this meal — how the bill adds up

An advisor call is a separate sub-inference billed at the advisor model's own rate, and it isn't rolled into the executor's usage totals. To break down what each segment cost, look at the usage.iterations array.

iteration 1
89
advisor
1612 (with thinking) · at advisor rate
iteration 3
442 · at executor rate
type: "message" iterations bill at the executor rate; type: "advisor_message" iterations bill at the advisor rate. Top-level usage counts only executor tokens — output_tokens is the sum of all executor iterations, and input/cache_read reflect only the first executor iteration.

Each advisor call typically outputs 400 to 700 text tokens, or about 1400 to 1800 tokens counting the thinking used before it's discarded. That's the key to the savings: the advisor doesn't generate your big final output — that part is done by the executor at a lower rate.

400–700
Typical text output per advisor call (tokens)
1400–1800
Total tokens per advisor call, thinking included

A few more details apply only to the executor and don't automatically extend to the advisor: top-level max_tokens constrains only the executor output, not the advisor's sub-inference (to cap the advisor separately, set a max_tokens inside the tool definition); advisor tokens don't count against the executor's task budget either; and Priority Tier commitments are tracked separately — unless your organization also has a commitment on the advisor model, advisor calls don't run on priority.

7Save a bit more

Save even more: you can cache on the advisor side too

There are two independent caching layers. On the executor side, the advisor_tool_result block is cacheable like any other content block, and this layer needs no special attention; the one that takes some thought is the advisor's own cache.

Each time the advisor sees the conversation, it's the previous version plus one newly appended segment, so the prefix is stable. Turn on caching in the tool definition (like {"type":"ephemeral","ttl":"5m"}) and each call writes a cache; the next call reads to that point and only pays for the new segment. You'll see cache_read_input_tokens turn non-zero in the second and later advisor_message iterations.

Analogy · Advisor-side caching

It's like having to re-read the earlier meeting minutes every time you consult the advisor. Turning on caching is like filing those minutes away, so from then on you only read out the new part. But building that file costs something too — if you meet too few times, filing costs more than it saves and isn't worth it.

≤ 2 calls
If the advisor is called two times or fewer in this conversation, cache write cost exceeds the read savings — don't turn it on
≈ 3 calls
Around three calls it breaks even, and beyond that it grows increasingly worthwhile — good for long agent loops

The vendor also flags two consistency pitfalls: first, the caching switch must be set once and kept for the whole run — flipping it back and forth invalidates the cache; second, a misconfigured clear_thinking (keep value not "all") shifts the reference record the advisor sees each turn, likewise breaking advisor-side caching — this only worsens cost, it doesn't affect advice quality.

Open for details: why the default clear_thinking value trips you up

When extended thinking is on but clear_thinking isn't explicitly configured, the API defaults to keep: {type: "thinking_turns", value: 1}, which triggers exactly the cache shift described above (this is the default for early Opus / Sonnet models and all Haiku models; Opus 4.5+ and Sonnet 4.6+ default to keeping all turns). To keep advisor-side caching stable, set keep explicitly to "all".

8Should you use it

What kind of work fits this, and what isn't worth the trouble

In its "When to use it" section, the vendor draws the boundary clearly: this collaboration only pays off on mixed workloads where "most turns can be cut cheap and a few must be strong."

Fits · adding an advisor makes sense

Long-chain agent tasks: coding agents, computer use, multi-step research pipelines where most turns are mechanical and only a few key moments truly need top-tier intelligence. Teams already using Sonnet for complex tasks can add an Opus advisor and stand to gain quality while keeping total cost near — or even below — using Sonnet directly; teams already on Haiku 4.5 get a cheaper path to smarter output than jumping straight to a bigger executor.

Doesn't fit · don't bother

Single-turn Q&A with nothing to plan; pure "model selector" passthrough scenarios where the user has already made their own cost-quality tradeoff; and workloads where every single step genuinely needs the advisor model's full capability. In these cases the collaborative mode brings no benefit.

The vendor also says it upfront: results vary by task, so evaluate on your own workloads. The tool is currently in beta on the Claude API and on the Claude Platform on AWS; it isn't yet supported on Amazon Bedrock, Google Cloud, or Microsoft Foundry, and it meets zero-data-retention (ZDR) requirements.

You get close to advisor-solo quality while the bulk of token generation happens at executor-model rates. Claude Developer Docs · Advisor tool
Source: the Claude developer docs Advisor tool page (platform.claude.com). This piece is a visual explainer by the Xiaohu Explainer site based on the official docs and is the vendor's own product documentation; pass rates, call ratios, and token ranges are all Anthropic's internal eval figures, and the vendor notes that results vary by task. The feature is in beta — parameters and the compatibility table follow the latest official docs.