Anthropic Ships the Advisor Tool: When a Cheap Model Gets Stuck Mid-Task, It Can Call a Smarter Model on the Spot
- Anthropic has launched a beta Advisor tool for the Claude API: a faster, cheaper "executor model" can call a smarter "advisor model" mid-generation, get strategic guidance, and then finish the task itself.
- The whole exchange happens inside a single API request: the advisor uses no tools and does no context management, its reasoning is discarded, and only the final guidance text is passed back to the executor.
- The advisor must be as capable as the executor or stronger — for example, with Claude Sonnet 5 as executor, only stronger models like Claude Opus 4.7 or Claude Opus 4.8 qualify as advisors, with pairings strictly defined by an official compatibility table.
- Each advisor call typically outputs just 400 to 700 text tokens (1400 to 1800 including thinking), far less than having the advisor regenerate the whole task from scratch — that's where the savings come from.
- For executors that don't reach for the advisor on their own (especially Claude Haiku 4.5), Anthropic found a "nudge" trick: inserting a "you haven't asked the advisor yet" line on the second turn lifts task pass rates by about 7 points — but timing the nudge too early or too late directly changes the result.
Split the doer from the advisor — pick a different model for each
Anthropic recently launched a beta tool for the Claude API called the Advisor tool. It lets a faster, cheaper "executor model" call a smarter "advisor model" mid-task to get strategic guidance, then keep writing once it has the advice.
A nimble but less-seasoned model does the actual work of producing the answer, and at key decision points it knocks on the door of the stronger model next door, asks a few questions, and heads back to its desk to keep writing — instead of handing the whole task to the strong model to redo from scratch.
Before, it was either-or: cheap or smart
In multi-step agent tasks — writing code, operating a computer, multi-step research — most turns are mechanical, and only a few key moments truly need top-tier intelligence. But calling a model used to mean one of two paths, neither satisfying.
The collaborative mode lands in the middle of the quality-cost tradeoff: the vast majority of tokens come from the cheap executor at a low rate, and only the handful of key moments that truly demand thought spend a few hundred tokens to get advice from a top-tier model.
Halfway through the work, place a call to the advisor
The advisor tool hangs in the tools array like any other tool, and when to call it is the executor's own decision. A complete call runs in four steps.
The whole round trip happens inside a single /v1/messages request — you incur no extra network round trips on your end. The one exception is when the advisor pauses before finishing, in which case you need to send that conversation back verbatim to resume it.
In step one, the executor emits a server_tool_use block named advisor, and its input is always empty: the executor only decides "now is the time to ask," while the server automatically fills in the context handed to the advisor. In step two, the server runs the advisor model separately on its own side; the advisor uses a system prompt provided by Anthropic and sees the executor's full conversation — including your system prompt, tool definitions, all prior turns and tool results, plus everything the executor has written so far this turn.
The advisor is a "give-advice-only, hands-off" role: it uses no tools of its own and does no context management; its reasoning is discarded before returning, and only the guidance text makes it back to the executor.
The reason the executor's call is "empty" is like knocking on a senior colleague's door without having to recap the whole project first — the company's shared doc system has already laid your current progress in front of them, so you just ask.
If the advisor hasn't finished: resume from a pause (pause_turn)
Sometimes the request ends early while the advisor call is still pending, and the response carries a stop_reason: "pause_turn" — only the server_tool_use block that started the call, with no matching result yet. When this happens, just append that assistant message back onto messages verbatim (keeping the server_tool_use block) and send another request with the same advisor tool and beta header. No new user message, no tool_result block needed.
Like getting a "line busy, please try again" on a phone call — you don't hang up, you just redial the same number to pick up where you left off. If the resumed turn pauses again, repeat the same move.
Open for details: the two block types in the response, and the advisor's "encrypted advice"
A successful call first shows a server_tool_use block (starting the call) in the assistant content, immediately followed by an advisor_tool_result block (the advisor's reply). The latter's content is a union type: advisors like Claude Opus 4.8 return a plaintext advisor_result (the text field is readable), while the two advisors Claude Fable 5 and Claude Mythos 5 return an encrypted advisor_redacted_result — you get an unreadable encrypted_content that the server decrypts and renders into the executor's prompt on the next turn. In both cases you must pass the content back verbatim on subsequent turns. On a failed call, the result block carries an error_code (such as overloaded, prompt_too_long, max_uses_exceeded); the executor sees the error and continues without advice, and the request itself does not fail.
Who can advise whom isn't arbitrary
The top-level model field is the executor; the model field inside the tool definition is the advisor, and the two must form a valid pair. There's one hard line: the advisor must be Claude Sonnet 4.6 or stronger, and at least as capable as the executor. Get it wrong and the API returns a 400, calling out that the combination isn't supported.
| Executor model | Eligible advisor models |
|---|---|
| Claude Haiku 4.5 | Fable 5Mythos 5Opus 4.8Opus 4.7Opus 4.6Sonnet 4.6 |
| Claude Sonnet 4.6 | Fable 5Mythos 5Opus 4.8Opus 4.7Opus 4.6Sonnet 4.6 |
| Claude Sonnet 5 | Fable 5Mythos 5Opus 4.8Opus 4.7 |
| Claude Opus 4.6 | Fable 5Mythos 5Opus 4.8Opus 4.7Opus 4.6 |
| Claude Opus 4.7 | Fable 5Mythos 5Opus 4.8Opus 4.7 |
| Claude Opus 4.8 | Fable 5Mythos 5Opus 4.8Opus 4.7 |
| Claude Fable 5 | Fable 5 |
| Claude Mythos 5 | Mythos 5 |
Dashed box = comparable capability, can advise each other (e.g. Opus 4.7 and Opus 4.8). The highlighted Sonnet 5 row is a clear example: it can only pair with Opus 4.7 or Opus 4.8 — not even Opus 4.6 makes the cut.
The logic behind the hard rule "advisor capability must be ≥ executor" is plain: if the advisor is weaker than the executor, it can't offer genuinely valuable advice and the whole call is wasted. So the vendor locks pairings down with a compatibility table, and weak-advisor-with-strong-executor combos simply can't be submitted.
Some models won't ask for advice on their own, so there's a "nudge"
Some executor models won't reach for the advisor on the first turn — especially lighter models like Claude Haiku 4.5. The vendor's fix: if it didn't call the advisor on turn one, insert a short reminder (a nudge) as a separate user message before turn two. But whether that nudge lands early or late significantly changes the outcome.
The nudge itself works well — the trick is timing. Nudge too early, before the executor has grasped the task, and this low-information call crowds out a later, more valuable one. Nudge too late and you miss a better window. Anthropic laid out this timing sensitivity with real experimental data.
The vendor's guidance is specific: first measure which turn your executor typically makes its first advisor call on without any nudge (call it turn N), then set NUDGE_TURN higher than N. If your workload mixes simple and complex tasks, you can raise NUDGE_TURN to 3, letting simple two-turn tasks finish first instead of being forced into an unnecessary consultation by the nudge.
Open for details: how to insert the nudge, and how it differs from "forced calls"
The nudge should be its own separate user message following the tool result, not tucked into the same message as a sibling block. Two consecutive user messages are valid; on Haiku and Sonnet the vendor found both styles equally effective, and making it its own message just keeps the nudge distinct from tool output. Also, if your system prompt already has restraint-oriented wording like "only ask the advisor when genuinely uncertain," skip the nudge — the two instructions would clash. To force a consultation on a given request, set tool_choice to point at advisor, but a forced call can't be combined with extended thinking or the API returns a 400.
The extra you spend on this meal — how the bill adds up
An advisor call is a separate sub-inference billed at the advisor model's own rate, and it isn't rolled into the executor's usage totals. To break down what each segment cost, look at the usage.iterations array.
Each advisor call typically outputs 400 to 700 text tokens, or about 1400 to 1800 tokens counting the thinking used before it's discarded. That's the key to the savings: the advisor doesn't generate your big final output — that part is done by the executor at a lower rate.
A few more details apply only to the executor and don't automatically extend to the advisor: top-level max_tokens constrains only the executor output, not the advisor's sub-inference (to cap the advisor separately, set a max_tokens inside the tool definition); advisor tokens don't count against the executor's task budget either; and Priority Tier commitments are tracked separately — unless your organization also has a commitment on the advisor model, advisor calls don't run on priority.
Save even more: you can cache on the advisor side too
There are two independent caching layers. On the executor side, the advisor_tool_result block is cacheable like any other content block, and this layer needs no special attention; the one that takes some thought is the advisor's own cache.
Each time the advisor sees the conversation, it's the previous version plus one newly appended segment, so the prefix is stable. Turn on caching in the tool definition (like {"type":"ephemeral","ttl":"5m"}) and each call writes a cache; the next call reads to that point and only pays for the new segment. You'll see cache_read_input_tokens turn non-zero in the second and later advisor_message iterations.
It's like having to re-read the earlier meeting minutes every time you consult the advisor. Turning on caching is like filing those minutes away, so from then on you only read out the new part. But building that file costs something too — if you meet too few times, filing costs more than it saves and isn't worth it.
The vendor also flags two consistency pitfalls: first, the caching switch must be set once and kept for the whole run — flipping it back and forth invalidates the cache; second, a misconfigured clear_thinking (keep value not "all") shifts the reference record the advisor sees each turn, likewise breaking advisor-side caching — this only worsens cost, it doesn't affect advice quality.
Open for details: why the default clear_thinking value trips you up
When extended thinking is on but clear_thinking isn't explicitly configured, the API defaults to keep: {type: "thinking_turns", value: 1}, which triggers exactly the cache shift described above (this is the default for early Opus / Sonnet models and all Haiku models; Opus 4.5+ and Sonnet 4.6+ default to keeping all turns). To keep advisor-side caching stable, set keep explicitly to "all".
What kind of work fits this, and what isn't worth the trouble
In its "When to use it" section, the vendor draws the boundary clearly: this collaboration only pays off on mixed workloads where "most turns can be cut cheap and a few must be strong."
Long-chain agent tasks: coding agents, computer use, multi-step research pipelines where most turns are mechanical and only a few key moments truly need top-tier intelligence. Teams already using Sonnet for complex tasks can add an Opus advisor and stand to gain quality while keeping total cost near — or even below — using Sonnet directly; teams already on Haiku 4.5 get a cheaper path to smarter output than jumping straight to a bigger executor.
Single-turn Q&A with nothing to plan; pure "model selector" passthrough scenarios where the user has already made their own cost-quality tradeoff; and workloads where every single step genuinely needs the advisor model's full capability. In these cases the collaborative mode brings no benefit.
The vendor also says it upfront: results vary by task, so evaluate on your own workloads. The tool is currently in beta on the Claude API and on the Claude Platform on AWS; it isn't yet supported on Amazon Bedrock, Google Cloud, or Microsoft Foundry, and it meets zero-data-retention (ZDR) requirements.
You get close to advisor-solo quality while the bulk of token generation happens at executor-model rates. Claude Developer Docs · Advisor tool