Deep Dive · Xiaohu Explainer

What Exactly Is Loop Engineering, the Idea Blowing Up Right Now

A deep dive into the "loop engineering" methodology from Anthropic engineers: stop prompting AI one line at a time — design a loop system that runs itself

At a glance

Loop Engineering was independently hit upon and named by Addy Osmani, Boris Cherny, and Peter Steinberger in the same week of June 2026: stop prompting AI by hand, design a system that prompts AI automatically — you shift from "operating the AI" to "designing the system that drives it"
A loop has five steps: discover the task, hand it off, verify independently, persist state, schedule automatically — drop any one and you get a named failure mode (the nodding loop / the amnesiac loop / the manual loop / the blind loop / the tangled loop)
The most critical and most-often-skipped step is verification: let AI grade its own output and it praises itself — you need a separate agent as the nitpicker, defaulting to "the code is broken," actually clicking through the page and screenshotting rather than just reading the code
Stripe's Minions pipeline merges 1,300+ machine-written PRs a week — reliability comes from deterministic constraints (a linter that must run, that the agent can't bypass), not a bigger model
A loop quietly accrues four costs: verification debt, comprehension rot, cognitive surrender, and a runaway token bill — they reinforce each other and blow up together, and the gatekeeper is always human judgment

1Origin · Naming

In One Week, Three People Hit the Same Thing

In one week of June 2026, Google Chrome engineer Addy Osmani, Anthropic's Claude Code lead Boris Cherny, and OpenClaw author Peter Steinberger — none of them comparing notes — hit the same thing: they'd stopped prompting AI by hand and were designing "systems that prompt AI automatically."

Loop Engineering is about a shift in identity: you go from "the person sitting at the keyboard directing AI one line at a time" to "the person who designs a system that assigns work to AI on its own." The weight of that sentence rests entirely on "replacing yourself."

⚡

Why it's worth your time: Steinberger's post — "design loops, don't prompt agents" — passed 8 million views; Cherny put it as "what I write now is loops, my job is to write loops"; Osmani named it Loop Engineering and wrote it up on June 7. Three lines pointing at one move: what you design goes from a single behavior of an agent to the whole system that drives the agent. And Stripe is already running this pattern, merging over 1,300 machine-written PRs a week.

8,000,000+

Views on Steinberger's "design loops, don't prompt agents" post

6 / 7

The date Osmani named it Loop Engineering and published it; mirrored to Substack the next day

Why It Surfaced This Exact Week

Three people who never coordinated reached for the same word in the same week — not coincidence, but the tools around them quietly crossing a threshold. Three conditions ripened at once: coding agents got reliable enough to finish a non-trivial task unattended; scheduling primitives just landed in mainstream tools; the cost of a single run dropped low enough that running it over and over wasn't wasteful. With all the parts on the table, "put them together" became obvious to everyone at the same moment.

The name lagged the practice by months: before anyone called it Loop Engineering, people were already writing loops — just as, before "generator/evaluator separation" had a name, teams were already pairing a code-writing agent with a code-reviewing one. Worth remembering as a rule: the next new term won't come from a model release, it'll come from the moment some capability gets cheap enough to turn a once-unthinkable combination into an everyday move.

2Where It Sits

It Sits on Top of These Four Layers

These "___ engineering" terms don't replace one another — they stack, each managing something one size bigger than the one below: from a single sentence, to a context window, to a single run, to a loop that turns on its own. Open each layer to see what it governs and how big the blast is when it fails.

Loop · Loop Engineeringthe top layer

What it governs: scheduling on top of the harness so it runs over and over by itself. Core question: how do you make it cycle unattended? Three verbs more than the layer below: run on a schedule (wake up on time, no button to press), split off sub-agents (one drafts the change, another nitpicks), and feed its own output into the next round (yesterday's findings written to a file, read back this morning to keep going). Blast radius: an error gets written into the state file, read back the next day as established fact, built on top of — it can propagate for many rounds before anyone notices.

Harness · Gear for One Runarm one run

What it governs: kitting out a single agent run — which tools it can use, which actions are allowed, how to recover from errors, what state counts as done. Core question: what does this one run bring with it? It arms one run, but doesn't let that run repeat itself. Blast radius: the agent edits a file once based on a misread, but the run ends, the diff is visible, and it ships only after a human review.

Context · Context Engineeringthe window now

What it governs: what goes in the window right now — what to retrieve, how to summarize, what stale info to clear. Core question: what do you show the model so it can solve it? A window stuffed with noise wastes even the best prompt. Blast radius: one confident wrong answer; spot it and clear the context.

Prompt · Prompt Engineeringthe words you write

What it governs: the sentence you write for the model — wording, examples, role, tone. Core question: what should you tell the model? Its boundary is a single conversation. The catch: it assumes someone is sitting there each time to hand the prompt over. Blast radius: you see it right there in the conversation; rewrite the prompt.

Take the same bug — the agent misreads a function's return value — across the four layers: the higher up, the later it's caught and the more it costs. At the loop layer, that misread gets written into the state file, read back the next day as fact, built up layer by layer; by the time anyone looks, the wrong assumption has become a load-bearing wall.

This is the one intuition to keep from loop engineering: the cost of an error equals the number of rounds it survives before someone catches it — and a loop is, by construction, a machine for maximizing that number. Everything that follows — the evaluator, human checkpoints, budget caps — exists for one purpose only: to shrink the distance between "made a mistake" and "caught it."

3Five Steps

Five Steps to a Lap — Drop Any One and It Breaks in a Set Way

Don't misread "loop" as spinning in place. Every round does something concrete: find work worth doing, hand it to an agent, verify it's right, save the state, then decide the next step. Drop any of the five and the loop either won't turn or just spins on the spot — and it breaks in a way that has a name.

Five steps in a clockwise ring; scheduling feeds this round's unfinished work into tomorrow's · verification is the only step in the whole ring that can say "no"

Below, using the "morning triage loop" Osmani built for himself, we walk through what each step does and which kind of loop you get if you skip it.

DiscoverDiscovery

The triage skill reads yesterday's failed CI, still-open issues, and recently merged commits, and figures out for itself what this round should tackle. The point is to let the agent find work, not hand it a list. This step sets the ceiling on the whole loop's quality: if the work it surfaces has no value, the other four steps can be flawless and still be wasted.

Skip it → the blind loop: you're still handing out work every morning by hand — "fix these three bugs" — you automated the "doing" but not the "finding," and "finding" is often the most expensive part

Hand offHandoff

Pass the work from the scheduling system into the hands of the agent that does it. For each finding worth acting on, open a separate git worktree; multiple agents edit code in their own directories without colliding on files. The cleaner each chunk is cut, the easier the later verification and merge.

Skip it → the tangled loop: several agents run in parallel but all edit the same directory, the changes collide, and the merge becomes a knot nobody can untangle — invisible with one agent, exposed only on the morning five run at once

VerifyVerification

⊘ The only step in the ring that can say "no"

The easiest step to cut corners on, and the one you can least afford to. The first sub-agent drafts the fix, then a second sub-agent reviews it — different instructions, sometimes even a different model. The agent that wrote the code grades its own homework too leniently; the dedicated nitpicker catches what the first one talked itself into letting through. A loop with no real check is just an agent nodding at itself.

Skip it → the nodding loop: the most common failure — every round self-approves, piling up at machine speed a stack of errors that look fine; hundreds of rounds and never once a "no," which is statistically impossible for any real workload and is exactly the proof that no checking is happening

PersistPersistence

Land the results somewhere that outlives this conversation: open a PR via a connector, update the ticket, send what it can't solve to an inbox, and keep a state file tracking progress. A loop's memory can't live only in the context window — what's written into markdown or a board doesn't get forgotten.

Skip it → the amnesiac loop: good work gets found and done, then forgotten, because the result lived only in a context window that got cleared; the next day it rediscovers the same work, even redoes it in a way that conflicts with the first result — every morning starting from the same spot, with nothing accumulating

ScheduleScheduling

This is what turns "one round" into "a loop." Triage runs automatically every morning, the state file carries unfinished findings into the next day, and the next day it picks them up on its own. In Osmani's words: it's automation that makes a loop a real loop, not just a run you happened to do once.

Skip it → the manual loop: the other four steps are fine, just no automation — that's not a loop, it's a script you run by hand and then forget to run again; dazzling in the demo the day you build it, quietly stopped the moment your attention moves on, its last run being the day it was demoed

These five aren't an à la carte list — they're an interdependent whole. Disciplined teams install all five; rushed teams install only the first two (discover + hand off), because those two produce visible output, while the last three produce "safety," which is the easiest thing to drop.

4Six Parts

Building a Loop Means Assembling Six Parts

The five steps describe "what happens in a round"; the six parts describe "what you need in hand to make it turn." They map one to one: discovery via skills, handoff via worktrees, verification via sub-agents, persistence via memory, scheduling via automations. Open each card to see what it solves.

① Automations Schedule

Hung off a schedule or a trigger so the loop moves on its own. Without it, all you have is a run, not a loop. Two kinds: local (the machine has to be on) and cloud (it runs even when the machine is off). What gets triggered should be a named skill, not a big blob of instructions glued into a cron job.

② Worktrees Hand off

A built-in git mechanism: multiple non-interfering working directories in one repo. Its value rises with parallelism: two agents writing the same file at once is as painful as two engineers committing to the same line. It turns parallel work from "runs but messy" into "runs and clean."

③ Skills / SKILL.md Discover

Freeze project knowledge into a file so the agent doesn't re-derive context every round. Osmani calls the cost it saves intent debt: the accumulated cost of re-explaining "what this project is, what the rules are, where the traps are" every time you open a new session. A skill can be reused and maintained; a wall of prompts can't.

④ Connectors Persist / Discover

Connectors built on MCP (Model Context Protocol, the standard interface for AI to call external tools), wiring the loop to the outside: issue tracker, database, staging API, Slack. A loop that can see only the local filesystem is a very small loop. A connector written for one tool can often be moved to another as-is.

⑤ Sub-agents Verify

Split the writer and the judge into two agents. When one agent is both player and referee, the referee favors itself. The counterintuitive part: tuning an independent referee to be picky is far easier than tuning the author to be self-critical, so a loop would rather keep an extra agent around than let one agent review itself.

⑥ Memory / state file Persist

Persistent state that lives outside a single conversation — a markdown file or a board. The moment the context window clears, the agent remembers nothing; to make today pick up from yesterday, memory has to land on disk. The agent forgets, the repo doesn't. Memory isn't context: context is what this round sees and is gone on refresh; memory persists across rounds and across days.

What a worktree is

Like several cooks doing prep at once, each handed their own cutting board so they chop without bumping hands; share one board and they collide. A worktree is handing each parallel agent its own cutting board.

With all six parts in place, the loop has a skeleton: automation makes it move, worktrees keep it from fighting itself, skills keep it from repeating labor, connectors let it see outside, sub-agents let it self-correct, memory lets it remember. But a skeleton is just the start — with the same six parts, two people can build completely opposite things, which is the subject of the final section.

5Core · Verification

The Hardest Step: Why You Can't Let AI Grade Itself

The hardest part of a loop was never getting the agent running — it's putting something inside it that can say "no," and the agent that wrote the code is precisely the one least likely to say it.

Core innovation · generator / evaluator separation

① Why it inevitably praises itself. Ask an agent to grade what it just produced and it tends to praise it confidently, even when a human sees at a glance the quality is mediocre — Anthropic engineer Prithvi Rajasekaran observed this while building long-running applications. It's not an intelligence problem: the context in which it wrote the code is already packed with reasons for "why it was done this way," so when the agent looks at its own output it sees not the result but the chain of self-persuasion that led it here. Inside a loop the flaw is amplified: every "is this good enough?" is decided by the agent that just finished writing, nodding at itself round after round, drifting further from real quality the longer it runs.

② The fix isn't changing the author, it's swapping the judge. Tuning the generator to be more self-critical is a dead end — you can't ask an author to step outside their own perspective. What you can do is bring in a separate agent that reads the code from scratch, carries a completely different set of instructions, and bears no self-persuasion. Tuning an independent skeptic is far easier than turning the author self-critical. The inspiration is GANs (generative adversarial networks): one network creates, one network nitpicks — ported to code, one writes and one reviews.

③ Reading the code isn't enough — it has to act. If the evaluator only reads the code, it judges "does this look right," not "does this run right." On frontend tasks, Rajasekaran wired the evaluator to Playwright MCP so it actually opens the page, clicks buttons, screenshots, and inspects the DOM like QA. The basis for judgment shifts from "this JSX looks fine" to "I clicked the button, the page navigated, here's the screenshot." Swap the underlying model too while you're at it — the same model with new instructions often carries the same blind spots. A calibration the community often uses: have the evaluator assume by default that the code is broken until proven otherwise — the default stance should be suspicion, not trust.

④ maker-checker: who strikes the final blow. Claude Code turns this structure into a primitive with /goal: give it a condition and let it run until that condition is met. The key is that at the end of each round, a fresh, small, fast model judges whether the condition is met — not the model that's been doing the work. This is the maker-checker principle banks have used for decades.

generator / evaluator separation

At a bank, the person who enters a large transfer and the person who reviews it must be two different people — a rule the industry has followed for decades, called maker-checker. Ported to a loop: the agent that writes the code and the model that rules "pass or fail" must be two, and the latter starts from scratch, assuming by default that you got it wrong.

Generator · writes code

Its context is full of reasons for "why it was written this way" — it sees the chain that persuaded itself

↓ draft

Evaluator · separate agent

Different model · assumes code is broken · uses Playwright MCP to actually click the page, screenshot, inspect the DOM — judges behavior, not intent

↓

REJECT + reasons, one by one ↩
back to generator to rewrite

PASS ✓
on to the next gate

↓

Fresh small model judges the stop condition

Not the model doing the work · this strike is maker-checker

Spelled out as config, the evaluator looks like this — every line enforces the four points above: switch the role, assume broken, verify by acting, give reasons one by one.

.claude/agents/reviewer.md — adversarial review agent

ROLE: Adversarial code reviewer.
ASSUME: this code is BROKEN until proven otherwise.
       # default: code is broken until shown otherwise
DO NOT praise. Find what fails.  # no praise — find what breaks

CHECK, in order:        # check in order
  1. Does it run? (execute, don't read)   # does it run: execute, don't just read
  2. Tests: run them, paste real output.  # run the tests, paste the real output
  3. Edge cases the author skipped.       # edge cases the author skipped
  4. Does behavior match the ticket?      # does behavior match the ticket

USE Playwright MCP: open the page, click,
    screenshot, inspect the DOM. Judge behavior, not intent.

VERDICT: PASS only if every check holds.
         Otherwise REJECT + list each reason.
         # PASS only if all checks hold, else REJECT and list each reason

Stop condition — judged by a fresh small model

/goal all tests in test/auth pass and the lint step is clean
# run until test/auth all pass AND lint is clean
# note: /goal runs until the condition is met — not the same as /loop (re-runs on a fixed interval)

A loop's floor is set by its evaluator. The generator's caliber decides what the loop can produce; the evaluator's caliber decides what it won't produce. Structurally separating generation from judgment, tuning the evaluator into a skeptic, having it verify by acting, and handing the final blow to a fresh model — these four together are the whole of giving a loop the power to say "no."

6Case Studies

One Person's Morning, and Stripe's 1,300 PRs a Week

Two real loops, worlds apart in scale, the same skeleton: a trigger to start, a set of constraints locking it on the rails, and a human checkpoint sitting at the end. "Runs while you sleep" was never about how strong the model is — it's about how stable this skeleton is.

Case A · one person, one machine

Osmani's morning triage

An automation starts itself each morning
The triage skill reads yesterday's failed CI, open issues, recent commits
It calls a skill, not a blob of instructions pasted into the schedule
Each finding gets a worktree: one sub-agent drafts, another reviews against the skill and tests
Connectors open PRs and update tickets automatically
What it can't solve goes to an inbox for a human; the state file lets tomorrow continue from yesterday

Case B · enterprise scale

Stripe's Minions

Merges 1,300+ PRs a week, not one line hand-written
The trigger is light: @ the bot in Slack, or add an emoji reaction
The real work happens before the model wakes: a deterministic orchestrator assembles the context first
The sandbox is a Devbox on EC2 — "cattle, not pets," swapped on demand, thousands of agents running at once without colliding
Those 1,300 PRs are still reviewed by humans: the human didn't leave, just moved from the coding desk to the review desk

The most counterintuitive thing about Stripe: Minions isn't built on a stronger model — it's a fork of the open-source tool Goose. Its core claim is that reliability comes from the quality of constraints, not the size of the model. Architecturally, deterministic gates (blue) and creative LLM steps (green) interlock; anything solvable by deterministic logic is never handed to a probabilistic model, and where that line is drawn decides whether the loop is reliable. (This case was disclosed by Stripe engineer Steve Kaliski on the How I AI podcast.)

Deterministic step (non-LLM, rules hard-coded)LLM step (creative)

Human trigger

@bot in Slack, or add an emoji reaction

↓

Deterministic orchestrator

Scans links, pulls Jira, locates relevant code with Sourcegraph + MCP — doesn't let the LLM find context itself

↓

LLM

Materials ready, writes code

↓

Hard-coded gate

The linter must run; the agent can't get around it

↓

LLM

Fixes lint

↓

Hard-coded step

git commit

↓

Human review

1,300 PRs a week, still human-reviewed

What "Runs While You Sleep" Actually Relies On

Local /loop and desktop scheduled tasks both need the machine on; turn it off and they stop. To run while the machine is off, the right answer is the scheduled triggers of Cloud Routines or GitHub Actions. Three approaches, each covering one stretch:

Scheduling method	Runs where	Machine on?	Session open?	Min interval	Local files?
Local /loop	Local machine	Yes	Yes	1 min	Yes
Desktop scheduled task	Local machine	Yes	No	1 min	Yes
Cloud Routines / CI	Cloud	No	No	1 hr	No

The point isn't picking which is better, but that these are different capabilities: local scheduling = "run a few more rounds while I'm here," cloud scheduling = "run while I'm away." Mistaking local re-runs for the whole of "unattended" is the most common misunderstanding — close the laptop lid and that loop you thought was autonomous quietly stops. Mature loops often use both: local for the close, hands-on checks, cloud for the overnight full sweep.

One aside: the commands are Claude Code's, the capability isn't tied to it

The commands in this piece are written the Claude Code way, but the capability isn't unique to it. Codex offers the same six organs under different names: scheduling, running until a condition is met, parallel isolation, sub-agents, external connectivity, and explicitly calling a skill — and a connector written for one tool can often be moved to another as-is. Whatever toolchain you switch to, the question to ask is always "are all six present," not "which vendor is this command from."

7Hands On

Step by Step: Build Your First Loop

Stripe's pipeline is the finish line, not the start. Your first loop should be small enough to barely look like a system: a little thing that goes off on schedule to check something. The five steps below can be copied as-is, each with its command and config, adding up to a minimal complete loop with all six essentials.

v2.1.72

The Claude Code version that introduced the /loop command

v2.1.139

The version that introduced /goal (run until the condition is met)

① Start a /loop

Available from v2.1.72, it re-runs the same task on an interval. Scoped to the session, the recurring task expires after 7 days, runs on your local machine, and stops when it's off. Three ways to use it:

/loop — three ways to use it

/loop 5m check the deploy   # fixed: runs every 5 minutes
/loop check the deploy      # the agent sets its own pace
/loop                       # reads the task in .claude/loop.md

② Use a skill for auto-discovery: triage first

Re-running one line isn't a loop. Give it a prompt to look at three things each morning and list what's worth acting on. "Scheduled + auto-discovery" is the entry point to a real loop. The discovery logic belongs in a skill, not in the schedule — instructions buried in a cron job rot when no one updates them, while a SKILL.md can be maintained and reused.

.claude/skills/morning-triage/SKILL.md

NAME: morning-triage
WHEN: invoked each morning by automation.
      # called each morning by automation
READ:
  - CI runs that failed since yesterday   # CI that failed since yesterday
  - issues opened in the last 24h         # issues opened in the last 24h
  - commits merged since the last run      # commits merged since the last run
JUDGE: for each item, is it worth acting on?
       Skip noise. Keep only actionable findings.
       # judge each: actionable now or noise? keep only what's actionable
OUTPUT: write findings + status to
        ./state/triage.md (one row per finding).
        # write to ./state/triage.md, one row each

③ Add a state file

Don't leave the results in the chat window. Write each finding and how far it's been handled into a markdown file (or a Linear board). The agent forgets, the repo doesn't.

./state/triage.md — this loop's memory

| finding         | source    | status   |
|-----------------|-----------|----------|
| auth test flaky | CI #4821  | fixing   |
| null deref      | issue 92  | PR open  |
| stale dep       | commit a3 | inbox    |

④ Add a /goal evaluator

The most critical step, and the easiest to skip. /goal (from v2.1.139) runs until a condition is met, with a different model judging whether it's met. Note it differs from /loop: /loop re-runs on a fixed interval, /goal runs until the bar is cleared, with the stop condition judged by a fresh small model. Pair it with the reviewer.md from the last section (assume the code is broken, verify by acting, give reasons one by one) so it nitpicks every round.

/goal — give it a stop condition that can be judged objectively

/goal all tests in test/auth pass and the lint step is clean
# run until test/auth all pass AND lint is clean, judged by a fresh small model

⑤ Add --worktree for parallel isolation

Use --worktree (or -w) to give each background agent its own worktree — one worktree per task — so multiple agents editing code at once don't collide on files.

One separate worktree per finding

claude --worktree fix/auth-test  "draft the fix"
claude --worktree fix/null-deref "draft the fix"

The six-essential checklist for a first loop

Tick them off and see whether you've got them all. The first two decide whether it runs; the last four decide whether problems go unattended once it does. The most common beginner mistake is installing only the first two, leaving a loop that no one watches, no one can stop, and that's still nodding at itself.

Discovery sourcedecides if it runs

What does it read on schedule? CI / issues / commits / inbox

State filedecides if it runs

Which disk file holds the cross-round memory?

Independent verificationproblems go unattended

Is there a separate agent that can say "no"?

Worktree isolationproblems go unattended

Does each parallel agent have its own directory?

Token capproblems go unattended

Set a per-run budget and a daily cap? Who stops it?

Human review checkpointproblems go unattended

Which step pauses for you to look, instead of running fully automatic end to end?

A minimal loop with all six essentials

Turning the checklist into code: the snippet below is short enough to read in one breath, yet contains every organ a real loop needs, just shrunk down. The six comments map one to one to the six essentials.

.github/workflows/triage.yml — complete minimal loop

# 1. Schedule — a real trigger
on:
  schedule:
    - cron: '0 6 * * *'   # 06:00 daily, runs in the cloud

# 2. Discovery — call a skill, not a blob of instructions
run: claude --skill morning-triage

# 3. Persistence — state to disk
#    the skill writes ./state/triage.md and commits it back to the repo

# 4. Handoff + verification — one worktree per finding, run until it clears the bar
for finding in $(parse ./state/triage.md); do
  claude --worktree "fix/$finding" \
    --goal "tests pass and lint is clean" \
    "draft a fix for $finding"
done

# 5. Verification — after each round a fresh model judges stop,
#    plus a reviewer agent dedicated to nitpicking

# 6. Human review — the door left open
#    PRs are opened but not auto-merged; anything uncertain goes to ./inbox/

Install all six and it's a real loop no matter how small; drop any one and it's one of the five failures from the last section in disguise. A first loop is best kept small, but the "say no" check and the human review checkpoint must be fully installed.

8The Costs

A Loop Quietly Runs Up Four Tabs

A loop that runs itself is also a loop that errs by itself. The livelier it runs, the more silently it errs. Four costs quietly accumulate while it runs, and not one of them sets off an alarm on the spot.

Unverified output corrodes comprehension, lagging comprehension invites surrender, surrender lets the loop run longer and spend more, producing yet more unverified output — back to verification debt

A concrete scenario shows how the four trigger in sequence: overnight the loop opens 20 PRs, all tests green, a big win on the surface. But 3 of them hide small bugs the tests don't cover.

With no independent evaluator, those 3 sick ones get merged too → this is verification debt

↓

You merged 20 without reading them, and your mental map of the code is now 20 changes behind → this is comprehension rot

↓

The loop runs so smoothly that by the next morning you simply stop looking → this is cognitive surrender

↓

It split off sub-agents and retried all night, and the bill is triple the estimate → this is token blowup

Those 3 buried errors now sit in a codebase you no longer fully understand, "guarded" by a person who's stopped looking, until one of them surfaces in production. The four costs aren't a list of independent risks — they're four faces of the same failure, feeding each other, all coming due together.

Each has its own way to guard against it:

Verification debt

Wire in an independent evaluator, not the same agent doing the work, to nitpick in the gap between "it runs" and "it's correct"

Comprehension rot

Sample-read a few each day, forcing yourself to explain "what changed and why"; if you can't, your map has fallen behind

Cognitive surrender

Keep at least one checkpoint a human must press — the loop can execute, but not decide; you must at least retain the ability to say "this is wrong"

Token blowup

Before the first unattended run, set a per-run budget + daily cap + max retries, so an idle-spinning bug can't burn a whole night's quota

The four share one trait: while the loop runs, they're all silent. The most fascinating thing about loop engineering is letting one person do a team's work; the most dangerous thing is in the same place — a team argues with itself, while one person plus a pile of loops easily becomes an echo chamber where no one argues. The gatekeeper is always human judgment.

9Closing

You're Still the Engineer, Not Just the Person Who Presses Go

The same loop, built by two people, can end up in opposite places — and the difference isn't in the loop itself.

One person uses a loop to move faster on things they already understand cold: they read the code, they're sure of the direction, and the loop amplifies the judgment they already have. Another uses the same loop to never have to understand again; six months later the first has grown stronger, the second has become the gatekeeper of a machine they can't read.

A loop isn't the kind of tool whose quality is fixed by the tool — it's powerful enough to amplify exactly what you bring to it: bring understanding and it amplifies understanding; bring laziness and it amplifies laziness. It's a faithful multiplication sign, and what it multiplies is the person who builds it.

A loop makes generation nearly free: code, plans, PRs, fixes — almost for nothing. What stays scarce is judgment: knowing which plan is right, which line to stop at, which output runs fine but is wrong at the root. A loop can generate a hundred options but can't really choose — or rather, it chooses by "looks reasonable," not "actually correct," and the gap between those two is exactly why the engineer exists. So loop engineering didn't devalue judgment; it stripped away every task that needed none, leaving judgment as the only thing left. The engineer's value moves from "types fast, has the API memorized, willing to grind out boilerplate" to "knows which path is right, which line to stop at, which output runs fine but is wrong at the root."

build the loop, but build it like someone who intends to stay the engineer, not just the person who presses go.Addy Osmani, "Loop Engineering," June 2026

Source: this piece is an interpretation based on the conference-style write-up of HuaShu's "Orange Book" open guide, "Loop Engineering: Stop Asking Me What It Is" (v260615, June 2026). The framework and direct quotes are Addy Osmani's; the generator/evaluator finding is Anthropic's Prithvi Rajasekaran's; the Stripe Minions enterprise case (1,300+ PRs a week) was disclosed by Stripe engineer Steve Kaliski on the How I AI podcast. Command version numbers (/loop from v2.1.72, /goal from v2.1.139) and scheduling parameters follow each tool's official docs and may change across versions.