What Exactly Is Loop Engineering, the Idea Blowing Up Right Now
- Loop Engineering was independently hit upon and named by Addy Osmani, Boris Cherny, and Peter Steinberger in the same week of June 2026: stop prompting AI by hand, design a system that prompts AI automatically — you shift from "operating the AI" to "designing the system that drives it"
- A loop has five steps: discover the task, hand it off, verify independently, persist state, schedule automatically — drop any one and you get a named failure mode (the nodding loop / the amnesiac loop / the manual loop / the blind loop / the tangled loop)
- The most critical and most-often-skipped step is verification: let AI grade its own output and it praises itself — you need a separate agent as the nitpicker, defaulting to "the code is broken," actually clicking through the page and screenshotting rather than just reading the code
- Stripe's Minions pipeline merges 1,300+ machine-written PRs a week — reliability comes from deterministic constraints (a linter that must run, that the agent can't bypass), not a bigger model
- A loop quietly accrues four costs: verification debt, comprehension rot, cognitive surrender, and a runaway token bill — they reinforce each other and blow up together, and the gatekeeper is always human judgment
In One Week, Three People Hit the Same Thing
In one week of June 2026, Google Chrome engineer Addy Osmani, Anthropic's Claude Code lead Boris Cherny, and OpenClaw author Peter Steinberger — none of them comparing notes — hit the same thing: they'd stopped prompting AI by hand and were designing "systems that prompt AI automatically."
Why It Surfaced This Exact Week
Three people who never coordinated reached for the same word in the same week — not coincidence, but the tools around them quietly crossing a threshold. Three conditions ripened at once: coding agents got reliable enough to finish a non-trivial task unattended; scheduling primitives just landed in mainstream tools; the cost of a single run dropped low enough that running it over and over wasn't wasteful. With all the parts on the table, "put them together" became obvious to everyone at the same moment.
The name lagged the practice by months: before anyone called it Loop Engineering, people were already writing loops — just as, before "generator/evaluator separation" had a name, teams were already pairing a code-writing agent with a code-reviewing one. Worth remembering as a rule: the next new term won't come from a model release, it'll come from the moment some capability gets cheap enough to turn a once-unthinkable combination into an everyday move.
It Sits on Top of These Four Layers
These "___ engineering" terms don't replace one another — they stack, each managing something one size bigger than the one below: from a single sentence, to a context window, to a single run, to a loop that turns on its own. Open each layer to see what it governs and how big the blast is when it fails.
Loop · Loop Engineeringthe top layer
Harness · Gear for One Runarm one run
Context · Context Engineeringthe window now
Prompt · Prompt Engineeringthe words you write
Take the same bug — the agent misreads a function's return value — across the four layers: the higher up, the later it's caught and the more it costs. At the loop layer, that misread gets written into the state file, read back the next day as fact, built up layer by layer; by the time anyone looks, the wrong assumption has become a load-bearing wall.
This is the one intuition to keep from loop engineering: the cost of an error equals the number of rounds it survives before someone catches it — and a loop is, by construction, a machine for maximizing that number. Everything that follows — the evaluator, human checkpoints, budget caps — exists for one purpose only: to shrink the distance between "made a mistake" and "caught it."
Five Steps to a Lap — Drop Any One and It Breaks in a Set Way
Don't misread "loop" as spinning in place. Every round does something concrete: find work worth doing, hand it to an agent, verify it's right, save the state, then decide the next step. Drop any of the five and the loop either won't turn or just spins on the spot — and it breaks in a way that has a name.
Below, using the "morning triage loop" Osmani built for himself, we walk through what each step does and which kind of loop you get if you skip it.
These five aren't an à la carte list — they're an interdependent whole. Disciplined teams install all five; rushed teams install only the first two (discover + hand off), because those two produce visible output, while the last three produce "safety," which is the easiest thing to drop.
Building a Loop Means Assembling Six Parts
The five steps describe "what happens in a round"; the six parts describe "what you need in hand to make it turn." They map one to one: discovery via skills, handoff via worktrees, verification via sub-agents, persistence via memory, scheduling via automations. Open each card to see what it solves.
① Automations Schedule
② Worktrees Hand off
③ Skills / SKILL.md Discover
④ Connectors Persist / Discover
⑤ Sub-agents Verify
⑥ Memory / state file Persist
Like several cooks doing prep at once, each handed their own cutting board so they chop without bumping hands; share one board and they collide. A worktree is handing each parallel agent its own cutting board.
With all six parts in place, the loop has a skeleton: automation makes it move, worktrees keep it from fighting itself, skills keep it from repeating labor, connectors let it see outside, sub-agents let it self-correct, memory lets it remember. But a skeleton is just the start — with the same six parts, two people can build completely opposite things, which is the subject of the final section.
The Hardest Step: Why You Can't Let AI Grade Itself
The hardest part of a loop was never getting the agent running — it's putting something inside it that can say "no," and the agent that wrote the code is precisely the one least likely to say it.
① Why it inevitably praises itself. Ask an agent to grade what it just produced and it tends to praise it confidently, even when a human sees at a glance the quality is mediocre — Anthropic engineer Prithvi Rajasekaran observed this while building long-running applications. It's not an intelligence problem: the context in which it wrote the code is already packed with reasons for "why it was done this way," so when the agent looks at its own output it sees not the result but the chain of self-persuasion that led it here. Inside a loop the flaw is amplified: every "is this good enough?" is decided by the agent that just finished writing, nodding at itself round after round, drifting further from real quality the longer it runs.
② The fix isn't changing the author, it's swapping the judge. Tuning the generator to be more self-critical is a dead end — you can't ask an author to step outside their own perspective. What you can do is bring in a separate agent that reads the code from scratch, carries a completely different set of instructions, and bears no self-persuasion. Tuning an independent skeptic is far easier than turning the author self-critical. The inspiration is GANs (generative adversarial networks): one network creates, one network nitpicks — ported to code, one writes and one reviews.
③ Reading the code isn't enough — it has to act. If the evaluator only reads the code, it judges "does this look right," not "does this run right." On frontend tasks, Rajasekaran wired the evaluator to Playwright MCP so it actually opens the page, clicks buttons, screenshots, and inspects the DOM like QA. The basis for judgment shifts from "this JSX looks fine" to "I clicked the button, the page navigated, here's the screenshot." Swap the underlying model too while you're at it — the same model with new instructions often carries the same blind spots. A calibration the community often uses: have the evaluator assume by default that the code is broken until proven otherwise — the default stance should be suspicion, not trust.
④ maker-checker: who strikes the final blow. Claude Code turns this structure into a primitive with /goal: give it a condition and let it run until that condition is met. The key is that at the end of each round, a fresh, small, fast model judges whether the condition is met — not the model that's been doing the work. This is the maker-checker principle banks have used for decades.
At a bank, the person who enters a large transfer and the person who reviews it must be two different people — a rule the industry has followed for decades, called maker-checker. Ported to a loop: the agent that writes the code and the model that rules "pass or fail" must be two, and the latter starts from scratch, assuming by default that you got it wrong.
back to generator to rewrite
on to the next gate
Spelled out as config, the evaluator looks like this — every line enforces the four points above: switch the role, assume broken, verify by acting, give reasons one by one.
ROLE: Adversarial code reviewer. ASSUME: this code is BROKEN until proven otherwise. # default: code is broken until shown otherwise DO NOT praise. Find what fails. # no praise — find what breaks CHECK, in order: # check in order 1. Does it run? (execute, don't read) # does it run: execute, don't just read 2. Tests: run them, paste real output. # run the tests, paste the real output 3. Edge cases the author skipped. # edge cases the author skipped 4. Does behavior match the ticket? # does behavior match the ticket USE Playwright MCP: open the page, click, screenshot, inspect the DOM. Judge behavior, not intent. VERDICT: PASS only if every check holds. Otherwise REJECT + list each reason. # PASS only if all checks hold, else REJECT and list each reason
/goal all tests in test/auth pass and the lint step is clean # run until test/auth all pass AND lint is clean # note: /goal runs until the condition is met — not the same as /loop (re-runs on a fixed interval)
A loop's floor is set by its evaluator. The generator's caliber decides what the loop can produce; the evaluator's caliber decides what it won't produce. Structurally separating generation from judgment, tuning the evaluator into a skeptic, having it verify by acting, and handing the final blow to a fresh model — these four together are the whole of giving a loop the power to say "no."
One Person's Morning, and Stripe's 1,300 PRs a Week
Two real loops, worlds apart in scale, the same skeleton: a trigger to start, a set of constraints locking it on the rails, and a human checkpoint sitting at the end. "Runs while you sleep" was never about how strong the model is — it's about how stable this skeleton is.
Osmani's morning triage
- An automation starts itself each morning
- The triage skill reads yesterday's failed CI, open issues, recent commits
- It calls a skill, not a blob of instructions pasted into the schedule
- Each finding gets a worktree: one sub-agent drafts, another reviews against the skill and tests
- Connectors open PRs and update tickets automatically
- What it can't solve goes to an inbox for a human; the state file lets tomorrow continue from yesterday
Stripe's Minions
- Merges 1,300+ PRs a week, not one line hand-written
- The trigger is light: @ the bot in Slack, or add an emoji reaction
- The real work happens before the model wakes: a deterministic orchestrator assembles the context first
- The sandbox is a Devbox on EC2 — "cattle, not pets," swapped on demand, thousands of agents running at once without colliding
- Those 1,300 PRs are still reviewed by humans: the human didn't leave, just moved from the coding desk to the review desk
The most counterintuitive thing about Stripe: Minions isn't built on a stronger model — it's a fork of the open-source tool Goose. Its core claim is that reliability comes from the quality of constraints, not the size of the model. Architecturally, deterministic gates (blue) and creative LLM steps (green) interlock; anything solvable by deterministic logic is never handed to a probabilistic model, and where that line is drawn decides whether the loop is reliable. (This case was disclosed by Stripe engineer Steve Kaliski on the How I AI podcast.)
What "Runs While You Sleep" Actually Relies On
Local /loop and desktop scheduled tasks both need the machine on; turn it off and they stop. To run while the machine is off, the right answer is the scheduled triggers of Cloud Routines or GitHub Actions. Three approaches, each covering one stretch:
| Scheduling method | Runs where | Machine on? | Session open? | Min interval | Local files? |
|---|---|---|---|---|---|
| Local /loop | Local machine | Yes | Yes | 1 min | Yes |
| Desktop scheduled task | Local machine | Yes | No | 1 min | Yes |
| Cloud Routines / CI | Cloud | No | No | 1 hr | No |
The point isn't picking which is better, but that these are different capabilities: local scheduling = "run a few more rounds while I'm here," cloud scheduling = "run while I'm away." Mistaking local re-runs for the whole of "unattended" is the most common misunderstanding — close the laptop lid and that loop you thought was autonomous quietly stops. Mature loops often use both: local for the close, hands-on checks, cloud for the overnight full sweep.
One aside: the commands are Claude Code's, the capability isn't tied to it
The commands in this piece are written the Claude Code way, but the capability isn't unique to it. Codex offers the same six organs under different names: scheduling, running until a condition is met, parallel isolation, sub-agents, external connectivity, and explicitly calling a skill — and a connector written for one tool can often be moved to another as-is. Whatever toolchain you switch to, the question to ask is always "are all six present," not "which vendor is this command from."
Step by Step: Build Your First Loop
Stripe's pipeline is the finish line, not the start. Your first loop should be small enough to barely look like a system: a little thing that goes off on schedule to check something. The five steps below can be copied as-is, each with its command and config, adding up to a minimal complete loop with all six essentials.
① Start a /loop
Available from v2.1.72, it re-runs the same task on an interval. Scoped to the session, the recurring task expires after 7 days, runs on your local machine, and stops when it's off. Three ways to use it:
/loop 5m check the deploy # fixed: runs every 5 minutes /loop check the deploy # the agent sets its own pace /loop # reads the task in .claude/loop.md
② Use a skill for auto-discovery: triage first
Re-running one line isn't a loop. Give it a prompt to look at three things each morning and list what's worth acting on. "Scheduled + auto-discovery" is the entry point to a real loop. The discovery logic belongs in a skill, not in the schedule — instructions buried in a cron job rot when no one updates them, while a SKILL.md can be maintained and reused.
NAME: morning-triage WHEN: invoked each morning by automation. # called each morning by automation READ: - CI runs that failed since yesterday # CI that failed since yesterday - issues opened in the last 24h # issues opened in the last 24h - commits merged since the last run # commits merged since the last run JUDGE: for each item, is it worth acting on? Skip noise. Keep only actionable findings. # judge each: actionable now or noise? keep only what's actionable OUTPUT: write findings + status to ./state/triage.md (one row per finding). # write to ./state/triage.md, one row each
③ Add a state file
Don't leave the results in the chat window. Write each finding and how far it's been handled into a markdown file (or a Linear board). The agent forgets, the repo doesn't.
| finding | source | status | |-----------------|-----------|----------| | auth test flaky | CI #4821 | fixing | | null deref | issue 92 | PR open | | stale dep | commit a3 | inbox |
④ Add a /goal evaluator
The most critical step, and the easiest to skip. /goal (from v2.1.139) runs until a condition is met, with a different model judging whether it's met. Note it differs from /loop: /loop re-runs on a fixed interval, /goal runs until the bar is cleared, with the stop condition judged by a fresh small model. Pair it with the reviewer.md from the last section (assume the code is broken, verify by acting, give reasons one by one) so it nitpicks every round.
/goal all tests in test/auth pass and the lint step is clean # run until test/auth all pass AND lint is clean, judged by a fresh small model
⑤ Add --worktree for parallel isolation
Use --worktree (or -w) to give each background agent its own worktree — one worktree per task — so multiple agents editing code at once don't collide on files.
claude --worktree fix/auth-test "draft the fix" claude --worktree fix/null-deref "draft the fix"
The six-essential checklist for a first loop
Tick them off and see whether you've got them all. The first two decide whether it runs; the last four decide whether problems go unattended once it does. The most common beginner mistake is installing only the first two, leaving a loop that no one watches, no one can stop, and that's still nodding at itself.
A minimal loop with all six essentials
Turning the checklist into code: the snippet below is short enough to read in one breath, yet contains every organ a real loop needs, just shrunk down. The six comments map one to one to the six essentials.
# 1. Schedule — a real trigger on: schedule: - cron: '0 6 * * *' # 06:00 daily, runs in the cloud # 2. Discovery — call a skill, not a blob of instructions run: claude --skill morning-triage # 3. Persistence — state to disk # the skill writes ./state/triage.md and commits it back to the repo # 4. Handoff + verification — one worktree per finding, run until it clears the bar for finding in $(parse ./state/triage.md); do claude --worktree "fix/$finding" \ --goal "tests pass and lint is clean" \ "draft a fix for $finding" done # 5. Verification — after each round a fresh model judges stop, # plus a reviewer agent dedicated to nitpicking # 6. Human review — the door left open # PRs are opened but not auto-merged; anything uncertain goes to ./inbox/
Install all six and it's a real loop no matter how small; drop any one and it's one of the five failures from the last section in disguise. A first loop is best kept small, but the "say no" check and the human review checkpoint must be fully installed.
A Loop Quietly Runs Up Four Tabs
A loop that runs itself is also a loop that errs by itself. The livelier it runs, the more silently it errs. Four costs quietly accumulate while it runs, and not one of them sets off an alarm on the spot.
A concrete scenario shows how the four trigger in sequence: overnight the loop opens 20 PRs, all tests green, a big win on the surface. But 3 of them hide small bugs the tests don't cover.
Those 3 buried errors now sit in a codebase you no longer fully understand, "guarded" by a person who's stopped looking, until one of them surfaces in production. The four costs aren't a list of independent risks — they're four faces of the same failure, feeding each other, all coming due together.
Each has its own way to guard against it:
The four share one trait: while the loop runs, they're all silent. The most fascinating thing about loop engineering is letting one person do a team's work; the most dangerous thing is in the same place — a team argues with itself, while one person plus a pile of loops easily becomes an echo chamber where no one argues. The gatekeeper is always human judgment.
You're Still the Engineer, Not Just the Person Who Presses Go
The same loop, built by two people, can end up in opposite places — and the difference isn't in the loop itself.
One person uses a loop to move faster on things they already understand cold: they read the code, they're sure of the direction, and the loop amplifies the judgment they already have. Another uses the same loop to never have to understand again; six months later the first has grown stronger, the second has become the gatekeeper of a machine they can't read.
A loop isn't the kind of tool whose quality is fixed by the tool — it's powerful enough to amplify exactly what you bring to it: bring understanding and it amplifies understanding; bring laziness and it amplifies laziness. It's a faithful multiplication sign, and what it multiplies is the person who builds it.
A loop makes generation nearly free: code, plans, PRs, fixes — almost for nothing. What stays scarce is judgment: knowing which plan is right, which line to stop at, which output runs fine but is wrong at the root. A loop can generate a hundred options but can't really choose — or rather, it chooses by "looks reasonable," not "actually correct," and the gap between those two is exactly why the engineer exists. So loop engineering didn't devalue judgment; it stripped away every task that needed none, leaving judgment as the only thing left. The engineer's value moves from "types fast, has the API memorized, willing to grind out boilerplate" to "knows which path is right, which line to stop at, which output runs fine but is wrong at the root."
build the loop, but build it like someone who intends to stay the engineer, not just the person who presses go.Addy Osmani, "Loop Engineering," June 2026