Stop tuning prompts. Loop them instead.
A single LLM call either nails the task or it doesn't, and you find out in production. Looping — generate, check, feed the failure back, repeat — turns that coin flip into something that converges. Here's the pattern, the numbers, and how I run the same loop inside Claude Code.
A single prompt either nails the task or it doesn't, and you usually find out in production. The fix that stuck for me was to stop betting on one call. Instead, loop — generate, check, feed the failure back, repeat — and that coin flip turns into something that converges. Here's the pattern, the stop conditions that keep it from running forever, and the numbers from wiring it onto a JSON-extraction job that was failing 1 in 6 calls.
Prompt tuning is whack-a-mole
The old reflex, when an output is wrong, is to fix the prompt. You add an example, the failure rate drops, a new input breaks it again. Prompt tuning is whack-a-mole because you're trying to make one forward pass perfect across inputs you haven't seen yet. The cheaper move is to accept that the first pass is sometimes wrong and add a loop that catches it.
Start with a check, not a prompt
The loop only works if you can verify the output without a human in the loop. That check is the whole design: JSON.parse plus a schema validation, a compiler, a test run, a regex, a 200 from the API you're generating a call for. No check, no loop — you're back to eyeballing outputs and hoping.
Feed the failure back as context
On a bad output, don't just retry — retry with the error. Same conversation, plus what went wrong:
async function loop(input: string, maxTries = 4) {
const messages = [{ role: "user", content: buildPrompt(input) }];
for (let attempt = 1; attempt <= maxTries; attempt++) {
const out = await model(messages);
const result = validate(out); // your check
if (result.ok) return result.value;
messages.push({ role: "assistant", content: out });
messages.push({
role: "user",
content: `That failed: ${result.error}. Fix it and return only the corrected output.`,
});
}
throw new Error("loop exhausted");
}
The error string is the trick. "Expected dueDate to be ISO 8601, got next Tuesday" tells the model exactly what to change — far more signal than a blind retry at higher temperature. A vague "that was wrong" gets you a vague second guess.
Cap it three ways
Loops that can't fail are how you get a surprise $400 bill, so bound every one of them three ways: a max attempt count, a token or cost budget summed across attempts, and a wall-clock timeout. Whichever trips first wins. Four attempts is my default — past that the model is usually stuck, and a fifth call just repeats the third. Break early, too: if the validator returns the same error two rounds running, bail instead of burning the rest. That one check has saved me more tokens than the attempt cap ever did.
The same loop, in Claude Code
You don't always need to write the harness. Most days I run this exact loop inside Claude Code, where the validator is just the test suite and the feedback is whatever the run printed:
/loop run the tests; if any fail, read the error, fix the
code, and run them again. stop when they're green or after 4 tries.
Claude writes a fix, runs the check, reads the failure, and feeds it back to itself — generate, check, repeat — until the suite passes or it hits the cap. Same three guards: "after 4 tries" is the attempt count, so a stubborn failure doesn't loop forever and drain the budget.
The numbers
On that extraction job, one-shot calls returned schema-valid JSON 83% of the time — 1 in 6 failed and silently corrupted downstream rows. With a four-attempt loop, 99.2% came back valid. 94% of those passed on attempt 1 or 2, so the loop cost was paid almost entirely by the hard inputs, and mean tokens per call rose 11%, not 4×, because most never looped. The remaining 0.8% now throw a clean, logged error instead of poisoning the database.
When to reach for it
Loop when you can check the output cheaply and the failure is recoverable with more context: extraction, code generation, format conformance. Don't loop when there's no automatic verifier, or when a wrong answer is unrecoverable — a loop just launders a guess into a confident one.
Two things matter more than the model you pick. A loop is only as good as the error message you feed back, so spend your effort there, not on the opening prompt. And always bound it — attempts, cost, and time. A loop without a ceiling is an outage waiting for the right input.