Published on
Last updated on
Last updated on

Feedback loop engineering

Hero image
Authors
  • avatar
    Name
    Daniel Demmel
    Occupation
    Software engineer with
    21 years of professional
    experience

A while back I listened to a Pragmatic Engineer podcast with Peter Steinberger about his AI-native development workflow – the one where he runs 5-10 coding agents in parallel and ships code he doesn't personally read. The headline numbers are easy to fixate on, but what stuck with me was the infrastructure underneath them. His workflow doesn't succeed because he writes better prompts, or even because he feeds the agents better context. It works because he's built what he calls "closed-loop systems" where the agents verify their own work – compiling, running, debugging, screenshotting, hitting real API keys.

I've been chewing on that pattern ever since, and I think it deserves its own name. Here's my current hierarchy:

The four levels

Prompt engineering is where most people start – and where most of the discourse lies. How do you phrase your request? What system prompt works best? Should I write "please" or "my job depends on it, make no mistakes"? It might matter, but the models are getting good enough to infer intention from vague prompts. Instead of crafting the perfect spec and trying to cover all possibilities, you can just have a conversation.

Context engineering is the next step up. This is about giving the model the right information: a well-crafted CLAUDE.md / AGENTS.md, relevant documentation, carefully selected files in the context window. I wrote about some of my practices here – things like using Plan mode and including the most relevant files upfront. Context engineering gets you much further than prompt engineering alone because the model stops guessing blind. It can also fill in the gaps much more easily from existing practice rather than assumptions.

But feedback loop engineering is what separates working code from getting lucky. It's the practice of building tools and infrastructure so that coding agents can verify their work in context. Not to stop when they did a reasonable amount of work and feel like it's "production ready", but to see hard evidence of how the code behaves in an as production-like setup as possible.

Harness engineering is the frame around all of it. Birgitta Böckeler defines the harness as everything in an agent except the model itself – "Agent = Model + Harness" – and splits it into guides that steer the agent before it acts (feedforward: your CLAUDE.md, types, linting) and sensors that let it observe the consequences after (feedback: the loops I'll get to below). By that definition my "feedback loop engineering" is really the sensor half of harness engineering, done on purpose.

So why give it its own rung rather than fold it in? Because the three answer different questions. Context engineering is about what goes in. Harness engineering is about governing the whole apparatus – which tools the agent can call, what it's allowed to touch, when it should stop. And the feedback loop is the connective tissue between them: the bit that decides whether the agent ever finds out it was wrong. It's also, I'd argue, where most developers get the best return on their time today. And for people using closed source harnesses like Claude Code or Cursor, the term harness engineering might sound confusing when it's only adding some skills or hooks.

What this looks like in practice

The core idea is straightforward: you want the agent to be able to run its code, see the result, and iterate. Just like a developer would. The faster and tighter that loop, the better the output.

The foundational layer – strict linting, strong types, integration tests, git hooks – I covered previously. These are table stakes. They catch syntax errors and structural problems before the agent even commits. Adding types and validation libraries (like Zod or Pydantic) are the next useful layer that self-documents and enforces data contracts. But they can't tell you whether the code actually does the right thing in a running system and if it fails gracefully.

If that foundational layer is mostly feedforward – guides that keep the agent on the rails before it acts – the more interesting layer is the feedback that tells it whether the code actually worked: production-like observability.

  • Browser debugging via CLI – I use browser-debugger-cli which wraps Chrome DevTools into a CLI rather than an MCP. The agent can navigate to a page, inspect the DOM, check console errors, verify that a frontend change actually renders correctly, and even execute JavaScript snippets – not just that the component compiles
  • Database query skills – knowing the schema and being pointed to a pre-authenticated CLI, the agent can run queries against a development database to verify that a migration ran correctly, that data is being written in the expected shape, or that a query optimisation actually improved things
  • Log access and crash tracebacks – when something goes wrong at runtime, the agent needs to see what happened. Tailing application logs, reading crash tracebacks, understanding the actual failure rather than guessing from the code
  • OpenTelemetry traces – in a microservices setup, a bug in one service might manifest as unexpected behaviour in another. Being able to pull OTel traces and correlate them across services means the agent can follow a request through the entire system, not just stare at the one file it changed
  • API keys to development services – the agent needs to be able to hit real (development) endpoints, not just mock them. I've never seen an API documentation that didn't miss some quirks that only prodding it and running full workflows would reveal.

The pattern goes beyond "does it compile?" to "does it actually work as part of the system?" Each of these tools closes a gap between what the agent wrote and what happens at runtime. And crucially, they're all exposed as CLI skills – pipeable, composable, text-in text-out – which brings us to why that matters.

Once the loops are solid, something shifts in how you work. You can run several agents in parallel without everything collapsing into chaos, because you're no longer reviewing every line – you're leaning on the systems that review the code for you. "Architecture over code review", as Peter puts it. It also makes experimentation cheap: you can deliberately under-prompt, give vague instructions to see what the agent comes up with, and trust the loops to flag a wrong turn before it compounds into something expensive.

Which raises the obvious question: what's the right interface between an agent and a feedback loop? Do we need to shovel another 70K tokens of MCP instructions into the context?

The Unix connection

I think CLI tools that are pipeable and progressively explorable are the optimal interface for current AI models, and Peter makes the same case. Not because CLIs are inherently superior to GUIs – they often aren't, for a lot of humans – but because text-in, text-out interfaces are what these models are most fluent with. In ML terms, they're in distribution: the kind of interface the model has seen most during training. And if some models later work better with a different interface, repackaging CLI wrapped APIs won't be too difficult.

So, "progressively explorable" means the agent can start broad and narrow down. git log --oneline gives an overview. Pick a commit, git show gives details. git diff narrows further. At each step, the agent gets feedback and decides where to dig deeper. The tool doesn't force a particular path – it responds to choices. But it's true for the CLI interface itself, running --help for each command depth.

"Pipeable" means the output of one tool feeds into another. rg "TODO" --json | jq ... lets the agent process results programmatically. Composability isn't just elegant design – it's functional literacy and a great context self-management opportunity for an LLM.

There's something delightfully ironic here. The Unix philosophy – small tools that do one thing well, connected through text streams – dates to the 1970s. It turns out to be near-perfectly suited to AI agents half a century later. Perhaps that's not a coincidence. Both Unix pipes and AI agents operate on text. Both benefit from composable, predictable interfaces. Both struggle with proprietary binary formats and modal GUIs.

For tool builders, the implication is clear: if you want your tool to work well with AI agents, make it work well as a CLI. Structured output. Clear error messages. Composable commands. Fast execution. These aren't new ideas – they're old ideas that have suddenly become much more valuable.

The inner and outer loop

Almost everything so far is about the inner loop: the agent runs its code, reads the result, and feeds that straight back into its own context, all within a single session. Tighten that loop and the output gets better. But there's a slower, wider loop sitting on top of it, and it's where the compounding really happens.

The outer loop is what turns one session's hard-won lesson into something every future session starts with. An agent burns half an hour discovering that a particular API silently truncates payloads over a certain size, or that a migration needs a specific flag on this database version. In the inner loop that knowledge lives and dies inside the context window. In the outer loop it gets reflected on at the end of the session, distilled, and written back into the shared knowledge the next agent loads up front – a new skill, a note in CLAUDE.md, an entry in a team knowledge base.

Mozilla AI's cq is the cleanest take on this I've seen. It's an open standard for shared agent learning: agents store discoveries as structured "knowledge units" – undocumented API quirks, workarounds, fixes – and query the store before retrying a failure, so they stop rediscovering the same dead ends independently. A /cq:reflect command mines a finished session for the lessons worth keeping, ranks them by how generalisable they are, checks for duplicates, and proposes new units for you to approve. The store can live locally in SQLite or sync across a whole team. Their own framing names it exactly: separate the inner loop of solving the problem from the outer loop of consolidating what you learned.

The neat part is how the two loops join up. Today's distilled lesson becomes tomorrow's guide – feedforward, in harness terms – so the outer loop quietly improves the inner one over time. It's the same circle a good team already runs: someone debugs a gnarly issue, writes it up, and the next person doesn't have to start from scratch. We're just teaching the agents to do their own write-ups.

The best bit

What I like about this approach is that feedback loop engineering isn't just AI-specific overhead. Observability, structured logging, trace correlation, browser testing, database verification, clear interfaces, helpful error messages – these were good engineering practices long before anyone was prompting a language model. The difference is that wrapping them as CLI-accessible skills makes them available to agents too. The feedback loops that help an agent debug a cross-service issue are the same ones that help a developer on their first week understand how the system actually behaves.

So if you're wondering where to invest your time as AI coding tools improve: less time crafting the perfect prompt, more time building tools that let agents – and humans – see what's actually happening. The prompts and coding agent harnesses will change as models evolve. The value of a tight feedback loop won't. And if the AI bubble pops next month, you're left with easy to test codebases that you can pick up again yourself or use local models for smaller tasks. If it doesn't, the inner and outer loops keep compounding, and your agents get a little less forgetful with every session.

And for Pete's sake, don't install OpenClaw connected to all your personal data and services...

Credits