Self-Improving AI Coding Agents in 5 Minutes
Slash your AI coding agent's error rate and runtime with a self-improvement feedback loop—capture logs, critique, patch, repeat.
When AI Coding Agents Stumble
Have you watched your coding agent struggle with routine tasks? Two weeks ago I watched my remote agent (Claude Code on Github Actions) spend eight minutes starting up a webapp and screenshotting it via MCP, tying and untieing itself in knots in the process. However, I wasn't surprised.
This was a new project and the AI agent was encountering a lot of confusion while working within it. It's important to point out that this isn't solely the AI agent's fault - it's the fault of the entire AI software development system, which includes the codebase, the tech stack, the CI/CD pipeline, the documentation, the AI agent doing the work, its rules, and the developer.
We can refer to this condition of chaotic agent outputs as Instability, borrowing a concept from Control Theory. It means that, given certain inputs, the output is not bounded (i.e. error may grow unbounded). I'll share much more on that in future posts, but for now I'll share how I easily got the agent to stabilize itself, fully mitigating the confusion.
The Self-Improvement Mandate and Feedback Loop
Fortunately, the AI Coding System can work on itself the same way it works on its target! And it's really easy to make it happen.
Here are the steps:
Generate a run - run the agent on a task. I recommend a basic benchmark to begin with. See the benchmark section below for more details.
Capture logs – save chat + shell/tool calls from each run.
Ask the agent to identify confusion etc – Tell it: “Review the logs and identify any agent confusion or failures, evidenced by:
Requiring multiple attempts to complete a task
Trying multiple methods to complete a task
Errors and failures
Timeouts and slow response times”
4. Implement the fixes – have the model write a patch or open a PR for review (make sure the next loop runs against the changes).
You can do this manually or programmatically - I used Claude Code locally with Github CLI to have it loop through the steps above (kicking off runs via Github Actions via CLI and sleeping until they completed) until no confusion was present and the agent execution times had stabilized.
Here’s a one-liner to pass to your local agent that will kick off the loop. For browser automation, I use Playwright MCP. If you use a different one, be sure to mention it in the prompt, and if you’re not working with web apps, choose a different task.
> Run the following shell command verbatim:
claude --print --verbose --output-format json "Run the app and take a screenshot using Playwright MCP."
After it's finished, review its agent output logs and look for any signs of agent confusion, such as:
- Requiring multiple attempts to complete a task
- Trying multiple methods to complete a task
- Errors and failures
- Timeouts and slow response times
Identify root causes and apply fixes to them. Repeat this cycle, running the shell command, reviewing logs, root causing confusion, and fixing issues, until there is no more confusion present in the logs.
This approach will work for small tasks under two minutes, before Claude Code’s shell tool times out. For more complex tasks, you will need to run the loop differently to avoid timeouts. Telling Claude Code to run a sub-agent won’t preserve the agent’s logs. Let me know if you need help!
This chart shows the runtime of my remote agent coming down as it self-improves. This is 100% real and complete data (processed and plotted by Claude Code). Note that sometimes the execution times go up, but this is often due to natural variance in the agent's performance (which our goal is to minimize, not eliminate - see the narrow variance at the end of the chart).
If you're running your agent locally, you should see much better final performance - Claude Code on GH Actions runs quite slowly (as do other remote agents like OpenAI's Codex and Google's Jules).
Examples of Defects and Fixes
Here are a few of the behaviors that contributed to the long execution times at the beginning of this project (or similar ones), and the fixes that the agent applied:
Run/Log Commands Specification: Caused the agent to start the app using inconsistent methods, leading to variance.
Fix: Make sure your agent has access to working commands (e.g. npm, make) for starting your service(s), stopping them, and checking their logs/status. Ensure they are documented in your agent's rules file. Keep them simple and few in number. Don't use Docker if you can help it.
Foregrounded Long-Running Processes: Caused the agent to wait until commands timed out before proceeding.
Fix: Make sure your agent is running any long-running processes in the background. Consider using the
ENABLE_BACKGROUND_TASKS
andFORCE_AUTO_BACKGROUND_TASKS
flags for Claude Code.
MCP Tool-Name Drift: Caused the agent to execute failed tool calls, and eventually hack a scripting solution to get the job done. Results from model hallucination with complex strings (e.g.
mcp__playwright__browser_navigate
).Fix: Explicitly specifying key MCP tool names in the rules file.
Over-Analysis of Simple Tasks: Caused the agent to fully explore the codebase to perform trivial tasks, due to overly verbose guidance in the rules file.
Fix: In the agent's rules file, don't require the agent to ultrathink or comprehensively analyze for trivial tasks.
Incomplete Environment Setup: Caused the agent to perform environment setup tasks under ambiguity. An interaction between the CI/CD pipeline, the local development environment, and the agent's understanding of the project.
Fix: Ensure the remote agent has access to all necessary environment variables and tools.
Benchmarks for an AI Coding System
Below are some of the benchmark tasks that I use to track my systems’ stability during initial development and over time. Execution times will vary based on your codebase.
Hello-world (remote agents only)
Exercises: remote-runner bootstrap, environment variables, basic CLI plumbing.
Pass target: agent prints “Hello, world!” in < 2 s.
Start, stop, restart, log
Command: Start the application, check the server logs/healthcheck to verify that it’s running. Stop it, and verify that it’s stopped. Start it again, and verify that it’s running.
Exercises: Run/stop/log-command wiring.
Webapp + screenshot
Command: Run the app locally and take a screenshot using <your choice of browser tools> MCP
Exercises: run-command wiring, MCP browser tools, initial page load.
Test-suite run
Command: (Start the app and) Run all of the tests.
Exercises: dependency caching, unit & end-to-end harness, reporting pipeline.
Quick refactor
Command: (Example) Rename <UserCard> to <ProfileCard>
Exercises: code-rewrite agility, import-graph updates, linter integration.
Tiny feature + tests
Command: (Example) Create a “Report a bug” button on the dashboard menu bar that opens a page that’s blank except for saying “Thank you for your feedback”
Exercises: full dev loop (code → tests → CI) with minimal scope.
I recommend running the full list as soon as your AI coding system can execute code locally, then scheduling it in CI to catch instability regressions.
Due to the non-deterministic nature of these systems, once you’ve driven the confusion from your system you’ll need to run 5+ of each benchmark in order to derive a stable average.
Take-aways
Choose a benchmark now and kick it off. Feed the log right back to it, using the instructions above.
Within 5 minutes you'll have a more stable AI Coding System.
I'd love to hear your before-and-after numbers, and what issues your agent discovered! Reply or DM, and hit subscribe so you don't miss the next tips or deep dives. I'm back to writing and will also be sharing some exciting tools I've been working on.