Self-Improving Agents: A Practical How-To Guide

What Is Autoresearch, Really?

You've built agents. They can search the web, write code, and maybe even deploy to production. But do your agents get better at their job the more they work? That's the gap autoresearch is designed to fill.

At the AI Engineer World's Fair this week, a new company called Introspection — founded by former xAI engineers Roland Gavrilescu and Julian Bright — introduced the concept in a way that actually makes sense for working developers. The core idea: stop thinking about agents as tools and start thinking about them as loops.

Autoresearch is the practice of building an "outer loop" where your agents monitor, evaluate, and gradually improve their own performance without a human having to tweak every knob. Think of it like CI/CD for agent behavior — except instead of running tests, the agent is running experiments on itself.

From Agent Harnesses to Agent Loops

The industry has gone through three distinct phases, and knowing where you are helps you figure out what to build next.

Models phase: Everyone focused on picking the right LLM — GPT-4o, Claude, Gemini — and prompt engineering. The model was the product.
Harnesses phase: We moved to tool-calling frameworks. LangChain, Vercel AI SDK, and the open-source Pi framework gave agents structured access to external tools. The harness was the product.
Loops phase (now): The agent itself becomes a self-improving system. The feedback loop — signals, evals, human input — is the product.

This shift matters because the difference between a demo agent and a production agent isn't the model — it's the feedback infrastructure around it. If you're still manually adjusting prompts and tweaking parameters when your agent makes mistakes, you're stuck in phase one.

Building Your First Agent Recipe

Introspection's key contribution is something they call an agent recipe. Here's how to think about it in practical terms.

An agent recipe captures everything your system needs to improve over time:

Evals — How do you measure success? Define clear pass/fail criteria for each agent task.
Judges — A separate LLM call (or human review) that scores the agent's output. The judge becomes the quality gate.
Signal processing — What happens when the judge says "fail"? The failure is categorized, stored, and fed back into the loop.
Model router — Which model handles which subtask? A recipe tracks the cost-accuracy tradeoff and can swap models dynamically.
Human-in-the-loop hooks — Points where a human reviews borderline cases, building a training dataset over time.

Start small. Pick one agent task — say, writing a weekly status report. Wrap it with a judge that checks for completeness, accuracy, and tone. When the judge flags a problem, log the failure pattern. After 50 iterations, you'll know exactly where your agent struggles, and you can update the recipe accordingly.

The Three Patterns in Practice

Gavrilescu outlined three production patterns during his session. Here's how to apply each one today:

The loop is the product. Stop shipping agents. Ship loops. Your users should experience an agent that gets noticeably better after every few uses. Design your architecture around eval-inference-eval cycles, not request-response pairs.
Recipes over tools. Agent tools are atomic — a calculator tool, a search tool, a code execution tool. Recipes are composite. They bundle tools, evals, and judges into a portable format that can move across providers (OpenAI → Anthropic → local models) without rewrites.
Optimize for better AND cheaper. Track cost-per-task alongside accuracy. A recipe should document not just what works, but what's most economical. Over time, distill frontier-model capabilities into smaller, cheaper models that handle routine cases while reserving expensive inference for edge cases.

Getting Started Today

You don't need Introspection's platform to start using these ideas. Here's a concrete workflow you can implement this afternoon:

Pick one agent workflow you already run in production.
Add a logging layer that captures every agent decision and its outcome.
Write a simple judge prompt that grades the output on 3 criteria (e.g., correctness, formatting, relevance).
Store failures in a structured format — JSON works fine — with the input, output, and judge score.
Every 100 iterations, analyze the failure patterns. Are they about prompt phrasing? Tool selection? Model hallucination?
Update your system prompt or recipe based on what you learned. Rinse and repeat.

The companies you admire — Cursor, Cognition (Devin), and increasingly every serious AI-native startup — are already running these loops internally. They're not using a secret model. They're using better feedback infrastructure.

Autoresearch isn't magic. It's software engineering discipline applied to agent behavior. And the best time to start building your first feedback loop was yesterday. The second best time is right now.

What Is Autoresearch, Really?

From Agent Harnesses to Agent Loops

Building Your First Agent Recipe

The Three Patterns in Practice

Getting Started Today

Comments