Self-Evolving Agents: Three Frameworks That Let Your AI Improve Itself

Self-Evolving Agents: Three Frameworks That Let Your AI Improve Itself

You ship an AI agent. It works great on the demo. Then real users show up, edge cases pile on, and suddenly you're spending your weekends manually tweaking prompts and patching logic. Sound familiar?

Self-evolving agents are a new class of AI systems designed to fix this exact problem. Instead of relying on you to diagnose every failure and hand-craft every improvement, these agents have built-in loops that let them evaluate their own output, figure out what went wrong, and update themselves automatically.

In this post, we'll look at three open-source frameworks that each take a different approach to this idea: the OpenAI Self-Evolving Agents Cookbook, Karpathy's autoresearch, and EvoMap's Evolver. We'll break down how each one works, what makes them different, and how you might use these ideas in your own day-to-day development.

What Are Self-Evolving Agents?

Think of a traditional AI agent as a script you deploy and then babysit. When it produces bad output, a human needs to notice the problem, figure out the root cause, update the prompt or code, and redeploy. This works fine at small scale. It doesn't scale when you have hundreds of edge cases pouring in every week.

A self-evolving agent closes this loop. It runs, evaluates its own results against some scoring criteria, identifies where it fell short, and then modifies itself to do better next time. The "self" part is the key difference: the system doesn't just flag problems for you – it tries to fix them on its own.

If you've ever worked with CI/CD pipelines, the mental model is similar: run, test, get feedback, fix, and repeat. Except here, the "fix" step is also automated.

How Do They Work? The Core Loop

Despite their differences, all three frameworks share the same fundamental pattern. You can think of it as a four-step cycle that repeats until performance is good enough.

First, the agent does its job – it generates a summary, writes some code, or runs an experiment. Second, the output gets scored. This could be an LLM grading the output, a set of unit-test-like checks, or a simple metric like "did the validation loss go down?" Third, the feedback from that scoring step gets fed back into the system. This is where the magic happens: a separate process (sometimes another LLM, sometimes the same agent) looks at what went wrong and proposes a change. That change might be a rewritten prompt, a modified config file, or even edited source code. Fourth, the improved version replaces the old one, and the cycle starts over.

The key insight is that none of this requires retraining a model from scratch. These systems work on top of existing LLMs. They improve the instructions, the code, or the configuration – not the model weights themselves. That's what makes them practical for everyday use.

The Three Frameworks

Let's look at each framework individually and then compare them.

  1. OpenAI Self-Evolving Agents Cookbook: Prompt Optimization on Autopilot
  2. This cookbook from OpenAI is the most production-oriented of the three. It addresses a scenario every developer has experienced: you've built an LLM-powered agent, it works reasonably well, but it keeps failing on certain types of inputs and you're stuck in a never-ending cycle of prompt tweaking.

The approach: set up graders (automated scoring functions) that evaluate your agent's output, then use a "meta-prompt agent" to rewrite your system prompt based on what the graders say went wrong. The graders can be simple Python functions (like "does the output contain the right keywords?"), similarity checks ("is the output close enough to the source material?"), or even another LLM acting as a judge ("rate this summary on a scale of 0 to 1").

Here's the loop in plain terms: your agent processes an input and produces output. The graders score that output. If the scores are below threshold, the system collects the feedback ("chemical names are missing from the summary," "the output is too long"), hands it to a meta-prompt agent, and that agent rewrites the system prompt. Then you run again with the new prompt and check if the scores improved. It keeps going until the scores pass or a maximum retry count is hit.

What evolves: the system prompt (the instructions given to the LLM). It also supports comparing different model versions (like GPT-5 vs GPT-5-mini) to find the best model-prompt combination. The cookbook includes a VersionedPrompt class that tracks every change, so you can always roll back if a new version performs worse.

Best for: teams already using OpenAI's API who want to automate prompt improvement for production agents. If you're building a customer-facing agent and need to systematically improve it over time with minimal manual effort, this is your starting point.

  1. Karpathy's autoresearch: Let the Agent Rewrite Its Own Code
  2. Andrej Karpathy's autoresearch takes the concept in a completely different direction. Instead of improving prompts, the agent improves actual source code – specifically, code that trains a small language model.

The setup is elegantly simple. There are essentially three files: prepare.py handles data preparation and stays untouched. train.py contains the full model definition and training loop – this is the file the agent is allowed to edit. And program.md is a Markdown file that serves as the agent's instruction manual, written by you, the human.

You point a coding agent (like Claude or Codex) at the repo and tell it to follow program.md. The agent reads the instructions, modifies train.py – maybe changing the model architecture, tweaking the learning rate, adjusting the batch size – and then kicks off a training run. Each run is limited to exactly 5 minutes of wall-clock time, which makes every experiment directly comparable regardless of what the agent changed. After training, the agent checks if the validation score improved. If it did, the change sticks. If it didn't, the agent reverts and tries something else.

The brilliant part is the philosophy: you're not writing training code anymore – you're writing instructions for an agent that writes training code. It's meta-programming in the truest sense. You can expect around 12 experiments per hour, or roughly 100 overnight. You go to sleep, and you wake up to a log of everything the agent tried and (hopefully) a better model.

What evolves: the actual Python source code (train.py). Everything is on the table – model architecture, optimizer settings, hyperparameters, batch size, even the attention pattern. The metric is simple and clear: validation bits per byte. Lower is better.

Best for: developers and researchers who want to automate experimentation. While the current repo focuses on LLM training, the pattern generalizes to any scenario where you have a single file to optimize and a clear metric to chase – think hyperparameter tuning, algorithm optimization, or configuration search.

  1. EvoMap Evolver: Governed Evolution with an Audit Trail
  2. EvoMap's Evolver takes yet another approach. If the OpenAI cookbook is about improving prompts and autoresearch is about improving code, Evolver is about improving the agent's behavior through a formal, protocol-driven process – think of it like version control, but for agent evolution.

The central idea is the Genome Evolution Protocol (GEP). Instead of letting the agent freely rewrite prompts or code, every change is constrained by a set of structured assets: Genes are reusable improvement patterns (like "add input validation before edits"), Capsules bundle related Genes together for bigger changes, and Events log every evolution that happened, creating a complete audit trail.

When you run Evolver, it scans your agent's log and history files for errors and patterns. It then selects the most relevant Gene or Capsule from its library and generates a GEP-compliant evolution prompt – a structured set of instructions that tells the agent exactly how to improve, within strict guardrails. It doesn't directly edit your code. Instead, it produces a protocol-bound prompt and artifacts that guide the evolution safely.

Evolver also has several operational modes: you can run it in review mode for human-in-the-loop oversight, in a continuous loop for autonomous operation, or with strategy presets like "innovate" (maximize new features), "harden" (focus on stability), or "repair-only" (emergency fix mode). It even supports a worker pool where multiple Evolver nodes can participate in a shared evolution network.

What evolves: agent behavior through structured, auditable evolution assets (Genes and Capsules). Every change is logged as an Event. The system also evolves its own "personality state" over time – essentially its operating tendencies and preferences.

Best for: teams that need compliance and auditability. If you're in a regulated industry, or you simply need to explain to stakeholders exactly what changed and why, Evolver's protocol-driven approach gives you a paper trail that the other two frameworks don't.

How They Differ: A Side-by-Side Comparison

These three frameworks solve the same fundamental problem – how to make agents improve without constant human intervention – but they differ in almost every design decision.

The biggest difference is what they evolve. The OpenAI cookbook evolves prompts – the text instructions you feed to an LLM. Autoresearch evolves source code – the actual Python that defines model architecture and training logic. Evolver evolves structured behavior assets – named patterns (Genes) and bundles (Capsules) that get applied through a formal protocol.

How they score is also different. The OpenAI cookbook uses a rich set of graders – multiple scoring functions running in parallel, each checking a different quality dimension. Autoresearch uses a single, hard metric: validation bits per byte. Did the number go down? Keep the change. Did it go up? Revert. Evolver takes a signal-based approach, scanning logs for error patterns and using those signals to select which Gene to apply.

The human's role varies too. With the OpenAI cookbook, you define the graders and thresholds, then the system runs autonomously (though it also supports a human-in-the-loop variant using the OpenAI Evals platform). With autoresearch, your job is to write and iterate on program.md – the Markdown file that instructs the agent. With Evolver, you can choose your involvement level: fully autonomous loop, review mode where you approve each change, or strategy presets that steer the agent's priorities.

Finally, the safety models are quite different. The OpenAI cookbook relies on versioned prompts with rollback capability. Autoresearch uses git and the simple "keep or revert" approach – if validation got worse, throw it away. Evolver has the most elaborate safety model: a command whitelist that only allows specific safe commands, rejection of shell injection patterns, scoped execution, timeout limits, and a full audit trail of every evolution event.

Using These Ideas in Day-to-Day Development

You don't need to adopt any of these frameworks wholesale to benefit from the ideas behind them. Here are practical ways to bring self-evolving patterns into your daily work.

Start by adding graders to your existing agents. Even before you build any automation, simply writing down how you'd score your agent's output is incredibly valuable. Can you write a Python function that checks if the output meets your requirements? Can you define what "good enough" looks like as a number between 0 and 1? Once you have that, you've built the foundation for any self-evolving loop.

Version your prompts like you version your code. The OpenAI cookbook's VersionedPrompt pattern is dead simple to implement yourself: store each prompt revision with a timestamp and the eval scores that prompted the change. When something breaks after a prompt update, you can roll back instantly instead of trying to remember what the prompt used to say.

Try the autoresearch pattern for any "single file, single metric" problem. Have a config file that controls your application's behavior? A set of hyperparameters for a model? A build configuration you've been meaning to optimize? Point a coding agent at it with clear instructions and a measurable goal, and let it run experiments while you do other work.

Borrow Evolver's audit trail thinking for production agents. Even if you don't use Evolver itself, the idea of logging every change as a structured event – what changed, why, what the scores were before and after – will save you countless debugging hours. When your agent starts behaving differently in production, you'll want to know exactly which evolution caused it.

Use strategy presets to match the evolution to the moment. Evolver's preset concept – innovate, harden, repair-only – is a useful mental framework even outside the tool. When you're early in development, let your agent explore freely. When you're close to production, switch to hardening mode. When something is on fire, constrain the agent to repairs only. This maps neatly onto how most development teams already think about release stages.

The Bottom Line

Self-evolving agents represent a shift in how we think about AI systems. Instead of deploying a static agent and manually maintaining it forever, you deploy a system that maintains itself – within boundaries you define.

The OpenAI cookbook gives you a production-ready pattern for prompt self-improvement with a rich evaluation framework. Karpathy's autoresearch shows the power of letting an agent modify actual code, with a beautifully simple "try it, measure it, keep it or revert" loop. EvoMap's Evolver brings governance and structure to the evolution process, making it safe enough for regulated environments.

The common thread? Define what good looks like, let the agent try to get there, and keep a clear record of what changed. That's the recipe – the rest is implementation details.

Bonus: A Minimal Self-Evolving Agent in LangGraph

To make this concrete, here's a ~45-line skeleton of a self-evolving agent using LangGraph. It doesn't do anything fancy – it just shows the core loop: run an agent, grade its output, improve the prompt if the grade is too low, and repeat.

from typing import TypedDict
from langgraph.graph import StateGraph, END
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")


class State(TypedDict):
    prompt: str
    input_text: str
    output: str
    score: float
    attempts: int


def run_agent(state: State) -> dict:
    """Step 1: Run the LLM with the current prompt."""
    result = llm.invoke([
        {"role": "system", "content": state["prompt"]},
        {"role": "user", "content": state["input_text"]},
    ])
    return {"output": result.content, "attempts": state["attempts"] + 1}


def grade_output(state: State) -> dict:
    """Step 2: Score the output (your grader goes here)."""
    grade = llm.invoke(
        f"Rate this output 0-10 for quality.\nOutput: {state['output']}\nScore:"
    )
    return {"score": float(grade.content.strip())}


def should_continue(state: State) -> str:
    """Step 3: Keep or improve? This is the routing logic."""
    if state["score"] >= 7 or state["attempts"] >= 5:
        return "done"
    return "improve"


def improve_prompt(state: State) -> dict:
    """Step 4: Rewrite the prompt based on feedback."""
    better = llm.invoke(
        f"This prompt scored {state['score']}/10. Rewrite it:\n{state['prompt']}"
    )
    return {"prompt": better.content}


# Wire up the graph: run -> grade -> (done | improve -> run)
graph = StateGraph(State)
graph.add_node("run_agent", run_agent)
graph.add_node("grade_output", grade_output)
graph.add_node("improve_prompt", improve_prompt)
graph.set_entry_point("run_agent")
graph.add_edge("run_agent", "grade_output")
graph.add_conditional_edges(
    "grade_output", should_continue,
    {"done": END, "improve": "improve_prompt"}
)
graph.add_edge("improve_prompt", "run_agent")

app = graph.compile()
result = app.invoke({
    "prompt": "Summarize concisely.",
    "input_text": "Your document here...",
    "output": "",
    "score": 0,
    "attempts": 0,
})
print(f"Final output (after {result['attempts']} attempts, score: {result['score']}):")
print(result["output"])

References

OpenAI Self-Evolving Agents Cookbook: https://developers.openai.com/cookbook/examples/partners/self_evolving_agents/autonomous_agent_retraining

Karpathy's autoresearch: https://github.com/karpathy/autoresearch

EvoMap Evolver: https://github.com/EvoMap/evolver