What is Karpathy's Autoresearch and how does it work?

What is Karpathy's Autoresearch and how does it work?
GitHub - karpathy/autoresearch: AI agents running research on single-GPU nanochat training automatically
AI agents running research on single-GPU nanochat training automatically - karpathy/autoresearch

In early March 2026, Andrej Karpathy — former head of AI at Tesla, co-founder of OpenAI, and one of the most influential voices in deep learning — open-sourced a project called autoresearch. Within days, the repository had accumulated over 61,000 stars on GitHub and ignited one of the most fascinating conversations in the AI community about what happens when you let an AI agent do your ML research while you sleep.

The premise is deceptively simple: give an AI coding agent a small but real LLM training setup, a single GPU, and a set of instructions written in a Markdown file. Then walk away. The agent modifies the training code, runs a 5-minute experiment, checks whether the result improved, keeps or discards the change, and repeats — indefinitely. You wake up in the morning to a log of experiments and, hopefully, a better model.

But beneath that simplicity lies a profound shift in how we think about research, agentic orchestration, and the role of humans in the loop. Let's unpack all of it.

What Is Autoresearch?

Autoresearch is an open-source project that turns an AI coding agent (like Claude or Codex) into an autonomous ML researcher. The entire repository is deliberately kept small — just three files that matter:

prepare.py handles the one-time data preparation: downloading training data, training a BPE tokenizer, and providing runtime utilities like the dataloader and evaluation function. This file is read-only. The agent never touches it, and the evaluation metric (val_bpb, or validation bits per byte) is locked in place so that no experiment can game the benchmark.

train.py is where all the action happens. It contains the full GPT model definition, the optimizer (Muon + AdamW), and the training loop. Everything in this file is fair game for the agent: architecture, hyperparameters, optimizer choice, batch size, model depth — anything goes.

program.md is the most interesting file. It's the instruction manual for the AI agent, written in plain Markdown. Karpathy describes this as the "code" that programs the agent. You don't touch the Python files like you normally would as a researcher. Instead, you are "programming the program.md" — writing the Markdown that tells the agent how to think, what to prioritize, and how to run its experimental loop.

This is what Karpathy has been calling "Software 3.0": where the artifact you ship isn't Python or C++, but a natural language document that orchestrates an AI agent's behavior.

How Does It Work?

The core loop is beautifully straightforward. Once setup is complete, the agent enters an infinite experiment cycle:

First, it reads program.md for its instructions and context. Then it examines the current state of train.py and the git history to understand what has been tried. It formulates a hypothesis — maybe increasing the learning rate, switching to a different activation function, or trying a more radical architectural change. It edits train.py directly, commits the change to a dedicated git branch, and runs the training script for exactly 5 minutes of wall-clock time.

When training completes, the script outputs a summary including the key metric: val_bpb (validation bits per byte). Lower is better. The agent reads the result, logs it to a results.tsv file, and makes a decision. If the new val_bpb is lower than the previous best, the change is kept and the branch advances. If it's equal or worse, the agent does a git reset back to the last known good state and tries something else.

Then it loops. Forever. The instructions in program.md explicitly say: "NEVER STOP. Once the experiment loop has begun, do NOT pause to ask the human if you should continue."

With each experiment taking roughly 5 minutes, you can expect about 12 experiments per hour or approximately 100 experiments during a typical night's sleep. The agent also applies a simplicity criterion that Karpathy baked into the instructions: a tiny improvement that adds ugly complexity isn't worth keeping, but deleting code while maintaining performance is always a win.

One crucial design choice is the fixed time budget. Every experiment trains for exactly 5 minutes, regardless of what the agent changes. This makes experiments directly comparable — whether the agent tries a tiny model with a huge batch size or a large model with fewer steps, the wall-clock cost is identical. It also means the system naturally discovers the optimal model for your specific hardware platform within that time constraint.

The Agentic Orchestration: Why It's Not What You Think

Here's where autoresearch gets philosophically interesting. When most people hear "agentic orchestration," they think of frameworks like LangGraph, CrewAI, or AutoGen — systems with explicit state machines, directed graphs, tool registries, and multi-agent coordination layers. Autoresearch has none of that.

The entire orchestration is a Markdown file. There are no state graphs, no tool schemas, no routing logic, no supervisor agents. The "orchestration framework" is the LLM's own ability to read instructions, reason about them, and execute a plan using its coding environment. The agent reads program.md, understands that it should loop forever, run experiments, track results, and make keep/discard decisions — and then it just does it.

This is what makes it radical. The "agentic loop" is not implemented in Python. It's implemented in English. The agent's context window is the state machine. The Markdown document is the workflow definition. Git is the version control and rollback mechanism. The file system is the persistence layer.

Karpathy is making a bet that as LLMs become more capable, the right abstraction for orchestrating their behavior isn't a DAG framework — it's a well-written document. The agent doesn't need a graph library to decide what to do next. It needs clear instructions, a well-scoped task, a measurable objective, and the tools to execute (in this case, a code editor and a terminal).

Is It Doing the Same Thing as OpenClaw?

OpenClaw (formerly known as Clawdbot/Moltbot) is an open-source AI agent that runs on your local machine and connects through messaging apps like Telegram to automate real-world tasks — research, email triage, file management, web browsing, deployment, and more. People have been using it to build everything from email command centers to research-on-demand systems and even deploying code from their phones.

At first glance, the comparison seems natural: both involve AI agents running tasks autonomously. But the similarity is largely surface-level.

OpenClaw is a general-purpose agent platform. It's an orchestration layer that gives an AI model access to your computer, your apps, and your APIs. It can be extended with "skills" — plugins that let it interact with external services. It's designed to be a broad productivity tool: a second brain, an assistant, a remote worker you can text instructions to.

Autoresearch, by contrast, is narrowly and deliberately scoped. It does exactly one thing: run ML training experiments in a loop on a single GPU, optimizing a single metric. It doesn't browse the web. It doesn't manage emails. It doesn't have plugins or skill registries. Its entire universe is three Python files and a Markdown document.

The philosophical difference matters. OpenClaw asks: "How do we give an AI agent broad access to be generally useful?" Autoresearch asks: "What happens when we give an AI agent a tightly constrained problem with a clear objective function and let it iterate with total autonomy?" Karpathy's approach trades generality for depth. By constraining the agent to a single file, a single metric, and a single time budget, he creates an environment where the agent can genuinely learn to be a better researcher through iteration — without the failure modes that come with broad, underspecified tasks.

Could You Rebuild It with LangGraph?

Technically, yes. LangGraph is a Python-based framework from LangChain designed for building stateful, multi-step agent workflows using graph architectures. It supports cyclic graphs, conditional branching, durable execution, human-in-the-loop checkpoints, and streaming. You could absolutely model the autoresearch loop as a LangGraph state machine: a "plan experiment" node, an "edit code" node, a "run training" node, a "evaluate results" node, and a "keep or discard" conditional edge that loops back to the planning phase.

But Karpathy would probably argue you're missing the point. The entire value proposition of autoresearch is that you don't need a framework. The agent itself is the framework. The moment you introduce LangGraph, you're adding a layer of Python orchestration code that sits between the human's intent (expressed in program.md) and the agent's execution. You're replacing a natural language interface with a programmatic one.

That said, there are legitimate reasons you might want to rebuild it with LangGraph or a similar tool. If you wanted to add multi-agent collaboration (one agent proposes experiments, another reviews them), persistent state across sessions that survives agent context window resets, parallel experiment execution across multiple GPUs, or structured observability and monitoring dashboards, then a framework like LangGraph gives you those capabilities out of the box.

The tradeoff is clear: LangGraph gives you control, debuggability, and composability at the cost of simplicity. Autoresearch gives you radical simplicity at the cost of extensibility. For a research project exploring the limits of single-agent autonomy, Karpathy's approach is more interesting. For a production system running at scale, you'd probably want something more structured.

Why Does It Work?

Autoresearch works for several interlocking reasons that reveal deep truths about effective agent design.

The objective function is unambiguous. Val_bpb is a single number. Lower is better. The agent never has to guess whether an experiment was successful. There's no subjective evaluation, no committee review, no ambiguity. This is the single most important design decision in the project — it turns research from an open-ended creative endeavor into a well-defined optimization problem.

The feedback loop is fast and cheap. Five minutes per experiment means the agent gets signal quickly. In a typical overnight run of 100 experiments, the agent accumulates more empirical data about what works and what doesn't than many human researchers would generate in a week of manual experimentation. Speed of iteration compounds.

The scope is tightly constrained. The agent can only edit one file. It can't install new packages. It can't modify the evaluation harness. These constraints aren't limitations — they're features. They prevent the agent from going off the rails, gaming the metric, or getting lost in an exponentially expanding action space. Constraint breeds creativity.

Git provides perfect memory and undo. Every experiment is a commit. Every failure can be rolled back cleanly. The agent doesn't need an external memory system or a vector database — git is its memory. This is elegant because it reuses infrastructure that already exists in every software project.

The instructions are well-written. This might sound trivial, but it's not. The program.md file is a masterclass in prompt engineering for autonomous agents. It covers setup procedures, experiment protocols, logging formats, decision criteria, crash handling, and the critical instruction to never stop and ask the human for permission. Every ambiguity a researcher might face has been anticipated and addressed in the document.

What Autoresearch Is Not

It's important to be clear about what autoresearch isn't, because the hype can obscure the reality.

It is not AGI doing science. The agent isn't reading papers, formulating novel hypotheses from first principles, or having genuine insight. It's applying its training-time knowledge of ML best practices to systematically try variations within a constrained search space. This is more akin to intelligent hyperparameter tuning and architecture search than it is to scientific discovery.

It is not a general-purpose research tool. It only works on the specific problem of training small language models on a single GPU. You can't point it at a biology paper and ask it to design experiments. The pattern is transferable — and people are already adapting it for A/B testing, quantitative trading, and product optimization — but the specific repository only does one thing.

It is not distributed or scalable out of the box. The design is deliberately single-GPU, single-agent, single-file. There's no multi-node training, no multi-agent coordination, no experiment scheduling system. Karpathy even notes in the README that your results won't be comparable to other people running on different hardware, because the time-budget design means every platform finds its own optimal configuration.

It is not a replacement for human researchers. The agent is good at grinding through the parameter space, trying combinations a human might not bother with, and maintaining perfect discipline about logging and reverting failed experiments. But it doesn't have taste, intuition, or the ability to step back and ask whether the entire problem is framed correctly. The human's job shifts from "run experiments" to "write better program.md files" and "interpret the results."

When Will It Not Work?

Understanding the failure modes of the autoresearch pattern is just as important as understanding why it works.

When the objective function is ambiguous or multi-dimensional, the pattern breaks down. If success isn't a single number but a combination of factors — performance, latency, memory, code readability, user experience — the agent has no clean way to make keep/discard decisions. Real-world ML research often involves tradeoffs that can't be collapsed into one metric.

When the search space is too large or too disconnected, the agent will plateau. It can make local improvements all night, but it can't make the kind of conceptual leap that says "we should abandon transformers entirely and try a state-space model." Its innovations are incremental edits to existing code, not paradigm shifts. After dozens of experiments, the improvements tend to converge toward diminishing returns.

When experiments are expensive or slow, the economics break down. The autoresearch pattern relies on fast, cheap iterations. If each experiment takes hours instead of minutes — as is the case for training any production-scale model — then running 100 experiments overnight isn't feasible. The pattern works precisely because Karpathy scoped it to a small model on a single GPU.

When the task requires cross-file or cross-system changes, the single-file constraint becomes a bottleneck. Real research often involves modifying data pipelines, evaluation protocols, and infrastructure simultaneously. The autoresearch pattern assumes all the important degrees of freedom live in one file.

When context windows run out, the agent loses track. After many experiments, the accumulated history of changes, results, and reasoning can exceed the agent's context window. At that point, the agent effectively "forgets" early experiments and may repeat failed ideas. This is a fundamental limitation of the current approach that frameworks with persistent memory could address.

The Bigger Picture

Karpathy opened the project's README with a piece of speculative fiction about a future where "research is now entirely the domain of autonomous swarms of AI agents running across compute cluster megastructures in the skies" and the code has become "a self-modifying binary that has grown beyond human comprehension." Then he added: "This repo is the story of how it all began."

It's tongue-in-cheek, but the underlying question is real. Autoresearch demonstrates that we're already in an era where an AI agent, given clear instructions and a measurable objective, can conduct meaningful experiments autonomously. The pattern — constrained scope, fast iterations, clear metrics, autonomous loops — is transferable far beyond ML training. People in the community are already experimenting with autoresearch-style loops for landing page optimization, trading strategy backtesting, and product feature experimentation.

The most profound takeaway might be the simplest one: the bottleneck in AI research is shifting. It used to be "can the agent do this task at all?" Now it's "can you write a good enough program.md?" The skill of the future isn't coding Python. It's writing the document that tells the agent how to code Python. And if Karpathy's vision is even partially correct, that document — not the framework, not the infrastructure, not the model weights — is where the real intellectual work

"But Isn't It Just an Agent That Fine-Tunes LLMs?"

This is the most common reductive take, and it's worth addressing head-on. Yes, at the narrowest description level, autoresearch is an agentic system that trains small language models on aloop. If you squint hard enough, you could call it automated fine-tuning. But that framing misses almost everything that makes the project interesting.

First, it's not fine-tuning at all — it's training from scratch. Fine-tuning takes a pre-trained model and adjusts its weights on a narrower dataset. Autoresearch starts from a randomly initialized GPT model and trains it on raw text data. More importantly, the agent isn't just tweaking hyperparameters on a fixed architecture the way a fine-tuning sweep would. It's rewriting the model architecture itself, swapping optimizers, changing attention patterns, adjusting depth and width, modifying the training loop — things that go well beyond what any hyperparameter search tool like Optuna or Ray Tune would touch.

Second, and more fundamentally, the model training is the benchmark, not the product. Karpathy didn't build autoresearch because he desperately needed a slightly better 50M-parameter language model. He built it to demonstrate a pattern: that an AI agent, given a clear metric and a constrained action space, can conduct autonomous iterative research. The val_bpb score is the evaluation function for the agent's research ability, not the point of the project.

Think of it this way: if someone built a chess-playing AI, you wouldn't say "it's just a system that moves pieces on a board." The board is the environment. The game is the benchmark. The interesting thing is the intelligence of the decision-making. Similarly, the interesting thing about autoresearch isn't that models get trained — it's that an AI agent is making research decisions about what to try next, learning from failures, and iterating autonomously with no human in the loop.

The real innovation is the pattern itself: a tight loop of hypothesis, experiment, evaluation, and keep-or-discard — orchestrated entirely through natural language instructions in a Markdown file. That pattern is what people are now porting to completely different domains. The LLM training is just the first proving ground.The real innovation is the pattern itself: a tight loop of hypothesis, experiment, evaluation, and keep-or-discard — orchestrated entirely through natur