A New Harness in Town: Meta-Harness

A New Harness in Town: Meta-Harness
Meta-Harness: End-to-End Optimization of Model Harnesses
Meta-Harness automatically optimizes model harnesses — the code determining what to store, retrieve, and present to an LLM — surpassing hand-designed systems on text classification, math reasoning, and agentic coding.
Meta-Harness: End-to-End Optimization of Model Harnesses
The performance of large language model (LLM) systems depends not only on model weights, but also on their harness: the code that determines what information to store, retrieve, and present to the model. Yet harnesses are still designed largely by hand, and existing text optimizers are poorly matched to this setting because they compress feedback too aggressively. We introduce Meta-Harness, an outer-loop system that searches over harness code for LLM applications. It uses an agentic proposer that accesses the source code, scores, and execution traces of all prior candidates through a filesystem. On online text classification, Meta-Harness improves over a state-of-the-art context management system by 7.7 points while using 4x fewer context tokens. On retrieval-augmented math reasoning, a single discovered harness improves accuracy on 200 IMO-level problems by 4.7 points on average across five held-out models. On agentic coding, discovered harnesses surpass the best hand-engineered baselines on TerminalBench-2. Together, these results show that richer access to prior experience can enable automated harness engineering.

Everyone obsesses over model weights. Which model is smarter, which one scored higher on the benchmark, which provider just dropped a new frontier release. But there's a dirty secret hiding in plain sight: the code that wraps the model often matters more than the model itself.

A new paper from Stanford and MIT makes this case forcefully. Meta-Harness, by Yoonho Lee, Chelsea Finn, and collaborators, introduces a system that automatically optimizes model harnesses — and the results suggest we've been leaving enormous performance on the table by hand-crafting them.

What Is a Model Harness, Exactly?

Before diving into Meta-Harness, it's worth being precise about what a harness actually is. A model harness is the code that sits between raw inputs and the language model. It decides what information to store, what to retrieve, how to construct the prompt, and how to post-process the output. Think of it as the entire software environment that surrounds a model call.

This includes things like retrieval-augmented generation pipelines, context management strategies, few-shot example selection, tool-calling scaffolds, and the agentic loops that govern multi-step reasoning. When you hear about a coding agent that does well on SWE-bench, a huge chunk of that performance comes not from the model itself but from the harness — the code that decides what files to read, how to structure the prompt, when to retry, and how to verify results.

The paper cites a striking finding: changing only the harness around a fixed LLM can produce a 6x performance gap on the same benchmark. That's not a marginal difference. That's the difference between a system that barely works and one that's state-of-the-art.

The Problem with Existing Text Optimizers

So if harnesses matter this much, why not optimize them automatically? People have tried. There's a growing literature on text optimization — systems like OPRO, TextGrad, AlphaEvolve, and GEPA that use LLMs to iteratively improve prompts and code.

The Meta-Harness paper argues these existing approaches share a fundamental limitation: they compress feedback too aggressively. OPRO only shows the optimizer a window of past solution-score pairs. TextGrad only gives feedback on the current artifact. AlphaEvolve maintains a program database with evaluation scores but still operates in a constrained window. All of these methods throw away most of the diagnostic information that would help you understand why a particular approach failed.

This matters because harness optimization is a causal reasoning problem. You don't just need to know that harness A scored 72% and harness B scored 68%. You need to know that harness B failed on examples where the context window was too long, or that it retrieved irrelevant documents for a specific category of questions. The raw execution traces contain this information. Existing optimizers discard it.

How Meta-Harness Works

Meta-Harness takes a different approach: give the proposer access to everything. The system maintains a filesystem that stores the complete source code, evaluation scores, and raw execution traces of every harness candidate it has ever tried. The proposer — an agentic coding model (Claude Opus 4.6 via Claude Code in the paper's implementation) — can browse this filesystem freely using standard tools like grep and cat.

The search loop is elegant in its simplicity. Start with an initial population of valid harnesses. Evaluate each one on the task. Store all results — code, scores, traces — in the filesystem. Then ask the proposer to inspect the accumulated evidence and propose new harness candidates. Evaluate the new candidates, add them to the filesystem, and repeat.

Each harness is a single-file Python program. The proposer doesn't just tweak a prompt string — it can modify retrieval strategies, change context management logic, add pre-processing steps, or restructure the entire approach. This is optimization in code space, which is strictly more expressive than optimization in prompt space.

The key design choice is that the proposer sees the full history via the filesystem, not a compressed summary. In practice, the proposer reads around 82 files per iteration, selectively grepping through execution traces to diagnose failures and form hypotheses about what to try next. The paper's ablation study confirms this is the crucial ingredient: when you remove access to raw traces and only provide summaries, performance drops significantly.

The Results Are Hard to Ignore

Meta-Harness is evaluated on three diverse task domains, and the results are consistently strong.

On online text classification, where an LLM receives labeled examples one at a time and must classify new ones using its accumulated context, Meta-Harness improves over ACE (Agentic Context Engineering, the prior state-of-the-art context management system) by 7.7 accuracy points while using 4x fewer context tokens. That's a Pareto improvement — better accuracy and lower cost simultaneously. Compared to other text optimizers, Meta-Harness reaches the final accuracy of OpenEvolve and TTT-Discover within the first 4 evaluations, and its final accuracy surpasses theirs by more than 10 points.

On retrieval-augmented math reasoning, Meta-Harness discovers a single retrieval harness that improves accuracy on 200 IMO-level problems by 4.7 points on average across five entirely different held-out models. This transfer result is particularly notable — the harness was optimized using one model but generalizes to models from different providers and different size classes.

On agentic coding with TerminalBench-2, the discovered harness surpasses Terminus-KIRA (the best hand-engineered baseline) and ranks #1 among all Claude Haiku 4.5 agents on this competitive benchmark. The project page includes an interactive demo where you can step through the proposer's reasoning across iterations — watching it perform counterfactual diagnosis across execution traces, identify failure patterns, and propose targeted fixes.

Why This Matters: The Bitter Lesson, Applied

The paper's discussion section draws a connection to Rich Sutton's Bitter Lesson: once a search space becomes accessible, automated search tends to outperform hand-engineering. Harness engineering has historically been a manual craft — researchers and engineers iterating on prompts, retrieval strategies, and scaffolding code through trial and error. Meta-Harness suggests that this craft is now automatable, at least in significant part.

There are several practical implications worth highlighting. First, discovered harnesses transfer. The text classification harness generalizes to entirely new datasets unseen during search. The math retrieval harness transfers across five different models. This means the investment in running a Meta-Harness search can pay off across deployments.

Second, the discovered harnesses are inspectable. Because they're single-file Python programs, you can read, understand, and modify what the system found. This isn't a black-box optimization — it's more like having an extremely persistent engineer who tries hundreds of approaches and gives you the best one with full source code.

Third, the approach is model-agnostic on the evaluation side. Meta-Harness optimizes the harness, not the model. This means it can improve any LLM's performance by finding better ways to present information to it, select examples, manage context, or structure tool use.

The work also raises an interesting meta-question: if the code around the model matters this much, and we can now optimize it automatically, where should the research community focus its attention? The traditional emphasis on model weights, training data, and RLHF may be only half the story. The harness is the other half, and it just got its own optimization algorithm.

The Bigger Picture: Meta-Harness and the Agentic Harness Engineering Paradigm

To fully appreciate what Meta-Harness represents, it helps to zoom out and look at where it sits within the rapidly evolving discipline of harness engineering.

The term "harness engineering" was coined by Mitchell Hashimoto in early 2026 to describe a simple but powerful idea: every time an agent makes a mistake, you engineer the environment so it cannot make that specific mistake again. You improve AGENTS.md files. You add linters, guardrails, and verification scripts. You design memory systems, sandboxes, and tool-access controls. The agent does the work inside the harness, but a human designs the harness itself.

This framing quickly crystallized into a three-layer model that the community has broadly adopted. Prompt engineering is what you ask the model. Context engineering is what you send the model so it can answer confidently. Harness engineering is the full infrastructure around the model: how it operates, what tools it can use, what permissions it has, how failures are handled, and what happens before anything gets called "done." Each layer encompasses the one before it.

The key insight from production systems like Claude Code, OpenClaw, OpenCode, and Codex is that the harness is where most of the engineering value lives. LangChain demonstrated this concretely: changing only the harness, not the model, moved a coding agent from outside the top 30 to the top 5 on TerminalBench 2.0. The filesystem became the universal memory primitive. ReAct loops became the standard planning pattern. Multi-surface architectures let the same agent serve terminal, web, and messaging interfaces. The entire discipline of agentic harness engineering, as Paul Iusztin and others have described it, is about building these systems that transform the LLM into a new kind of operating system.

But here is the critical distinction: all of this is fundamentally a human-driven discipline. Humans observe agent failures, humans diagnose root causes, humans design the fixes, and humans iterate on the harness. The agent works inside the harness. The human engineers the harness.

Meta-Harness flips this entirely. It is not a new method within the agentic harness engineering paradigm. It is the automation of the paradigm itself. An agentic coding model inspects execution traces, diagnoses failures causally, and proposes new harness code. Where a human engineer might iterate on a few harness designs per week, Meta-Harness evaluates hundreds of candidates per search, reading 82 files per iteration, performing counterfactual diagnosis across execution traces, and generating complete single-file Python harnesses that restructure retrieval strategies, context management logic, and tool-calling scaffolds.

The self-referential name is deliberate. Meta-Harness is itself a harness, since it determines what information the proposer sees and when. But its job is not to wrap a model for a downstream task. Its job is to search over the space of harnesses that wrap models for downstream tasks. It is a harness for optimizing harnesses, operating one level of abstraction above the existing paradigm.

This represents a meaningful shift in how we think about the relationship between humans and harnesses. The existing paradigm says: you are the harness engineer, the model is the worker. Meta-Harness says: the model can be both the worker and the harness engineer, if you give it the right search infrastructure. The human role shifts further up the stack, from designing harnesses to designing the search process that discovers them.

Whether this makes human harness engineering obsolete is another question. The honest answer is: probably not, at least not yet. Meta-Harness still requires human decisions about the search space, the evaluation metrics, and the initial harness population. And the discovered harnesses, being inspectable Python programs, are designed to be read and modified by humans. The more likely near-term future is a collaborative loop: Meta-Harness discovers harness designs that no human would have tried, and human engineers curate, combine, and deploy them into production systems where reliability constraints go beyond what automated search can verify alone.

But make no mistake about the direction. We went from prompt engineering to context engineering to harness engineering in under two years. Meta-Harness suggests the next step is already here: automated harness engineering, where the craft of building agent infrastructure becomes itself a search problem that agents can solve.