.MD vs .HTML file choices for LLM, Agentic tasks

.MD vs .HTML file choices for LLM, Agentic tasks

When you hand content to a large language model, whether you are stuffing a document into a prompt, building a retrieval pipeline, or wiring up an agent that reads and writes files, the format you choose is not a cosmetic detail. Markdown and HTML carry the same words, but they package those words very differently. That packaging changes how reliably a model understands the content, how many tokens you burn, and how easily an agent can act on what it reads. This post lays out when each format helps, when it quietly hurts your results, and a simple framework for choosing between them.

The short version

Reach for Markdown by default. It is compact, readable, and maps cleanly onto the structure models were trained to expect. Reach for HTML when structure, precision, or fidelity to a rendered page actually matters: tables with merged cells, forms an agent must interact with, or content where the exact DOM is the thing you care about. Most of the time the question is not which format is universally better, but which one preserves the signal you need at the lowest cost.

What actually happens inside the model

To choose well, it helps to understand what a transformer does with markup. Everything you pass in is first broken into tokens, and structural characters are not free. In HTML, a single emphasized phrase might tokenize into an opening tag, the words, and a closing tag, so the markup itself can outnumber the content tokens it wraps. In Markdown, that same emphasis is a pair of asterisks. Fewer structural tokens means more of the budget, and more of the model's attention, lands on the words that carry meaning rather than on scaffolding.

Attention is the second piece. Every token attends to every other token, so each additional token of markup is one more thing competing for the model's finite attention. Deeply nested HTML spreads the relationship between a heading and its paragraph across many intervening tag tokens, which can dilute the signal that ties them together. Markdown keeps those cues short and adjacent, so a hash mark sits right next to the heading text and a model trained on mountains of Markdown immediately reads it as a section boundary. The structural signal is strong precisely because it is concise and familiar.

This is also why training data matters so much. Models have seen enormous volumes of Markdown in documentation, README files, and forum posts, so the mapping from its syntax to meaning is well worn. They have also seen plenty of HTML, but much of it is noisy, machine-generated layout, which makes clean semantic structure harder to rely on. The takeaway is not that one format is parsed and the other is not. It is that concise, familiar structure leaves more room and more attention for the content, while verbose or noisy structure competes with it.

Why Markdown is the sensible default

The mechanics above translate into three practical wins. First, Markdown is token-efficient. A heading is a single hash mark rather than a pair of tags, a list item is a dash rather than wrapped elements, and emphasis is two asterisks rather than span markup. On a long document that difference can cut your token count by a third or more, which means lower cost and more room in the context window for content that matters.

Second, models are deeply fluent in it. Because so much training data is written in Markdown, models hold a strong prior for how its structure maps to meaning. A line beginning with a hash reliably reads as a section boundary, and nested bullets reliably read as hierarchy. The format does double duty: it is both human-readable and machine-legible with no parsing layer in between.

Third, it degrades gracefully. If a model misreads a Markdown table or a stray asterisk, the underlying text is still plain and recoverable. There is no malformed tag waiting to swallow a paragraph or a broken attribute to confuse the parse. For prompts, system instructions, summaries, notes, and the vast majority of document content you feed a model, Markdown is the right call.

Where HTML earns its keep

HTML is worth its extra verbosity when the structure itself is the information. Complex tables are the clearest case. When cells span multiple rows or columns, when there are nested headers, or when the relationship between a value and its label depends on layout, Markdown's flat pipe-and-dash tables simply cannot express it. HTML's row and column spans preserve those relationships precisely, and a model reading the markup can recover the true shape of the data.

Agentic browsing and automation are the other major case. When an agent needs to click, fill, or extract from a live page, the HTML is exactly what it must reason about: element types, attributes, ids, form fields, and the relationships between them. Flattening that page to Markdown throws away the interaction surface the agent depends on. The same logic applies whenever fidelity to a specific rendered artifact matters, such as email templates, invoices, or anything where exact attributes, links, or embedded metadata are load-bearing.

When each choice quietly hurts you

Markdown hurts when you force structured or interactive content through it. Convert a rich web page to Markdown and you lose form fields, button targets, and the attribute-level detail an agent needs to act. Flatten a complex table and merged cells collapse into ambiguity, so the model confidently misreads which value belongs to which label. Markdown also has no single standard, since dialects differ on tables, footnotes, and raw HTML, so content that renders cleanly in one place can parse oddly in another.

HTML hurts mostly through cost and noise. Real-world HTML is full of wrapper divs, inline styles, tracking attributes, script and style blocks, and deeply nested layout scaffolding that carries no semantic meaning. Feeding raw HTML can easily triple your token count while burying the actual content under markup the model must wade through. That extra noise can degrade comprehension and dilute attention, and malformed or truncated tags can derail the parse in ways plain text never would. HTML is the right tool only once you have cleaned it down to the structure you actually need.

A framework for choosing

Work through four questions in order, and the answer usually falls out on its own.

1. Does the model need to act on structure, or just understand prose? If the task is reading, summarizing, reasoning, or generating text, use Markdown. If the task requires interacting with or precisely extracting from structured elements such as forms, complex tables, or live pages, lean toward HTML.

2. Is layout part of the meaning? If relationships between values depend on rows, columns, spans, or nesting that flat text cannot capture, HTML preserves what Markdown would destroy. If the content is essentially linear prose with simple headings and lists, Markdown captures everything that matters.

3. What is the token budget? Markdown is far cheaper per unit of content. On long inputs or high-volume pipelines, that cost difference is decisive unless HTML is buying you structure you genuinely need. Never pay for HTML's verbosity when Markdown carries the same signal.

4. Can you clean it? If you must use HTML, strip scripts, styles, tracking attributes, and layout-only wrappers first. Clean, semantic HTML is a reasonable input, but raw page source rarely is. If you cannot clean it and the structure is not essential, convert to Markdown instead.

The bottom line

Default to Markdown for prose, prompts, documentation, and retrieval content, where its low token cost and natural fit with how models read are pure upside. Escalate to HTML, ideally cleaned rather than raw, only when structure is the signal: interactive pages an agent must operate, complex tables, and artifacts whose exact rendering matters. The goal is never the fancier format. It is the smallest representation that preserves the information the model actually needs.