AI's Memento Problem: Why LLMs Can't Form New Memories

Jia Chen

13 May 2026 • 8 min read

In Christopher Nolan's Following, a young writer is drawn into the orbit of a charismatic thief who can read a stranger's life from the contents of their flat. In Memento, his next film, Nolan inverts the idea: Leonard Shelby is just as brilliant and just as observant, but a head injury has left him with anterograde amnesia. He cannot form new memories. So he tattoos clues onto his skin, snaps Polaroids, and scribbles notes in other people's handwriting. The intellect is intact. The memory is not. Every few minutes, his world resets, and he has to reconstruct who he is from external props.

This is, almost exactly, the condition of today's Large Language Models. In their April 2026 essay "Why We Need Continual Learning," Andreessen Horowitz partners Malika Aubakirova and Matt Bornstein make the analogy explicit: "Large language models live in a similar perpetual present. They emerge from training with vast knowledge frozen into their parameters but they cannot form new memories — cannot update their parameters in response to new experience." To compensate, we surround them with scaffolding the authors describe as "chat history as short-term sticky notes, retrieval systems as external notebooks, system prompts as guiding tattoos." The model itself never internalizes the new information.

To understand why this is a real problem, and why Continual Learning has become one of the most interesting research frontiers in AI, it helps to start with the architecture itself.

How a Transformer Actually Works

A transformer, the architecture behind every modern LLM, is, in the words of the a16z authors, "at their core, conditional next-token predictors over a sequence." Strip away the marketing language and an LLM does one thing: it takes a sequence of tokens (roughly, word fragments) and predicts the most likely next token, one step at a time.

All of the model's "knowledge" lives in its parameters, also called weights: hundreds of billions of numbers fixed during a massive training run. At that stage, the model compresses huge swaths of the internet into those weights. The compression is lossy, and that is exactly what makes it useful — it forces the model to find structure, to generalize, to build representations that transfer across contexts.

Once training ends, those weights are frozen. From that point on, the model never learns anything again in the strict sense of updating its internal representations. Every output it produces, no matter how clever, is generated from a fixed brain conditioned on whatever you happen to type into the prompt.

What a Context Window Is

If the weights are the model's long-term, frozen memory, the context window is its working memory. It is the chunk of text the model can "see" at any given moment: the system prompt, the chat history, any retrieved documents, tool outputs, and the user's latest message. Modern LLMs measure this in tokens — 8K, 128K, 1M, sometimes more.

Everything outside that window may as well not exist. The model has no recall of yesterday's conversation, no awareness of the email you sent last week, no memory of the mistake it made an hour ago, unless someone, or some system, drops that information back into the context as text. This is why prompt engineering, retrieval-augmented generation (RAG), and agent harnesses exist. They are all elaborate ways of choosing what to feed into the window.

It also explains why this approach has been so successful so far. As the a16z piece notes, "the intelligence lives in the static parameters, and the apparent capabilities change radically depending on what you feed into the window." Better prompting, longer contexts, and smarter retrieval have unlocked enormous capability gains without retraining a single weight.

Why Memory Is an Issue for Modern LLMs

The cracks show up the moment you ask an LLM to do something that lasts longer than a single chat.

Agents are the clearest stress test. A coding agent or research agent runs in a loop: think, act, observe, repeat. Each step depends on the context produced by the previous one. The a16z authors observe that long-running agents "often fail after 20-100 steps because they lose the thread: their context fills up, coherence degrades, and they stop converging." Like Leonard pinning Polaroids to the wall, the agent is constantly trying to reconstruct a coherent picture from a pile of external notes — and eventually the pile gets too big to hold.

Bigger context windows help, but only up to a point. The a16z piece pushes back hard on the idea that scaling context alone is the answer: "a bigger filing cabinet is still a filing cabinet." A system with infinite storage and perfect retrieval has never been forced to compress anything, and compression, the authors argue, is what learning actually is.

There is a deeper limitation too. The context window can only hold what can be written down. The essay points out that "in-context learning is limited to what can be expressed in language, whereas weights can encode concepts that someone's prompt cannot relay in text." The visual texture that separates a tumor from a benign artifact on a medical scan, the micro-cadence of a particular speaker's voice, the tacit feel of a codebase you have worked in for years — none of these decompose neatly into words. No prompt, however long, can transfer them. That kind of knowledge can only live in the weights.

This also explains a more subtle product problem. Features like "the bot remembers you" — ChatGPT memory and its peers — often produce user discomfort rather than delight. Aubakirova and Bornstein diagnose this cleanly: "Users don't actually want recall per se. They want competence." Verbatim playback of past conversations is not the same as a model that has genuinely internalized how you think.

Enter Continual Learning

Continual Learning is the field of research aimed at letting a model keep learning after it has been deployed, ideally by updating its weights in response to new experience rather than relying entirely on external scaffolding. In Memento terms, it is the attempt to give Leonard back the ability to form memories, rather than handing him more Polaroids.

The term is not new. The a16z authors trace it back to McCloskey and Cohen's 1989 work on catastrophic interference in neural networks. What has changed is the stakes: with frontier LLMs as capable as they now are, the gap "between what models know and what they could know" has become impossible to ignore. The authors call Continual Learning "some of the most important work happening in AI right now."

The essay frames the design space around one question: "where does compaction happen?" That gives three broad clusters of approaches. Context-based methods, the most mature, leave the weights alone and invest in smarter retrieval, longer context windows, and agent harnesses. Module-based methods attach lightweight knowledge modules — compressed KV caches, adapter layers, external memory stores — that specialize a general-purpose model without retraining it. Weight-update methods go further, attempting genuine parametric learning through techniques such as sparse memory layers, reinforcement learning from deployment feedback, and test-time training that compresses fresh context directly into the model's parameters during inference.

What Continual Learning Is Not

It is worth being precise about what Continual Learning is not, because the term is increasingly used loosely.

It is not a longer context window. Architectures like state space models extend how much an LLM can hold in working memory, but the weights still do not change. The a16z piece is direct on this point: "while you're not updating the model weights, you've introduced an external memory layer."

It is not RAG. Retrieval is enormously useful for serving fresh information, but the model learns nothing from the lookup. As the essay puts it, "retrieval is not learning. A system that can look up any fact has not been forced to find structure."

It is not chat memory. Storing snippets of past conversations and re-injecting them into the prompt is recall, not learning. A continually learning model would internalize patterns from those interactions, not replay them.

It is not standard fine-tuning. Periodic offline fine-tuning is a one-shot weight update on a curated dataset, performed by the model provider. Continual Learning aspires to something much harder: ongoing, online updates from live experience, without erasing the skills the model already has.

Why It Matters

The core argument in the a16z piece is that compression is the essence of learning, and we have switched it off at exactly the wrong moment. During training, the model is forced to squeeze the internet into a finite set of weights, and that pressure is what produces generalization. At deployment, we stop the compression and substitute external memory. The mechanism that made the model powerful is the one we refuse to let it use after release.

The essay quotes Ilya Sutskever making a related point about human intelligence: "A human being is not an AGI. Yes, there is definitely a foundation of skills, but a human being lacks a huge amount of knowledge. Instead, we rely on continual learning. If I produce a super intelligent 15-year-old, they don't know very much at all… The deployment itself will involve some kind of a learning, trial-and-error period. It's a process, not dropping the finished thing." A static, deploy-once model, by contrast, has no equivalent of that on-the-job learning.

The practical consequences are everywhere: agents that lose coherence after a few dozen steps, assistants that never get better at your specific workflow no matter how many times you correct them, and entire categories of tacit, non-verbal knowledge that simply cannot be passed in through a prompt. The a16z authors also flag the harder open question: whether genuine novel discovery — Andrew Wiles' proof of Fermat's Last Theorem, Grigori Perelman's proof of the Poincaré conjecture — requires something more than recombining existing context, and whether parametric learning is part of the answer.

The Hard Problems

Continual Learning is not a solved engineering discipline, and the a16z authors are careful to enumerate why. Catastrophic forgetting is the oldest problem: a model sensitive enough to learn from new data tends to overwrite the representations that already worked. Temporal disentanglement is a related issue, where invariant rules and mutable facts get encoded into the same weights, so updating one corrupts the other. Logical integration fails because edits are typically local to token sequences, not semantic concepts. And unlearning is unsolved: as the authors put it, "there is no differentiable operation for subtraction," so once something false or toxic is baked into the weights there is no surgical way to take it out.

There is also a governance dimension. The current separation between training and deployment is, in the authors' words, "a safety, auditability, and governance boundary." A continuously updating model is a moving target that cannot be versioned, regression-tested, or certified once, and continuous updates create a slow, persistent form of data poisoning that lives in the weights.

Active research directions discussed in the essay span Elastic Weight Consolidation (Kirkpatrick et al., 2017), Test-Time Training (Sun et al., 2020 and subsequent TTT layer work), Model-Agnostic Meta-Learning (Finn et al., 2017), Nested Learning architectures (Behrouz et al., 2025), self-distillation methods such as SDFT (Shenfeld et al., 2026), and self-improvement loops including STaR (Zelikman et al., 2022) and DeepMind's AlphaEvolve (2025). Silver and Sutton's "Era of Experience" (2025) provides a broader framing of agents learning from continuous experience streams.

From Memento to Memory

Leonard's tragedy in Memento, the a16z piece argues, is not that he cannot function. He is resourceful and brilliant in any given scene. His tragedy is that he can never compound. Every insight he reaches has to be re-derived from scratch the next time the world resets.

Today's LLMs operate under the same constraint. We have built remarkable retrieval and scaffolding around them — longer context windows, smarter harnesses, coordinated multi-agent swarms — and they work. But, as the authors put it, "retrieval is not learning." The path forward, in their view, is unlikely to be a single breakthrough. It is more likely a layered system: in-context learning as the first line of adaptation, modules for personalization and domain specialization, and weight-level updates for the hardest problems where knowledge is too tacit or too novel to fit in a prompt.

It may even require us to redefine what a "model" is: not a fixed set of weights, but, in the words of the essay, "an evolving system that includes its memories, its update algorithms, and its capacity to abstract from its own experience." The breakthrough, if it comes, will be letting the model do after deployment what made it powerful during training: "compress, abstract, and learn."

Otherwise, we will be stuck in our own Memento.

Source

Malika Aubakirova and Matt Bornstein, "Why We Need Continual Learning," Andreessen Horowitz, April 22, 2026. Available at https://a16z.com/why-we-need-continual-learning/