The Cost of Forgetting

TechnologyAIThoughts

Most AI systems today have the memory span of a goldfish with a budget. They burn through millions of tokens reasoning about a problem, arrive at something useful, and then forget all of it the moment the session ends. The next time you bring the same problem back, the system starts from scratch – same tokens, same reasoning, same cost – as if the previous conversation never happened.

This is treated as normal. It shouldn’t be.

If you have built anything with AI agents, you know the pattern. The context window fills up. The system starts to lose the thread. You summarise, truncate, compress – anything to squeeze a few more minutes of coherence out of a model that has no concept of yesterday. The workarounds are clever. Embeddings, vector databases, hybrid search. They help. But they are still workarounds for a fundamental absence: AI systems that do real work over time have no durable memory of their own.

I had come to accept this as an inherent limitation until I heard Allen Stewart describe something structurally different.


Allen Stewart is a Partner Director of Software Engineering Innovation at Microsoft. His team built a cognitive engine called CLIO for autonomous scientific research – drug repurposing, chemical discovery, biological analysis. These are not tasks that finish in a single prompt. They run for days or weeks, burning hundreds of millions of tokens across parallel lines of investigation that branch, converge, fail, and restart.

What makes CLIO different is a design choice that sounds simple but changes everything: every token the system expends is captured as memory. Completed research threads, incomplete threads, abandoned paths, intermediate reasoning – all of it goes into a persistent store that the cognitive engine consults at every significant decision point.

Allen ran a job for 14 days across almost 200 parallel thought channels. When it finished, he packaged up all the resulting memories – including 33 incomplete channels with only partial results – and transferred them to a fresh system. He started the same research problem again.

The second run began 150 million tokens ahead.

Those tokens had been spent entirely on planning – decomposing the problem, deciding which threads to open, determining what tools and data sources to use. On the second run, the plan was already there in memory. The system moved straight into active research. The incomplete threads were identified as relevant too, and the engine picked them up where they had left off.

“I literally proved there’s no such thing as a wasted token in science research.”

That reframes the economics of AI compute. In the standard paradigm, every token is a cost. In Allen’s paradigm, every token is an investment in reusable research capital. The exhaust from AI processing is not waste – it is institutional knowledge waiting to be applied.


The memory architecture underneath this is worth understanding, because it represents design choices that most current AI systems do not make.

The store distinguishes between procedural knowledge (how to perform tasks) and episodic knowledge (what happened during specific research runs). Each memory receives a confidence score between 1 and 5 relative to the current problem. Memories scoring 1 or 2 remain in the store but are not used to ground the current research. Memories scoring 3 or above are what Allen calls “groundable” – relevant enough to inform current reasoning.

Nothing is discarded. A memory that scores 2 against a drug repurposing problem might score 4 against an immunology problem six months later. Relevance is not a fixed property. It depends on what the system is trying to do at the moment of retrieval.

Hanselman offered an analogy that captures this: a library where you do not know in advance which books are useful. You keep everything on the shelves. When a specific question comes in, you pull the relevant volumes and leave the rest. The system curates at the point of use, not at the point of storage.

This inverts how most enterprise knowledge management works. The standard assumption is that if you structure the input well enough – tag it, organise it, curate it on entry – retrieval takes care of itself. Allen’s architecture does the opposite: store everything with minimal structure, invest the intelligence in retrieval.

Anyone who has watched a knowledge management project slowly collapse under the weight of its own taxonomy will recognise why that inversion matters.


The grounding problem is where the stakes of this work become concrete.

Allen’s team works with SMILES notation – Simplified Molecular Input Line Entry System, which is essentially ASCII for chemical structures. Ethanol is CCO. Change a single character and you have a different compound. Allen quoted an old chemistry school rhyme to make the point: “What Tommy thought was H2O was H2SO4.” In a system generating candidate drugs, a single incorrect character in a molecular representation could mean the difference between a therapeutic compound and something harmful.

Large language models, left to their own devices, will produce plausible-looking SMILES strings that are chemically meaningless. They understand the format. They do not understand the chemistry. This is the hallucination problem in its most consequential form – not a chatbot getting a date wrong, but a system generating molecular structures that could end up being synthesised in a lab.

The response in CLIO’s architecture is layered. LLMs handle orchestration, agent coordination, and summarisation – not the science itself. Scientific reasoning runs through specifically trained models built on scientific process. Knowledge grounding comes from Graph RAG, which builds knowledge graphs with dynamic ontology rather than the static retrieval of conventional RAG. Traditional RAG is a filing cabinet – documents go in, snippets come out, relationships between them invisible. Graph RAG constructs and extends a network of relationships, communities, and connections as data grows.

The memory layer provides a third line of defence. When the cognitive engine hits uncertainty – a path not converging, a tool unavailable, a result that contradicts expectations – it queries the memory store rather than generating a speculative answer. Allen’s team observed the system repeatedly catching its own potential errors and regrounding in prior data. Not because they built a hallucination filter, but because the architecture gave the system somewhere more reliable to look when it was unsure.

This is structurally different from guardrails – output filters, confidence thresholds, human review queues. Those are reactive. They catch errors after generation. A memory-grounded architecture is preventive. It reduces the conditions under which hallucination occurs by giving the system access to its own prior verified work at the exact moment it would otherwise be most likely to fabricate.


The human-in-the-loop design reflects a tension that runs through every serious enterprise AI deployment.

Allen described his first meetings with the scientists his team was building for. They made him repeat a phrase back before they would engage: “AI is a tool. Scientists do science.”

The system honours that. It surfaces relevant memories to scientists for manual review. The automatic evaluation runs in parallel. But the scientist sees what the system proposes and retains the ability to override, select, or discard.

The instinct in most enterprise AI programmes is to optimise the expert out of the process. The efficiency argument is compelling right up to the point where the system encounters the edge case a domain expert would have caught immediately – and it does not.

In pharma, that edge case is a molecular structure that is chemically valid but biologically dangerous. In legal work, an interpretation that is linguistically reasonable but jurisdictionally wrong. In financial modelling, an assumption that holds in normal conditions and fails catastrophically under stress.

The question is not how to automate the expert away. It is how to make the expert’s time more valuable – handle the labour-intensive parts that are not judgment-intensive, and leave the evaluation where it belongs: with the person who understands the domain.

I have watched this across education, financial services, and public sector transformation. The deployments that survived past pilot were, without exception, the ones that made the domain expert faster – not the ones that tried to make the domain expert unnecessary.


There is one more dimension that I think will prove significant over time: what Allen calls lab in the loop.

The cognitive engine generates chemical candidates and produces a lab protocol. That protocol is submitted to robotic systems in a physical laboratory. The robots carry out the synthesis. The results feed back into the memory store, and the cycle begins again with the system now incorporating the physical outcomes of its previous reasoning.

This closes a loop that has been open in computational science for decades. The memory architecture is what makes it practical. Without persistent, scored, retrievable memory, every experimental cycle starts from a blank slate. With it, each cycle compounds. The failed syntheses are as valuable as the successful ones, because they narrow the search space for the next iteration.


What stays with me is not the scale – the hundreds of millions of tokens, the 200 thought channels, the 14-day autonomous runs. It is the principle underneath: that every piece of reasoning an AI system produces has potential future value, and that the systems we build should be designed to capture, evaluate, and reuse that reasoning rather than discarding it.

Most of what we currently call AI memory is retrieval – searching through stored text for relevant passages. What Allen’s team has built is closer to institutional knowledge – a system that accumulates experience, evaluates its relevance in context, and applies it to new problems in ways that measurably improve outcomes.

The gap between those two things is where the next generation of AI infrastructure will be built. Not in larger models or faster inference, but in the persistent, structured, contextually-scored memory systems that give those models something worth remembering.


The Hanselminutes episode – “A Cognition Engine for Science with Allen Stewart” – is available at hanselminutes.com/1040. The CLIO paper is at arxiv.org/abs/2508.02789. Microsoft Discovery is at azure.microsoft.com. GraphRAG is open source at microsoft.github.io/graphrag.

← Back to blog