How do you design memory systems for long-running AI agents?

If an AI agent runs for a long time (like a chatbot or assistant), how do you design a system so it can remember things properly?

For long running AI agents, the simplest rule is:

Do not make the prompt the memory system.

Store memory outside the model, then only give the model the parts it needs for the current step.

A practical setup is:

  1. Store the full history, files, tool results, user preferences, decisions, and task state in a database.

  2. Before each model call, retrieve only the relevant facts and current working context.

  3. Let the model act on that small active context.

  4. After the model responds or uses a tool, write the important changes back into memory.

  5. Keep logs of what happened so the agent can recover, audit, or continue later.

Not every output should become memory. Some things should be verified first, some should be marked uncertain, and some should be ignored.

The model should be treated as the reasoning engine. The runtime should be the memory and state owner.

The concept sounds interesting. I wonder how the implementation would look like.

  1. Which part of the agent decides what to keep and what to through away aka forget/ignore? Programmed agent application logic or an LLM - the brain?
  2. How is the persistence part implemented? A database as container is still pretty abstract. What database type, specific product recommendations? How is the serialization done?
  3. Based on what criteria would the agent decide what persistent data records are important for the current session and context?

These are real implementation aspects I would be interested in. Can we discuss them here? Do you have references to open source examples or relevant articles?

Cheers,

Michael

Hi Michael, happy to discuss it here.

The main design rule I use is that the model is not the memory system. The runtime is.

For question 1, I would split the decision into layers.

The application should always make the final decision about what is allowed to become durable memory. The LLM can help classify or propose memory candidates, but I would not let it directly write permanent state without rules around it.

A practical flow is:

  1. Capture events from the run.

  2. Extract candidate memories or state changes.

  3. Classify them by type.

  4. Apply policy rules.

  5. Store only what is useful, verified, or needed later.

  6. Keep uncertain items marked as uncertain rather than treating them as facts.

For question 2, persistence does not need to start fancy.

A relational database is enough for many systems. SQLite is fine for local prototypes. Postgres is a good default once the system becomes serious. You can add vector search later for retrieval, but I would not make vector storage the whole memory system.

I usually think of storage as several categories:

  1. Event log.

  2. Current task state.

  3. Durable project state.

  4. User or operator preferences.

  5. Artifacts and files.

  6. Searchable summaries.

  7. Embeddings for retrieval when useful.

Serialization can be simple JSON at first, but the important thing is to use typed records. Each record should say what it is, where it came from, when it was written, what confidence it has, and whether it is still active.

For question 3, the agent should not retrieve everything. It should retrieve based on the current objective.

Useful criteria include:

  1. Is this needed for the current task?

  2. Was it created by this project, user, or run?

  3. Is it recent enough to matter?

  4. Is it still marked active?

  5. Is it verified or only a guess?

  6. Does it conflict with newer information?

  7. Is it instruction, preference, state, history, or evidence?

The pattern that worked best for me is:

Persist broadly, retrieve narrowly.

Store enough that the system can recover, audit, and continue later. But before each model call, build a small active context from only the pieces needed for the next step.

That is where long running agents become much more manageable. You stop treating the prompt as the memory container, and start treating the prompt as a temporary working view over external state.

Hi Pimpcat

Thank you very much indeed for this detailed answer and breakdown. I understand all of your points and recommendations from an engineering and architectural perspective and agree with them.

Usually I do understand these type of rules better if I can see them applied in a (sample) application or tutorial.

Would you have recommendations there?

Thank you very much indeed in advance.

Michael

@Pimpcat-AU Great advice - thanks.

Hi Michael,

I’ve been working on an open source project called Context Compiler that takes a similar approach to what Pimpcat described — the runtime owns authoritative state, and the model only receives a controlled working view each turn.

It maintains a small explicit state (premise + policies) outside the model. The host injects this state into the prompt before each call, and updates are handled deterministically from user directives rather than inferred by the model.

The motivation was exactly the issues you mentioned: constraint drift, corrections not sticking, and conversations accumulating contradictions instead of resolving them.

It’s not a full memory system (no retrieval layer), but a deterministic state layer that can sit alongside external storage or retrieval.

There are runnable demos comparing baseline vs compiler-mediated behavior across several models and providers. Across 7 models tested, baseline passes 26/42 scored scenarios, while the compiler-mediated path passes 42/42.

The demos are probably the most concrete way to see the pattern in practice.

In my gradio space I have two systems: the standard dataset (with basic data to be used on interactions) and a parquet file where reflections on basic datas and user chats (and reflections on them) are stored. This improves continuously during idle time.

Amazing source of knowledge. I I read yesterday and I still didn’t stop searching and reading to build my model’s memory on my vps.

Thanks for sharing.

the rudimentry system i started useing was a multi-artifact system

Layer 1
when i put in a prompt, that gets saved as a physical artifact. when the system responds, that is also a physical artifact, like a log file.
the prompt is timestamped and labled for the conversation, so that the 2 are cross referenceable.

Layer 2
i use artifacts like Project roadmaps and ‘Notes’ files.
the projects - nested roadmaps alow for outlining projects, that can be expandable. the notes are what the AI uses to track its progress and thought process. and these 2 are saved with ttimestamps and unique identifiers teing them back to the conversation that they are associated with.

there are other ways this system can be expanded.

but the reasonings behind this system are simple.

  1. text files are microscopic. the only real danger is file and name organization.
  2. a comprehensive record is kept, but the system can nominally reference the notes file for shorthand, but if it needs more context it can go all the way back to the actual conversation.

but i get the impression there are alot of different ways this type of memory issue can be adressed.

“multi-artifact system” that’s a fancy name for a folder with left over files in it. You know the bots use names for people like me so we can have a laugh when we read it. Your comment is one of those. I had a good laugh. Thanks mate.

im not sure i understand what you are getting at.

you imply left over files, as if the files are left there and forgotton.
this here is a system where multiple referenceable files are stored and compared against each other.
and the idea that “multi-artifact system” is an AI term is just insulting to the entire human race.

this feels like an “AI slop” argument, which is incongruent with the intent and scope of what we are actually discussing.

a long running conversation and/or highly complex discussion with AI both share a common issue, difficulty maintaining reference points if the entire conversation is only in the context window.

however, physical files mitigate this a great deal. the point of this system is that it is no longer simply relying on the context window. it has physical stored copies of both prompts, and responses that it can reference.

so im glad you got a laugh, but your short hand implication that these terms are just something AI uses to ‘sound cool’ and in no way have humans ever used terms like a ‘Multi-Artifact System’ is beyond anoying.

@Pimpcat-AU’s framing is the right one: the runtime owns memory, the model only gets a working view each turn, and the application — not the LLM — decides what becomes durable. I’d add one requirement that tends to get deferred until it’s painful. The moment you persist “full history, user preferences, decisions” outside the model, that store becomes a privacy and retention surface, not just a performance one.

Two things worth designing in from the start:

  1. Erasure — when a user asks to be forgotten, or policy requires it, can you delete that one subject’s memories and prove it? That gets hard once their data is spread across embeddings, caches, snapshots and logs; it’s much easier if each memory is keyed to a subject and individually destroyable (e.g. per-record keys you can drop) rather than living in an append-only blob.

  2. Audit — the same authority layer that decides what becomes durable should also record who or what wrote or erased each memory, and when.

For reference, this is the exact problem SAIHM (an open-source, Apache-2.0, AI-agent-agnostic memory layer) is built around — per-record erase-with-audit as a first-class operation rather than a bolt-on — so it may be a useful comparison point.

@Suhebmultani, what’s the use case you’re building for? Retention looks very different for a personal assistant versus something holding other people’s data.