Production AI agents need four kinds of memory: working (current context), episodic (specific past events), semantic (extracted facts and preferences), and procedural (the agent's own instructions, learned). The dominant production pattern is tiered: a small always-in-context core + a vector-store-backed retrieval layer + an explicit forgetting policy.
The leading 2026 frameworks: LangMem (best on LangGraph), Letta (best for OS-style explicit memory management), Mem0 (best managed service for fast integration), Zep (best when temporal awareness and knowledge graphs matter).
For years the running joke about LLM agents was that they were goldfish — brilliant for one turn, blank by the next. The 2024-2025 wave of memory papers and frameworks — MemGPT, LangMem, Letta, Mem0, Zep — turned memory into a first-class component of agent design rather than an afterthought. By mid-2026, "what's your memory architecture?" is a standard question in agentic-product hiring loops at companies like Anthropic, Cursor, Lindy, and Sierra.
This guide is the practical engineering reference: the four memory types you should be designing around, the OS-style tiering pattern that's emerged as the dominant architecture, how to pick a framework, and the production gotchas that show up six months into running a memory-enabled agent.
Why even 1M-token windows don't solve memory
The most common pushback when you propose adding a memory layer is "just use a long context window." Three reasons that doesn't work:
- Cost. 1M tokens per turn is uneconomic at any production scale. Even with caching, the input bill dwarfs the model's actual compute on a per-task basis.
- Attention degradation. Models still have "lost in the middle" problems. Stuffing 800k tokens of history into context doesn't mean the model retrieves the right fact — needle-in-haystack benchmarks are a separate skill from generation quality.
- Time scale. Agents that run for weeks or months easily accumulate more than 1M tokens of relevant history. You're going to need retrieval no matter how big the window gets.
Long contexts complement memory systems — they make the "working memory" tier bigger and cheaper to manage — but they don't replace them.
The four types of memory
The taxonomy that's stabilized in the field, borrowed loosely from cognitive science:
| Type | What it holds | Example | Lifetime |
|---|---|---|---|
| Working | The current turn's context | System prompt + recent messages | Seconds to minutes |
| Episodic | Specific past events with timestamps | "Last Tuesday we discussed the API redesign" | Days to forever |
| Semantic | Extracted facts and preferences | "User is a senior eng at a 50-person startup, prefers Python" | Forever, updated |
| Procedural | How the agent works | System prompt + learned heuristics | Updated rarely |
Working memory
Everything the model can see in this turn. The context window. Working memory is the easy one — it's just whatever you stuff into the prompt. The interesting design questions are what you put in it and how you decide.
Episodic memory
Specific, timestamped events. "User reported a bug at 3pm Tuesday." "Last week we agreed on the React migration plan." Episodic memory is what makes an agent feel like it actually remembers your conversations rather than starting fresh. Backed by vector stores (semantic search over event embeddings) or knowledge graphs.
Semantic memory
Extracted, summarized facts that have been distilled from many episodes. "The user is a parent" is semantic; "the user mentioned their daughter's birthday last week" is episodic. Semantic memory is denser, more compressed, and easier to retrieve relevantly than raw episodic history.
Procedural memory
How the agent itself works — its operating instructions, heuristics, and patterns. In 2026 this is increasingly self-edited: LangMem supports agents updating their own system instructions based on what worked and didn't. This is where agents start to feel like they're learning, not just remembering.
For background on the broader AI engineering stack, our how to become an AI engineer in 2026 guide covers the surrounding skill map.
The dominant architecture: OS-style tiered memory
The pattern that won 2024-2026 design space, popularized by the MemGPT paper and productized by Letta: treat the LLM like an operating system.
- Core memory — always in-context, like RAM. Holds the user persona, the agent persona, and the current task. Small, hot, fast.
- Recall memory — conversation history. Scrollable, searchable, swappable in and out of context as needed.
- Archival memory — external vector store, like disk. Massive, cheap, retrieved on demand via embedding search.
What makes this elegant: the agent decides what moves between tiers. It explicitly calls memory tools — memory.insert(), memory.search(), memory.swap() — rather than having context injected by the framework. The agent is an active participant in its own memory management. The original MemGPT paper hit 93.4% on Deep Memory Retrieval, which was the result that proved the tiered pattern works.
Framework comparison (2026)
| Tool | Best for | Memory model | Hosting |
|---|---|---|---|
| LangMem | LangGraph teams | Episodic + semantic + procedural; background extraction | Self-host |
| Letta | Long-running agents needing OS-style control | Core + recall + archival; explicit tier tools | Self-host / cloud |
| Mem0 | Fastest path to managed memory | Semantic facts with auto-extraction | Managed service |
| Zep | Temporal queries + knowledge graphs | Knowledge graph with time-aware facts | Cloud / self-host |
LangMem
LangChain's memory SDK, built specifically for the LangGraph ecosystem. Background process that extracts and stores memories from agent runs without you having to hand-instrument the agent. Supports all three of episodic, semantic, and procedural memory natively. Best choice if your stack is already LangGraph and you want memory to "just work."
Letta (formerly MemGPT)
The reference implementation of the OS-style memory pattern. Model-agnostic, open-source, designed for agents that need fine-grained control over what's in context vs external. Best for long-running personal-assistant agents, agentic teammates, and anywhere you want the agent itself to be a participant in memory management. Letta is the bet on the "agent as autonomous memory manager" thesis.
Mem0
The fastest way to add memory to an existing app. Managed service, clean REST/SDK interface, auto-extracts semantic memories from raw conversations. Less control than Letta or LangMem, but you ship in an afternoon. Good default for product teams adding memory to a chat app or assistant where extreme control isn't required.
Zep
The differentiator: temporal awareness. Zep builds a knowledge graph that knows when each fact was learned, when it was updated, and how it relates to other facts. If your agent needs to reason about how a user's preferences have changed over time — "you used to prefer Python, but in March you switched to Rust" — this is the framework to pick.
How to pick between them
The decision tree we recommend after watching teams ship and regret:
- Already on LangGraph? → LangMem. Don't shop further.
- Need agents that explicitly manage their own memory tiers? → Letta. This is the pattern for agents-as-teammates.
- Want to add memory in one afternoon and you don't have hard performance or compliance constraints? → Mem0. Managed services are the right default for most product teams.
- Need temporal reasoning or knowledge-graph-style facts? → Zep. The "user's preferences over time" case.
- Hard on-prem requirement or weird latency budget? → Roll your own. Pinecone/Qdrant/Postgres+pgvector for the storage, write the extraction prompts yourself. Only do this if managed options are off the table.
Production gotchas (the things nobody tells you)
Memory architectures look clean in slides. They get messy in production. The patterns we've seen kill memory-backed agents six months in:
- Memory hygiene is the silent killer. Agents that aggressively write to memory accumulate bloated, contradictory, outdated facts. Retrieval quality decays. You need explicit forgetting policies (TTLs, decay scores), deduplication, and conflict resolution (which fact wins when two contradict?). Without these, your memory system degrades into noise within weeks.
- "Last updated" matters more than "matched score." A fact from 18 months ago that semantically matches the query can hurt more than help. Weight retrieval by recency, not just similarity.
- Privacy is a different problem from retrieval. If you're storing user data, decide upfront which memories survive a "forget me" request and which are aggregate. The architecture you pick affects whether GDPR-style deletion is a one-line query or a multi-week migration.
- Embedding model drift. If you re-embed your archival memory with a new model six months in, you need a re-indexing plan. Mixing embeddings from two models destroys vector search quality silently.
- Latency in the tool call loop. Every memory call in the agent loop adds a round-trip. Multi-tool agents that hit memory 3-5 times per turn add real latency. Cache aggressively.
- Eval is harder than for stateless models. You can't replay a memory-having agent against a benchmark by just resetting the prompt — the memory state is part of the system. Build your eval harness around session-level scenarios, not one-shot turns.
For broader thinking on production LLM systems, our production AI agents guide and LLM evaluation guide cover the surrounding territory.
A reference architecture you can copy
Here's a memory architecture that works in production for a chat-based assistant with persistent user history. Adjust to taste:
- Core memory: 2-4 small editable blocks — user persona, agent persona, current task. Always in context. Total <1k tokens.
- Working buffer: Last 20-40 messages, summarized when capacity hits a threshold. Summary moves to episodic.
- Episodic store: Vector-stored events with timestamps and metadata (channel, conversation, topic). Retrieved by semantic similarity + recency weighting.
- Semantic store: Extracted facts about the user, with provenance (which episodes generated them) and confidence scores. Updated by background process every N turns.
- Procedural memory: System prompt with append-only learned heuristics. Edited by the agent itself through a controlled tool, with a rollback log.
- Forgetting policy: Episodic memories TTL after 90 days unless promoted to semantic. Semantic memories decay in confidence if not reinforced. Contradictions trigger a resolution step before write.
This setup costs roughly 4x less per turn than stuffing all history into context for a typical multi-turn assistant, and the retrieval quality holds up over months of use — not weeks.
What this means for hiring
"Memory engineer" isn't a standalone title yet, but memory expertise is increasingly explicit in AI/ML engineer job descriptions in 2026 — especially at frontier labs and agentic-product companies. If you're an engineer thinking about an AI specialization, memory architecture sits right next to RAG and evaluation as the three skills that immediately pay back in production. See current ML & AI engineering jobs and our AI tools directory for adjacent context.
The other adjacent path: AI/ML engineering roles increasingly want product judgment on top of model competence. Memory architecture is a place where engineering taste shows up immediately — the same problem can be solved with five very different stacks, and the right call depends on the product. Companies hiring for this skill specifically include Anthropic, OpenAI, Cursor, Lindy, and most of the agentic-application Y Combinator cohort over the last 18 months.
Frequently Asked Questions
Looking for an AI/ML engineering role?
Memory architecture is one of the hottest skills in 2026 hiring. Browse open ML & AI roles across companies building agentic products.
Browse ML/AI Jobs → Explore AI Tools →