AI Agent Memory: Patterns for Giving Agents Persistence Across Sessions

Most agent demos are built for stateless interactions. The agent receives a query, processes it, returns a result, and the context window is cleared. Refresh the page, start a new session, and it’s as if nothing happened.

This works fine for one-shot tasks — translate this document, summarize this report, extract these fields. It breaks down for anything involving a relationship over time: a customer support agent that should remember the user’s product tier and previous issues, a coding assistant that knows your project’s conventions and architectural decisions, a research assistant that builds a model of the domain as it reads more documents.

Memory is what turns an agent from a fancy autocomplete into something that compounds in value with use. The patterns for implementing it are well-understood, even if they’re rarely documented together.

The Four Types of Agent Memory

It helps to think about agent memory the same way cognitive scientists think about human memory, because the categories map cleanly to implementation strategies.

Working memory is what the agent is currently thinking about. In practice, this is the context window. It holds the current conversation, any retrieved documents, tool call results, and system instructions. It’s fast, immediately accessible, and temporary.

Episodic memory is the record of past experiences. For an agent, this means past conversations, past actions taken, past observations. It’s indexed by recency and often by relevance to the current task.

Semantic memory is accumulated knowledge — facts, concepts, relationships. An agent’s semantic memory might hold: this user’s preferred programming language is TypeScript, their project uses PostgreSQL not MySQL, they prefer short answers to long explanations.

Procedural memory is knowledge of how to do things. Prompts, tool descriptions, and retrieved examples of past successful approaches all feed this.

In agent systems, you typically implement working memory implicitly (it’s just the context), and episodic and semantic memory explicitly via external storage. Procedural memory is often handled through RAG over documentation or example retrieval.

Working Memory: What Goes in the Context

The context window is your working memory. For most current models it’s between 128K and 1M tokens, which sounds large until you’re feeding it conversation history, retrieved documents, tool schemas, and system instructions simultaneously.

The decisions to make:

Full history vs. sliding window. Including all prior turns in the conversation is simple. For long conversations, it wastes tokens on turns that are no longer relevant. A sliding window (last N turns) is cheap and often good enough.

Summarization. When history exceeds a threshold, summarize older turns and replace them with the summary. The agent gets compressed context instead of truncated context.

async def build_context_with_summarization(
    messages: list[dict],
    token_limit: int = 8000,
    preserve_turns: int = 6,
) -> list[dict]:
    """Summarize old messages to keep context within token budget."""
    recent = messages[-preserve_turns:]
    older = messages[:-preserve_turns]

    if not older:
        return recent

    # Summarize the older portion
    summary_prompt = [
        {
            "role": "user",
            "content": f"Summarize this conversation history concisely:\n\n"
                       + "\n".join(f"{m['role']}: {m['content']}" for m in older),
        }
    ]
    summary = await call_model(summary_prompt)

    return [
        {"role": "system", "content": f"Earlier conversation summary: {summary}"},
        *recent,
    ]

What to include besides messages. System instructions, retrieved facts about the user, the current date and any time-sensitive context. Put what you know about the user at the top of the system prompt so it frames everything below.

Episodic Memory: Searching Past Conversations

For agents that have ongoing relationships with users, episodic memory means being able to retrieve relevant past conversations and include them in the current context.

The standard implementation: embed conversation summaries and store them in a vector database. When a new conversation starts, retrieve the most semantically relevant past summaries.

from openai import OpenAI
import psycopg2
from pgvector.psycopg2 import register_vector

client = OpenAI()

def embed(text: str) -> list[float]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text,
    )
    return response.data[0].embedding


def save_conversation_summary(
    user_id: str,
    summary: str,
    conn: psycopg2.extensions.connection,
) -> None:
    embedding = embed(summary)
    with conn.cursor() as cur:
        cur.execute(
            """
            INSERT INTO conversation_memories (user_id, summary, embedding, created_at)
            VALUES (%s, %s, %s, NOW())
            """,
            (user_id, summary, embedding),
        )
    conn.commit()


def retrieve_relevant_memories(
    user_id: str,
    current_query: str,
    conn: psycopg2.extensions.connection,
    limit: int = 3,
) -> list[str]:
    query_embedding = embed(current_query)
    with conn.cursor() as cur:
        cur.execute(
            """
            SELECT summary
            FROM conversation_memories
            WHERE user_id = %s
            ORDER BY embedding <-> %s
            LIMIT %s
            """,
            (user_id, query_embedding, limit),
        )
        rows = cur.fetchall()
    return [row[0] for row in rows]

The <-> operator is pgvector’s cosine distance. The query returns summaries most similar to the current conversation starter, ranked by semantic relevance rather than recency.

When building the context for a new session:

memories = retrieve_relevant_memories(user_id, user_message, conn)

memory_block = "\n".join(f"- {m}" for m in memories)
system_prompt = f"""You are a helpful assistant with memory of past conversations.

Relevant context from past conversations:
{memory_block}

Use this context to provide continuity. Don't reference it explicitly unless asked."""

The agent doesn’t need to know the implementation — just that the system prompt includes relevant history.

Semantic Memory: Extracting and Storing Facts

Episodic memory retrieves past conversations. Semantic memory extracts structured facts from those conversations and stores them in a form that’s easy to retrieve precisely.

The key difference: episodic memory stores “on March 5th, the user said they were migrating to Kubernetes” as a summary blob. Semantic memory extracts the fact user.infrastructure = "Kubernetes (migration in progress)" and stores it as a structured record.

Extraction can happen at the end of each conversation, or continuously as facts are established:

async def extract_facts(conversation: list[dict]) -> dict:
    """Extract structured facts from a conversation to update user profile."""
    extraction_prompt = """From this conversation, extract any new facts about the user's:
- Technical preferences (languages, frameworks, tools)
- Project details (what they're building, constraints)
- Work context (role, team size, company type)
- Communication preferences (preferred answer length, depth)

Return a JSON object with only the facts that were newly established.
If nothing new was established, return {}."""

    response = await call_model([
        {"role": "system", "content": extraction_prompt},
        {"role": "user", "content": str(conversation)},
    ], response_format={"type": "json_object"})

    return json.loads(response)


async def update_user_profile(user_id: str, new_facts: dict, db) -> None:
    """Merge new facts into the user's persistent profile."""
    current = await db.get_user_profile(user_id) or {}
    merged = {**current, **new_facts}
    await db.set_user_profile(user_id, merged)

The stored profile becomes part of the system prompt:

profile = await db.get_user_profile(user_id)
if profile:
    profile_text = "\n".join(f"- {k}: {v}" for k, v in profile.items())
    system_prompt += f"\n\nWhat I know about this user:\n{profile_text}"

The agent now “remembers” preferences across sessions without any explicit retrieval step — the facts are small enough to include directly in the system prompt.

mem0: A Library That Handles the Pattern

If building the extraction and storage layer from scratch sounds like a lot of plumbing, mem0 is a Python/TypeScript library that implements the episodic and semantic memory patterns behind a simple API.

from mem0 import Memory

m = Memory()

# Add a memory (e.g., at the end of a session)
m.add("I prefer TypeScript over JavaScript for all new projects", user_id="alice")

# Retrieve relevant memories at the start of a new session
relevant = m.search("What should I use for the new API?", user_id="alice")
# Returns: [{"memory": "I prefer TypeScript over JavaScript...", "score": 0.92}]

# The retrieved memories go into your system prompt
context = "\n".join(r["memory"] for r in relevant)

mem0 handles embedding, storage (supports Qdrant, Chroma, Pinecone, Postgres), and retrieval. It also does the extraction step — it processes raw text and extracts the memorable facts automatically.

The tradeoff: it’s an external dependency and abstracts away the storage so you have less control over the schema. For teams that want to ship quickly and don’t have unusual requirements, it’s faster than building the stack yourself.

The Compaction Problem

Long-running agents accumulate memory. At some point, the retrieved memories become noisy — they include outdated information, contradictory facts, and low-relevance details that clutter the context.

Three strategies for managing memory over time:

Decay by recency. Weight memories by how recently they were formed. The semantic memory that the user prefers React was established in 2024. The same user is now building Vue projects in 2026. The newer information should dominate.

# Include a recency factor in retrieval scoring
SELECT summary, (1 - (embedding <-> $1)) * (1 / (1 + EXTRACT(EPOCH FROM NOW() - created_at) / 86400 / 30)) AS score
FROM memories WHERE user_id = $2
ORDER BY score DESC LIMIT 5;

Consolidation. Periodically compress overlapping or redundant memories into a single summary. A batch job that runs weekly or monthly, not on every request.

Explicit invalidation. When a fact is superseded — the user changed jobs, the project stack changed — mark the old memory as invalid rather than waiting for it to age out. This requires the agent to recognize when new information contradicts old memory and trigger an update.

Pitfalls

Stale facts. An agent that “knows” a user’s job title from 18 months ago will state it confidently if you don’t have invalidation or decay. Build either a TTL on sensitive facts or a way for the agent to notice and update contradictions.

Over-retrieval. Retrieving 10 memories and stuffing them all in the context adds noise. For most applications, 2-3 well-chosen episodic memories plus a compact semantic profile works better than more.

Trust and privacy. Memories persist. If a user says something they later want forgotten, the agent remembers it until you provide a deletion mechanism. GDPR and similar regulations treat user data stored in AI memory systems the same as other personal data. Build delete-by-user-id into the storage layer from the start.

Hallucinated memories. If you ask the model to generate memories to store without tight structured extraction, it will sometimes fabricate plausible-sounding facts. Use structured output with schema validation for memory extraction, and have the agent confirm important facts (“I have you down as preferring Python for scripting — is that still accurate?”) periodically.

What to Build First

If you’re adding memory to an agent for the first time:

Start with semantic memory — a user profile stored as structured key-value pairs that goes in the system prompt. This gives 80% of the value with minimal complexity. The agent knows the user’s preferences and context without any retrieval step.

Add episodic memory when users start having repeat sessions and your token budget allows it. Use a vector store to retrieve 2-3 relevant past conversations and include them as additional context.

Add compaction logic when you have evidence that memory quality is degrading over time, not before. Premature optimization of memory management adds complexity without measurable benefit.

The memory layer is rarely what makes or breaks an agent product. What makes it work is the orchestration around memory: deciding when to read, when to write, what to store, and what to let expire. Get the basics working and adjust from there.

AI Agent Memory: Patterns for Giving Agents Persistence Across Sessions

The Four Types of Agent Memory

Working Memory: What Goes in the Context

Episodic Memory: Searching Past Conversations

Semantic Memory: Extracting and Storing Facts

mem0: A Library That Handles the Pattern

The Compaction Problem

Pitfalls

What to Build First

Stripe in 2026: The Payment APIs Every Web Agency Should Know

CSS Container Queries in Production: Components That Adapt to Their Context

More from AI Integration

AI Video Generation in 2026: What Agencies Need to Know Before Pitching It to Clients

Browser-Use Agents: Automating the Web When APIs Don't Exist

Fine-Tuning vs RAG in 2026: A Decision Guide for Teams Building with LLMs

Working notes from
the studio.

Join the conversation.

The Four Types of Agent Memory

Working Memory: What Goes in the Context

Episodic Memory: Searching Past Conversations

Semantic Memory: Extracting and Storing Facts

mem0: A Library That Handles the Pattern

The Compaction Problem

Pitfalls

What to Build First

Stripe in 2026: The Payment APIs Every Web Agency Should Know

CSS Container Queries in Production: Components That Adapt to Their Context

More from AI Integration

AI Video Generation in 2026: What Agencies Need to Know Before Pitching It to Clients

Browser-Use Agents: Automating the Web When APIs Don't Exist

Fine-Tuning vs RAG in 2026: A Decision Guide for Teams Building with LLMs

Working notes fromthe studio.

Join the conversation.

Working notes from
the studio.