Agent Memory Types: Short-Term, Long-Term, and Working Memory for AI

Your AI coding assistant helps a developer refactor a payment service over three days. Day 1: they discuss the architecture, agree on a microservices split, and choose gRPC for inter-service communication. Day 2: the developer asks “continue where we left off.” The assistant has no idea what they discussed yesterday. Day 3: the developer says “use the same pattern we agreed on.” The assistant asks “what pattern?” - destroying the illusion of a capable partner.

The context window is not memory. It is working memory - temporary, limited, and gone when the session ends. Real memory persists across sessions, surfaces relevant information when needed, and quietly forgets what is no longer useful. Building agents that remember requires explicit memory architecture, not just longer context windows.

The three types of agent memory

graph TD
  subgraph memory["Agent Memory Architecture"]
      WM["Working Memory
(current context window)
Active for this conversation
Limited by tokens"]
      STM["Short-Term Memory
(session state)
Persists within a task
Recent observations, plan state"]
      LTM["Long-Term Memory
(persistent store)
Persists across sessions
User preferences, past decisions, facts learned"]
  end

  style WM fill:#FAEEDA,stroke:#854F0B,color:#633806
  style STM fill:#EEEDFE,stroke:#534AB7,color:#3C3489
  style LTM fill:#E1F5EE,stroke:#0F6E56,color:#085041

Working memory (context window)

What the model can see right now. It includes the system prompt, current conversation, retrieved context, and tool results. It is limited by the context window size and resets between API calls unless you explicitly maintain it.

Characteristics: Fast access, limited capacity, volatile (lost when context resets).

Short-term memory (session/task state)

Information that persists within a single task or session but does not need to survive long-term. The agent’s current plan, intermediate results, what tools it has already tried, what it learned in the last few steps.

Characteristics: Persists within a session, relevant to the current task, may be discarded when the task completes.

Long-term memory (persistent store)

Information that should be available across sessions and tasks. User preferences, past decisions, learned facts about the user’s codebase, previous conversation summaries.

Characteristics: Persists indefinitely, accessible via retrieval, grows over time, needs curation.

Implementing each memory type

Working memory management

The context window IS your working memory. Manage it deliberately:

class WorkingMemory:
    def __init__(self, max_tokens=8000):
        self.system_prompt = ""  # Fixed
        self.retrieved_context = []  # Per-query
        self.recent_messages = []  # Sliding window
        self.max_tokens = max_tokens
    
    def compile(self):
        """Build the actual prompt, respecting token budget"""
        budget = self.max_tokens
        components = []
        
        # System prompt always included
        components.append(self.system_prompt)
        budget -= count_tokens(self.system_prompt)
        
        # Recent messages (most important)
        for msg in reversed(self.recent_messages):
            if budget - count_tokens(msg) < 0:
                break
            components.insert(1, msg)
            budget -= count_tokens(msg)
        
        # Retrieved context fills remaining space
        for ctx in self.retrieved_context:
            if budget - count_tokens(ctx) < 0:
                break
            components.insert(1, ctx)
            budget -= count_tokens(ctx)
        
        return components

Short-term memory (scratchpad)

A structured store for the agent’s current task state:

class ShortTermMemory:
    def __init__(self):
        self.plan = []          # Current step-by-step plan
        self.completed_steps = []  # What's been done
        self.observations = {}  # Key findings
        self.failed_attempts = []  # What didn't work
    
    def summarize_for_context(self):
        """Compress into tokens for working memory injection"""
        return f"""
Current task state:
- Plan: {self.plan}
- Completed: {len(self.completed_steps)} steps
- Key findings: {self.observations}
- Failed approaches: {self.failed_attempts[-3:]}
"""

Long-term memory (persistent retrieval)

A vector database + metadata store that the agent can query:

class LongTermMemory:
    def __init__(self, vector_db, user_id):
        self.vector_db = vector_db
        self.user_id = user_id
    
    def remember(self, content, metadata):
        """Store a memory"""
        embedding = embed(content)
        self.vector_db.upsert(
            id=generate_id(),
            vector=embedding,
            metadata={
                "user_id": self.user_id,
                "content": content,
                "timestamp": now(),
                "type": metadata.get("type", "observation"),
                **metadata
            }
        )
    
    def recall(self, query, top_k=5):
        """Retrieve relevant memories"""
        query_embedding = embed(query)
        results = self.vector_db.search(
            vector=query_embedding,
            filter={"user_id": self.user_id},
            top_k=top_k
        )
        return results
    
    def forget(self, memory_id):
        """Explicitly remove a memory (corrections, outdated info)"""
        self.vector_db.delete(memory_id)

Memory patterns for production agents

Pattern 1: Conversation summarization

After each conversation, extract and store key information:

summarize_prompt = """
Extract the following from this conversation:
1. Decisions made (with reasoning)
2. User preferences expressed
3. Facts about their system/project
4. Action items or next steps
5. Important context for future conversations

Conversation: {conversation}
"""

summary = llm.generate(summarize_prompt)
long_term_memory.remember(summary, {"type": "conversation_summary", "date": today()})

Pattern 2: Episodic memory (specific events)

Store specific interactions that might be relevant later:

# After resolving a complex issue
memory = {
    "type": "episode",
    "event": "Helped user debug a race condition in their payment processing",
    "resolution": "Added distributed lock using Redis SETNX with TTL",
    "user_feedback": "positive",
    "tags": ["debugging", "concurrency", "redis", "payments"]
}
long_term_memory.remember(memory["event"] + " " + memory["resolution"], memory)

Pattern 3: Semantic memory (facts and knowledge)

Store learned facts about the user’s domain:

facts = [
    "User's project uses PostgreSQL 15 on AWS RDS",
    "Their API is built with FastAPI",
    "They prefer functional style over OOP",
    "Production traffic peaks at 2000 RPS during business hours",
]

for fact in facts:
    long_term_memory.remember(fact, {"type": "fact", "confidence": "stated_by_user"})

Pattern 4: Procedural memory (learned workflows)

Store successful approaches that worked:

# After a successful task completion
procedure = {
    "task_type": "database_migration",
    "approach": "1. Create migration script 2. Test on staging 3. Schedule maintenance window 4. Execute with rollback plan",
    "worked_for": "Adding new columns to high-traffic tables",
    "caveats": "User's team requires PR review before migration execution"
}
long_term_memory.remember(json.dumps(procedure), {"type": "procedure"})

graph LR
  subgraph patterns["Memory Patterns"]
      P1["Conversation Summaries
Key decisions & preferences"]
      P2["Episodic Memory
Specific events & resolutions"]
      P3["Semantic Memory
Facts about user's system"]
      P4["Procedural Memory
Workflows that worked"]
  end

  style P1 fill:#EEEDFE,stroke:#534AB7,color:#3C3489
  style P2 fill:#E1F5EE,stroke:#0F6E56,color:#085041
  style P3 fill:#FAEEDA,stroke:#854F0B,color:#633806
  style P4 fill:#F1EFE8,stroke:#888780,color:#444441

Where memory breaks

Stale memories

Information stored 6 months ago may be wrong now. The user switched from PostgreSQL to MongoDB, but the agent still suggests Postgres-specific solutions because that is what is in long-term memory.

Fix: Attach timestamps and confidence scores. When retrieving memories, surface age prominently. Periodically validate facts with the user. Allow explicit corrections that update or invalidate old memories.

Memory pollution

Incorrect observations get stored and then retrieved in future conversations, causing the agent to repeat wrong information confidently.

Fix: Only store memories the user confirmed or that came from verified tool outputs. Add a feedback mechanism where the user can flag incorrect memories.

Retrieval noise

When memory grows large, retrieval returns marginally relevant memories that confuse more than they help. The user asks about “deployment” and gets memories about deploying a completely different project from a year ago.

Fix: Scope memories by project/context. Use metadata filtering aggressively (current project, recent timeframe). Only inject retrieved memories when similarity exceeds a high threshold.

Privacy and security

Stored memories may contain sensitive information. If the memory system is shared across users or accessible to other systems, this is a data leak.

Fix: Strict per-user isolation. Encrypt stored memories. Allow users to view and delete their memories. Never share memories across user boundaries.

Real-world memory implementations

ChatGPT Memory - stores user preferences and facts across conversations, retrievable by relevance
GitHub Copilot - remembers project context via codebase indexing (implicit memory through file embeddings)
Mem0 - open-source memory layer for AI agents with automatic extraction and retrieval
Zep - memory server for LLM apps with session management and long-term memory
LangChain/LangGraph - provides ConversationBufferMemory, ConversationSummaryMemory, and VectorStoreRetrieverMemory

How to apply in practice

Start with conversation summarization. The simplest, highest-value memory pattern. Summarize each conversation into key facts and decisions. Retrieve relevant summaries at the start of new conversations.

Separate “what the user told me” from “what I observed.” User-stated facts (“we use Kubernetes”) are high confidence. Agent-observed facts (“the test failed with error X”) are lower confidence and may be context-dependent.

Let users see and control their memories. Provide a “memory” view where users can see what the agent remembers, correct inaccuracies, and delete sensitive information. This builds trust and catches memory pollution early.

Inject memories sparingly. Do not dump all retrieved memories into every context. Select the 3-5 most relevant to the current query. More memories = more noise = worse reasoning.

Use memory as a context signal, not a hard constraint. Prefix injected memories with “Previously, you mentioned…” rather than treating them as absolute facts. This lets the model weigh memories appropriately and ask for confirmation when memories seem outdated.

FAQ

Q: How is agent memory different from RAG?

RAG retrieves from a static knowledge base (documents, articles). Agent memory retrieves from a dynamic, personalized store that grows through interactions. The technical mechanism is similar (embed, store, retrieve), but the content is different: RAG serves shared knowledge, memory serves personal/contextual information. In practice, many systems combine both: RAG for general knowledge + memory for user-specific context.

Q: How long should memories persist? Should they ever be automatically deleted?

Memories should have different lifetimes based on type. Facts (“user’s company uses AWS”) persist until corrected. Episodic memories (“helped debug X yesterday”) are most useful for 1-4 weeks, then should be summarized and compressed. Procedural memories (“this workflow works for their team”) persist but with decreasing relevance weight over time. Implement memory decay: reduce retrieval weight for older memories, and periodically consolidate (summarize old memories into higher-level patterns).

Q: Can I use the same vector database for RAG and agent memory?

Yes, with namespace separation. Use separate collections or partitions: one for your knowledge base (RAG), one per user for their memories. This simplifies infrastructure while maintaining isolation. The embedding model can be the same, but you might want different retrieval strategies (RAG uses broader search, memory uses stricter user-scoped search with higher similarity thresholds).

Interview questions

Q: Design the memory architecture for a personal AI coding assistant that helps a developer across multiple projects over months. What should it remember, how should it store it, and how should it retrieve?

Memory types: (1) Project facts - languages, frameworks, architecture decisions, conventions per project. Stored with project_id metadata. (2) User preferences - coding style, preferred libraries, review standards. Stored globally. (3) Interaction history - summarized past conversations with key decisions. Stored chronologically with project context. (4) Successful patterns - approaches that worked for this user’s codebase. Stored as procedural memories. Storage: vector database (Qdrant/Pinecone) with metadata filtering. Retrieval strategy: at session start, retrieve project-relevant memories (filter by current project). During conversation, retrieve memories relevant to the current query. Inject as “context from previous sessions” with timestamps. Scope control: memories from Project A should not influence work on Project B unless explicitly relevant (shared libraries, user preferences).

Q: Your AI agent has accumulated 10,000 memories over 6 months. Users complain it sometimes references outdated information. How do you implement memory lifecycle management?

Multi-strategy approach: (1) Timestamp-based decay - reduce retrieval score for older memories (multiply by decay factor based on age). (2) Confidence degradation - facts not reconfirmed in 3 months get flagged as “possibly outdated” when retrieved. (3) Contradiction detection - when new information contradicts stored memory, update or invalidate the old memory automatically. (4) Periodic consolidation - monthly job that summarizes old episodic memories into compact fact statements and removes the detailed episodes. (5) User-triggered cleanup - “forget everything about project X” or “my stack has changed” commands that invalidate relevant memories. (6) Access-based relevance - memories never retrieved in 60 days are candidates for archival. Monitoring: track memory age distribution, retrieval hit rate, and user corrections (signal of stale memories being surfaced).