Skip to main content
18.06.2026

Incident Memory for AI SRE Agents

head-image

AI incident agents fail quickly when every investigation starts from a blank prompt. The alert may be new, but the useful context often is not: similar incidents, noisy services, rollback notes, owner preferences, recent deploys, and failed fixes from last week. Elastic's recent work on agentic memory with Elasticsearch is a useful signal for SRE teams building more durable AI operations workflows.

What Is Incident Memory?

Incident memory is the operational context an AI agent can carry across investigations. It is not the full chat transcript. It is the curated set of facts, summaries, decisions, evidence links, and team-specific patterns that help the next triage run start smarter.

The Elastic article separates short-term memory from long-term memory. Short-term memory is the active conversation or task state. Long-term memory lives outside the prompt, often in a search or vector store, and is retrieved only when relevant. That distinction matters in incident response because stuffing every prior outage into the prompt creates cost, latency, and confusion.

Key Features SREs Need

  • Hybrid retrieval: combine keyword search, semantic search, and metadata filters so the agent finds both exact service names and similar failure patterns.
  • Memory types: separate episodic memories such as prior incidents from semantic memories such as runbook facts and procedural rules.
  • Tenant and role isolation: use access controls so agents only retrieve memories for the right team, service, environment, or customer.
  • Summarization and pruning: compress stale detail into stable lessons instead of replaying every raw chat message.
  • Audit trails: keep enough source links and timestamps for responders to verify why a memory influenced the answer.

A Minimal Architecture

A practical design starts with three stores:

active incident state -> current alert, tool results, hypotheses
incident memory index -> past incidents, summaries, decisions, evidence
knowledge index -> runbooks, service ownership, architecture docs

During triage, the agent should retrieve memory with structured filters first. Service, environment, time range, severity, owner team, and incident type are often more useful than pure vector similarity. Semantic retrieval then helps catch related failure modes with different wording.

Usage In An On-Call Workflow

Start with read-only recall. When a new alert arrives, ask the agent to find similar incidents and return a short evidence packet:

Find prior incidents involving checkout API 5xx errors, payment timeout spikes,
or failed deploy rollbacks in production. Return matching incidents, confidence,
source links, and any actions that were attempted.

The answer should not be "restart service X" by default. A better first output is a ranked list of related incidents, what was true then, what differs now, and what evidence is missing.

A Local Tool To Test

If you want to experiment with the pattern outside production, agentmemory is a current open source project that exposes persistent memory to coding agents and MCP clients. It is aimed at developer workflows, but it is useful for testing memory lifecycle, retrieval, and inspection habits before you wire similar ideas into incident tooling.

npm install -g @agentmemory/agentmemory
agentmemory

For SRE use, keep this as a lab exercise. Production incident memory needs stronger controls around retention, ownership, tenant isolation, and source-of-truth links than a personal coding-agent memory server usually provides.

Operational Tips

Do not store secrets, raw customer data, or private chat noise as memory. Index references and summaries where possible. Keep source systems as the record of truth.

Treat memory writes as a controlled workflow. A post-incident summary, reviewed action item, or responder-approved lesson is safer than letting the agent remember every intermediate guess. That reduces context poisoning and prevents one hallucinated hypothesis from becoming future "truth."

Measure retrieval quality with real incidents. Pick 20 past alerts, hide the known cause, and check whether memory recall returns useful context without leaking unrelated teams or customers. Recall without isolation is not production-ready.

Conclusion

Persistent memory is one of the missing pieces for useful AI SRE agents. The goal is not nostalgia for old incidents. The goal is faster evidence gathering, fewer repeated mistakes, and cleaner handoffs between humans and automation.

For platform teams, Elasticsearch-style memory is interesting because it uses infrastructure patterns they already understand: indexes, filters, roles, retention, and auditability. That is the right direction for incident automation.

Akmatori helps SRE teams automate alert triage, collect operational context, and coordinate reliable incident response. Pair Akmatori with infrastructure from Gcore when you need resilient global delivery, edge networking, and cloud capacity for production systems.

Automate incident response and prevent on-call burnout with AI-driven agents!