What is AI-powered incident management?

AI-powered incident management uses intelligent agents to automate incident response tasks like alert triage, runbook execution, root cause analysis, and team notifications. Instead of relying solely on on-call engineers, AI agents can handle initial diagnosis and remediation steps 24/7, reducing MTTR and on-call burnout.

Is Akmatori open source?

Yes, Akmatori is an Apache 2.0 open-source project. You can inspect the code, run it on your own infrastructure using Docker or Kubernetes, and keep complete control over your data and incident workflows.

What LLM providers does Akmatori support?

Akmatori supports multiple LLM providers including OpenAI (GPT-4), Anthropic (Claude), Google (Gemini), and OpenRouter. You can also use on-premise models like Mistral, GLM, Kimi, or Minimax for data sovereignty requirements.

How does Akmatori integrate with PagerDuty?

Akmatori integrates with PagerDuty via webhooks and the PagerDuty API. When an incident is triggered, Akmatori's AI agents can automatically acknowledge alerts, gather context from your observability stack, execute diagnostic runbooks, and post updates to incident channels.

Can Akmatori reduce on-call burnout?

Yes. Akmatori handles routine incidents autonomously, filters alert noise, and only escalates to human engineers when necessary. This significantly reduces the number of pages during off-hours and allows SRE teams to focus on high-impact work instead of repetitive troubleshooting.

What observability tools does Akmatori work with?

Akmatori integrates with popular observability tools including Prometheus, Grafana, Datadog, New Relic, Splunk, and CloudWatch. AI agents can query metrics, logs, and traces to diagnose issues automatically.

30.03.2026

MCP Gateways for Production AI Ops

head-image

Model Context Protocol, or MCP, is quickly becoming the default way to connect AI agents to tools, APIs, files, and internal systems. That is good news for builders because it standardizes integration work. It is also a wake-up call for platform teams because every new MCP server is another path into production data and operational workflows.

A local MCP setup is easy. A production MCP setup is not.

Once you move beyond a single developer laptop, the questions change fast:

Which agents are allowed to call which tools?
How do you prevent prompt-driven access to sensitive systems?
Where do you enforce timeouts, quotas, and approval rules?
How do you audit tool use during incidents or postmortems?
How do you keep one noisy team or runaway agent from overwhelming shared infrastructure?

This is where an MCP gateway becomes useful.

Why MCP is getting traction

MCP adoption has accelerated because agent platforms need a common way to expose external capabilities. Instead of building one-off integrations for every model runtime, teams can wrap a system once as an MCP server and make it available to multiple clients.

For DevOps and SRE teams, that opens practical workflows:

Querying observability systems
Reading runbooks and internal docs
Inspecting CI/CD status
Looking up cloud inventory
Triggering safe automation steps

The upside is speed. The downside is surface area. Every tool connection is a new trust boundary.

The problem with direct MCP connections

If agents connect directly to a growing pile of MCP servers, governance becomes fragmented. Each server may implement authentication differently. Some may have weak logging. Others may expose more than they should because they were built for convenience instead of production safety.

That creates familiar platform risks:

1. Inconsistent auth

One server trusts environment variables. Another trusts a local token. A third has no real identity model at all. That makes it hard to map tool activity back to a user, workload, or service account.

2. Policy drift

You may want one team to query dashboards but not restart workloads. You may want an agent to read secrets metadata but never secret values. Enforcing these rules separately on every MCP server gets messy quickly.

3. Missing audit trails

When an agent suggests a bad change or touches a sensitive tool during an incident, you need a clean record of what happened. Direct peer-to-peer connections often leave you with partial logs spread across multiple systems.

4. Weak isolation

Without central controls, a prompt injection or misconfigured agent can chain together tools in ways you did not expect. Even read-only tools can leak topology, credentials metadata, or internal URLs that help a later attack.

What an MCP gateway should do

An MCP gateway sits between agent runtimes and backend MCP servers. Think of it as the policy and traffic layer for agent tool access.

A good production gateway should provide the following:

Central authentication

Agents should authenticate to the gateway, not directly to every tool backend. The gateway can then map requests to team identities, service accounts, or workload principals.

Authorization and policy enforcement

Policies should be evaluated in one place. For example:

Allow get_logs for on-call agents
Deny kubectl_exec in production by default
Require approval for actions that mutate infrastructure
Limit certain tools to business hours or approved environments

Audit logging

Every tool call should be logged with enough context to support incident review:

Which agent made the request
Which human or workflow initiated it
Which tool and arguments were used
Whether the action was allowed, denied, or escalated
How long it ran and what it returned

Rate limiting and quotas

AI systems can be bursty. A bad prompt or loop can hammer shared backends. The gateway should apply concurrency limits, per-tenant quotas, and sane timeouts.

Secret handling

Backend credentials should stay behind the gateway. Agents should receive scoped access, not raw long-lived secrets. This reduces blast radius and makes rotation easier.

Response filtering

Some tool responses should be redacted, summarized, or transformed before they reach the agent. This is especially important for secrets-adjacent systems, customer data, and noisy operational outputs.

MCP gateways and SRE design patterns

If you already run API gateways, service meshes, or policy engines, the architecture will feel familiar. The same production patterns apply here.

Treat MCP like ingress for tools

Do not let every runtime talk to every MCP server over ad hoc local configs. Publish approved tools through a gateway layer with standard auth, telemetry, and policy.

Separate read paths from write paths

Read-only tool access can support investigation and triage. Write access should be narrower, more observable, and often approval-gated. This split helps teams adopt agents without giving them the keys to the kingdom on day one.

Use short-lived credentials

Where possible, have the gateway mint or broker short-lived backend credentials per request or per session. Avoid distributing static tokens to every agent host.

Instrument the gateway itself

If the gateway becomes the control point, it also becomes critical infrastructure. Export metrics for latency, error rate, denied actions, backend saturation, and unusual request patterns.

A practical rollout plan

You do not need a giant platform project to get value from this model.

Inventory your current MCP servers and tool wrappers.
Classify them as read-only, low-risk write, or high-risk write.
Put a gateway in front of the highest-risk tools first.
Add policy checks and structured audit logs.
Expand access gradually as teams learn which workflows are safe and useful.

In practice, most teams should start with observability, ticketing, documentation, and CI status before exposing direct production mutation paths.

Why this matters now

Agent frameworks are moving fast, and MCP is becoming the connective tissue between models and real systems. That makes this a platform problem, not just a developer convenience feature.

Teams that treat MCP as just another local plugin layer will end up with scattered auth, weak visibility, and avoidable risk. Teams that treat it like production infrastructure can turn it into something much more useful: a governed automation surface for AI operations.

MCP servers make tools reachable. MCP gateways make them manageable.

If you are building AI workflows for real operations, that distinction matters.

Looking to automate infrastructure operations? Akmatori helps SRE teams reduce toil with AI agents built for real production workflows. For reliable global infrastructure, check out Gcore.

Akmatori team