AI Agent Testing for Distributed Systems

Distributed systems fail in places ordinary integration tests rarely reach: partitions, crash recovery, stale leases, replayed requests, non-deterministic ordering, and rolling upgrades. That is why distributed-system-testing is worth tracking for platform and SRE teams using coding agents.
The project packages two plain SKILL.md workflows that agents such as Codex, Claude Code, Cursor, Gemini, and Copilot CLI can read and execute. One skill designs a test plan. The other executes it and produces findings. The important part is not that an AI writes more tests. It is that the workflow forces the agent to connect every scenario to a product claim, an explicit fault, a checker, and evidence that the fault actually landed.
What Is Distributed-System-Testing?
Distributed-system-testing is a small open source toolkit for agent-assisted testing of distributed and stateful systems. It asks the agent to inspect the codebase, extract the claims the system makes, generate failure hypotheses, and write a structured Markdown plan.
For consistency-critical scenarios, the plan must name an abstract model such as a register, queue, log, lock, lease, or ledger. It also defines an operation-history schema, a checker such as linearizability or serializability, nemesis landing evidence, and ambiguous outcome handling for timeouts, retries, and unknown commits.
That structure matters because chaos without an oracle can produce false confidence. A partition script running cleanly is not the same as proving that acknowledged writes survived the partition.
Key Features
- Claim-driven planning: scenarios start from product promises, not from convenient test setup.
- Coverage adequacy: the plan must explain why the chosen scenarios are enough to support a release decision.
- Model and checker discipline: safety, durability, ordering, isolation, and idempotency claims get explicit models and named checkers.
- Fault landing evidence: every pass cites proof that the nemesis actually fired.
- Blame classification: failures are tagged as SUT, harness, checker, or environment so follow-up work reaches the right owner.
Installation
The README points to an idempotent install flow, but the manual shape is simple:
git clone https://github.com/shenli/distributed-system-testing.git \
~/.local/share/distributed-testing-skills
mkdir -p ~/.claude/skills
ln -snf ~/.local/share/distributed-testing-skills/skills/designing-distributed-system-tests \
~/.claude/skills/designing-distributed-system-tests
ln -snf ~/.local/share/distributed-testing-skills/skills/executing-distributed-system-tests \
~/.claude/skills/executing-distributed-system-tests
For Codex and other agents, use the same repository as a local instruction source and add the project pointer described in INSTALL.md.
Usage
Start with a scoped ask:
Design a project-wide distributed systems test plan for this codebase.
For a release branch or risky change, narrow the scope:
Design a change-scoped test plan for this PR, focused on consistency, retries, and crash recovery.
Once the plan exists, the execute skill runs scenarios, captures logs and metrics, writes per-scenario verdicts, and produces a final report. A reviewer can then decide whether the evidence is strong enough to ship.
Operational Tips
Use this on systems where correctness depends on state transitions: queues, schedulers, controllers, databases, workflow engines, billing ledgers, and coordination services. Keep the first plan small. One good claim, one realistic fault, and one strong checker are better than a broad chaos run with weak assertions.
Also treat the generated plan as review material, not gospel. The agent can help find coverage gaps, but humans still need to approve the claims, residual uncertainty, and release confidence.
Conclusion
Distributed-system-testing is a useful pattern for SRE teams because it turns AI agents into structured reliability assistants instead of generic test generators. It brings claims, models, checkers, and fault evidence into the same artifact, which is exactly what production release reviews need.
Looking to automate infrastructure operations? Akmatori helps SRE teams reduce toil with AI agents built for real production workflows. For reliable global infrastructure, check out Gcore.
