Self-Healing Systems

TL;DR

AI can detect failures and trigger remediation: restart pods, scale up, failover, run scripts. The tech exists.
Research: automated remediation stays medium impact—constrained, safe automation; reproducibility and auditability required. Opaque automation kills trust.
Start with low-risk, high-repeat actions. Expand only after evidence that it works.

Self-healing sounds great until the bot restarts the wrong service during a database migration. The goal is fewer pages, not more chaos. Research: automated remediation stays medium impact—constrained, safe automation; reproducibility and auditability required. Tools like Datadog Bits offer AI SRE agents for investigation; remediation needs human-defined tiers. Opaque automation kills trust. If you can't explain or replay what the system did, don't let it run unsupervised.

What's Safe to Automate

Restart failed pods/containers. Low risk, high volume. Standard K8s liveness/readiness already do this; AI can extend to "restart if metric X degrades."
Scale up on load. Autoscaling is mature. AI can tune parameters or add custom triggers.
Circuit breaker / failover. If primary is down, fail to standby. Well-defined, reversible. Good candidate.
Cache invalidation. Clear caches when data changes. Usually safe if scoped correctly.

What Needs Human Oversight

Database operations. Restarts, failovers, schema changes. One wrong move, data loss.
Network changes. Routing, firewall, DNS. High blast radius.
Multi-service rollbacks. "Something is wrong" might mean roll back one service—or ten. AI can suggest; humans should confirm.
First-time failures. If we've never seen this pattern, don't let AI guess. Page.

Building a Self-Healing Strategy

Tier 1: Auto-execute. Clear pattern, low risk, reversible. No human in loop.

Tier 2: Auto-propose, human approve. AI suggests action; on-call confirms. Use for medium risk or unfamiliar patterns.

Tier 3: Human-only. High risk, data, or novel failures. AI can assist with diagnosis; human executes.

Document your tiers. Review after each incident. "Could we have auto-remediated?" If yes, consider promoting that pattern. If no, keep it in Tier 2 or 3.

Manual incident response. On-call pages for every failure. No audit trail for bot actions.

Click "With AI" to see the difference →

Quick Check

When should SREs keep self-healing in human-only (Tier 3)?

Do This Next

List your top 5 repeat incidents from the last quarter. For each, ask: could a script have fixed it safely? If yes, draft a Tier 1 or 2 runbook.
Implement one self-heal for a well-understood case (e.g., pod restart on OOM). Run it with a kill switch for 2 weeks. Measure: did it help? Any near-misses?