Self-Healing Systems
Sre
Auto-remediate the obvious. Escalate the ambiguous. You draw the line.
Devops
Self-healing is powerful. It can also make things worse. Start narrow.
Self-Healing Systems
TL;DR
- AI can detect failures and trigger remediation: restart pods, scale up, failover, run scripts. The tech exists.
- Research: automated remediation stays medium impact—constrained, safe automation; reproducibility and auditability required. Opaque automation kills trust.
- Start with low-risk, high-repeat actions. Expand only after evidence that it works.
Self-healing sounds great until the bot restarts the wrong service during a database migration. The goal is fewer pages, not more chaos. Research: automated remediation stays medium impact—constrained, safe automation; reproducibility and auditability required. Tools like Datadog Bits offer AI SRE agents for investigation; remediation needs human-defined tiers. Opaque automation kills trust. If you can't explain or replay what the system did, don't let it run unsupervised.
What's Safe to Automate
- Restart failed pods/containers. Low risk, high volume. Standard K8s liveness/readiness already do this; AI can extend to "restart if metric X degrades."
- Scale up on load. Autoscaling is mature. AI can tune parameters or add custom triggers.
- Circuit breaker / failover. If primary is down, fail to standby. Well-defined, reversible. Good candidate.
- Cache invalidation. Clear caches when data changes. Usually safe if scoped correctly.
What Needs Human Oversight
- Database operations. Restarts, failovers, schema changes. One wrong move, data loss.
- Network changes. Routing, firewall, DNS. High blast radius.
- Multi-service rollbacks. "Something is wrong" might mean roll back one service—or ten. AI can suggest; humans should confirm.
- First-time failures. If we've never seen this pattern, don't let AI guess. Page.
Building a Self-Healing Strategy
Tier 1: Auto-execute. Clear pattern, low risk, reversible. No human in loop.
Tier 2: Auto-propose, human approve. AI suggests action; on-call confirms. Use for medium risk or unfamiliar patterns.
Tier 3: Human-only. High risk, data, or novel failures. AI can assist with diagnosis; human executes.
Document your tiers. Review after each incident. "Could we have auto-remediated?" If yes, consider promoting that pattern. If no, keep it in Tier 2 or 3.
Manual incident response. On-call pages for every failure. No audit trail for bot actions.
Click "With AI" to see the difference →
Quick Check
When should SREs keep self-healing in human-only (Tier 3)?
Do This Next
- List your top 5 repeat incidents from the last quarter. For each, ask: could a script have fixed it safely? If yes, draft a Tier 1 or 2 runbook.
- Implement one self-heal for a well-understood case (e.g., pod restart on OOM). Run it with a kill switch for 2 weeks. Measure: did it help? Any near-misses?