Skip to main content

Self-Healing Systems

5 min read
SreDevops

Sre

Auto-remediate the obvious. Escalate the ambiguous. You draw the line.

Devops

Self-healing is powerful. It can also make things worse. Start narrow.

Self-Healing Systems

TL;DR

  • AI can detect failures and trigger remediation: restart pods, scale up, failover, run scripts. The tech exists.
  • The risk: AI fixes the wrong thing, hides root cause, or creates cascading failures. You need guardrails.
  • Start with low-risk, high-repeat actions. Expand only after evidence that it works.

Self-healing sounds great until the bot restarts the wrong service during a database migration. The goal is fewer pages, not more chaos.

What's Safe to Automate

  • Restart failed pods/containers. Low risk, high volume. Standard K8s liveness/readiness already do this; AI can extend to "restart if metric X degrades."
  • Scale up on load. Autoscaling is mature. AI can tune parameters or add custom triggers.
  • Circuit breaker / failover. If primary is down, fail to standby. Well-defined, reversible. Good candidate.
  • Cache invalidation. Clear caches when data changes. Usually safe if scoped correctly.

What Needs Human Oversight

  • Database operations. Restarts, failovers, schema changes. One wrong move, data loss.
  • Network changes. Routing, firewall, DNS. High blast radius.
  • Multi-service rollbacks. "Something is wrong" might mean roll back one service—or ten. AI can suggest; humans should confirm.
  • First-time failures. If we've never seen this pattern, don't let AI guess. Page.

Building a Self-Healing Strategy

Tier 1: Auto-execute. Clear pattern, low risk, reversible. No human in loop.

Tier 2: Auto-propose, human approve. AI suggests action; on-call confirms. Use for medium risk or unfamiliar patterns.

Tier 3: Human-only. High risk, data, or novel failures. AI can assist with diagnosis; human executes.

Document your tiers. Review after each incident. "Could we have auto-remediated?" If yes, consider promoting that pattern. If no, keep it in Tier 2 or 3.

Manual process. Repetitive tasks. Limited scale.

Click "With AI" to see the difference →

Quick Check

What remains human when AI automates more of this role?

Do This Next

  1. List your top 5 repeat incidents from the last quarter. For each, ask: could a script have fixed it safely? If yes, draft a Tier 1 or 2 runbook.
  2. Implement one self-heal for a well-understood case (e.g., pod restart on OOM). Run it with a kill switch for 2 weeks. Measure: did it help? Any near-misses?