Stage 6 of 8 ↑ HOME
Stage 06 · Heal · 1.5 min
Auto-Heal & Diagnose
Detect broken pods, diagnose, fix, incident analysis
💬 TALKING POINT · #33
Jenkins fails at 3am and pages your on-call engineer. Bob detects the failure, reads the logs, diagnoses the root cause, rolls back the deployment, and leaves you a report in the morning. That's the difference between running scripts and having an intelligent agent.
6a · Auto-Heal
Monitor the bob-demo namespace for unhealthy pods. Check for: CrashLoopBackOff, OOMKilled, failed readiness probes, and ImagePullBackOff. For any unhealthy pod: 1. Pull the last 50 lines of logs 2. Check recent events (oc describe pod) 3. Check resource usage vs limits 4. Diagnose the root cause 5. If it's CrashLoopBackOff, roll back to the previous image 6. If it's OOMKilled, recommend new memory limits 7. Report all findings
6b · Incident Analysis (backup)
Analyze the last 7 days of deployment history for payment-service. Show: deployment time, image version, rollout duration, any failed rollouts and their causes. Identify patterns: are failures happening during specific hours? After specific types of changes? Recommend improvements to the deployment strategy based on the data.