Learning incident response with problem sets
It’s hard to teach good incident response. A good understanding of how the system runs in production is essential, but how do you build that understanding? What’s worked for me is looking at the system’s behaviour - dashboards, logs and metrics - during various incidents, and making sure I understand the pathological activity I’m seeing. Below I’m going to sketch some made-up incidents and ask what could be going wrong. I think this kind of thing could be a useful exercise, particularly adapted for a specific team or system.
Suppose you own a standard productionized Rails app. It runs in a collection of VMs (or containers) that sit behind a load balancer node running Nginx or Apache. In each VM, the app is served by a Unicorn server: a fixed set of worker processes, listening on a common port. This should be a pretty familiar architecture to anyone who’s worked on a Rails app from the early 2010s.
Periodic error spikes
You’re paged to investigate periodic bursts of errors where your service becomes completely unusable for a minute. Looking at your graph of 5xx responses, you can see periodic spikes of errors:
You also happen to notice an unusual-looking pattern in your graph of memory usage:
What theory might you form from those graphs? Click ‘Details’ to see where I’d start looking.
What action would you take to confirm that theory?
And what would a sensible remediation be, if confirmed?
Latency down, CPU up
You’re paged by a CPU alert. Your service is suddenly using much more processing power than usual:
Weirdly, request latency is down - way down:
What theory might you form from those graphs?
What action would you take to confirm that theory?
And what would a sensible remediation be, if confirmed?
Latency up, CPU down
You’re paged when your app becomes completely unresponsive. Requests are timing out. However, you’re not seeing any exceptions being thrown in your logs, and memory usage and CPU usage aren’t being overwhelmed. In fact, CPU usage looks much lower than usual:
Latency is up at around the same time:
What theory might you form from those graphs?
What action would you take to confirm that theory?
And what would a sensible remediation be, if confirmed?
These exercises have all been fairly generic, and as such can’t dig into any really juicy problems that require more context. I think this kind of thing could be much more useful for a specific system, with examples drawn from actual incidents.
If you liked this post, consider subscribing to email updates about my new posts.
September 3, 2021