I have been on the on-call rota in some form for most of my working life. There is a particular kind of knowledge you only acquire by being woken up at 03:14 by your own infrastructure, and I have noticed that this knowledge is almost never written down.
What follows is not a manifesto. It is a list. Each item is something I genuinely learned by being woken up by it, and it has saved me — or someone working with me — at least once since.
On preparing
- Sleep with the phone face-down, vibrate on, charging. Face-up is a small light against your eyelids all night.
- Two charging cables, one bedside, one in the bag you grab. You will not remember to pack one.
- Read the runbook on a quiet Tuesday, not at 03:14. This is the single highest-leverage thing you can do for the pager.
- The runbook lives outside the system. If the production database is the thing on fire, the runbook cannot be in it.
- Have one paper notebook for incidents. Not for the system of record. For you. You will remember what you scribbled.
On the first ten minutes
- The first thing to do is acknowledge. Not fix. Acknowledge. The alert is now your problem; the rest of the system can stop paging.
- Confirm the alert is real before you do anything else. False positives at 3am are common. Sleep is expensive.
- Open three windows: the alert, the dashboard, the logs. In that order. Resist the urge to log into production first; you will know more in thirty seconds and act better.
- Type the customer impact out loud, in plain English, before touching anything. "Right now, X customers cannot Y." If you cannot, the alert is the wrong shape.
- Start a timeline. Two lines is enough.
T+0 alert. T+2 confirmed real. T+5 paged colleague.Future-you will thank present-you.
On working with other humans at 03:00
- One bridge, one commander. Even if the commander is also the only engineer.
- Read back instructions like a pilot. "Restarting service X on host Y, confirming." You will catch the wrong host this way at least once.
- Stop the colleagues who aren't helping. Two engineers in deep concentration are useful. Six engineers in a bridge are a tax on the two who are working.
- Eat something. I know.
On the fix
- The safest action you have is to roll back. Forward fixes look heroic in the post-mortem and feel terrifying at the time.
- If you don't know what changed, don't change anything. First: find out what changed. Then: undo it. Then: investigate.
- Touching production at 03:14 should feel like surgery, not typing. Slow down. Read the command twice. Tab-complete.
- Take a screenshot before and after. Of the dashboard, the log, the config diff. You will need it tomorrow.
The pager is a teaching device, not a punishment. Read what it tells you and the next incident is shorter.
On the morning after
- The post-mortem is a working document, not a tribunal. If anyone shows up wanting blame, they are in the wrong meeting.
- The action items are smaller than they look at 03:14. Most of them are "improve a check" or "write a runbook for next time." Don't dress them up.
- Tell the team something specific you learned. Not the lesson — the specific. "I learned that our health check passes if the database is unreachable, because it only pings the loadbalancer." That sentence will save someone six months from now.
- Sleep, then write. In that order. Otherwise the write-up will read like the inside of your head at 03:14.
A short list of things that aren't on the list
I have deliberately left out: tools, vendors, frameworks. None of that matters until you have the discipline of doing the small things in the right order under pressure. Once you have that, the tools are interchangeable.
The pager is a teaching device, not a punishment. Read what it tells you, write it down, and the next incident is shorter by twenty minutes. That is most of what experience is, in this line of work.