Learning from incidents: blameless postmortems for one

A few years ago I deployed a dbt change to production at five p.m. on a Friday. The change was tiny, a single column rename, tested on staging that morning. What I had not noticed was that one downstream Looker dashboard, used by the head of finance for the Monday review, was hardcoded to the old column name. The cron rebuild ran at six. The dashboard broke at six-fifteen. I found out at seven a.m. Monday from a Slack message that started with the word “explanation”.

I spent the morning fixing it. By eleven the dashboard was back. By noon I had apologised to three people and could feel that low, jangly post-incident energy where you know everything is technically fine but your nervous system has not caught up. I went for a walk. When I came back, instead of moving on to the next ticket, I opened a Notion page and wrote a postmortem. For myself. Forty minutes of writing, no audience, just me trying to work out what had actually happened.

That postmortem changed how I deploy on Fridays. More importantly, it changed how I learn from bad days in general. Big incidents at proper companies get five-person reviews, dedicated facilitators, action items tracked in Jira. Personal screw-ups, the kind you live through alone or with one other engineer, almost never get reviewed at all. They become a vague feeling of “I should have known better” and then they get repeated.

You do not need a five-person review to learn from a bad day. You need forty minutes and a willingness to be honest with yourself.

The format that works for one person

The structure I use is shamelessly borrowed from the standard incident review template, simplified to the parts that work without an audience. Five sections, in this order, in a markdown file or a Notion page or a paper notebook, it does not matter.

Timeline. What happened, in chronological order, with rough times. “16:50, opened PR. 17:02, merged after CI passed. 17:08, deploy completed. Monday 07:00, Slack message about broken dashboard.” Be specific. Vague timelines hide the moment where the wheels came off.

What happened. A one-paragraph narrative. The dashboard broke because the column it referenced was renamed. The renaming was part of a change I deployed Friday at five. Nobody saw the breakage until Monday because nobody looked at that dashboard over the weekend. State the facts. No interpretation yet.

Why it happened. This is the section where most personal postmortems go wrong. The temptation is to write “because I was tired” or “because I should have checked”. Both are useless. The useful version asks: what system, what process, what tooling, made this mistake possible or likely? In my case: I had no automated check that connected dbt column renames to downstream Looker fields. I had no policy about deploying on Friday afternoons. I had no convention for marking dashboards as “high-stakes”. The mistake was the symptom; the gaps were the cause.

What I would do differently. Specific behaviours, not vague resolutions. “Add a pre-deploy check for downstream Looker references” is good. “Be more careful” is not. If you cannot describe the new behaviour clearly enough to teach it to someone else, you have not actually learned anything.

Action items. Two or three concrete things, with dates. Anything more is wishful thinking. For me, that day, the action items were: write a small script that lists Looker fields referencing a dbt column, before next Friday; stop deploying schema changes on Fridays after three p.m., starting now; tag the finance dashboards as “executive” so I know to be extra careful when they are downstream.

That is the entire format. Twenty to forty minutes of writing, depending on how complicated the day was. Five sections, no fluff, written the same evening.

Blameless, applied to yourself

The hardest part of a personal postmortem is not the writing. It is keeping the tone right.

The Google SRE book popularised the term “blameless postmortem” in the context of teams, and the principle is simple: assume people behaved reasonably given what they knew at the time, and look for the systemic causes instead. If your incident review is full of “you should have done X”, it is not blameless, and the next person who messes up will hide it from you instead of letting you learn from it.

The same principle, turned inward, is harder than it sounds. The default mode of a personal postmortem is self-flagellation. “I should have known. I am stupid. This is exactly the kind of thing a good engineer would have caught.” None of that produces learning. It produces a small spike of guilt and then nothing, because guilt is not actionable.

The reframe I have come to use: I am the system. The mistake came out of a system that had a gap. My job in the postmortem is to find the gap, not to punish the person who walked through it. “I was tired” is not a useful conclusion. “There was no automated check between dbt and Looker, and I deployed during the worst possible window for catching breakage” is useful, because it points at things I can change.

A small mental trick that helps: write the postmortem as if it were about a colleague you respect. Imagine someone you think is good at their job did this exact thing. What would you say to them? Almost certainly not “you were tired”. Probably something like “okay, the conditions for this mistake were obviously there, what do we change?”. Apply the same generosity to yourself. The aim is curiosity, not contrition.

What “I should have known better” gets wrong

There is a specific failure mode in personal postmortems that I want to name, because I keep falling into it and I keep watching other people fall into it.

It goes like this. You make a mistake. Looking back, the mistake seems obvious. You feel stupid. You write “I should have known better” in big letters across your brain and call it a lesson learned. Then six months later you make a structurally identical mistake, in a different domain, and you are equally surprised, and the cycle repeats.

The reason “I should have known better” does not work is that it is not a lesson. It is a feeling about a lesson. It does not change any of your inputs, any of your defaults, any of your tooling. It is the postmortem equivalent of saying “I will be more careful next year” and then not changing a single habit.

The thing that does work is much less dramatic. It is some version of: “the system that produced this mistake had X gap, and the change that closes the gap is Y, and I will do Y before this Friday”. The Friday-deploy postmortem did not result in me feeling worse about myself; it resulted in a one-line cron change and a Slack reminder I set for Friday afternoons. Boring, low-drama, effective. Six months later, the same class of mistake had not happened again, because the conditions for it no longer existed.

If your postmortem ends with you feeling bad and not with you changing a default, you have written a confession, not a postmortem.

How to make it a habit, not a wallow

The risk of personal postmortems, especially when you are the kind of engineer who takes their work seriously, is that they expand. Twenty minutes becomes an hour. An hour becomes a Saturday morning of brooding. The postmortem stops being a learning tool and turns into a way to relive the worst part of the week.

A few rules that have kept this in check for me.

Time-box it. Forty minutes maximum, ideally on the same day or the next morning. If you cannot finish in forty minutes, the document is too long. Cut sections, not depth.

Write it once. Do not edit it the next day. Do not polish it. The point is the act of writing, not the artefact. Read it once a quarter when you are doing your skip-level retros or thinking about growth, but do not pick at it.

One incident per postmortem. Resist the urge to write “the postmortem of my whole month”. You will end up with a sad essay and zero action items. One incident, one document.

Park it. When the document is done, close the tab. The action items go in your task manager (Jira ticket, Notion page, GitHub issue, whatever). The lessons go in your head. The document itself can sit in a folder you rarely revisit, doing the small job of being available the next time you find yourself in a similar situation and want to remember exactly how it played out.

A small thing that has helped me: after I write the postmortem, I write a single sentence at the top. Something like “the deploy at five p.m. Friday broke a finance dashboard; the system had no dbt-to-Looker reference check; I added one and stopped late-Friday deploys”. One sentence. That is the version of the postmortem I will actually remember in six months. The rest is scaffolding for getting to that sentence honestly.

You do not get to choose whether to make mistakes. You only get to choose whether to learn from them, and the difference between learning and not learning is whether you write down what happened while it is still fresh. Forty minutes, five sections, blameless tone, two or three action items, then move on. The discipline is in the writing, not the artefact, and the payoff is years of not making the same mistake twice. Most engineers never do this. The ones who do tend to look, after a few years, like people who got lucky with their growth. They did not. They just took their bad days seriously enough to learn from them.