Johnny Cash once said, “You build on failure. You use it as a stepping stone. Close the door on the past. You don’t try to forget the mistakes, but you don’t dwell on it.” Probably, he didn’t know that this quote will perfectly describe organizations that work according to the “fall fast, fail often” mantra.
Incidents are a new normal. Part of the development process. We are seeing an increase in canary releases, feature flags adoption, monitoring to mitigate incidents. While smaller, these are still cuts. How can we avoid our business getting killed by thousands of mitigated problems?
There are a lot of techniques at our disposal, but in this presentation, the focus is on not repeating mistakes. I will describe how to benefit from past incidents and encourage engineers to embrace and participate in failures. Share the best practices for gathering and analyzing software metrics and people-related data. Give tips on how to formulate actionable executable actions that prevent repetitive incidents.
Key takeaways:
– what is the goal of blameless post-mortem
– techniques helping build context in which incident happened,
– techniques helping identify root cause, propose preventive and corrective actions,
– how being better in identifying root causes can help you become better engineer by focusing your attention on most likely broken parts of your system