Different metrics can be used to measure and thus help improve how [[Software engineering MOC|Software engineering]] teams respond to issues. ## Overview of different metrics ![[OpMetricsTimeline.png]] *Although this diagram uses the term "outage", it could be more generically described as an "incident" as even a single API route which is starting to return errors could be considered.* **Mean time to recovery** tells you how quickly you can get your systems back up and running. [^fn1] [^fn1]: [MTBF, MTTR, MTTA, and MTTF](https://www.atlassian.com/incident-management/kpis/common-metrics) by [[Atlassian]] Layer in **mean time to respond** and you get a sense for how much of the recovery time belongs to the team and how much is your alert system. Further layer in **mean time to repair** and you start to see how much time the team is spending on repairs vs. diagnostics. Add **mean time to resolve** to the mix and you start to understand the full scope of fixing and resolving issues beyond the actual downtime they cause. Fold in **mean time between failures** and the picture gets even bigger, showing you how successful your team is at preventing or reducing future issues. And then add **mean time to failure** to understand the full lifecycle of a product or system. ## MTTD: Mean Time to Discovery MTTD measures how quickly you find out that your system has an error. Key to this is alerting. ## MTTR: Mean Time To Recovery **MTTR (mean time to recovery or mean time to restore) is the average time it takes to recover from a product or system failure.** This includes the full time of the outage—from the time the system or product fails to the time that it becomes fully operational again. **Lower == Better** *Ambiguity warning*: When we talk about MTTR, it’s easy to assume it’s a single metric with a single meaning. But the truth is it potentially represents **four different measurements**. The R can stand for repair, recovery, respond, or resolve, and while the four metrics do overlap, **they each have their own meaning and nuance.** ## MTBF: Mean time between failures **MTBF (mean time between failures) is the average time between repairable failures of a technology product.** The metric is used to track both the availability and reliability of a product. **Higher == Better** ## MTTA: Mean time to acknowledge **MTTA (mean time to acknowledge) is the average time it takes from when an alert is triggered to when work begins on the issue.** This metric is useful for tracking your team’s responsiveness and your alert system’s effectiveness. --- tags: [[Serverless MOC|Serverless]]