Different metrics can be used to measure and thus help improve how [[Software engineering MOC|Software engineering]] teams respond to issues.
## Overview of different metrics
![[OpMetricsTimeline.png]]
*Although this diagram uses the term "outage", it could be more generically described as an "incident" as even a single API route which is starting to return errors could be considered.*
**Mean time to recovery** tells you how quickly you can get your systems back up and running. [^fn1]
[^fn1]: [MTBF, MTTR, MTTA, and MTTF](https://www.atlassian.com/incident-management/kpis/common-metrics) by [[Atlassian]]
Layer in **mean time to respond** and you get a sense for how much of the recovery time belongs to the team and how much is your alert system.
Further layer in **mean time to repair** and you start to see how much time the team is spending on repairs vs. diagnostics.
Add **mean time to resolve** to the mix and you start to understand the full scope of fixing and resolving issues beyond the actual downtime they cause.
Fold in **mean time between failures** and the picture gets even bigger, showing you how successful your team is at preventing or reducing future issues.
And then add **mean time to failure** to understand the full lifecycle of a product or system.
## MTTD: Mean Time to Discovery
MTTD measures how quickly you find out that your system has an error. Key to this is alerting.
## MTTR: Mean Time To Recovery
**MTTR (mean time to recovery or mean time to restore) is the average time it takes to recover from a product or system failure.** This includes the full time of the outage—from the time the system or product fails to the time that it becomes fully operational again.
**Lower == Better**
*Ambiguity warning*: When we talk about MTTR, it’s easy to assume it’s a single metric with a single meaning. But the truth is it potentially represents **four different measurements**. The R can stand for repair, recovery, respond, or resolve, and while the four metrics do overlap, **they each have their own meaning and nuance.**
## MTBF: Mean time between failures
**MTBF (mean time between failures) is the average time between repairable failures of a technology product.** The metric is used to track both the availability and reliability of a product.
**Higher == Better**
## MTTA: Mean time to acknowledge
**MTTA (mean time to acknowledge) is the average time it takes from when an alert is triggered to when work begins on the issue.** This metric is useful for tracking your team’s responsiveness and your alert system’s effectiveness.
---
tags: [[Serverless MOC|Serverless]]