Alerting within an AWS serverless application

## Key considerations Things which help lower the [[Operational metrics for a distributed application#MTTR Mean Time To Recovery|Mean Time To Recovery]] metric: - What events/incidents do you need to be alerted about? - Time-to-log file line: how easy is it to get from being notified of an error to seeing a detailed error message with stack trace - What notification channels are available and ease of configuration: email, Slack, pager, etc Other considerations: - **Privacy**: is potentially sensitive log data being shipped to a third party service? - **Billing**: what is the billing model and how much can you expect to pay? - **Ease of setup**: how easy is it to integrate into your application codebase? Does it require changes to application code or just configuration updates? ## What events to alert on [^fn1] [^fn1]: [What alerts should you have for serverless applications](https://lumigo.io/blog/what-alerts-should-you-have-for-serverless-applications/) by [[@Yan Cui]] and [[Lumigo]] ### [[AWS Lambda]] #### Region-wide - `ConcurrentExecutions`: Set the alert threshold to ~80% of your current regional concurrency limit (which starts at 1000 for most regions) #### Per-function **Error rate alert**: Use CloudWatch metric math to calculate the error rate of a function — i.e., `100 * Errors / MAX([Errors, Invocations])`. Align the alert threshold with your Service Level Agreements (SLAs). For example, if your SLA states that 99% of requests should succeed then set the error rate alert to 1%. **Throttles alert**: Unless you’re using `Reserved Concurrency`, you probably shouldn’t expect the function’s invocations to be throttled. So you should have an alert against the `Throttles` metric. **DeadLetterErrors alert**: For async functions with a dead letter queue (DLQ), you should set up an alert against the `DeadLetterErrors` metric. This tells you when the Lambda service is not able to forward failed events to the configured DLQ. **DestinationDeliveryFailures alert**: Similar to above, for functions with Lambda Destinations, you should set up an alert against the `DestinationDeliveryFailures` metric. This tells you when the Lambda service is not able to forward events to the configured destination. **IteratorAge alert**: For functions triggered by [[AWS Kinesis Data Streams]] or [[DynamoDB streams]], the `IteratorAge` metric tells you the age of the messages they receive. When this metric starts to creep up, it’s an indicator that the function is not keeping pace with the rate of new messages and is falling behind. The worst-case scenario is that you will experience data loss since data in the streams are only kept for 24 hours by default. This is why you should set up an alert against the `IteratorAge` metric so that you can detect and rectify the situation before it gets worse. ### [[AWS API Gateway|API Gateway]] - **p90/p95/p99 Latency alert** - **4xx rate/5xx rate alert** ### [[AWS SQS|SQS]] When working with SQS, you should set up alerts against the `ApproximateAgeOfOldestMessage` metric for an SQS queue. It tells you the age of the oldest message in the queue. When this metric trends upwards, it means your SQS function is not able to keep pace with the rate of new messages. ### [[AWS Step Functions|Step Functions]] There are a number of metrics that you should alert on: - `ExecutionThrottled` - `ExecutionsAborted` - `ExecutionsFailed` - `ExecutionsTimedOut` They represent the various ways state machine executions would fail. And since Step Functions are often used to model business-critical workflows, I would usually set the alert threshold to 1. ### [[DynamoDB]] #TODO ### [[AWS AppSync|AppSync]] #TODO ## How to setup these alerts within [[AWS CloudWatch|CloudWatch]] A few options here: - Configure the alerts using [[Serverless Framework]] plugin [`serverless-plugin-aws-alerts`](https://github.com/ACloudGuru/serverless-plugin-aws-alerts). This provides some canned alarm definitions for Lambda funcs and allows you to opt individual functions in or out of alarms. It also allows you to define custom definitions which could be a good fit for [[AWS AppSync|AppSync]]. - Use this [SAR CloudWatch Macro](https://github.com/lumigo-io/SAR-cloudwatch-alarms-macro) for auto-generating CloudWatch alarms based on your stack's configuration - Use a third-party tool that auto-detects your current setup and (Lumigo docs says it does this [^fn1]) ## Third-party alerting services to consider - [Sentry](https://sentry.io) - Can instrument Lambdas by [adding its SDK to your Lambda code](https://docs.sentry.io/platforms/node/guides/aws-lambda/) - Alternatively can [install their CloudFormation stack](https://docs.sentry.io/product/integrations/aws-lambda/) to enable monitoring. Once the CF stack is installed it reports back to Sentry servers all the Lambda functions in the AWS account. Via the Sentry UI you can then enable monitoring on individual functions. Functions selected for monitoring will have a Lambda layer applied to them, which contains the collection agent. - Don't think it supports AWS services other than Lambda. So no AppSync. - [Lumigo](https://lumigo.io) - [Thundra](https://www.thundra.io) ## See also - [[Operational metrics for a distributed application]] --- tags: #Operations