Notes on the challenges with testing [[AWS Step Functions]] and some strategies for addressing them. ## Why is Step Functions more difficult to test than many other AWS serverless services? - A state machine has multiple states and testing a specific state/state transition can require a lot of setup if always have to start from the initial state - ~~There is no tool right now that allows for testing of logic contained in the state machine definition itself~~ (As of Jan 2022, [the ability to mock tasks within the Step Functions Local tool](https://aws.amazon.com/blogs/compute/mocking-service-integrations-with-aws-step-functions-local/) seems to address a lot of these concerns ) ## Failure modes When considering how to test a service, start with thinking about all the things that can go wrong with it. ### Deploy-time issues with State Machine definition - The State Machine definition contains syntactic errors that can be statically detected (e.g. referencing undefined states): - Solution: Deploy-time checks (e.g. `serverless-step-functions` plugin with `validate: true` enabled, or [[AWS CDK]] constructs to check this) ### Runtime issues with State Machine definition - Takes incorrect path (e.g. misconfigured `Choice` state) - Error handling does not behave correctly, e.g. retries or other compensating actions are not correctly executed - TODO how to force the state machine down an error path from E2E path? - Data passing and mapping between state inputs and outputs (via JSON paths ) are misconfigured, resulting in missing or incorrect data going into next state ### Runtime issue with executing tasks - Lambda-based task doesn't behave correctly or fails (e.g. due to missing IAM permissions). - Solution: Verify by invoking Lambda function directly outside of Step Functions - Wait states may cause slow tests - Solution: ensure any wait time is parameterisable at the start of the execution so a short duration can be injected - Executing state machine hits a runtime [quota limit](https://docs.aws.amazon.com/step-functions/latest/dg/limits.html), e.g. 256kB max input/output payload size for a task can be particularly problematic for Map states that iterate and collect outputs of N items. - WaitForTaskToken async callback not invoked or failed (due to incorrect IAM permissions) ## Pain-point quotes - "I've been caught out so many times with inputs and outputs depths" - "Biggest risks are around the path strings - I'd replace States Language (and YAML/VTL) with programming languages that already have great IDE support, testing features and tooling." ## Testing strategies for Step Functions ### Unit Unit tests would be a good method for testing the failure modes listed in the "Runtime issues with State Machine definition" section. Given the multi-step nature of Step Functions, the ability to unit test individual state transitions defined in the State Machine language would be a great benefit. Ideally, we would be able to perform the following actions within a unit testing tool: - Select a start state and associated input, start the execution and verify the output - Allow task states to be stubbed with a [[Test double]] - Verify that `Choice` steps choose the correct branch - Verify that input and output size limits aren't hit - Spy on specific states within a specific execution, to check how many times they were invoked and with what inputs - Run locally in a single process ~~It doesn't look like the ["Step Functions Local"](https://docs.aws.amazon.com/step-functions/latest/dg/sfn-local.html) AWS tool would help with any of these issues in a unit test.~~ In January 2022, AWS released the ability to [Mock service integrations with AWS Step Functions Local ](https://aws.amazon.com/blogs/compute/mocking-service-integrations-with-aws-step-functions-local/). This opens up many of the above unit testing capabilities. NB: there is a relatively new official AWS tool ["Step Functions Data Flow Simulator"](https://aws.amazon.com/blogs/compute/modeling-workflow-input-output-path-processing-with-data-flow-simulator/) that looks to help with some of these issues at design-time in the AWS Console but doesn't allow for any automated testing as far as I can tell. ### E2E E2E test cases are a good method for verifying the failure modes listed in the "Runtime issue with executing tasks" section. They verify that all the tasks execute correctly when executed in a cloud environment (e.g. that they're talking to the correct services and have IAM permissions to do so). Also, given the lack of unit testability of state machine definitions, E2E test cases may also be required to verify these issues too. E2E test cases typically use the AWS StepFunctions SDK to start an execution with a specific input. They use polling to check for progress of the state machine execution and verify its output at each stage. --- ## Questions from Testing Workshop attendees 1. Any ideas on how to test each step (state) in isolation? 2. Stubbing (or "short circuiting") invoked lambdas 3. Testing steps that invoke lambdas asynchronously (via Resource: arn:aws:states:::lambda:invoke.waitForTaskToken) 4. Map steps 5. Testing timeouts/heartbeats and other time-sensitive and long-running steps 6. Testing transitions between steps --- ## References - ["What testing strategies are you using for your Step Functions State Machine today?"(Twitter convo)](https://twitter.com/heitor_lessa/status/1385626326745374726?s=20) by [[@Heitor Lessa]] - [Serverless Office Hours: Step Functions Local - mocking service integrations (YouTube video)](https://www.youtube.com/watch?v=4pTfYon6zJ8) - [Testing AWS Step Functions flows (with StepFunctions Local mocking)](https://dev.to/aws-builders/testing-aws-step-functions-flows-2kpn) by [[@Wojciech Matuszewski]] - Uses CDK but same approach could be adapted for other deployment frameworks --- tags: [[Serverless testing MOC]]