Safe deployment strategies for AppSync Lambda datasources
In this post we will walk through the options of running safe deployments of AppSync datasources backed by Lambda functions. This is particularly useful if you are running AppSync in production with substantial real user traffic.
Deployment types
There are different options on how to deploy the function code to the environment:
- All at once: all traffic is redirected to the latest version of the function as soon as the stack is deployed
- Canary: small portion of the traffic is diverted to a new version of the function for a period of time and if no issues — all traffic is redirected to a new version
- Linear: traffic is incrementally shifted from an old version of the function to a new version over a period of time
Blue/Green type can also be added into the mix, however it is more applicable in the context of deploying a new side-by-side standby environment (new CFN stack) and then shifting traffic to it, rather than deploying a new version of a single component within an existing stack.
Within the scope of single function, Blue/Green can be implemented by adding PostTraffic
hook to an “all-at-once” deployment type.
Each of the methods above has its own place, for example if the function is deployed to a development environment with no real user traffic, “all-at-once” option will be the best choice, however for a production deployments with real user traffic, it’s important to ensure that failed deployments can roll-back quickly and in an automated way with minimum impact on user experience.
Configuring Safe Deployments
Gradual deployments can be easily integrated into an existing stack if you are already using AWS SAM in your CI/CD for deployments.
Let’s walk through the deployment of an existing project and then simulate a function code change, to see how gradual deployment will work in dev vs production environments.
The code for the walk-through is located in the github repo.
First we need to package the main stack, this will upload all nested stack templates and code artifacts to an S3 bucket and will replace local filesystem references with corresponding S3 locations:
aws cloudformation package --template-file template.yaml --s3-bucket appsync-gradual-deployment-bucket --output-template-file template-updated.yaml
Now using updated template we can deploy main stack to the environment:
aws cloudformation deploy --template-file template-updated.yaml --stack-name appsync-gradual-deployment-stack --capabilities CAPABILITY_NAMED_IAM CAPABILITY_AUTO_EXPAND
Executing getVersion
query from AppSync console, we get the response from an initial version of the deployment:
“All At Once” Traffic Shift
Let’s update version to 2.0 and deploy without changing the Environment
parameter in the master stack template.yaml
file, which is set to dev
by default:
100% of the traffic is redirected to a new version of the function as soon as the nested stack finished the deployment:
Linear Traffic Shift
Now let’s bump up the version again, but this time we will override Environment
to prod
to simulate production deployment.
Our serverless function in a SAM template is configured with Linear10PercentEvery1Minute
deployment preference for production environment, so we should see a different behavior where traffic shifts in increments of 10% every minute over 10 minutes:
aws cloudformation deploy --template-file template-updated.yaml --stack-name appsync-gradual-deployment-stack --capabilities CAPABILITY_NAMED_IAM CAPABILITY_AUTO_EXPAND --parameter-overrides "Environment=prod"
When DeploymentPreference
is defined on a function in a SAM template, CloudFormation creates additional resources: CodeDeploy Application, CodeDeploy Deployment Group and required IAM role as below:
Now, heading to CodeDeploy Deployments, you will see that the production deployment is in progress and querying API will return the response from both versions of the function until linear traffic shift is fully completed.
Deployment Failure
Let’s simulate a scenario when new deployment has issues.
If CloudWatch alarm defined in the template goes into “ALARM” state, CodeDeploy should redirect all traffic to an old version and report a failure to CloudFormation to roll back the changes:
After packaging and deploying, go to AppSync console and try to query API until you get failed response at least 5 times — this should trigger a CloudWatch alarm and initiate an automated deployment roll-back:
Gotchas
It’s worth to mention that having traffic to your API is necessary if you are planning to use gradual deployments to catch the issues and trigger the roll-back.
On a system with low volume of user traffic you may opt into using longer deployment windows to increase the chance of catching the errors during a deployment and also increase CloudWatch alarm sensitivity to go off early by lowering the alarm threshold.
Another option may be to use CloudWatch Synthetics to ensure a steady traffic to your AppSync API (be aware that CloudWatch Synthetics Canaries are not cheap, ~$50USD per 1-minute frequency canary per month) or PostTraffic
hook Lambda (integration test is limited to Lambda maximum runtime of 15 minutes, so ensure that your deployment preference allows to hit a new version of back-end frequently enough to identify the issue within 15 minutes).
AppSync with Caching Enabled
On an AppSync with caching enabled, it’s important to understand that for gradual deployment to catch the issues, AppSync needs to hit a new version of the backend all the time instead of using a cached response.
To achieve this you can use PreTraffic
hook to change AppSync cache TTL to 1 second before the deployment and PostTraffic
hook to change TTL back to a desired value after the deployment is completed.
Conclusion
It doesn’t take much effort to add gradual deployments for AppSync Lambda datasources into your existing CI/CD process and it can save you a lot of trouble of hunting down failed deployments and rolling back changes manually.
Thumbs up if you liked the post and stay tuned for more serverless articles!