Monday, April 3, 2017

Rebuild AMI using SSM Automation

In the previous post I used Packer to remove block device mappings from Ubuntu AMI to ensure EC2 auto-recovery is working. While Packer is a fantastic tool with many features, it is synchronous in its nature and requires active SSH connection to the temporary instance in order to bake new AMI.

Can we get similar results with AWS tools alone? Indeed, we can!

In December of 2016, at the re:Invent, AWS announced a slew of interesting new features for their Amazon EC2 Systems Manager (a.k.a. SSM). The initial documentation and tutorials were sparse, so their capabilities weren't immediately clear. One of these new features, Automation, offers functionality similar to Packer, so I decided to learn more about it by applying to my simple use case.

Start by configuring IAM Roles for Automation as described in the docs. I know this might seem like an extra step compared to Packer example, there I just used my personal credentials. However, if you serious about security and live by POLP, you should create a dedicated IAM Role or User for Packer as well.

Next, we will need an automation document - a JSON document defining meta data, parameters, outputs and steps to execute (rather similar to CloudFormation). AWS docs already provide a great example walk through for a very similar use case, which can be used as a starting point. More detailed information can be found under Systems Manager Automation Actions and Automation System Variables.

Here's my automation document for this use case: While I tried to simulate Packer template from previous post as much as possible, I could not re-create 'source_ami_filter' property, which automatically found latest Ubuntu AMI for me. I suspect AWS will advise I write a Lambda instead (sigh)...

Now we need to load our Automation doc into SSM. These steps can be done via AWS Console, but I decided to do it the hard(?) way, with CLI.

Create automation document in EC2 Systems Manager (SSM):

aws ssm \
create-document \
--name "TVLK-ami-sans-bdm" \
--content "file://TVLK-ami-sans-bdm.json" \
--document-type "Automation"

When successful, you should see JSON output with document description, type, version and other meta data.

If you need to make a change, update your document by creating a new version with:

aws ssm \
update-document \
--name "TVLK-ami-sans-bdm" \
--content "file://TVLK-ami-sans-bdm.json" \
--document-version "\$LATEST"

Note: while I applaud AWS for incorporating versioning of documents and Lambdas directly into AWS, I would rather have better integration with popular VCS like Git.

List all documents you have created so far with:

aws ssm \
list-documents \
--document-filter-list "key=Owner,value=self"

To view specific document:

aws ssm \
get-document \
--name "TVLK-ami-sans-bdm"

In case you want to get your JSON back out of the document without meta data and escape characters, you can use this little jq trick:

aws ssm \
get-document \
--name "TVLK-ami-sans-bdm" \
--query 'Content' \
| jq '.|fromjson'

Now we can finally execute our Automation with the following command (just don't forget to substitute real values for AutomationAssumeRole, SubnetId and SourceAmiId):

aws ssm \
start-automation-execution \
--document-name "TVLK-ami-sans-bdm" \
--parameters \

If all is well, you should get back AutomationExecutionId as part of the output. Note this value, you will need it to poll status of your automation run from CLI with (again, substitute the real value for --automation-execution-id):

aws ssm \
get-automation-execution \
--automation-execution-id "12345678-1234-1234-1234-1234567890ab"

Alternatively, just use EC2 Console to monitor the execution.

In conclusion, here's my thoughts on SSM Automation so far:

  • No need to connect to temporary instance.
  • It's asynchronous, thus better choice in situations where you don't want to wait for the job to finish.
  • Could be triggered from Lambda or used as a target for CloudWatch Events.
  • Can use values from SSM Parameter Store.
  • Might be preferred by purists, who want to rely exclusively on AWS tools and services. 
  • It's SLOW. It appears that Automation polls for status change of each step at about 2 minute intervals, which results in much slower execution compared to Packer (in this case: ~11min vs ~2min). Packer seems to poll for events more often, thus moves along faster.
  • It's still new and BUGGY. At the time of this writing I found and reported a bug, which caused interpolation of global variable global:REGION inside 'BuildRegion' tag to fail. If you run this example before the bug is fixed, you should see error "Parameter: global:REGION is not defined in the Automation Document" in the output of the final step.