Step Auto Scaling Group (ASG) Deployer a.k.a. Asgard

Deploy your 12-factor-applications to AWS easily and securely with the Step Auto-Scaling Group (ASG) Deployer (Asgard).

Asgard's goals/requirements/features are:

Ephemeral Blue/Green: create new instances, wait for them to become healthy, delete old instances.
Declarative: describe what a successful release looks like, not how to deploy it.
Scalable: can scale both vertically (larger instances) and horizontally (more instances).
Secure: resources are verified to ensure that they cannot be used accidentally or maliciously.
Gracefully Fail: handle failures to recover and roll back with no/minimal impact to users.
Configuration Parity: minimize divergence between production, staging and development environments by keeping releases as similar as possible.
Cattle not Pets: treat compute instances as disposable and ephemeral.
No Deployer Configuration: no configuration and minimal setup needed to get Asgard up and running.
Multi Account: one deployer for all AWS accounts.

Getting Started

Asgard is made of an AWS Lambda Function (with a role) and AWS Step Function. You can bootstrap these into AWS with:

git pull # pull down new code
./scripts/bootstrap

Testing Asgard with deploy-test

Asgard includes a test project deploy-test that has one service web which is a nginx server to be mounted behind a Elastic Load Balancer (ELB) and Load Balancer target group. The service instances have a security group and instance profile.

To create the AWS resources for deploy-test:

./scripts/geo apply resources/deploy-test-resources.rb

Note: you will also have to tag the latest Ubuntu release

A deploy-test release file deployer-test-release.json looks like:

{
  "project_name": "coinbase/deploy-test",
  "config_name": "development",
  "subnets": ["test_private_subnet_a", "test_private_subnet_b"],
  "ami": "ubuntu",
  "user_data": "{{USER_DATA_FILE}}",
  "services": {
    "web": {
      "instance_type": "t2.nano",
      "security_groups": ["ec2::coinbase/deploy-test::development"],
      "elbs": ["coinbase-deploy-test-web-elb"],
      "profile": "coinbase-deploy-test",
      "target_groups": ["coinbase-deploy-test-web-tg"]
    }
  }
}

The user data for the release is stored in the file deployer-test-release.json.userdata:

#cloud-config
repo_update: true
repo_upgrade: all

packages:
 - docker.io

runcmd:
 - docker run -d --restart always --name test_server -p 8000:80 nginx

To build a release for deploy-test and send it to Asgard we use the step-asg-deployer executable:

step-asg-deployer deploy deploy-test-release.json

Asgard then:

validates the sent release and any referenced resources.
creates a new auto-scaling group for web which is configured to start an nginx server.
waits for the EC2 instances in the web ASG to become healthy behind the ELB and target group. Healthy means that the health checks for both ELB and target group pass.
Once healthy the old ASG and its instances are terminated.

Asgard Release

An Asgard release is a request to deploy a Project-Configuration where:

A Project is a code-base typically named with org/name.
A Configuration is the environment the project is being deployed into, e.g. development, production.

Each release can define 1-to-many Services; each service is a logical group of servers, e.g. web or worker, that maps to a single auto-scaling group (ASG).

When Asgard is sent a release, it moves it through a state machine:

Validate: validate the release is correct.
Lock: grabs a lock on project-configuration.
ValidateResources: validate resources w.r.t. the project, configuration and service using them.
Deploy: creates an ASG and other resource for each service.
CheckHealthy: check to see if the new instances created are healthy w.r.t. their ASGs ELBs and target groups. If instances are seen to be terminating immediately halt release.
CleanUpSuccess: if the release was a success, then delete the old ASGs.
CleanUpFailure: if the release failed, delete the new ASGs.
ReleaseLockFailure: try to release the lock and fail.

At each of these states it is possible to fail and then move towards a failure state. The typical failures are:

BadReleaseError: The release sent was invalid because either its structure was incorrect, its values were invalid, or its resources were invalid.
LockExistsError: Could not grab the lock because either another deploy for the project-configuration is currently going out, or a previous deploy left a lock in place.
DeployError: Unable to create a new ASG or resource.
HaltError: Halt was detected or instances were found terminating.
TimeoutError: The deploy took too long and failed.

The end states are:

Success: the release went went as planned.
FailureClean: release was unsuccessful, but cleanup was successful, so AWS was left in good state.
FailureDirty: release was unsuccessful, but cleanup failed so AWS was left in a bad state. This should never happen and should alert if this happens, and file a bug.
It is possible to not end in one of these states if the state machine is incorrect. This is very bad, alert if this happens and file a bug.

Resources

A release uses resources that must exist and be configured correctly to be used for the project-configuration-service being deployed.

A release must have:

an AMI defined with the ami key that can be either a Name tag or AMI ID e.g. ami-1234567
Subnets defined with subnets key that is a list of either Name tags or Subnet IDs e.g. subnet-1234567

Both the above resources MUST have a tag DeployWith that equals step-asg-deployer.

Services can have:

Security Groups defined with security_groups key is a list of security groups Name tags
Elastic Load Balancers defined with elbs key is a list of ELB names
Application Load Balancer Target Groups defined with target_groups is a list of target group's Name tags

All the above resources MUST be tagged with the ProjectName, ConfigName and ServiceName of the release to ensure that resources are assigned correctly.

Services can also have an Instance Profile defined by the profile key that is and instance profile Name tag. The roles path MUST be equal to /<project_name>/<config_name>/<service_name>/.

Scale

Asgard makes it easy to scale both vertically and horizontally. To scale deploy-test we add to the release:

{ ...
  "services": {
    "web": { ...
      "instance_type": "c4.xlarge",
      "ebs_volume_size": 20,
      "ebs_volume_type": "gp2",
      "ebs_device_name": "/dev/sda1",
      "autoscaling": {
        "min_size": 3,
        "max_size": 5,
        "spread": 0.2,
        "max_terms": 1,
        "policies": [
          {
            "type": "cpu_scale_up",
            "threshold" : 25,
            "scaling_adjustment": 2
          },
          {
            "type": "cpu_scale_down",
            "threshold" : 15,
            "scaling_adjustment": -1
          }
        ]
      }
    }
  }
}

instance_type is the EC2 instance type for the service
ebs_volume_size, ebs_volume_type, ebs_device_name define the attached EBS volume in GB.

The autoscaling key defines the horizontal scaling of a service:

all calculations are bounded by min_size and max_size.
the desired_capacity is equal to the min_size or capacity of the previously launched service
the actual number of instances launched is the desired_capacity * (1 + spread)
to be deemed the healthy the service must have desired_capacity * (1 - spread)
if the number of terminating is greater than or equal to max_terms (default 0), the release is immediately halts.
policies are defined above to increase the desired_capacity by 2 instances if the CPU goes above 25% and reduce by 1 instance if it drops below 15%.

Both spread and max_terms are useful when launching many instances because as scale increases the number of cloud errors increase.

User Data

Do not put sensitive data into user data. User data is not treated by Asgard as secure information, it is difficult to secure with IAM, and it is very limited in size. We recommend using Vault, AWS Parameter store, or KMS encrypted S3 authenticated by a service's instance profile.

The user_data in the release is the plain text instance metadata sent to initialize each instance. Asgard will replace some strings with information about the release, project, config and service, e.g.:

...
write_files:
  - path: /
    content: |
      {{RELEASE_ID}}
      {{PROJECT_NAME}}
      {{CONFIG_NAME}}
      {{SERVICE_NAME}}

Asgard will replace {{PROJECT_NAME}} with the name of the project and {{SERVICE_NAME}} with the name of the service. This can be useful for getting service specific configuration and logging.

If user_data is equal to {{USER_DATA_FILE}} and deployed with step-asg-deployer the value will be replaced with the contents of the <release_file>.userdata, e.g. deployer-test-release.json.userdata.

Timeout

A release can have a timeout which is how long in seconds a release will wait for its services to become healthy. By default the timeout is 10 minutes, the max value would be around a year (31556926 seconds) since that is how long a step function can run.

Lifecycle

AWS provides Auto Scaling Group Lifecycle Hooks to detect and react to auto-scaling events. You can add the lifecycle hooks to the ASGs with:

{ ...
  "lifecycle": {
    "termhook" : {
      "transition": "autoscaling:EC2_INSTANCE_TERMINATING",
      "role": "asg_lifecycle_hooks",
      "sns": "asg_lifecycle_hooks",
      "heartbeat_timeout": 300
    }
  }
}

These can be used to gracefully shutdown instances, which is necessary if a service has long running jobs e.g. a worker service.

Halt

Asgard supports manually stopping a release while is it being deployed. Just execute:

step-asg-deployer halt deploy-test-release.json

This will:

Find the currently running deploy for the project configuration
Write a halt file to S3
Wait for Asgard to detect the halt file and fail the deploy

Halt does not guarantee that the release will not be deployed, if executed too late the release may still result in success.

DO NOT use Stop execution of the Asgard step function as it will not clean up resources and leave AWS in a bad state.

Security

Deployers are critical pieces of infrastructure as they may be used to compromise software they deploy. As such, we take security very seriously around the step-asg-deployer and try to answer the following questions:

Authentication: Who can deploy?
Authorization: What can be deployed?
Replay and Man-in-the-middle (MITM): Can some unauthorized person edit or reuse a release to change what is deployed?
Audit: Who has done what, and when?

Authentication

The central authentication mechanisms are the AWS IAM permissions for step functions and S3.

By limiting the ec2:CreateAutoscalingGroup, permissions the Asgard function becomes the only way to deploy ASG's. Then limiting permissions to who can call states:StartExecution for Asgard limits who can deploy.

Ensuring that Asgard's lambda can only access a single S3 bucket, further limits who can deploy with:

{
  "Effect": "Allow",
  "Action": [
    "s3:GetObject*", "s3:PutObject*",
    "s3:List*", "s3:DeleteObject*"
  ],
  "Resource": [
    "arn:aws:s3:::#{s3_bucket_name}/*",
    "arn:aws:s3:::#{s3_bucket_name}"
  ]
},
{
  "Effect": "Deny",
  "Action": ["s3:*"],
  "NotResource": [
    "arn:aws:s3:::#{s3_bucket_name}/*",
    "arn:aws:s3:::#{s3_bucket_name}"
  ]
},

Who can execute the step function, and who can upload to S3 are the two permissions that guard who can deploy.

Authorization

All resources that can be used in a Asgard deploy must opt-in using tags or paths. Additionally, service resources require specific tags or paths denoting which project/config/service can use them.

Assets uploaded to S3 are in the path /<ProjectName>/<ConfigName> so limiting who can s3:PutObject to a path can be used to limit what project-configs they can deploy or halt.

Replay and MITM

Each release the client generates a release release_id, a created_at date, and together also uploads the release to S3.

The step-asg-deployer will reject any request where the created_at date is not recent, or the release sent to the step function and S3 don't match. This means that if a user can invoke the step function, but not upload to S3 (or vice-versa) it is not possible to deploy old or malicious code.

Audit

Working out what happened and when is very useful for debugging and security response. Step functions make it easy to see the history of all executions in the AWS console and via API. S3 can log all access to cloud-trail, so collecting from these two sources will show all information about a deploy.

Continuing Deployment

There is always more to do:

Allow LifeCycle Hooks to send to Cloudwatch.
Subnet, AMI, life cycle and userdata overrides per service.
Check EC2 instance limits and capacity before deploying.
Slowly scale instances up rather than all at once, e.g. deploy 1 instance check it is healthy then deploy the rest.
Add ELB and Target Group error rates when checking healthy.
Custom auto-scaling policy types.