Resilience

Reflecting back on the first three posts on Terraform, we started with a high level overview of what Terraform is to show the potential power of the tool to easily deploy repeatable infrastructure as code, this was followed by a detailed set of two posts on how to deploy a LAMP Stack a core part of a legacy three tier application. This post is about creating resilient Terraform code.

This should hopefully have given you a firm grounding on Terraform’s power and a good handle on its common syntax model. Today’s post has an operational bias. Your manager loved the speed that you deployed the LAMP stack, but they have noticed that there is no resilience in the deployment. So they have asked that the LAMP Stack be placed behind a load balancer and also that it is deployed to multiple availability zones for additional resiliency.

Resilience

However before we move onto building out that resilience into the environment, let’s have a look at some DevOps principles and how they can be applied to your code to make it stable, reliable and reusable.

We will start by looking at separation of environments (Dev, Test, Stage and Production) and then move on to an investigation into the concept of State to create resilient Terraform code.

In addition to that, if you cast your mind back to the original posts you will remember that we had our AWS access and secret keys embedded into the deployment script, this is most definitely NOT recommended; we need to do something about that to increase the security of the deployment.

What is State in regard to Terraform

Creating resilient Terraform code is the method of understanding what has been deployed when a Terraform template is run and the resultant infrastructure has been created.

Hashicorp stores this information in a JSON formatted file called terraform.tfstate. Once a Terraform template has been run this file is created; below is an excerpt of the resultant file.

{
"version": 4,
"terraform_version": "0.12.9",
"serial": 2354,
"lineage": "#?#??###-#??#-####-##??-??####?#?###",
"outputs": {
"db_server_address": {
"value": "mysqldb.?#??????#???.us-east-1.rds.amazonaws.com",
"type": "string"
},
"web_server_address": {
"value": "ec2-##-###-###-##.compute-1.amazonaws.com",
"type": "string"
}
},

The full file is quite large even for our fairly simple deployment. The current deployment of a single node webserver and RDS MySQL environment generated a state file well in excess of 900 lines.

It is important that you do not interfere with this file as this is how Terraform understands how your environment stands. Every time a Terraform plan or apply is run against the same template, or a modified template against the same infrastructure, Terraform will interrogate your deployed environment and the statefile to make sure everything is in sync and the outputted plan will reflect the changes to be made to the environment.

In a single person environment such as we are currently utilizing, there are no concerns about multiple people deploying infrastructure into the environment. However, in a production environment a single person writing and deploying the code is unlikely to be the norm and there will be a team of developers and deployers. Therefore it is recommended that the statefile is stored in a shared store to enable the file to be locked to prevent an environment getting out of sync. Your statefile is the single source of truth for your Terraform deployed environment.

As we are currently deploying into a AWS lets use an AWS’s recommended method and utilize an S3 bucket. To do this we need to create a new folder lets name it state then create a new main.tf file with the following:

  # Enable versioning so we can see the full revision history of our
  # state files
  versioning {
    enabled = true
  }  # Enable server-side encryption by default
  server_side_encryption_configuration {
    rule {
      apply_server_side_encryption_by_default {
        sse_algorithm = "AES256"
      }
    }
  }
}

By creating this file, we reduce risk by lowering the chance of manual error as each time that a plan or apply is run by a member of the team they will automatically load the centralized statefile, once the Terraform plan, apply or destroy command is run Terraform will automatically apply a lock on the file this prevents any other attempt to run the template until the previous session has completed. Also as S3 has rollback capabilities, meaning you have versioning on your statefile – very useful in regards to troubleshooting.

Terraform State file Isolation via File Layout

Next we need to look at isolating the statefile. Currently all our template files are stored in a single folder, which isn’t best practice: a single mistake could destroy the entire environment. It is common practice to have a number of environments for your deployments as you move from development to production. A basic environment would consist of Development, Test/QA, Staging and Production. your code would be deployed into separate folders and each environment protected with different access keys and secrets, and also separate statefiles. At the same time we should start to separate out the services. For example networks, access, IAM and storage will rarely change in a production environment, so separating them from objects that may change daily or at an even greater cadence reduces the impact of changes gone wrong.

Below is an outline of the file layout of the new production-ready environment for our enhanced LAMP stack deployment. Note that Development and QA/Test is not shown but it will replicate Stage and Production:

stage
   └ vpc
   └ services
      └ frontend-app
      └ backend-app
         └ main.tf
         └ outputs.tf
         └ variables.tf
   └ data-storage
      └ mysql
      └ redis
prod
   └ vpc
   └ services
      └ frontend-app
      └ backend-app 
         └ main.tf 
         └ outputs.tf 
         └ variables.tf
   └ data-storage
      └ mysql
      └ redis
mgmt
   └ vpc
   └ services
      └ bastion-host
      └ jenkins
global
   └ iam
   └ s3

Lets have a quick discussion about the file structure above.

  • stage: An environment for pre-production workloads (i.e., testing).
  • prod: An environment for production workloads (i.e., user-facing apps).
  • mgmt: An environment for DevOps tooling (e.g., bastion host, Jenkins, monitoring).
  • global: A place to put resources that are used across all environments (e.g., S3, IAM).

Within each environment, there are separate folders for each component. The components differ for every project, but the typical ones are:

  • vpc: The network topology for this environment.
  • services: The apps or microservices to run in this environment, such as a Ruby on Rails frontend or a Java backend. Each app could even live in its own folder to isolate it from all the other apps.
  • data-storage: The data stores to run in this environment, such as MySQL or Redis. Each data store could even live in its own folder to isolate it from all other data stores.

Within each component, there are the actual Terraform configuration files, which are organized according to the following naming conventions:

  • variables.tf: Input variables.
  • outputs.tf: Output variables.
  • main.tf: The actual resources that are to be deployed, modified or destroyed.

This separation of tasks allows greater granularity on deployments.

Protecting AWS Secrets and Access Keys when using Terraform

The biggest most important thing when hardening your code is to protect is your secrets and access keys. The last thing you want or need is these leaking to the public. Your AWS bill will be huge in hours and your key will get thrown around all the places you’re not allowed to browse at work…

safe
Access and Secret Key Management

There are many tools for protecting these. The tool we are going to use is Vault. This tool is also from HashiCorp and comes with a free and open source version. If you need greater functionality than the free version, HashiCorp provides an enterprise version. What Vault allows you to do is centralize the holding of secrets and access keys together with other types of credentials. These are then used to login to your AWS environment, and create a new set of one-time only Access and Secret keys to deploy your environments or log into an environment. This is obviously a vast improvement in regards to identity management and security, as keys can only used only once, or for a very limited amount of time.

This post in not going to delve into how to deploy or configure Vault. More on this in a later post. For now to see information on Vault refer to the HashiCorp Documentation found here.

Note: Remember to set the environment variable VAULT_ADDR and the VAULT_TOKEN in the deployment environment.

Download the relevant version for your Operating system and install it on a server. In our case it will run locally on my machine, in a production server it would be running as a service on a server so that all secrets could be shared and their usage monitored.

Once installation has been confirmed and configured, firstly the variable file will need to be created with the vault root token, the address and access port of the Vault server and one time secrets and access keys.

Two new variables will need to be added to the variable.tf file

variable "vault_addr" {default="<your servername here>:8200"}
variable "vault_token" {default = "<Vault Token Here>"}

Once configured run the following code to activate the ability to create one time use user and access credentials.

provider "vault" {
address = "${var.vault_addr}"
token = "${var.vault_token}"
}

resource "vault_aws_secret_backend" "aws" {
access_key = "${var.access_key}"
secret_key = "${var.secret_key}"
region = "us-east-1"

default_lease_ttl_seconds = "120"
max_lease_ttl_seconds = "240"
}

resource "vault_aws_secret_backend_role" "ec2-admin" {
backend = "${vault_aws_secret_backend.aws.path}"
name = "ec2-admin-role"
credential_type = "iam_user"

policy_document =<<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"iam:*", "ec2:*"
],
"Resource": "*"
}
]
}
EOF
}

What is the script doing?

What the code in the script is doing is downloading the Vault Provider and configuring the address and the necessary access token to use the environment.

The resource sections define two entities to be created:

resource "vault_aws_secret_backend" "aws" {
access_key = "${var.access_key}"
secret_key = "${var.secret_key}"
region = "us-east-1"

default_lease_ttl_seconds = "120"
max_lease_ttl_seconds = "240"
}

The first resource stanza gives access to the AWS environment using your pre-defined access and secret keys, and setting the region that will be used. Finally it set the time-to-live for the ephemeral credentials.

resource "vault_aws_secret_backend_role" "ec2-admin" {
backend = "${vault_aws_secret_backend.aws.path}"
name = "ec2-admin-role"
credential_type = "IAM_user"

policy_document = <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"iam:*", "ec2:*"
],
"Resource": "*"
}
]
}
EOF
}

The second resource stanza creates the role of ec2-admin and assigns the necessary user policy.

Running through this code in a dedicated Admin environment you can see the user being created, this also shows that the split of the script into variables, resource and outputs has been successful.

Looking at the Vault terminal session you can verify the code has run through the vault.

Vault Log
Vault Log showing Credential creation

The next step is to modify our current deployment code to point to the vault to enable real time short term credentials for authentication.

Running the LAMP Deployment again

We now need to rework the code to take advantage of the increased security that the Vault server supplies.

The first thing you should notice about the new script is that there are no defined variables in this script, these have been moved out to a separate Terraform file called variables.tf. We have also added two new variables.

variable "vault_addr" {default="<your servername here>:8200"}
variable "vault_token" {default = "<Vault Token Here>"}

After removing the variables from the original script we add the following stanzas to the code.

provider "vault" {
address = "${var.vault_addr}"
token = "${var.vault_token}"
}

data "vault_aws_access_credentials" "creds" {
backend = "aws"
role = "ec2-admin-role"
}

These enable the vault provider and set the credentials to be utilized. Next the AWS provider code needs to be modified to read:

provider "aws" {
access_key = "${data.vault_aws_access_credentials.creds.access_key}"
secret_key = "${data.vault_aws_access_credentials.creds.secret_key}"
region = "${var.region}"
}

For completeness we also split out the outputs into a file called outputs.tf thus clearing up the code some more.

output "backend" {
value = "${vault_aws_secret_backend.aws.path}"
}

output "role" {
value = "${vault_aws_secret_backend_role.ec2-admin.name}"
}

When we save the code and run it Terraform now looks to the vault server for the relevant tokens to access AWS. Once those credentials have been received the script runs exactly as before, but with the added security of a one time use credential.

Summary

 

This has been quite a long post but we have investigated several major points in securing our environment and looked at some tenants of proper coding.

In our next post we move on the what our manager asked us to do: to introduce the resilience and start to break out the code into smaller chunks so that less-often change code is in different files. I firmly believe that our mythical manager will be happy that we have increased the security of the environment and created a proper DevOps Pipeline to support the continuing development of the environment.