In one of the previous articles in this series on Cloud Computing in AWS, we looked at ECS on Fargate. This article is about a very similar setup, which however offers much more flexibility, but at the same time comes along with a higher operational effort.
In our series, we emphasize examples that are suitable for a production environment and thus include aspects such as network security, access permissions or scalability and operational effort. All code examples are nevertheless reduced to the minimum for better clarity and, for example, deliberately do not feature any tags. All infrastructure is configured as infrastructure-as-code for easier reproducibility with Terraform and can serve as a basis for your own setups.
Please note that provisioning AWS infrastructure resources may incur costs. Our Terraform-based tutorials help you remove all resources with a single command terraform destroy
to avoid unintended costs.
At Gyden, we help startups and SMEs build their Cloud Infrastructure and guide them in gaining technical expertise. Often complex setups like container-based applications in a highly available environment are the initial conditions, where AWS ECS plays an important role as one of the most flexible Cloud Computing technology on the market.
You can find the entire tutorial on GitHub, including all the source code to create a fully functional setup based on ECS on EC2 with Terraform. The only requirement is a top-level domain hosted on AWS (like example.com
). More information can be found in the README of the project. Clone the repository with:
ECS EC2 – Container orchestration without Kubernetes
Just like Fargate, ECS EC2 is a runtime environment for services that run in Docker containers and offer maximum flexibility when it comes to creating applications of any size and type. A large part of the setup is accordingly also identical to the variant with Fargate, but here we have to take care of the EC2 instances (also called Container Instances in the ECS context) ourselves. The target architecture looks like this:
Architecture overview for ECS on EC2
In our scenario, we create the basis for a multi-tier architecture, but focus on one layer that serves as the presentation tier. The service is accessible via the Internet and is designed for High Availability (HA). This is achieved by deploying it in multiple Availability Zones (AZ), while dynamic scaling is ensured by an Autoscaling TargetTracking Policy.
A comparison with ECS Fargate
In principle, we can of course achieve a production-ready result with both EC2 and Fargate. The fundamental difference is that Fargate is the serverless variant of ECS on EC2: the entire management of the underlying infrastructure is taken over by AWS here – and Amazon naturally makes itself pay for this.
The following points should be considered when comparing EC2 versus Fargate:
- At a certain point, Fargate is more expensive than EC2. Determining the exact break-even is difficult due to the many variables, but as a rule of thumb you can say that Fargate is cheaper as long as the memory usage in the cluster does not exceed about 50%. On the other hand, Fargate becomes more expensive at higher utilization levels. AWS gives a good insight into the pricing structure in this blog post.
- When using Fargate, the team does not have to worry about managing clusters, instances, containers, task definitions or other computing infrastructure like Autoscaling Groups at all. Depending on the team structure and existing knowledge, this can result in a not inconsiderable cost saving, but this is very individual depending on the company and is difficult to quantify.
- ECS on EC2 offers much more flexibility when it comes to selecting specific instance types or installing and running daemon tasks, agents and the like on the EC2 Instances. Fargate does not offer this flexibility. For example, if you want to run the Datadog Agent as one daemon task per instance, this is not possible with Fargate.
- The mounting of EBS Volumes is not possible with Fargate. Likewise, the use of GPUs in Fargate clusters is not supported.
So, in summary, we can say that ECS on EC2 is the means of choice especially if you have the necessary know-how as well as the staff to manage the infrastructure, need a high or complete flexibility over the infrastructure and/or need to run special software like daemons as standalone daemon tasks. We must not forget that we can of course also mix within AWS and run individual services on Fargate, others on EC2 Instances. By the way, Spot Instances are supported by both ECS variants.
ECS on EC2 and CI/CD
When selecting technologies and strategies for a cloud-based infrastructure setup, the first question is always whether the technology supports a modern development approach with continuous integration and continuous deployment (CI/CD) as well as infrastructure-as-code. Fortunately, the latter is always given with all major Cloud Providers such as AWS, Google Cloud or Microsoft Azure thanks to Terraform, Pulumi or AWS’s own CloudFormation offering. For CI/CD, the following criteria must be met:
- All infrastructure must be able to be deployed and provisioned automatically.
- It must be possible to replace running services with a new version on a rolling basis without interrupting ongoing operations.
- The infrastructure and all changes must be versioned and available as code.
- We want to support different environments like
development
orproduction
using one code base. - All important components and resources such as instances must be monitorable with the help of metrics.
In this article, we focus primarily on AWS-specific resources. We will look at the setup for automating the deployments, the project structure, and the underlying development principles in detail in a separate article, since the topic of ECS on EC2 is already complex enough on its own.
Overview of the main components
The most important components in our setup are the following:
- Route 53 for all routing and domain settings.
- Elastic Container Registry (ECR) for deploying container images that contain our service
- Certificate Manager for creating SSL certificates for the Application Load Balancer (ALB) and CloudFront distribution
- Application Load Balancer (ALB) and Target Groups/Listeners for routing requests to the different container instances
- Autoscaling Group (ASG) and Autoscaling Policies for elastic scaling of EC2 instances
- Launch Template with the definition of the EC2 instances, which are the basis for ECS
- Bastion Host to gain access via SSH for debugging purposes to the EC2 instances
- CloudFront Distribution for worldwide fast access to our service like AWS Edge Locations as well as the possibility to use AWS WAF against SQL Injections or similar and AWS Shield against DDoS Attacks
- Elastic Container Service (ECS) Cluster, ECS Service and ECS Tasks for running our service
- Security Groups for detailed configuration of network-level firewall rules for EC2 and ALB
- VPC for all subnets and Internet Gateway for inbound and outbound network access to the public subnets
- Private subnets including NAT gateway for the ECS service
- Public subnets for the ALB
As you will see, we consistently use a unique name with the environment (e.g. dev
) for naming resources. This allows us to provision resources from multiple environments (such as staging
or pre-production
) in the same AWS account in a real-life scenario.
Remember that you can find the full code base with a working local setup in our GitHub repository.
Let’s start with a few basic components, namely a Hosted Zone, the certificates and an Elastic Container Registry (ECR).
Route 53, ECR and Certificate Manager
For our example, we assume that our service will run under a subdomain, for example https://service.example.com
. This will require us to create a new hosted zone in Route 53 with nameserver (NS) records in the hosted zone of the top level domain example.com
.
Next are two certificates: one for the ALB and one for CloudFront. We remember: while the ALB certificate must be created in the region of the ALB itself, CloudFront certificates must be created in the region us-east-1
. So if your setup is basically running in the North Virginia region, you can work with a single certificate, but otherwise you need two.
Just like the setup with Fargate, we need a container registry (a kind of repository for Docker images) where we can store the latest images of our Presentation Tier service. Besides AWS’ own service ECR, services like Docker Hub or GitHub Container Registry are appropriate alternatives. We also directly enable image scanning for known security issues. For the sake of completeness (also to save costs), a lifecycle policy should also be created to delete old or untagged images. This step is not shown in our example. force_delete
should be disabled in a production environment, as this deletes the ECR resource including all images in the repository if a terraform destroy
command or a change that forces a replacement is executed.
The network configuration
The basis for any Internet-based application in AWS is a virtual private cloud (VPC) in combination with subnets and resources such as Internet Gateways, NAT Gateways, and possibly other components such as site-to-site VPNs, Transit Gateways, and others.
In the next step, we will take care of these network components. We start with the VPC and an Internet Gateway:
For High Availability we need to deploy our Private Subnets and Public Subnets into multiple (at least two) Availability Zones (AZ). For this we can use the following data
resource from Terraform:
Public Subnets
We now create a Public Subnet, Route Table, and a Route for outbound access to the Internet in each available AZ in our chosen AWS Region. The Public Subnets will be used for the Application Load Balancer (ALB) and the Bastion Host, not for the EC2 Instances that are part of the ECS Cluster.
Private Subnets
We perform almost the same steps for the Private Subnets, but here we have to add a NAT Gateway, otherwise there is no access to the Internet. Accordingly, a route to the Internet must be added to the route tables (CIDR block 0.0.0.0/0
). All our Container Instances will reside inside the Private Subnets without direct access from the internet.
The compute layer on EC2
In contrast to ECS on Fargate, with ECS on EC2 we are responsible for the operation and setup of the EC2 Instances ourselves. On the one hand, this gives us more flexibility, but also leads to a higher operational effort. To keep this as small as possible, we use an AWS Launch Template for creating the EC2 Instances in our ECS Cluster.
We also need an EC2 Key Pair to gain access to the EC2 Instances. You can use your existing compatible SSH key or create a new one:
We will use this SSH key later to connect to one of our private EC2 instances via Bastion Host. In the next step you create a key pair resource:
Next up is the Launch Template. Here lies a small stumbling block, because you should make sure when selecting an appropriate Amazon Machine Image (AMI) to use an ECS-optimized image on which the ECS Agent is installed. In our example we use an AMI of the group amzn2-ami-ecs-hvm-*-x86_64-ebs
. If this is not the case, you can install the ECS Agent manually, but without it you will not be able to use an EC2 Instance in an ECS Cluster.
In order to use the EC2 Instances as part of the ECS Cluster after startup, we need to configure the cluster name. To do this, the name is written to the ECS config file /etc/ecs/ecs.config
. We do this step as part of the user data script user_data.sh
that is executed when an EC2 Instance is started. Remember that user data scripts are not executed during reboot of an instance.
In the last step, we have to assign an IAM Role with the necessary permissions to the Launch Template and thus to the Container Instances:
Next, we create the ECS Cluster, the ECS Service, and the ECS Task Definition. We start with the ECS Cluster:
This is followed by the ECS Service with the corresponding IAM Role. Among other things, we can make important configurations here as to how the ECS Tasks are to be distributed to the individual Container Instances. This is particularly important with regard to High Availability. In this case, the tasks are configured in such a way that the number of ECS Tasks is distributed evenly across EC2 instances to the various Availability Zones. We use the binpack
method to make the most efficient use of space on Container Instances.
It is important to ignore desired_count
for changes in a deployment setup designed for high performance with CI/CD and frequent commits (several times per hour) in Terraform. Our auto-scaling setup independently decides how many resources and ECS Tasks need to run concurrently to satisfy the Target Tracking Policy. If desired_count
would be deployed again and again, we would bypass capacity planning each time and reset it, so to speak.
Now we can create the IAM Role for the ECS Service. This role will be assumed by ECS in our behalf to manage the ECS Cluster.
And finally, the ECS Task Definition, which configures what settings our Docker container will use to run with our service. Here the following line is important:
The variable ${var.hash}
comes from the underlying CI/CD system (e.g. Jenkins, CircleCI) and consists of the Git commit hash, which is different for each commit. This value is passed to Terraform as a variable with each deployment (e.g. via the Environment Variable TF_VAR_hash
). This ensures that with each deployment a new version of the ECS task is created and deployed – an essential part in a trunk-based development process where each change is directly integrated and deployed according to Continuous Integration and Continuous Deployment (CI/CD) rules.
Now we have to make sure out logging group actually exists. This is the logging group that will also be used for application logging inside our service (for example a Python app using a statement like the following:
The definition of the log group contains a log retention period to delete older logs after some time.
For the ECS Task we create the two roles Nexgeneerz_ECS_TaskExecutionRole
and Nexgeneerz_ECS_TaskIAMRole
, which I will discuss in detail later.
With this, we have finished the computing part and can turn to configure autoscaling on ECS.
Autoscaling on ECS
First, let’s take a look at our setup. On the one hand, we have a set of x EC2 Instances that belong to an ECS Cluster and are used by one or more ECS Services (in our example, by one service). The EC2 Instances are provisioned by an Autoscaling Group and are registered in an ECS Cluster. The ECS Cluster in turn manages our containers (or ECS Tasks) that contain our service. So we also have the number of ECS Tasks that we can include in the Autoscaling Group.
This already creates some of the complexity with ECS on EC2 that AWS takes away from us with EC2 on Fargate: we have to figure out for ourselves which strategy we want to use to achieve flexible scale-out and scale-in of services. There are a variety of options here, but fortunately this task has become much easier with ECS Capacity Providers and ECS Service Autoscaling.
We configure autoscaling at two different levels using Target Tracking Policies:
- Capacity Provider with Target Tracking Policy for a target capacity
- Service Autoscaling on ECS Service Level with Target Tracking Policy for CPU and memory usage
Capacity Providers
An AWS Capacity Provider acts as a link between ECS Cluster and Autoscaling Group and is linked to both resources. In principle, each ECS Cluster can use multiple Capacity Providers and thus different Autoscaling Groups. This allows different infrastructure such as on-demand instances and spot instances to be used simultaneously in the cluster.
Capacity Providers calculate the required infrastructure for ECS Task containers and Container Instances (aka EC2 Instances) based on variables such as virtual CPU or memory. They take care of scaling out and scaling in of both components on demand by means of a Target Tracking Policy with a target value for CPU and/or memory usage. For example, a target tracking value of 50% for CPU usage means that the Capacity Provider always tries to balance the number of EC2 Instances so that this value is not exceeded (scale-out) or significantly undercut (scale-in). It is this demand-based, elastic scaling of infrastructure that makes cloud providers such as AWS or Google Cloud extremely useful players, because as a user you only pay for the infrastructure that you really use. In an on-demand data center, unused compute power would idle around unused and still incur costs.
A very detailed insight into Capacity Providers can be found in this article by Annie Holladay. A general overview of the complex topic Autoscaling on AWS is provided by Nathan Peck, while Durai Selvan discusses scaling on AWS ECS in particular.
Our Capacity Provider configuration is defined as follows:
Here, maximum_scaling_step_size
and minimum_scaling_step_size
define by how many EC2 Instances the capacity provider may simultaneously increase or decrease the number of Container Instances during a scale-out or scale-in. managed_termination_protection
prevents EC2 Instances on which other tasks are running from being terminated
.
Service Autoscaling on ECS
Service Autoscaling handles elastic scaling of containers (ECS Tasks) and also works in our setup using Target Tracking for CPU and memory usage.
In the aws_appautoscaling_target
resource, we define the minimum and maximum number of tasks that may run simultaneously, just as in the Capacity Provider. This helps us to keep costs under control despite scalability. min_capacity
is an important value that is set to at least 2 in our setup for ensuring High Availability. Since we configured aws_ecs_service
with the spread
Placement Strategy, this ensures that each of the two tasks runs in a different Availability Zone. In case of a failure of one AZ, a backup is still available and the service can be guaranteed without interruption.
We use ECSServiceAverageCPUUtilization
and ECSServiceAverageMemoryUtilization
as metrics, whose data decides whether a scale-out or scale-in should be triggered.
Autoscaling Group
The last missing part for linking ECS Cluster, Capacity Providers and Launch Template is the Autoscaling Group (ASG). Again, just like for the ECS tasks, we define the minimum and maximum number of EC2 Instances that can be created to avoid uncontrolled scaling. The Autoscaling Group uses our already configured Launch Template for launching new instances.
The instance_refresh
block has an important task, because it would allow us to configure the warmup time with which new EC2 instances should be available in order to reduce too long startup times. enabled_metrics
” defines which metrics the ASG should provide. These are then available in CloudWatch. protect_from_scale_in
must be set to true
because we have enabled managed_termination_protection
in the capacity provider.
An important detail is that the EC2 Instances within the Autoscaling Group are all created in the Private Subnets. This allows us to significantly increase the security level, because the firewall settings in the Security Groups only allow controlled traffic in the order CloudFront Distribution -> ALB -> Autoscaling Group.
Application Load Balancer
The Application Load Balancer (ALB) is part of Amazon’s Elastic Load Balancing (ELB) and takes over the task of load balancing, i.e. the simultaneous distribution of requests to the available Container Instances and ECS Tasks. In contrast to the Container Instances, the ALB runs in the Public Subnets. It is a highly available component and redundant due to our setup with multiple AZs.
The ALB receives incoming traffic via ALB Listeners. We define an HTTPS Listener, which means we can only accept traffic on port 443
(HTTPS). We will configure the automatic redirection from HTTP to HTTPS later in the CloudFront Distribution.
HTTPS Listener und Target Group
The default action in the HTTPS Listener in the Application Load Balancer blocks all requests with the status code 403
(access denied). In the further course we will now add a listener rule that allows access only with a valid Custom Origin Header. As the certificate required for HTTPS we use our certificate alb_certificate
created at the beginning.
The ALB Listener Rule associates incoming requests sent with the correct Custom Origin header with our Target Group.
The Target Group is the output of the ALB, while the ALB Listener is the input. So traffic goes into the Listener and is forwarded to the appropriate Target Group, which then routes it to the appropriate backend resource (in our case Container Instances). From here, traffic is forwarded over HTTP on port 80
. Since we are not implementing an end-to-end encrypted solution and are now inside the Private Subnets, this is fine.
The health_check
block defines which URL the ALB uses to decide if the target (our ECS Tasks running on the Container Instances) is healthy. matcher
and path
can be configured to use one or more HTTP status codes and a specific URL path for the healthcheck. healthy_threshold
and unhealthy_threshold
define how many (failed) healthchecks a target should be evaluated as healthy or unhealthy. These healthchecks are performed in intervals of seconds defined in interval
.
Depending on the application logic, the stickiness
block should be defined to ensure that sticky sessions are used. This way, visitors are always directed to the same target and their session can still be used. However, this approach can cause problems, especially with frequent deployments and thus frequently changing targets, and is an approach that should be considered in the application architecture to enable efficient development methods such as trunk-based development.
The linking of target group with our ECS Service happens exactly there, in the resource aws_ecs_service
. In the block load_balancer
the ARN of the target group is linked and the ECS Tasks are registered as targets.
Security Groups
Next, we take care of the necessary Security Groups (SG) for ALB and EC2 Instances. The SG for the EC2 Instances allows all outgoing traffic (often necessary to make external API calls or for package managers like pip
or npm
). Incoming traffic is allowed on port 22
(SSH) exclusively from the Bastion Host, which we will create later. We also allow the so-called ephemeral ports 1024
– 65535
coming from the Security Group of the ALB.
The ALB also allows all outbound traffic. Going in, we again limit access to known AWS CIDR block ranges, which we can query via the aws_ec2_managed_prefix_list
resource in the form of managed prefix lists. This is an additional layer of security to the Custom Origin Header.
CloudFront Distribution
As a final component, we use a CloudFront Distribution to provide fast, low-latency access to our service worldwide (a multi-region setup would improve this scenario even further), but also to use Web Application Firewall Rules (WAF) and benefit from AWS Shield enabled by default. WAF is used for detecting and preventing malicious attacks such as SQL injections, while AWS Shield provides a first level of DDoS attack security. AWS Shield can be upgraded to the paid AWS Shield Advanced as needed.
CloudFront is linked to our ALB via target_origin_id
. Using the configuration redirect-to-https
we ensure that all incoming traffic for the user is always redirected to HTTPS. So if we as a user type http://service.example.com
, we will automatically be redirected to https://service.example.com
.
In the custom_header
we configure our Custom Origin Header, which we have already prepared in our ALB Listener. As acm_certificate_arn
we now use the certificate specially created in the us-east-1
region here.
Bastion Host
One of the advantages of ECS on EC2 over ECS Fargate is that we have full control over the underlying compute layer (EC2 Instances). Among other things, this can also be helpful for debugging Docker containers that do not start correctly. For this, we need access to the EC2 Instances, which is possible via SSH. However, since the Container Instances are running in a Private Subnet and do not have a public IP, we need to find another way to establish a secure connection.
Using Bastion Host (sometimes also called Jump Host), a separate small EC2 Instance running in one of our Public Subnets and only allowing access via SSH on port 22
, users with the appropriate Private Key can connect to the Container Instances via SSH Agent Forwarding. Alternatively, AWS Instance Connect could be used to establish a terminal session on the Container Instance from the AWS Console without explicitly using the SSH key.
DNS settings with Route 53
To put our whole setup together, we still need the final DNS settings, which we configure with Route 53. In our example we assume that our service runs under a subdomain like service.example.com
and should be accessible for different development environments in the dev
environment under dev.service.example.com
. We therefore need a new Route 53 Hosted Zone for this subdomain, the appropriate nameserver (NS) records in the parent zone, and an A-record pointing to our CloudFront Distribution as the entry point for the service.
IAM Roles in AWS ECS
AWS ECS uses a total of four different IAM Roles, all of which are used within the context of EC2 Instances, ECS Service, and ECS Tasks. These roles can be a bit confusing if it is not clear which role does what.
We use the following IAM Roles:
- The
Nexgeneerz_EC2_InstanceRole
within the Launch Template: this role is assumed by every single EC2 Instance launched. Since we are working with AWS Elastic Container Services (ECS), we use AWS’s own service roleAmazonEC2ContainerServiceforEC2Role
here. For example, this role grants the permissionecs:RegisterContainerInstance
, which is used by the ECS Service Agent running in the EC2 Instance to register itself as a newly started Container Instance in the ECS Cluster. An overview of all granted permissions can be found here. In summary: theNexgeneerz_EC2_InstanceRole
is assumed by the Principal EC2 Instance to perform tasks like (de)registering Container Instances. - The
Nexgeneerz_ECS_ServiceRole
, which is used by the ECS Service: Here, a service-linked role with a policy similar toAmazonECSServiceRolePolicy
is used to allow ECS to manage the cluster. In the AWS Console, this role would be automatically created for us, but in the case of Infrastructure-as-Code, we handle this ourselves and use a separate role for it to avoid conflicts when using multiple ECS Clusters and ECS Services. - In the resource
aws_ecs_task_definition
two roles are expected. Theexecution_role_arn
is theNexgeneerz_ECS_TaskExecutionRole
role that takes theAmazonECSTaskExecutionRolePolicy
policy. The Task Execution Role grants the ECS Agent the necessary rights to write log streams, for example. Depending on the requirements, it may be necessary to define a separate inline policy. - The role
Nexgeneerz_ECS_TaskIAMRole
is assumed by the executed ECS Task itself. It may be necessary if the talk needs access to additional AWS services (which is not the case in our example, so this role exists only as an empty container and does not get any permissions).
Connect to the EC2 Instances
One of the advantages of ECS EC2 over ECS Fargate is the unrestricted access to the EC2 instances we control via the Launch Template, including Security Groups and SSH Keys. For debugging purposes it can sometimes be helpful to connect to these EC2 instances via SSH. For this purpose, we have stored the appropriate security group rules and created the bastion host, since the EC2 instances run in a private subnet without direct access from the Internet.
To connect to one of the instances, we need to determine its Private IP in the AWS Console for this EC2 instance (e.g. 192.173.98.20
). We also need the public IP address of the Bastion Host, let’s say it is 173.198.20.89
. Since we are using agent forwarding over SSH, we need to execute the following commands, using the default EC2 Instance user ec2-user
:
Once you are logged in to the Bastion Host, you can continue to connect from there to the private EC2 Instance:
After we have successfully logged in, we can perform further actions, such as debugging why a particular Docker container is not starting.
What is missing?
Although we have already covered many components necessary for a production-grade infrastructure in our setup, there are still some configurations that we have omitted from this tutorial for clarity and ease of understanding. These include the following components:
- Configuration of AWS Web Application Firewalls (WAF) Rules for CloudFront.
- Possibly configuring ACL at the network level.
- Restricting the IP range for the bastion host to specific CIDR blocks, although access control via AWS EC2 Instance Connect in the AWS Console is the preferred way to go here due to better access control.
That said, this article already provides an in-depth look at configuring, provisioning and scaling ECS on EC2 for organizations that want to run their applications in Docker containers in the AWS Cloud.
Conclusion
ECS on EC2 is an extremely powerful tool for creating complex infrastructure without significant limitations. Building and managing the infrastructure requires a solid understanding of techniques such as autoscaling, as well as aspects such as load balancing, security groups and networking. In terms of flexibility, ECS on EC2 is almost on a par with Kubernetes and can be extended in almost any direction.
However, one should not underestimate the operational effort that goes hand in hand with this flexibility. Smaller teams in particular may reach their limits here if they first have to painstakingly acquire this knowledge themselves and neglect the operational part. With our Cloud Computing for Startups offering, we support such companies in particular in taking advantage of flexible and highly available infrastructure, reliably building up the necessary knowledge and then being able to drive forward further development in line with business goals on their own responsibility. Infrastructure in the cloud will become more and more standard in the future and software engineering teams are well advised to acquire the necessary skills in time to ensure cost-efficient product development with fast time-to-market.
Whew, that was a big one! 😅
Hopefully you were able to learn something in our article about Amazon ECS EC2 with Terraform and apply it to your own use case.
Our goal at Gyden is to share knowledge with others in the tech community and learn together. Because that’s the only way we can keep reducing the complexity that comes with such technologies.
If you liked this post, share it with your friends and colleagues. Together we can help more people learn and grow about cloud computing and Amazon ECS. You can also follow us on LinkedIn to stay up to date about new articles like this. 🚀