0% found this document useful (0 votes)
12 views

Docker & Kubernetes on Aws 2020

Uploaded by

dave hill
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Docker & Kubernetes on Aws 2020

Uploaded by

dave hill
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 220

DOCKER &

K U B E R N E T E S O N AW S
2020
Elastic Container Service (ECS) & Elastic Kubernetes
Service (EKS)

Charles Clifford

The Solus Group, LLC


Copyright © 2020 Charles B. Clifford

All rights reserved

The characters and events portrayed in this book are fictitious. Any similarity to real persons, living or
dead, is coincidental and not intended by the author.

No part of this book may be reproduced, or stored in a retrieval system, or transmitted in any form or
by any means, electronic, mechanical, photocopying, recording, or otherwise, without express written
permission of the publisher.

ISBN-13: 9781234567890
ISBN-10: 1477123456

Cover design by: Art Painter


Library of Congress Control Number: 2018675309
Printed in the United States of America
TABLE OF CONTENTS
Preface
Amazon Elastic Container Service (ECS)
Introduction
Purpose and Scope
Everything is Disposable
Not Official Docker
Containerized Applications
Amazon ECS Use Cases
ECS Amazon Compute Service Level Agreement
Amazon ECS Security
Amazon ECS Platform
ECS Endpoint
Amazon ECS and Service Quotas
ECS Pre-requisites
ECS Platform-as-a-Service
Docker Background
Open Container Initiative (OCI)
Docker Images
Dockerfile
Docker Image Registry
Docker Containers
ECS Control Plane
Docker Engine
ECS API
Docker RBAC
Amazon ECS Container Agent
ECS Console
Levels of the ECS Stack
ECS and CloudFormation
CloudFormation ECS Templates
1 Use Case => 1 ECS Solution
ECS’s VPC Configuration
ECS and Interface VPC Endpoints
ECS Cluster Definition
Cluster Capacity Provider
Capacity Provider Strategy
Auto Scaling Service-Linked IAM Role
ECS Cluster Lifecycle
CloudWatch Containers Insights
Linux Container Namespaces on ECS
Docker Container Network Modes
awsvpc Network Mode
Configuring the Container Network Mode
Linux EC2 Instance Trunking
ECS Service Definition
ECS Service Lifecycle
IAM Service-Linked Roles for ECS Service
ECS Service Load Balancing
Amazon ECS Service Discovery
Route 53 and AWS Cloud Map with ECS Service Discovery
ECS Service Scheduling Strategy
Service Placement Strategy and Constraints
EC2 Launch Type
EC2 Compute Instance IAM role
ECS-Optimized Amazon AMI
ECS-Optimized AMI and AWS Systems Manager
Bottlerocket
GPUs on Amazon ECS
AWS Fargate Launch Type
Fargate Spot
Amazon Savings Plans for ECS
ECS and AWS CodePipeline
ECS Service Deployment
ECS Service and AWS CodeDeploy
CodeDeploy IAM Service-Linked Role
CodeDeploy Pre-requisites
CodeDeploy App Spec File
Blue-green Deployments
Rollbacks
ECS Task Definition
ECS Tasks and AppMesh
RunTask API
StartTask API
StopTask API
Scheduled Tasks
CloudWatch Events IAM Role
Running Tasks on Fargate Instances
Task Lifecycle
Docker Data Storage Types
Docker Bind Mounts
Docker Volumes
Docker tmpfs Mounts
Configuring Bind Mounts on ECS
Configuring Volumes on ECS
Configuring Amazon Elastic File System (EFS) Volumes
Configuring tmpfs Mounts on ECS
Task Retirement
ECS Task Container Definition
Standard Container Parameters
Container Memory Parameters
Container Port Mappings Parameters
Advanced Container Parameters
Container Health Check Parameters
Container Environment Parameters
Container Network Settings Parameters
Container Storage Parameters
Container Logging Parameters
Container Security Parameters
Container Resource Limits Parameters
Other Container Parameters
Container Linux Parameters
Container Dependency Parameters
Container Timeouts Parameters
Container System Controls Parameters
Container Placement Constraints
ECS Docker Container at Runtime
Running Deep Learning Containers on ECS
ECS Windows Misses
ECS Monitoring
ECS API and Amazon CloudTrail
ECS and Amazon CloudWatch
CloudWatch Logs
How to Publish Logs to CloudWatch
Monitoring and Metrics
CloudWatch Custom Metric
CloudWatch Events
CloudWatch Alarms
2 Levels of Monitoring
CloudWatch Logs Agent
View CloudWatch Graph and Statistics
ECS on AWS Outpost
ESC DevOps Toolset
Container Agent Configuration
Amazon Elastic Kubernetes Service (EKS)
Introduction
Purpose and Scope
Amazon EKS Use Cases
EKS Amazon Compute Service Level Agreement
Kubernetes Architecture
EKS and IAM
EKS Clusters
EKS Cluster VPC
VPC CNI Plugin
CNI Metrics Helper
EKS and Calico
EKS IAM Service Role
The eksctl & kubectl Command Line Utilities
Creating the EKS Cluster
Masters Nodes (Control Plane)
EKS Cluster Authentication
EKS Cluster Endpoint Access
EKS and CloudTrail
Control Plane Logging
Worker Nodes (Data Plane)
Worker Node IAM Role
Node Groups
Managed Node Groups
EKS IAM Users and Roles
Declarative Manifest Files
Pods and Containers
Kubernetes Controllers
Helm Package Manager
EKS and AWS Fargate Profiles
Running Pods on Fargate
Running Deep Learning Containers on EKS
Kubernetes Storage Classes
Auto Scaling
Kubernetes Metrics Server
Prometheus Metrics
Load Balancing
ALB Ingress Controller
EKS on AWS Outposts
AWS DevOps Toolset
AWS Billing and Cost Management Service
AWS Cost Explorer
AWS Cost and Usage Report
AWS Cost Management Matched Supply and Demand
AWS Cost Management Expenditure Awareness Services
Optimizing Over Time
Total Cost of Ownership (TOC)
AWS TOC Calculator
AWS Simple Monthly Calculator
AWS Budgets Dashboard
AWS Trusted Advisor
GitHub
AWS Management Console
AWS Command Line Interface (CLI) Utility
AWS CloudFormation
AWS Software Development Kit (SDK)
AWS Support Tiers
Support Severity
Basic Support
Developer Support
Business Support
Enterprise Support
Technical Account Manager
Infrastructure Event Management
The Concierge Service
HTTP & the OSI Model
The committee for placing things on top of other things
OSI Model
The system as the network, and the network as the system
HTTP’s two message types
HTTP is Stateless
An application called a web service
The origins of HTTP
Hypermedia information
The origins of HTTPS
Distributed CRUD
The Port Number
Idempotency
How to tell if a given HTTP request is idempotent
9 HTTP request methods
HTTP Connections
What is TCP?
What’s IP?
Types of HTTP connections
Of Monoliths and Microservices
Monolithic Applications
The Institution
Cloud animus
Easy Lifts
BBoMs in the core of the BPM
Chasing Suitable Metrics
Field Notes from the real world
Containers of BBoMs
How to lift a BBoM
The Real World as Final Frontier
The Best Option
VM Hard-stop
Migration to the cloud
Divide and Conquer
Underlying Assumptions
12-Factors Methodology
1. Codebase
2. Explicit and Isolated Dependencies
3. Configuration
4. Backing Services
5. Build, Release, Run stages
6. Stateless Processes
7. Port Binding
8. Concurrency
9. Disposability
10. Dev/Prod Parity
11. Logs
12. Admin Processes
12 Factors Eye-Opener
Greenfield DevOps
Reincarnating a monolith
Microservices
The End of the Beginning
Software-as-a-Service (SaaS)
Conclusion
About the Author
Leave a Review
PREFACE
In 2020, when it comes to Docker containers on Amazon Web Services
(AWS) you have to make two mutually exclusive decisions:

1. Run the container on a serverless AWS Fargate compute


instance, or for server-level control run the container on an
Elastic Compute Cloud (EC2) virtual machine.
2. Orchestrate the containers using Elastic Container Service
(ECS) or by using Elastic Kubernetes Service (EKS).

The goal of this manuscript is to provide DevOps professionals with an


in-depth explanation of Docker containers and Kubernetes on AWS as well
as explain the various AWS resources and services that are commonly used
in both ECS and EKS solutions. At present, the huge volume of critical facts
that you must understand about Docker, Kubernetes, ECS, and EKS are
scattered across dozens of manuscripts, white papers, and blog posts. To
realize its goal, this manuscript organizes and consolidates those critical facts.
Amazon ECS is a fully-managed service, but where fully-managed is
limited to its control plane. The ECS consumer is responsible for defining
and configuring nearly all parts of the ECS solution. Amazon EKS is a fully
managed service, but compared to ECS, there are significantly fewer parts for
the consumer to define and configure.
A cluster of distributed Docker containers is not a simple problem
space. Neither ECS nor EKS is easy to design, configure, or manage. With
ECS and EKS there are many interdependent parts that have to be configured
optimally and which have to operate efficiently. To be sure, both ECS and
EKS are tightly coupled to the AWS platform, its resources, and services.
Inevitably, the typical ECS or EKS solution is integrated with numerous
supporting and complementary AWS services.
The ECS or EKS solution and its Docker containers are not static
objects, are not permanent constructions, they are disposable. Each ECS or
EKS part has to be configured and deployed within a dynamic workflow that
is continually changing and doing so in an indeterminate manner.
Manipulating ECS and EKS is much like releasing an arrow from a bow and
next aiming that in-flight arrow at a target that is constantly moving.
To make those technical decisions in an informed manner, each of us
needs a solid understanding of AWS platform and infrastructure resources. In
that ECS and EKS are extensively integrated with AWS, using either
successfully requires the ability to declare and define in code a cloud
environment that uses:

GitHub (or an alternate source code control system)


Virtual Private Cloud (VPC), subnets, security groups,
gateways, etc.
EC2 virtual machines and namespaces
Persistent and temporary data storage (for use by an EC2
instance or by a container volume)
Identity and Access Management (IAM) roles and policies
Auto-scaling
Load balancing
ECS or EKS Control Plane
Docker Runtime
Docker Containers
Docker Repository
Event trails

With the AWS infrastructure resources designed, you can begin to


declare and define in code the various components specific to ECS (cluster,
Service, Task, Container, Repository) or EKS (cluster, Pod, Container,
Repository).
Typically, the complete definition of your ECS or EKS solution might
include:

AWS CodePipeline;
AWS Cloud Deploy;
AWS Fargate - a serverless compute engine for containers that
works with both ECS and EKS;
AWS Route 53 and Cloud Map;
AWS AppMesh - provides application-level networking to
make it easy for your (services running inside) containers to
discover and communicate with each other across different
compute infrastructure, e.g. ECS and Fargate;
Bottlerocket – a Linux-based open-source operating system
(built by AWS) for running containers on virtual machines or
bare metal hosts;
Amazon Outpost – use EKS or ECS to manage clusters of
containers in a private cloud integrated with the AWS public
cloud;

The manuscript is brought to a closure with background reference


materials:

HTTP and the OSI Model


Of Monoliths and Microservices.
AMAZON ELASTIC
CONTAINER SERVICE
(ECS)

Introduction
Amazon Elastic Container Service (ECS) is a Region-based service used
to orchestrate Docker containers that are distributed across a cluster. The use
cases that gain from the benefits delivered by ECS are rarely if ever simple or
easy to DevOps. If they were simple then their solution would not need to be
orchestrated.
In an ECS solution there are many interdependent AWS resources and
services that have to be configured optimally and which have to inter-operate
securely and efficiently to support your use case in a viable and cost effective
manner. To provide a containerized application with security isolation, high
availability, and the quick resolution of failure events, the ECS consumer is
responsible for configuring:

Virtual Private Clouds (VPCs), subnets, gateways, routing, and


security groups, that span multiple Availability Zones (AZs);
Identity Access Management (IAM) roles and policies for
compute instances and containers;
EC2 compute instances (both Linux and Windows based) and
their namespaces used by containers;
Clusters and Auto Scaling;
Long running Services, their deployments, and Load Balancing
Standalone Tasks, their placement and scheduling;
Docker Containers, their network mode, data storage, and kernel
resources;
Dynamic Service Discovery of compute instances and
containers;
Monitoring and Logging.

ECS provides a central management service – the ECS console - that


enables you to perceive and manipulate the state of all clusters and of their
containers, as well as all of the AWS resources and AWS services that they
use. The ECS control plane is fully-managed and extends across all clusters
that are the property of a given AWS account. Though the ECS control plane
simplifies running a cluster of distributed containers, it is magical thinking to
suggest that is easy to DevOps an ECS cluster of distributed Docker
containers.
Given Docker and AWS’s roots in Linux there is no wonder that Linux-
based virtual machines (VMs) and containers are able to fully exploit ECS
features. Though ECS supports Windows, a significant number of ECS
features are not available for Windows-based VMs or containers.
Recently, AWS has taken steps to simplify the creation and management
of virtual machines by extending ECS to use serverless Fargate instances as
container hosts. The use of Fargate does, however, preclude the use of a
number of ECS feature available with Linux-based VMs and their containers.
Unfortunately, Fargate does not support Windows-based containers.
Lastly, the parts of ECS are tightly coupled with each other. A
configuration choice made for one part makes features in other parts null and
void. To gain from all benefits provided by ECS, the consumer’s best bet is
to use Amazon ECS-optimized AMIs on EC2 instances. These ECS-
optimized AMIs allow the consumer to exploit the features they need without
being forced to sacrifice other desirable features of ECS and the AWS
platform.

Purpose and Scope


The information the consumer needs to understand to DevOps ECS is
scattered over dozens of AWS User, Developer, and API guidebooks, and
dozens of ECS evangelist blog postings. The purpose of this chapter is to
gather together those fragmented facts, to organize them, and disclose details
about how to configure ECS and integrate ECS with other beneficial AWS
resources and services.
However, code examples of how to DevOps the ECS infrastructure and its
integration with AWS platform or service using the suite of AWS DevOps
tools are not provided. Countless code examples of using AWS DevOps tools
with ECS (as well as on-line tutorials) are well documented in numerous
AWS guidebooks and web pages.
It is recommended that the reader collect CloudFormation templates that
are used to provision VPCs, ECS clusters, services, tasks, Docker containers,
compute instances, auto scaling, load balancing, service discovery, logging,
and monitoring.
The scope of this exploration of ECS is limited to the AWS platform stack
and ECS best practices as of March 2020. In that there is a substantial amount
of material to cover for the current version of ECS alone, this manuscript
does not cover legacy ECS issues. Best practices for handing legacy ECS
issues are covered in the AWS User, Developer, and API guidebooks.

Everything is Disposable
Everything in the cloud is ephemeral, is temporary. The VPCs, security
groups, IAM roles, clusters, services, tasks, the Docker containers, and their
compute instances, are not static objects, are not permanent artifacts.
Each ECS artifact has to be configured and deployed within a dynamic
workflow that is continually changing. In the real world, the DevOps of
containerized applications feels similar to releasing an arrow from a bow
while blind-sighted, and next aiming that in-flight arrow at a target that is
constantly moving.
As is typical of any service in the cloud, the Docker container’s AWS eco-
system has many moving parts functioning across multiple levels: VPC,
cluster, service, task, container, compute instance. In each level, ECS artifacts
can be deeply integrated with a variety of AWS resources and services. To
handle the variety and complexity of ECS requires you to define the
infrastructure in JSON and YAML documents, and which functions like
programable code to create and destroy virtual infrastructure. The practice of
infrastructure as code – and using the Amazon CloudFormation service - is a
critically important discipline demanded of every ECS consumer.
In addition, to be able to continually release re-versioned services running
Docker containers, the ECS consumer will need to adopt Amazon
CodePipeline and CodeDeploy services. In turn, these AWS services require
the ECS consumer to adopt a source code control system (such as GitHub)
and a Docker container repository (such as Amazon Elastic Container
Repository (ECR)).
A solid understanding of AWS resources and services is required if you
are to successfully DevOps ECS.

Not Official Docker


ECS does not receive official Docker upgrades. The best of the best
practices for ECS is to continually upgrade all of its parts to their latest AWS
approved release level. Operating in production at the latest release level can
be achieved by implementing a Continuous Delivery and Continuous
Integration (CD/CI) workflow that tests your application and ECS
infrastructure from end-to-end before moving into the production
environment.

Containerized Applications
A container is a runtime environment, such as Docker, in which a process
executes in isolation from other processes. The typical containerized
application:

Is composed of multiple services that are HTTP/HTTPs based


and have a REST API;
Network traffic occurs between services and where all data in
transit is encrypted;
Each service is stateless.
Each service can be developed independently of other services,
as well as deployed independently.
All data at rest is encrypted;
Each service maximizes cohesion – each service supports one
service contract or one data contract – it executes one function
and does so in an optimal manner. Naturally, this results in code
images that are much smaller than processes that implement
multiple contracts;
Each service Minimizes coupling – no functional dependencies
exist between the services, where each can start, stop, fail and
recover independently of any other service. Multiple services do
not manipulate the same resource, e.g., data volume, kernel
namespace, network interface, etc..
Each service responds to failure events quickly (a.k.a., self-
heals) without interfering with or depending upon another
service;
Compute instances and containers are launched in both private
and public subnets, behind firewalls, using digital identities that
have the least permissions required;
Can auto scale compute instances – horizontally, and perhaps
vertically;
Can load balance network traffic across compute instances;
Can dynamically discover at runtime the network interfaces of
compute instances as well as those of containers;
Can log and monitor services events in real-time;
Can use blue-green deployments, as well as rolling updates and
roll-backs, when provisioning its containers.

Amazon ECS Use Cases


At the time of this writing, ECS can be deployed in 69 Availability Zones
across 22 Regions. Use cases that gain the benefits of Amazon ECS are not
simple problem spaces, they are complex to extraordinarily complex:

Containerized Application Migration to AWS – the application


that is not a monolith or big ball of mud but is containerized
prior to initiating migration to the AWS cloud;
Batch ETL Processing – a workflow of containerized ETL
processes, executed proximate to their data sources and sinks;
Hybrid Containerized Applications – containerized applications
running in AWS as well as in the consumer’s private data
center;
Machine Learning – training and inference processes, executed
proximate to their data sources and sinks;
Microservices – distributed containerized applications
composed of a set of services that maximize cohesion and
minimize coupling. Typically HTTP/HTTPS based;
Software-as-a-Service (SaaS) – deploy and manage
infrastructure used by distributed containerized applications.
ECS can support single tenancy as well as multi-tenancy SaaS
scenarios.
Microservices, containerized applications, and SaaS, have to be able to
respond quickly to failure events as well as to constantly changing
workloads. Therefore those ECS use cases require auto scaling, load
balancing, and service discovery. Amazon ECS defines auto scaling in the
Cluster level and defines load balancing and service discovery in the ECS
Service level.

ECS Amazon Compute Service Level Agreement


Under the terms of the AWS Customer Agreement (a.k.a., the AWS
Agreement) each AWS account has a governing policy called the Amazon
Compute Service Level Agreement (SLA). In addition, the SLA is subject to
the terms of the AWS Agreement and capitalized terms.
The Amazon Compute SLA includes these AWS products and services:

Amazon Elastic Compute Cloud (EC2)


Amazon Elastic Block Store (EBS)
Amazon Elastic Container Service (ECS)
Amazon Fargate for Amazon ECS (Fargate)
Service Commitment

AWS undertakes commercially reasonable efforts to make each of the


included products and services available with a Monthly Uptime Percentage
(defined below) of at least 99.99%, in each case during any monthly billing
cycle (the “Service Commitment”). In the event any of the included products
and services do not meet the Service Commitment, the consumer will be
eligible to receive a Service Credit as described by AWS.
AWS reserves the right to change the terms of the SAL in accordance with
the AWS Agreement. For more details on:

Service Commitment;
Included Products and Services;
Service Commitments and Service Credits;
Credit Request and Payment Procedures, and
Amazon EC2 SLA Exclusions.

Please visit the AWS website.

Amazon ECS Security


I addition to ECS being tightly integrated with IAM roles and policies, as
well as VPC security groups and routing, the AWS platform reinforces ECS
by providing two hundred and ten (210) security, compliance, and
governance, services and features.
ECS provides strong security isolation between Docker containers. ECS
enables you to set granular access permission for every container. In addition,
AWS ensures that your ECS clusters are running with the latest security
updates.

Amazon ECS Platform


On the AWS platform, the container’s compute instance launch type
determines the ECS cluster’s infrastructure. In an ECS cluster, two launch
types are supported:

EC2 instance, and or


AWS Fargate.

If desired, a given ECS cluster can be composed of both launch types at


the same time. Whether you choose to use EC2 instances or serverless AWS
Fargate the containers will be running on top of the operating system of a
virtual machine (not on top of the operating system of a bare metal server).
For both launch types it is likely that the ECS solution will include:

AWS CloudFormation;
VPC;
IAM;
AWS CodePipeline;
AWS CodeDeploy;
Auto Scaling;
Elastic Load Balancing;
CloudTrail;
CloudWatch;
AWS Secrets Manager
AWS System Manager;
AWS AppMesh;
Amazon S3;
Bastion host;
Amazon Elastic Repository Service (ERS).
A deep understanding of the AWS platform, as well as its resources, and
services, is a pre-requisite to gaining the benefits of ECS.
To contribute to your success with ECS, details about the AWS platform,
resources, and services:

AWS Networking;
AWS Compute (EC2 instances);
AWS Storage;
AWS Identity;
Amazon Route 53;
AWS Cloud Config;
AWS Security Hub;
AWS DevOps Toolset;
AWS Service Quotas;
AWS Support Tiers;
AWS Cloud Map;

have to be understood to be able to successfully integrate ECS with the


AWS platform, its resources, and services. There are other valuable AWS
resources and services whose features provide direct benefits to ECS, e.g.,
CodePipeline, CodeDeploy, CloudWatch, etc.. These other AWS resources
and services will be disclosed and described in this chapter when and where
they interoperate with ECS.

ECS Endpoint
To connect programmatically to an AWS service, you use an endpoint. As
such, ECS has a service endpoint. The service quotas (a.k.a. limits) for an
AWS service are enforced at its service endpoint. The ECS endpoint
represents the ECS backbone.

Amazon ECS and Service Quotas


Each AWS account, the consumption of an AWS service is constrained by
default limits established by AWS – these limits are now called by a more
emotionally neutral term: quotas. The AWS account’s quota for each AWS
service is per Region. The AWS consumer can request to increase their quota
of an AWS service however AWS only permits the quotas changes to of a
subset of the 100+ AWS services.
Within the Amazon Management Console, the Service Quotas console is
used to perceive all service quotas for all AWS services, for all Regions, set
for the consumer’s AWS account. The Service Quotas console is also used to
request a quota increase for an AWS service in a given Region.
The typical Elastic Container Service consumer will need to keep an eye
of these service quotas:

Amazon VPC
IAM
Amazon ECS
Amazon ECR
Billing and Cost Management
AWS CloudFormation
Amazon CloudWatch
Amazon CloudWatch Events
Amazon CloudWatch Logs
AWS CloudTrail
EventBridge
AWS CodeDeploy
AWS CodePipeline
Amazon EC2
Amazon EC2 Auto Scaling
Elastic Load Balancing
AWS Certificate Manager
AWS Cloud Map
AWS Config
Route 53
AWS App Mesh
AWS Elastic Beanstalk
Amazon EBS
Amazon EFS
Amazon Elastic Inference
AWS Outposts
Amazon SNS
Amazon S3

Unfortunately, at this point in time, the service quota for the Fargate
launch type (used as compute instance in your ECS) is not visible in the
Service Quota console. For current service quotas for AWS services please
review the AWS General Reference Guide that is available on-line.

ECS Pre-requisites
Before starting to DevOps ECS, a number of technical pre-requisites must
be satisfied:

To access AWS resources and services an AWS account and


password is needed;
An AWS Virtual Private Cloud (AWS VPC) in which the ECS
runtime and AWS resources are launched is needed. Depending
on the use case, private subnets and or public subnets, as well as
a bastion host and a gateway, may need to be present in the
AWS VPC;
To access the ECS, an IAM user is needed. The IAM user is
granted access to the AWS Management Console and is granted
administrator access privilege to ECS. Best practice is to attach
a tag to the IAM user;
A Security group that acts as a firewall for associated compute
instances, controlling both inbound and outbound traffic at the
compute instance level is needed. Best practice is to add an SSH
rule to the security group so you can log into the compute
instance and examine the tasks with Docker commands;
When the EC2 launch type is used, to provide SSH access a
(public-key cryptography) key-pair is needed to login to the
EC2 instance;
The AWS CLI (installed on the DevOps box) which enables you
to build scripts that can automate common management tasks in
Amazon ECS. Best practice is to install the latest AWS CLI
version for use with Amazon ECS.

ECS Platform-as-a-Service
The Amazon ECS Platform-as-a-Service (PAAS) is composed of several
configurable artifacts:

VPC
Cluster
Cluster Capacity Provider
Service
Instance Launch Type
Service Load Balancing
Service Deployment
Task
Container
Container Networking Mode
Container Storage
Container Repository

Each of these artifacts is complex and has numerous parameters that have
to be carefully configured by the ECS consumer. These artifacts are not all
that are present in the ECS platform that the consumer must configure, but
they are the core artifacts of ECS.
The VPC is vital to, is the foundation of, the ECS solution. And, it is the
internal and external network traffic required by the Docker containers that
drives the design and provisioning of the VPC and its subnets, their routes
and security groups, as well as gateways.
The ECS cluster supports horizontal scaling of compute instances. The
ECS Service supports both load balancing (of HTTP/HTTPS and TCP traffic)
as well as service discovery.
In addition to the VPC, the other vital ECS artifact is the Task. The Task
has a task definition, as does an ECS Service. The task definition is a group
of Task level configuration parameters plus a list of 1-10 Docker container
definitions. A Docker container definition is part of the task definition.
Consequently, a Docker container cannot be run in ECS in the absence of its
task definition.
An ECS Service launches a Task. A Task can also be launched without
need of an ECS Service. When a Task is launched its list of containers are
run on a compute instance provisioned in the cluster. With Amazon ECS, you
can control the EC2 instance chosen to for the Task according to rules and
constraints that you define at the Task level.
Shared internal or external network traffic requirements can result in more
than 1 container being listed in a task definition. Likewise, the shared need
for other types of namespaces – Process ID; User ID; Mount; IPC; etc. – can
result in more than 1 container being listed in a task definition. Unless a
reason exists for more than 1 container per task definition, adopt the ratio of 1
container to 1 task. Doing so enables both vertical scaling of the container
image as well as horizontally scaling the container image. As importantly,
that 1:1 ratio minimizes the scope of a container’s attack surface.
Each of the ECS artifacts will be described in detail in this manuscript.
Unquestionably, the ECS architecture is rooted in Docker therefore, to
understand how to configure ECS, you need a good understanding of Docker.
As this chapter explores each of those ECS artifacts, relevant information
about Docker will be shared.

Docker Background
Docker is a technology that allows you to build, run, test, and deploy
distributed applications that are based on containers. In 2013, Docker Inc (a
startup tech company from San Francisco) originated the Docker container
management service. The original Docker (written in Golang) was a
Platform-as-a-Service (PaaS) call dotCloud. In 2017 the Docker container
management service became an open-source project that adheres to the
Apache License 2.0.
The component that runs and orchestrates containers is the Docker
Engine. Like the hypervisor that runs a virtual machine, the Docker Engine is
a runtime environment that runs the container.
Docker can be installed on Linux and various distros, on Windows,
Windows Server 2016, and Mac OSX.

Open Container Initiative (OCI)


The OCI is a governance council, i.e., a committee for putting things on
top of other things, that is responsible for standardizing container
infrastructure. At present, the OCI has published specifications of two
standards:

1. The image-spec;
2. The runtime-spec;

As of Docker version 1.11, the Docker Engine conforms to the OCI


runtime-spec.

Docker Images
It is important to understand how Docker builds and stores images as well
as how containers use those images. Creating the Docker image begins after
the application binary is built. The image includes everything the application
needs to run:

Code binary;
OS components the application needs to run (system tools;
system libraries);
Application library Dependencies;
File system object.

Docker images are built in several layers. Layers are stacked on top of
each other. Each layer of the Docker image is read-only (except for the top-
most writeable-layer a.k.a. the container layer).
In Docker, the simplest image is a 3-layer of binary:

1. Base layer – contains the OS and filesystem components the


application needs to run. Here is where consistency across stages
is feasible but only if the exact same OS image (patches, drivers,
etc.) is used in all stages.
2. Middle layer – contains the application library dependencies.
3. Top layer – contains the code that implements the application
functionality.

Overtime, as changes are made to the application binary, and the Docker
image is rebuilt a new layer are added to the top of the layer stack. The new
layer is only the set of differences from the layer below it.
It is the application image’s use of kernel resources in isolation of other
processes, and it is how the image may share a namespace with other
processes, that dictate container coupling and the bundling of containers in a
Task. The requirements of an image to exclusively use of a namespace or to
share a namespace are determined by the use case the application image is
designed to support. Both image’s need to consume or share internal network
namespaces and its need to handle external network traffic have a profound
impact on the design of the VPC used by the ECS cluster and the compute
instances on which the image(s) executes.

Dockerfile
Docker daemon API’s ‘build’ command assembles a Docker image by
reading the instructions contained in a formatted text file, called a Dockerfile.
The Dockerfile instructions explain how to setup the container’s private
filesystem, identifies the image the container is based on, as well as how to
run that image in a container.
Once you have the code binary and the Dockerfile you are able to build
the read-only Docker image. When the image is built, an image manifest is
generated. The image manifest contains information about the image, such as
its layers, size, and digest, as well as the operating system and compute
instance architecture the image was built for.

Docker Image Registry


A Docker Image Registry is a storage and content delivery system which
holds Docker images (and image manifests) and which enables you to
distribute those images. AWS has the Elastic Container Repository (ECR)
service that is used to hold and manage Docker images. The ECR is outside
the scope of this manuscript.
Docker images are pushed to and pulled from the registry. The registry
supports TLS and basic authentication. The registry can be part of the ECS
infrastructure or external to it. Best practice is integrate the Docker Image
Registry into your ECS CD/CI workflow.

Docker Containers
A container is an instance of an image that can be run on a compute
instance. Once you have a Docker image (held in a registry), the Docker
daemon API’s ‘run’ command starts a container based on the image.
A container is a running process that uses its partition of the compute
instance’s operating system’s kernel resources. The running container:

is isolated in that it has a process tree separate from the host;


has its own network namespaces which it might share with other
containers;
has its own file system which it might share with other
containers,
might use Interprocess communication with other containers,
and
has its own pluggable network driver.

Within a Docker container there is a read-only Docker image whose top-


most layer is the writeable-layer. It is the image that provides the container
with its own, private, filesystem and other types of kernel resources.
Any changes made by the container to the host file system are stored in
the writeable-layer. Multiple Docker containers can share access to the same
read-only Docker image, but each container will have its own private
writeable-layer.
Unlike a virtual machine (VM), a Docker container does not require an
operating system – at least not a complete operating system. The container
runs in user space on top of the existing operating system that manages the
compute instance’s kernel recourses. At the same time, multiple compute
instances can run within multiple user spaces that are processed by a single
compute instance. Conceptually, a container running in a user space is similar
to an application running on MVS (i.e., applications have address space and
virtualized resources controlled by MVS).
A container is a means by which an application can be packaged for
distribution. According to folk-legend, the name ‘container’ was chosen due
to its resemblance to the generic shipping container used to move cargo
around the globe, complete with its manifest that you can inspect.
By default, the Docker container runs in the foreground. The Docker
daemon starts the application residing in the container and attaches the
console to the application’s standard input, output, and error.
In the AWS cloud, unless the consumer uses Amazon ECS in combination
with AWS Outpost their Docker containers will not be executed by a runtime
environment that is running on bare metal, instead the Docker runtime runs
on a virtual machine that is an EC2 instance or on a Fargate instance.

ECS Control Plane


At present, Docker version 18.09.x software is installed on ECS launch
instances. In addition, to exploit features, and gain the benefits, of AWS
Amazon ECS extends the Docker service architecture.
Every ECS launch instance (of both types) is plumbed in part or whole
with Docker and Amazon ECS software:

Docker daemon (dockered), a.k.a., the Docker Engine;


shim (docker-containered-shim);
runc (docker-runc) - an OCI runtime-spec compliant utility used
to create and run containers;
Amazon ECS Container Agent (a.k.a., containered (docker-
containered) - serves up Docker images to runc. ECS Container
Agent uses runc to create containers. In addition, runc can be
used as a standalone CLI tool to create and run containers;
ecs-init Service - controls the starting and stopping of the
Amazon ECS container agent.

Together these processes make up the ECS control plane. The ECS
control plane can be manipulated through the ECS Console, the ECS API, the
AWS API, as well as the AWS SDKs.

Docker Engine
The Docker daemon, a.k.a., Docker Engine, implements:

Docker API;
Image management;
Role-based Access Control (RBAC);
Security features;
Core networking, and
Data Volumes.

The Docker daemon communicates with ECS Container Agent via a


CRUD-style API over general-purpose Remote Procedure Calls (gRPC) -
gRPC is outside the scope of this manuscript.

ECS API
In Amazon ECS, the Docker daemon API is re-titled the ECS API. The
Docker daemon has a RESTful API that is manipulated by a command line
interface (CLI) and is access by a HTTP client, such as:

Docker Client;
AWS CLI (accessed through an AWS SDK);
Amazon ECS Console (available within the AWS Management
Console);
ecs-cli command line utility (accessed through an AWS SDK);

The ‘docker’ command can manipulate the Docker API to do such things
as:
Build a Docker image from a Dockerfile and application binary;
Push/pull an image or a container to a registry/repository;
Manage images over their lifecycle;
Manage Docker containers over their lifecycle;
Display container resource usage metrics;
Import a tar ball to instantiate a container’s filesystem;

Docker RBAC
To enforce authentication while making changes to a cluster or
containerized application, the Docker daemon uses RBAC. On AWS, these
Docker roles have been supplanted by numerous IAM roles and service links.
These IAM roles, service links, and policies, will be identified and described
in this manuscript.

Amazon ECS Container Agent


The Amazon ECS Container Agent (i.e., the ‘containerd’ component of
the Docker) handles container execution as well as container lifecycle
operations. The ECS Container Agent runs in its own Docker container.
The ECS Container Agent cannot instantiate Docker containers: it forks a
new instance of runc (the Docker daemon) which creates each container. The
runc interfaces with the operating system kernel to gather together the
constructs – namespaces, cgroups, etc. – needed to instantiate a container.
The container is started as a child-process of runc, and as soon as the
container is started the parent runc exits.
The ECS Container Agent makes calls to the ECS API on your behalf.
EC2 compute instances that run the Container Agent require an IAM role and
policy for ECS to know that the Container Agent belongs to your account.
The Container Agent uses a persistent web socket connection to the ECS
backend. When the ECS Task changes status changes , or when an ECS
Service scales up or down, the Container Agent tracks these changes as the
last known status (lastStatus) of the Task and the desired status
(desiredStatus) of the Task.

ECS Console
The ECS Console (nested within the Amazon Management Console)
enables the consumer to create and manage their ECS Clusters, Services,
Tasks, Containers, and compute instances in a single place – that centralized
console is the ECS control plane. The ECS Console is a web page that
enables you to manipulate the Docker daemon API, the docker run command,
as well as the ECS API, through a simple user interface that provides access
to a variety of wizards.
The ECS Console wizards enables the consumer to create ECS Clusters,
Services, Tasks and Containers, to associate them with IAM roles and
policies, to provision them within VPC subnets, as well as to auto scale and
load balance compute instances. And after the creation of the clusters,
services, tasks, and containers, the ECS Console enables the consumer to
manage each artifact across their entire lifecycle.
The ECS Console’s wizards walk the consumer through the steps
involved, let you set configuration parameters, pick IAM roles, use auto
scaling and load balancers, without requiring the consumer to know the
corresponding AWS API calls, Docker daemon API calls, docker run
command options or ECS API calls. The ECS Console makes those API calls
and executes the required commands on the account’s behalf. However, a
useful control plane does not a Software-as-a-Service make. The ECS
Console does not fully-manage ECS – human involvement is required. It
helps to view the ECS Console wizards as helpful tutorial guides. However,
at places along the way the wizards provide defaults (IAM roles, VPC, etc.)
that are not suitable for production environments.
Inevitably, once familiar with the steps taken and resources required to
create and deploy VPCs, ECS Clusters, Services, Tasks, Containers, and
compute instances, you need to pick up other tools in the AWS DevOps
toolset that provide the power to fine tune the particulars demanded by a
specific application image.
The ECS is a PaaS, and as a PaaS, the consumer will write the
infrastructure code that materializes and terminates the VPCs, the ECS
Clusters, Services, Tasks and Containers, and compute instances, across and
throughout their lifecycles. Moreover, that infrastructure code will be
managed as a valuable software resources via Continuous Integration.

Levels of the ECS Stack


The essential ECS ingredient is the task definition. In order to run any
Docker container on ECS, the Docker container must be part of a task
definition. The task definition has a number of Task level parameters as well
as a list of 1-10 container definitions. In order for an ECS Service to deploy a
container, the ECS Service must have its task definition. In order for an ECS
Task to run a container, the ECS Task must have its task definition. ECS
enforces a limit of 1 task definition per ECS Service, per ECS Task.
How the Task level parameters are configured depends on how its
containers are defined, and how a container is defined depends on how its
task is defined. Those interdependencies are prominent when the data
volumes and bind mounts that are used by the Task are defined – some
parameters are in the task definition while their related-parameters are in the
container definitions.
The levels present in the ECS stack depends on the use case that is the
basis of the ECS solution. When these use cases are supported:

Containerized Application Migration to AWS;


Hybrid Containerized Applications;
Microservices;
Software-as-a-Service;

they require an ECS Service. Therefore, the Amazon ECS platform has 6
levels, each of which the consumer must configure:
6 - Image;
5 – Container and Launch type;
4 – Task and Launch type;
3 – Service and Launch type;
2 - Cluster
1 - VPC
However, when these use cases are supported:

Batch ETL Processing;


Machine Learning;

they require an ECS Task, and the ECS platform has just 5 levels that the
consumer must configure:
5 - Image;
4 – Container and Launch type;
3 – Task and Launch type;
2 - Cluster
1 - VPC
The Image level has already been explained. The Container level has been
explained but only at a high level and needs further explanation. While the
Task, Service, Cluster and VPC levels remain to be explained.
The task of configuring each ECS level is not an exercise independent of
the other levels – each level has dependencies that it shares with the other
levels. The configurations of a Service can dominate its task definition. The
configuration of a task definition can dominate its list of container
definitions. Conversely, parameters in a container definition can over-ride
parameters set in its task definition. In addition, parameters in a task
definition can override parameters set in the Service definition.
It is important to appreciate which Container parameters override a
corresponding Task parameter settings, as well as which Task parameters
over-ride a corresponding Service parameter settings. With good fortune, a
single Service can, by letting its task definition set parameters, be a Service
template that can be used by different task definitions. Likewise, with good
fortune, a single Task can, by letting its container definitions set parameters,
be a Task template that is used by different lists of container definitions.
Last but not least, to provide a secure ECS environment IAM roles of
varies types, and their policies, at present in each level of ECS. It is important
to appreciate how IAM secures and governs each VPC, ECS Cluster, Service,
Task, Container, and compute instance, as well as their access to other AWS
resources and services.

ECS and CloudFormation


The AWS CloudFormation is a free service used to launch, configure, and
connect AWS resources. CloudFormation treats infrastructure as code and
does so by using JSON and or YAML formatted templates. AWS does not
charge a fee for using CloudFormation. You are charged for the infrastructure
resources and services created using CloudFormation.
CloudFormation enables you to version control your infrastructure –
VPCs, Clusters, Services, Tasks, Containers. EC2 instances as well as
Fargate instances. Best practice is to use a version control system to manage
the CloudFormation templates. CloudFormation is also a great disaster
recovery option.

CloudFormation ECS Templates


The AWS CloudFormation template is used to define the entire solution
stack, as well as runtime parameters. There are CloudFormation templates
that support VPCs, Clusters, Services, Tasks, EC2 and Fargate instances. You
can reuse templates to set-up resources consistently and repeatedly. Since
these templates are text files, it is a simple matter to use a version control
system with the templates. The version control system will report any
changes that were made, who made them, and when. In addition, the version
control system can enable you to reverse changes to templates to a previous
version.
An AWS CloudFormation template can be created using a code editor.
But, they can be easily created using the drag-n-drop CloudFormation
Designer tool, available in the AWS Management Console. The Designer
will automatically generate the JSON or YAML template document.

1 Use Case => 1 ECS Solution


At all times, the particulars of the stakeholders’ use case determines the
configuration of AWS resources and services. In the real world the
stakeholders’ use case is seldom well-known or well-defined (particularly on
Agile projects) – it is more akin to a mini Tower of Babel. The use case is not
actionable by DevOps until it has clear statements defining how the
solution’s output will be verified and validated.
For each use case the typical stakeholders are continually changing their
requirements (particularly on Agile projects) as they attempt to comprehend
their own needs. Today, mis-information - not clarity - rules the typical IT
project. That chronic communication deficit permeates all levels of the
organization creates and perpetuates chaos throughout the lifecycle of the
project/product. In the face of unrelenting chaos the best approach is divide-
n-conquer combined with keeping-it-simple.
To minimize the ripple effects of political chaos on the ECS solution, best
practices is to isolate an ECS Service or an ECS Task to a dedicated VPC +
cluster. Given the perpetual state of mis-information, when multiple ECS
Services or Tasks are using the same VPC + cluster, the probability of
collateral damage is exceedingly high. The secure isolation of an ECS
solution evaporates as multiple use cases share the same VPC and ECS
cluster. With multiplicity, the likelihood of an unsecure solution with a highly
vulnerable attach surface is guaranteed.
Though the AWS VPC is not a general purpose utility, an ECS Cluster
may be put to general use. As always, all artifacts in the cloud are temporary,
are not permanent, constructions. In the cloud the dedicated VPC + cluster
only exists when and while an ECS Service or ECS Task needs to run.
In ECS, a Docker container is not a standalone application. A task
definition is required in order to run a Docker container on ECS. The ECS
Service has 1 task definition, as does the ECS Task. The task definition has a
list of 1-10 container definitions. When the task is run all of its containers run
on the same compute instance.
When more than 1 container is listed in the task definition this is due to
tight coupling that exists among those containers – they may share the same
namespace (e.g., mount point; Interprocess communication; control group;
etc.) - they do share the same network stack. When the containers of the task
communicate with each other, their network traffic stays on the localhost
network stack. When the awsvpc network mode is used the containers can
communicate over the Task’s virtual Elastic Network Interface (ENI) that is
shared by the containers. The need to minimize network traffic latency can be
good cause to put more than 1 container definition inside a task definition.
In distributed systems tight coupling is to be avoided if possible. The
general rule for distributed systems states that coupling is to be minimized
and cohesion is to be maximized. Cohesion is maximized when the process
does 1 thing and only 1 thing is an optimal manner – another way of
describing this is to state that the service has only 1 service contract, or data
contract. Unless there is good cause present in the use case minimizing
coupling and maximizing cohesion always applies. As does ensuring that
each Service and each Task is able to fail quickly.
For good cause or not, having more than 1 container per task definition
incurs potential losses. When a task has more than 1 container:

The containers share the same namespaces – the have the same
vulnerable underbelly.
The containers share the same or similar attack surface.
You have lost the ability to horizontally scale those containers
independently of each other.
You have lost the ability to vertically scale those containers
independently of each other.

And, when more than 1 Task is run on the same compute instance:

You have sacrificed the ability to horizontally scale those Tasks


independently of each other.
You have sacrificed the ability to vertically scale those Tasks
independently of each other.

Other than the SaaS use case, it is extremely rare for the exact same
infrastructure requirements to be shared by two or more of these other types
of ECS use cases:

1. Containerized Application Migration to AWS;


2. Batch ETL Processing;
3. Hybrid Applications;
4. Machine Learning;
5. Microservices;

The use cases well suited to ECS typically have multiple loosely coupled
Services (e.g., 1, 3, and 5) or have multiple sequential Tasks (e.g., 2 and 4).
In most scenarios, each Service or Task launches only 1 Docker container,
where each container automates a single discrete activity present in a
business process model (BPM). As such, the typical ECS use case will
include multiple Services where each Service launches 1 container. For
example, for a product that I and my team designed and delivered there were
64 distinctly different services only 2 of which were tightly coupled and
therefore part of the same task definition. In well-designed and well-made
distributed systems the vast majority of Services/Tasks have 1 and only 1
container. It is good to be able to place more than 1 containers in a
Service/Task when they are tightly coupled but doing so is seldom the right
solution because tight coupling is rarely a good design pattern for use with
distributed systems. There are scenarios where tight coupling is the right
choice but they are definitely the rare exception and are not the general rule.
How a group of related and distributed Services/Tasks interoperate – how
they share data - with each other most often includes stable queues. Though
stable queues are a common part of solutions based on service oriented
architecture, stable queues are not addressed by ECS or by Elastic Kubernetes
Service (EKS). The consumer has to manage provisioning message queues,
data services, caches, and other related integration services, outside of ECS.
ECS is a wonderful technology, but it is not all things to all components of a
distributed systems. The adoption of the CloudFormation service by the ECS
consumer is a must. In AWS, only that service is currently able to manage the
ancillary resources and services that the typical ECS solution demands.
Though an ECS cluster supports variations of launch types and a dynamic
variety of different Services and Tasks, best practice is to dedicate a cluster to
a single use case for which its performance is optimized and its cost
minimized. The default practices of 1 VPC + Cluster to 1 logical group of
Services, of 1 VPC + Cluster to 1 logical group of Tasks, are well supported
by the ECS console which provides access from a single point to all clusters
that are the property the AWS account.
The programmatic control of VPCs, AWS resources and services, ECS
Services, ECS Tasks, and ECS Containers and Images is the key to DevOps
of ECS. When a Service needs to run in its cluster, its compute instances and
their Docker containers, each AWS resource and service, is programmatically
spun-up and shutdown, scaled in and scaled out, and this happens on demand,
dynamically. When a Service or a Task needs to run all of its parts are to be
programmatically assembled.
If by good fortune a given VPC + cluster configuration can be proven to
support multiple use cases then the opportunity to associate multiple Services
with a single cluster exists. In the real world that type of good fortune is
extremely rare. Best practice is to divide-n-conquer until there is proof that
more than 1 use case can securely share the same VPC + cluster.
This divide-n-conquer approach does not impede the arising of garbage
application code, does not ensure the success of unit tests, or integration tests,
or workload testing. The divide-n-conquer approach does however isolate
those human error events to a particular chunk of infrastructure running
during a discrete window of time – the problem space is isolated. Process
isolation is the digital equivalent of social distancing – it makes sense to be at
a distance from others who may be the cause that harms you.
It is important to acknowledge that each type of ECS use case has a
distinct attack surface. It is highly improbable (outside of a SaaS use case)
that different collections of containerized applications share the same roles
and digital identities, use the same access permission policies, allow the same
network traffic workflows, use the same infrastructure of AWS resources and
services.
Though ECS use cases of all types may share security vulnerabilities, best
practice is to isolate each ECS use case to its dedicated VPC and cluster and
by doing so ensure that each use case is not compromised or complicated by
any other ECS use case’s attack surface. Neither the VPC nor the ECS
Cluster is a general purpose utility. Taken together the VPC and cluster are a
use case specific construction – both are specific to a given ECS Service or
ECS Task.
Another general rule of distributed systems is that processes must be able
to fail fast. Failures happen to due hardware, failures happen due to bugs in
the application, failures happen due to administration errors. Failing fast
starts the recovery process sooner than later. When multiple containers exist
in the same task the process of recovering from failure events is extremely
complicated and recovery time expands significantly.
Lastly, when the EC2 launch type is used there are numerous
configuration matters that must be addressed on a container by container
basis. When different containers, Tasks, and Services, run in the same cluster
it becomes nearly impossible to maintain homogeneity among the EC2
compute instances. Homogeneity of EC2 instances is key to successful auto
scaling and load balancing of an ECS solution. A lack of homogeneity among
EC2 instances is a sure sign of contamination by multiplicity of Services or
Tasks in a cluster.

ECS’s VPC Configuration


The AWS Virtual Private Cloud (VPC) is the lowest level of the ECS
platform – it is the foundation of the Amazon ECS solution. A VPC is a
fundamental tool in AWS that the account uses to control the networking
capabilities of resources (e.g., compute instances; Docker containers; etc.)
that the account launches.
A Docker container can, by default, communicate with other containers of
its task by using local network traffic (via the localhost network namespace
of the compute instance), and, when properly configured, can handle external
network traffic as required by the use case. In ECS, external network traffic is
all network traffic that does not originate from the container(s) of the Task.
Configuring external networking for a Task (and in turn for its container(s))
is done by modifying the settings of the VPC in which you launch the ECS
Task.
Given all other VPC design factors are assumed equal, how the VPC is
configured depends in great part on the external network traffic used by the
container(s) of the Task. Network traffic is external whenever a container is
communicating with another process/endpoint that is not another container
listed in its task definition. That endpoint can be another ECS Task, can be an
AWS service or might be some endpoint over the Internet.
The internal and external networking requirements of the containerized
application determines, at a minimum, these aspects of the VPC:

1. Cost effectiveness
2. Region
3. Availability Zone(s)
4. VPC’s CIDR Block(s)
5. Amazon DNS Server and Route 53
6. VPC DHCP Options
7. Choice of Subnets, both private and public
8. VPC Router Configuration
9. VPC Security Groups
10. VPC Network ACL
11. Public IPs assignment
12. Non-routable IP addresses
13. AWS Internet Gateway (IGW)
14. AWS Network Address Translation (NAT) Instances
15. AWS NAT Gateways
16. AWS Egress-only Internet Gateway
17. VPC Peering
18. Virtual Private Networks (VPNs)
19. AWS VPC Endpoints
20. Bastion Host

Once these aspects of the VPC are established, collectively, they


significantly constrain the ability of the VPC to support any other ECS use
cases in a secure, performant, and efficient, manner. No worries. The VPC
that underlies the ECS solution is not a general purpose utility, the VPC is
custom-built according to the requirements of the containerized application
that it supports.
Best practice is to declare the VPC definition in an AWS CloudFormation
template, and to manage that document like all files that participate in your
CD/CI workflow.
With every VPC that underlies an ECS solution:
At a minimum, to be used by ECS, the VPC has at least 2
subnets, where each subnet is located in a different AZ. Best
practice configures at least 3 subnets located in 3 different AZs
for a VPC that supports an ECS solution.
A VPC subnet is a logical landscape that your ECS resources
can be placed into. Each subnet has an AZ and its own route
table, which defines that rules about the network traffic that is
allowed in the subnet. A VPC has two types of subnets: private
and public.
A private subnet does not have direct internet access. The Tasks
running inside the private subnet do not have public IP
addresses, only private IP addresses. Instead of an Internet
Gateway, a Network Address Translation (NAT) gateway is
attached to the subnet. The Tasks running in a private subnet can
communicate to other endpoints on the internet via the NAT
Gateway. The requesting Tasks would appear to have the IP
address of the NAT gateway to the endpoint receiving of the
communication somewhere in the Internet. If Tasks want to
communicate directly with each other, they can use each other’s
private IP address. The network traffic sent directly from one
Task to another Task stays inside the subnet without going out
to the Internet Gateway and back in. Tasks running in a private
subnet are protected and cannot receive any inbound external
traffic. There is no way for an endpoint on the internet to reach
the Task directly because the endpoint does not have an IP
address of the Task or a direct route to reach the Task. Tasks
running in a private subnets must have access to a NAT gateway
otherwise they will not be able to send container image request
to Amazon ECR or communicate with Amazon CloudWatch to
store container metrics.
A public subnet is a subnet that has an associated Internet
Gateway. The Task can send network traffic to endpoints on the
Internet because the VPC’s route table is configured to route
traffic out via the Internet Gateway. Endpoints on the Internet
can send network traffic to the Task via the Internet Gateway’s
public IP address. A Task that uses the Fargate launch is
assigned both a private IP address and a public IP address. You
have to assign a public IP address to the ENI of a Task that uses
an EC2 launch type. A Task that uses private, confidential
information ought not be placed in a public subnet nor have its
ENI be given a public IP address.
If you are running a container in a private subnet, you might
need a way for manage the volumes of external network traffic
to reach the container. External network traffic reaching a
container in a private subnets can be accomplished by using an
Application Load Balancer or a Network Load Balancer that is
placed in the public subnet. The load balancer is configured to
forward traffic to the Task’s container(s) in the private subnet.
When each Task starts, the private IP address of its ENI is added
to the load balancer’s configuration. When the Task is being
shut down, external network traffic is safely drained from the
Task before removal from the load balancer.
The network isolation of a compute instance (and the containers
that it hosts) is controlled by security groups and VPC settings.
Each EC2 instance in the VPC has network namespace called
the ‘local loopback interface.’ The local loopback interface is
assigned the IPv4 address of 127.0.0.1 and is given the
hostname of ‘localhost.’
All Docker containers listed in the task definition are run on the
same compute instance.
Docker containers listed in the same task definition and are
running on the same compute instance, are therefore can
communicate with each other by using the compute instance’s
localhost network stack. The localhost network stack is often
used for tightly coupled containers. By making a networking
request to the localhost network stack, the container bypasses
the (virtualized) network interface hardware and, instead, the
operating system routes network calls from one container to the
other directly (via the network namespace that they share). This
gives the containers a fast and efficient way to exchange
information. Most containerized application require more than a
local network namespace: they require the ability to handle
external network traffic.
Both the EC2 and Fargate launch type compute instance has an
Elastic Network Interface (ENI) that AWS automatically
associates with the compute instance.
A Task that has containers that use the awsvpc network mode is
automatically associated with its own ENI. The Task’s ENI is
shared by all containers of the Task. Containers of a task can
communicate with each other over their task’s ENI that they
share.
Routing tables in the VPC enable seamless communication
between the primary private IP address of the compute
instance’s ENI as well those of a Task and its container(s).
The compute instance’s ENI is, by default, assigned a primary
private IPv4 address.
When the container uses the awsvpc network mode it shares the
primary private IP address of the Task’s ENI.
When both the enableDnsHostnames and enableDnsSupport
options are enabled on the VPC, ECS populates the hostname of
a computer instance using an Amazon-provided (internal) DNS
hostname.
Each network interface of each compute instance is assigned to
one or more VPC security groups (thereby locating the compute
instance behind one or more firewalls).
The Task’s ENI (and in turn its container(s)) is assigned to one
or more VPC security groups.
If you specify in the task definition a port mapping for use by
the Task’s ENI, then the Task’s containers can communicate
with each other on that port. For containers, this is a form of
local network traffic.
Neither launch type is assigned, by default, a public IPv4
address.
A Task’s ENI is not, by default, assigned a public IP address.
Ensure that the VPC’s CIDR range has IP addresses sufficient to
support the Tasks deployed to the cluster, as well as IP address
for load balancers to use.
When a VPC being used by ECS, is updated, for example a
security group is changed, and you want the running Services,
Tasks, and Containers (that are using the VPC) to pick up the
changes, they must be stopped and new Services and Tasks
started.

When the ECS console is used to create the ECS cluster and an existing
VPC is not chosen, the wizard will create the VPC into which the cluster will
be provisioned. However, that default VPC is not usable as an integration
testing or a production VPC.
ECS containers are run dynamically. Depending on the ECS use case, the
VPC that underlies the ECS solution may be required to support:

Automatic scaling out/in compute instances;


Load balancing HTTP/HTTPS and TCP requests across those
compute instances and containers;
Task placement constraints;
Service Discovery for containers and compute instances, as well
as,
Route 53, App Mesh, and trunking.

Correctly configuring a VPC is not a simple matter, and when configured


for ECS containers the activity becomes more complex, becomes
significantly more challenging.

ECS and Interface VPC Endpoints


The typical ECS solutions requires compute instance launched in a VPC’s
private and public subnets to:

1. Communicate with the ECS control plane;


2. Download Docker images from Amazon Elastic Container
Service (ECR);
3. Download content from Amazon S3, and
4. The container’s awsvpc network driver sends log data to
CloudWatch Logs.

Amazon ECS supports these requirements without recourse to an Internet


Gateway, a NAT device, or a virtual private gateway. Network traffic
between the compute instance and ECR, the ECS control plane, etc., does not
leave the Amazon network. ECS provides this capability through an interface
VPC endpoint that is powered by PrivateLink. PrivateLink is an AWS
technology that enables compute instances to privately access the ECS
control plane, S3, ECR, etc., by using private IP addresses. Tied as it is to the
VPC, PrivateLink does not support cross-Region network traffic. However,
interface VPC endpoints only support Amazon-provided DNS through Route
53.
Communication between an EC2 compute instance and the ECS control
plane is required so that container receive orchestration instructions. An
interface VPC endpoint is required for each of these three (3) ECS services
(each of which has its own Elastic Network Interface (ENI) device):
com.amazonaws.region.ecs-agent
com.amazonaws.region.ecs-telemetry
com.amazonaws.region.ecs
Amazon ECR, S3, and CloudWatch, likewise require their own interface
VPC endpoint. With S3, the Gateway VPC endpoint for Amazon S3 is used.

ECS Cluster Definition


The Cluster is the second level of the ECS platform. The common view of
a cluster is as a group of compute instances running in a VPC. However, in
ECS, the virtual machine that hosts the Docker containers is not specified in
the Cluster definition.
An ECS Cluster is small group of Cluster-level configuration parameters
plus a logical grouping of ECS Services and ECS Tasks – and is not a group
of compute instances. When the Service definition and the Task definition are
fully described later in this manuscript, this view of a Cluster as a group of
Services and Tasks – that is not a group of compute instances - will make
good sense.
It is a fact that unrelated Services and Tasks can run on the same Cluster
during the same window in time. While doing so can be OK in certain
scenarios, the segregation of logical groups Services and Task to their own
VPC and cluster is a best practice.
It is the ECS Service and the ECS Task that define the type of compute
instance that their list of Docker container(s) will run on. Consequently, when
a Service is deployed to a Cluster that is an event that provisions compute
instances in the Cluster. Likewise, when a Task is started in a Cluster that is
an event that provisions compute instances in the Cluster.
An ECS Service as well as an ECS Task can use either of the two launch
types as compute instances. However, a given Service or a given Task can
use only one launch type. All containers listed in the task definition will run
on a compute instance of that launch type. Because a Service or a Task can
use either of the two launch types, this can lead to runtime scenarios in which
both launch types are running at the same time in the same Cluster. Best
practice is to minimize the heterogeneity of compute instances in a cluster,
instead maximize the homogeneity of compute instances (either all are
Fargate instances or all are EC2 instances of the exact same type and
configuration).
In that a VPC is its foundation, the ECS Cluster is therefore Region-based.
The maximum number of ECS Clusters per Region, per account, is 2,000.
Best practice is to isolate each ECS Cluster on its own dedicated VPC.
Significant investment of time and effort is required to create the VPC(s) that
can provide viable and robust support to the cluster(s). For example, the VPC
subnets and security groups, the Auto Scaling and Load Balancing services,
that the Service (and the Task) depends upon, have to be defined before the
compute instances and the Docker containers are distributed across the
cluster.
Amazon ECS supports Services that use Auto Scaling to horizontally
scale the group of EC2 compute instances that it launched into the cluster.
Because a Cluster can support multiple Services, an ECS cluster can support
multiple ASGs – within limits. In Amazon ECS, a service that is used to
horizontally scale the Service or Task is called a ‘capacity provider’ (and the
corresponding Auto Scaling Policy is called an ‘attachment’).
When the EC2 launch type is used the consumer is responsible for
ensuring that the Auto Scaling works smoothly, gracefully. For each ECS
Service you have to determine exactly how each EC2 instance in the ASG
will be configured, launched, shut-down/terminated per cluster. An ASG is a
complex construct, as is the Auto Scaling Policy. When the Fargate launch
type is used horizontal scaling is handled automatically by the Fargate service
– the consumer does not have to configure or manage horizontal scaling.
In ECS, there are a few Cluster-level configuration parameters:

clusterARN – the cluster’s Amazon Resource Name (ARN).


clusterName – the cluster’s name.
settings – this enables/disables CloudWatch Containers Insights
for the cluster. The default setting disables the use of the
CloudWatch by the cluster.
defaultCapacityProviderStrategy – Determines how the Tasks
are spread out across the capacity providers. Each ECS cluster
has an optional default capacity provider strategy. The capacity
provider strategy or launch type configured in the ECS Service
over-rides this cluster default.

Cluster parameters can be set using the ECS Console, the ECS CLI, as
well as the AWS CLI. Like all things configurable in AWS, the cluster
parameters and their assigned values are captured in a JSON or a YAML
formatted document. Best practice is to declare the cluster definition in an
AWS CloudFormation template, and to manage that document like all files
that participate in your CD/CI workflow.
At runtime, additional ECS cluster properties can be perceived in the ECS
console:

status – the status of the cluster (ACTIVE, PROVISIONING,


DEPROVISIONING, FAILED, and INACTIVE).
activeServicesCount – the count of Services running on the
cluster in the ACTIVE state.
capacityProviders – the list of 1 to 6 capacity providers used by
the ECS Services and Tasks running within the cluster. The
capacity provider determines the infrastructure used by the
containers of a Task. Capacity providers work with both the
EC2 and Fargate compute instances. The capacity provider
determines how compute instance capacity is used by the Task.
With EC2 instances, the capacity provider consists of a name, an
Auto Scaling group (ASG), and the settings for managed scaling
and managed termination protection. With Fargate instances, the
FARGATE and FARGATE_SPOT capacity providers are
provided automatically.
capacityProviderStrategy - the capacity provider strategy
associated with a capacity provider. The capacity provider
strategy determines how the Tasks are spread out across the
capacity providers. A capacity provider strategy gives you
control over how a Task uses one or more capacity providers.
When a Task is run, or when a Service is created, a capacity
provider strategy is specified. A capacity provider strategy
consists of one or more capacity providers with an optional base
and weight specified for each provider. With EC2 instances, the
capacity provider strategy designates how many hosts, at a
minimum, to run on a specific Auto Scaling Group (ASG)
supported by the ECS Cluster. The strategy also designates the
relative percentage of the total number of launched instances
that should use that ASG.
attachments – this is the Auto Scaling Policy that is created for
ASG used by the cluster.
attachmentsStatus – the status of the Auto Scaling Policy used
by the cluster.
runningTasksCount – the count of Tasks running on the cluster
in the RUNNING state.
pendingTaskCount – the count of Tasks that are in the
PENDING state.
registeredContainerInstancesCount – the count of compute
instances registered into the cluster, in both ACTIVE and
DRAINING states.
tags – the tags that categorize and organize the AWS resources
present in the cluster.
statistics – counts of running Tasks, pending Tasks, active
Services, and draining Services, per launch type.

Clusters can be created, listed, updated, destroyed, as well as described (in


JSON format). Refer to the Amazon ECS Developer Guide for JSON
examples.
A cluster is destroyed by deleting it from ECS. Before a cluster can be
deleted all Services and Tasks runnable on the cluster must be deleted, and all
registered task definitions and compute instances must be de-registered, from
the cluster.
Lastly, if the consumer’s use case includes multi-tenancy, best practice is
to ensure that the cluster is used exclusively by a single tenant. If multiple
tenants must share the same cluster, best practice is to ensure that the tenants
share the same availability requirements as well as the placement strategy
used by their ECS Services and Tasks, and Tasks are launched using an AZ
spread strategy.
Cluster Capacity Provider
Microservices and containerized applications have to be able to respond
quickly to failure events as well as to constantly changing workloads.
Consequently compute instances and containers require auto scaling, load
balancing, and service discovery.
In Amazon ECS, capacity providers furnish auto scaling. Amazon ECS
defines capacity providers (and capacity provider strategies) as a Cluster-
level property and defines load balancing and service discovery as Service-
level properties. Capacity providers are a recent feature of ECS and are
intended to supersede task placements and placement constraints.
A capacity provider improves the availability and scalability of the ECS
Service or ECS Task running on the Cluster. A capacity provider is used in
association with a cluster to determine the infrastructure that the ECS
Service’s or ECS Task’s Docker containers runs on.
The PutClusterCapacityProviders API is used to associate a capacity
provider with a cluster. An ECS Cluster has 1 to 6 capacity providers and an
optional default capacity provider strategy, which it shares with all ECS
Services and Task running on the cluster. A given cluster’s capacity provider
can be used by one or more ECS Services and ECS Tasks. The running ECS
Service or ECS Task can use 1 and more capacity provider present in the
cluster – but they can use only 1 capacity provider strategy.
When an ECS Task is run or when an ECS Service is created, you can
specify a capacity provider strategy or you can use the cluster's default
capacity provider strategy (if that parameter has been set in the cluster
definition). In addition, the ECS launch type used by the ECS Service or ECS
Task determines the type of capacity provider (and strategy) used by the
Service or Task:

AWS Auto Scaling service and Auto Scaling Group (ASG) is


used with EC2 instances;
FARGATE is used with Fargate instances;
FARGATE_SPOT is used with Fargate Spot instances;

The launch type defined in the ECS Service or ECS Task determines the
type of compute instance and therefore the compute instance capacity
available to the Service or Task. In AWS, a capacity provider is a way to
manage the EC2 and Fargate compute instances that host your Docker
containers, and allows you to define rules for how the containers (listed in the
task definition) use that compute instance capacity. The capacity provider
manages the horizontal scaling of that compute instance capacity. During a
blue-green deployment the capacity provider is not functional.
When the Fargate or Fargate Spot launch type is used, the FARGATE or
FARGATE_SPOT capacity providers are provided automatically to ECS.
When the ECS Service or ECS Task uses the EC2 launch type, you create a
capacity provider that is associated with an EC2 Auto Scaling Group (ASG).
To be able to scale out, the ASG must have a MaxSize greater than zero. A
service-linked IAM role is required so ECS can use the Auto Scaling service
on behalf of an ECS Service or Task.
The capacity provider manages scaling of the ASG through ECS Cluster
Auto Scaling. Amazon ECS adds an AmazonECSManaged tag to the ASG
when it associates it with the capacity provider. Do not remove the
AmazonECSManaged tag from the ASG. If this tag is removed, Amazon
ECS is not able to manage the ASG when scaling your cluster. When using
capacity providers with ASGs, the autoscaling:CreateOrUpdateTags
permission is needed on the IAM user creating the capacity provider.
When you create an ASG capacity provider, you decide whether or not to
enable:

Managed scaling - when managed scaling is enabled, Amazon


ECS manages the scale-in and scale-out actions of the ASG.
ECS automatically creates an AWS Auto Scaling ‘scaling plan’
with a target tracking scaling policy based on the target capacity
value you specify. Amazon ECS then associates this scaling
plan with the ASG. For each capacity provider with managed
scaling enabled, an Amazon ECS managed CloudWatch metric
with the prefix AWS/ECS/ManagedScaling is created along
with two CloudWatch alarms. The CloudWatch metrics and
alarms are used to monitor the compute instance capacity in the
ASGs and will trigger the ASG to scale in and scale out as
needed. Managed scaling is only supported in Regions that
AWS Auto Scaling is available in. When managed scaling is
disabled on the capacity provider, the consumer must manage
the ASGs.
Managed termination process - when managed termination
protection is enabled, Amazon ECS prevents Amazon EC2
instances that host containers and that are in an ASG from being
terminated during a scale-in action. Managed termination
protection can only be enabled if the ASG also has ‘instance
protection from scale in’ enabled. When using managed
termination protection, managed scaling must also enabled on
the capacity provider otherwise managed termination protection
will not work.

Capacity Provider Strategy


With EC2 instances, the capacity provider strategy designates how many
compute instances, at a minimum, to run on a specific ASG. The strategy also
designates the relative percentage of the total number of launched hosts that
should use that ASG. A capacity provider strategy has a base and weight
specified for a capacity provider. The base value sets how many Tasks, at a
minimum, to run on the specified capacity provider. The base value is not
applicable to Services.
The weight value sets the relative percentage of the total count of
launched tasks that use the specified capacity provider. A capacity provider
strategy can be used with one or more capacity providers. However, each
capacity provider strategy can be applied to only one launch type. The
capacity provider strategy or launch type configured by the Service or the
Task over-rides this Cluster-level setting.

Auto Scaling Service-Linked IAM Role


A service-linked role is a unique type of IAM role that is ‘linked’ directly
to an AWS service. Service-linked roles are predefined by the AWS service
and include all the permissions that the AWS service requires to call other
AWS services. A service-linked IAM role is required for (a cluster in) ECS to
use the Auto Scaling service. Likewise, for each AWS service that your ECS
uses a service-specific service-linked IAM role is required.

ECS Cluster Lifecycle


All resources in the cloud are temporary. An ECS cluster transitions
through 5 states during its lifecycle:

ACTIVE – the cluster is ready to accept a task and can register


compute instances with the cluster.
PROVISIONING – the cluster is creating a capacity provider
and resources used by a Service/Task.
DEPROVISIONING– the cluster is deleting a capacity provider
and resources used by a Service/Task.
FAILED - the cluster has failed to create a capacity provider
group and or the resources used by a Service.
INACTIVE – the cluster has been deleted.

CloudWatch Containers Insights


Turning on and off CloudWatch Containers Insights is an AWS Account
setting, as well as an ECS Cluster configuration parameter. If set at the AWS
Account level then it is setting is applied to all ECS clusters created by that
account. This setting applies to all containers listed in every Task of every
Service that is deployed into the cluster.
CloudWatch Containers Insights collects, aggregates, and summarizes
metrics and logs from the running containers – during a specified window of
time. Network metrics are not available for containers that use the bridge
network mode. CloudWatch Containers Insights is provided at an additional
cost to the consumer.

Linux Container Namespaces on ECS


Container network connectivity plays a significant role in all use cases
supported by ECS. Before diving into the configuration parameters of the
ECS Service, Task, and Container, a review of Docker container network
connectivity is helpful. And, to understand Docker container network
connectivity a review of the Linux operating system internals is beneficial.
Docker’s origin is Linux not Windows, and that means that namespaces play
a major role in ECS.
The Docker runtime environment can work with the operating system on a
bare metal host, and the runtime environment can work with the operating
system of a virtual machine (as is the case with both ECS launch types). In
both contexts, multiple Docker containers can run at the same time in the
same Docker runtime environment. In addition, multiple instance of the same
Docker container can run at the same time in the same Docker runtime
environment. Whether the container runs in a private subnet or in a public
subnet, how that container uses network resources dynamically (as well as
securely) is not a simple matter.
Docker’s runtime environment was invented on Linux. In Linux, how
processes are isolated and how the kernel’s network stack is partitioned are
the cornerstones of Docker runtime and Docker container network
connectivity. In Linux, a process is isolated to its own system environment
through the use of namespaces. A namespace is used to partition kernel
resources such that one processes (or a set of processes) uses one set of
kernel resources while another process (or another set of processes) uses a
different set of kernel resources.
There are a variety of namespace types implemented in Linux:

Process ID – ‘A process is visible to other processes in its PID


namespace, and to the processes in each direct ancestor PID
namespace going back to the root PID namespace. In this
context, "visible" means that one process can be the target of
operations by another process using system calls that specify a
process ID.’ (pid_namespaces(7) - Linux manual);
Network – ‘A network namespace is logically another copy of
the network stack, with its own routes, firewall rules, and
network devices. By default a process inherits its network
namespace from its parent. Initially all the processes share the
same default network namespace from the init process’.
(ip_netns(8) – Linux manual;
Mount – ‘Mount namespaces provide isolation of the list of
[filesystem] mount points seen by the processes in each
namespace instance. Thus, the processes in each of the mount
namespace instances will see distinct single-directory
hierarchies.’ (mount_namespaces(7) - Linux manual);
User ID – ‘User namespaces isolate security-related identifiers
and attributes, in particular, user IDs and group IDs, the root
directory, keys, and capabilities. A process's user and group
IDs can be different inside and outside a user namespace. In
particular, a process can have a normal unprivileged user ID
outside a user namespace while at the same time having a user
ID of 0 inside the namespace; in other words, the process has
full privileges for operations inside the user namespace but is
unprivileged for operations outside the namespace.’
(user_namespaces(7) - Linux manual);
Interprocess communication (IPC) – ‘IPC namespaces isolate
certain IPC resources, namely, System V IPC objects and
POSIX message queues. The common characteristic of these
IPC mechanisms is that IPC objects are identified by
mechanisms other than filesystem pathnames. Each IPC
namespace has its own set of System V IPC identifiers and its
own POSIX message queue filesystem. Objects created in an
IPC namespace are visible to all other processes that are
members of that namespace but are not visible to processes in
other IPC namespaces. ’ (ipc_namespaces(7) - Linux manual);
Control group (cgroup) – ‘Control groups, usually referred to as
cgroups, are a Linux kernel feature which allow processes to be
organized into hierarchical groups whose usage of various types
of resources can then be limited and monitored. The Cgroup
namespaces virtualize the view of a process's cgroups.’
(cgroup_namespaces(7) - Linux manual);
Unix Time Sharing (UTS) – ‘UTS namespaces provide isolation
of two system identifiers: the hostname and the NIS domain
name. Changes made to these identifiers are visible to all other
processes in the same UTS namespace but are not visible to
processes in other UTS namespaces. ’ (uts_namespaces(7) -
Linux manual); allows a single machine to appear to have
different host names and different domain names to different
processes.
Time – ‘allows processes to see different system times in a way
similar to the UTS namespace.’ (Wikipedia);

Isolating a process ensures that no process can interfere with any other
processes running on the same compute instance. However, there are use
cases where a process needs to communicate and interoperate with other
processes running on the same compute instance (e.g., PaaS; microservices)
as well as with other endpoints hosted on other compute instances located in
a private subnet or in a public subnet in the VPC or located somewhere in the
Internet.
To support this local network traffic between processes (a.k.a., Docker
containers) a network namespace - a copy of the kernel’s network stack - is
provisioned, with its own VPC routes, firewall rules (security groups), and
network devices (Elastic Network Interface (ENI)).
Docker containers have access to these Linux namespaces:

Process ID – can be set in a task definition using parameter


pidMode;
Network – can be set in a Service definition, in a task definition,
and in a container definition;
IPC – can be set in a task definition using parameter ipcMode;
Mount – can be set in a task definition and its container
definitions;

And ECS Docker containers can use this additional namespaces:

User ID– can be set in a container definition using parameter


user;

Docker Container Network Modes


Without getting too deep into either the ECS Service or ECS Task
artifacts, it is important to point out that a task definition is required to run a
Docker container in ECS. An ECS Service has a task definition as does an
ECS Task. A task definition has a number of Task level configuration
parameters as well as a list of 1-10 container definitions. When the task runs
all of its containers run on the same compute instance.
The container network mode is a Service level as well as a Task level
setting. The network mode chosen at either the Service or Task level apply to
all containers listed in their task definition.
By default, a Docker container is created with the ability to handle local
network traffic. And, when configured to do so, the container can also handle
external network traffic. For external network connectivity, the Docker
container was designed to use a pluggable network driver.
Docker provides 2 network drivers and supports 3 external network
connectivity settings:

none – the container has no external network connectivity and


container port mappings cannot be specified in the task
definition. Compute instances do require external network
access to communicate with the Amazon ECS Service Endpoint
(i.e., the ECS backbone);
bridge – the default Docker’s built-in virtual network stack
which runs inside each container and supports container port
mapping. Less performant that either the host or awsvpc drivers.
With the bridge network mode, containers on a compute
instance are connected to each other using the docker0 bridge.
Containers use this bridge to communicate with endpoints
outside of the host EC2 instance, using the primary Elastic
Network Interface (ENI) of the EC2 instance on which they are
running. Containers share and rely on the networking properties
of the EC2 instance’s primary ENI, including the firewall rules
(security group subscription) and IP addressing. In ECS, you
cannot address these containers with the IP address allocated by
Docker, nor can you enforce finely grained network ACLs and
firewall rules onto the docker0 bridge. Instead, containers are
addressable in your VPC by the combination of the IP address
of the primary ENI of the EC2 instance, and the host port to
which the container is mapped (either via static or dynamic port
mapping). Also, because a single ENI is shared by multiple
Docker containers, it is difficult to create and manage network
policies for each container.
host – bypasses the bridge virtual network and container port
maps, accesses the EC2 instance’s Elastic Network Interface
(ENI). When the host network driver is used multiple instances
of the same image cannot be run in the same compute instance
when port mappings are used. The host modes offers higher
networking performance for containers because they use the
Amazon EC2 network stack instead of the virtualized network
stack provided by the docker0 bridge. With host mode, exposed
container ports are mapped directly to the corresponding host
port, so you cannot take advantage of dynamic host port
mappings.

awsvpc Network Mode


Initially, ECS containers only had use of the Docker network modes.
Fortunately, today Amazon ECS provides its Docker containers with a 3rd
pluggable network driver: awsvpc . Currently, only the Amazon ECS-
optimized AMI, other Amazon Linux variants with the ecs-init package, or
AWS Fargate infrastructure support the awsvpc network mode.
The awsvpc network mode gives the container the same external
networking properties as an EC2 instance with full networking features,
segmentation, and security controls in the VPC. However, containers that use
the awsvpc network mode are associated with an Elastic Network Interface
(ENI), not with an EC2 instance. In awsvpc mode an ENI is assigned at the
Task level, and its list of containers share that ENI. By default, containers are
able to communicate with each other over their shared ENI. By convention,
an ENI is a virtual network interface that is attached to an EC2 instance in the
VPC – but, with ECS the ENI is attached to the Task which its containers
share and is not to be mistaken for the ENI associated with the EC2 instance
that hosts the Task’s container(s).
Each container (listed in the task definition) is associated with the Task’s
ENI and is assigned the ENI’s internal DNS hostname and primary private
IPv4 address. This primary IP address is unique and routable within the VPC
used by the cluster. ECS populates the hostname of a container using an
Amazon-provided (internal) DNS hostname when both the
enableDnsHostnames and enableDnsSupport options are enabled on your
VPC. If these options are not enabled, the DNS hostname of the task will be a
random hostname.
The container sends and receives network traffic on the Task ENI in the
same way that Amazon EC2 instances do with their primary network
interfaces. All containers listed in a task are addressable by the private IP
address of the Task’s ENI or by their internal DNS hostname, and (when
placed on the same host) the Task’s containers can communicate with each
other over the localhost interface. When a task definition has 1 and only 1
container that container is identifiable by the private IP address and host
name of the task’s ENI.
You must specify a network mode when the ECS Service is created or
when the ECS Task is run (on a cluster that uses a VPC that includes 2 or
more subnets in different AZs). On demand, Amazon ECS creates the ENI
and attaches it to the Task and associates 1 or more security groups with that
ENI. Associating security group rules with the Task allows you to restrict the
ports and IP addresses from which its container(s) accepts network traffic.
Enforcing such security group rules greatly reduces the surface area of attack
for your compute instances and for Docker containers. AWS does not
recommend that you manually create and attach the ENI to a Task. When the
Task stops or if the Service is scaled down, the ENI is detached from the
Task and deleted. The ENIs that ECS creates and attaches to a Task cannot be
detached manually or modified by an AWS account. To release the ENI, stop
the Task.
A container that uses awsvpc network mode is attachable as an ‘IP’ target
to an Application Load Balancer (ALB) or Network Load Balancer (NLB).
An ECS Service with containers that use the awsvpc network mode can only
support ALB or NLB.
With the awsvpc network mode, you can run multiple copies of the
container on the same instance using the same container port without needing
to do any port mapping, without needing to worry about port conflicts or
perform port translations, simplifying the application architecture. With the
awsvpc network mode, exposed container ports are mapped directly to the
attached ENI port, so you cannot take advantage of dynamic host port
mappings.
The awsvpc network mode offers higher networking performance for
containers because they use the Amazon EC2 network stack instead of
contending for bandwidth on a shared virtualized network stack provided by
the bridge driver. Container networking with the awsvpc drive provides
greater security and monitoring for the containers than does host or bridge.
Because each container shares its Task’s ENI, the awsvpc driver network
policy is enforced using ACLs and firewall rules (security groups), and the
IPv4 address range is enforced via VPC subnet CIDRs. And, given an ENI
per Task, you can use VPC Flow Logs to monitor the traffic to and from the
Task container(s).
For ECS Tasks and Services that have containers that use the awsvpc
network mode ECS needs a service-linked IAM role to provide ECS with the
permissions to make calls to other AWS services on behalf of containers.
This service-linked IAM role is created for automatically the cluster is
created, or an ECS Service is updated, using the ECS Console.
The awsvpc network mode does not provide ENIs with public IP
addresses for containers that use the EC2 launch type. To access the internet,
a container that uses the EC2 launch type must be launched in a private
subnet that is configured to use a NAT gateway. Inbound network access to
the container must be from within the VPC using the container ENI’s private
IP address or routed through a load balancer located within the VPC.
Containers launched within public subnets do not have outbound network
access.
At this time, on ECS, the Linux-based Docker container network
connectivity capabilities are not fully available with Windows-based Docker
containers running on ECS. Windows containers uses a network mode called
NAT. When you register a task definition with Windows containers, you
must specify a network mode. Container network connectivity of Linux-
based EC2 compute instances and Fargate compute instances are nearly the
same when the Docker container they host is configured to use the awsvpc
network mode (awsvpc is one of four network modes that a Docker container
can use with ECS).

Configuring the Container Network Mode


In ECS, the network mode used by a Docker container can be configured
in the Service, in the Task, as well as in the Container, definitions:

Service definition – the networkConfiguration parameter is used


to set the network mode that is used by all containers listed in
the Service’s task definition.
Task definition – the networkMode parameter is used to set the
network mode that is used by all containers listed in the Task’s
task definition.
Container definition – the container definition’s
disableNetworking parameter can be used to disable the network
mode of the individual container.

The configured network mode is influenced by the launch type, and


container type, identified in the Service and Task definitions. When the
container uses the Fargate launch type, the awsvpc network mode is required
(and the ENI is fully managed by Fargate). When the container uses the EC2
launch type, the allowable network mode depends on the underlying EC2
instance's operating system. If Linux, then none, bridge, host and awsvpc
network modes can be used. If Windows, only the NAT mode is allowed.

Linux EC2 Instance Trunking


Minimizing the number of compute instances used by an ECS cluster
keeps the cluster small. The smaller the ECS cluster the greater its savings
potential. The more containers that can be run on a compute instance the
lower the cost of the compute instance per container. The key to increasing
savings potential is network interface density. As network interface density
increases on the compute instance so does the savings potential. The ECS
cluster’s savings potential is capped by the network interface density that it
achieves.
When Fargate is the launch type used, each container gets its own
compute instance with its own ENI – the cluster with Fargate instances has a
1-to-1 relationship between container and ENI. That constancy of network
interface density caps the savings potential.
When the EC2 is the launch type used, the EC2 instance is limited by the
count of network interfaces that the chosen EC2 type supports. For each EC2
instance, its primary private IP address counts as one network interface, as
does each ENI attached to the EC2 compute instance. The count of ENIs that
an EC2 type can support is a form of network interface density.
When each Task gets its own ENI, the count of Tasks that can be placed
on a type of EC2 is limited by the count of ENIs that the EC2 type supports.
That limit in Tasks that the ECS cluster supports caps the savings potential.
To go past the ECS cluster’s ceiling in network interface density (and
therefore Task density) AWS has made available EC2 types that can exceed
their default network interface density when either the AWS account or the
EC2 instance opt-in to the awsvpcTrunking account setting. That AWS
account must also have the AWSServiceRoleForECS service-linked IAM
role for ECS. The account setting only affects the creation of new EC2
instances after opting into awsvpcTrunking.
Amazon ECS allows you to raise network interface density limits (beyond
the EC2 type’s default limit) with a technique called trunking. To take
advantage of trunking, the containers must use the awsvpc network mode.
Only Linux variants of the Amazon ECS-optimized AMI, or other Amazon
Linux variants can support trunking.
Trunking is a technique for multiplexing data over a shared
communication link. Trunking offers you the option of more efficient
network interface packing per instance, potentially resulting in smaller
clusters and associated savings. AWS proports that 5 to 17 times as many
Tasks can be run on an EC2 type with enhanced network interface density. If
the Task containers create high CPU or memory workloads, than they might
not benefit from trunking.
Amazon ECS assigns the primary private IP to the EC2 instance and
creates and attaches a "trunk" network interface (a special type of ENI) to the
EC2 instance. The ‘trunk’ has its own private IP address and counts as 1
network interface on the EC2 instance. The “trunk” belongs to the same
subnet in which the instance’s primary network interface originates. This
reduces the available IP addresses and potentially the ability to horizontally
scale out other EC2 instances sharing that same subnet.
The trunk network is fully managed by Amazon ECS. The trunk ENI is
deleted when either a container is terminated or when the EC2 instance is
deregistered from the ECS cluster.

ECS Service Definition


Once the cluster’s VPC is provisioned and the cluster is configured, you
can provision the ECS Services and Tasks (and their Docker containers) that
support your use case(s). The ECS Service has Service-level configuration
parameters and has 1 and only 1 task definition that in turn has a list of 1-10
container definitions. The ECS Services is an execution environment shared
by 1 or more containers. The ECS Service determines which launch type is
used by its containers, and therefore which launch type instance the cluster
contains.
The ECS Service has a task definition that it uses to launch containers. A
task definition is required to run Docker containers on ECS. When the task
runs all of its containers run on the same compute instance.
Having more than 1 container per task definition incurs potential losses.
When a Task has more than 1 container:

The containers share the same namespaces as well as the same


attack surface.
You have lost the ability to horizontally scale those containers
independently of each other.
You have lost the ability to vertically scale those containers
independently of each other.

And, when more than 1 Task is run on the same compute instance:

You have lost the ability to horizontally scale those Tasks


independently of each other.
You have lost the ability to vertically scale those Tasks
independently of each other.
Unless over-ridden by parameters in the task definition or the container
definition, the parameters configured in the service definition apply to all
compute instances and to all Docker containers that the Service creates and
launches in the cluster.
The maximum number of running Services per cluster is 1,000. The
maximum number of compute instances (a.k.a., compute instances) per
cluster is 2,000. An individual compute instance can host the Docker
containers of multiple Tasks and Services, at the same time. The maximum
number of Tasks using the EC2 launch type per Service per cluster is 1,000.
And lastly, the maximum number of Tasks using the Fargate launch type, per
Region, is 100.
Provisioning and running an ECS Service (its containers and compute
instances) on a cluster is called a deployment. Not all use cases need Services
that are deployed. Only those use cases involving long running containers
ought to define a Service and specify how the Service is deployed in the
cluster. The ECS use case suited to ECS Services are:

Application Migration to AWS;


Hybrid Applications
Microservices;
Software-as-a-Service.

Multiple Service Tasks (with their containers) can run on the same
compute instance at the same time. The ECS Service also has the ability to
defines rules and constraints that govern the placement of a Task of a
compute instance provisioned in the cluster. ECS has recently introduced
capacity providers and capacity provider strategies to supersede placement
rules and constraints.
The ECS Service definition has these parameters:

clusterARN – the ARN of the cluster that hosts the ECS Service.
The same named ECS Service can run simultaneously on 1-N
different clusters.
serviceARN – the Amazon Resource Name (ARN) that identifies
the ECS Service.
serviceName – the name of the ECS Service (that must be
unique per cluster).
taskDefinition – the task definition (a.k.a., the task) used by the
ECS Service to launch the Docker container(s) listed in the task
definition. The task definition contains the list of 1-10 container
definitions, as well as other Task level configuration parameters,
some of which override certain Service parameter settings.
When the task runs all of its containers run on the same compute
instance.
launchType – the compute instance launch type on which the
ECS Service’s containers will be running. The default is EC2.
All containers launched by the ECS Service run on the same
launch type and run on the same compute instance. The ECS
Task also has a parameter that specifies the launch type used by
its containers, therefore over-ridding the Service’s launchType
parameter.
platformVersion – specifies the Fargate version that provides the
Fargate launch type instances used by the task’s container.
desiredCount – the desired count of compute instances that are
kept running in the cluster for the ECS Service.
capacityProviderStrategy - the capacity provider strategy
associated with 1 or more capacity providers. A capacity
provider strategy gives you control over how a Task uses one or
more capacity providers. When a Task is run, or when a Service
is created, a capacity provider strategy is specified. A capacity
provider strategy consists of one or more capacity providers
with an optional base and weight specified for each provider.
With EC2 instances, the capacity provider strategy designates
how many hosts, at a minimum, to run on a specific Auto
Scaling Group (ASG) supported by the ECS Cluster. The
strategy also designates the relative percentage of the total
number of launched instances that should use that ASG.
loadBalancers – the list of Elastic Load Balancing (ELB)
objects that support the containers launched by ECS Service.
Can be used with both launch types.
roleARN – The ARN of the IAM Role that allows the Amazon
ECS Container Agent to register compute instances with an
Elastic Load Balancing (ELB).
serviceRegistries – the list of Service Discovery Registries
assigned to the Service.
networkConfiguration – the network mode that all containers
listed in the task definition will use. The container’s network
mode can be set to none, bridge, host, or awsvpc. Alternatively,
the task definition can specify the network mode used by all of
its containers.
placementStrategy – the placement strategy that determines how
containers are placed on an EC2 launch type. Not used by the
Fargate launch type.
placementConstraints – a list of constraints applied when a Task
(and its container(s)) is placed on an EC2 launch type. Not used
by the Fargate launch type because each Task gets its own
compute instance. You can specify up to 10 constraints per Task
(including constraints in the task definition and those specified
at runtime).
schedulingStrategy – there are two ECS Service scheduling
strategies: REPLICA or DAEMON.
healthCheckGracePeriodSeconds – the count of seconds that the
ECS Service Scheduler ignores unhealthy Elastic Load
Balancing (ELB) target health checks after an ECS Task has
first started.
tags – the metadata you append to resources (used the ECS
Service) to help you categorize and organize resources. The
maximum number of tags per resource is 50.
propagateTags – specifies whether tags from the ECS Service
or the task definitions are propagated to the Tasks.
enableECSManagedTags – enables ECS managed tags for the
tasks in the service.

Though the cluster can configure its default capacity provider strategy, the
ECS Service has a capacity provide strategy parameter that over-rides the
cluster’s default. In addition, the ECS Service supports load balancing - the
balanced distribution of HTTP (or TCP) requests across containers, compute
instances, and compute instance ports – as well as service discovery.
An ECS Service can be created, listed, described, deleted, as well as
updated (i.e., re-deployed). An ECS Service can be deleted if it has no
running Tasks and the desired Task count is zero. An ECS Service is updated
when any combination of these changes are made to the ECS Service’s
definition:

task definition;
Fargate platform version;
deployment configuration;
desired count of compute instances;

Like all things configurable in AWS, the ECS Service parameters and
their assigned values are captured in a JSON-formatted or a YAML
document. Best practice is to declare the ECS Service definition in an AWS
CloudFormation template, and to manage that document as part like all files
that participate in your CD/CI workflow.
At runtime, additional ECS Service properties can be perceived in the
ECS Console:

deployments – the current state of deployments of the ECS


Service on the cluster.
createdAt – the UNIX timestamp for when the ECS Service was
created.
createdBy – the principal that created the ECS Service.
status – the status (ACTIVE, DRAINING, INACTIVE) of the
ECS Service.
runningCount – the count of containers in the cluster that are in
the RUNNING state.
pendingCount – the count of containers in the cluster that are in
the PENDING state.
events – the event stream of the Service.

ECS Service Lifecycle


All resources in the cloud are temporary. The ECS Service transition
through 3 states during its lifecycle:

ACTIVE – the Service’s Task is running on a cluster.


DRAINING – the Service is deleted, but it’s Task is still
running and requires cleanup, the Service moves from ACTIVE
to DRAINING.
INACTIVE – after all containers in the Task are either
STOPPING or STOPPED, the Service moves from DRAINING
to INACTIVE.

IAM Service-Linked Roles for ECS Service


The ECS Service requires service-linked IAM roles when its container(s)
uses the awsvpc network mode, or if the ECS Service is configured to:

use ECS Service Discovery;


an External deployment controller;
Auto Scaling service;
Elastic Load Balancing (ELB) service, or
Elastic Inference accelerators.

ECS Service Load Balancing


Amazon ECS defines load balancing and service discovery in the ECS
Service level. The Elastic Load Balancing Service (ELB) is used by an ECS
Service to distribute network traffic evenly across its Docker containers. Both
ECS launch types are compatible with Amazon Elastic Load Balancing
(ELB).
ELB has 3 different types of load balancing and ECS Service is
compatible with all 3:

1. Application Load Balancing (ALB) - layer 7 of the OSI Model;


used to route HTTP/HTTPS traffic.
2. Network Load Balancing (NLB) - layer 4 of the OSI Model;
used to route TCP traffic.
3. Classic Load Balancing (CLB) - layer 4 of the OSI Model; used
to route TCP traffic.

Though ECS supports all three types, AWS recommends Application


Load Balancing. The typical containerized application runs in layer 7 of the
OSI Model, therefore Application Load Balancing (ALB) has the features
best suited to the ECS Service:

1. Supported by both Fargate and EC2 launch types.


2. Containers can use dynamic host port mapping (so that multiple
Docker containers from the same ECS Service are allowed per
compute instance).
3. Each ECS Service can serve traffic from multiple Application
Load Balancers and expose multiple load balanced ports by
specifying multiple target groups.
4. Support path-based routing and priority rules (so that multiple
ECS Services can use the same listener port on a single ALB).

ECS Services with Docker containers configured to use the awsvpc


network mode only support ALB and NLB. And, to use CodeDeploy with
Amazon ECS, the ECS Service must use either an ALB or an NLB. An ELB
is a complex construct, as is the Auto Scaling Policy.

Amazon ECS Service Discovery


Microservices and containerized applications have to be able to respond
quickly to failure events as well as to constantly changing workloads.
Therefore they require auto scaling, load balancing, and service discovery.
Amazon ECS configures load balancing and service discovery in the ECS
Service level.
In Amazon ECS, service discovery involves the discovery of compute
instance IP addresses (both EC2 as well as Fargate instances) and the IPs and
port numbers used by the Docker containers.
Docker containers are loosely coupled and are allowed to specify their
own dependencies – clearly, these features of a distributed architecture
further complicate an already dynamic environment. Consequently, Docker
containers are forced to find the ever-changing endpoints that they need to
use. Given instance and container failure events, as well as the auto scaling of
compute instances, solving service discovery in such a dynamic environment
is not a simple matter. In Hadoop, the Zookeeper service and its agents are
used to provide service discovery within the Hadoop cluster. Fortunately,
ECS has an integrated service discovery that makes it easy for your
containerized services to discover and connect with each other.
In Amazon ECS, the consumer can configure service discovery for an
ECS Service that uses a load balancer. In that scenario, service discovery
traffic is always routed to the Docker containers and not to the load balancer.
For a given ECS Service, the consumer can enable service discovery by
using the ECS console, the AWS CLI, or by using the ECS API. Service
discovery can be enabled for an ECS Service only when the ECS Service is
created. Service discovery is automatically enabled for an ECS Service that is
created using the ECS console wizard.
A service discovery Service Registry has the following properties:

ContainerName – the container name value, already specified in


the task definition, to be used by the service discovery service.
If the ECS Service (or its task definition) specifies that the
Docker container use the bridge or host network mode, you
must specify a containerName and containerPort combination
from the task definition. If the ECS Service (or its task
definition) specifies that the Docker container use the awsvpc
network mode and a type SRV DNS record is used, you must
specify either a containerName and containerPort combination
or a port value, but not both.
ContainerPort - the port value, specified in the task definition,
to be used by the service discovery service. If the ECS Service
(or its task definition) specifies that the Docker container use the
bridge or host network mode, you must specify a containerName
and containerPort combination in the task definition. If the ECS
Service (or its task definition) specifies that the Docker
container use the awsvpc network mode and a type SRV DNS
record is used, you must specify either a containerName and
containerPort combination or a port value, but not both.
Port – the port value used if your service discovery service
specified an SRV record. This field may be used if both the
awsvpc network mode and SRV records are used.
RegistryArn – the Amazon Resource Name (ARN) of the
service registry.

Route 53 and AWS Cloud Map with ECS Service Discovery


ECS Service Discovery is built on top of Route 53 APIs and manages all
of the underlying API calls for the Docker containers. Service discovery
works by communicating with the Amazon Route 53 Service Registry and
Auto Naming APIs on behalf of the container. ECS creates and manages a
registry of service names using the Route 53 Auto Naming API. Names are
automatically mapped to a set of DNS records so you can refer to Docker
containers by an alias, and have this alias automatically resolve to the Docker
container’s endpoint at runtime.
The currently supported service registry is AWS Cloud Map. AWS Cloud
Map is a cloud resource discovery service. With Cloud Map, you can define
custom names for your ECS resources, and it maintains the updated location
of these dynamically changing resources. Cloud Map constantly monitors the
health of every IP-based compute instance and Docker container and
dynamically updates the location of each as it is added or removed to the
cluster. This ensures that the Docker containers only discover the most up-to-
date location of its resources, increasing the availability of the application.
When the new ECS Service is configured to use Amazon ECS Service
Discovery the Docker containers listed in the Service’s task definition can
potentially use service discovery. However, for a Docker container to use
service discovery other parameters of the ECS Service have to be set
appropriately. An ECS Service that is able to use service discovery can use
either the Fargate launch type or the EC2 launch type. Therefore, when:

the AWS Fargate launch type instance is chosen, the


platformVersion Service parameter has to be set to use platform
version v1.1.0 or later.
the EC2 launch type is chosen, the networkConfiguration
Service parameter has to be set to awsvpc, bridge, or host. This
ECS Service parameter ensures that each Docker container
hosted by an EC2 instance is configured to use a compatible
networking mode. If the ECS Service specifies the use of the
awsvpc network mode, you can create any combination of A or
SRV records for each container. If you use SRV records, a port
is required. If the ECS Service specifies the use of the bridge or
host network mode, an SRV record is the only supported DNS
record type. Create an SRV record for each container. The SRV
record must specify a container name and container port
combination found in the ECS Service’s task definition.

To provide service discovery to the Docker containers listed in the


Service’s task definition, ECS uses Amazon Route 53 Service Registry and
Auto Naming APIs and AWS Cloud Map API actions to manage HTTP and
DNS namespaces on behalf of the ECS Service’s containers. DNS records for
a service discovery service can be queried within the VPC that the cluster
uses. When doing a DNS query on the container name, A records return a set
of IP addresses that correspond to the container. SRV records return a set of
IP addresses and ports per container. Customers using service discovery are
charged for Route 53 resources (i.e., each namespace that you create, and for
the lookup queries your containers make) and AWS Cloud Map discovery
API operations.
Amazon ECS Service Discovery depends upon these components:

Service discovery namespace – a logical group of service


discovery services that share the same domain name. A
namespace contains a specific set of compute instances and
Docker containers. A namespace specifies the domain name that
you want to route traffic to. The namespace is a logical
boundary within which Docker containers are able to discover
each other. Namespaces are roughly equivalent to hosted zones
in Route 53. network stacks of containers are configured via
network namespaces.
Service discovery service – the service discovery service exists
within the service discovery namespace and has a service name
and DNS configuration for the namespace.

The service discovery service provides the following components:

Service registry – used to look up a Docker container via DNS


or the AWS Cloud Map API actions and returns 1-N available
endpoints that can be used to connect to the Docker container.
Service discovery instance – consists of the attributes of each
Docker container in the service directory.
Instance attributes – each Docker container has these custom
attributes that are configured for use by service discovery:
AWS_INSTANCE_IPV4 – For an A record, the IPv4
address that Route 53 returns in response to DNS
queries and AWS Cloud Map returns when
discovering instance details.
AWS_INSTANCE_PORT – The port value
associated with the service discovery service.
AVAILABILITY_ZONE – The Availability Zone
(AZ) into which the Docker container was launched.
For containers using the EC2 launch type, this is the
AZ in which the compute instance exists. For
containers using the Fargate launch type, this is the
AZ in which the Elastic Network Interface (ENI)
exists.
REGION – The Region in which the container exists.
ECS_SERVICE_NAME – The name of the Amazon
ECS service to which the container belongs.
ECS_CLUSTER_NAME – The name of the Amazon
ECS cluster to which the container belongs.
EC2_INSTANCE_ID – The ID of the compute
instance the Docker container was placed on. This
custom attribute is not added if the container uses the
Fargate launch type.
ECS_TASK_DEFINITION_FAMILY – The task
definition family that the Docker container is listed
within.
ECS_TASK_SET_EXTERNAL_ID – If the ECS
Service’s Docker container is created for an external
deployment and is associated with a service discovery
registry, then the ECS_TASK_SET_EXTERNAL_ID
attribute will contain the external ID of the task set.

Amazon ECS performs periodic container-level health checks. If an


endpoint (a.k.a., container port) does not pass the health check, it is removed
from DNS routing and marked as unhealthy. Only healthy containers are
returned by a service lookup. Container level health checks are provided at no
cost to the consumer. Conducting endpoint health checks is specified by
setting the HealthCheckCustomConfig parameter when creating the ECS
Service.
You can use Amazon ECS Service Discovery in all AWS regions where
Amazon ECS and Amazon Route 53 Auto Naming are available. These
include US East (N. Virginia), US East (Ohio), US West (Oregon), and EU
(Ireland) regions.
You can use Amazon ECS Service Discovery in all AWS regions where
Amazon ECS and AWS Cloud Map are available. AWS Cloud Map is
currently available in the following regions: US East (Virginia), US East
(Ohio), US West (N. California), US West (Oregon), Canada (Central),
Europe (Frankfurt), Europe (Ireland), Europe (London), Europe (Paris), Asia
Pacific (Singapore), Asia Pacific (Tokyo), Asia Pacific (Sydney), Asia
Pacific (Seoul), and Asia Pacific (Mumbai) Regions.

ECS Service Scheduling Strategy


There are two ECS Service scheduling strategies:

REPLICA – scheduling strategy places and maintains the


desired number of containers across the cluster. By default,
containers are spread across Availability Zones in the Region.
The ECS Service’s placement strategy and constraints can be
used to customize container placement decisions.
DAEMON – scheduling strategy that deploys one container on
each active launch instance that meets all of the ECS Service’s
container placement constraints. The scheduling strategy
evaluates running containers and will stop those containers that
do not meet the placement constraints.

Service Placement Strategy and Constraints


To determine the EC2 instance on which the Task’s containers will be
launched, as well as when the Task (and its container(s)) is terminated to
support scaling-in the cluster, ECS uses a:

1. Task placement strategy, and


2. Task placement constraints

Task placement is not supported using the Fargate launch type. By default,
Fargate distributes the containers (of a REPLICA scheduled Service) across
multiple AZs in the Region (that contains the VPC subnets used by the
cluster).
When a Task uses the EC2 launch type, ECS has to determine (based on
the information contained in the task definition and its list of constraint
definitions) which EC2 instance on which to place each container of the task.
A task placement constraint is a rule that is considered during container
placement on the EC2 instance in the cluster.

EC2 Launch Type


The EC2 launch type is a compute engine for containers, available in both
ECS and Amazon Elastic Kubernetes Service (EKS). As the label indicates,
the EC2 launch type is an EC2 instance – the virtual machine (VM) of the
AWS platform. The EC2 launch type provides the greatest control over the
compute instances used by Tasks. A deep understanding of the AWS
platform, as well as its resources, and services, is a pre-requisite to exercising
that control wisely.
Though a VM allows you to ignore the bare metal machine, correctly
configuring an EC2 instance is not a simple task. There are many important
and inter-related design choices that have to be made:

1. EC2 Operating Systems


2. EC2 Instance’s Root Device
3. EC2’s Instance Store
4. Amazon Machine Images (AMIs)
5. AMI Maximum Transmission Unit (MTU)
6. AMI Root Volume
7. AMI Launch Permission Categories
8. S3-Backed (aka Instance Store-Backed) EC2 Instance Root
Device
9. EBS-Backed EC2 Instance Root Device
10. Golden Images
11. Moving AMIs Across Regions
12. Linux AMIs
13. Sources of AMIs
14. VM Import/Export
15. EC2 Instance Type Categories
16. EC2 Instance Intel Processor
17. The EC2’s Block Storage Features
18. Provisioned IOPS
19. EC2’s Block Storage Comparison
20. Ways to Address an EC2 instance
21. Public Domain Name System (DNS) Name Address
22. EC2 Private IP Address
23. How Internet Hosts Access EC2 Instances
24. EC2 Public IP Address
25. EC2 Elastic IP Address
26. Source/Destination Check Attribute
27. EC2 Scheduled Events
28. The EC2 Instance Key-Pair
29. Logging into the EC2 Instance
30. EC2 Instance Security Groups
31. Security Groups and Ping
32. Changing a Security Group
33. EC2 Instance Life-Cycle
34. EC2 Instance Bootstrapping at Launch
35. Modifying a Launched EC2 Instance
36. Changing the Instance Type or Size
37. Diagnose EC2 Instance Termination Events
38. Amazon EC2 Pricing Options
39. On-Demand Instance Pricing Option
40. Reserved Instance Pricing Option
41. Sub-Categories of Reserved Instance
42. Reserved Instance Marketplace
43. Spot Instance Pricing Option
44. AWS Tenancy Options
45. Shared Tenancy
46. Dedicated Host Tenancy
47. Dedicated Instances Tenancy
48. Dedicated Instance Auto Recovery
49. The Placement Group (Cluster Networking)
50. Placement Group with Enhanced Networking
51. Steps for Using an AWS EC2 Instance
52. Amazon Resource Name (ARN)
53. EC2 Instance Tag
54. EC2 Instance Metadata
55. Instance Identity Document
56. Managing & Monitoring EC2 Instances
57. Terminating an EC2 Instance
58. EC2 Penetration Testing
59. AWS Elastic IP Addresses (EIPs)
60. AWS Elastic Network Interfaces (ENIs)
61. ENI Attributes
62. How EC2 Instances in a VPC Communicate With Each Other
63. Elastic Block Storage (Amazon EBS)
64. EBS Use Cases
65. Magnetic Volumes
66. Magnetic Types
67. General-Purpose SSD
68. Provisioned IOPS SSD
69. EBS-Optimized EC2 Instance
70. EBS Volume Encryption
71. Durable EBS Snapshot
72. EBS Snapshot and RAID 0
73. EBS Snapshot Encryption
74. Moving EBS Volume Across AZs
75. EBS Volume Check
76. EBS Snapshot Best Practices

ECS supports EC2 instance AMIs that are based on a variety of operating
systems, such as Linux, CoreOS, Ubuntu, and Windows. Amazon highly
recommends using ECS-optimized AMIs. Regardless of the AMI chosen, to
reduce costs significantly AWS recommends the use of Spot Instances for
container processes that can be interrupted. You incur additional charges
when your container uses other AWS services or transfers data.
EC2 instances do not require external network access to communicate
with the ECS endpoint. EC2 instances do not require any inbound ports to be
open. However, to examine containers with Docker commands an SSH rule
must be added so that you can log into the EC2 instance. Best practice is to
refrain from allowing SSH access from all IP addresses (0.0.0.0/0).
Best practices when used the EC2 launch type, as needed:

Use the Auto Scaling service to manage EC2 compute instances


within the ECS Cluster.
Configure the VPC (used by the ECS cluster) in at least three
subnets located in different AZs and keep the compute instance
counts balanced across the AZs.
Configure the ECS Cluster with homogenous EC2 instances.

EC2 Compute Instance IAM role


Before you can launch an EC2 instance and register it with a Cluster, you
must create an IAM role for that compute instances to use when it is
launched. The Docker containers that are running on the EC2 compute
instances have access to all of the permissions that are supplied to the
compute instance IAM role through instance metadata. AWS advice is to
limit the permissions of the compute instance IAM role by attaching the
AmazonEC2ContainerServiceforEC2Role policy to it.

ECS-Optimized Amazon AMI


An ECS optimized Amazon AMI is pre-configured and tested by AWS
engineers. Metadata about the compute instance’s AMI can be obtained
programmatically. If the choice is the EC2 launch type – and not serverless
Fargate – the ECS optimized Amazon AMI is the easiest way to get started
with ECS and get your containers running. The ECS optimized Amazon
AMIs are also your best bet for minimizing risk.
The ECS-Optimized Amazon AMI comes prepackaged with the latest
versions of these ECS components:

Docker Client;
Docker daemon (dockered), a.k.a., the Docker Engine;
shim (docker-containered-shim);
runc (docker-runc);
Amazon ECS Container Agent (a.k.a., containered (docker-
containered);
ecs-init Service;

Unless your Docker container requirements prohibits its use, AWS


recommends using an ECS optimized Amazon AMI.
In March 2020, AWS offers these ECS optimized AMIs:

Amazon ECS-optimized Amazon Linux 2 AMI –


Recommended for launching your Amazon ECS compute
instances in most cases.
Amazon ECS-optimized Amazon Linux 2 (arm64) AMI –
Recommended for launching your Amazon ECS compute
instances when using the Amazon EC2 A1 instance type, which
is powered by Arm-based AWS Graviton Processors.
Amazon ECS GPU-optimized AMI – Recommended for
launching your Amazon ECS compute instances when working
with GPU workloads.
Amazon ECS-optimized Amazon Linux AMI – This AMI is
based off of Amazon Linux 1. We recommend that you migrate
your workloads to the Amazon ECS-optimized Amazon Linux 2
AMI. Support for the Amazon ECS-optimized Amazon Linux
AMI ends no later than December 31, 2020.
Amazon ECS-optimized Windows 2019 Full AMI;
Amazon ECS-optimized Windows 2019 Core AMI;
Amazon ECS-optimized Windows 1909 Core AMI;
Amazon ECS-optimized Windows 2016 Full AMI;

By default, the Amazon Linux 2-based ECS-optimized AMIs ship with a


single 30-GiB root volume size which can be modified at launch time to
increase the available storage on your compute instance. This storage is used
for the operating system and for Docker images and metadata. Lastly, the
default filesystem for the Amazon ECS-optimized Amazon Linux 2 AMI is
ext4, and Docker uses the overlay2 storage driver.

ECS-Optimized AMI and AWS Systems Manager


The AWS Systems Manager is a fee-based service that gives you visibility
and control of your infrastructure on AWS. The Systems Manager has a
unified UI that allows the consumer to view operational data about various
AWS services and that enables the consumer to automate operational tasks
across your AWS resources and services at scale.
Systems Manager has a Run Command capability that allows the
consumer to securely and remotely manage the configuration of EC2
instances launched in ECS. The Run Command is a simple way of
performing administration tasks without logging on locally to the EC2
instance:

Install or uninstall packages;


Clean up Docker images;
View log files;
Perform file operations, etc.

To benefit from System Manager, three things must be done:

1. Attach to the compute instance’s IAM role the ecsInstanceRole


which grants access to the Systems Manager APIs;
2. To that IAM role attach the AmazonSSMManagedInstanceCore
policy which provides the minimum permissions needed to use
Systems Manager, and
3. Install the AWS Systems Manager Agent (SSM Agent) on the
ECS-optimized AMI compute instance (or on-premise bare
metal nodes, or VM). The SSM Agent makes it possible to
remotely update, manage, and configure the compute instance.

Bottlerocket
Where the container (and image) uses a general purpose operating system
(OS), updating the OS – package by package – is difficult to automate in
Amazon ECS and EKS. Amazon ECS and EKS now support Bottlerocket -
an open source Linux-based operating system built by AWS for running
Docker containers on both EC2 instances as well as bare metal hosts.
Bottlerocket is available as an AMI for EC2 instances. Bottlerocket includes
only the essential software needed to run Docker containers, and this
improves instance usage and reduces the compute instance’s attack surface
(compared to a general purpose OS).
Updates to the Bottlerocket OS are applied in a single step (not on a
package by package basis) (and can be rolled back in a single step if need
be). This single step increases container uptime and greatly simplifies the
management of OS upgrades to instances that have orchestrated Docker
containers running on them.

GPUs on Amazon ECS


ECS supports machine learning workloads that take advantage of GPUs
by provisioning p2, p3, g3 and g4 EC2 instance types that provide access to
NVIDIA GPUs. These GPU optimized AMI come ready with pre-configured
NVIDIA kernel drivers and a Docker GPU container runtime.
An ECS cluster can contain a mix of GPU and non-GPU compute
instances. The count of GPUs needed by the Docker containers is designated
in the task definition. If GPU requirements are not specified in the task
definition, the task will use the default Docker runtime. The number of GPUs
is considered during placement of the container. ECS will place the container
on GPU enabled compute instances and for optimal performance pins the
physical GPUs to the appropriate container. The number of GPUs reserved
for all containers in a task definition should not exceed the number of
available GPUs on the instance the container is launched on.
The count of GPUs needed can also be designated in the container
definition. When this number of GPUs is set in the container definition ECS
sets the container runtime to be the NVIDIA container runtime. The NVIDIA
container runtime requires certain environment variables to be set in the
container in order to work. ECS sets the NVIDIA_VISIBLE_DEVICES
environment variable value to be a list of the GPU device IDs that ECS
assigns to the container.

AWS Fargate Launch Type

The AWS Fargate launch type is a compute engine for containers,


available in both ECS and Amazon Elastic Kubernetes Service (EKS). The
serverless Fargate launch type is the easiest way to get started with ECS and
to get your containers running quickly. With AWS Fargate, you no longer
have to choose EC2 types, decide when to horizontally scale the cluster,
when to optimize cluster packing, or when to provision, configure, or scale
clusters of virtual machines to run Docker containers.
AWS Fargate pricing is calculated based on the vCPU and memory
resources used from the time you start to download your container image
until the ECS Task or EKS Pod terminates, rounded up to the nearest second.
The consumer incurs additional charges when their containers use other AWS
services or transfers data. At this time, Fargate for ECS is not available in all
Regions.
When a consumer uses the AWS Fargate launch type they have the least
control over the compute instances used by containers. The AWS Fargate
launch type removes the need for the consumer to configure, patch, manage,
secure, and auto scale, EC2 instances. Fargate allows the consumer to
sidestep infrastructure duties and focus instead on the containerized
application.
When you run ECS Services and Tasks with the Fargate launch type (and
Fargate capacity provider), you package your application in containers,
specify the CPU and memory requirements, define networking and IAM
policies, and launch the application. Each Task you run on Fargate has its
own isolation boundary and does not share the underlying operating system
kernel, CPU resources, memory resources, or Elastic Network Interface
(ENI) with another Task.
Unlike an EC2 launch type instance (which the consume pays for whether
it is used or not used by a container), the consumer only pays for the
requested vCPU (or GPU) and memory resources for the Task (or in EKS, for
the Pod). AWS Fargate pricing is calculated based on the per hour vCPU (or
GPU) and memory resources used from the time you start to download your
container image until the ECS Task (or EKS Pod) terminates, rounded up to
the nearest second. You incur additional charges when your containers use
other AWS services or transfer data.
These applications are automatically installed on the AWS Fargate ECS
launch type:

Docker daemon (dockered), a.k.a., the Docker Engine;


shim (docker-containered-shim);
runc (docker-runc);
Amazon ECS Container Agent (a.k.a., containered (docker-
containered);
ecs-init Service;

AWS does not support SSH access to the Fargate instance.

Fargate Spot
Like EC2 Spot instances, you can use the Fargate Spot instance with
running containers that are interruption tolerant. The Fargate Spot rate is
discounted compared to the Fargate price. The consumer incurs additional
charges when their containers use other AWS services or transfers data.
Fargate Spot runs Docker containers on spare compute capacity. When
AWS needs the capacity back the container is interrupted with a 2-minute
warning. When containers using the Fargate and Fargate Spot instances are
stopped a task state change event is sent by ECS to Amazon EventBridge.
The stopped reason describes the cause.

Amazon Savings Plans for ECS


Savings Plans are a pricing model that offer significant savings on AWS
usage such as EC2 instance usage as well as Fargate usage. The consumers
commits to a consistent amount of usage, in USD per hour, for a term of 1 or
3 years, and receive a lower price for that usage. The consumer choices
between All Upfront, Partial upfront, or No upfront payment options. Savings
Plans provide a discount of up to 66 percent.

ECS and AWS CodePipeline


Continual Delivery and Continuous Integration (CD/CI) are best practices.
On AWS, the CodePipeline service is used to model, visualize, and automate
the building, testing, and deployment steps required to continuously release
your containerized application. You can use the CodePipeline console, the
AWS CLI, the AWS SDKS, or any combination of these, to create and
manage your pipelines. A pipeline – the building, testing, and deployment
steps – can be as simple or as complex as the consumer’s business process
model (BPM) dictates.
In continuous delivery, every change to the container is released in an
automated manner based on the release model that you define. Continuous
integration is a software development best practice where the team uses a
version control system (e.g., GitHub) and frequently integrates their
individual work into a shared location, such as a version branch. Each change
to the software triggers an automated build of the software and testing of the
build image to determine if any integration errors are present.
CodePipeline automatically detects changes to your Docker images and
task definitions. Those changes are built and any tests are run on the built
code. After the test are completed successfully, the built container is
deployed to staging servers for integration testing and load testing. After
completing those tests successfully and the built container is manually
approved, CodePipeline deploys the approved containers into production.
You can create a pipeline in AWS CodePipeline that deploys Docker
containers by using a blue/green deployment with ECS. In a blue/green
deployment you launch the new version of the container alongside the old
version of the container and test the new version before rerouting incoming
requests (to the port on the new container). You can also monitor the
container deployment and rollback containers if any problems arise.
The pipeline can be configured to automatically detect changes to Docker
images in a repository (such as Amazon ECR) and performs a CodeDeploy
blue-green deployment. How AWS CodeDeploy works with ECS is
described below. The pipeline can also be configured to automatically detect
changes to a task definition JSON file (pushed to a code control system, and
its ARN is specified in a CodeDeploy AppSpec file) and performs a
CodeDeploy blue-green deployment

ECS Service Deployment


Placing and running ECS Service on a cluster is called a deployment.
Deployment is configured using these ECS Service definition parameters:

deploymentController – used by the ECS Service to control a


deployment.
deploymentConfiguration – controls how many containers are
run during deployment as well as the ordering of stopping and
starting of containers.
taskSets – the Service’s task definition. The ‘task set’ term does
not represent a set of ECS Tasks.

Amazon ECS supports two types of ECS Service deployment controllers:

1. AWS CodeDeploy
2. EXTERNAL

An EXTERNAL deployment is based on the use of none-AWS resources.


As such, coverage of that deployment type is outside the scope of this
manuscript.

ECS Service and AWS CodeDeploy


AWS CodeDeploy is a deployment service that automates application
deployments to EC2 instances (and to on-premise instances), automates ECS
Service deployments, as well as serverless Lambda function deployments.
With CodeDeploy you can automatically or manually stop and rollback
deployments if there are errors. CodeDeploy can deploy application content
that is stored in Amazon S3, GitHub repositories, or Bitbucket repositories.
The consumer can launch and track the status of their deployments through
the CodeDeploy console or the AWS CLI.
CodeDeploy deploys the revised Docker containers of an ECS Service’s
task definition as a ‘task set.’ The term ‘task set’ originates with
CodeDeploy, and unfortunately clashes directly with ECS’s use of the word
task. When CodeDeploy is used with ECS a task set is actually the Service’s
revised task definition and is not a set of ECS Tasks. Remember, an ECS
Service (like an ECS Task) has 1 and only 1 task definition, and the task
definition has a list of 1-10 container definitions. Therefore, it is incorrect to
imagine that CodeDeploy is deploying a set of ECS Tasks. It is important to
be clear that CodeDeploy is deploying the Docker container(s) listed in the
ECS Service’s task definition and is not deploying a set of ECS Tasks. From
the viewpoint of CodeDeploy, provisioning the compute instance and the
Docker container(s) that run on it is a unit of work that CodeDeploy calls a
task.
CodeDeploy is used to deploy the original ECS Service as well as for
deploying revisions to the ECS Service. CodeDeploy performs a blue/green
deployment when installing a new version of a Service – CodeDeploy refers
to the Service’s revised task definition as the new ‘replacement task set.’
CodeDeploy automatically reroutes traffic from the old ‘primary task set’
(a.k.a., the Task that has the PRIMARY status) to the ‘replacement task set.’
The ‘primary task set’ is terminated by CodeDeploy after a successful
deployment of the ‘replacement task set.’ The consumer can manage the way
in which network traffic is shifted to the ‘replacement task set’ during
deployment by choosing either a canary or linear (network traffic shift)
through the deployment configuration.

CodeDeploy IAM Service-Linked Role


An IAM service-linked role can be created in the IAM console, as well as
by using the AWS CLI or the IAM APIs. This IAM service-linked role gives
the AWS CodeDeploy service the access that it must be granted to deploy an
ECS Service. To provide CodeDeploy with full access to ECS to support
ECS Services, attach the AWSCodeDeployRoleForECS policy to the IAM
service-linked role. This policy provides permissions to:

Read, update, and delete ECS task sets;


Update ELB target groups, listeners, and rules;
Access revision files in S3 buckers;
Retrieve information about CloudWatch alarms;
Publish information to Amazon SNS topics;
Invoke AWS Lambda functions;

To provide the CodeDeploy IAM service-linked role with limited access


to ECS, attach the AWSCodeDeployRoleForECSLimited policy to it. This
policy provides permissions to:

Read, update, and delete ECS task sets;


Retrieve information about CloudWatch alarms;
Publish information to Amazon SNS topics;

CodeDeploy Pre-requisites
Given an ECS Cluster and an ECS Service registered with that cluster and
which has its deployment controller set to CodeDeploy, there are a number of
other pre-requisites that the ECS Service must satisfy before CodeDeploy can
be used:

Elastic Load Balancing;


One or two Listeners;
Two ECS target groups;
ECS task definition;
A Docker container names;
A port for the replacement task set;

The ECS Service must use either an Application Load Balancer or a


Network Load Balancer. Best practice is to use an Application Load Balancer
(wherever possible) since it supports dynamic port mapping and path-based
routing and priority rules.
A listener is used by the load balancer to direct network traffic to the two
ECS target groups. CodeDeploy requires one production listener. Best
practice is to specify a second test listener that directs network traffic to the
‘replacement task set’ while the validation tests are run. If the ECS console is
used to create the ECS service than the two listeners are automatically created
for the consumer.
A target group is used by CodeDeploy to route network traffic to a
‘registered target.’ An ECS Service deployment requires two target groups:
one for the ‘primary task set’ (the task set with the PRIMARY status) and
one for the ‘replacement task set.’ During deployment, CodeDeploy
instantiates the ‘replacement task set’ and reroutes network traffic away from
the containers in the ‘primary task set’ to the containers in the ‘replacement
task set.’

CodeDeploy App Spec File


In order for an application (containerized or not) to use CodeDeploy, first
a ‘CodeDeploy Application’ has to be created. The information defining that
‘CodeDeploy Application’ is held in a CodeDeploy AppSpec file. The
AppSpec file can be in either YAML or JSON format and can be created
using the ECS console as well as by using the AWS CLI or AWS SDKs.
Second, a deployment group has to be specified in the App Spec file. A
CodeDeploy deployment group has:

a group name,
the names of the ECS Cluster and ECS Service (whose
deployment controller is set to CodeDeploy),
the load balancer,
the production and the test listeners,
the two target groups,
deployment settings (e.g., when to reroute network traffic to the
replacement task set; when to terminate the primary task set),
triggers, alarms, and rollback behavior.

The CodeDeploy console, the AWS CLI, and the CodeDeploy API can be
used to view the deployment group associated with the application.
For an ECS Service to be deployed using CodeDeploy, specific
information about the ECS Service has to be entered into the AppSpec file:

The ECS Service needs a task definition to run Docket


containers. The ARN of that task definition must be specified in
the application’s CodeDeploy AppSpec file.
The names of the containers in the task set must be specified in
the application’s CodeDeploy AppSpec file. The container name
must be present in list of containers defined in the ECS
Service’s task definition.
During deployment, the load balancer directs network traffic to
a port of the containers in the replacement task set. That
container port must be specified in the application’s
CodeDeploy AppSpec file.
Lambda functions run at specific stages (a.k.a. lifecycle hooks)
in the deployment to verify and validate the deployment process.
Though optional with CodeDeploy, these verification and
validation functions are a mandatory quality assurance best
practice. If the deployment is failing the functions provide
important insights into the causal factors that lead to the failure
event.

Best practice is to store the AppSpec file in a secured Amazon S3 bucket.


To deploy an ECS Service, the first step is to add a new tag to the Service’s
task definition.

Blue-green Deployments
CodeDeploy uses blue-green deployments with ECS Services.
Fundamental to blue-green deployments, compute instance network traffic is
rerouted behind an ELB load balancer by using listeners. The consumer has
to specify how network traffic is shifted from the old primary task set to the
new replacement task set:

Canary – the traffic is shifted in two increments, you specify the


percentage shifted to the new replacement task set in the first
increment, and the remaining traffic is shifted in the second
increment.
Linear – the traffic is shifted in equal increments with an equal
number of minutes between each increment.

During the blue-green deployment the load balancer allows the test and
production listeners to route network traffic to the new compute instances and
containers in the replacement task set, according to the rules the consume
specifies. During this time period, the load balancer allows the test and
production listeners to block network traffic to the old compute instances and
containers in the primary task set.
After the new compute instances in the replacement task set are registered
with the load balancer, the old compute instances in the primary task set are
deregistered and terminated (if desired).
Lastly, during the blue-green deployment the cluster’s capacity provider is
deactivated. Therefore, the infrastructure of neither the primary task set or the
replacement task set can scale horizontally during the deployment stage.

Rollbacks
You can manually rollback a Service deployment. However, CodeDeploy
is able to automatically rollback a deployment. During a rollback
CodeDeploy reroutes network traffic from the new replacement task set to the
old primary task set. Rollback behaviors are set in the App Spec file. The
consumer can choose 1 of 3 behaviors:

1. Roll back when a deployment fails - CodeDeploy returns the


cluster and Service to the last known good deployment.
2. Roll back when alarm thresholds are met – when alarms are
added to the deployment group and one or more of the alarms
are activated during the deployment, CodeDeploy returns the
cluster and Service to the last known good deployment.
3. Disable rollbacks – rollbacks are not performed to this
deployment.

ECS Task Definition


A Task is a logical group of Task level parameters plus a task definition
that lists 1-10 Docker containers. A task definition is required to run Docker
containers on ECS. When the task runs all of its containers run on the same
compute instance. The ECS Task is an execution environment shared by 1 or
more containers. In addition, multiple Tasks (with their containers) can run
on the same compute instance at the same time.
Though an ECS Service requires a task definition in order to run Docker
containers on a cluster. However, an ECS Task can be standalone - the Task
does not need an ECS Service in order to run Docker containers on a cluster.
Without benefit of an ECS Service artifact, an ECS Task can be started (as
well as stopped) on a cluster. Unlike an ECS Service, an ECS Task is not
deployed to the cluster. Running a Task on a cluster is much simpler than
deploying a Service to a cluster. An ECS Task is the preferred choice when
the Docker containers are not long running processes:

Batch ETL Processing;


Machine Learning;

ECS task definitions are split into separate parts:

1. the task family - the family name is assigned to the task


definition when you registered it with the Cluster. This family
name, specified with a revision number, is assigned to multiple
versions of the same task definition. The first time that the task
definition is registered with the cluster the family name is given
a revision of 1, and any task definitions registered after that are
given a sequential revision number.
2. container definitions - a list of 1-10 Docker container
definitions.
3. the IAM task role - by setting the taskRoleArn task definition
parameter (when the task definition is registered with the
cluster) you provide a task role for an IAM role that grants the
Docker containers (listed in the task definition) the permission
to call the AWS APIs that are specified in its associated policies.
When the EC2 launch type is chosen those permissions are
granted to the Amazon ECS Compute instance IAM role.
4. the network mode - a Docker container has a network mode
(none, bridge, host, awsvpc). The task definition can specify the
network mode used by the 1-10 containers that it defines.
Alternatively, the ECS Service can specify the network mode.
The task definition networkMode parameter over-rides the
Service networkConfiguration parameter.
5. volumes - the list of Docker data volumes (of type Docker
volume or Bind mount) that are provisioned to all containers
listed in the task definition. When defining data volumes for use
by containers a mixture of task definition parameters and
container definition parameters must be set. When a task
definition is registered with the cluster, you can specify a list of
volumes to be passed to the Docker daemon on a compute
instance, which then becomes available for access by other
containers on the same compute instance. A Docker volume is a
Docker-managed data volume that is created under
/var/lib/docker/volumes on the compute instance. The plugin
Docker volume drivers are used to integrate the volumes with
external storage systems, such as Amazon EBS. The built-in
local volume driver or a third-party volume driver can be used.
Docker volumes are only supported when using the EC2 launch
type. Windows containers only support the use of the local
driver. A Bind mount is a file or directory on the compute
instance is mounted into a container. Bind mount host volumes
are supported when using either the EC2 or Fargate launch
types. To use bind mount host volumes, specify a value for the
host and optional sourcePath parameters in your task definition.
6. task placement constraints - when a task definition is registered
with the cluster, task placement constraints can be specified by
using the placementConstraints task definition parameter. A task
placement constraint customizes how Amazon ECS places the
Task’s container(s) on the compute instance. A list of constraints
can be applied when the Task’s container is placed on an EC2
launch type. Alternatively, the placement constraints can be
specified in the ECS Service definition. The task definition’s
task placement constraint parameters over-ride the Service
placementConstraints parameter setting. For containers that use
the EC2 launch type, you can use constraints to place containers
based on Availability Zone, instance launch type, or custom
attributes. The task placement constraint can also be specified
using parameters in the container definitions section of the task
definition. Task placement constraints are not used by the
Fargate launch type which by default distributes the Task’s
container across AZs in the Region (that contains the VPC used
by the cluster).
7. launch types – the task definition requiresCompatibilities
parameter determines the launch type used by the 1-10
containers listed in the task definition. This Task level launch
type parameter over-rides the Service launchType parameter.
Tasks that use the Fargate launch type do not require the (3)
interface VPC endpoints for the ECS.

The task family and container definitions are mandatory parts of a task
definition, while the other parts are optional. It is important to understand that
at all times a Task is a logical group of 1-10 Docker containers plus some
Task level parameters.
When a task definition lists more than 1 container that is usually due to
the need of those containers to:

share a namespace (e.g., process ID; network stack; mount


point; Interprocess communication; etc.);
share data storage;
communicate with each other over the compute instance’s local
network;
communicate with each other over the Task’s ENI;

Though there are use cases that demand more than 1 container per task,
having more than 1 container per task incurs potential losses. When a task
has more than 1 container:

The containers may share the same namespaces, data storage,


and local network as well as the same attack surface.
You have lost the ability to horizontally scale those containers
independently of each other.
You have lost the ability to vertically scale those containers
independently of each other.

And, when more than 1 Task is run on the same compute instance:

You have lost the ability to horizontally scale those Tasks


independently of each other.
You have lost the ability to vertically scale those Tasks
independently of each other.

When you register a task definition, there are Task level parameters that
allow you to specify the task size in terms of total cpu and memory used by
the containers in the task. If using the EC2 launch type, these fields are
optional. If using the Fargate launch type, these fields are required and there
are specific values for both cpu and memory that are supported. This is
separate from the cpu and memory values at the container definition level.
Task level CPU and memory parameters are ignored for Windows containers,
but container-level resources are supported by Windows containers. Those
Task level parameters are:

cpu – the hard limit of CPU units available for the containers of
the task. It can be expressed as an integer using CPU units, or as
a string using vCPUs. When the task definition is registered, a
vCPU value is converted to an integer indicating the CPU units.
If using the EC2 launch type, this field is optional. If your
cluster does not have any registered compute instances with the
requested CPU units available, the task will fail. Supported
values are between 128 CPU units (0.125 vCPUs) and 10240
CPU units (10 vCPUs). If using the Fargate launch type, this
field is required and you must use one of the following values,
which determines your range of supported values for the
memory parameter:
memory – the hard limit of memory (in MiB) available for the
containers of the task. It can be expressed as an integer using
MiB, or as a string using GB. When the task definition is
registered, a GB value is converted to an integer indicating the
MiB. If using the EC2 launch type, this field is optional and any
value can be used. If a Task level memory value is specified
then the container-level memory value is optional. If your
cluster does not have any registered compute instances with the
requested memory available, the task will fail. If using the
Fargate launch type, this field is required.

An ECS task definition has these additional parameters:

clusterARN - the Amazon Resource Name (ARN) assigned to


the Cluster.
capacityProviderName
attachments
taskArn - the Amazon Resource Name (ARN) assigned to the
Task.
taskDefinitionArn - the Amazon Resource Name (ARN)
assigned to the task definition.
revision – the revision number of a tax definition family name;
compatibilities – the launch type that the task container(s) can
use. Alternatively, the launch type can be specified in the ECS
Service definition. This task definition parameter over-rides the
Service parameter.
containerInstances – the compute instance IDs or full ARN
entries for the compute instances on which you would like to
place your containers. You can specify up to 10 compute
instances.
inferenceAccelerators
platformVersion
executionRoleArn - this IAM role is referred to as a task
execution IAM role. The Amazon ECS container agent, and the
Fargate agent for your Fargate tasks, make calls to the Amazon
ECS API on your behalf. Both agents requires an IAM role for
ECS to know that the agent belongs to a given AWS account.
The IAM task execution role also allows the containers in the
task to publish container logs to CloudWatch. The task
execution IAM role is required for a container that uses the
private registry authentication feature - it allows the ECS
Container Agent to pull the container image from the private
registry. The task execution IAM role is required when
referencing sensitive data using Secrets Manager secrets or
AWS Systems Manager Parameter Store parameters.
requiresAttributes – an attribute is a key-value pair associated
with the task’s container(s); this is a list of required attributes of
each container.
inferenceAccelerators – an Amazon Deep Learning Containers
used for training and serving models in TensorFlow, Apache
MXNet (MXNet), and PyTorch.
ipcMode - allows you to configure your containers (in the task)
to share an inter-process communication (IPC) namespace with
the other containers in the task, or with the host. The IPC
namespace allows containers (in the task) to communicate
directly through shared-memory with other containers that are
listed in the task definition or that are running on the same
compute instance. If the host IPC mode is used, there is a
heightened risk of undesired IPC namespace exposure. The
valid values are host, task, or none. If host is specified, then all
containers within the task that specified the host IPC mode on
the same compute instance share the same IPC resources with
the host Amazon EC2 instance. If task is specified, all
containers within the specified task share the same IPC
resources. If none is specified, then IPC resources within the
containers of a task are private and not shared with other
containers in a task or on the compute instance. If no value is
specified, then the IPC resource namespace sharing depends on
the Docker daemon setting on the compute instance. If
namespaced kernel parameters using systemControls are set in
the container definitions, the following applies to the IPC
resource namespace:

• For tasks that use the host IPC mode, IPC namespace related
systemControls are not supported.
• For tasks that use the task IPC mode, IPC namespace related
systemControls will apply to all containers within a task.
The ipcMode parameter is not supported for Windows containers or
tasks using the Fargate launch type.

pidMode - allows you to configure the containers (in the task) to


share a process ID (PID) namespace with other containers in the
task, or with the compute instance. Sharing the PID namespace
enables for example monitoring applications deployed as
containers to access information about other containers that are
listed in the task definition or that are running on the same
compute instance. If the host PID mode is used, be aware that
there is a heightened risk of undesired process namespace
exposure. The valid values are host or task. If host is specified,
then all containers within the task that specified the host PID
mode on the same compute instance share the same process
namespace with the host Amazon EC2 instance. If task is
specified, all containers within the specified task share the same
process namespace. If no value is specified, the default is a
private namespace. The pidMode parameter is not supported for
Windows containers or tasks using the Fargate launch type.
over-rides
attributes
group – the name of the task group to associate with the task.
The default value is the family name of the task definition;
enableECSManagedTags
propagateTags
tags
At runtime, additional properties of an ECS Task can be perceived in the
ECS Console:

availabilityZone
version
connectivity
connectivityAt
createdAt
pullStartedAt
pullStoppedAt
startedAt
startedBy
stopCode
stoppedAt
stoppedReason
stoppingAt
executionStoppedAt
desiredStatus
healthStatus
lastStatus
status – the current status of the ECS Task (and its containers):
PROVISIONING, PENDING, ACTIVATING, RUNNING,
DEACTIVATING, STOPPING, DEPROVISIONING,
STOPPED.

Like all things configurable in AWS, the ECS Task parameters and their
assigned values are captured in a JSON-formatted or a YAML document.
Best practice is to declare the ECS Task definition in an AWS
CloudFormation template, and to manage that document as part like all files
that participate in your CD/CI workflow.

ECS Tasks and AppMesh


With AWS App Mesh you get application-level networking and it is the
only service mesh (based on the Envoy proxy) that allows communication
across multiple types of compute infrastructure such as EC2, ECS, Fargate,
and Kubernetes on AWS. App Mesh offers end-to-end observability of
network traffic, security, and traffic management for your distributed Docker
containers, freeing you to focus on building your containerized applications.
In that context, increased network interface density might save money.
App Mesh requires containers to use the awsvpc network mode, which
provides routing control and visibility over a suite of running Tasks. Once
updated to use App Mesh, the task’s containers communicate with each other
through the Envoy proxy rather than directly with each other. App Mesh
separates the logic needed for monitoring and controlling communications
into a proxy that manages all network traffic for each container. The proxy
standardizes how Docker containers communicate with each other, and how
they export monitoring data. When each container starts, its proxy connects
to App Mesh and receives configuration data about the locations of other
containers in the mesh.
The task definition’s proxyConfiguration parameter enables you to make
AWS App Mesh available to its list of Docker containers. App Mesh only
supports Linux-based containers that are registered with DNS, AWS Cloud
Map, or both. Therefore, a task definition with proxyConfiguration enabled
has to be a part of a registered ECS Service that has enabled the ECS Service
Discovery service which has registered the compute instances and containers.
The proxyConfiguration parameter is not useable by a task definition that
is part of an ECS Task and which is not part of an ECS Service. The
proxyConfiguration parameter cannot be used with Windows containers. For
ECS Services using the Fargate launch type, this feature requires that the
Service (and its Task) uses platform version 1.3.0 or later. For ECS Services
using the EC2 launch type, the compute instance requires at least version
1.26.0 of the Amazon ECS Container Agent and at least version 1.26.0-1 of
the ecs-init Service to enable an App Mesh proxy configuration.
Refer to the App Mesh developer and user guides for instructions on how
to integrate the App Mesh service with ECS.

RunTask API
The ECS API supports the RunTask command which allows ECS to place
containers for you, or you can customize how ECS places containers by using
placement constraints and placement strategies. The RunTask command
accepts the following data in JSON format:

cluster - the short name or full Amazon Resource Name (ARN)


of the cluster on which to run your task. If you do not specify a
cluster, the default cluster is assumed.
capacityProviderStrategy
launchType
platformVersion
taskDefinition
networkConfiguration
count - the number of containers the task can place on a
compute instance in the cluster. You can specify up to 10
containers (not tasks) per call.
placementStrategy
placementConstraints
group - the name of the task group to associate with the Task.
The default value is the family name of the task definition.
over-rides – a list of container over-rides in JSON format that
specify the name of a container in the specified task definition
and the over-rides it should receive.
enableECSManagedTags
propagateTags
referenceId - the reference ID to use for the task.
startedBy - contains the deployment ID of the service that starts
it.
tags

StartTask API
The ECS API supports the StartTask command which the consumer uses
to start a Task according to the specified task definition and where it’s 1-10
containers are hosted by the specified launch type. The StartTask command
accepts the following data in JSON format.

cluster
taskDefinition
networkConfiguration
containerInstances – the compute instance IDs or full ARN
entries for the compute instances on which you would like to
place your containers. You can specify up to 10 compute
instances.
group
over-rides
enableECSManagedTags
propagateTags
referenceId
startedBy
tags

StopTask API
The ECS API supports the StartTask command which the consumer uses
to start a Task according to the specified task ID. The StopTask command
accepts the following data in JSON format:

cluster – the short name or full Amazon Resource Name (ARN)


of the cluster that hosts the task to stop. If you do not specify a
cluster, the default cluster is assumed.
reason – an optional message specified when a task is stopped.
task - task ID or full Amazon Resource Name (ARN) of the task
to stop.

Scheduled Tasks
Amazon ECS supports the ability to schedule Tasks on either a cron-like
schedule or in a response to CloudWatch Events. If you have a Task that
needs to run at set intervals you can use the ECS console to create a
CloudWatch Events rule that will run the Task at the specified times.

CloudWatch Events IAM Role


Before you can scheduled Tasks with CloudWatch Events rules and
targets, the CloudWatch Events service needs permissions to run Amazon
ECS tasks on your behalf. These permissions are provided by the
CloudWatch Events IAM role (ecsEventsRole).

Running Tasks on Fargate Instances


ECS Task that use Fargate instances require that their containers use the
awsvpc network mode. In addition, this requires that both cpu and memory
be set at the Task level (not at the container level). When the Fargate launch
type is chosen for use by an ECS Task there are a number of task definition
and container definitions configuration parameters that become invalid.
These task definition parameters are invalid:

dockerVolumeConfiguration
ipcMode
pidMode
placementConstraints

And, these container definition parameters become invalid:

disableNetworking
dnsSearchDomains
dnsServers
dockerSecurityOptions
extraHosts
gpu
links
privileged
systemControls

Task Lifecycle
All resources in the cloud are temporary. The task can transition through 8
states during its lifecycle:

PROVISIONING - ECS has to perform additional steps before


the Task is launched.
PENDING - ECS is waiting on the Container Agent to take
further action.
ACTIVATING - ECS has to perform additional steps after the
Task is launched but before the Task can transition to the
RUNNING state.
RUNNING – The Task is running as defined.
DEACTIVATING - ECS has to perform additional steps before
the Task is stopped.
STOPPING- ECS is waiting on the Container Agent to take
further action.
DEPROVISIONING - ECS has to perform additional steps after
the Task has stopped but before the Task transitions to the
STOPPED state
STOPPED – The Task is stopped as needed.

Docker Data Storage Types


The typical application running in a Docker container consumes as well as
generates data. To handle the data that they consume and generate a Docker
container can be use different types of storage:

Bind Mount (called a host volume in ECS);


Docker volumes;
tmpfs Mount.

Docker container storage is supported by pluggable storage drivers. Each


container can use 1 and only 1 storage driver. All containers listed in the task
definition use the same type of storage driver.
In ECS, the data storage provisioned to a Docker container are configured
by using parameters found in the task definition. Before diving into how ECS
defines container data storage, let us examine data storage according to
Docker.

Docker Bind Mounts


Docker Bind Mounts are dependent on the directory structure of the
container compute instance’s file system. The bind mount is a file or
directory on the compute instance that is mounted inside the container.
The bind mount is container’s writeable layer therefore a bind mount
exists within the lifecycle of the Docker container. Data written to the bind
mount are propagated back to the host (virtual) machine.
A few notes about Docker bind mounts:

Provides persistent data storage for use by containers.


Can be used to share data storage at different locations on
different containers running on the same compute instance.
By default the container can write to the bind mount, but read-
only access can be set;
Increasing the volume of data written to the bind mount (the
container’s writeable layer) does increase the physical size of
the container;
Can persist data after the container is stopped, but not after the
container is deleted or after you manually delete the file.
Its data contents can be exported as a tar file;
Can be used for non-persistent data, where the same bind mount
is mounted in multiple copies of the same container (where all
copies are running on the same compute instance);
Can be used to define an empty, nonpersistent data volume and
mount it on multiple containers on the same compute instance
A bind-mount of a non-empty directory on the container
obscures the directory’s existing content.

Before a Docker image can run in a container it has to be copied to the


compute instance that the container runs on. Docker images are stored in the
container’s bind mount. The ECS Container Agent serves up Docker images
(located in the container’s bind mount) to runc.

Docker Volumes
Docker volumes are managed by Docker and are a directory within the
Docker storage directory on the compute instance’s file system. The Docker
volume exists outside, and beyond, the lifecycle of the Docker container.
A few notes about Docker volumes:

A Docker volume is created under /var/lib/docker/volumes on


the compute instance, and is managed by Docker’s built-in local
driver;
Provides persistent data for use by the container. Can persist
volume data after the container is stopped, even after the
container is deleted;
The local Docker volume cannot be shared among containers –
each container uses its own local volume. The local volume is
scoped to the ECS task: the local volume is automatically
provisioned when the task starts and destroyed when the task is
deleted;
Can be a data volume that is managed by a third-party volume
driver;
The same external storage system volume can be safely shared
among multiple containers;

Increasing the volume of data written to the Docker volume


does not increase the physical size of the container;
Can be used to share volume data at different locations on
different containers running on the same compute instance;
The volume used by the container(s) of the Task can be a third-
party volume driver;
If you start a container with a volume that does not exist Docker
will create the volume;
Its data contents can be exported as a tar file;

Can be managed using the Docker API as well as the Docker


Client’s CLI;
Can be backed-up and migrate to other containers using the
Docker Client’s CLI;
When the volume is new, its content can be pre-populated by the
content within a container’s directory;
Can define an empty, nonpersistent data volume and mount it on
multiple containers.

Docker tmpfs Mounts


The Docker tmpfs mount is located in the memory space allocated to the
container. The Docker tmpfs mount is temporary and therefore suited to
containers that generate data that does not need to be persisted beyond the life
of the container.
A few notes about Docker tmpfs mounts:

Writes/reads faster than the Docker bind mount or the Docker


volume;
Cannot be shared among containers;
Cannot persist after the container is stopped;

Configuring Bind Mounts on ECS


In ECS, the bind mounts are a Task level object. Before a bind mount can
be used by a container it must be described in the task definition (and in its
container definitions section).
A few notes about bind mounts on ECS:

The Docker bind mount has been given the alias ‘host volume.’
Both EC2 and AWS Fargate launch types support Docker bind
mounts. However, bind mounts on AWS Fargate tasks are non-
persistent.
ECS does not sync bind mount data across containers;
By default, 3 hours after the container exits ECS Container
agent deletes the data;
With ECS, if the bind mount’s directory path on the compute
instance is not defined, the Docker daemon will create the
directory, but the data is not persisted after the container is
stopped.

To provide bind mounts to containers in ECS, you must specify the host
parameter (and the optional sourcePath parameter) in the volumes section of
the task definition. The host and sourcePath parameters are not supported for
Fargate Tasks. Before the containers can use bind mount (a.k.a., host
volumes), you must specify the volume configuration within the task
definition as well as the mount point configuration within the containers
definitions. The volumes parameter is a Task level property and the
mountPoints parameter is a Container level property.
Immediately below is a segment of a task definition – sourced from the
public AWS web site – that shows the syntax used to define a bind mount.
{
"family": "",
...
"containerDefinitions" : [
{
"mountPoints" : [
{
"containerPath" : "/path/to/mount_volume",
"sourceVolume" : "string"
}
],
"name" : "string"
}
],
...
"volumes" : [
{
"host" : {
"sourcePath" : "string"
},
"name" : "string"
}
]
}
With the free sample as our template, here are the descriptions of the bind
mount configuration parameters found within the volumes section of the task
definition:

host (mandatory) – the parameter used when specifying a bind


mount. Determines whether the bind mount data volume persists
on the host compute instance and where it is stored. If this
parameter is empty, then the Docker daemon assigns a host path,
but the data is not guaranteed to persist after the containers
using it have stopped running.
sourcePath (optional) – declares the path on the host compute
instance that is provided/exported to the container. When the
path is declare, the data persists in that location on the host
compute until it is manually deleted. If the declared path does
not exist on the host compute instance, then the Docker daemon
creates it.
name – the name of the data volume. This name is referenced in
the sourceVolume parameter of the container definition
mountPoints.

And, here are the descriptions of the bind mount configuration parameters
found within the container definitions section of the task definition:

mountPoints (mandatory) - the mount points used for data


volumes by the container.
sourceVolume (mandatory) – the name of the data volume to
mount.
containerPath (mandatory) – the path on the container to mount
the data volume at.
readOnly - If this value is true, the container has read-only
access to the data volume. If this value is false, then the
container can write to the data volume. The default value is
false.

Configuring Volumes on ECS


In ECS, the Docker volumes are a Task level object. Before a Docker
volume can be used by a container it must be described in the task definition
(and in its container definitions section).
A few notes about Docker volumes on ECS:

Are only supported by the EC2 launch type;


Can be used with both Linux and Windows containers;
Linux containers support both local volumes as well as external
storage system volumes.
Windows containers only support local volumes.
The lifecycle of the Docker volume can be tied to either a
specific ECS Task or to the lifecycle of a specific EC2 instance.
Can be mounted in multiple containers within the same ECS
task;
Can be either local to the container or an external storage
volume driver (e.g. EBS, S3);
ECS does not sync volume data across containers;
Can be stored on remote compute instances or AWS storage
(where data is encrypted at rest);
Can be shared among containers using the volumes,
mountPoints and volumesFrom parameters in the task definition
Sharing Docker volumes among containers is way to build fault-
tolerant applications. Multiple containers running on the
compute instance can mount the same internal or external
storage volume. Multiple containers running on multiple
compute instances can mount the same external storage volume.
The same external storage volume can be read/write by certain
containers and read-only by other containers.

To provide Docker volumes to containers in ECS, you must specify the


dockerVolumeConfiguration parameter in your task definition. Before the
containers can use a Docker volume, you must specify the volume
configuration within the task definition as well as the mount point
configuration within the containers definitions section. The volumes
parameter is a Task level property and the mountPoints parameter is a
Container level property.
Immediately below is a segment of a task definition – sourced from the
public AWS web site – that shows the syntax used to define a Docker
volume.
{
"containerDefinitions": [
{
"mountPoints": [
{
"sourceVolume": "string",
"containerPath": "/path/to/mount_volume",
"readOnly": boolean
}
]
}
],
"volumes": [
{
"name": "string",
"dockerVolumeConfiguration": {
"scope": "string",
"autoprovision": boolean,
"driver": "string",
"driverOpts": {
"key": "value"
},
"labels": {
"key": "value"
}
}
}
]
}
With the free sample as our template, here are the descriptions of the
Docker volume configuration parameters found within the volumes section of
the task definition:

dockerVolumeConfiguration (mandatory) - used when defining


a Docker volume.
name .
labels – custom metadata that you have added to the Docker
volume.

Here are the descriptions of the parameters found within the


dockerVolumeConfiguration definition:

scope – the scope for the Docker volume, which determines its
lifecycle. Docker volumes that are scoped to a task are
automatically provisioned when the task starts destroyed when
the task is cleaned up. Docker volumes that are scoped as shared
persist after the task stops.
autoprovision - this value is true the Docker volume is created if
it does not already exist. This field is only used if the scope is
shared. If the scope is task then this parameter must either be
omitted or set to false.
driver – the driver to use with the Docker volume.
driverOpts – the options to pass through to the Docker volume
driver.

And, here are the descriptions of the Docker volume configuration


parameters found within the container definitions section of the task
definition:

mountPoints (mandatory).
sourceVolume (mandatory).
containerPath (mandatory).
readOnly .

Configuring Amazon Elastic File System (EFS) Volumes


The Amazon Elastic File System (EFS) provides scalable data storage for
use with EC2 instances in ECS. The EC2 instance AMI has to be configured
to mount the EFS file system before the Docker daemon starts the Task. Once
attached to the EC2 instance, the EFS grows and shrinks automatically as the
containerized application adds and remove files. With EFS, the Task’s
containers have the data storage they need when they need it.
EFS volumes can be used by EC2 instances and by Fargate instances.
Fargate automatically creates a supervisor container that is responsible for the
EFS volume.
EFS can be used to export file system data to multiple EC2 instances, and
therefore make the file system data available to Task containers running on
the EC2 instance. Given a suite of EC2 instances configured to use EFS,
multiple Tasks (and their containers) have access to the same persistent file
storage from each EC2 instance that they are placed on.
EFS access points are application-specific entry points into the EFS file
system that make it easier to manage application access to shared file system
data. EFS access points can enforce a user identity, including the user's
POSIX groups, for all Task container file system requests that are made
through the access point. A task IAM role is used to enforce how the
container uses an EFS access point. Access points can also enforce a different
root directory for the file system so that Task containers can only access data
in the specified directory or its subdirectories.
To provide EFS to the Task’s containers, you must specify the
efsVolumeConfiguration parameter in the volumes section of the task
definition. Before the containers can use EFS, you must specify the volume
configuration within the task definition as well as the mount point
configuration within the containers definitions section. The volumes
parameter is a Task level property and the mountPoints parameter is a
Container level property.
Immediately below is a segment of a task definition – sourced from the
public AWS web site – that shows the syntax used to define an EFS volume.
{
"containerDefinitions": [
{
"name": "container-using-efs",
"image": "amazonlinux:2",
"entryPoint": [
"sh",
"-c"
],
"command": [
"ls -la /mount/efs"
],
"mountPoints": [
{
"sourceVolume": "myEfsVolume",
"containerPath": "/mount/efs",
"readOnly": true
}
]
}
],
"volumes": [
{
"name": "myEfsVolume",
"efsVolumeConfiguration": {
"fileSystemId": "fs-1234",
"rootDirectory": "/path/to/my/data",
"tranitEncryption": "ENABLED",
"transitEncryptionPort: {
"accessPointId": "fsap-1234",
"iam": "ENABLED"
}
}
}
]
}
With the free sample as our template, here are the descriptions of the EFS
volume configuration parameters found within the volumes section of the
task definition:
efsVolumeConfiguration (mandatory) – used when defining an EFS
volume.
name .
Here are the descriptions of the parameters found within the
efsVolumeConfiguration definition:

fileSystemId – the EFS file system ID to use.


rootDirectory – the directory within the EFS file system to
mount as the root directory inside the EC2 instance. If this
parameter is omitted, the root of the EFS volume will be used.
transitEncryption – whether or not to enable encryption for EFS
data in transit between the EC2 instance and the EFS server.
Transit encryption must be enabled if EFS IAM authorization is
used. If this parameter is omitted, the default value of
DISABLED is used.
transitEncryptionPort - the port to use when sending encrypted
data between the EC2 instance and the EFS server. If you do not
specify a transit encryption port, it will use the port selection
strategy that the EFS mount helper uses.
authorizationConfig – the authorization configuration details for
the EFS file system.
accessPointId - the access point ID to use. If an access point is
specified, the root directory value will be relative to the
directory set for the access point. If specified, transit encryption
must be enabled in the EFSVolumeConfiguration.
iam – whether or not to use the ECS task IAM role defined in a
task definition when mounting the EFS file system. If enabled,
transit encryption must be enabled in the
EFSVolumeConfiguration. If this parameter is omitted, the
default value of DISABLED is used.

And, here are the descriptions of the EFS volume configuration


parameters found within the container definitions section of the task
definition:

name (mandatory).
image (mandatory).
command (mandatory) - "ls -la /mount/efs". The command that
is passed to the container.
entryPoint - The entry point that is passed to the container
mountPoints (mandatory).
sourceVolume (mandatory).
containerPath (mandatory).
readOnly .

Configuring tmpfs Mounts on ECS


In ECS, the tmpfs mounts are a Task level object. Before a tmpfs mount
can be used by a container it must be described in the task definition (and in
its container definitions section).
A few notes about tmpfs mounts on ECS:

Supported by both EC2 launch type;

To provide tmpfs to the Task’s containers, you must define the tmpfs
parameter in the container definitions section of the task definition:

tmpfs (mandatory) – the container path, mount options, and size


(in MiB) of the tmpfs mount.
containerPath (mandatory) – the absolute file path where the
tmpfs volume is to be mounted.
mountOptions (mandatory) – the list of tmpfs volume mount
options.
size (mandatory) – the size (in MiB) of the tmpfs volume.

Task Retirement
The phrase ‘task retirement’ refers to stopping or terminating a Service, a
Task, and their containers. ECS retires a task when:

The task has reached its scheduled retirement date;


AWS detects an unrecoverable hardware failure of the compute
instance that is hosting the task;
The task uses a Fargate instances that has a security
vulnerability.

When ECS retires that task, the AWS account is sent an email notifying
them of the pending task retirement. If the task is part of a Service, the task is
stopped and the service scheduler automatically restarts the task. If a
standalone task is retired then you must launch the replacement task.
ECS Task Container Definition
The Task level contains a list of 1-10 containers definitions. Each
container definition is held within the Task’s containersDefinitions
parameter. The container definition is the Container level of the ECS
platform. The container definition contains configuration information that
ECS passes to the Docker Remote API (and that can also be used for option
values passed to docker run).
When the Task runs all of its Docker containers run on the same compute
instance. In addition, multiple tasks (with their containers) can run on the
same compute instance at the same time.
It is with a container definition that you specify how each container is
configured to support its Docker image. A container definition is specific to
the Docker image present in the container, the namespaces that the image
requires, and any dependency the container has with another container in the
task. And, it is the dependencies between containers that cause them to be
listed together in the same task definition. Unless containers are inter-
dependent, there is little reason to list more than one container in a task
definition.
In ECS, parts of a container can be configured at the Service level and at
the Task level and is fully defined of course in the Container level. The
parameters of the Container level over-ride both corresponding Task
parameters and Service parameters. It is important to appreciate the Container
parameters that are used to over-ride container configuration set at the Task
or at the Service levels.
AWS created three categories of parameters of an ECS container
definition:

1. Standard;
2. Advanced, and
3. Other.

A significant majority of container parameters are tightly coupled to other


VPC, Cluster, Service, Task, and container, configuration parameters. In this
manuscript, to document the multiplicity of dependencies effecting each
container parameter, their descriptions have been copied in whole or in large
part from the Amazon ECS API Reference guide. The copied text is bound
by single quotes and has the string ‘(Amazon ECS API Reference)’ appended
to them.
As like all things configurable in AWS, the ECS Container parameters
and their assigned values are captured in a JSON-formatted or a YAML
document. Best practice is to declare the ECS Container definition in an
AWS CloudFormation template, and to manage that document as part like all
files that participate in your CD/CI workflow.

Standard Container Parameters


The standard container definition parameters are either required by or
used in most container definitions. There are simple and complex standard
container parameters.
A container has these simple standard parameters:

name - the name of a container.

image - the image used to start a container. This string is passed


directly to the Docker daemon. Images in the Docker Hub
registry are available by default. Other repositories can be
specified. When a new task starts, the Amazon ECS Container
Agent pulls the latest version of the specified image and tag for
the container to use. Instead of setting the image container
parameter, the command container parameter can be set.

And, a container has these complex standard parameters:

Memory, and
Port Mapping.

Container Memory Parameters


There are two container memory parameters:

memory – ‘The amount (in MiB) of memory to present to the


container. If the container attempts to exceed the memory
specified here, the container is killed. The total amount of
memory reserved for all containers within a task must be lower
than the task memory value if one is specified. If using the
Fargate launch type, this parameter is optional. If using the EC2
launch type, you must specify either a Task level memory value
or a container-level memory value. If you specify both a
container-level memory and memoryReservation value, memory
value must be greater than memoryReservation value. If you
specify memoryReservation, then that value is subtracted from
the available memory resources for the compute instance on
which the container is placed. Otherwise, the value of memory
is used. The Docker daemon reserves a minimum of 4 MiB of
memory for a container, so you should not specify fewer than 4
MiB of memory for your containers.’ (Amazon ECS API
Reference)
memoryReservation – ‘The soft limit (in MiB) of memory to
reserve for the container. When system memory is under heavy
contention, Docker attempts to keep the container memory to
this soft limit. However, the container can consume more
memory when it needs to, up to either the hard limit specified
with the memory parameter (if applicable), or all of the
available memory on the compute instance, whichever comes
first. If a Task level memory value is not specified, you must
specify a non-zero integer for one or both of memory or
memoryReservation in a container definition. If you specify
both, memory must be greater than memoryReservation. If you
specify memoryReservation, then that value is subtracted from
the available memory resources for the compute instance on
which the container is placed. Otherwise, the value of memory
is used. The Docker daemon reserves a minimum of 4 MiB of
memory for a container, so do not specify fewer than 4 MiB of
memory for the containers. ’ (Amazon ECS API Reference)

Container Port Mappings Parameters


There are four container port mappings parameters:

portMappings – ‘The list of port mappings for the container.


Port mappings allow containers to access ports on the host
compute instance to send or receive traffic. For task definitions
that use the awsvpc network mode, you should only specify the
containerPort. The hostPort can be left blank or it must be the
same value as the containerPort. Port mappings on Windows use
the NetNAT gateway address rather than localhost. There is no
loopback for port mappings on Windows, so you cannot access
a container's mapped port from the host itself. If the network
mode of a task definition is set to none, then you cannot specify
port mappings. If the network mode of a task definition is set to
host, then host ports must either be undefined or they must
match the container port in the port mapping. After a task
reaches the RUNNING status, manual and automatic host and
container port assignments are visible in the Network Bindings
section of a container description for a selected task in the
Amazon ECS console. The assignments are also visible in the
networkBindings section DescribeTasks responses. ’ (Amazon
ECS API Reference)
containerPort – ‘Required: yes, when portMappings are used.
The port number on the container that is bound to the user-
specified or automatically assigned host port. If using containers
in a task with the Fargate launch type, exposed ports should be
specified using containerPort. If using containers in a task with
the EC2 launch type and you specify a container port and not a
host port, your container automatically receives a host port in
the ephemeral port range. For more information, see hostPort.
Port mappings that are automatically assigned in this way do not
count toward the 100 reserved ports limit of a compute instance.
You cannot expose the same container port for multiple
protocols. An error will be returned if this is attempted. ’
(Amazon ECS API Reference)
hostPort – ‘The port number on the compute instance to reserve
for your container. If using containers in a task with the Fargate
launch type, the hostPort can either be left blank or be the same
value as containerPort. If using containers in a task with the EC2
launch type, you can specify a non-reserved host port for your
container port mapping (this is referred to as static host port
mapping), or you can omit the hostPort (or set it to 0) while
specifying a containerPort and your container automatically
receives a port (this is referred to as dynamic host port mapping)
in the ephemeral port range for your compute instance operating
system and Docker version. The default ephemeral port range
Docker version 1.6.0 and later is listed on the instance under
/proc/sys/net/ipv4/ip_local_port_range. If this kernel parameter
is unavailable, the default ephemeral port range from 49153–
65535 is used. Do not attempt to specify a host port in the
ephemeral port range, as these are reserved for automatic
assignment. In general, ports below 32768 are outside of the
ephemeral port range. The default reserved ports are 22 for SSH,
the Docker ports 2375 and 2376, and the Amazon ECS
container agent ports 51678-51680. Any host port that was
previously user-specified for a running task is also reserved
while the task is running (after a task stops, the host port is
released). The current reserved ports are displayed in the
remainingResources of describe-container-instances output, and
a compute instance may have up to 100 reserved ports at a time,
including the default reserved ports. Automatically assigned
ports do not count toward the 100 reserved ports limit. ’
(Amazon ECS API Reference)
protocol – ‘The protocol used for the port mapping. Valid
values are tcp and udp. The default is tcp. ’ (Amazon ECS API
Reference)

Advanced Container Parameters


There are 1 simple and numerous complex advanced container
parameters. This is the simple advanced parameter:

dockerLabels – a map of key/value pairs of labels that are added


to the container.

These are the complex advanced parameters:

Health Check
Environment
Network Settings
Storage and Logging
Security
Resource Limits
Container Health Check Parameters
There are six container health check parameters:

healthCheck – ‘The container health check command and


associated configuration parameters for the container. This
parameter maps to HealthCheck in the Create a container
section of the Docker Remote API and the HEALTHCHECK
parameter of docker run. The Amazon ECS container agent only
monitors and reports on the health checks specified in the task
definition. Amazon ECS does not monitor Docker health checks
that are embedded in a container image and not specified in the
container definition. Health check parameters that are specified
in a container definition override any Docker health checks that
exist in the container image. ’ (Amazon ECS API Reference)

‘The following describes the possible healthStatus values for a


container:

HEALTHY—The container health check has passed


successfully.
UNHEALTHY—The container health check has failed.
UNKNOWN—The container health check is being
evaluated or there is no container health check defined. ’
(Amazon ECS API Reference)

‘The following describes the possible healthStatus values for a task.


The container health check status of nonessential containers do not have
an effect on the health status of a task.

HEALTHY—All essential containers within the task have


passed their health checks.
UNHEALTHY—One or more essential containers have
failed their health check.
UNKNOWN—The essential containers within the task are
still having their health checks evaluated or there are no
container health checks defined. ’ (Amazon ECS API
Reference)
‘If a task is run manually, and not as part of a service, the task will
continue its lifecycle regardless of its health status. For tasks that are
part of a service, if the task reports as unhealthy then the task will be
stopped and the service scheduler will replace it. ’ (Amazon ECS API
Reference)
‘The following are notes about container health check support:

1. Container health checks require version 1.17.0 or greater of


the Amazon ECS container agent.
2. Container health checks are supported for Fargate tasks if
you are using platform version 1.1.0 or later.
3. Container health checks are not supported for tasks that are
part of a Service that is configured to use a Classic Load
Balancer.’ (Amazon ECS API Reference)

command – ‘A string array representing the command that the


container runs to determine if it is healthy. The string array can
start with CMD to execute the command arguments directly, or
CMD-SHELL to run the command with the container's default
shell. If neither is specified, CMD is used by default. An exit
code of 0 indicates success, and a non-zero exit code indicates
failure. ’ (Amazon ECS API Reference)
Interval – ‘The time period in seconds between each health
check execution. You may specify between 5 and 300 seconds.
The default value is 30 seconds. ’ (Amazon ECS API
Reference)
timeout – ‘The time period in seconds to wait for a health check
to succeed before it is considered a failure. You may specify
between 2 and 60 seconds. The default value is 5 seconds. ’
(Amazon ECS API Reference)
retries – ‘The number of times to retry a failed health check
before the container is considered unhealthy. You may specify
between 1 and 10 retries. The default value is three retries. ’
(Amazon ECS API Reference)
startPeriod – ‘The optional grace period within which to
provide containers time to bootstrap before failed health checks
count towards the maximum number of retries. You may specify
between 0 and 300 seconds. The startPeriod is disabled by
default. ’ (Amazon ECS API Reference)

Container Environment Parameters


There are ten container environment parameters:

cpu – ‘The number of cpu units the Amazon ECS container


agent will reserve for the container. This field is optional for
tasks using the Fargate launch type, and the only requirement is
that the total amount of CPU reserved for all containers within a
task be lower than the Task level cpu value. You can determine
the number of CPU units that are available per Amazon EC2
instance type by multiplying the number of vCPUs listed for that
instance type on the Amazon EC2 Instances detail page by
1,024. Linux containers share unallocated CPU units with other
containers on the compute instance with the same ratio as their
allocated amount. For example, if you run a single-container
task on a single-core instance type with 512 CPU units specified
for that container, and that is the only task running on the
compute instance, that container could use the full 1,024 CPU
unit share at any given time. However, if you launched another
copy of the same task on that compute instance, each task would
be guaranteed a minimum of 512 CPU units when needed, and
each container could float to higher CPU usage if the other
container was not using it, but if both tasks were 100% active all
of the time, they would be limited to 512 CPU units. On Linux
compute instances, the Docker daemon on the compute instance
uses the CPU value to calculate the relative CPU share ratios for
running containers. The minimum valid CPU share value that
the Linux kernel allows is 2. However, the CPU parameter is not
required, and you can use CPU values below 2 in your container
definitions. For CPU values below 2 (including null), the
behavior varies based on your Amazon ECS container agent
version. Agent versions >= 1.2.0: Null, zero, and CPU values of
1 are passed to Docker as two CPU shares. On Windows
compute instances, the CPU limit is enforced as an absolute
limit, or a quota. Windows containers only have access to the
specified amount of CPU that is described in the task definition.
’ (Amazon ECS API Reference)
gpu – ‘The number of physical GPUs the Amazon ECS
container agent will reserve for the container. The number of
GPUs reserved for all containers in a task should not exceed the
number of available GPUs on the compute instance the task is
launched on. For more information, see Working with GPUs on
Amazon ECS. This parameter is not supported for Windows
containers or tasks using the Fargate launch type. ’ (Amazon
ECS API Reference)
essential – ‘If the essential parameter of a container is marked as
true, and that container fails or stops for any reason, all other
containers that are part of the task are stopped. If the essential
parameter of a container is marked as false, then its failure does
not affect the rest of the containers in a task. If this parameter is
omitted, a container is assumed to be essential. All tasks must
have at least one essential container. If you have an application
that is composed of multiple containers, you should group
containers that are used for a common purpose into components
and separate the different components into multiple task
definitions. ’ (Amazon ECS API Reference)
entryPoint – ‘The entry point that is passed to the container. ’
(Amazon ECS API Reference)
command – ‘The command that is passed to the container. If
there are multiple arguments, each argument should be a
separated string in the array. ’ (Amazon ECS API Reference)
workingDirectory – ‘The working directory in which to run
commands inside the container. ’ (Amazon ECS API Reference)
environment – ‘The environment variables to pass to a
container. AWS does not recommend using plaintext
environment variables for sensitive information, such as
credential data. ’ (Amazon ECS API Reference)
name – ‘Required when the environment parameter is used. The
name of the environment variable. ’ (Amazon ECS API
Reference)
value – ‘Required when the environment parameter is used. The
value of the environment variable. ’ (Amazon ECS API
Reference)
secrets – ‘An object representing the secret to expose to your
container. ’ (Amazon ECS API Reference)
name – ‘The value to set as the environment variable on the
container. ’ (Amazon ECS API Reference)
valueFrom – ‘The secret to expose to the container. The
supported values are either the full ARN of the AWS Secrets
Manager secret or the full ARN of the parameter in the AWS
Systems Manager Parameter Store. If the Systems Manager
Parameter Store parameter exists in the same Region as the task
you are launching then you can use either the full ARN or name
of the secret. If the parameter exists in a different Region then
the full ARN must be specified. ’ (Amazon ECS API Reference)

Container Network Settings Parameters


There are eight container network settings parameters:

disableNetworking – ‘true|false. When this parameter is true,


networking is disabled within the container. This parameter is
not supported for Windows containers or tasks using the awsvpc
network mode. ’ (Amazon ECS API Reference)
links - ‘The link parameter allows containers to communicate
with each other without the need for port mappings. Only
supported if the network mode of a task definition is set to
bridge. The name:internalName construct is analogous to
name:alias in Docker links. Up to 255 letters (uppercase and
lowercase), numbers, hyphens, and underscores are allowed.
This parameter is not supported for Windows containers or tasks
using the awsvpc network mode. Containers that are collocated
on the same compute instance may be able to communicate with
each other without requiring links or host port mappings. The
network isolation of a compute instance is controlled by security
groups and VPC settings. ’ (Amazon ECS API Reference)
hostname – ‘The hostname to use for your container. The
hostname parameter is not supported if you are using the awsvpc
network mode. ’ (Amazon ECS API Reference)
dnsServers – ‘A list of DNS servers that are presented to the
container. This parameter is not supported for Windows
containers or tasks using the awsvpc network mode. ’ (Amazon
ECS API Reference)
dnsSearchDomains – ‘A list of DNS search domains that are
presented to the container. This parameter is not supported for
Windows containers or tasks using the awsvpc network mode. ’
(Amazon ECS API Reference)
extraHosts – ‘A list of hostnames and IP address mappings to
append to the /etc/hosts file on the container. This parameter is
not supported for Windows containers or tasks that use the
awsvpc network mode. ’ (Amazon ECS API Reference)
hostname - Required when extraHosts are used. The hostname
to use in the /etc/hosts entry. ’ (Amazon ECS API Reference)
ipAddress – ‘Required when extraHosts are used. The IP address
to use in the /etc/hosts entry. ’ (Amazon ECS API Reference)

Container Storage Parameters


There are eight container storage parameters:

readonlyRootFilesystem – ‘true|false. When this parameter is


true, the container is given read-only access to its root file
system. This parameter is not supported for Windows
containers. ’ (Amazon ECS API Reference)
mountPoints – ‘The mount points for data volumes in your
container. Windows containers can mount whole directories on
the same drive as $env:ProgramData. Windows containers
cannot mount directories on a different drive, and mount point
cannot be across drives. ’ (Amazon ECS API Reference)
sourceVolume – ‘Required when mountPoints are used. The
name of the volume to mount. ’ (Amazon ECS API Reference)
containerPath – ‘Required when mountPoints are used. The
path on the container to mount the volume at. ’ (Amazon ECS
API Reference)
readOnly – ‘If this value is true, the container has read-only
access to the volume. If this value is false, then the container can
write to the volume. The default value is false. ’ (Amazon ECS
API Reference)
volumesFrom – ‘Data volumes to mount from another container.
’ (Amazon ECS API Reference)
sourceContainer – ‘Required when volumesFrom is used. The
name of the container to mount volumes from. ’ (Amazon ECS
API Reference)
readOnly – ‘If this value is true, the container has read-only
access to the volume. If this value is false, then the container can
write to the volume. The default value is false. ’ (Amazon ECS
API Reference)

Container Logging Parameters


There are six container logging parameters:

logConfiguration – ‘The log configuration specification for the


container. By default, containers use the same logging driver
that the Docker daemon uses; however the container may use a
different logging driver than the Docker daemon by specifying a
log driver with this parameter in the container definition. To use
a different logging driver for a container, the log system must be
configured properly on the compute instance (or on a different
log server for remote logging options). The following should be
noted when specifying a log configuration for your containers:

1. Amazon ECS currently supports a subset of the logging


drivers available to the Docker daemon (shown in the valid
values below). Additional log drivers may be available in
future releases of the Amazon ECS container agent.
2. For tasks using the EC2 launch type, the Amazon ECS
container agent running on a compute instance must register
the logging drivers available on that instance with the
ECS_AVAILABLE_LOGGING_DRIVERS environment
variable before containers placed on that instance can use
these log configuration options. For more information, see
Amazon ECS Container Agent Configuration.
3. For tasks using the Fargate launch type, because you do not
have access to the underlying infrastructure your tasks are
hosted on, any additional software needed will have to be
installed outside of the task. For example, the Fluentd output
aggregators or a remote host running Logstash to send Gelf
logs to. ’ (Amazon ECS API Reference)

logDriver – ‘Required when logConfiguration is used. The log


driver to use for the container. The valid values listed earlier are
log drivers that the Amazon ECS container agent can
communicate with by default. For tasks using the Fargate launch
type, the supported log drivers are awslogs, splunk, and
awsfirelens. For tasks using the EC2 launch type, the supported
log drivers are awslogs, fluentd, gelf, json-file, journald,
logentries,syslog, splunk, and awsfirelens. ’ (Amazon ECS API
Reference)
options – ‘The configuration options to send to the log driver. ’
(Amazon ECS API Reference)
secretOptions – ‘An object representing the secret to pass to the
log configuration. ’ (Amazon ECS API Reference)
name – ‘The value to set as the environment variable on the
container. ’ (Amazon ECS API Reference)
valueFrom - ‘The secret to expose to the log configuration of
the container. ’ (Amazon ECS API Reference)

Container Security Parameters


There are three container security parameters:

privileged – ‘true|false. When this parameter is true, the


container is given elevated privileges on the host compute
instance (similar to the root user). This parameter is not
supported for Windows containers or tasks using the Fargate
launch type. ’ (Amazon ECS API Reference)
user – ‘The user name to use inside the container. You can use
the following formats: user | user:group | uid | uid:gid | user:gid |
uid:group. If specifying a UID or GID, you must specify it as a
positive integer. This parameter is not supported for Windows
containers. ’ (Amazon ECS API Reference)
dockerSecurityOptions – ‘A list of strings to provide custom
labels for SELinux and AppArmor multi-level security systems.
This field is not valid for containers in tasks using the Fargate
launch type. With Windows containers, this parameter can be
used to reference a credential spec file when configuring a
container for Active Directory authentication. The Amazon ECS
container agent running on a compute instance must register
with the ECS_SELINUX_CAPABLE=true or
ECS_APPARMOR_CAPABLE=true environment variables
before containers placed on that instance can use these security
options. ’ (Amazon ECS API Reference)

Container Resource Limits Parameters


There are four container resource limits parameters:

ulimits – ‘A list of ulimits to set in the container. Fargate tasks


use the default resource limit values with the exception of the
nofile resource limit parameter which Fargate overrides. The
nofile resource limit sets a restriction on the number of open
files that a container can use. The default nofile soft limit is
1024 and hard limit is 4096 for Fargate tasks. These limits can
be adjusted in a task definition if your tasks needs to handle a
larger number of files. This parameter is not supported for
Windows containers. ’ (Amazon ECS API Reference)
name – ‘The type of the ulimit. Valid values: "core" | "cpu" |
"data" | "fsize" | "locks" | "memlock" | "msgqueue" | "nice" |
"nofile" | "nproc" | "rss" | "rtprio" | "rttime" | "sigpending" |
"stack". Required when ulimits are used. ’ (Amazon ECS API
Reference)
hardLimit – ‘The hard limit for the ulimit type. Required when
ulimits are used. ’ (Amazon ECS API Reference)
softLimit – ‘The soft limit for the ulimit type. Required when
ulimits are used. ’ (Amazon ECS API Reference)

Other Container Parameters


The other container parameters are either simple or complex. These other
container parameters are simple:

interactive – if set to true, you can deploy containerized


applications that require stdin or a tty to be allocated.
pseudoTerminal - if set to true, a TTY is allocated.

These other container parameters are complex:

Linux Parameters
Container Dependency
Container Timeouts
System Controls

Container Linux Parameters


There are 15 container Linux parameters:

linuxParameters – ‘Linux-specific options that are applied to the


container, such as KernelCapabilities. This parameter is not
supported for Windows containers. ’ (Amazon ECS API
Reference)
capabilities – ‘The Linux capabilities for the container that are
added to or dropped from the default configuration provided by
Docker.’ (Amazon ECS API Reference)
add – ‘The Linux capabilities for the container to add to the
default configuration provided by Docker. If you are using tasks
that use the Fargate launch type and platform version 1.4.0, the
only supported add parameter value is SYS_PTRACE. Valid
values: "ALL" | "AUDIT_CONTROL" | "AUDIT_READ" |
"AUDIT_WRITE" | "BLOCK_SUSPEND" | "CHOWN" |
"DAC_OVERRIDE" | "DAC_READ_SEARCH" | "FOWNER"
| "FSETID" | "IPC_LOCK" | "IPC_OWNER" | "KILL" |
"LEASE" | "LINUX_IMMUTABLE" | "MAC_ADMIN" |
"MAC_OVERRIDE" | "MKNOD" | "NET_ADMIN" |
"NET_BIND_SERVICE" | "NET_BROADCAST" |
"NET_RAW" | "SETFCAP" | "SETGID" | "SETPCAP" |
"SETUID" | "SYS_ADMIN" | "SYS_BOOT" |
"SYS_CHROOT" | "SYS_MODULE" | "SYS_NICE" |
"SYS_PACCT" | "SYS_PTRACE" | "SYS_RAWIO" |
"SYS_RESOURCE" | "SYS_TIME" | "SYS_TTY_CONFIG" |
"SYSLOG" | "WAKE_ALARM". ’ (Amazon ECS API
Reference)
drop – ‘The Linux capabilities for the container to remove from
the default configuration provided by Docker. Valid values:
"ALL" | "AUDIT_CONTROL" | "AUDIT_WRITE" |
"BLOCK_SUSPEND" | "CHOWN" | "DAC_OVERRIDE" |
"DAC_READ_SEARCH" | "FOWNER" | "FSETID" |
"IPC_LOCK" | "IPC_OWNER" | "KILL" | "LEASE" |
"LINUX_IMMUTABLE" | "MAC_ADMIN" |
"MAC_OVERRIDE" | "MKNOD" | "NET_ADMIN" |
"NET_BIND_SERVICE" | "NET_BROADCAST" |
"NET_RAW" | "SETFCAP" | "SETGID" | "SETPCAP" |
"SETUID" | "SYS_ADMIN" | "SYS_BOOT" |
"SYS_CHROOT" | "SYS_MODULE" | "SYS_NICE" |
"SYS_PACCT" | "SYS_PTRACE" | "SYS_RAWIO" |
"SYS_RESOURCE" | "SYS_TIME" | "SYS_TTY_CONFIG" |
"SYSLOG" | "WAKE_ALARM". ’ (Amazon ECS API
Reference)
devices – ‘Any host devices to expose to the container. This
parameter is not supported by the Fargate launch type. ’
(Amazon ECS API Reference)
hostPath – ‘The path for the device on the host compute
instance. ’ (Amazon ECS API Reference)
containerPath – ‘The path inside the container at which to
expose the host device. ’ (Amazon ECS API Reference)
permissions – ‘The explicit permissions to provide to the
container for the device. By default, the container has
permissions for read, write, and mknod on the device. Valid
Values: read | write | mknod. ’ (Amazon ECS API Reference)
initProcessEnabled – ‘Run an init process inside the container
that forwards signals and reaps processes.’ (Amazon ECS API
Reference)
maxSwap – ‘The total amount of swap memory (in MiB) a
container can use. If a maxSwap value of 0 is specified, the
container will not use swap. Accepted values are 0 or any
positive integer. If the maxSwap parameter is omitted, the
container will use the swap configuration for the compute
instance it is running on. A maxSwap value must be set for the
swappiness parameter to be used. This parameter is not
supported by the Fargate launch type. ’ (Amazon ECS API
Reference)
sharedMemorySize – ‘The value for the size (in MiB) of the
/dev/shm volume. This parameter is not supported by the
Fargate launch type. ’ (Amazon ECS API Reference)
swappiness – ‘This allows you to tune a container's memory
swappiness behavior. A swappiness value of 0 will cause
swapping to not happen unless absolutely necessary. A
swappiness value of 100 will cause pages to be swapped very
aggressively. Accepted values are whole numbers between 0
and 100. If the swappiness parameter is not specified, a default
value of 60 is used. If a value is not specified for maxSwap then
this parameter is ignored. This parameter is not supported by the
Fargate launch type’ (Amazon ECS API Reference)
tmpfs – ‘The container path, mount options, and size (in MiB) of
the tmpfs mount. This parameter is not supported by the Fargate
launch type. ’ (Amazon ECS API Reference)
containerPath – ‘The absolute file path where the tmpfs volume
is to be mounted. ’ (Amazon ECS API Reference)
mountOptions – ‘The list of tmpfs volume mount options. Valid
Values: "defaults" | "ro" | "rw" | "suid" | "nosuid" | "dev" |
"nodev" | "exec" | "noexec" | "sync" | "async" | "dirsync" |
"remount" | "mand" | "nomand" | "atime" | "noatime" | "diratime"
| "nodiratime" | "bind" | "rbind" | "unbindable" | "runbindable" |
"private" | "rprivate" | "shared" | "rshared" | "slave" | "rslave" |
"relatime" | "norelatime" | "strictatime" | "nostrictatime" |
"mode" | "uid" | "gid" | "nr_inodes" | "nr_blocks" | "mpol". ’
(Amazon ECS API Reference)
size – ‘The size (in MiB) of the tmpfs volume. ’ (Amazon ECS
API Reference)

Container Dependency Parameters


There are three container dependency parameters:

dependsOn – ‘The dependencies defined for container startup


and shutdown. A container can contain multiple dependencies.
When a dependency is defined for container startup, for
container shutdown it is reversed.’(Amazon ECS API
Reference)
containerName – ‘The container name that must meet the
specified condition. ’ (Amazon ECS API Reference)
condition – ‘The dependency condition of the container. The
following are the available conditions and their behavior:
START – This condition emulates the behavior of
links and volumes today. It validates that a dependent
container is started before permitting other containers
to start.
COMPLETE – This condition validates that a
dependent container runs to completion (exits) before
permitting other containers to start. This can be useful
for nonessential containers that run a script and then
exit.
SUCCESS – This condition is the same as
COMPLETE, but it also requires that the container
exits with a zero status.
HEALTHY – This condition validates that the
dependent container passes its Docker healthcheck
before permitting other containers to start. This
requires that the dependent container has health
checks configured. This condition is confirmed only at
task startup. ’ (Amazon ECS API Reference)

Container Timeouts Parameters


There are two container timeouts parameters:

startTimeout – ‘Time duration (in seconds) to wait before giving


up on resolving dependencies for a container. For example, you
specify two containers in a task definition with containerA
having a dependency on containerB reaching a COMPLETE,
SUCCESS, or HEALTHY status. If a startTimeout value is
specified for containerB and it does not reach the desired status
within that time then containerA will give up and not start. This
results in the task transitioning to a STOPPED state. If this
parameter is not specified, the default value of 3 minutes is
used. For tasks using the EC2 launch type, if the startTimeout
parameter is not specified, the value set for the Amazon ECS
container agent configuration variable
ECS_CONTAINER_START_TIMEOUT is used by default. If
neither the startTimeout parameter or the
ECS_CONTAINER_START_TIMEOUT agent configuration
variable are set, then the default values of 3 minutes for Linux
containers and 8 minutes on Windows containers are used.’
(Amazon ECS API Reference)
stopTimeout – ‘Time duration (in seconds) to wait before the
container is forcefully killed if it does not exit normally on its
own. The max stop timeout value is 120 seconds and if the
parameter is not specified, the default value of 30 seconds is
used. For tasks using the EC2 launch type, if the stopTimeout
parameter is not specified, the value set for the Amazon ECS
container agent configuration variable
ECS_CONTAINER_STOP_TIMEOUT is used by default. If
neither the stopTimeout parameter or the
ECS_CONTAINER_STOP_TIMEOUT agent configuration
variable are set, then the default values of 30 seconds for Linux
containers and 30 seconds on Windows containers are used.
Compute instances require at least version 1.26.0 of the
container agent to enable a container stop timeout value.
However, we recommend using the latest container agent
version.’ (Amazon ECS API Reference)

Container System Controls Parameters


There are three container system controls parameters:

systemControls – ‘A list of namespaced kernel parameters to set in


the container. It is not recommended that you specify network-
related systemControls parameters for multiple containers in a
single task that also uses either the awsvpc or host network mode
for the following reasons:
For tasks that use the awsvpc network mode, if you set
systemControls for any container it will apply to all
containers in the task. If you set different
systemControls for multiple containers in a single task,
the container that is started last will determine which
systemControls take effect.
For tasks that use the host network mode, the network
namespace systemControls are not supported.

If you are setting an IPC resource namespace to use for the containers
in the task, the following will apply to your system controls. For tasks
that use the host IPC mode, IPC namespace systemControls are not
supported. For tasks that use the task IPC mode, IPC namespace
systemControls values will apply to all containers within a task. This
parameter is not supported for Windows containers or tasks using the
Fargate launch type. ’ (Amazon ECS API Reference)

namespace – ‘The namespaced kernel parameter to set a value for.


Valid IPC namespace values: "kernel.msgmax" | "kernel.msgmnb"
| "kernel.msgmni" | "kernel.sem" | "kernel.shmall" |
"kernel.shmmax" | "kernel.shmmni" | "kernel.shm_rmid_forced",
as well as Sysctls beginning with "fs.mqueue.*" Valid network
namespace values: Sysctls beginning with "net.*" . ’ (Amazon
ECS API Reference)
value - The value for the namespaced kernel parameter specified
in namespace. ’ (Amazon ECS API Reference)

Container Placement Constraints


When you register a task definition, you can provide task placement
constraints that customize how Amazon ECS places the task and its
containers.
If you are using the Fargate launch type, task placement constraints are
not supported. By default Fargate tasks are spread across Availability Zones.
For tasks that use the EC2 launch type, you can use constraints to place
tasks based on Availability Zone, instance type, or custom attributes.
The following parameters are allowed in a container definition:

Expression - A cluster query language expression to apply to the


constraint.
Type - The type of constraint. Use memberOf to restrict the
selection to a group of valid candidates.
Launch Type - When you register a task definition, you specify
the launch type to use for your task.

ECS Docker Container at Runtime


At runtime, these additional ECS container properties can be perceived in
the ECS console:

taskArn – the ARN of the task definition that includes this


container.
runtimeId – the ID of the container.
containerArn - the Amazon Resource Name (ARN) of the
container.
name – the name of the container.
image – the name of the Docker image used by of the container.
imageDigest – the container’s image manifest digest.
cpu - the number of CPU units allocated for the container.
gpuIds – the list of ID of each GPU assigned to the container.
memory – the hard limit (in MiB) of memory assigned for the
container.
memoryReservation – the soft limit (in MiB) of memory
assigned for the container.
networkBindings – the network bindings associated with the
container.
networkInterfaces – the network interfaces associated with the
container.
reason – additional information about a stopped or a running
container.
exitCode – the exit code returned by the container.
healthStatus – the health status (HEALTHY | UNHEALTHY |
UNKNOWN) of the container.
lastStatus – the last known status of the container.

Running Deep Learning Containers on ECS


AWS Deep Learning Containers are a set of Docker images for training
and serving models in TensorFlow on Amazon ECS and Amazon EKS. Deep
Learning Containers provide optimized environments with TensorFlow,
Nvidia CUDA (for GPU instances), and Intel MKL (for CPU instances)
libraries and are available in Amazon ECR. Fargate does not support AWS
Deep Learning Containers.
ECS must be set up to use AWS Deep Learning Containers:

Attach to the IAM role associated with the EC2 compute instance
the Amazon CloudWatch Logs IAM policy which allows ECS to
send logs to CloudWatch.
Create a security group or update existing security group so that
ports are open for the inference server, for:
MXNet inference, ports 80 and 8081 open to TCP
traffic.
TensorFlow inference, ports 8501 and 8500 open to TCP
traffic.

ECS Windows Misses


Docker’s architecture is based on the features of the Linux operating
system. Amazon Web Services is likewise based on the features of Linux.
Consequently, Windows-based containers are not capable of exploiting the
Linux-based features of Amazon ECS. To be prudent, if you are running
Windows-based containers then the Amazon ECS-optimized Windows AMIs
are the best bet for success with ECS.
Here is the list of the gaps between ECS features and Windows-based
containers:

Are not compatible with Fargate instances;


Cannot use the awsvpc network mode, nor the bridge or host
network modes;
Can only support the use of the local driver, therefore only bind
mounts and local volumes are supported.
Task level CPU and memory parameters are ignored for
Windows containers, but container-level CPU and memory
resources are supported by Windows containers;
The CPU limit is enforced as an absolute limit, or a quota.
Windows containers only have access to the specified amount of
CPU that is described in the task definition;
The ipcMode parameter is not supported;
The pidMode parameter is not supported;
The proxyConfiguration parameter cannot be used;
Port mappings on Windows use the NetNAT gateway address
rather than localhost;
There is no loopback for port mappings on Windows, so you
cannot access a container's mapped port from the host itself;
The gpu container parameter is not supported;
The disableNetworking container parameter is not supported;
The links container parameter is not supported;
The dnsServers container parameter is not supported;
The dnsSearchDomains container parameter is not supported;
The extraHosts container parameter is not supported;
The readonlyRootFilesystem container parameter is not
supported;
The privileged container parameter is not supported;
The user container parameter is not supported;
The ulimits container parameter is not supported;
The linuxParameters (devices; initProcessEnabled; maxSwap;
sharedMemorySize; swappiness; tmpfs) container parameter is
not supported;
The systemControls container parameter is not supported;

ECS Monitoring
AWS provides the following automated monitoring tool to observe your
ECS solution and report when something goes wrong:

AWS CloudTrail log monitoring;


Amazon CloudWatch Logs;
Amazon CloudWatch Alarms;
Amazon CloudWatch Events;

ECS API and Amazon CloudTrail


CloudTrail is the de-facto AWS service that can log and monitor all ECS
API calls (AWS management console activities, CLI commands and
arguments, SDK calls and input parameters,) that enter AWS made by a
given AWS Account, what acts they did, when they did those acts and from
where they acted.
Best practice – ensure that CloudTrail logging is enabled on each AWS
Account.
For a given AWS Account, CloudTrail enables:

Governance,
Compliance,
Operational Auditing, and
Risk Auditing

CloudTrail can be used to support external compliance audits, by


providing evidence how data is being transmitted, processed, and stored in
AWS. CloudTrail helps track changes to AWS resources, support compliance
with rules and with troubleshooting operational problems.
CloudTrail typically delivers log files 15 minutes after the API call and
can publishes new log files every 5 minutes. To validate CloudTrail log files,
AWS uses SHA-256 for hashing and SHA-256 with RSA for digital signing
of the file. The CloudTrail console displays the event history of a Region,
over the past 90 days.
The CloudTrail logs can be saved to S3, can be sent to CloudWatch and to
CloudWatch Events. AWS Account trail log files delivered to S3 are
encrypted by S3, using SSE-S3 keys. You can choose to use SSE-KMS
encryption if needed. You can configure an SNS notification when a
CloudTrail log file is written into an S3 bucket.
For a given Region, you can send the account trails to a central account
and from which analysis across the accounts is supported. The central
account controls access to the logs as well as their distribution. You can
create up to 5 account trails per Region. You can create a trail that applies to
all Regions and record the log files for each Region.
Best practice – configure an SNS notification when a CloudTrail log file
is written into an S3 bucket.

ECS and Amazon CloudWatch


Amazon CloudWatch is a service that monitors your EC2 instances, and it
collects and processes raw data into readable, near real-time metrics.
CloudWatch (is a distributed system that) integrates with AWS IAM, collects
and manages metrics about your existing system, application, and custom, log
files. CloudWatch gives you visibility into resource use, application
performance, as well as monitor operational health, create rules and alarms
and troubleshoot operational health. CloudWatch can examine logs for
specific phrases, values, or patterns.
CloudWatch provides > 100 types of metrics and you can create your own
metrics. Using CloudWatch’s Free Tier, the frequency of metrics update is 5
minutes. If a shorter interval is needed, you will have to pay extra.
EC2 instances publish their CPU utilization, data transfer, and disk usage
metrics to CloudWatch. The EC2 instance does not publish memory
utilization by the instance.
AWS has no visibility into:

events that are visible to the operating system of the EC2


instance.
your system/application specific-events.

CloudWatch Logs
CloudWatch Logs are used to monitor and troubleshoot
systems/applications by using your existing system, application and custom,
log files. CloudWatch Logs is a managed service that collects and keeps your
logs. CloudWatch can aggregate and centralize logs, across multiple sources.
CloudWatch can use S3 or Glacier to store your logs. By default, metrics data
is retained in S3 for a 2-week period.
Best practice – in this use case, due to an application bug, the EC2
instance must be rebooted. Ensure the bug event is captured in the application
log. CloudWatch Logs can be used to monitor that application log for the
error keywords, create an alarm, and then restart the EC2 instance.

How to Publish Logs to CloudWatch


Logs can be published to CloudWatch using these means:

CloudWatch Logs Agent;


AWS CLI;
CloudWatch Logs SDK;
CloudWatch API – allows programs/scripts to PUT metrics (a
name-value pair) into CloudWatch, which can then create events
and trigger alarms.
You can also export logs data to S3 for archiving, for analytics, or stream
the logs to Amazon Elasticsearch Service or to Splunk and other 3rd party
tools.

Monitoring and Metrics


CloudWatch monitoring of and metrics are available for:
1 Auto Scaling;
2 Elastic Load Balancing;
3 Amazon SNS;
4 Amazon EC2;
5 Amazon ElastiCache;
6 Amazon Elastic Block Store (EBS);
7 Amazon DynamoDB;
8 Amazon Elastic MapReduce (EMR);
9 Amazon EC2 Container Service (ECS);
10 Amazon Elastisearch Service;
11 Amazon Kinesis Streams;
12 Amazon Kinesis Firehose;
13 AWS Lambda;
14 Amazon Machine Learning;
15 AWS OpsWorks;
16 Amazon Redshift;
17 Amazon Relational Database Service (RDS);
18 Amazon Route 53;
19 Amazon Simple Queue Service (SQS);
20 Amazon S3;
21 Amazon Simple Workflow Service (SWF);
22 Amazon Storage Gateway;
23 Amazon WorkSpaces;
24 Amazon CloudFront;
25 Amazon CloudSearch;

CloudWatch Custom Metric


CloudWatch can measure CPU usage, disk I/O, and network operations.
But, CloudWatch is not able to measure everything about everything. There
is one particularly important thing that needs to be measured but which
CloudWatch does not measure: EC2 Instance Memory Usage.
To measure EC2 instance memory usage you need to be add a custom
metric in CloudWatch. However, the custom metrics are only available with
the Linux AMI. A Perl script is used to monitor and capture the custom
metric (memory usage, swap usage, disk use). It is the CloudWatch Agent
installed on the EC2 instance that operates the custom metric Perl scripts. The
standard CloudWatch usage charges apply to your use of custom metrics Perl
scripts.

CloudWatch Events
A CloudWatch Event is aware of operation changes in an AWS cloud
environment as they are occurring and generates a stream of information. A
CloudWatch Event delivers a near real-time stream of system events that
describe changes in AWS resources in an environment. CloudWatch Events
use simple rules that match events present in the stream, and routes those
matched events to one or more target functions or streams, which take a
corrective action in response as needed. CloudWatch Events can be
scheduled to automated actions that are triggered at certain times using cron
(the OS tool) or rate expressions.

CloudWatch Alarms
A CloudWatch Alarm monitors a single metric over a period of time that
you specify and performs 1:N actions based on the value of the metric
relative to a given threshold over a number of time periods. For example, an
alarm can represent a reboot of an EC2 instance, or the scale-in and the scale-
out of instances in an ASG.
An Alarm has these states:

OK;
ALARM;
INSUFFICIENT_DATA.

You can set up rules and an action can be taken whenever a change is
detected. For example, a CloudWatch alarm action can be used to
automatically stop, terminate, reboot, or recover and EC2 instance. Stop and
terminate can be used to optimize cost savings when an instance does not
need to be run. Reboot and recover actions are used when an instance is
impaired.
When the alarm changes state, the action is a SNS notification that is
instantly sent to a target you choose:

A Lambda function;
An Auto Scaling Policy;
An SNS queue;
An SNS topic;
A Kinesis Stream;
A built-in target.

Each AWS Account is limited to 5,000 Alarms.

2 Levels of Monitoring
Basic Monitoring

No charge. Sends data points to CloudWatch every 5 minutes


for a limited # of preselected metrics.
Basic monitoring only monitors and captures hypervisor level
metrics. It does not monitor operating system metrics, such as
memory use or disk space use.

Detailed Monitoring

For a charge, sends data points to CloudWatch every 1 minute


for a metrics.
For an additional charge, supports aggregation of metrics across
AZs in a given Region.
You obtain metrics through the CloudWatch API by performing
an HTTP GET request.

CloudWatch Logs Agent


Use CloudWatch Logs Agent to stream your existing system, application
and custom, log files from EC2 instances, AWS CloudTrail and other AWS
services.
The CloudWatch Logs Agent is available on Linux and Windows EC2
instance types.
Use the Amazon CloudWatch Logs Agent Installer to install the Agent on
the EC2 instance and to configure that Agent.
View CloudWatch Graph and Statistics
Use CloudWatch Dashboard(s) to view different types of graphs and
statistics of the metrics you collect. For example, in a graph you can view the
CPU use, and can see its trend over time.
During operational events, the Dashboard can act as a playbook that
provide guidance about how to respond to specific incidents.

ECS on AWS Outpost


AWS Outpost is a fully-managed service extends AWS infrastructure,
AWS services, APIs, and tools into the consumer’s private data center.
Outpost offers a mix of EC2 compute instance and EBS capacity designed to
meet a variety of application needs. There is a use case suited to Amazon
ECS that benefits from the AWS Outpost service:

Hybrid Applications – containerized applications run in AWS as


well as in the consumer’s private data center;

ECS on Outpost provide highly scalable, high-performance Docker


container orchestration that requires the low latencies available with on
premises systems.

ESC DevOps Toolset


Within the Amazon Management Console, AWS has provided the ECS
Console which can be used to set-up and manage all operations manually
with Amazon ECS. In addition to the AWS DevOps toolset and the ECS
Console, AWS provides:

ecs-cli – the ECS command line interface alternative to the ECS


console. Provides high-level commands to creating, updating,
and monitoring Clusters, Services, Tasks, and Containers, from
a local development environment.
Docker Client - a command line tool used to DevOps Docker
images and containers. When commands are entered into the
Docker command line interface, the Docker client converts the
commands into the appropriate API payload and POSTs them to
the correct REST API endpoint implemented in the daemon.

It is best practice to use the Docker Client as a non-root user (i.e., a login
account of the operating system running on the EC2 launch type). Add the
non-root user to the ‘docker’ group (in the operating system).

Container Agent Configuration


In closing, the Amazon ECS Container Agent supports a number of
configuration options, most of which can be set through environment
variables. If the container agents’ instance was launched on a Linux-based
Amazon ECS-optimized AMI, these environment variables can be set in the
/etc/ecs/ecs.config file – a restart of the container agent is required. You can
also write these configuration variables to your compute instances with EC2
user data at launch time.
The following environment variables are available, and all of them are
optional. Only after you dive deeply into the ECS platform will these
environment variables make sense:

ECS_CLUSTER – the cluster that this agent should check into. If


this value is undefined, then the default cluster is assumed. If the
default cluster does not exist, the ECS container agent attempts to
create it. If a non-default cluster is specified and it does not exist,
then registration fails.
ECS_RESERVED_PORTS - an array of ports that should be
marked as unavailable for scheduling on this compute instance
(a.k.a., compute instance).
ECS_RESERVED_PORTS_UDP - an array of UDP ports that
should be marked as unavailable for scheduling on this compute
instance.
ECS_ENGINE_AUTH_TYPE – required for private registry
authentication. This is the type of authentication data in
ECS_ENGINE_AUTH_DATA.
ECS_ENGINE_AUTH_DATA – required for private registry
authentication.
AWS_DEFAULT_REGION – the region to be used in API
requests as well as to infer the correct backend host.
AWS_ACCESS_KEY_ID – the access key used by the agent for
all calls.
AWS_SECRET_ACCESS_KEY – the key used by the agent for
all calls.
AWS_SESSION_TOKEN – the session token used for temporary
credentials.
DOCKER_HOST – used to create a connection to the Docker
daemon; behaves similarly to the environment variable as used by
the Docker client.
ECS_LOGFILE – determines the location where agent logs should
be written.
ECS_LOGLEVEL – the level to log at on stdout.
ECS_CHECKPOINT – set true or false, whether to save the
checkpoint state to the location specified with ECS_DATADIR.
ECS_DATADIR – the name of the persistent data directory on the
compute instance that is running the ECS container agent. The
directory is used to save information about the cluster and the
agent state.
ECS_UPDATES_ENABLED – set true or false, whether to exit
for ECS agent updates when they are requested.
ECS_DISABLE_METRICS - set true or false, whether to disable
CloudWatch metrics for ECS. If this value is set to true,
CloudWatch metrics are not collected.
ECS_POLL_METRICS - set true or false, whether to poll or
stream when gathering CloudWatch metrics for tasks.
ECS_POLLING_METRICS_WAIT_DURATION – time to wait
to poll for new CloudWatch metrics for a task. Only used when
ECS_POLL_METRICS is true.
ECS_RESERVED_MEMORY – the of memory, in MiB, to
remove from the pool that is allocated to your tasks. This
effectively reserves that memory for critical system processes
including the Docker daemon and the ECS container agent
ECS_AVAILABLE_LOGGING_DRIVERS – the logging drivers
available on the compute instance. The Amazon ECS container
agent running on a compute instance must register the logging
drivers available on that instance with the
ECS_AVAILABLE_LOGGING_DRIVERS environment variable
before containers placed on that instance can use log configuration
options for those drivers in tasks.
ECS_DISABLE_PRIVILEGED – set true or false, whether
launching privileged containers is disabled on the compute
instance. If this value is set to true, privileged containers are not
permitted.
ECS_SELINUX_CAPABLE – set true or false, whether SELinux
is available on the compute instance.
ECS_APPARMOR_CAPABLE – set true or false, whether
AppArmor is available on the compute instance.
ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION - time to
wait from when a task is stopped until the Docker container is
removed. As this removes the Docker container data, be aware that
if this value is set too low, you may not be able to inspect your
stopped containers or view the logs before they are removed. The
minimum duration is 1m; any value shorter than 1 minute is
ignored.
ECS_CONTAINER_STOP_TIMEOUT – time to wait from when
a task is stopped before its containers are forcefully stopped if they
do not exit normally on their own.
ECS_CONTAINER_START_TIMEOUT – time to wait before
giving up on starting a container.
HTTP_PROXY – the hostname (or IP address) and port number of
an HTTP proxy to use for the ECS agent to connect to the internet.
NO_PROXY – the HTTP traffic that should not be forwarded to
the specified HTTP_PROXY.
ECS_ENABLE_TASK_IAM_ROLE – set true or false, whether
IAM roles for tasks should be enabled on the compute instance for
task containers with the bridge or default network modes. For
more information, see IAM Roles for Tasks.
ECS_ENABLE_TASK_IAM_ROLE_NETWORK_HOST – set
true or false, whether IAM roles for tasks should be enabled on the
compute instance for task containers with the host network mode.
ECS_DISABLE_IMAGE_CLEANUP – set true or false, whether
disable automated image cleanup for the Amazon ECS agent.
ECS_IMAGE_CLEANUP_INTERVAL – the time interval
between automated image cleanup cycles. If set to less than 10
minutes, the value is ignored.
ECS_IMAGE_MINIMUM_CLEANUP_AGE – the time interval
between when an image is pulled and when it can be considered
for automated image cleanup.
NON_ECS_IMAGE_MINIMUM_CLEANUP_AGE – the time
interval between when a non-Amazon ECS image is created and
when it can be considered for automated image cleanup.
ECS_NUM_IMAGES_DELETE_PER_CYCLE – the maximum
number of images to delete in a single automated image cleanup
cycle. If set to less than 1, the value is ignored.
ECS_IMAGE_PULL_BEHAVIOR – the behavior used to
customize the pull image process for your compute instances. The
following describes the optional behaviors:
If default is specified, the image is pulled remotely. If
the image pull fails, then the container uses the cached
image on the instance.
If always is specified, the image is always pulled
remotely. If the image pull fails, then the task fails. This
option ensures that the latest version of the image is
always pulled. Any cached images are ignored and are
subject to the automated image cleanup process.
If once is specified, the image is pulled remotely only if
it has not been pulled by a previous task on the same
compute instance or if the cached image was removed
by the automated image cleanup process. Otherwise, the
cached image on the instance is used. This ensures that
no unnecessary image pulls are attempted.
If prefer-cached is specified, the image is pulled
remotely if there is no cached image. Otherwise, the
cached image on the instance is used. Automated image
cleanup is disabled for the container to ensure that the
cached image is not removed.
ECS_IMAGE_PULL_INACTIVITY_TIMEOUT – the time to
wait after docker pulls complete waiting for extraction of a
container. Useful for tuning large Windows containers.
ECS_INSTANCE_ATTRIBUTES – a list of custom attributes, in
JSON format, to apply to your compute instances.
ECS_ENABLE_TASK_ENI – set true or false, whether to enable
task networking for tasks to be launched with their own network
interface.
ECS_CNI_PLUGINS_PATH - the path where the CNI binary file
is located.
ECS_AWSVPC_BLOCK_IMDS – set true or false, whether to
block access to Instance Metadata for tasks started with awsvpc
network mode.
ECS_AWSVPC_ADDITIONAL_LOCAL_ROUTES – in awsvpc
network mode, traffic to these prefixes is routed via the host
bridge instead of the task elastic network interface.
ECS_ENABLE_CONTAINER_METADATA – when true, the
agent creates a file describing the container's metadata. The file
can be located and consumed by using the container environment
variable $ECS_CONTAINER_METADATA_FILE.
ECS_HOST_DATA_DIR – the source directory on the host from
which ECS_DATADIR is mounted. ECS uses this to determine
the source mount path for container metadata files when the
Amazon ECS agent is running as a container. ECS does not use
this value in Windows because the ECS agent does not run as a
container.
ECS_ENABLE_TASK_CPU_MEM_LIMIT - set true or false,
whether to enable Task level CPU and memory limits.
ECS_CGROUP_PATH - the root cgroup path that is expected by
the ECS agent. This is the path that is accessible from the agent
mount.
ECS_ENABLE_CPU_UNBOUNDED_WINDOWS_WORKAROUND
- when true, ECS allows CPU-unbounded (CPU=0) tasks to run
along with CPU-bounded tasks in Windows.
ECS_TASK_METADATA_RPS_LIMIT - comma-separated
integer values for steady state and burst throttle limits for the task
metadata endpoint.
ECS_SHARED_VOLUME_MATCH_FULL_CONFIG – when
dockerVolumeConfiguration is specified in a task definition and
the autoprovision flag is used, the ECS container agent compares
the details of the Docker volume with the details of existing
Docker volumes.
ECS_SHARED_VOLUME_MATCH_FULL_CONFIG is true,
the container agent compares the full configuration of the volume
(name, driverOpts, and labels) to verify that the volumes are
identical. When it is false, the container agent uses Docker's
default behavior, which verifies the volume name only. If a
volume is shared across compute instances, this should be set to
false.
ECS_CONTAINER_INSTANCE_PROPAGATE_TAGS_FROM
– if ec2_instance is specified, existing tags defined on the compute
instance are registered to ECS. The tags are discoverable using the
ListTagsForResource operation. The IAM role associated with the
compute instance should have the ec2:DescribeTags action
allowed.
ECS_CONTAINER_INSTANCE_TAGS – metadata applied to
compute instances to help you categorize and organize your
resources. Each tag consists of a custom-defined key and an
optional value. Tag keys can have a maximum character length of
128 characters. Tag values can have a maximum length of 256
characters. If compute instance tags are propagated using the
ECS_CONTAINER_INSTANCE_PROPAGATE_TAGS_FROM
parameter, those tags are overwritten by the tags specified using
ECS_CONTAINER_INSTANCE_TAGS.
ECS_ENABLE_UNTRACKED_IMAGE_CLEANUP – set true or
false, whether to allow the Amazon ECS agent to delete containers
and images that are not part of ECS tasks.
ECS_EXCLUDE_UNTRACKED_IMAGE - comma separated list
of images (imageName:tag) that should not be deleted by the
Amazon ECS agent if
ECS_ENABLE_UNTRACKED_IMAGE_CLEANUP is true.
ECS_DISABLE_DOCKER_HEALTH_CHECK - set true or false,
whether to disable the Docker container health check for the ECS
agent.
ECS_NVIDIA_RUNTIME – the runtime to be used to pass
NVIDIA GPU devices to containers. This parameter should not be
specified as an environment variable in a task definition if the
GPU resource requirements are already specified.
ECS_ENABLE_SPOT_INSTANCE_DRAINING – set true or
false, whether to enable Spot Instance draining for the compute
instance. When true, if the compute instance receives a Spot
interruption notice, then the agent sets the instance status to
DRAINING, which gracefully shuts down and replaces all tasks
running on the instance that are part of a service. It is
recommended that this be set to true when using Spot Instances.
ECS_LOG_ROLLOVER_TYPE – determines whether the
container agent log file will be rotated hourly or based on size. By
default, the agent log file is rotated each hour.
ECS_LOG_OUTPUT_FORMAT – determines the log output
format. When the JSON format is used, each line in the log will be
a structured JSON map.
ECS_LOG_MAX_FILE_SIZE_MB – when the
ECS_LOG_ROLLOVER_TYPE variable is set to size this
variable determines the maximum size (in MB) of the log file
before it is rotated. If the rollover type is set to hourly, then this
variable is ignored.
ECS_LOG_MAX_ROLL_COUNT – determines the number of
rotated log files to keep. Older log files are deleted once this limit
is reached.
If you are manually starting the ECS container agent these
environment variables can be used with the docker run command used to
start the agent. For sensitive information, such as authentication
credentials for private repositories, best practices is to store the container
agent environment variables in a file and pass them all at one time with
the --env-file path to_env_file option of the docker run command.
AWS recommends that the ecs.config file be stored in a private
Amazon S3 bucket and grant read-only access to the container
instance’s IAM role.
AMAZON ELASTIC
KUBERNETES SERVICE
(EKS)

Introduction
Kubernetes is an open-source technology that manages (a.k.a.
orchestrates) and provisions containerized applications that can scale, fail and
self-heal quickly, as well as be revised while running inline. It is common
practice to use Docker to develop containerized applications and use
Kubernetes to orchestrate those containers in cloud, on-premise, as well as
hybrid, environments.
Amazon Elastic Kubernetes Service (Amazon EKS) is a Region-based
fully managed Kubernetes service. EKS stands-up and maintains the
Kubernetes control plane which is accessible via the Amazon Management
Console as well as the eksctl and kubectl command line utilities.
The EKS control plane takes care of deploying containers and keeping
them running. EKS provides a scalable and highly-available control plane
that runs across multiple Availability Zones (AZs) to eliminate a single point
of failure. The EKS consumer is responsible for defining the underlying VPC
and security groups, the components of Kubernetes’ data plane, the Docker
containers, and the IAM roles and policies that secure the overall EKS
solution.
EKS is seamlessly integrated with CloudWatch, Auto Scaling Groups,
IAM, VPC, as well as AWS App Mesh. These integrations provide
monitoring, autoscaling, security, networking, load balancing, and service
discovery.
EKS runs upstream Kubernetes and is certified Kubernetes conformant so
you get all benefits of open source tooling from the community. At present,
EKS supports Kubernetes versions 1.12, 1.13, 1.14, and 1.1.5.

Purpose and Scope


The information the consumer needs to understand to DevOps EKS is
scattered over dozens of AWS User, Developer, and API guidebooks, and
dozens of EKS evangelist blog postings. The purpose of this chapter is to
gather together those facts, to organize them, and disclose details about how
to configure EKS and integrate EKS with other beneficial AWS resources
and services.
However, code examples of how to DevOps the EKS infrastructure and its
integration with AWS platform or service using the suite of AWS DevOps
tools are not provided. Countless code examples of using AWS DevOps tools
with EKS (as well as on-line tutorials) are well documented in numerous
AWS guidebooks and web pages. A complete list of the AWS User,
Developer, and API guidebooks is provided in the ‘Reference Materials’
appendix of this manuscript.
It is recommended that the reader collect CloudFormation templates that
are used to provision VPCs, EKS clusters, Pods, compute instances, etc.
The scope of this exploration of EKS is limited to the AWS platform stack
and EKS best practices as of Kubernetes 1.15. In that there is a substantial
amount of material to cover just for the current EKS, this manuscript does not
cover legacy EKS issues. This exploration of EKS is limited to the relevant
facts and best practices as of March 2020. Best practices for handing legacy
EKS issues are covered in the AWS Developer and User guides.
Though EKS is a fully managed Kubernetes service that ‘fully managed’
benefit only applies to the Kubernetes control plane. In this context fully
managed does not eliminate the significant level of effort demanded by the
data plane and secured network traffic:

Create and configure the VPCs that the Kubernetes clusters use,
or
Configure the EC2 instances that host Kubernetes Pods.

The difficulties and challenges inherent to VPC and EC2 instances do not
magically disappear with EKS.

Amazon EKS Use Cases


At the time of this writing, EKS is available in 19 Regions. Use cases that
gain the benefits of Amazon EKS are not simple problem spaces, they are
complex to extraordinarily complex:

Containerized Application Migration to AWS – the application


that is not a monolith or big ball of mud but is containerized
prior to initiating migration to the AWS cloud;
Batch ETL Processing – a workflow of containerized ETL
processes, executed proximate to their data sources and sinks;
Hybrid Containerized Applications – containerized applications
running in AWS as well as in the consumer’s private data
center;
Machine Learning – training and inference processes, executed
proximate to their data sources and sinks;
Microservices – distributed containerized applications
composed of a set of services that maximize cohesion and
minimize coupling. Typically HTTP/HTTPS based;
Software-as-a-Service (SaaS) – deploy and manage
infrastructure used by distributed containerized applications.
ECS can support single tenancy as well as multi-tenancy SaaS
scenarios.

EKS Amazon Compute Service Level Agreement


Under the terms of the AWS Customer Agreement (a.k.a., the AWS
Agreement) each AWS account has a governing policy called the Amazon
Compute Service Level Agreement (SLA). In addition, the SLA is subject to
the terms of the AWS Agreement and capitalized terms.
The Amazon Compute SLA includes these AWS products and services:

Amazon Elastic Compute Cloud (EC2)


Amazon Elastic Block Store (EBS)
Amazon Elastic Kubernetes Service (EKS)
Amazon Fargate for Amazon ECS (Fargate)
Service Commitment

AWS undertakes commercially reasonable efforts to make each of the


included products and services available with a Monthly Uptime Percentage
(defined below) of at least 99.99%, in each case during any monthly billing
cycle (the “Service Commitment”). In the event any of the included products
and services do not meet the Service Commitment, the consumer will be
eligible to receive a Service Credit as described by AWS.
AWS reserves the right to change the terms of the SAL in accordance with
the AWS Agreement. For more details on:

Service Commitment;
Included Products and Services;
Service Commitments and Service Credits;
Credit Request and Payment Procedures, and
Amazon EC2 SLA Exclusions.

Please visit the AWS website.

Kubernetes Architecture
Google created Kubernetes for use in their back office. In 2014, Google
donated Kubernetes to the Cloud Native Computing Foundation (CNCF) as
an open-source project and is licensed under the Apache 2.0 license.
Docker and Kubernetes are complementary technologies. A Kubernetes
node uses Docker as its container runtime environment. Kubernetes uses
Docker to start and stop containers. Kubernetes functions at a higher-level –
it decides which nodes to run containers on, decides when to scale the nodes
in and out, up and down, updates the containers and keeps them running.
On the surface, Kubernetes has two components:

1. A cluster on which containerized application runs – a data plane,


and
2. An orchestrator of the containerized application – a control
plane that deploys and manages the containers.

Kubernetes uses a role-based access control (RBAC) system to


authenticate and verify authorization of requests to the components of the
control plane and data plane.

EKS and IAM


In addition to the Kubernetes RBAC system, EKS is substantially
integrated with the AWS IAM service. EKS is able to use:
IAM Identity-based Policies – you specify allowed or denied
actions (i.e., invoking an EKS operation) and resources (i.e., the
ARN of the EKS component to which the action is applied) as
well as the conditions under which the action is allowed or
denied, per cluster, per Region.
Authorization based on Tags – tags can be attached to EKS
resources. You can control access to EKS resources based on
tags.
Temporary Security Credentials – EKS supports temporary
security credentials. You can use temporary credentials to sign
in with federation, assume an IAM role, or assume an IAM
cross-account role.
Service-linked Roles – there are service-linked roles linked
directly to EKS and the permissions that they need are defined.
Permissions include the trust policy and the permissions policy.
Only EKS can assume its service-linked roles. EKS uses the
AWSServiceRoleForAmazonEKSNodegroup service-linked
role to manage these resources: Auto-scaling groups, security
groups, launch templates, and IAM instance profiles. You must
configure the permissions to allow an IAM user, group, or role,
to create, edit, or delete, a service-linked role. Service-linked
roles are available in all Regions in which EKS is available.
Service IAM Role – EKS makes calls to other AWS services to
manage the resources used by EKS.

EKS does not support IAM resource-based policies.

EKS Clusters
Unlike Amazon ECS clusters, the Kubernetes cluster is a control plane
and worker nodes. The worker nodes host the containers – they can be VMs
in a public or private cloud, or on-premise bare metal servers. The control
plane exposes an API, schedules containers on nodes, implements auto
scaling, handles container updates, and records the cluster’s state in persistent
storage.
With EKS, you can chose to use Kubernetes versions 1.12 through 1.15
for a given cluster. Each cluster has 1 or more masters (a.k.a., heads or head
nodes) and a group of worker nodes. Taken together, the masters are the
control plane. The worker nodes host the running containers and are
subordinate to the control plane. The maximum number of EKS clusters per
Region, per AWS account, is 100. The maximum number of control plane
security groups per cluster is 4.
AWS charges $0.10 per hour for each Amazon EKS cluster that you
create. The Kubernetes cluster can be run on AWS using either EC2 or AWS
Fargate compute instances, as well as on-premise using AWS Outposts. You
are charged a fee for the AWS resources (e.g., compute instances, storage,
etc.) and AWS services that are part of the cluster. However, you only pay for
the resources and services that you use, as you use them - there are no
minimum fees and no upfront commitments with EKS.
Every Kubernetes cluster also has an internal DNS server (a.k.a., kube-
dns). However, with Amazon EKS, the kube-dns server is removed and
replaced by AWS’s CoreDNS service. A Linux worker node is used to run an
EKS core system Pod called coredns.

EKS Cluster VPC


Before you create an EKS cluster you must create a VPC and security
group(s) that meet EKS requirements. EKS has these requirements of a VPC:

The VPC must have subnets in at least two AZs.


AWS recommends a VPC with public and private subnets.
Internet facing load balancers requires a public subnet.
All public and private subnets used by the cluster must have a
tag (so that load balancers can be deployed to them).
At launchtime, worker nodes require outbound Internet access to
the EKS API Server for cluster inspection and node registration.
To pull images, worker nodes require access to container
registry endpoints.
For worker nodes to be able to register with the control plane the
VPC must have DNS hostname and DNS resolution support.
A worker node deployed to a private subnets must have a
default route to a NAT Gateway. To provide the worker nodes
with Internet access, the NAT Gateway must assign a public IP
address to the work node.
If the worker node is deployed to a public subnet, the subnet
must be configured to automatically assign a public IP address
to them.
Ensure that the VPC’s CIDR range has IP addresses sufficient to
support the Pods deployed to the cluster, as well as IP address
for load balancers to use.
The EKS control plane creates up to 4 cross-account ENIs in the
VPC.
Docker runs in the 172.17.0.0/16 CIDR range in EKS clusters.
Ensure that the VPC’s CIDR range does not overlap that range.
If CIDR blocks are used to restrict access to the cluster’s public
endpoint, AWS recommends that private endpoint access is
enabled so that worker nodes can communicate with the cluster
Control Plane.

EKS has these requirements of security groups:

Create a security group that is bound to the worker nodes.


Create a security group that is bound to the control plane’s
cross-account ENIs.
The security groups must allow communication between the
control plane and the worker nodes.
Use a dedicated security group for each control plane.
A security group used in a private subnets denies all inbound
network traffic and allows all outbound traffic.

A VPC and its security group(s) can be used by multiple EKS clusters.
For better network isolation, AWS recommends that each EKS cluster use a
separate VPC.

VPC CNI Plugin


To support native VPC networking, EKS has the Amazon VPC CNI
Plugin. This CNI plugin allows Kubernetes Pods to have the same IP address
inside the Pod as they do on the VPC network. The CNI plugin is responsible
for allocating the VPC IP addresses to the worker nodes and configuring the
Pod networking on each worker node. The CNI plugin has two primary
components:

1. L-IPAM Daemon – responsible for attaching ENIs to compute


instances, assigning secondary IP address to ENIs, and
maintains a ‘warm pool’ of IP addresses on each work node to
assignment to Pods when they are scheduled. ENI and secondary
IP address limitations per EC2 instance type are enforced. As
the IP addresses in the warm pool are depleted, another ENI is
automatically attached and allocated a secondary IP address.
This process continues until the EC2 worker node cannot
support additional ENIs.
2. CNI Plugin – responsible for wiring the host network and adding
the correct interface to the Pod namespace.

There are eighteen environment variables that are used to configure the
CNI plugin. Please refer to the ‘Amazon EKS User Guide’ for details about
these CNI plugin environment variables and their complex configurations.
Pod to Pod network traffic within the VPC is direct between private IP
addresses and requires no Source Network Address Translation (SNAT).
When network traffic address an endpoint outside the VPC the CNI Plugin
translates the private IP address of each Pod to the primary private IP address
assigned to the primary ENI (eth0) of the EC2 worker node that the Pod is
running on, by default.
When SNAT is enabled on the VPC, Pods communicate bi-directionally
with the Internet endpoint. The EC2 worker node must be launched in a
public subnet and have a public IP address assigned to the primary private IP
address of the primary ENI. The network traffic is translated to and from the
public or ENI IP address and routed to and from the Internet by an Internet
gateway.

CNI Metrics Helper


For a given EKS cluster, the CNI Metrics Helper allows you to know how
many IP addresses have been assigned and how many are available. The CNI
Metrics Helper keeps track of time-series metrics, helps you troubleshoot and
diagnose IP address assignment and reclamation issues, and can provide you
with insights into IP capacity planning. The metrics data is sent to
CloudWatch and can be viewed in the CloudWatch console. To make calls to
the AWS APIs, the IAM CNIMetricsHelperPolicy policy must be attached to
the worker node’s IAM role

EKS and Calico


Calico is a network policy engine for Kubernetes which is useful in multi-
tenant environments. The network policies are similar to the VPC security
groups, in that you must create network ingress and egress rules. Instead of
assigning the Calico network policy to a worker node they are assigned to a
Pod by using Pod selector and labels.
With Calico you implement network segmentation and tenant isolation.
Each tenant is isolated from each other, and development, integration testing,
and production environments can be isolated as well.
The Calico network policy enforcement is not available on Windows-
based worker nodes, nor with Fargate instances.

EKS IAM Service Role


EKS makes calls to other AWS services to manage the resources used by
EKS. In addition to the VPC, before you create an EKS cluster you must also
create an IAM role that EKS can assume when it creates AWS resources,
e.g., load balancer. The same IAM role can be used by multiple clusters, but
that is not advised.
You can create this IAM service role by using the AWS Management
Console or an AWS CloudFormation template text file. The EKS’s IAM
service role must have these policies:

AmazonEKSServicePolicy, and
AmazonEKSClusterPolicy.

The eksctl & kubectl Command Line Utilities


The eksctl command line utility is simple and easy to use to create and
manage Kubernetes clusters on Amazon EKS. A new cluster created by using
eksctl has both a control plane and a data plane (populated with a worker
node group) Kubernetes uses the kubectl command line utility to
communicate with the cluster control plane’s API Server. But before you can
install and use eksctl or kubectl there are pre-requisites.
To use kubectl you must install the AWS CLI utility (i.e., the aws
executable) and create a client security token for cluster API Server
communication. The AWS CLI ‘aws eks get-token’ command is used to
create that security token. To run, AWS CLI requires Python 2.7, or Python
3.4 or a later version. To be installed, AWS CLI depends on the pip utility.
The EKS consumer must install the AWS CLI binary on their Linux, MacOS,
or Windows, device.
Both the eksctl and AWS CLI utilities requires AWS credentials are
configured in their environment (i.e., which is the device that they are
installed on). The AWS CLI ‘aws configure’ command is used to set up the
AWS CLI for general use of the AWS cloud. When that command is
executed you are prompted to input the following information: an access key,
secret access key, AWS Region, and output format (e.g. JSON). This
collection of information is stored in a profile named default. That profile is
used unless another is specified.
Once AWS CLI is installed and the AWS credentials are configured the
eksctl utility can be installed on the Linux, MacOS, or Windows, device.
Lastly, after installing eksctl, the kubectl utility can be installed on the
device. AWS recommends an EKS-vended version of kubectl be installed.
You can use the kubectl utility to test the cluster you created using the eksctl
utility.
The kubectl can be configured to use the AWS IAM Authenticator for
Kubernetes for authentication by modifying the kubectl config file. The aws-
iam-authenticator binary must be installed on the device on which the kubectl
client binary is installed.
In addition, a kubeconfig file must be created for the cluster (by using the
AWS CLI update-kubeconfig command). By default, the kubeconfig file is
created in the .kube/config path in the home directory. You can specify an
IAM role ARN for use when a kubectl command is issued. Otherwise, the
IAM role in the default AWS CLI or SDK credential chain is used (you can
also specify a specific named AWS credential profile). The kubectl client’s
KUBECONFIG environment variable must be set to the kubeconfig path.

Creating the EKS Cluster


An EKS cluster can be created by using the Amazon Management
Console or by using the EKS command line tool (eksctl). The IAM user or
role that creates the cluster is automatically added to the Kubernetes RBAC
authorization table as the administrator. When you execute kubectl - the
Kubernetes command line tool - commands on the cluster, you must that this
same IAM user/role is in the AWS SDK credential chain. If you use AWS
CLI, you must ensure that this same IAM user/role eksctl – the EKS
command line tool – and the AWS IAM Authenticator for Kubernetes can
find those credentials.
In addition to assigning a name to the cluster, and designating an IAM
role and VPC for use by the cluster, when creating the cluster you can also:

The version of Kubernetes to use with the cluster. The latest


available version is the default setting.
Enable or disable private access to the cluster’s API Server
endpoint. If private access is enabled then API Server requests
that originate with the cluster’s VPC will use the private VPC
endpoint.
Enable or disable public access to the cluster’s API Server
endpoint. If public access is disabled then the API Server can
only receive requests from within the cluster’s VPC.
Choose to enable or disable each type of control plane logging,
i.e., API Server component logs; Audit logs of users,
administrators, and system components that have affected the
cluster; Authenticator log; Controller Manager logs; and
Scheduler logs.
Add a tag to the cluster. A tag is a label (a key-value pair) that
you assign to an AWS resource. The tag quickly identify the
resource and know its purpose, owner, and environment. You
can search and filter AWS resources based on their assigned tag.
A tag can be edited and can be removed from an AWS resource.
Tags have no semantic meaning to EKS. You can control which
IAM users and roles have permissions to create, edit, or remove,
tags. There is a maximum of 50 tags per AWS resource. For
each resource, the tag must be unique. You use the EKS console
to create and manage tags associated with clusters or with
managed node groups.
Enable or disable Secrets Encryption using AWS Key
Management Service (KMS). A Kubernetes Secret is a small
object that is used to store and manage sensitive information,
such as passwords, OAuth tokens, and ssh keys. Such sensitive
information might otherwise be put in a Pod specification or in
an image. If enabled, the Kubernetes secrets are encrypted using
the Customer Master Key (CMK) that you select. The CMK
must be symmetric, created in the same Region as the cluster.
Masters Nodes (Control Plane)
The Kubernetes master node hosts a suite of services that are the control
plane. The Amazon EKS control plane is fully managed. When an EKS
cluster is created, the control plane is automatically created by EKS, but the
data plane is not automatically created.
For each cluster, EKS delivers a highly-available control plane that runs
across three AZs. In addition, EKS automatically detects and replaces
unhealthy master nodes, provides automated upgrades for them and applies
patches to them.
A master node hosts these five services:

API Server (kube-apiserver) – all Kubernetes components as


well as external users and system communicate via the API
Server. The API Server has a RESTful API to which you POST
YAML configuration files (a.k.a., manifests). Kubernetes uses
the kubectl command line utility to communicate with the API
Server. By default, the API Server endpoint is public to the
Internet – you can disable Internet access to the API Server
endpoint. Clients of the API Server must support Transport
Layer Security (TLS) 1.0 or later. These same clients must also
support cipher suites with perfect forward security (PFS), e.g.,
ephemeral Diffie-Hellman (DHE) or Elliptic Curve Ephemeral
Diffie-Hellman (ECDHE). Requests to the API Server must be
signed by an access key ID and a secret access key that are
associated with the IAM principal. Alternatively, the AWS
Security Token Service (STS) can be used to generate
temporary security credentials used to sign the requests. All
calls to the API Server are subject to authentication and
authorization checks by a combination of IAM role and the
Kubernetes RBAC service. The structure of the API Server
endpoint is not similar to an AWS PrivateLink endpoint
therefore the API Server endpoint is not visible within the VPC
console. EKS automatically scales the API Server across three
AZs in the Region.
Cluster Store – stores the entire configuration of the cluster plus
its state. It is the single source of truth for the cluster. Once the
posted manifest is validated it is persisted to the Cluster Store,
from where is deployed to the cluster. The Cluster Store is based
on the etcd distributed database which favors write consistency
over availability. EKS automatically scales the Cluster Store
across three AZs in the Region.
Controller Manager (kube-controller-manager) – it spawns all
of the background control loops that monitor the cluster and
respond to events to ensure that the cluster’s current state
matches its desired state.
Scheduler (kube-scheduler) – monitors the API Server for new
Pods and assigns them to a suitable worker node that is capable
of hosting the Pod. The Scheduler is not responsible for running
the container, just picking the compute instance that hosts the
container.
AWS Cloud Controller Manager – AWS created controllers that
handle the integration of Kubernetes with the AWS platform, its
resources, and services, e.g., compute instances, storage, auto
scaling, load balancing, etc.. All cloud vendors have created a
cloud controller manager and have tightly coupled it with the
Kubernetes control plane, as well as to the cloud vendor’s
resources and services. Consequently, each cloud-based
Kubernetes solution is to a great degree specific to the cloud
vendor.

EKS Cluster Authentication


All interactions with the EKS cluster’s API Server are managed through
the Kubernetes RBAC system, in coordination with the AWS IAM service.
IAM authentication can be achieved using the AWS IAM Authenticator for
Kubernetes, as well as by using the AWS CLI through its ‘aws eks get-token
command.’

EKS Cluster Endpoint Access


When the cluster is created, by default, the control plane’s API Server’s
endpoint (a.k.a., the cluster endpoint) is public to the Internet. By default,
access to the cluster endpoint is secured by a combination of IAM role and
the Kubernetes RBAC. By default, worker nodes can communicate with the
public Internet.
With EKS, you can – by using the AWS Management Console or the
AWS CLI - configure both public as well as private access to the API
Server’s endpoint:

Public Access Enabled & Private Access Disabled – this is the


default configuration for EKS clusters. The requests to the API
Server that originate within the VPC, whether from worker
nodes or other components of the control plane, leave the VPC
but remain within the AWS network. The API Server endpoint
is public to the Internet. You can limit the IP addresses – limit
the public CIDR blocks - that can reach the API Server over the
Internet. If public access to the Internet is limited by the use of
CIDR blocks, AWS recommends that private access to the API
Server endpoint be enabled, otherwise you have to ensure that
the CIDR blocks include the IP addresses of the worker nodes
and Fargate Pods so they can access the public endpoint. The
maximum number of public endpoint access CIDR ranges per
cluster is 40.
Public Access Enabled & Private Access Enabled – the requests
to the API Server that originate within the VPC, whether from
worker nodes or other components of the control plane, do not
leave the VPC. The API Server endpoint is public to the
Internet. You can limit the IP addresses – limit the public CIDR
blocks - that can reach the API Server over the Internet.
Public Access Disabled & Private Access Enabled – all network
traffic to the API Server must originate from within the VPC or
an AWS connected network, e.g., AWS Transit Gateway, VPC
Peering, AWS Managed VPN, Direct Connect, etc.. There is no
access to the API Server from the public Internet. Any kubectl
commands must originate within the VPC or an AWS connected
network. This can be accomplished in three different ways: 1)
use a compute instance in the connected network; 2) use an EC2
bastion host in the cluster’s VPC, or 3) use the AWS Cloud9
IDE. The EKS security group must contain rules that allow
ingress traffic on port 443 from the connected network, or the
bastion host. If Cloud9 is used you must also use AWS
credentials that are mapped to the cluster’s Kubernetes RBAC.
EKS and CloudTrail
EKS is integrated with the AWS CloudTrail service. CloudTrail records
actions of IAM users and roles, or an AWS service, happening in EKS.
CloudTrail captures API calls for EKS as events from the AWS Management
Console and container calls to EKS API operations. Information captured in a
trail can be used to determine the request made to EKS, who made the
request, the IP address from which the request was made, the date and time of
the request, request parameters, etc..
CloudTrail is enable on each AWS account when it is created. When you
create a trail, CloudTrail events are continually delivered to an Amazon S3
bucket that you specify. By default, a trail logs events from all Regions. The
CloudTrail log is not an ordered stack trace – the events do not appear in any
specific order. Even if you do not create a trail, the recent EKS events can be
viewed in the CloudTrail console’s Event history.

Control Plane Logging


By default, control plane logging is not enabled and logs are not sent to
CloudWatch Logs. You are charged standard CloudWatch Logs data
ingestion and storage costs for any control plane log sent to CloudWatch.
There are a number of logs type that you can enable, each of which
corresponds to a component of the control plane:

1. API Server;
2. Scheduler;
3. Controller Manager;
4. Audit – a record of the individual users, Kubernetes
administrators, or control plane components whose actions
impact the state of the cluster;
5. Authenticator – a record of the control plane component that
EKS uses for Kubernetes RBAC authentication using IAM
credentials;

The logs of each EKS log type that you enable can be viewed in the
CloudWatch console.

Worker Nodes (Data Plane)


The cluster’s worker nodes are the Kubernetes data plane. EKS does not
automatically create the data plane when the cluster is created. The creation
of the data plane is the sole responsibility of the consumer.
The data plane has three primary functions:

1. Monitor the API Server for new Pod assignments;


2. Execute the Pods assigned to it;
3. Maintains a reporting channel back to the API Server.

In the VPC’s private subnet(s), Amazon EKS places and manages the
Elastic Network Interfaces (ENIs) used for control plane to worker node
communication. EKS applies the security groups to the ENIs created used by
the control plane.
A worker node is a Linux-based EC2 instance, a Fargate instance, a
Window-based EC2 instance, or GPU enhanced Linux-based EC2 instance.
Each worker node has three main components:

kubelet – a Kubernetes agent that runs as a daemon on the


compute instance. The Kubelet monitors the API Server for new
Pod assignments. Kubelet executes the Pod and reports back to
the API Server. The Kubelet is responsible for registering the
compute instance with the cluster. Once registered, the compute
instance’s CPU, memory, and storage are added to the cluster’s
pool of compute resources.
Docker Runtime (containerd) – Pulls the Docker image, starts
and stops a Docker container wrapped by a Pod. In EKS,
Docker is the container runtime that it uses.
kube-proxy – working with the VPC, it is responsible for local
cluster networking. Works with AWS to associate each node
with a unique IP address, to ensure that the node is placed
behind a firewall and manages load balancing of network traffic.

By default, given the API server is public to the Internet, so are the work
nodes.

Worker Node IAM Role


The worker node’s kubelet agent makes calls to the API Server. An IAM
role and associated policies grant to the kubelet agent the permissions needed
to make calls to the API Server. When the worker node is created using eksctl
that IAM role is named ‘InstanceRoleARN’ and named ‘NodeInstanceRole’
when created with a CloudFormation template or in the AWS Management
Console.
The worker node’s IAM role can be created by using the Amazon
Management Console or by using AWS CloudFormation. Before a worker
node can be launched on and registered with the cluster, that IAM role must
be created for use by the worker node. And, these IAM policies must be
attached to that IAM role:

AmazonEKSWorkerNodePolicy;
AmazonEKS_CNI_Policy;
AmazonEC2ContainerRegistryReadOnly;

Node Groups
A new EKS cluster is created without a worker node group. In EKS, 1 or
more worker nodes are deployed into a node group. All compute instances in
the node group must:

1. Be the same compute instance type;


2. Run the same Amazon Machine Image (AMI);
3. Use the same IAM role;

A cluster may have 1 or more node groups. AWS provides an Amazon


EKS-optimized AMI that is based on Amazon Linux 2 that is pre-configured
with a kubelet, kube-Proxy, Docker Container Runtime, and AWS IAM
Authenticator.
You can choose between standard and GPC variants of an Amazon EKS-
optimized AMI. The EKS-optimized AMI comes with a boot script that
allows the computer instance to automatically discover and connect to the
cluster’s Control Plane.
When the eksctl utility is used to create the EKS cluster you can specify a:

Fargate-only cluster – the worker nodes are Fargate instances.


The eksctl utility creates the IAM Pod Execution role, a Fargate
profile for the default and kube-system namespaces, and updates
the coredns system Pod so that it can run on Fargate;
Linux-only cluster – the worker nodes are Linux-based EC2
instances. ;
Linux and Windows cluster – if the intent is to run containers on
Windows-based worker nodes, there must be at least 1 Linux-
based EC2 instance in the cluster which runs core system Pods
such as coredns or the VPC resource controller. Host
networking mode is not supported on Windows-based nodes. To
enhance availability, AWS recommends at least 2 Linux
instances per cluster. Kubernetes Group Managed Service
Accounts (GMSA) for Windows Pods and containers is not
supported by EKS;
Linux GPU workers only cluster – the GPU variant of the EKS-
optimized Linux-based EC2 instance. The NVIDIA device
plugin for Kubernetes must be applied as a DaemonSet on the
cluster. The NVIDA device plugin DaemonSet automatically
exposes the number of GPUs on each worker node in the cluster,
keeps health of the GPU worker nodes, and runs GPU enabled
containers on the cluster.

Managed Node Groups


A ‘managed node group’ automates the provisioning and lifecycle
management of EC2 instance worker nodes. EKS automatically associates a
public IP to the work node instances and automatically configures each node
to use the cluster’s security group.. All managed nodes are provisioned as
part of an Auto Scaling group (ASG). The managed node group spans the
subnets referenced by the ASG. However, the subnet must have outbound
Internet access. In addition, a managed node group can run across multiple
AZs. The maximum number of worker nodes per managed node group is
100.
When you create the managed node group using the AWS Management
Console, or the AWS CLI, or the AWS API, EKS automatically creates the
IAM service-linked role that the managed node group needs.
A node launched as part of a managed group is automatically tagged for
auto-discovery by the Kubernetes cluster Autoscaler. You can use the
managed group to apply Kubernetes labels to the work nodes and update the
node at any time as well. AWS does not charge addition costs to create and
use managed node groups. In addition, there are no minimum fees or upfront
costs to use managed groups. The maximum number of managed node
groups per cluster is 10. EKS managed node groups are automatically
configured to use the cluster security group.
AWS recommends that the ssh-public-key option be enable when the
managed node group is created. Enabling SSH allows you to connect to the
Linux-based EC2 instances, and gather information used to diagnose
problems. SSH cannot be enabled after the managed node group is created.

EKS IAM Users and Roles


In Kubernetes, to grant other IAM users and roles permissions to interact
with an EKS cluster, you can edit an AWS authenticator configuration map
named ‘aws-auth-cm.yaml’ and referred to as the ‘aws-auth ConfigMap’.
Located on the master node, the original purpose of the aws-auth ConfigMap
was to allow worker nodes to join the cluster. Now, the aws-auth ConfigMap
also provides IAM users and role with RBAC access based on their ARN.
After specifying the IAM user and role ARNs in the ConfigMap file, you
apply its contents to the cluster by using this command: kubectl apply -f aws-
auth-cm.yaml.
The kubectl utility is also used to add individual IAM users and roles to
the EKS cluster’s ConfigMap. When doing so, the IAM user/role ARN, the
username that Kubernetes will map to the IAM user/role ARN, and a list of
groups within Kubernetes to which to map the username.

Declarative Manifest Files


To use Kubernetes, the consumer must:

1. Design and build the service, create its Docker image, and store
the image in a repository;
2. Package each service image in a Docker container;
3. Wrap each Docker container in its own Pod – a Pod is a wrapper
that is required to run a Docker container on a Kubernetes
cluster;
4. Deploy Pods to the cluster by using a declarative manifest file
(in YAML or JSON format) and a controller such as
Deployments, DaemonSets, StatefulSets, or CronJobs. These
manifest files are to be held within a source code control system
and serve their role in CD/CI.

The manifest file defines the ‘desired state’ of the containerized


application – defines what you want: how the container is to be configured,
provisioned, and run on the cluster. Each manifest file becomes part of the
control plane and together they are in effect the cluster’s desired state. The
manifest is similar to the ECS Service and Task definitions and serves a
similar purpose. However, EKS, unlike ECS, does provide deployment
functionality – with ECS you need to use CodeDeploy to provision ECS
Services on a cluster. How a container is deployed using EKS is radically
different from ECS.
The manifest file is sent as an HTTP/HTTPS POST to the control plane’s
API Server. To communicate with the EKS’s API Server you can use the
kubectl command-line utility or the Helm package manager. Once the POST
request is authenticated and authorized, the API Server vets the manifest,
identifies which ‘controller’ the manifest will use for deploying the container,
and then persists the manifest file in the Cluster Store. When those actions
complete successfully, the Scheduler takes over, pulls the Docker image from
its repository, provisions the network, and starts the container(s). The
Controller Manager spawns the background loops that monitor the state of
the cluster and its running containers. If the current state of the cluster
deviates from the ‘desired state,’ Kubernetes will respond and carry out tasks
intended to resolve the problem.

Pods and Containers


Whereas ECS container’s runtime environment is specified by Service,
Task, and Container, definitions, Kubernetes uses the Pod artifact to define
the container’s runtime environment. Just as a Docker container cannot run in
ECS without a task definition, a Docker container needs its Pod so that it can
be run by Kubernetes.
Just as an ECS task definition has a list of 1-10 containers, 1 or more
containers can exist within a Pod. The simplest Pod has just 1 container. But,
as with the ECS task definitions, more than 1 container exists in a Pod due to
tight coupling among those containers.
The Pod is a runtime environment that is isolated on the host compute
instance’s operating system, has access to the unencrypted localhost network
stack, may have persistent storage volumes, and may access other kernel
namespaces as well. Like the containers of an ECS Service or ECS Task, all
containers in the Kubernetes Pod share the same runtime environment and
run on the same compute instance. In addition, all containers in the Pod share
the Pod’s IPv4 address and hostname. When containers in the Pod need to
communicate with each other they use ports on the localhost network
interface.
Pods with more than 1 container have the same short-comings are
described for ECS task definitions that have more than 1 container. As
always, tight coupling is to be avoided unless absolutely necessary to support
the use case requirements.
Kubernetes allows you to define the minimum amount of vCPU and
memory resources that are allocated to each container in the Pod. Kubernetes
will not schedule the Pod unless the requested resources are available on the
compute instance that host the Pod’s container(s).

Kubernetes Controllers
Unlike ECS that requires CodeDeploy, or ECS Task that are run/started
by command, Kubernetes uses internal controllers to deploy Pods on a
cluster. By itself, a Pod is not capable of scaling, self-healing, or handling
rolling updates or rollbacks. For these capabilities, the Pod depends on a
Kubernetes controller (which is identified in the Pod’s manifest file).
Kubernetes has 4 controllers:

1. Deployment – provides self-healing: if a Pod fails it is replaced


(but is assigned a new name, ID, and IP address). If the load on
the Pod increases, more Pod instances are deployed. If the load
on the Pod decreases, and there are too many Pod instances, then
they are terminated. Provides zero-downtime rolling-updates.
Provides rollbacks.
2. DaemonSet – used when you need a particular Pod to run on
every worker node in the cluster (or in a subset of the cluster).
As worker nodes are added to a cluster the DaemonSets
controller ensures that the Pod is running on the new node.
3. StatefulSet - used for stateful containers. Pods are started and
deleted in defined order. When a Pod fails it is replaced and
assigned the same name, ID and IP address.
4. Job and CronJob – short-lived batch-like jobs. A job can run
more than 1 Pod. A job does not have a desired state. If the job
fails it is restarted. CronJobs are jobs that are run on a scheduled
basis.
Helm Package Manager
Helm (a graduated CNCF project) is a command line package manager
that simplifies how containerized applications are installed in and managed
by Kubernetes. Helm uses a package artifact called a chart. A chart defines
all Kubernetes resources required by the application. The Helm command
line tool is called helm. And tiller is the Helm server installed directly in the
Kubernetes cluster. Tiller converts the chart into a manifest file, and via the
kubectl utility interacts with the API server to install, upgrade, query, and
remove Kubernetes resources.
Helm is not part of EKS; therefore the consumer must install Helm.
Before installing Helm on a device, the kubectl utility must be installed as
well as configured.

EKS and AWS Fargate Profiles


AWS Fargate and Fargate Spot provide server-less compute instances - a
virtual machine that (unlike an EC2 instance) you cannot configure or scale.
A Fargate compute instance is a pre-configured and highly scalable one-size-
fits-all VM – if the shoe fits wear it.
To run Pods on Fargate compute instances, open-source Kubernetes
integrates with Fargate through various controllers created by AWS. These
AWS controllers are part of the Kubernetes control plane and are fully
managed by EKS.
As with EC2 instances, all containers in the Pod run on the same Fargate
compute instance. For each Pod, Fargate provides its own isolation boundary
and does not share kernel, CPU, memory, or its Elastic Network Interface
(ENI), with any other Pod. Each Pod running on a Fargate instance is
configured to use the cluster’s security group.
When the cluster creates a Pod on a Fargate instance, the Pods needs IAM
permissions to make calls to the EKS API and other AWS APIs. The EKS
Pod Execution role provides the IAM permissions that each Pod needs.
To schedule a Pod to run on a Fargate instance you have to create a
Fargate profile, which becomes part of the control plane, becomes part of the
cluster’s desired state. A Fargate profile is immutable. The maximum number
of Fargate profiles per cluster is 10.
A Fargate profile can be created using the AWS Management Console or
eksctl. The Fargate profile identifies the Pods that will run on Fargate
instances, as well as the:
VPC subnets used by the Fargate instance;
IAM Pod Execution role - used by the Pod. For purposes of
authorization, EKS automatically adds the IAM Pod Execution
role to the cluster’s Kubernetes RBAC. That action allows the
kubelet agent on the Fargate instance to register that Fargate
instance with the EKS control plane. The
AmazonEKSFargatePodExecutionsRolePolicy must be attached
to the IAM Pod Execution role. The role also grants Pods access
to the Amazon Elastic Container Repository (ECR);
Selectors - each profile can have 1 to 5 selectors. A selector
contains a namespace and optional labels; Pods that match a
selector (its’ namespace and all of its labels) are scheduled on
Fargate.
Namespace ;
Labels – 1 or more key-value pairs. The maximum number of
labels per selector is 100.

While coredns is configured to work with EC2 instances on EKS clusters,


coredns has to be updated to work with Fargate instances. To allow Ingress
objects for use by Pods on Fargate instances, the ALB Ingress Controller on
Amazon EKS must be deployed. And to optimize CPU and memory use by
those Pods, deploy the Vertical Pod Autoscaler.
Considerations that may preclude the use of Fargate:

Stateful containers are not a good fit with Fargate instances;


Fargate cannot guarantee pod-level security isolation;
DNS resolution and DNS hostnames must be enabled on the
VPC;
Pods running on Fargate are compatible with Application Load
Balancer (ALB), but are not compatible with NLB or CLB;
Using DaemonSets to deploy Pods is not supported by Fargate;
Only private subnets are supported for Pods run on Fargate;
Privileged Docker containers have all the root capabilities of the
host machine and therefore are not supported on Fargate;
Pods running on Fargate cannot specify HostPort or
HostNetwork in the Pod’s manifest file;
A GPU variant is not available with Fargate;
Running Pods on Fargate
In order to be scheduled, Pods on Fargate require additional configuration:

CPU and memory – the vCPU and memory reservation with the
Pod specification are used to determine the CPU and memory
resources provisioned to the Pod by Fargate. Fargate adds
256MB to each Pod’s memory reservation for use by the
Kubernetes worker node components (kubelet, containerd, kube-
proxy). If the Pod specification does not include vCPU and
memory then Fargate provisions the smallest combination of
these resources.
Storage – Fargate automatically configures 10GB of Docker
container image’s writeable-layer. This storage is ephemeral and
is deleted when the Pod stops.

The maximum number of concurrent Fargate Pods, per Region, per AWS
account, is 100. And, the maximum number of Fargate Pod launches per
second, per Region, per AWS account is 1, with temporary burst up to 10.

Running Deep Learning Containers on EKS


AWS Deep Learning Containers are a set of Docker images for training
and serving models in TensorFlow on Amazon EKS and Amazon ECS. Deep
Learning Containers provide optimized environments with TensorFlow,
Nvidia CUDA (for GPU instances), and Intel MKL (for CPU instances)
libraries and are available in Amazon ECR. Fargate does not support AWS
Deep Learning Containers.
When using deep learning containers – either AWS containers or custom-
made containers, the typical EKS solution has multiple clusters (all of which
are in the same Region). It is common to provision training cluster along with
an inference cluster. A given deep learning cluster can support multiple EC2
instance types. AWS recommends that the EC2 instance with the Deep
Learning Base AMI (Ubuntu).
Each deep learning cluster must be specifically configured to support
AWS Deep Learning Containers. The eksctl utility is used to set up the deep
learning cluster.
To configure the EKS cluster, create an IAM user that has these policies:
AmazonEKSAdminPolicy
AmazonCloudFormationPolicy
AmazonEC2FullAccess
IAMFullAccess
AmazonEC2ContainerRegistryReadOnly
AmazonEKS_CNI_Policy
AmazonS3FullAccess

Next, create a gateway node in the same Region as the EKS cluster. Then
log into the gateway node and install the AWS CLI, eksctl, kubectl, and aws-
iam-authenticator. Run the aws configure command for the IAM user. When
prompted, paste in the IAM user’s access key, secret access key. Afterwards,
install ksonnet (a configuration management tool for Kubernetes manifests).
For each deep learning clusters using GPU instances, the cluster’s
~/.kube/eksctl/clusters/kubeconfig file must be modified accordingly:

region – the Region where the cluster is located;


nodes - the number of EC2 compute instances in the cluster;
node-type – specify the GPU instance type chosen to be a
worker node.
ssh-public-key - the name of the key that you want to use to
login the cluster’s worker nodes. Either use a security key you
already use or create a new one but be sure to swap out the ssh-
public-key with a key that was allocated for the chosen Region.
The ssh-public-key must have access to launch instances in this
Region.

Use the eksctl utility to submit the modified kubeconfig file to the EKS
control plane. It will take EKS several minutes to create the deep learning
cluster. If the node-type is an Nvidia CUDA GPU instance you must install
the NVIDIA device plugin for Kubernetes in the EKS control plane. The
NVIDIA device plugin is a DaemonSet that can automatically:

Expose the number of GPUs on each worker node in the cluster;


Keep track of the health of the GPUs;
Run GPU enabled containers on the cluster.

For cluster using CPU instances, the cluster must be changed in the same
way as when GPU instances are used – just modify the node-type to specify a
CPU instance type.

Kubernetes Storage Classes


For Kubernetes 1.14 version and later EKS clusters, there are three
Amazon Container Storage Interface (CSI) drivers available for managing
container storage on worker nodes:

1. EBS CSI driver – manages the lifecycle of Amazon Elastic


Block Storage (EBS) volumes for persistent volume storage.
You must attach the Amazon_EBS_CSI_Driver IAM policy to
each work node IAM role. If a stateful application is running
across multiple AZs and uses both the EBS CSI driver and the
Kubernetes Cluster Autoscaler, you should configure multiple
worker node groups, each scoped to a single AZ. Otherwise,
create a single worker node group that spans multiple AZs. At
present, this driver is in Beta release;
2. EFS CSI driver – manages the lifecycle of Amazon Elastic File
System (EFS) volumes for persistent volume storage. You must
create a security group that allows inbound NFS traffic for the
EFS mount points and add a rule to that security group which
allows inbound NFS traffic from the VPC CIDR range. By
default, a new EFS file system is owned by root:root, and only
the root user (UID 0) has read-write-execute permissions. If the
container is not running as root, permissions on the EFS file
system must be modified and granted other users. An EFS
mount point can be accessed by multiple Pods. EFS does not
enforce any file system capacity limits. At present, this driver is
in Beta release;
3. FSx for Lustre CSI driver - manages the lifecycle of Amazon
FSx for Lustre file system volumes for persistent volume
storage. You must create an AWS IAM OIDC provider and
associate it with the cluster. An IAM OIDC identity provider is
an entity in IAM that describes an external identity provider
(IdP) service that supports the OpenID Connect (OIDC)
standard. You must create an IAM service account and policy
that allows this CSI driver to make calls to the AWS APIs and
attach it to the cluster. At present, this driver is in Beta release;

Auto Scaling
EKS supports three types of Kubernetes autoscaling:

1. Cluster Autoscaler (CA) – automatically adjusts the number of


worker nodes in the cluster when Pods fail to launch due to a
lack of compute resources or when worker nodes are
underutilized and their Pods can be rescheduled onto other
worker nodes in the cluster. The CA modifies the worker node
groups so that they scale out and scale in. If a stateful
application is running across multiple AZs and uses both the
EBS CSI driver and the Kubernetes Cluster Autoscaler, you
should configure multiple worker node groups, each scoped to a
single AZ. Otherwise, create a single worker node group that
spans multiple AZs. CA requires an IAM role that can make
calls to the AWS APIs. If the eksctl utility was used to make the
cluster then the IAM role has the needed permissions. CA
requires specific tags on the worker node group Auto Scaling
groups so that they and be automatically discovered by EKS. If
the eksctl utility was used to make the cluster then these tags
were created.
2. Horizontal Pod Autoscaler (HPA) – automatically scales the
number of Pods in a Deployment, replication controller, or
replica set, based on a target CPU utilization percentage. The
Kubernetes API Server exposes metrics that are used for
monitoring and analysis. The HPA is a standard API Server
resource which requires the Kubernetes Metrics Server be
installed on the EKS cluster.
3. Vertical Pod Autoscaler (VPA) – automatically adjusts the CPU
and memory reservations for the Pods to help ‘right-size’ the
containers. The VPA can help improve use of the cluster
resources and free up CPU and memory for other Pods. The
Kubernetes API Server exposes metrics that are used for
monitoring and analysis. The VPA requires either the
Kubernetes Metrics Server or the Prometheus Metrics Server be
installed on the EKS cluster.
Kubernetes Metrics Server
The Kubernetes API Server exposes metrics that are used for monitoring
and analysis. These metrics are exposed internally through the metrics
endpoint /metrics HTTP API. The metrics endpoint is exposed on the EKS
control plane. The Kubernetes Metrics Server aggregates resource usage data
in the cluster. By default, the Metrics Server is not installed in EKS. The
consumer is responsible for installing the Metrics Server on the EKS cluster.
The Metrics Server provides the metrics that are required by the HPA. The
kubectl utility is used to install the Metrics Server.

Prometheus Metrics
Prometheus is a monitoring and time-series database that scraps the
exposed API Server’s metrics endpoint and aggregates data. Prometheus
allows you to filter, graph, and query, metrics time-series data. The Helm
package manager is used to install Prometheus on the EKS control plane.

Load Balancing
For Pods running of EC2-based worker nodes, EKS supports the Network
Load Balancer and the Classic Load Balancer through the Kubernetes
LoadBalancer service. By default, Kubernetes LoadBalancer creates external-
facing load balancers. The external-facing load balancers require a public
subnet that has a route directly to the Internet using an Internet gateway. The
public subnets must be tagged (kubernetes.io/role/elb : 1) so that Kubernetes
knows to use only those public subnets. When CloudFormation is used to
create the EKS’s VPC the tags are automatically attached to the public
subnets. For internal-facing load balancers, the EKS’s VPC must be
configured to use are least one private subnet. The private subnets must also
be tagged (Kubernetes.io/role/internal-elb : 1) so that Kubernetes knows to
use them.

ALB Ingress Controller


For Pods running on EC2-based or Fargate-based worker nodes, EKS
supports the Application Load Balancer (ALB) Ingress Controller Pod. All
VPC subnets must be tagged so that the ALB Ingress Controller knows to use
them. An IAM OIDC provider must be created and associated with the
cluster. A Kubernetes service account named alb-ingress-controller must be
created in the kube-system namespace, a cluster role, and a cluster role
binding for the ALB Ingress Controller to use. An IAM role for the ALB
Ingress Controller Pod must be created and attached to the service account.
The IAM ALBIngressControllerIAMPolicy policy must be attached to the
ALB Ingress Controller Pod’s IAM role so that is can make calls to the AWS
APIs.
The ALP Ingress Controller configures the ALB to route HTTP/HTTPS
traffic to different pods in the cluster, and supports these traffic modes:

Instance – registers worker nodes within the cluster as targets


for the ALB. Network traffic reaching the ALB is routed to
NodePort for the service and then proxied to the Pods. This is
the default traffic mode.
IP – registers Pods as targets for the ALB. Network traffic
reaching the ALP is directly routed to Pods for the service.

EKS on AWS Outposts


AWS Outpost is a fully-managed service extends AWS infrastructure,
AWS services, APIs, and tools into the consumer’s private data center.
Outpost offers a mix of EC2 compute instance and EKS capacity designed to
meet a variety of application needs. There is one use case suited to Amazon
EKS that benefits from the AWS Outpost service:

Hybrid Applications – containerized applications run in AWS as


well as in the consumer’s private data center;

EKS on Outpost provide highly scalable, high-performance Docker


container orchestration that requires the low latencies available with on
premises systems.
AWS DEVOPS TOOLSET
AWS DevOps tools are not just for coding applications, processes, or
infrastructure, there are tools for measuring and managing cost as well:

AWS Billing and Cost Management Service;


AWS Cost Explorer;
AWS Cost and Usage Report;
Total Cost of Ownership (TOC);
AWS TOC Calculator;
AWS Simple Monthly Calculator;
AWS Budgets Dashboard;
AWS Trusted Advisor;
AWS Management Console;
AWS CloudFormation;
AWS Command Line Interface (CLI) Utility;
AWS Software Development Kit (SDK);

AWS Billing and Cost Management Service


The AWS Billing and Cost Management service:

Enables to you forecast future resource/service usages and fees


to help you plan and budget costs.
Monitors you usage of AWS resources and services and provide
the most up-to-date cost and usage information.
Use to pay your AWS bill.

The AWS Billing and Cost Management service is configurable. You can
select a custom time period within which to view you cost and usage data, on
a monthly or daily level of granularity. You can also group and filter cost and
usage information to help you analyze the trends of cost and usage
information (over time) in a variety of useful ways and use that information
to pin-point identify and optimize costs.
To help you visualize the usage and cost data, the services provides these
graphs:

Spend Summary – shows last month’s costs, the estimated costs


for the current month to date, and forecast of the costs for the
current month.
Month-to-Date Spend – shows the usage of the tops AWS
services and their proportion of the costs.
Lastly, the AWS Bills page lists the costs you incurred over the
past month for each AWS service with a breakdown by Region
and AWS account.

AWS Cost Explorer


The AWS Cost Explorer is a free tool used to view charts of your costs for
up to the last 13 months and forecast how much you are likely to spend of the
next 3 months. You can get started immediately analyzing your costs by
using with the set of default reports (which are listed below) and you can
create your own custom cost reports as well. You can save these reports and
measure your actual progress toward your goal. And, like the Billing and
Cost Management service, with the Cost Explorer you can select a custom
time period within which to view you cost and usage data, on a monthly or
daily level of granularity. You can also group and filter cost and usage
information to help you analyze the trends of cost and usage information
(over time) in a variety of useful ways and use that information to pin-point
identify and optimize costs.
Included in the default reports are:

Monthly Costs by AWS Service


EC2 Monthly Cost and Usage
Monthly Costs by Linked Account
Monthly Running Costs
Reserved Instance (RI) Reports

AWS Cost and Usage Report


To support cost and usage optimization at the lowest level of granularity,
AWS provides the cumulative Cost and Usage Report. This report contains
line items of each AWS product, usage type and operation, as well as their
unique combinations, that your individual AWS account consumed over time,
aggregated by the hour or by the day.
Each update is cumulative, and each version of this report includes all line
items and information contained in previous versions. The intra-month
reports are estimates of usage and cost, are subject to change as usages
change, which AWS finalizes in the end-of-month version of this report.
Applying refunds, credits, or support fees, can cause a finalized end-of-month
report to be updated to reflect your ‘final’ end-of-month charges.
You specify the S3 bucket that AWS delivers the report files to and which
AWS updates up-to 3 times a day. The Cost and Usage Report in an S3
bucket can be queried by Amazon Athena, it can be downloaded from the S3
console and the report can be uploaded into Amazon Redshift as well as
Amazon QuickSight.
Calls to the AWS Billing and Cost Management API can create, retrieve,
and delete a Cost and Usage Report. The Cost and Usage Report is available
on the Reports page of the AWS Billing and Cost Management console, 24-
hours after you create it.
However, if the consolidate billing feature in AWS Organizations is used
then this report is only available to the master account (from whom other
member accounts can obtain their report).

AWS Cost Management Matched Supply and Demand


For matched supply and demand, AWS provides the Auto Scaling
Service, which is covered elsewhere in this manuscript.

AWS Cost Management Expenditure Awareness Services


AWS provides these expenditure awareness services:

Amazon CloudWatch
Amazon SNS

Optimizing Over Time


AWS publishes documents on this topic at its training website and also
provides the AWS Trusted Advisor service.

Total Cost of Ownership (TOC)


At the financial center of the decision to use cloud resources and services
is the total cost of owning and operating a system/application in the cloud,
verses on-premise, verse an on-premise-cloud hybrid.
In the cloud, unlike on-premise (and hybrids), several CAPEX inherent to
on-premise and hybrids (e.g., maintenance staff, facilities, software licenses,
hardware, etc.) are not present in the cloud. Unlike in the cloud where scaling
up or down is simple and you pay only for what you use, scaling down on-
premise does not reduce CAPEX and scaling up on-premise involves long
delays.
Despite the CAPEX differences, and given the technical differences, the
fairest way to compare a cloud solution with an on-premise (or hybrid) is to
use the Total Cost of Ownership (TOC) tool. The TOC captures the total
direct as well as indirect costs of owning as well as operating a
system/application in AWS.

AWS TOC Calculator


The AWS TOC is a tool you use to help reduce the TOC, the money
invested in the capital expenditures for AWS resources and services. The
AWS TOC allows you to estimate the cost saving when using AWS and
gives you the option to modify assumptions to meet your business needs. The
AWS TOC does provide a set of report that might be used in presentations to
the C-levels. The AWS TOC can be reached at this URL
https://awstcocalculator.com/

AWS Simple Monthly Calculator


The AWS Simple Monthly Calculator is a tool that you can use to
estimate the cost of running a solution in the AWS Cloud, based on estimated
usage. This calculator will help you determine how adding a new AWS
resource or server will impact your overall bill. The AWS Simple Monthly
Calculator can be reached at this URL
https://calculator.s3.amazonaws.com/index.html

AWS Budgets Dashboard


The AWS Budgets dashboard provides you with the ability to set a budget
for AWS resource and service usage or costs. Budgets can be tracked for a
start and end date of your choosing, at a monthly, quarterly, or yearly time
period. When your actual usage and cost exceeds you set budget amount (or
AWS forecasts that they will be exceeded) you will receive an alert. And,
when your actual usage and costs drops below the budgeted amount you also
receive an alert to that effect. Budget alerts are sent via email or an SNS
topic, and are available for EC2 instances, RDS, Redshift and ElastiCache.
Budgets can be created and monitored from the AWS Budgets dashboard
or via the Budgets API.

AWS Trusted Advisor


At no charge, every AWS account has access to the AWS Trusted
Advisor. Accessed from the AWS Management Console, the AWS Trusted
Advisor helps AWS customers improve security and performance. It is
prominent focus is on:

Service Limits
Security Groups
Specific Ports Unrestricted
IAM use
MFA on the AWS Root Account
Find under-utilized resources

The AWS Trusted Advisor provides customers with easy access to a


variety of important performance and security recommendations. As reported
by AWS, the most popular recommendations involve:

Cost optimization
Security
Fault tolerance
Performance improvement; and
Service checks

The AWS Trusted Advisor is also a source of best practices that cover:

Service limits;
Security group rules that allow unrestricted access to specific
ports ;
IAM use;
MFA on the root account;
S3 bucket permissions;
EBS public snapshots, and
RDS public snapshots.

For AWS clients who have purchased the Business or Enterprise Support
plans there are additional checks and guidance available.

GitHub
The code that defines and manages the runtime environment(s) used by
container/Kubernetes services must be managed and stored in a version
control system.

AWS Management Console


Accessing, managing, and administering AWS resources and service is
supported by a simple and easy to use web-based user-interface (UI), called
the AWS Management Console. The Management Console can be used to
manage all aspects of your AWS Account, from monitoring your fees to
managing identity credentials and their permissions. In addition, all functions
available in the Management Console are also available in the AWS API and
CLI.
The Management Console is also a great way to discover the many AWS
resources and services that are available to you. The easy to use and simple
UI of each resource and services makes them available for your use in just a
few clicks and after filling in a few text boxes.
Lastly, the Management Console gets you up and running with both ECS
and EKS in minutes.

AWS Command Line Interface (CLI) Utility


The AWS Command Line Interface (CLI) is a utility tool used to manage
AWS services. Once configure for use by an IAM user, the AWS CLI can
control multiple AWS services from the command line as well as automate
those AWS services through scripts.

AWS CloudFormation
The AWS CloudFormation is a free service used to launch, configure, and
connect AWS resources. CloudFormation treats infrastructure as code and
does so by using JSON and or YAML formatted templates. AWS does not
charge a fee for using CloudFormation. You are charged for the infrastructure
resources and services created using CloudFormation.
CloudFormation enables you to version control your infrastructure –
VPCs, Clusters, Services, Tasks, Containers. EC2 instances as well as
Fargate instances. Best practice is to use a version control system to manage
the CloudFormation templates. CloudFormation is also a great disaster
recovery option.
The AWS CloudFormation template is used to define the entire solution
stack, as well as runtime parameters. There are CloudFormation templates
that support VPCs, Clusters, Services, Tasks, EC2 and Fargate instances. You
can reuse templates to set-up resources consistently and repeatedly. Since
these templates are text files, it is a simple matter to use a version control
system with the templates. The version control system will report any
changes that were made, who made them, and when. In addition, the version
control system can enable you to reverse changes to templates to a previous
version.
An AWS CloudFormation template can be created using a code editor.
But, they can be easily created using the drag-n-drop CloudFormation
Designer tool, available in the AWS Management Console. The Designer
will automatically generate the JSON or YAML template document.

AWS Software Development Kit (SDK)


The AWS SDK is used to programmatically manipulate AWS services
and resources. SDKs are available for these programming languages:

Java
JavaScript
Python
.NET
Go
Ruby
Node.js
C++
PHP
AWS SUPPORT TIERS
AWS offers 4 different tiers of support, with differing services:
1 Basic;
2 Developer (fees charged);
3 Business (fees charged) and
4 Enterprise (fees charged).
All AWS Support plans include an unlimited number of support cases,
with no long-term contracts. With the Business and Enterprise tiers, you earn
volume discounts on your AWS Support costs, as your AWS charges
increase.

Support Severity
AWS support uses the following severity chart to set response times.
Critical - the business is at risk. Critical functions of an application are
unavailable.
Urgent – the business is significantly impacted. Important functions of an
application are unavailable.
High - important functions of an application are impaired or degraded.
Normal - non-critical functions of an application are behaving
abnormally, or there is a time-sensitive development question.
Low – there is a general development question, or desire to request a
service feature.

Basic Support
Customer Service & Communities
24x7 access to customer service, documentation, whitepapers
and support forums.
AWS Trusted Advisor
Access to the 7 core Trusted Advisor checks and guidance to
provision your resources following best practices to increase
performance and improve security.
AWS Personal Health Dashboard
A personalized view of the health of AWS services and alerts
when your resources are impacted.
Alerts and remediation guidance when AWS experiences events
that may impact you.
You can set alerts across multiple channels, e.g., email, mobile
notifications.

Developer Support
Enhanced Technical Support
Business hours email access to Support Engineers
Unlimited cases / 1 primary contact
Case Severity / Response Times
General guidance: < 24 business hours
System impaired: < 12 business hours
Architectural Guidance
General

Business Support
Enhanced Technical Support
24x7 phone, email, and chat access to Support Engineers
Unlimited cases / unlimited contacts (IAM supported)
Case Severity / Response Times
General guidance: < 24 hours
System impaired: < 12 hours
Production system impaired: < 4 hours
Production system down: < 1 hour
Architectural Guidance
Contextual to your use-cases
Programmatic Case Management
AWS Support API
Third-Party Software Support
Interoperability & configuration guidance and troubleshooting
Proactive Programs
Access to Infrastructure Event Management for additional fee.
Enterprise Support
Enhanced Technical Support
24x7 phone, email and chat access to Support Engineers
Unlimited cases / unlimited contacts (IAM supported)
Case Severity / Response Times
General guidance: < 24 hours
System impaired: < 12 hours
Production system impaired: < 4 hours
Production system down: < 1 hour
Business-critical system down: < 15 minutes
Architectural Guidance
Consultative review and guidance based on your applications
Programmatic Case Management
AWS Support API
Third-Party Software Support
Interoperability & configuration guidance and troubleshooting
Proactive Programs
Infrastructure Event Management
Well-Architected Reviews
Operations Reviews
Technical Account Manager (TAM) coordinates access to
programs and other AWS experts as needed.
Technical Account Management
Designated Technical Account Manager (TAM) to proactively
monitor your environment and assist with optimization.
Training
Access to online self-paced labs
Account Assistance
Concierge Support Team

Technical Account Manager


The Technical Account Manager:
1 Acts as your ‘voice in AWS’ and serves as your advocate.
2 Uses business and performance reviews to provide proactive
guidance and insight into ways to optimize AWS.
3 Orchestrates and provides access to technical expertise inside AWS.
4 Provides access to resources and best practice recommendations.

Infrastructure Event Management


Infrastructure Event Management includes:

Through pre-event planning and preparation, provides a


common understanding of event objectives and use cases.
Resource recommendations and deployment guidance based on
anticipated capacity needs.
Dedicated attention of the AWS Support team during an event.

Provides the ability to immediately scale down resources to normal


operating levels post-event.

The Concierge Service


The Concierge Service:
1 Provides a primary point of contact to help manage AWS resources.
2 Personalized handling of billing inquiries, tax questions, service
limits and bulk reserve instance purchases.
3 Direct access to an agent to help optimize costs to identify
underused resources.
HTTP & THE OSI MODEL
To understand containers a working knowledge of HTTP and the OSI
Model is needed.

The committee for placing things on top of other things


From the outside looking in, all systems appear to be built by a group of
people going about placing things on top of other things. When that technique
is skillfully applied you have an onion-like structure. When the committee
does not skillfully apply that technique you have a BBoM.
The capable practitioner of SaaS learns to acquire an over-view of a
problem space, and then drills down vertically into its parts, that is how SaaS
will be approached – from the outside moving inward.
The SaaS architect typically looks at the network as an industry standard
off-the-shelf generic commodity service, which is sustained by standard
commodity resources, consumed at commodity prices. The microservices
present in the core domain of the SaaS are based on the Open Systems
Interconnection (OSI) Model in which layers are placed on top of other layers
by committee.

OSI Model

The 7 Layer OSI Model


When all is well, the Physical and the Data Link layers of the OSI Model,
layers #1 & #2, are of least concern to the SaaS methodology: they represent
the lowest level of the Infrastructure as a Service (IaaS) upon which the SaaS
operates.
When looking upwards from the bottom of the OSI model, the layers of
the OSI Model that are most important to the SaaS are the Network, the
Transport, and the Application layers. As the above diagram shows, HTTP is
the protocol at the Application layer, TCP the protocol at the Transport layer,
and IP at the Network layer.

The system as the network, and the network as the system


Given any particular network, there is a network protocol whose function
is to, at all points of in the network topology, and at all times, optimize the
consumption of resources in a secure manner, over as many connections as
feasible, handling as many sessions and messages as needed. In a nutshell,
that’s the grand scheme, at least, that is, if a physical part of the network is
not decaying
As a SaaS engineer investigating a protocol, your view of it is as if from a
window made of 4 panes:

1. Resources;
2. Messages;
3. Connections, and
4. Security (listed last but is certainly not the least important
focus).

In the context of inter-operating with (hypertext information) resources


via REST interfaces, that protocol is HTTP. In a secure context, that protocol
becomes HTTPS.

HTTP’s two message types


Information, in the form of a message, is exchanged during an HTTP
transaction. The HTTP is a request-response protocol, supported by two
different message types:

1. HTTP Request – a carefully structured and formatted message that


the web service can interpret successfully. There are just 9 request
methods.
2. HTTP Response - a carefully structured and formatted message
that the web browser can interpret/execute successfully.

Any type of application – not just a web browser - that can create an
HTTP Request and can send that message over a TCP/IP network can access
resources available at URL endpoints, e.g., telnet www.fredo.net 80

HTTP is Stateless
Each HTTP transaction, each request and response, is independent of all
other HTTP transaction, past and future. The HTTP protocol does not require
of a server application that it retain information about an HTTP request.
Every HTTP request contains all of the information that a server application
needs to create and return a response message. Because every message
contains all required information, HTTP messages can be inspected by,
transformed by, and cached by, proxy servers, by web cache servers, etc.
However, without the ability to cache HTTP messages in memory, the
latency inherent in the world-wide web makes the Internet unusable. In
addition, of needed by the BPM, most web clients and server applications are
highly stateful!

An application called a web service


On the host computer that replies to HTTP requests, there is an application
called a web service. A web service ‘listens on a port’ for HTTP requests. It
is a common feature among cloud web services that they only handle JSON
encoded HTTP requests.
A wide variety of web service stacks exists, none of which will be
explored here. All web service stacks have one thing in common: they
support HTTP verbs and invoke CRUD transactions involving hypermedia
information resources.
The typical web server reply is implemented in Hypertext Markup
Language (HTML), which the web browser uses to display the hypermedia
information resource, as well as to support interactions with that resource As
such, a returned resource can be a combination of passive information
content, HTML, software programs (e.g., JavaScript), CSS style sheets, and
other use-related parts, which are combined to create the user experience of
the resource.
The origins of HTTP
The Hypertext Transfer Protocol (HTTP) is the foundation of data
communication for the World Wide Web. Development of HTTP was
initiated by Tim Berners-Lee at CERN in 1989. Development of HTTP
standards was coordinated by the Internet Engineering Task Force (IETF) and
the World Wide Web Consortium (W3C) and released in 1991.
HTTP is an Application Layer protocol for distributed, collaborative,
hypermedia information systems. By making simple requests, hypermedia
information resources can be easily manipulated – created, read, updated,
deleted.

Hypermedia information
Hypermedia information is not a passive entity, and seldom exists as an
isolate – seldom exists apart from other hypermedia entities. Hypermedia
information systems arise, persist, and perish, in the form of a property graph
database. Hypermedia contains information as well as behavior. It is the
behavior which makes it active information, that makes it hyper.
The tools which provide behavior are:

The Hypertext Markup Language (HTML), and


A programming language - typically that’s JavaScript

Principle – hypermedia information resource is not a relation: it is a


property graph.
Best practice – Do not store a hypermedia information resource within a
relational database service.
Best practice – Do not use a relational database service to persist or to
CRUD elements contained within a hypermedia information resource.

The origins of HTTPS


Most information resources are not for sharing. Most information
resources must be secured when they are stored, as well as when they are
distributed.
3 years after HTTP was released and by then in wide use, Netscape
Communications created HTTPS in 1994 for its Netscape Navigator web
browser. Originally, HTTPS used the Secure Socket Layer (SSL) protocol.
SSL has evolved into the Transport Layer Security (TLS). HTTPS was
formally specified by RFC 2818 in May 2000. Typically, the SaaS supports
SSL/TLS certificates.

Distributed CRUD
When information and services are distributed over a network, their
associated CRUD transactions are also distributed over the network. A
CRUD transaction over a network is uniquely identified by specifying both:

The IP address of the host computer, and


The specific port # used by the application service that will
handle that specific type of request – that handles that type of
CRUD transactions.

A port number is a crude but effective form of network security, in that


the request for the resource will not be successful unless the both parts of the
identity (the IP address and the port #) of the destination of the request are
valid.

The Port Number


In the TCP/IP protocol, a port is an I/O mechanism that an application
uses to ‘listen to requests’ which it handles. In the typical SaaS Use Case, the
requests composed of HTTP verbs and information encoded in JSON.
An application service running on a host computer (or on a Virtual
Machine, as well as within a container/Pod) is uniquely identified by its port
– by a number.

HTTP uses port 80


SMTP uses port 25
FTP uses port 21

To specify the specific port on a host computer that handles HTTP


requests is straightforward. For example,
http://fredo.net:80/ismg6340/stuff.json/
If the hypermedia information resource is available at the default HTTP
port, then the port number does not need to be included in the URL.

Idempotency
To determine if a request method is or is not idempotent, only the state of
the request handler service is considered. The idempotency of a method
request is not determined by the agent initiating the request.
The request handler service controls and dictates if a given method request
is or is not executed as an idempotent request. The HHTP request handler is
relative to the requestor a black-box – the requestor has no control for how
their request is handled.

How to tell if a given HTTP request is idempotent


A request method is idempotent if an identical invocation made 1 single
time, or countless times, causes the same side-effect-free effect on the
information state of the service that handles the request. A request leaves the
request handler service in the same state that the service was in before the
request was handled.
Between invocations, the information content of results returned for an
identical request may change over time. Between invocations, the status code
returned for an identical request may change over time. For example, the first
call of a DELETE will likely return a 200, while successive ones will likely
return a 404.

9 HTTP request methods


1. The GET method requests a representation of the specified
resource. Requests using GET should only retrieve data.
2. The HEAD method asks for a response, but without the usual
response body, just the resource header.
3. The CONNECT method establishes a tunnel to the server
identified by the target resource.
4. The OPTIONS method is used to describe the communication
options for the target resource.
5. The TRACE method performs a message loop-back test along
the path to the target resource.

GET, HEAD, OPTIONS and TRACE methods are safe in that their only
purpose is to retrieve information. Safe requests are defined to be
idempotent but are not guaranteed to be so.
Best practice – do not assume that your 1 single GET or HEAD request is
always only executed 1 and only 1 time by the request handler service.
6. The POST method is used to submit an information payload to
the specified resource, intending to cause a change in
information state – an effect - on the server, free of any
unintended side-effects
7. The PUT method displaces all current digital representations of
the specified target resource with the request payload.
8. The DELETE method deletes the specified resource.
9. The PATCH method is used to make partial changes to a
resource.

The PUT and DELETE methods are defined not to be idempotent, but
there is no guarantee that they are not idempotent.
Best practice – do not assume that your 1 single POST, PUT, DELETE, or
PATCH, request is always only executed 1 and only 1 time by the request
handler service.

HTTP Connections
In an HTTP transaction a message in the form of a request is sent by a
client application to a server application which returns a message in the form
of a response. Remember, there are other types of HTTP client applications
other than a web browser.
The official format specifications of these 2 types of messages and of their
information contents are part of the HTTP protocol. These messages are
transmitted across a network wherein connections are opened and closed. The
opening and closing of network connections takes place within the OSI layers
that the foundation on which your cloud solution executes.
The client application uses the hostname of the URL, from which the IP
address and port # of the server application are derived. Once the client
application opens a connection, it can write messages to the server.

What is TCP?
The TCP layer handles the messages, ensuring that they are relayed
between the correct client and server applications, without message loss or
message duplication occurring. TCP is a Transport layer protocol, and is a
connection-oriented protocol, and therefore requires a logical connection
between 2 devices before transmitting data.
In a TCP-based network, the client application and the server application
that are exchanging messages with each other can reside on any
computer/node in the network. The TCP layer, present on each node in the
network, handles the messages, ensuring that they are relayed between the
correct client and server applications, without message loss or message
duplication occurring.
TCP controls the rate at which messages flow and detects errors. If a
message is lost, TCP will automatically resend that message.

What’s IP?
The IP is a Network layer protocol and is a connectionless protocol. IP
does not require a logical connection between 2 devices before transmitting
data. An IP device just sends the data, just puts the data on the Ethernet.
Every host interface in the TCP/IP network has an IP address. The IP address
(of which they are now 2 types: IPv4 and IPv6) is:

a logical address that is unique within the TCP/IP network;


independent of the virtual machine/machine it is associated with
(as well as its interfaces);

Present on each node in the network, the defining job of the IP layer is
routing traffic. IP fragments a TCP message and encapsulates it into
datagrams, which it then places on the Ethernet layer beneath it. Also, IP
takes datagrams off the Ethernet and reassembles them into a TCP message.
Lastly, IP exchanges information about the status/health of host interfaces in
the TCP/IP network.

Types of HTTP connections


There are three types of HTTP connections:

1. Parallel – the client application opens multiple con-current


connections to the server.
2. Persistent – the client application opens a connection and keeps
that connection open and uses that same connection to
continually send and receive messages to and from the server.
Persisting the connection and using it to support multiple
request/response messages is a client application performance
optimization technique which avoids the overhead of re-opening
a socket.
3. Pipelined – the client application sends multiple requests
messages, while not waiting between requests for its
corresponding response. Pipelined connections are a client
application performance optimization technique reduces request
latency.
OF MONOLITHS AND
MICROSERVICES
Back in the mid-1970s structured programming was advocating two
timeless software engineering principles:

1. Maximize cohesion, and


2. Minimize coupling.

Over the past 50 years, unless the software engineers were working on an
operating system, a database service, or some other advanced technology,
invariably the application code consistently violated the principles of
maximizing cohesion and minimizing coupling. Over the past 50 years, the
drive to get the application out of development, through testing and out into
production has been and remains the holy grail. Due to pernicious neglect,
the principles of maximizing cohesion and minimizing coupling are not
engraved on that holy grail.
It has been well known since the mid-1970s that whenever those two
principles are violated the application becomes a big ball of mud (BBoM).
The polite term for a BBoM is a monolithic application. And, just because it
is a web application that does not mean that it is not a BBoM. Gaining hands-
on knowledge of how to maximize cohesion or minimizing coupling of the
application has had little to no street value, until recently that is.
Recently, the critical role that those two software engineering principles
play in reducing CapEx and OpEx has been recognized by a wider percentage
of the C-Level chairs in the audience. Respecting the 2 timeless principles is
a matter of survival (for institutions as well as software engineers).

Monolithic Applications
In every IT institution, a monolithic application is a significant barrier to
reducing CapEx and or OpEx.
Maximizing cohesion and minimizing coupling share much in common
with film-making. It requires of number of takes before you are able to select
the best result, before you are able to see how cohesion is being maximized
and coupling is being minimized. Shareholders rarely want to pay for
multiple takes on an application component that already passes its validation
and verification tests. A monolithic application shares much in common with
a home movie – they lack entertainment value and are painful to watch,
worse still to appear in.
As if it can’t help itself, overtime, the typical business appends to the
sorry monolithic application a variety of disparate processes that have
different compute needs. Those practices continue until a straw is added
which breaks the back of the monolithic beast. In effect, the typical business
misses all of the opportunities it previously had to minimize coupling the
monolithic application with other applications.

The Institution
“It must be remembered that there is nothing more difficult to plan, more
doubtful of success, nor more dangerous to manage than a new system. For
the initiator has the enmity of all who would profit by the preservation of the
old institution and merely lukewarm defenders in those who gain by the new
ones. ”
― Niccolò Machiavelli
It seems that most technologist fail to perceive the humans present in the
environment. As such, rarely will you discover a cloud migration strategy
that starts with a wise consideration of our shared humanity. The above quote
explains why nearly all IT institutions are adverse to migrating from their
traditional IT environments to the cloud (private or public, or hybrid), and
why most migrations fail to achieve their CapEx or OpEx goals. Your
migration strategy amounts to nothing when the tribe in question has a vested
interest in the preservation of their traditional IT institution and is therefore
full of animus in relation to the public cloud which rends their hallowed IT
institution's infrastructure into a commodity.
To the animus infected, their traditional IT institution is to them a totem,
and the ‘cloudsters’ are a real existential threat. Therefore, the cloud
migration strategy is not fundamentally a matter of logic or reasoning, and
certainly not a matter of due diligence: it is a matter of when, where, and how
fear in the ranks of the IT institution is handled by the C-levels. As history
continually teaches us, extremely few C-levels have the leadership skills
needed to take a crowd of fearful people (who almost certainly distrust the C-
levels) into the unknown and beyond into the greener pastures.
Consequently, the best times to migrate a service/system to the cloud are
during the conception or the birth of that service/system, well before the child
gets too accustomed to their local neighborhood, or too familiar with its
ingrained customs, taboos and bigotries: before they join the tribe. As such,
our best hope for success in the cloud is a green-field solution.

Cloud animus
Animus might be irrational, illogical, but there are reasons for its arising
and abiding. To be fair to the fearful, our traditional IT systems and their
applications were never designed to run on temporary infrastructure (which is
ubiquitous to the public cloud). Traditional IT systems/application were
designed and constructed to run on “permanent” infrastructure, and usually
directly on the hardware. Consequently, the overwhelming majority of
traditional IT systems/applications are not suitable candidates for migration
to the cloud. They are invariably big balls of mud (BBoM).
You’d be crazy to undertake a venture which you know from the start is
doomed to fail - that course of action is something all rational persons are
adverse to take. Yet, to speak out in opposition to a C-level’s initiative is
employment suicide. The crowd’s course of action then is covert aggression:
remain silent, do not point out the material facts that doom the initiative, but
be patient and watch the migration and the sponsoring C-level fail.
Afterwards, gladly welcome their replacement with open arms.
There are various means to assuage the crowds fears, and they all require
a gentle guiding hand along with practical encouragement. Nevertheless, you
can lead a horse to water but you cannot make them drink (unless they're
thirsty).

Easy Lifts
When looking at a traditional IT institution, the easy task is to identify the
generic off-the-shelf systems/services that are present. The generics, e.g.,
email, data archival, etc., are the best candidates for migration. These are
vendor products which the traditional IT institution merely had to install and
then operated per the manual. If and only if the vendor supports the cloud
platform, then there is a good assurance of success achieved quickly and at a
reasonable price point. If no such vendor assurance exists than the generic is
a no-go-to-the-cloud.
By completing the easy lifts, the traditional IT institution is re-shaped into
a hybrid cloud, where some parts are in the cloud and most parts remain in
the company’s back-office(s).

BBoMs in the core of the BPM


With the generics identified, that leaves the BBOMs as the lion’s share of
your traditional IT institutions. These BBOMs are the technologies that are
core to your business process models (BPM), the ones you designed and
developed, that you invested the most in. These are the BBOMs that are
critical to the success of your BPM.
The public cloud presents the opportunity to significantly reduce capital
expenditures (CapEx) and (by reducing IT head-count) eliminate operating
expenditures (OpEx). What’s not to like? Unless, that is, you are a biped
attached to the totem. In which case, they are everything to be feared.
We can accept that eliminating and profoundly reducing OpEx is a given
in the cloud. But the CapEx opportunity is reputable. For a C-level survive
the migration, the CapEx has to be non-reputable from the viewpoint of the
board. To provide non-reputable CapEx reductions relative to BBOMs the
best case scenario must exist.

Chasing Suitable Metrics


To obtain a non-reputable cloud CapEx estimates it is necessary to take
measurements of the BBoM's infrastructure consumption, on a BBoM by
BBoM basis: preferably on a BBoM component by BBoM component basis.
That is your best case scenario. And as life teaches each of us, the best case is
usually the rarest case as well.
At a minimum, you need to collect key performance metrics (KPIs) such
as:

CPU usage;
Memory usage;
Network usage;
Storage usage (actual and projected);
Response times;
Data throughput volumes;
Recovery times.
Given a BBoM, these metrics will only be measured at a gross level. In
that BBOMs rarely have features which obtain these metrics on a per process
basis, let alone per component basis, the gross level metrics will not help
identify the individual cannibal components that consume the greater part of
the infrastructure.
Therefore, identifying resource cannibals, and determining how to refactor
them to make them more efficient, is literally impossible. Keep in mind that
the actual cost of badly performant BBoM component can be easily
obfuscated in the traditional IT environment but will always make you pay
threw the teeth in the cloud. Migrating to the cloud is not the time to cross
your fingers and then hope for the best. You have to keep an eye on the
cannibals, else they will eat you alive.
In addition to suitable metrics, it is necessary to analyze each BBoM and
identify anti-cloud problems that must be remedied during the migration.
Successfully concluding that analysis is easier said than done when it comes
to BBOMs.

Field Notes from the real world


Invariably, the processes executing in a BBoM are tightly coupled and
they minimize cohesion (i.e., the component has more than 1 purpose and
usually performs each function poorly (because Peter is robbing resources
from Paul)). Therefore, analyzing the BBoM means reviewing convoluted
software riddled with ambiguity and absent of clear meaning or purpose. And
given the widespread proliferation of the Agile methodology in this century,
it is highly unlikely that you can discover any documents that explain what,
when, where, how the BBoM does what it does. Don’t even hope for an
explanation for why a BBoM does what it does. The documented BBoM use
cases, if they do exist, are most often badly written jibber jabber, and are
devoid of KPIs.
Most commonly, since a BBoM is only intended to run in-house, was
never intended to be distributed, the configuration of the BBoM is hardwired
into the code (into its binary). Consequently, it is extremely difficult to
separate the BBoM configuration from the BBoM component's code. To be a
candidate for the cloud, the configuration of a service/system and its code
must be managed as separate constructs. At a minimum, for each BBoM,
separating the two requires a complete re-build of the BBoM code, which
follows on a significant refactoring of all BBoM code such that each
component is fully able to inter-operate with an independent configuration
file.

Containers of BBoMs
The use of VMs is a minimum goal of the cloud migration preliminaries.
Given the temporary nature of cloud infrastructure, it is imperative that the
BBoM be capable of being containerized and orchestrated. That is, each
component of the BBoM must be significantly refactored so that each is
capable of being containerized.
Since the CapEx opportunity of the cloud can only be realized by
exploiting the temporary nature of cloud infrastructure, each BBoM
components must be able to be spun-up and spun-down at will and do so in
sync with a transient cloud infrastructure. It is sufficient to assert that a
BBoM component cannot carry a tune, let alone be orchestrated. However, in
the cloud, orchestration is not a luxury orchestration is a necessity.

How to lift a BBoM


A BBoM is like a bird’s nest, and when you pull on one twig a dozen
other twigs move as well. Given the convoluted nature of a BBoM, it is
onerous to identify each component, let alone its security contract(s) or the
service contract(s) and the data contract(s) that are nested and hidden within
them.
Usually, the security of an in-house BBoM that is not connected to the
Internet is adequate to resist most authentication and authorization attacks.
However, that is no warranty of the BBoM’s suitability for the cloud. Of
necessity, when considering migrating a BBoM into the cloud, the analysis of
the BBoM’s attack surface must be undertaken and completed. You might be
surprised to learn that BBoM processes often run with root privileges and
make free use of system calls as well.
Refactoring a BBoM to be secure for the cloud is not simple, and rarely is
it straight-forward. But it certainly must be done, sufficient to ensure the
desired safeguards: authentication and authorization; code tampering; identify
spoofing; denial of service; elevation of privileges; etc.

The Real World as Final Frontier


The software quality of a typical BBoM is notoriously bad and clearly
unsuited to the cloud. For example, in the cloud each request of a service
must be able to fail fast. Yet, in the traditional IT institution, a request
implemented in a BBoM is rarely if ever built to fail fast. In stark contrast,
most BBoM requests are implement with complete confidence of their
immanent success each and every time – you know, the old hail Mary pass
and a prayer.
Consequently, it is always necessary that each BBoM be refactored and
rebuilt to live in the cloud. To think otherwise is foolish and naïve. It is
necessary to determine the cost of refactoring and rebuilding each BBoM.
However, refactoring and rebuilding each BBoM is an expensive luxury that
few can (or are willing to) afford. It is typical to see an Agile cloud migration
project reach 100%-1,000% cost over-runs. When, however, a method suited
for manufacturing software is used such cost over-runs are rare.

The Best Option


Most often the best option is to operate the BBoM as-is wherever,
whenever, and however it currently is. Leave the BBoM alone. Do not touch
the BBoM, don't even exhale near it. Instead, determine which use cases the
BBoM supports and be absolutely certain to identity the KPIs established by
the stakeholders for each use case. Be disciplined and aspire to master the
politics of refraining from considering use cases not currently supported by
the BBoM. Avoid scope creep. It’s all about the BBoM and only the BBoM.
It's only business Fredo, nothing personal.
Next, use that list of use cases (and their KPIs) to design the cloud
compliant version of the BBoM. However, ultimately, the BBoM does not
survive the operation. The cloud design will be constructed as a suite of
orchestrated microservices that collectively support the same use cases
supported by the BBoM. Relative to the design, the cloud is a ‘green field’
uncontaminated by the 'features' of the existing BBoM.
The orchestrated cloud compliant version is readily migrated from
development, into testing, from testing into production - from where it will
run in parallel with the as-is BBoM. Once all KPIs have been verified and
validated, the process of turning off the BBoM begins.
To keep things simple, going forward, let’s call the verified and validated
cloud compliant version of the BBoM the ‘solution.’

VM Hard-stop
If a BBoM is not currently operating on virtual machines whenever
feasible, that situation is a hard stop. Virtual machines (VMs) are ‘standard
technology’ so when a BBoM is not yet running on VMs that is an extremely
negative mark and tell of the IT institution's culture.
In the cloud your services/systems will run on a VM. Therefore, before a
cloud migration can be considered, each BBoM must first be migrated to
VMs. Failing the VM migration there is no hope of that IT institution
achieving a successful cloud migration any time soon: the institution's culture
is too far behind the curve.

Migration to the cloud


Each migration to the cloud – of any type – is a journey over treacherous
territory. It helps to have a guide along for the journey. And in all cases you
must keep your eyes (and ears) open at all times. The key to a successful
journey into the cloud is the IT institution’s ability to orchestrate
microservices. Being able to orchestrate microservices is akin to being able to
swim in the ocean, as such that skill is mandatory if survival on the high seas
is important to the tribe.
In conclusion, it is fair to say that migration to the cloud is more a matter
of mastering the orchestration of microservices before considering
opportunities to reduce CapEx and OpEx. To jump overboard into the ocean
while lacking the ability to swim is not a strategy you can survive.
For those with animus for the cloud, your best option is to master
microservice orchestration.

Divide and Conquer


Though the inherent complexity of a BBoM is obfuscated and white-
washed, the complexity inherent to SaaS cannot be denied, cannot be
ignored. To be successful, the complexity of a SaaS is conquered – with eyes
wide open - by dividing the problem space into isolated parts that can be
managed independently. To successful divide and conquer, a methodology
filled with powerful techniques is necessary.

Underlying Assumptions
It is one thing to divide and conquer, it is all together something else to
manage that conquest. Managing a SaaS is not like managing a BBoM. To
manage a SaaS you need to be able to spin-up and shut-down individual
microservices on demand. To manage a SaaS you need to be able recover a
microservice after a failure event. To manage a SaaS you need to be able
support rolling updates, as well as rolling rollbacks. Of course, these
management task have to be accomplished within a security context that
enforces required guidelines.
Therefore, when considering a SaaS orchestration of its microservices is a
huge part of the challenge. To keep things simple, in this article Kubernetes
will do the orchestration.

12-Factors Methodology
The demands of designing, developing and operating a SaaS extend well
beyond the 2 timeless principles, but all of those demands are logical
extensions of those 2 principles. There is a methodology advocated for SaaS
called the 12-Factor methodology. However, the count of 12 is a low-ball
estimate:

1. Codebase;
2. Explicit and Isolated Dependencies;
3. Configuration;
4. Backing services;
5. Build, release, run stages;
6. Stateless Processes;
7. Port binding;
8. Concurrency;
9. Disposability;
10. Dev/prod parity;
11. Logs, and
12. Admin processes.

At the street level where each of these factors are made real, each of these
12 factors

1. Codebase
A version control system (e.g., Git) is a tool used in all good software
engineering environments. The version control system is to software what
clean air is to our body. No one has a good life when there is polluted air in
the room. No one has good software when that software is un-managed by a
robust version control system. The version controlled software is held in a
repository. With the aid of the version control tool, specific versions of the
codebase can be repeatedly deployed from the repository to a variety of
environments.
The migration of the codebase occurs through stages - from development
to test, from test onto production – and needs to be repeatable as well as
automated (i.e., does not require human intervention to provision). The
slogan for this is automation process is Continuous Delivery and Continuous
Integration (CD/CI).
This is not to assert that the exact same codebase is migrating from stage
to stage. For good reasons, the build of the development binary image is not
identical to the build used in subsequent stages. For example:

in development the codebase may well have performance


measuring libraries built into the binary. Those metrics libraries
are not necessary in test or in production environments;
logging libraries used to support unit testing (and metrics
collection) in development are not to be present in subsequent
stages, and
the mocks used in dev as well as the traces that can be turned on
in dev must not to be available in subsequent stages.

Unnecessary libraries and logs are unwanted! What is necessary (and


therefore wanted) is the smallest possible and identical binary image that can
be run in both the test and the production environments. Consequently, the
binary image in development is not the exact same binary image present in
latter stages. However, the binary image in the stages following after
development must be identical to the greatest extent possible.
Another common failing of a codebase we must defend against –
particularly when a database is one of the services in the SaaS – is using a
declarative language in a context where an imperative language is the valid
choice. Time and again tree and network data structures are manipulated in
an algorithm implemented using a declarative language in a context where an
imperative language is the only valid choice. Choose the language best suited
to the data structure processed by the algorithm. To be explicit – only use
SQL over relations.

2. Explicit and Isolated Dependencies


The dependencies across the application binary image have to be declared
explicitly as well as isolated. For example, consider Maven. All libraries used
to build the application binary image are declared in the Maven POM file.
However, Maven is not just a build tool it is also a software project
management tool, as it supports:

Dependency management;
Remote repositories;
IDE tool portability, and
Easy searching and filtering of project artifacts.

Along with the version control system, the explicit dependencies, and
remote repositories, the application binary image is built at each stage.
An important benefit of explicit declaration of dependencies is that they
make it easier to on-board others into the application team. It is much easier
for new persons to acquire knowledge of how the application is put together
when those dependencies are recorded in explicit declarative statements.

3. Configuration
The codebase is not just the binary image of the application. The codebase
also includes configuration information, of different types. For example, the
configuration of an application is different across different environments. For
example, the hardware resources used in production are not identical to the
hardware resources used in testing or in development environments. The
environment configuration information has to be managed independently of
the application functions that consume that configuration information – the
configuration information has to be external to the application’s binary
image.
The codebase is not just the binary image of the application and
application configuration information. The infrastructure that is the
environment of each stage is also expressed in declarative code statements,
that are stored in files external to the application’s binary image. These
declarative code statements make it possible to fully automate the dynamic
provisioning of infrastructure present in each stage.

4. Backing Services
Each HTTP-based microservice present in the SaaS is a resource that has
a Uniform Interface (i.e., REST) and which is assigned a Uniform Resource
Locator (URL) logical name. It helpful to think of the microservices of a
SaaS as being a model of (a portion of) the BPM's domain.
Within a business domain model supported by the SaaS consists of the
core domain, generic domain, and supporting domain. The reader who is
familiar with Domain-Driven Design (DDD) will find it easier to understand
the ‘backing services’ factor as the supporting domain..
The core domain consists of the most important, parts of the SaaS: these
are the microservices that enable your business process model (BPM) to
succeed. However, these parts cannot be bought off the shelf: their design
and construction are custom made by/for your company and account for
significant investments in technology and engineering. In return, the core
delivers the highest value to your BPM.
The technologies in the core domain are only worth investing in so long as
they provide the required ROI. The core is a product upon which your BPM
is heavily dependent. The core is a product that you manufacture and which
requires a manufacturing methodology. Therefore, the core is not to be
handled as if it were a project. Agile is a methodology limited to projects.
Perceiving the core as if it were a litany of projects will certainly heighten the
risk of the core turning into a big ball of mud (BBoM).
Consider this: if you were to fly on a airplane do you prefer that the
airplane's flight system was manufactured as a product or was handled as if it
were a litany of Agile projects? If that airplane's flight system were a result of
a litany of Agile projects - and not a manufacturing methodology - it is
doubtful that the airplane could be operated safely. Remember, in Agile no
individual on the team is personally responsible for any deliverable - the buck
stops on no one's desk; the buck falls between the cracks of an abstraction
called the team!
The generic domain is not core to the BPM, but the BPM cannot function
successfully without them. The parts present in the generic domain are the
services/resources that can be bought off the shelf: email; CRM; data
archival; etc. The supporting domain contains the backing services that
support the core domain. For example, within the supporting domain are the
persistent data storage services, the messaging service, etc..
Unlike services in the SaaS core, the services present in the generic and
supporting domains may or may not support a RESTful interface and might
not be assigned a URL. Despite the resulting communication difficulties
between the core domain and the generic and supporting domains, it is
helpful to manage services in all 3 domains as resources.
5. Build, Release, Run stages
Using the appropriate tools, the build event brings together (from 1 or
more repositories) all portions of the codebase intended for release into the
target stage/environment, and outputs a binary image of the microservice.
Each build that outputs an image that will be released into production needs a
unique ID which de-marks each incremental change release. A given released
and ID’s image is immutable, cannot be changed. Any change to the
codebase to be released requires a new unique and incremented ID value.
Using the appropriate tools, the release events take that output, along with
configuration information about the target environment, and deploys these
into the target environment. Using the appropriate tools, the run stage involve
sthe launch of the service in the environment as well as its execution in that
environment, as well as rollbacks that may be needed.

6. Stateless Processes
As explained in article #3 of this series, the HTTP protocol (on which
microservices are based) is a stateless protocol. Therefore, the requests and
responses supported by the SaaS REST interface are stateless by default and
share nothing as well. This stateless share nothing paradigm facilitates the
scaling in and scaling out of instances of the SaaS microservices.
All information processed by the SaaS that the BPM needs to persist must
be managed by a stateful data service (located in the supporting domain).
Information that the BPM needs to persist (or provide access to) over small
windows of time can be managed by a dynamic cache service. Using the
filesystem of the OS that the microservice runs on is a viable choice in some
scenarios, but its best to refrain from abusing the filesystem.
For BPM that need stateful sessions connected to their SaaS, there is the
‘sticky session.’ A sticky session caches temporary metadata about the user
session (in the memory space of the microservice) which is able to route
multiple requests and responses across the same session. Because sticky
session violate the share noting paradigm they impeded the scaling in and out
of SaaS microservices.

7. Port Binding
As explained in article #3 of this series, in networks that use the TCP/IP
protocol, a port is an I/O mechanism that an application uses to ‘listen to
requests’ which it handles. In the typical SaaS Use Case, the requests are
composed of HTTP verbs and information is encoded in JSON.
In a TCP/IP network, each application service running on a host computer
(or on a Virtual Machine, as well as within a container/Pod) is uniquely
identified by its port – by a number.

HTTP uses port 80


SMTP uses port 25
FTP uses port 21

To specify the specific port on a host computer that handles HTTP


requests is straightforward. For example,
http://fredo.net:80/ismg6340/stuff.json/ or http://localhost:5000/
The port number to which the microservice REST interface is bound is an
example of configuration information. As such, the port number is a critically
import piece of information that needs to be captures in a declarative
statement.

8. Concurrency
If all goes well, each microservice will have cause to support more than 1
request/response at a moment in time. This stateless share nothing paradigm
facilitates the scaling in and scaling out of instances of the SaaS
microservices which is a very straight-forward way of supporting
concurrency.
Depending on the stack you choose to use to implement the SaaS, other
concurrency options may also exist. For example, Node.js, though single
threaded, uses event loops and callbacks to support concurrency. In a Java
web service, multi-threading may be needed to support the use case KPIs.

9. Disposability
‘All things must pass.’
- - George Harrison
Each microservice on the SaaS needs to be able to be started or stopped at
any chosen point in time. By supporting those behaviors the SaaS
microservice can be scaled in as needed to reduce CapEx, as well as scaled
out to acquire capital. Time is critical when shutting down and starting up an
instance of a microservice, the quicker the better. And to dynamically scale in
as well as out, while handling on-going sessions, a Load Balancing that fronts
the SaaS microservices is a must have.
Under normal conditions, the features of the SaaS change over time. Like
all other things, a SaaS is continually changing, is not permanent, is
temporary. Change is expected. Breakage is expected. Therefore a method to
handle the continually changing madness is useful. The 12-Factor
methodology is well suited to handling continually changing SaaS. In this
article, and in each of the subsequent articles in this series, we will take up a
group of factors from among the 12-Factors methodology.

10. Dev/Prod Parity


To support CD/CI, disparities between environments need to be kept to a
minimum. Disparities have the direct effect of slowing down the migration of
the SaaS across stages. Minimizing deviations between released application
binaries is likewise needed.

11. Logs
Logs give us visibility into the events handled by the microservice, and
how they are handled, in different levels detail as may be needed. Like stdout
and stderr, a log captures a continual stream of events, and so has to support
the ability to be continually enlarged.
The types of logs used by microservices may need to change across
environments.

12. Admin Processes


Recurring administration process are ubiquitous to all systems. Manual
processes that administer a system are error prone and difficult to repeat.
Therefore, it is best to automate and schedule these recurring admin process,
and refrain from undertaking those processes manually. Scripts are
commonly used to automate these tasks.

12 Factors Eye-Opener
Each of the 12 factors can be expanded according to the needs of the
BPM. How these guidelines are adopted do vary according to the
environments the SaaS uses. For example, on the AWS public cloud, factor
#12-Admin Processes is supported by a variety of tools, such as Cloud
Formation.
Though there may be variations in how the each of the 12 factors is
adopted by a SaaS, all 12 factors need to be adopted to promote the success
of the SaaS. As stated in article #1 of this series, there are other critical
factors that must be handled, but which are not identified in the 12-Factors
methodology. For example, it fails to mention identity management.

Greenfield DevOps
The greatest probability of success exists in a greenfield in which a cloud-
native application can be constructed, tested and deployed, un-chained from
the constraints of a monolithic application.

Reincarnating a monolith
Unfortunately, rarely if ever is it cost effective to containerize the
monolith application or to run it on a private/public cloud. Given a
monolithic application the most cost effective process to running it in a
private/public cloud is to first containerize it - logically, not physically.
In practice, the monolithic application code is not changed physically.
Instead the monolith is transformed into a greenfield, a logical greenfield.
The computations supported by the monolith define the computations that the
containerized application is to support.
The security, service and data contracts of the existing computations are
designed and developed from day one to be containerized (and to run in the
private/public cloud). Of course, doing so ensures that all existing security,
service and data contracts will have to be rebuilt from the ground up. Unlike
the first time around when the monolith was built the details of every contract
that the solution must support are known before you design the solution as a
cloud-native application.
To respond to the shareholders intent to reduce CapEx and OpEx,
application architects are advocating microservices, which run in containers
that are orchestrated.

Microservices
The discipline of structured programming began in the 1970s, at least a
decade before object-oriented languages or techniques existed outside of
academia. At the core of the structured programming discipline there are 2
timeless software engineering principles: minimize coupling and maximize
cohesion. Until recently, however, those two principles rarely (if ever)
influenced the design and construction of software (object-oriented or not).
As a consequence, there exists a plethora of big balls of mud (BBoM) present
in almost all institutions. Violating either or both of those 2 engineering
principles has always been a willfully ignorant and intentional choice.
Usually, a software application/service is constructed of multiple
components. Tight coupling – hard-wiring component dependencies at the
code level – is to software what metastasized cancer is to the body. Tight
coupling interjects pernicious ripple effects throughout the application code.
Tight coupling makes it impossible to test a component in isolation. Tight
coupling makes code needlessly and senselessly complex. Light coupling
minimizes the dependencies between components, thereby eliminating the
toxic consequences of tight coupling.
When a software component serves 1 and only 1 purpose, the performance
of that component can be optimized. Highly performant components are the
best way to minimize infrastructure (memory, CPU, network, etc.)
consumption. Ensuring that a component serves 1 and only 1 purpose is the
almost most effective way to keep things simple. Keeping things simple (by
dividing and conquering) is the best way to drive down the cost of
engineering and operating software. When a software component serves
multiple purposes it has minimal cohesion, is highly resistant to optimization,
and needlessly and senselessly raises the cost of software engineering over its
entire life-cycle,
Though microservices are promoted as if they are a new application
architecture, they are old school, they embody the structured programming
principles of maximizing cohesion and minimizing coupling established in
the mid-1970s. Ever since, those principles has been used by software
engineers who valued their relevancy to distributed systems. Since the late
1980s this author has used them to DevOps globally distributed systems
(without having the significant benefit of a cluster of containers).
If you are familiar with client-server architecture, the server implements
all security, service and data contracts and all of these parts are present in a
single chunk of source code. The entire chunk of source code has to be built,
and then tested, and then deployed, all as a single unit. In a microservice
architecture each server contract becomes an independent service
(implemented as a small chunk of isolated source code) that has a RESTful
interface and which uses HTTPS to inter-operate with the other services.
Individual microservices are then containerized and run in a secure cluster
present in a private/public cloud. Alternatively, each microservice can run in
a dedicated virtual machine.
To help people refrain from continuing to ignore these 2 timeless software
engineering principles, they have been re-branded. Today we have a catch-all
term for asserting the primacy of minimizing coupling and maximizing
cohesion: microservices.
A microservice serves 1 and only 1 purpose and does so while ensuring
that it couplings to other services, to the operating system, to the virtual
machines, to the host hardware, are held to a practical minimum. Coupling of
services is minimized by adoption of interfaces that abstract away the lower
level communication details.
Microservices are not new, they’ve been around since the early 90s when
client-server architectures began to be common-place in business. Back then
the service interfaces were proprietary. Today, open standards define the
interfaces used by microservices. Today, HTTP-based microservices adopt a
stateless Representational State Transfer (REST) interface. That RESTful
interface enforces an identity based security contract, behind which there is a
service contract or a data contract.

The End of the Beginning


The wonderful thing about orchestrated microservices is that they can
operate in the public cloud, in the private cloud, as well as in hybrid clouds.
This places you in the best position from which to choose which cloud type
to adopt (and which cloud vendor to use). To make a few more points, for the
time being let’s assume that there are no data governance issues or
government regulations constraining which type of cloud the solution can be
placed within.
The end of the beginning of the process of turning off the BBoM is
decision(for each solution) of the cloud type is will deployed to public or
private, or a hybrid. Each cloud type – public, private, hybrid - has its
benefits as well as short-comings that must be taken into consideration. In
addition, the particular benefits and short-comings of each public cloud
vendor must be taken into consideration as well.
From an operations perspective, the public cloud places the lowest level of
demands for human resources. The public cloud also places the greatest level
of constraints on access to the underlying hardware that is the infrastructure.
From an operations perspective, the private cloud places the greatest level
of demands for human resources. The private cloud also places the lowest
level of constraints on access to the underlying hardware that is the
infrastructure.
From an operations perspective, the hybrid cloud is a unique (customized)
mix of the strengths and weaknesses of both the public cloud and the private
cloud.

Software-as-a-Service (SaaS)
Instead of a BBoM, an HTTP-based solution is composed of one or more
loosely coupled microservices. Taken together, this collection is called
Software-as-a-Service (SaaS). A SaaS may or may not be connected to the
Internet. A SaaS may or may not be running in a public, a private, or a hybrid
cloud. Regardless of where the SaaS is deployed, in addition to the 2 classic
software engineering principles there are other critically important techniques
that must be enforced when designing, developing and operating any SaaS.

Conclusion
Monolith applications, a.k.a., big balls of mud (BBoM), are imagined to
be a permanent resource. In the institutions in which they are found the
BBoM is a totem to that culture. Consequently, any change to the BBoM
such as lifting it into a public cloud or containerizing is imagined by
institution’s culture to be threatening. Naturally, that response increases the
level of animus within the institution’s culture to the cloud as well as to
containers.
To be clear, given the physical structure of the BBoM’s source and system
code base it is rarely if ever cost effective to stuff a BBoM into a container or
to lift a BBoM into a private or public cloud. All BBoMs are defined and
managed to run on a physical server located in a private (or tenant) data
center. All BBoMs (even the web applications) were never implemented to
run on virtual machines distributed across network subnets exposed to the
Internet.
ABOUT THE AUTHOR
In the spring of ’85, out of the blue, while finishing an undergraduate
degree in Geology (and planning that fall to enter graduate school to study
Mineral Economics), the writer’s first computer was dropped on his work
desk along with the assignment to create a program that estimated the
ownership and operating expenses of surface and subsurface mining
equipment. Back then, in the dark ages, no one had a computer on their desk,
or extremely few people owned their own computer. Before that moment, the
writer had never touched a computer, or had seen a computer. At that time,
given 80% of others in Geology were out of work, the machine placed on my
desk was better than their alternatives.
That micro-computer was a brand-new proto-DOS HP-85, with a tiny key-
board, a 4” green monitor, and it ran applications you write in a language
called BASIC. I literally did not know what a CPU was, what RAM was (ram
was a barnyard animal, right?), what an operating system was, what a
programming language was, etc.. Thirty days later the bug-free program ran
in under 5 minutes, completing the work a Geologist accomplished in one
week using a hand-held HP calculator.
Rather than be run over by the gathering herd of machines, that fall the
writer switched disciplines entered graduate school to study information
systems. In the fall of 1985, the writer’s first personal computer was a
10MHZ CPU, 16KB of RAM, a 14” black-n-white TV as a monitor, with a
dial-up modem cartridge, used to connect to a VAX 11-780 running BSD
Unix 4.3 and INGRES, on which ran an application written in C code using
double pointer indirection. In the following spring, my next computer had 12
MHZ with 32 KB and man was it fast, really, really fast!
Since earning a Master of Science (MS) in 1987, and working in pure and
applied research, the author has been developing and operating globally
distributed systems since 1990, first using varieties of UNIX, and then
Windows NT, and then Linux, and now there off-spring. The author has
worked with a mixture of on-premise and since 2007 a mixture of cloud
platforms: first with proto-Azure while an employee at the Microsoft Center
of Excellence, then in 2012 using Amazon Web Services (using Hadoop and
Redshift), followed in 2015 using Google Cloud Platform (GCP).
As everyone involved with AWS has encountered, the use of Docker and
Kubernetes on AWS is growing rapidly. This manuscript is constructed from
the notes recorded while out in the field. This manuscript was first published
in Amazon Kindle at the end of May 2020.
The author can be reached at charles@thesolusgroupllc.com
For those interested in the AWS certifications held by this author, here is
where they stand.
AWS Certified Cloud Practitioner Validation Number#
PN49EN3CDE411H30
AWS Certified Cloud Practitioner Issue Date: February 12, 2019
AWS Certified Cloud Practitioner Expiration Date: February 12, 2021

Please note that an AWS Academy Accredited CF Educator is not


assigned a public validation number. The AWS Academy needs to be
contacted directly.
AWS Academy Accredited Cloud Foundation Educator Certification Issue
Date: February 12, 2019
AWS Academy Accredited Cloud Foundation Educator Certification
Expiration Date: February 12, 2021

On May 7, 2019 the author earned accreditation to teach the AWS


Academy CA class.
AWS Certified Solution Architect Associate Validation Number#
WVWVCWPCJJ1EQRKH
AWS Certified Solution Architect Associate Issue Date: April 8, 2019
AWS Certified Solution Architect Associate Expiration Date: April 8,
2021

AWS Academy Accredited Cloud Architecture Educator Certification


Issue Date: May 7, 2019
AWS Academy Accredited Cloud Architecture Educator Certification
Expiration Date: May 7, 2021
Marketing materials describing the consulting services provided by my
company (The Solus Group, LLC) are available at
www.thesolusgroupllc.com
LEAVE A REVIEW
If you have enjoyed this manuscript, please leave a review on Amazon.
If you have not enjoyed this manuscript, please lease a measured and
courteous review on Amazon.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy