Docker & Kubernetes on Aws 2020
Docker & Kubernetes on Aws 2020
K U B E R N E T E S O N AW S
2020
Elastic Container Service (ECS) & Elastic Kubernetes
Service (EKS)
Charles Clifford
The characters and events portrayed in this book are fictitious. Any similarity to real persons, living or
dead, is coincidental and not intended by the author.
No part of this book may be reproduced, or stored in a retrieval system, or transmitted in any form or
by any means, electronic, mechanical, photocopying, recording, or otherwise, without express written
permission of the publisher.
ISBN-13: 9781234567890
ISBN-10: 1477123456
AWS CodePipeline;
AWS Cloud Deploy;
AWS Fargate - a serverless compute engine for containers that
works with both ECS and EKS;
AWS Route 53 and Cloud Map;
AWS AppMesh - provides application-level networking to
make it easy for your (services running inside) containers to
discover and communicate with each other across different
compute infrastructure, e.g. ECS and Fargate;
Bottlerocket – a Linux-based open-source operating system
(built by AWS) for running containers on virtual machines or
bare metal hosts;
Amazon Outpost – use EKS or ECS to manage clusters of
containers in a private cloud integrated with the AWS public
cloud;
Introduction
Amazon Elastic Container Service (ECS) is a Region-based service used
to orchestrate Docker containers that are distributed across a cluster. The use
cases that gain from the benefits delivered by ECS are rarely if ever simple or
easy to DevOps. If they were simple then their solution would not need to be
orchestrated.
In an ECS solution there are many interdependent AWS resources and
services that have to be configured optimally and which have to inter-operate
securely and efficiently to support your use case in a viable and cost effective
manner. To provide a containerized application with security isolation, high
availability, and the quick resolution of failure events, the ECS consumer is
responsible for configuring:
Everything is Disposable
Everything in the cloud is ephemeral, is temporary. The VPCs, security
groups, IAM roles, clusters, services, tasks, the Docker containers, and their
compute instances, are not static objects, are not permanent artifacts.
Each ECS artifact has to be configured and deployed within a dynamic
workflow that is continually changing. In the real world, the DevOps of
containerized applications feels similar to releasing an arrow from a bow
while blind-sighted, and next aiming that in-flight arrow at a target that is
constantly moving.
As is typical of any service in the cloud, the Docker container’s AWS eco-
system has many moving parts functioning across multiple levels: VPC,
cluster, service, task, container, compute instance. In each level, ECS artifacts
can be deeply integrated with a variety of AWS resources and services. To
handle the variety and complexity of ECS requires you to define the
infrastructure in JSON and YAML documents, and which functions like
programable code to create and destroy virtual infrastructure. The practice of
infrastructure as code – and using the Amazon CloudFormation service - is a
critically important discipline demanded of every ECS consumer.
In addition, to be able to continually release re-versioned services running
Docker containers, the ECS consumer will need to adopt Amazon
CodePipeline and CodeDeploy services. In turn, these AWS services require
the ECS consumer to adopt a source code control system (such as GitHub)
and a Docker container repository (such as Amazon Elastic Container
Repository (ECR)).
A solid understanding of AWS resources and services is required if you
are to successfully DevOps ECS.
Containerized Applications
A container is a runtime environment, such as Docker, in which a process
executes in isolation from other processes. The typical containerized
application:
Service Commitment;
Included Products and Services;
Service Commitments and Service Credits;
Credit Request and Payment Procedures, and
Amazon EC2 SLA Exclusions.
AWS CloudFormation;
VPC;
IAM;
AWS CodePipeline;
AWS CodeDeploy;
Auto Scaling;
Elastic Load Balancing;
CloudTrail;
CloudWatch;
AWS Secrets Manager
AWS System Manager;
AWS AppMesh;
Amazon S3;
Bastion host;
Amazon Elastic Repository Service (ERS).
A deep understanding of the AWS platform, as well as its resources, and
services, is a pre-requisite to gaining the benefits of ECS.
To contribute to your success with ECS, details about the AWS platform,
resources, and services:
AWS Networking;
AWS Compute (EC2 instances);
AWS Storage;
AWS Identity;
Amazon Route 53;
AWS Cloud Config;
AWS Security Hub;
AWS DevOps Toolset;
AWS Service Quotas;
AWS Support Tiers;
AWS Cloud Map;
ECS Endpoint
To connect programmatically to an AWS service, you use an endpoint. As
such, ECS has a service endpoint. The service quotas (a.k.a. limits) for an
AWS service are enforced at its service endpoint. The ECS endpoint
represents the ECS backbone.
Amazon VPC
IAM
Amazon ECS
Amazon ECR
Billing and Cost Management
AWS CloudFormation
Amazon CloudWatch
Amazon CloudWatch Events
Amazon CloudWatch Logs
AWS CloudTrail
EventBridge
AWS CodeDeploy
AWS CodePipeline
Amazon EC2
Amazon EC2 Auto Scaling
Elastic Load Balancing
AWS Certificate Manager
AWS Cloud Map
AWS Config
Route 53
AWS App Mesh
AWS Elastic Beanstalk
Amazon EBS
Amazon EFS
Amazon Elastic Inference
AWS Outposts
Amazon SNS
Amazon S3
Unfortunately, at this point in time, the service quota for the Fargate
launch type (used as compute instance in your ECS) is not visible in the
Service Quota console. For current service quotas for AWS services please
review the AWS General Reference Guide that is available on-line.
ECS Pre-requisites
Before starting to DevOps ECS, a number of technical pre-requisites must
be satisfied:
ECS Platform-as-a-Service
The Amazon ECS Platform-as-a-Service (PAAS) is composed of several
configurable artifacts:
VPC
Cluster
Cluster Capacity Provider
Service
Instance Launch Type
Service Load Balancing
Service Deployment
Task
Container
Container Networking Mode
Container Storage
Container Repository
Each of these artifacts is complex and has numerous parameters that have
to be carefully configured by the ECS consumer. These artifacts are not all
that are present in the ECS platform that the consumer must configure, but
they are the core artifacts of ECS.
The VPC is vital to, is the foundation of, the ECS solution. And, it is the
internal and external network traffic required by the Docker containers that
drives the design and provisioning of the VPC and its subnets, their routes
and security groups, as well as gateways.
The ECS cluster supports horizontal scaling of compute instances. The
ECS Service supports both load balancing (of HTTP/HTTPS and TCP traffic)
as well as service discovery.
In addition to the VPC, the other vital ECS artifact is the Task. The Task
has a task definition, as does an ECS Service. The task definition is a group
of Task level configuration parameters plus a list of 1-10 Docker container
definitions. A Docker container definition is part of the task definition.
Consequently, a Docker container cannot be run in ECS in the absence of its
task definition.
An ECS Service launches a Task. A Task can also be launched without
need of an ECS Service. When a Task is launched its list of containers are
run on a compute instance provisioned in the cluster. With Amazon ECS, you
can control the EC2 instance chosen to for the Task according to rules and
constraints that you define at the Task level.
Shared internal or external network traffic requirements can result in more
than 1 container being listed in a task definition. Likewise, the shared need
for other types of namespaces – Process ID; User ID; Mount; IPC; etc. – can
result in more than 1 container being listed in a task definition. Unless a
reason exists for more than 1 container per task definition, adopt the ratio of 1
container to 1 task. Doing so enables both vertical scaling of the container
image as well as horizontally scaling the container image. As importantly,
that 1:1 ratio minimizes the scope of a container’s attack surface.
Each of the ECS artifacts will be described in detail in this manuscript.
Unquestionably, the ECS architecture is rooted in Docker therefore, to
understand how to configure ECS, you need a good understanding of Docker.
As this chapter explores each of those ECS artifacts, relevant information
about Docker will be shared.
Docker Background
Docker is a technology that allows you to build, run, test, and deploy
distributed applications that are based on containers. In 2013, Docker Inc (a
startup tech company from San Francisco) originated the Docker container
management service. The original Docker (written in Golang) was a
Platform-as-a-Service (PaaS) call dotCloud. In 2017 the Docker container
management service became an open-source project that adheres to the
Apache License 2.0.
The component that runs and orchestrates containers is the Docker
Engine. Like the hypervisor that runs a virtual machine, the Docker Engine is
a runtime environment that runs the container.
Docker can be installed on Linux and various distros, on Windows,
Windows Server 2016, and Mac OSX.
1. The image-spec;
2. The runtime-spec;
Docker Images
It is important to understand how Docker builds and stores images as well
as how containers use those images. Creating the Docker image begins after
the application binary is built. The image includes everything the application
needs to run:
Code binary;
OS components the application needs to run (system tools;
system libraries);
Application library Dependencies;
File system object.
Docker images are built in several layers. Layers are stacked on top of
each other. Each layer of the Docker image is read-only (except for the top-
most writeable-layer a.k.a. the container layer).
In Docker, the simplest image is a 3-layer of binary:
Overtime, as changes are made to the application binary, and the Docker
image is rebuilt a new layer are added to the top of the layer stack. The new
layer is only the set of differences from the layer below it.
It is the application image’s use of kernel resources in isolation of other
processes, and it is how the image may share a namespace with other
processes, that dictate container coupling and the bundling of containers in a
Task. The requirements of an image to exclusively use of a namespace or to
share a namespace are determined by the use case the application image is
designed to support. Both image’s need to consume or share internal network
namespaces and its need to handle external network traffic have a profound
impact on the design of the VPC used by the ECS cluster and the compute
instances on which the image(s) executes.
Dockerfile
Docker daemon API’s ‘build’ command assembles a Docker image by
reading the instructions contained in a formatted text file, called a Dockerfile.
The Dockerfile instructions explain how to setup the container’s private
filesystem, identifies the image the container is based on, as well as how to
run that image in a container.
Once you have the code binary and the Dockerfile you are able to build
the read-only Docker image. When the image is built, an image manifest is
generated. The image manifest contains information about the image, such as
its layers, size, and digest, as well as the operating system and compute
instance architecture the image was built for.
Docker Containers
A container is an instance of an image that can be run on a compute
instance. Once you have a Docker image (held in a registry), the Docker
daemon API’s ‘run’ command starts a container based on the image.
A container is a running process that uses its partition of the compute
instance’s operating system’s kernel resources. The running container:
Together these processes make up the ECS control plane. The ECS
control plane can be manipulated through the ECS Console, the ECS API, the
AWS API, as well as the AWS SDKs.
Docker Engine
The Docker daemon, a.k.a., Docker Engine, implements:
Docker API;
Image management;
Role-based Access Control (RBAC);
Security features;
Core networking, and
Data Volumes.
ECS API
In Amazon ECS, the Docker daemon API is re-titled the ECS API. The
Docker daemon has a RESTful API that is manipulated by a command line
interface (CLI) and is access by a HTTP client, such as:
Docker Client;
AWS CLI (accessed through an AWS SDK);
Amazon ECS Console (available within the AWS Management
Console);
ecs-cli command line utility (accessed through an AWS SDK);
The ‘docker’ command can manipulate the Docker API to do such things
as:
Build a Docker image from a Dockerfile and application binary;
Push/pull an image or a container to a registry/repository;
Manage images over their lifecycle;
Manage Docker containers over their lifecycle;
Display container resource usage metrics;
Import a tar ball to instantiate a container’s filesystem;
Docker RBAC
To enforce authentication while making changes to a cluster or
containerized application, the Docker daemon uses RBAC. On AWS, these
Docker roles have been supplanted by numerous IAM roles and service links.
These IAM roles, service links, and policies, will be identified and described
in this manuscript.
ECS Console
The ECS Console (nested within the Amazon Management Console)
enables the consumer to create and manage their ECS Clusters, Services,
Tasks, Containers, and compute instances in a single place – that centralized
console is the ECS control plane. The ECS Console is a web page that
enables you to manipulate the Docker daemon API, the docker run command,
as well as the ECS API, through a simple user interface that provides access
to a variety of wizards.
The ECS Console wizards enables the consumer to create ECS Clusters,
Services, Tasks and Containers, to associate them with IAM roles and
policies, to provision them within VPC subnets, as well as to auto scale and
load balance compute instances. And after the creation of the clusters,
services, tasks, and containers, the ECS Console enables the consumer to
manage each artifact across their entire lifecycle.
The ECS Console’s wizards walk the consumer through the steps
involved, let you set configuration parameters, pick IAM roles, use auto
scaling and load balancers, without requiring the consumer to know the
corresponding AWS API calls, Docker daemon API calls, docker run
command options or ECS API calls. The ECS Console makes those API calls
and executes the required commands on the account’s behalf. However, a
useful control plane does not a Software-as-a-Service make. The ECS
Console does not fully-manage ECS – human involvement is required. It
helps to view the ECS Console wizards as helpful tutorial guides. However,
at places along the way the wizards provide defaults (IAM roles, VPC, etc.)
that are not suitable for production environments.
Inevitably, once familiar with the steps taken and resources required to
create and deploy VPCs, ECS Clusters, Services, Tasks, Containers, and
compute instances, you need to pick up other tools in the AWS DevOps
toolset that provide the power to fine tune the particulars demanded by a
specific application image.
The ECS is a PaaS, and as a PaaS, the consumer will write the
infrastructure code that materializes and terminates the VPCs, the ECS
Clusters, Services, Tasks and Containers, and compute instances, across and
throughout their lifecycles. Moreover, that infrastructure code will be
managed as a valuable software resources via Continuous Integration.
they require an ECS Service. Therefore, the Amazon ECS platform has 6
levels, each of which the consumer must configure:
6 - Image;
5 – Container and Launch type;
4 – Task and Launch type;
3 – Service and Launch type;
2 - Cluster
1 - VPC
However, when these use cases are supported:
they require an ECS Task, and the ECS platform has just 5 levels that the
consumer must configure:
5 - Image;
4 – Container and Launch type;
3 – Task and Launch type;
2 - Cluster
1 - VPC
The Image level has already been explained. The Container level has been
explained but only at a high level and needs further explanation. While the
Task, Service, Cluster and VPC levels remain to be explained.
The task of configuring each ECS level is not an exercise independent of
the other levels – each level has dependencies that it shares with the other
levels. The configurations of a Service can dominate its task definition. The
configuration of a task definition can dominate its list of container
definitions. Conversely, parameters in a container definition can over-ride
parameters set in its task definition. In addition, parameters in a task
definition can override parameters set in the Service definition.
It is important to appreciate which Container parameters override a
corresponding Task parameter settings, as well as which Task parameters
over-ride a corresponding Service parameter settings. With good fortune, a
single Service can, by letting its task definition set parameters, be a Service
template that can be used by different task definitions. Likewise, with good
fortune, a single Task can, by letting its container definitions set parameters,
be a Task template that is used by different lists of container definitions.
Last but not least, to provide a secure ECS environment IAM roles of
varies types, and their policies, at present in each level of ECS. It is important
to appreciate how IAM secures and governs each VPC, ECS Cluster, Service,
Task, Container, and compute instance, as well as their access to other AWS
resources and services.
The containers share the same namespaces – the have the same
vulnerable underbelly.
The containers share the same or similar attack surface.
You have lost the ability to horizontally scale those containers
independently of each other.
You have lost the ability to vertically scale those containers
independently of each other.
And, when more than 1 Task is run on the same compute instance:
Other than the SaaS use case, it is extremely rare for the exact same
infrastructure requirements to be shared by two or more of these other types
of ECS use cases:
The use cases well suited to ECS typically have multiple loosely coupled
Services (e.g., 1, 3, and 5) or have multiple sequential Tasks (e.g., 2 and 4).
In most scenarios, each Service or Task launches only 1 Docker container,
where each container automates a single discrete activity present in a
business process model (BPM). As such, the typical ECS use case will
include multiple Services where each Service launches 1 container. For
example, for a product that I and my team designed and delivered there were
64 distinctly different services only 2 of which were tightly coupled and
therefore part of the same task definition. In well-designed and well-made
distributed systems the vast majority of Services/Tasks have 1 and only 1
container. It is good to be able to place more than 1 containers in a
Service/Task when they are tightly coupled but doing so is seldom the right
solution because tight coupling is rarely a good design pattern for use with
distributed systems. There are scenarios where tight coupling is the right
choice but they are definitely the rare exception and are not the general rule.
How a group of related and distributed Services/Tasks interoperate – how
they share data - with each other most often includes stable queues. Though
stable queues are a common part of solutions based on service oriented
architecture, stable queues are not addressed by ECS or by Elastic Kubernetes
Service (EKS). The consumer has to manage provisioning message queues,
data services, caches, and other related integration services, outside of ECS.
ECS is a wonderful technology, but it is not all things to all components of a
distributed systems. The adoption of the CloudFormation service by the ECS
consumer is a must. In AWS, only that service is currently able to manage the
ancillary resources and services that the typical ECS solution demands.
Though an ECS cluster supports variations of launch types and a dynamic
variety of different Services and Tasks, best practice is to dedicate a cluster to
a single use case for which its performance is optimized and its cost
minimized. The default practices of 1 VPC + Cluster to 1 logical group of
Services, of 1 VPC + Cluster to 1 logical group of Tasks, are well supported
by the ECS console which provides access from a single point to all clusters
that are the property the AWS account.
The programmatic control of VPCs, AWS resources and services, ECS
Services, ECS Tasks, and ECS Containers and Images is the key to DevOps
of ECS. When a Service needs to run in its cluster, its compute instances and
their Docker containers, each AWS resource and service, is programmatically
spun-up and shutdown, scaled in and scaled out, and this happens on demand,
dynamically. When a Service or a Task needs to run all of its parts are to be
programmatically assembled.
If by good fortune a given VPC + cluster configuration can be proven to
support multiple use cases then the opportunity to associate multiple Services
with a single cluster exists. In the real world that type of good fortune is
extremely rare. Best practice is to divide-n-conquer until there is proof that
more than 1 use case can securely share the same VPC + cluster.
This divide-n-conquer approach does not impede the arising of garbage
application code, does not ensure the success of unit tests, or integration tests,
or workload testing. The divide-n-conquer approach does however isolate
those human error events to a particular chunk of infrastructure running
during a discrete window of time – the problem space is isolated. Process
isolation is the digital equivalent of social distancing – it makes sense to be at
a distance from others who may be the cause that harms you.
It is important to acknowledge that each type of ECS use case has a
distinct attack surface. It is highly improbable (outside of a SaaS use case)
that different collections of containerized applications share the same roles
and digital identities, use the same access permission policies, allow the same
network traffic workflows, use the same infrastructure of AWS resources and
services.
Though ECS use cases of all types may share security vulnerabilities, best
practice is to isolate each ECS use case to its dedicated VPC and cluster and
by doing so ensure that each use case is not compromised or complicated by
any other ECS use case’s attack surface. Neither the VPC nor the ECS
Cluster is a general purpose utility. Taken together the VPC and cluster are a
use case specific construction – both are specific to a given ECS Service or
ECS Task.
Another general rule of distributed systems is that processes must be able
to fail fast. Failures happen to due hardware, failures happen due to bugs in
the application, failures happen due to administration errors. Failing fast
starts the recovery process sooner than later. When multiple containers exist
in the same task the process of recovering from failure events is extremely
complicated and recovery time expands significantly.
Lastly, when the EC2 launch type is used there are numerous
configuration matters that must be addressed on a container by container
basis. When different containers, Tasks, and Services, run in the same cluster
it becomes nearly impossible to maintain homogeneity among the EC2
compute instances. Homogeneity of EC2 instances is key to successful auto
scaling and load balancing of an ECS solution. A lack of homogeneity among
EC2 instances is a sure sign of contamination by multiplicity of Services or
Tasks in a cluster.
1. Cost effectiveness
2. Region
3. Availability Zone(s)
4. VPC’s CIDR Block(s)
5. Amazon DNS Server and Route 53
6. VPC DHCP Options
7. Choice of Subnets, both private and public
8. VPC Router Configuration
9. VPC Security Groups
10. VPC Network ACL
11. Public IPs assignment
12. Non-routable IP addresses
13. AWS Internet Gateway (IGW)
14. AWS Network Address Translation (NAT) Instances
15. AWS NAT Gateways
16. AWS Egress-only Internet Gateway
17. VPC Peering
18. Virtual Private Networks (VPNs)
19. AWS VPC Endpoints
20. Bastion Host
When the ECS console is used to create the ECS cluster and an existing
VPC is not chosen, the wizard will create the VPC into which the cluster will
be provisioned. However, that default VPC is not usable as an integration
testing or a production VPC.
ECS containers are run dynamically. Depending on the ECS use case, the
VPC that underlies the ECS solution may be required to support:
Cluster parameters can be set using the ECS Console, the ECS CLI, as
well as the AWS CLI. Like all things configurable in AWS, the cluster
parameters and their assigned values are captured in a JSON or a YAML
formatted document. Best practice is to declare the cluster definition in an
AWS CloudFormation template, and to manage that document like all files
that participate in your CD/CI workflow.
At runtime, additional ECS cluster properties can be perceived in the ECS
console:
The launch type defined in the ECS Service or ECS Task determines the
type of compute instance and therefore the compute instance capacity
available to the Service or Task. In AWS, a capacity provider is a way to
manage the EC2 and Fargate compute instances that host your Docker
containers, and allows you to define rules for how the containers (listed in the
task definition) use that compute instance capacity. The capacity provider
manages the horizontal scaling of that compute instance capacity. During a
blue-green deployment the capacity provider is not functional.
When the Fargate or Fargate Spot launch type is used, the FARGATE or
FARGATE_SPOT capacity providers are provided automatically to ECS.
When the ECS Service or ECS Task uses the EC2 launch type, you create a
capacity provider that is associated with an EC2 Auto Scaling Group (ASG).
To be able to scale out, the ASG must have a MaxSize greater than zero. A
service-linked IAM role is required so ECS can use the Auto Scaling service
on behalf of an ECS Service or Task.
The capacity provider manages scaling of the ASG through ECS Cluster
Auto Scaling. Amazon ECS adds an AmazonECSManaged tag to the ASG
when it associates it with the capacity provider. Do not remove the
AmazonECSManaged tag from the ASG. If this tag is removed, Amazon
ECS is not able to manage the ASG when scaling your cluster. When using
capacity providers with ASGs, the autoscaling:CreateOrUpdateTags
permission is needed on the IAM user creating the capacity provider.
When you create an ASG capacity provider, you decide whether or not to
enable:
Isolating a process ensures that no process can interfere with any other
processes running on the same compute instance. However, there are use
cases where a process needs to communicate and interoperate with other
processes running on the same compute instance (e.g., PaaS; microservices)
as well as with other endpoints hosted on other compute instances located in
a private subnet or in a public subnet in the VPC or located somewhere in the
Internet.
To support this local network traffic between processes (a.k.a., Docker
containers) a network namespace - a copy of the kernel’s network stack - is
provisioned, with its own VPC routes, firewall rules (security groups), and
network devices (Elastic Network Interface (ENI)).
Docker containers have access to these Linux namespaces:
And, when more than 1 Task is run on the same compute instance:
Multiple Service Tasks (with their containers) can run on the same
compute instance at the same time. The ECS Service also has the ability to
defines rules and constraints that govern the placement of a Task of a
compute instance provisioned in the cluster. ECS has recently introduced
capacity providers and capacity provider strategies to supersede placement
rules and constraints.
The ECS Service definition has these parameters:
clusterARN – the ARN of the cluster that hosts the ECS Service.
The same named ECS Service can run simultaneously on 1-N
different clusters.
serviceARN – the Amazon Resource Name (ARN) that identifies
the ECS Service.
serviceName – the name of the ECS Service (that must be
unique per cluster).
taskDefinition – the task definition (a.k.a., the task) used by the
ECS Service to launch the Docker container(s) listed in the task
definition. The task definition contains the list of 1-10 container
definitions, as well as other Task level configuration parameters,
some of which override certain Service parameter settings.
When the task runs all of its containers run on the same compute
instance.
launchType – the compute instance launch type on which the
ECS Service’s containers will be running. The default is EC2.
All containers launched by the ECS Service run on the same
launch type and run on the same compute instance. The ECS
Task also has a parameter that specifies the launch type used by
its containers, therefore over-ridding the Service’s launchType
parameter.
platformVersion – specifies the Fargate version that provides the
Fargate launch type instances used by the task’s container.
desiredCount – the desired count of compute instances that are
kept running in the cluster for the ECS Service.
capacityProviderStrategy - the capacity provider strategy
associated with 1 or more capacity providers. A capacity
provider strategy gives you control over how a Task uses one or
more capacity providers. When a Task is run, or when a Service
is created, a capacity provider strategy is specified. A capacity
provider strategy consists of one or more capacity providers
with an optional base and weight specified for each provider.
With EC2 instances, the capacity provider strategy designates
how many hosts, at a minimum, to run on a specific Auto
Scaling Group (ASG) supported by the ECS Cluster. The
strategy also designates the relative percentage of the total
number of launched instances that should use that ASG.
loadBalancers – the list of Elastic Load Balancing (ELB)
objects that support the containers launched by ECS Service.
Can be used with both launch types.
roleARN – The ARN of the IAM Role that allows the Amazon
ECS Container Agent to register compute instances with an
Elastic Load Balancing (ELB).
serviceRegistries – the list of Service Discovery Registries
assigned to the Service.
networkConfiguration – the network mode that all containers
listed in the task definition will use. The container’s network
mode can be set to none, bridge, host, or awsvpc. Alternatively,
the task definition can specify the network mode used by all of
its containers.
placementStrategy – the placement strategy that determines how
containers are placed on an EC2 launch type. Not used by the
Fargate launch type.
placementConstraints – a list of constraints applied when a Task
(and its container(s)) is placed on an EC2 launch type. Not used
by the Fargate launch type because each Task gets its own
compute instance. You can specify up to 10 constraints per Task
(including constraints in the task definition and those specified
at runtime).
schedulingStrategy – there are two ECS Service scheduling
strategies: REPLICA or DAEMON.
healthCheckGracePeriodSeconds – the count of seconds that the
ECS Service Scheduler ignores unhealthy Elastic Load
Balancing (ELB) target health checks after an ECS Task has
first started.
tags – the metadata you append to resources (used the ECS
Service) to help you categorize and organize resources. The
maximum number of tags per resource is 50.
propagateTags – specifies whether tags from the ECS Service
or the task definitions are propagated to the Tasks.
enableECSManagedTags – enables ECS managed tags for the
tasks in the service.
Though the cluster can configure its default capacity provider strategy, the
ECS Service has a capacity provide strategy parameter that over-rides the
cluster’s default. In addition, the ECS Service supports load balancing - the
balanced distribution of HTTP (or TCP) requests across containers, compute
instances, and compute instance ports – as well as service discovery.
An ECS Service can be created, listed, described, deleted, as well as
updated (i.e., re-deployed). An ECS Service can be deleted if it has no
running Tasks and the desired Task count is zero. An ECS Service is updated
when any combination of these changes are made to the ECS Service’s
definition:
task definition;
Fargate platform version;
deployment configuration;
desired count of compute instances;
Like all things configurable in AWS, the ECS Service parameters and
their assigned values are captured in a JSON-formatted or a YAML
document. Best practice is to declare the ECS Service definition in an AWS
CloudFormation template, and to manage that document as part like all files
that participate in your CD/CI workflow.
At runtime, additional ECS Service properties can be perceived in the
ECS Console:
Task placement is not supported using the Fargate launch type. By default,
Fargate distributes the containers (of a REPLICA scheduled Service) across
multiple AZs in the Region (that contains the VPC subnets used by the
cluster).
When a Task uses the EC2 launch type, ECS has to determine (based on
the information contained in the task definition and its list of constraint
definitions) which EC2 instance on which to place each container of the task.
A task placement constraint is a rule that is considered during container
placement on the EC2 instance in the cluster.
ECS supports EC2 instance AMIs that are based on a variety of operating
systems, such as Linux, CoreOS, Ubuntu, and Windows. Amazon highly
recommends using ECS-optimized AMIs. Regardless of the AMI chosen, to
reduce costs significantly AWS recommends the use of Spot Instances for
container processes that can be interrupted. You incur additional charges
when your container uses other AWS services or transfers data.
EC2 instances do not require external network access to communicate
with the ECS endpoint. EC2 instances do not require any inbound ports to be
open. However, to examine containers with Docker commands an SSH rule
must be added so that you can log into the EC2 instance. Best practice is to
refrain from allowing SSH access from all IP addresses (0.0.0.0/0).
Best practices when used the EC2 launch type, as needed:
Docker Client;
Docker daemon (dockered), a.k.a., the Docker Engine;
shim (docker-containered-shim);
runc (docker-runc);
Amazon ECS Container Agent (a.k.a., containered (docker-
containered);
ecs-init Service;
Bottlerocket
Where the container (and image) uses a general purpose operating system
(OS), updating the OS – package by package – is difficult to automate in
Amazon ECS and EKS. Amazon ECS and EKS now support Bottlerocket -
an open source Linux-based operating system built by AWS for running
Docker containers on both EC2 instances as well as bare metal hosts.
Bottlerocket is available as an AMI for EC2 instances. Bottlerocket includes
only the essential software needed to run Docker containers, and this
improves instance usage and reduces the compute instance’s attack surface
(compared to a general purpose OS).
Updates to the Bottlerocket OS are applied in a single step (not on a
package by package basis) (and can be rolled back in a single step if need
be). This single step increases container uptime and greatly simplifies the
management of OS upgrades to instances that have orchestrated Docker
containers running on them.
Fargate Spot
Like EC2 Spot instances, you can use the Fargate Spot instance with
running containers that are interruption tolerant. The Fargate Spot rate is
discounted compared to the Fargate price. The consumer incurs additional
charges when their containers use other AWS services or transfers data.
Fargate Spot runs Docker containers on spare compute capacity. When
AWS needs the capacity back the container is interrupted with a 2-minute
warning. When containers using the Fargate and Fargate Spot instances are
stopped a task state change event is sent by ECS to Amazon EventBridge.
The stopped reason describes the cause.
1. AWS CodeDeploy
2. EXTERNAL
CodeDeploy Pre-requisites
Given an ECS Cluster and an ECS Service registered with that cluster and
which has its deployment controller set to CodeDeploy, there are a number of
other pre-requisites that the ECS Service must satisfy before CodeDeploy can
be used:
a group name,
the names of the ECS Cluster and ECS Service (whose
deployment controller is set to CodeDeploy),
the load balancer,
the production and the test listeners,
the two target groups,
deployment settings (e.g., when to reroute network traffic to the
replacement task set; when to terminate the primary task set),
triggers, alarms, and rollback behavior.
The CodeDeploy console, the AWS CLI, and the CodeDeploy API can be
used to view the deployment group associated with the application.
For an ECS Service to be deployed using CodeDeploy, specific
information about the ECS Service has to be entered into the AppSpec file:
Blue-green Deployments
CodeDeploy uses blue-green deployments with ECS Services.
Fundamental to blue-green deployments, compute instance network traffic is
rerouted behind an ELB load balancer by using listeners. The consumer has
to specify how network traffic is shifted from the old primary task set to the
new replacement task set:
During the blue-green deployment the load balancer allows the test and
production listeners to route network traffic to the new compute instances and
containers in the replacement task set, according to the rules the consume
specifies. During this time period, the load balancer allows the test and
production listeners to block network traffic to the old compute instances and
containers in the primary task set.
After the new compute instances in the replacement task set are registered
with the load balancer, the old compute instances in the primary task set are
deregistered and terminated (if desired).
Lastly, during the blue-green deployment the cluster’s capacity provider is
deactivated. Therefore, the infrastructure of neither the primary task set or the
replacement task set can scale horizontally during the deployment stage.
Rollbacks
You can manually rollback a Service deployment. However, CodeDeploy
is able to automatically rollback a deployment. During a rollback
CodeDeploy reroutes network traffic from the new replacement task set to the
old primary task set. Rollback behaviors are set in the App Spec file. The
consumer can choose 1 of 3 behaviors:
The task family and container definitions are mandatory parts of a task
definition, while the other parts are optional. It is important to understand that
at all times a Task is a logical group of 1-10 Docker containers plus some
Task level parameters.
When a task definition lists more than 1 container that is usually due to
the need of those containers to:
Though there are use cases that demand more than 1 container per task,
having more than 1 container per task incurs potential losses. When a task
has more than 1 container:
And, when more than 1 Task is run on the same compute instance:
When you register a task definition, there are Task level parameters that
allow you to specify the task size in terms of total cpu and memory used by
the containers in the task. If using the EC2 launch type, these fields are
optional. If using the Fargate launch type, these fields are required and there
are specific values for both cpu and memory that are supported. This is
separate from the cpu and memory values at the container definition level.
Task level CPU and memory parameters are ignored for Windows containers,
but container-level resources are supported by Windows containers. Those
Task level parameters are:
cpu – the hard limit of CPU units available for the containers of
the task. It can be expressed as an integer using CPU units, or as
a string using vCPUs. When the task definition is registered, a
vCPU value is converted to an integer indicating the CPU units.
If using the EC2 launch type, this field is optional. If your
cluster does not have any registered compute instances with the
requested CPU units available, the task will fail. Supported
values are between 128 CPU units (0.125 vCPUs) and 10240
CPU units (10 vCPUs). If using the Fargate launch type, this
field is required and you must use one of the following values,
which determines your range of supported values for the
memory parameter:
memory – the hard limit of memory (in MiB) available for the
containers of the task. It can be expressed as an integer using
MiB, or as a string using GB. When the task definition is
registered, a GB value is converted to an integer indicating the
MiB. If using the EC2 launch type, this field is optional and any
value can be used. If a Task level memory value is specified
then the container-level memory value is optional. If your
cluster does not have any registered compute instances with the
requested memory available, the task will fail. If using the
Fargate launch type, this field is required.
• For tasks that use the host IPC mode, IPC namespace related
systemControls are not supported.
• For tasks that use the task IPC mode, IPC namespace related
systemControls will apply to all containers within a task.
The ipcMode parameter is not supported for Windows containers or
tasks using the Fargate launch type.
availabilityZone
version
connectivity
connectivityAt
createdAt
pullStartedAt
pullStoppedAt
startedAt
startedBy
stopCode
stoppedAt
stoppedReason
stoppingAt
executionStoppedAt
desiredStatus
healthStatus
lastStatus
status – the current status of the ECS Task (and its containers):
PROVISIONING, PENDING, ACTIVATING, RUNNING,
DEACTIVATING, STOPPING, DEPROVISIONING,
STOPPED.
Like all things configurable in AWS, the ECS Task parameters and their
assigned values are captured in a JSON-formatted or a YAML document.
Best practice is to declare the ECS Task definition in an AWS
CloudFormation template, and to manage that document as part like all files
that participate in your CD/CI workflow.
RunTask API
The ECS API supports the RunTask command which allows ECS to place
containers for you, or you can customize how ECS places containers by using
placement constraints and placement strategies. The RunTask command
accepts the following data in JSON format:
StartTask API
The ECS API supports the StartTask command which the consumer uses
to start a Task according to the specified task definition and where it’s 1-10
containers are hosted by the specified launch type. The StartTask command
accepts the following data in JSON format.
cluster
taskDefinition
networkConfiguration
containerInstances – the compute instance IDs or full ARN
entries for the compute instances on which you would like to
place your containers. You can specify up to 10 compute
instances.
group
over-rides
enableECSManagedTags
propagateTags
referenceId
startedBy
tags
StopTask API
The ECS API supports the StartTask command which the consumer uses
to start a Task according to the specified task ID. The StopTask command
accepts the following data in JSON format:
Scheduled Tasks
Amazon ECS supports the ability to schedule Tasks on either a cron-like
schedule or in a response to CloudWatch Events. If you have a Task that
needs to run at set intervals you can use the ECS console to create a
CloudWatch Events rule that will run the Task at the specified times.
dockerVolumeConfiguration
ipcMode
pidMode
placementConstraints
disableNetworking
dnsSearchDomains
dnsServers
dockerSecurityOptions
extraHosts
gpu
links
privileged
systemControls
Task Lifecycle
All resources in the cloud are temporary. The task can transition through 8
states during its lifecycle:
Docker Volumes
Docker volumes are managed by Docker and are a directory within the
Docker storage directory on the compute instance’s file system. The Docker
volume exists outside, and beyond, the lifecycle of the Docker container.
A few notes about Docker volumes:
The Docker bind mount has been given the alias ‘host volume.’
Both EC2 and AWS Fargate launch types support Docker bind
mounts. However, bind mounts on AWS Fargate tasks are non-
persistent.
ECS does not sync bind mount data across containers;
By default, 3 hours after the container exits ECS Container
agent deletes the data;
With ECS, if the bind mount’s directory path on the compute
instance is not defined, the Docker daemon will create the
directory, but the data is not persisted after the container is
stopped.
To provide bind mounts to containers in ECS, you must specify the host
parameter (and the optional sourcePath parameter) in the volumes section of
the task definition. The host and sourcePath parameters are not supported for
Fargate Tasks. Before the containers can use bind mount (a.k.a., host
volumes), you must specify the volume configuration within the task
definition as well as the mount point configuration within the containers
definitions. The volumes parameter is a Task level property and the
mountPoints parameter is a Container level property.
Immediately below is a segment of a task definition – sourced from the
public AWS web site – that shows the syntax used to define a bind mount.
{
"family": "",
...
"containerDefinitions" : [
{
"mountPoints" : [
{
"containerPath" : "/path/to/mount_volume",
"sourceVolume" : "string"
}
],
"name" : "string"
}
],
...
"volumes" : [
{
"host" : {
"sourcePath" : "string"
},
"name" : "string"
}
]
}
With the free sample as our template, here are the descriptions of the bind
mount configuration parameters found within the volumes section of the task
definition:
And, here are the descriptions of the bind mount configuration parameters
found within the container definitions section of the task definition:
scope – the scope for the Docker volume, which determines its
lifecycle. Docker volumes that are scoped to a task are
automatically provisioned when the task starts destroyed when
the task is cleaned up. Docker volumes that are scoped as shared
persist after the task stops.
autoprovision - this value is true the Docker volume is created if
it does not already exist. This field is only used if the scope is
shared. If the scope is task then this parameter must either be
omitted or set to false.
driver – the driver to use with the Docker volume.
driverOpts – the options to pass through to the Docker volume
driver.
mountPoints (mandatory).
sourceVolume (mandatory).
containerPath (mandatory).
readOnly .
name (mandatory).
image (mandatory).
command (mandatory) - "ls -la /mount/efs". The command that
is passed to the container.
entryPoint - The entry point that is passed to the container
mountPoints (mandatory).
sourceVolume (mandatory).
containerPath (mandatory).
readOnly .
To provide tmpfs to the Task’s containers, you must define the tmpfs
parameter in the container definitions section of the task definition:
Task Retirement
The phrase ‘task retirement’ refers to stopping or terminating a Service, a
Task, and their containers. ECS retires a task when:
When ECS retires that task, the AWS account is sent an email notifying
them of the pending task retirement. If the task is part of a Service, the task is
stopped and the service scheduler automatically restarts the task. If a
standalone task is retired then you must launch the replacement task.
ECS Task Container Definition
The Task level contains a list of 1-10 containers definitions. Each
container definition is held within the Task’s containersDefinitions
parameter. The container definition is the Container level of the ECS
platform. The container definition contains configuration information that
ECS passes to the Docker Remote API (and that can also be used for option
values passed to docker run).
When the Task runs all of its Docker containers run on the same compute
instance. In addition, multiple tasks (with their containers) can run on the
same compute instance at the same time.
It is with a container definition that you specify how each container is
configured to support its Docker image. A container definition is specific to
the Docker image present in the container, the namespaces that the image
requires, and any dependency the container has with another container in the
task. And, it is the dependencies between containers that cause them to be
listed together in the same task definition. Unless containers are inter-
dependent, there is little reason to list more than one container in a task
definition.
In ECS, parts of a container can be configured at the Service level and at
the Task level and is fully defined of course in the Container level. The
parameters of the Container level over-ride both corresponding Task
parameters and Service parameters. It is important to appreciate the Container
parameters that are used to over-ride container configuration set at the Task
or at the Service levels.
AWS created three categories of parameters of an ECS container
definition:
1. Standard;
2. Advanced, and
3. Other.
Memory, and
Port Mapping.
Health Check
Environment
Network Settings
Storage and Logging
Security
Resource Limits
Container Health Check Parameters
There are six container health check parameters:
Linux Parameters
Container Dependency
Container Timeouts
System Controls
If you are setting an IPC resource namespace to use for the containers
in the task, the following will apply to your system controls. For tasks
that use the host IPC mode, IPC namespace systemControls are not
supported. For tasks that use the task IPC mode, IPC namespace
systemControls values will apply to all containers within a task. This
parameter is not supported for Windows containers or tasks using the
Fargate launch type. ’ (Amazon ECS API Reference)
Attach to the IAM role associated with the EC2 compute instance
the Amazon CloudWatch Logs IAM policy which allows ECS to
send logs to CloudWatch.
Create a security group or update existing security group so that
ports are open for the inference server, for:
MXNet inference, ports 80 and 8081 open to TCP
traffic.
TensorFlow inference, ports 8501 and 8500 open to TCP
traffic.
ECS Monitoring
AWS provides the following automated monitoring tool to observe your
ECS solution and report when something goes wrong:
Governance,
Compliance,
Operational Auditing, and
Risk Auditing
CloudWatch Logs
CloudWatch Logs are used to monitor and troubleshoot
systems/applications by using your existing system, application and custom,
log files. CloudWatch Logs is a managed service that collects and keeps your
logs. CloudWatch can aggregate and centralize logs, across multiple sources.
CloudWatch can use S3 or Glacier to store your logs. By default, metrics data
is retained in S3 for a 2-week period.
Best practice – in this use case, due to an application bug, the EC2
instance must be rebooted. Ensure the bug event is captured in the application
log. CloudWatch Logs can be used to monitor that application log for the
error keywords, create an alarm, and then restart the EC2 instance.
CloudWatch Events
A CloudWatch Event is aware of operation changes in an AWS cloud
environment as they are occurring and generates a stream of information. A
CloudWatch Event delivers a near real-time stream of system events that
describe changes in AWS resources in an environment. CloudWatch Events
use simple rules that match events present in the stream, and routes those
matched events to one or more target functions or streams, which take a
corrective action in response as needed. CloudWatch Events can be
scheduled to automated actions that are triggered at certain times using cron
(the OS tool) or rate expressions.
CloudWatch Alarms
A CloudWatch Alarm monitors a single metric over a period of time that
you specify and performs 1:N actions based on the value of the metric
relative to a given threshold over a number of time periods. For example, an
alarm can represent a reboot of an EC2 instance, or the scale-in and the scale-
out of instances in an ASG.
An Alarm has these states:
OK;
ALARM;
INSUFFICIENT_DATA.
You can set up rules and an action can be taken whenever a change is
detected. For example, a CloudWatch alarm action can be used to
automatically stop, terminate, reboot, or recover and EC2 instance. Stop and
terminate can be used to optimize cost savings when an instance does not
need to be run. Reboot and recover actions are used when an instance is
impaired.
When the alarm changes state, the action is a SNS notification that is
instantly sent to a target you choose:
A Lambda function;
An Auto Scaling Policy;
An SNS queue;
An SNS topic;
A Kinesis Stream;
A built-in target.
2 Levels of Monitoring
Basic Monitoring
Detailed Monitoring
It is best practice to use the Docker Client as a non-root user (i.e., a login
account of the operating system running on the EC2 launch type). Add the
non-root user to the ‘docker’ group (in the operating system).
Introduction
Kubernetes is an open-source technology that manages (a.k.a.
orchestrates) and provisions containerized applications that can scale, fail and
self-heal quickly, as well as be revised while running inline. It is common
practice to use Docker to develop containerized applications and use
Kubernetes to orchestrate those containers in cloud, on-premise, as well as
hybrid, environments.
Amazon Elastic Kubernetes Service (Amazon EKS) is a Region-based
fully managed Kubernetes service. EKS stands-up and maintains the
Kubernetes control plane which is accessible via the Amazon Management
Console as well as the eksctl and kubectl command line utilities.
The EKS control plane takes care of deploying containers and keeping
them running. EKS provides a scalable and highly-available control plane
that runs across multiple Availability Zones (AZs) to eliminate a single point
of failure. The EKS consumer is responsible for defining the underlying VPC
and security groups, the components of Kubernetes’ data plane, the Docker
containers, and the IAM roles and policies that secure the overall EKS
solution.
EKS is seamlessly integrated with CloudWatch, Auto Scaling Groups,
IAM, VPC, as well as AWS App Mesh. These integrations provide
monitoring, autoscaling, security, networking, load balancing, and service
discovery.
EKS runs upstream Kubernetes and is certified Kubernetes conformant so
you get all benefits of open source tooling from the community. At present,
EKS supports Kubernetes versions 1.12, 1.13, 1.14, and 1.1.5.
Create and configure the VPCs that the Kubernetes clusters use,
or
Configure the EC2 instances that host Kubernetes Pods.
The difficulties and challenges inherent to VPC and EC2 instances do not
magically disappear with EKS.
Service Commitment;
Included Products and Services;
Service Commitments and Service Credits;
Credit Request and Payment Procedures, and
Amazon EC2 SLA Exclusions.
Kubernetes Architecture
Google created Kubernetes for use in their back office. In 2014, Google
donated Kubernetes to the Cloud Native Computing Foundation (CNCF) as
an open-source project and is licensed under the Apache 2.0 license.
Docker and Kubernetes are complementary technologies. A Kubernetes
node uses Docker as its container runtime environment. Kubernetes uses
Docker to start and stop containers. Kubernetes functions at a higher-level –
it decides which nodes to run containers on, decides when to scale the nodes
in and out, up and down, updates the containers and keeps them running.
On the surface, Kubernetes has two components:
EKS Clusters
Unlike Amazon ECS clusters, the Kubernetes cluster is a control plane
and worker nodes. The worker nodes host the containers – they can be VMs
in a public or private cloud, or on-premise bare metal servers. The control
plane exposes an API, schedules containers on nodes, implements auto
scaling, handles container updates, and records the cluster’s state in persistent
storage.
With EKS, you can chose to use Kubernetes versions 1.12 through 1.15
for a given cluster. Each cluster has 1 or more masters (a.k.a., heads or head
nodes) and a group of worker nodes. Taken together, the masters are the
control plane. The worker nodes host the running containers and are
subordinate to the control plane. The maximum number of EKS clusters per
Region, per AWS account, is 100. The maximum number of control plane
security groups per cluster is 4.
AWS charges $0.10 per hour for each Amazon EKS cluster that you
create. The Kubernetes cluster can be run on AWS using either EC2 or AWS
Fargate compute instances, as well as on-premise using AWS Outposts. You
are charged a fee for the AWS resources (e.g., compute instances, storage,
etc.) and AWS services that are part of the cluster. However, you only pay for
the resources and services that you use, as you use them - there are no
minimum fees and no upfront commitments with EKS.
Every Kubernetes cluster also has an internal DNS server (a.k.a., kube-
dns). However, with Amazon EKS, the kube-dns server is removed and
replaced by AWS’s CoreDNS service. A Linux worker node is used to run an
EKS core system Pod called coredns.
A VPC and its security group(s) can be used by multiple EKS clusters.
For better network isolation, AWS recommends that each EKS cluster use a
separate VPC.
There are eighteen environment variables that are used to configure the
CNI plugin. Please refer to the ‘Amazon EKS User Guide’ for details about
these CNI plugin environment variables and their complex configurations.
Pod to Pod network traffic within the VPC is direct between private IP
addresses and requires no Source Network Address Translation (SNAT).
When network traffic address an endpoint outside the VPC the CNI Plugin
translates the private IP address of each Pod to the primary private IP address
assigned to the primary ENI (eth0) of the EC2 worker node that the Pod is
running on, by default.
When SNAT is enabled on the VPC, Pods communicate bi-directionally
with the Internet endpoint. The EC2 worker node must be launched in a
public subnet and have a public IP address assigned to the primary private IP
address of the primary ENI. The network traffic is translated to and from the
public or ENI IP address and routed to and from the Internet by an Internet
gateway.
AmazonEKSServicePolicy, and
AmazonEKSClusterPolicy.
1. API Server;
2. Scheduler;
3. Controller Manager;
4. Audit – a record of the individual users, Kubernetes
administrators, or control plane components whose actions
impact the state of the cluster;
5. Authenticator – a record of the control plane component that
EKS uses for Kubernetes RBAC authentication using IAM
credentials;
The logs of each EKS log type that you enable can be viewed in the
CloudWatch console.
In the VPC’s private subnet(s), Amazon EKS places and manages the
Elastic Network Interfaces (ENIs) used for control plane to worker node
communication. EKS applies the security groups to the ENIs created used by
the control plane.
A worker node is a Linux-based EC2 instance, a Fargate instance, a
Window-based EC2 instance, or GPU enhanced Linux-based EC2 instance.
Each worker node has three main components:
By default, given the API server is public to the Internet, so are the work
nodes.
AmazonEKSWorkerNodePolicy;
AmazonEKS_CNI_Policy;
AmazonEC2ContainerRegistryReadOnly;
Node Groups
A new EKS cluster is created without a worker node group. In EKS, 1 or
more worker nodes are deployed into a node group. All compute instances in
the node group must:
1. Design and build the service, create its Docker image, and store
the image in a repository;
2. Package each service image in a Docker container;
3. Wrap each Docker container in its own Pod – a Pod is a wrapper
that is required to run a Docker container on a Kubernetes
cluster;
4. Deploy Pods to the cluster by using a declarative manifest file
(in YAML or JSON format) and a controller such as
Deployments, DaemonSets, StatefulSets, or CronJobs. These
manifest files are to be held within a source code control system
and serve their role in CD/CI.
Kubernetes Controllers
Unlike ECS that requires CodeDeploy, or ECS Task that are run/started
by command, Kubernetes uses internal controllers to deploy Pods on a
cluster. By itself, a Pod is not capable of scaling, self-healing, or handling
rolling updates or rollbacks. For these capabilities, the Pod depends on a
Kubernetes controller (which is identified in the Pod’s manifest file).
Kubernetes has 4 controllers:
CPU and memory – the vCPU and memory reservation with the
Pod specification are used to determine the CPU and memory
resources provisioned to the Pod by Fargate. Fargate adds
256MB to each Pod’s memory reservation for use by the
Kubernetes worker node components (kubelet, containerd, kube-
proxy). If the Pod specification does not include vCPU and
memory then Fargate provisions the smallest combination of
these resources.
Storage – Fargate automatically configures 10GB of Docker
container image’s writeable-layer. This storage is ephemeral and
is deleted when the Pod stops.
The maximum number of concurrent Fargate Pods, per Region, per AWS
account, is 100. And, the maximum number of Fargate Pod launches per
second, per Region, per AWS account is 1, with temporary burst up to 10.
Next, create a gateway node in the same Region as the EKS cluster. Then
log into the gateway node and install the AWS CLI, eksctl, kubectl, and aws-
iam-authenticator. Run the aws configure command for the IAM user. When
prompted, paste in the IAM user’s access key, secret access key. Afterwards,
install ksonnet (a configuration management tool for Kubernetes manifests).
For each deep learning clusters using GPU instances, the cluster’s
~/.kube/eksctl/clusters/kubeconfig file must be modified accordingly:
Use the eksctl utility to submit the modified kubeconfig file to the EKS
control plane. It will take EKS several minutes to create the deep learning
cluster. If the node-type is an Nvidia CUDA GPU instance you must install
the NVIDIA device plugin for Kubernetes in the EKS control plane. The
NVIDIA device plugin is a DaemonSet that can automatically:
For cluster using CPU instances, the cluster must be changed in the same
way as when GPU instances are used – just modify the node-type to specify a
CPU instance type.
Auto Scaling
EKS supports three types of Kubernetes autoscaling:
Prometheus Metrics
Prometheus is a monitoring and time-series database that scraps the
exposed API Server’s metrics endpoint and aggregates data. Prometheus
allows you to filter, graph, and query, metrics time-series data. The Helm
package manager is used to install Prometheus on the EKS control plane.
Load Balancing
For Pods running of EC2-based worker nodes, EKS supports the Network
Load Balancer and the Classic Load Balancer through the Kubernetes
LoadBalancer service. By default, Kubernetes LoadBalancer creates external-
facing load balancers. The external-facing load balancers require a public
subnet that has a route directly to the Internet using an Internet gateway. The
public subnets must be tagged (kubernetes.io/role/elb : 1) so that Kubernetes
knows to use only those public subnets. When CloudFormation is used to
create the EKS’s VPC the tags are automatically attached to the public
subnets. For internal-facing load balancers, the EKS’s VPC must be
configured to use are least one private subnet. The private subnets must also
be tagged (Kubernetes.io/role/internal-elb : 1) so that Kubernetes knows to
use them.
The AWS Billing and Cost Management service is configurable. You can
select a custom time period within which to view you cost and usage data, on
a monthly or daily level of granularity. You can also group and filter cost and
usage information to help you analyze the trends of cost and usage
information (over time) in a variety of useful ways and use that information
to pin-point identify and optimize costs.
To help you visualize the usage and cost data, the services provides these
graphs:
Amazon CloudWatch
Amazon SNS
Service Limits
Security Groups
Specific Ports Unrestricted
IAM use
MFA on the AWS Root Account
Find under-utilized resources
Cost optimization
Security
Fault tolerance
Performance improvement; and
Service checks
The AWS Trusted Advisor is also a source of best practices that cover:
Service limits;
Security group rules that allow unrestricted access to specific
ports ;
IAM use;
MFA on the root account;
S3 bucket permissions;
EBS public snapshots, and
RDS public snapshots.
For AWS clients who have purchased the Business or Enterprise Support
plans there are additional checks and guidance available.
GitHub
The code that defines and manages the runtime environment(s) used by
container/Kubernetes services must be managed and stored in a version
control system.
AWS CloudFormation
The AWS CloudFormation is a free service used to launch, configure, and
connect AWS resources. CloudFormation treats infrastructure as code and
does so by using JSON and or YAML formatted templates. AWS does not
charge a fee for using CloudFormation. You are charged for the infrastructure
resources and services created using CloudFormation.
CloudFormation enables you to version control your infrastructure –
VPCs, Clusters, Services, Tasks, Containers. EC2 instances as well as
Fargate instances. Best practice is to use a version control system to manage
the CloudFormation templates. CloudFormation is also a great disaster
recovery option.
The AWS CloudFormation template is used to define the entire solution
stack, as well as runtime parameters. There are CloudFormation templates
that support VPCs, Clusters, Services, Tasks, EC2 and Fargate instances. You
can reuse templates to set-up resources consistently and repeatedly. Since
these templates are text files, it is a simple matter to use a version control
system with the templates. The version control system will report any
changes that were made, who made them, and when. In addition, the version
control system can enable you to reverse changes to templates to a previous
version.
An AWS CloudFormation template can be created using a code editor.
But, they can be easily created using the drag-n-drop CloudFormation
Designer tool, available in the AWS Management Console. The Designer
will automatically generate the JSON or YAML template document.
Java
JavaScript
Python
.NET
Go
Ruby
Node.js
C++
PHP
AWS SUPPORT TIERS
AWS offers 4 different tiers of support, with differing services:
1 Basic;
2 Developer (fees charged);
3 Business (fees charged) and
4 Enterprise (fees charged).
All AWS Support plans include an unlimited number of support cases,
with no long-term contracts. With the Business and Enterprise tiers, you earn
volume discounts on your AWS Support costs, as your AWS charges
increase.
Support Severity
AWS support uses the following severity chart to set response times.
Critical - the business is at risk. Critical functions of an application are
unavailable.
Urgent – the business is significantly impacted. Important functions of an
application are unavailable.
High - important functions of an application are impaired or degraded.
Normal - non-critical functions of an application are behaving
abnormally, or there is a time-sensitive development question.
Low – there is a general development question, or desire to request a
service feature.
Basic Support
Customer Service & Communities
24x7 access to customer service, documentation, whitepapers
and support forums.
AWS Trusted Advisor
Access to the 7 core Trusted Advisor checks and guidance to
provision your resources following best practices to increase
performance and improve security.
AWS Personal Health Dashboard
A personalized view of the health of AWS services and alerts
when your resources are impacted.
Alerts and remediation guidance when AWS experiences events
that may impact you.
You can set alerts across multiple channels, e.g., email, mobile
notifications.
Developer Support
Enhanced Technical Support
Business hours email access to Support Engineers
Unlimited cases / 1 primary contact
Case Severity / Response Times
General guidance: < 24 business hours
System impaired: < 12 business hours
Architectural Guidance
General
Business Support
Enhanced Technical Support
24x7 phone, email, and chat access to Support Engineers
Unlimited cases / unlimited contacts (IAM supported)
Case Severity / Response Times
General guidance: < 24 hours
System impaired: < 12 hours
Production system impaired: < 4 hours
Production system down: < 1 hour
Architectural Guidance
Contextual to your use-cases
Programmatic Case Management
AWS Support API
Third-Party Software Support
Interoperability & configuration guidance and troubleshooting
Proactive Programs
Access to Infrastructure Event Management for additional fee.
Enterprise Support
Enhanced Technical Support
24x7 phone, email and chat access to Support Engineers
Unlimited cases / unlimited contacts (IAM supported)
Case Severity / Response Times
General guidance: < 24 hours
System impaired: < 12 hours
Production system impaired: < 4 hours
Production system down: < 1 hour
Business-critical system down: < 15 minutes
Architectural Guidance
Consultative review and guidance based on your applications
Programmatic Case Management
AWS Support API
Third-Party Software Support
Interoperability & configuration guidance and troubleshooting
Proactive Programs
Infrastructure Event Management
Well-Architected Reviews
Operations Reviews
Technical Account Manager (TAM) coordinates access to
programs and other AWS experts as needed.
Technical Account Management
Designated Technical Account Manager (TAM) to proactively
monitor your environment and assist with optimization.
Training
Access to online self-paced labs
Account Assistance
Concierge Support Team
OSI Model
1. Resources;
2. Messages;
3. Connections, and
4. Security (listed last but is certainly not the least important
focus).
Any type of application – not just a web browser - that can create an
HTTP Request and can send that message over a TCP/IP network can access
resources available at URL endpoints, e.g., telnet www.fredo.net 80
HTTP is Stateless
Each HTTP transaction, each request and response, is independent of all
other HTTP transaction, past and future. The HTTP protocol does not require
of a server application that it retain information about an HTTP request.
Every HTTP request contains all of the information that a server application
needs to create and return a response message. Because every message
contains all required information, HTTP messages can be inspected by,
transformed by, and cached by, proxy servers, by web cache servers, etc.
However, without the ability to cache HTTP messages in memory, the
latency inherent in the world-wide web makes the Internet unusable. In
addition, of needed by the BPM, most web clients and server applications are
highly stateful!
Hypermedia information
Hypermedia information is not a passive entity, and seldom exists as an
isolate – seldom exists apart from other hypermedia entities. Hypermedia
information systems arise, persist, and perish, in the form of a property graph
database. Hypermedia contains information as well as behavior. It is the
behavior which makes it active information, that makes it hyper.
The tools which provide behavior are:
Distributed CRUD
When information and services are distributed over a network, their
associated CRUD transactions are also distributed over the network. A
CRUD transaction over a network is uniquely identified by specifying both:
Idempotency
To determine if a request method is or is not idempotent, only the state of
the request handler service is considered. The idempotency of a method
request is not determined by the agent initiating the request.
The request handler service controls and dictates if a given method request
is or is not executed as an idempotent request. The HHTP request handler is
relative to the requestor a black-box – the requestor has no control for how
their request is handled.
GET, HEAD, OPTIONS and TRACE methods are safe in that their only
purpose is to retrieve information. Safe requests are defined to be
idempotent but are not guaranteed to be so.
Best practice – do not assume that your 1 single GET or HEAD request is
always only executed 1 and only 1 time by the request handler service.
6. The POST method is used to submit an information payload to
the specified resource, intending to cause a change in
information state – an effect - on the server, free of any
unintended side-effects
7. The PUT method displaces all current digital representations of
the specified target resource with the request payload.
8. The DELETE method deletes the specified resource.
9. The PATCH method is used to make partial changes to a
resource.
The PUT and DELETE methods are defined not to be idempotent, but
there is no guarantee that they are not idempotent.
Best practice – do not assume that your 1 single POST, PUT, DELETE, or
PATCH, request is always only executed 1 and only 1 time by the request
handler service.
HTTP Connections
In an HTTP transaction a message in the form of a request is sent by a
client application to a server application which returns a message in the form
of a response. Remember, there are other types of HTTP client applications
other than a web browser.
The official format specifications of these 2 types of messages and of their
information contents are part of the HTTP protocol. These messages are
transmitted across a network wherein connections are opened and closed. The
opening and closing of network connections takes place within the OSI layers
that the foundation on which your cloud solution executes.
The client application uses the hostname of the URL, from which the IP
address and port # of the server application are derived. Once the client
application opens a connection, it can write messages to the server.
What is TCP?
The TCP layer handles the messages, ensuring that they are relayed
between the correct client and server applications, without message loss or
message duplication occurring. TCP is a Transport layer protocol, and is a
connection-oriented protocol, and therefore requires a logical connection
between 2 devices before transmitting data.
In a TCP-based network, the client application and the server application
that are exchanging messages with each other can reside on any
computer/node in the network. The TCP layer, present on each node in the
network, handles the messages, ensuring that they are relayed between the
correct client and server applications, without message loss or message
duplication occurring.
TCP controls the rate at which messages flow and detects errors. If a
message is lost, TCP will automatically resend that message.
What’s IP?
The IP is a Network layer protocol and is a connectionless protocol. IP
does not require a logical connection between 2 devices before transmitting
data. An IP device just sends the data, just puts the data on the Ethernet.
Every host interface in the TCP/IP network has an IP address. The IP address
(of which they are now 2 types: IPv4 and IPv6) is:
Present on each node in the network, the defining job of the IP layer is
routing traffic. IP fragments a TCP message and encapsulates it into
datagrams, which it then places on the Ethernet layer beneath it. Also, IP
takes datagrams off the Ethernet and reassembles them into a TCP message.
Lastly, IP exchanges information about the status/health of host interfaces in
the TCP/IP network.
Over the past 50 years, unless the software engineers were working on an
operating system, a database service, or some other advanced technology,
invariably the application code consistently violated the principles of
maximizing cohesion and minimizing coupling. Over the past 50 years, the
drive to get the application out of development, through testing and out into
production has been and remains the holy grail. Due to pernicious neglect,
the principles of maximizing cohesion and minimizing coupling are not
engraved on that holy grail.
It has been well known since the mid-1970s that whenever those two
principles are violated the application becomes a big ball of mud (BBoM).
The polite term for a BBoM is a monolithic application. And, just because it
is a web application that does not mean that it is not a BBoM. Gaining hands-
on knowledge of how to maximize cohesion or minimizing coupling of the
application has had little to no street value, until recently that is.
Recently, the critical role that those two software engineering principles
play in reducing CapEx and OpEx has been recognized by a wider percentage
of the C-Level chairs in the audience. Respecting the 2 timeless principles is
a matter of survival (for institutions as well as software engineers).
Monolithic Applications
In every IT institution, a monolithic application is a significant barrier to
reducing CapEx and or OpEx.
Maximizing cohesion and minimizing coupling share much in common
with film-making. It requires of number of takes before you are able to select
the best result, before you are able to see how cohesion is being maximized
and coupling is being minimized. Shareholders rarely want to pay for
multiple takes on an application component that already passes its validation
and verification tests. A monolithic application shares much in common with
a home movie – they lack entertainment value and are painful to watch,
worse still to appear in.
As if it can’t help itself, overtime, the typical business appends to the
sorry monolithic application a variety of disparate processes that have
different compute needs. Those practices continue until a straw is added
which breaks the back of the monolithic beast. In effect, the typical business
misses all of the opportunities it previously had to minimize coupling the
monolithic application with other applications.
The Institution
“It must be remembered that there is nothing more difficult to plan, more
doubtful of success, nor more dangerous to manage than a new system. For
the initiator has the enmity of all who would profit by the preservation of the
old institution and merely lukewarm defenders in those who gain by the new
ones. ”
― Niccolò Machiavelli
It seems that most technologist fail to perceive the humans present in the
environment. As such, rarely will you discover a cloud migration strategy
that starts with a wise consideration of our shared humanity. The above quote
explains why nearly all IT institutions are adverse to migrating from their
traditional IT environments to the cloud (private or public, or hybrid), and
why most migrations fail to achieve their CapEx or OpEx goals. Your
migration strategy amounts to nothing when the tribe in question has a vested
interest in the preservation of their traditional IT institution and is therefore
full of animus in relation to the public cloud which rends their hallowed IT
institution's infrastructure into a commodity.
To the animus infected, their traditional IT institution is to them a totem,
and the ‘cloudsters’ are a real existential threat. Therefore, the cloud
migration strategy is not fundamentally a matter of logic or reasoning, and
certainly not a matter of due diligence: it is a matter of when, where, and how
fear in the ranks of the IT institution is handled by the C-levels. As history
continually teaches us, extremely few C-levels have the leadership skills
needed to take a crowd of fearful people (who almost certainly distrust the C-
levels) into the unknown and beyond into the greener pastures.
Consequently, the best times to migrate a service/system to the cloud are
during the conception or the birth of that service/system, well before the child
gets too accustomed to their local neighborhood, or too familiar with its
ingrained customs, taboos and bigotries: before they join the tribe. As such,
our best hope for success in the cloud is a green-field solution.
Cloud animus
Animus might be irrational, illogical, but there are reasons for its arising
and abiding. To be fair to the fearful, our traditional IT systems and their
applications were never designed to run on temporary infrastructure (which is
ubiquitous to the public cloud). Traditional IT systems/application were
designed and constructed to run on “permanent” infrastructure, and usually
directly on the hardware. Consequently, the overwhelming majority of
traditional IT systems/applications are not suitable candidates for migration
to the cloud. They are invariably big balls of mud (BBoM).
You’d be crazy to undertake a venture which you know from the start is
doomed to fail - that course of action is something all rational persons are
adverse to take. Yet, to speak out in opposition to a C-level’s initiative is
employment suicide. The crowd’s course of action then is covert aggression:
remain silent, do not point out the material facts that doom the initiative, but
be patient and watch the migration and the sponsoring C-level fail.
Afterwards, gladly welcome their replacement with open arms.
There are various means to assuage the crowds fears, and they all require
a gentle guiding hand along with practical encouragement. Nevertheless, you
can lead a horse to water but you cannot make them drink (unless they're
thirsty).
Easy Lifts
When looking at a traditional IT institution, the easy task is to identify the
generic off-the-shelf systems/services that are present. The generics, e.g.,
email, data archival, etc., are the best candidates for migration. These are
vendor products which the traditional IT institution merely had to install and
then operated per the manual. If and only if the vendor supports the cloud
platform, then there is a good assurance of success achieved quickly and at a
reasonable price point. If no such vendor assurance exists than the generic is
a no-go-to-the-cloud.
By completing the easy lifts, the traditional IT institution is re-shaped into
a hybrid cloud, where some parts are in the cloud and most parts remain in
the company’s back-office(s).
CPU usage;
Memory usage;
Network usage;
Storage usage (actual and projected);
Response times;
Data throughput volumes;
Recovery times.
Given a BBoM, these metrics will only be measured at a gross level. In
that BBOMs rarely have features which obtain these metrics on a per process
basis, let alone per component basis, the gross level metrics will not help
identify the individual cannibal components that consume the greater part of
the infrastructure.
Therefore, identifying resource cannibals, and determining how to refactor
them to make them more efficient, is literally impossible. Keep in mind that
the actual cost of badly performant BBoM component can be easily
obfuscated in the traditional IT environment but will always make you pay
threw the teeth in the cloud. Migrating to the cloud is not the time to cross
your fingers and then hope for the best. You have to keep an eye on the
cannibals, else they will eat you alive.
In addition to suitable metrics, it is necessary to analyze each BBoM and
identify anti-cloud problems that must be remedied during the migration.
Successfully concluding that analysis is easier said than done when it comes
to BBOMs.
Containers of BBoMs
The use of VMs is a minimum goal of the cloud migration preliminaries.
Given the temporary nature of cloud infrastructure, it is imperative that the
BBoM be capable of being containerized and orchestrated. That is, each
component of the BBoM must be significantly refactored so that each is
capable of being containerized.
Since the CapEx opportunity of the cloud can only be realized by
exploiting the temporary nature of cloud infrastructure, each BBoM
components must be able to be spun-up and spun-down at will and do so in
sync with a transient cloud infrastructure. It is sufficient to assert that a
BBoM component cannot carry a tune, let alone be orchestrated. However, in
the cloud, orchestration is not a luxury orchestration is a necessity.
VM Hard-stop
If a BBoM is not currently operating on virtual machines whenever
feasible, that situation is a hard stop. Virtual machines (VMs) are ‘standard
technology’ so when a BBoM is not yet running on VMs that is an extremely
negative mark and tell of the IT institution's culture.
In the cloud your services/systems will run on a VM. Therefore, before a
cloud migration can be considered, each BBoM must first be migrated to
VMs. Failing the VM migration there is no hope of that IT institution
achieving a successful cloud migration any time soon: the institution's culture
is too far behind the curve.
Underlying Assumptions
It is one thing to divide and conquer, it is all together something else to
manage that conquest. Managing a SaaS is not like managing a BBoM. To
manage a SaaS you need to be able to spin-up and shut-down individual
microservices on demand. To manage a SaaS you need to be able recover a
microservice after a failure event. To manage a SaaS you need to be able
support rolling updates, as well as rolling rollbacks. Of course, these
management task have to be accomplished within a security context that
enforces required guidelines.
Therefore, when considering a SaaS orchestration of its microservices is a
huge part of the challenge. To keep things simple, in this article Kubernetes
will do the orchestration.
12-Factors Methodology
The demands of designing, developing and operating a SaaS extend well
beyond the 2 timeless principles, but all of those demands are logical
extensions of those 2 principles. There is a methodology advocated for SaaS
called the 12-Factor methodology. However, the count of 12 is a low-ball
estimate:
1. Codebase;
2. Explicit and Isolated Dependencies;
3. Configuration;
4. Backing services;
5. Build, release, run stages;
6. Stateless Processes;
7. Port binding;
8. Concurrency;
9. Disposability;
10. Dev/prod parity;
11. Logs, and
12. Admin processes.
At the street level where each of these factors are made real, each of these
12 factors
1. Codebase
A version control system (e.g., Git) is a tool used in all good software
engineering environments. The version control system is to software what
clean air is to our body. No one has a good life when there is polluted air in
the room. No one has good software when that software is un-managed by a
robust version control system. The version controlled software is held in a
repository. With the aid of the version control tool, specific versions of the
codebase can be repeatedly deployed from the repository to a variety of
environments.
The migration of the codebase occurs through stages - from development
to test, from test onto production – and needs to be repeatable as well as
automated (i.e., does not require human intervention to provision). The
slogan for this is automation process is Continuous Delivery and Continuous
Integration (CD/CI).
This is not to assert that the exact same codebase is migrating from stage
to stage. For good reasons, the build of the development binary image is not
identical to the build used in subsequent stages. For example:
Dependency management;
Remote repositories;
IDE tool portability, and
Easy searching and filtering of project artifacts.
Along with the version control system, the explicit dependencies, and
remote repositories, the application binary image is built at each stage.
An important benefit of explicit declaration of dependencies is that they
make it easier to on-board others into the application team. It is much easier
for new persons to acquire knowledge of how the application is put together
when those dependencies are recorded in explicit declarative statements.
3. Configuration
The codebase is not just the binary image of the application. The codebase
also includes configuration information, of different types. For example, the
configuration of an application is different across different environments. For
example, the hardware resources used in production are not identical to the
hardware resources used in testing or in development environments. The
environment configuration information has to be managed independently of
the application functions that consume that configuration information – the
configuration information has to be external to the application’s binary
image.
The codebase is not just the binary image of the application and
application configuration information. The infrastructure that is the
environment of each stage is also expressed in declarative code statements,
that are stored in files external to the application’s binary image. These
declarative code statements make it possible to fully automate the dynamic
provisioning of infrastructure present in each stage.
4. Backing Services
Each HTTP-based microservice present in the SaaS is a resource that has
a Uniform Interface (i.e., REST) and which is assigned a Uniform Resource
Locator (URL) logical name. It helpful to think of the microservices of a
SaaS as being a model of (a portion of) the BPM's domain.
Within a business domain model supported by the SaaS consists of the
core domain, generic domain, and supporting domain. The reader who is
familiar with Domain-Driven Design (DDD) will find it easier to understand
the ‘backing services’ factor as the supporting domain..
The core domain consists of the most important, parts of the SaaS: these
are the microservices that enable your business process model (BPM) to
succeed. However, these parts cannot be bought off the shelf: their design
and construction are custom made by/for your company and account for
significant investments in technology and engineering. In return, the core
delivers the highest value to your BPM.
The technologies in the core domain are only worth investing in so long as
they provide the required ROI. The core is a product upon which your BPM
is heavily dependent. The core is a product that you manufacture and which
requires a manufacturing methodology. Therefore, the core is not to be
handled as if it were a project. Agile is a methodology limited to projects.
Perceiving the core as if it were a litany of projects will certainly heighten the
risk of the core turning into a big ball of mud (BBoM).
Consider this: if you were to fly on a airplane do you prefer that the
airplane's flight system was manufactured as a product or was handled as if it
were a litany of Agile projects? If that airplane's flight system were a result of
a litany of Agile projects - and not a manufacturing methodology - it is
doubtful that the airplane could be operated safely. Remember, in Agile no
individual on the team is personally responsible for any deliverable - the buck
stops on no one's desk; the buck falls between the cracks of an abstraction
called the team!
The generic domain is not core to the BPM, but the BPM cannot function
successfully without them. The parts present in the generic domain are the
services/resources that can be bought off the shelf: email; CRM; data
archival; etc. The supporting domain contains the backing services that
support the core domain. For example, within the supporting domain are the
persistent data storage services, the messaging service, etc..
Unlike services in the SaaS core, the services present in the generic and
supporting domains may or may not support a RESTful interface and might
not be assigned a URL. Despite the resulting communication difficulties
between the core domain and the generic and supporting domains, it is
helpful to manage services in all 3 domains as resources.
5. Build, Release, Run stages
Using the appropriate tools, the build event brings together (from 1 or
more repositories) all portions of the codebase intended for release into the
target stage/environment, and outputs a binary image of the microservice.
Each build that outputs an image that will be released into production needs a
unique ID which de-marks each incremental change release. A given released
and ID’s image is immutable, cannot be changed. Any change to the
codebase to be released requires a new unique and incremented ID value.
Using the appropriate tools, the release events take that output, along with
configuration information about the target environment, and deploys these
into the target environment. Using the appropriate tools, the run stage involve
sthe launch of the service in the environment as well as its execution in that
environment, as well as rollbacks that may be needed.
6. Stateless Processes
As explained in article #3 of this series, the HTTP protocol (on which
microservices are based) is a stateless protocol. Therefore, the requests and
responses supported by the SaaS REST interface are stateless by default and
share nothing as well. This stateless share nothing paradigm facilitates the
scaling in and scaling out of instances of the SaaS microservices.
All information processed by the SaaS that the BPM needs to persist must
be managed by a stateful data service (located in the supporting domain).
Information that the BPM needs to persist (or provide access to) over small
windows of time can be managed by a dynamic cache service. Using the
filesystem of the OS that the microservice runs on is a viable choice in some
scenarios, but its best to refrain from abusing the filesystem.
For BPM that need stateful sessions connected to their SaaS, there is the
‘sticky session.’ A sticky session caches temporary metadata about the user
session (in the memory space of the microservice) which is able to route
multiple requests and responses across the same session. Because sticky
session violate the share noting paradigm they impeded the scaling in and out
of SaaS microservices.
7. Port Binding
As explained in article #3 of this series, in networks that use the TCP/IP
protocol, a port is an I/O mechanism that an application uses to ‘listen to
requests’ which it handles. In the typical SaaS Use Case, the requests are
composed of HTTP verbs and information is encoded in JSON.
In a TCP/IP network, each application service running on a host computer
(or on a Virtual Machine, as well as within a container/Pod) is uniquely
identified by its port – by a number.
8. Concurrency
If all goes well, each microservice will have cause to support more than 1
request/response at a moment in time. This stateless share nothing paradigm
facilitates the scaling in and scaling out of instances of the SaaS
microservices which is a very straight-forward way of supporting
concurrency.
Depending on the stack you choose to use to implement the SaaS, other
concurrency options may also exist. For example, Node.js, though single
threaded, uses event loops and callbacks to support concurrency. In a Java
web service, multi-threading may be needed to support the use case KPIs.
9. Disposability
‘All things must pass.’
- - George Harrison
Each microservice on the SaaS needs to be able to be started or stopped at
any chosen point in time. By supporting those behaviors the SaaS
microservice can be scaled in as needed to reduce CapEx, as well as scaled
out to acquire capital. Time is critical when shutting down and starting up an
instance of a microservice, the quicker the better. And to dynamically scale in
as well as out, while handling on-going sessions, a Load Balancing that fronts
the SaaS microservices is a must have.
Under normal conditions, the features of the SaaS change over time. Like
all other things, a SaaS is continually changing, is not permanent, is
temporary. Change is expected. Breakage is expected. Therefore a method to
handle the continually changing madness is useful. The 12-Factor
methodology is well suited to handling continually changing SaaS. In this
article, and in each of the subsequent articles in this series, we will take up a
group of factors from among the 12-Factors methodology.
11. Logs
Logs give us visibility into the events handled by the microservice, and
how they are handled, in different levels detail as may be needed. Like stdout
and stderr, a log captures a continual stream of events, and so has to support
the ability to be continually enlarged.
The types of logs used by microservices may need to change across
environments.
12 Factors Eye-Opener
Each of the 12 factors can be expanded according to the needs of the
BPM. How these guidelines are adopted do vary according to the
environments the SaaS uses. For example, on the AWS public cloud, factor
#12-Admin Processes is supported by a variety of tools, such as Cloud
Formation.
Though there may be variations in how the each of the 12 factors is
adopted by a SaaS, all 12 factors need to be adopted to promote the success
of the SaaS. As stated in article #1 of this series, there are other critical
factors that must be handled, but which are not identified in the 12-Factors
methodology. For example, it fails to mention identity management.
Greenfield DevOps
The greatest probability of success exists in a greenfield in which a cloud-
native application can be constructed, tested and deployed, un-chained from
the constraints of a monolithic application.
Reincarnating a monolith
Unfortunately, rarely if ever is it cost effective to containerize the
monolith application or to run it on a private/public cloud. Given a
monolithic application the most cost effective process to running it in a
private/public cloud is to first containerize it - logically, not physically.
In practice, the monolithic application code is not changed physically.
Instead the monolith is transformed into a greenfield, a logical greenfield.
The computations supported by the monolith define the computations that the
containerized application is to support.
The security, service and data contracts of the existing computations are
designed and developed from day one to be containerized (and to run in the
private/public cloud). Of course, doing so ensures that all existing security,
service and data contracts will have to be rebuilt from the ground up. Unlike
the first time around when the monolith was built the details of every contract
that the solution must support are known before you design the solution as a
cloud-native application.
To respond to the shareholders intent to reduce CapEx and OpEx,
application architects are advocating microservices, which run in containers
that are orchestrated.
Microservices
The discipline of structured programming began in the 1970s, at least a
decade before object-oriented languages or techniques existed outside of
academia. At the core of the structured programming discipline there are 2
timeless software engineering principles: minimize coupling and maximize
cohesion. Until recently, however, those two principles rarely (if ever)
influenced the design and construction of software (object-oriented or not).
As a consequence, there exists a plethora of big balls of mud (BBoM) present
in almost all institutions. Violating either or both of those 2 engineering
principles has always been a willfully ignorant and intentional choice.
Usually, a software application/service is constructed of multiple
components. Tight coupling – hard-wiring component dependencies at the
code level – is to software what metastasized cancer is to the body. Tight
coupling interjects pernicious ripple effects throughout the application code.
Tight coupling makes it impossible to test a component in isolation. Tight
coupling makes code needlessly and senselessly complex. Light coupling
minimizes the dependencies between components, thereby eliminating the
toxic consequences of tight coupling.
When a software component serves 1 and only 1 purpose, the performance
of that component can be optimized. Highly performant components are the
best way to minimize infrastructure (memory, CPU, network, etc.)
consumption. Ensuring that a component serves 1 and only 1 purpose is the
almost most effective way to keep things simple. Keeping things simple (by
dividing and conquering) is the best way to drive down the cost of
engineering and operating software. When a software component serves
multiple purposes it has minimal cohesion, is highly resistant to optimization,
and needlessly and senselessly raises the cost of software engineering over its
entire life-cycle,
Though microservices are promoted as if they are a new application
architecture, they are old school, they embody the structured programming
principles of maximizing cohesion and minimizing coupling established in
the mid-1970s. Ever since, those principles has been used by software
engineers who valued their relevancy to distributed systems. Since the late
1980s this author has used them to DevOps globally distributed systems
(without having the significant benefit of a cluster of containers).
If you are familiar with client-server architecture, the server implements
all security, service and data contracts and all of these parts are present in a
single chunk of source code. The entire chunk of source code has to be built,
and then tested, and then deployed, all as a single unit. In a microservice
architecture each server contract becomes an independent service
(implemented as a small chunk of isolated source code) that has a RESTful
interface and which uses HTTPS to inter-operate with the other services.
Individual microservices are then containerized and run in a secure cluster
present in a private/public cloud. Alternatively, each microservice can run in
a dedicated virtual machine.
To help people refrain from continuing to ignore these 2 timeless software
engineering principles, they have been re-branded. Today we have a catch-all
term for asserting the primacy of minimizing coupling and maximizing
cohesion: microservices.
A microservice serves 1 and only 1 purpose and does so while ensuring
that it couplings to other services, to the operating system, to the virtual
machines, to the host hardware, are held to a practical minimum. Coupling of
services is minimized by adoption of interfaces that abstract away the lower
level communication details.
Microservices are not new, they’ve been around since the early 90s when
client-server architectures began to be common-place in business. Back then
the service interfaces were proprietary. Today, open standards define the
interfaces used by microservices. Today, HTTP-based microservices adopt a
stateless Representational State Transfer (REST) interface. That RESTful
interface enforces an identity based security contract, behind which there is a
service contract or a data contract.
Software-as-a-Service (SaaS)
Instead of a BBoM, an HTTP-based solution is composed of one or more
loosely coupled microservices. Taken together, this collection is called
Software-as-a-Service (SaaS). A SaaS may or may not be connected to the
Internet. A SaaS may or may not be running in a public, a private, or a hybrid
cloud. Regardless of where the SaaS is deployed, in addition to the 2 classic
software engineering principles there are other critically important techniques
that must be enforced when designing, developing and operating any SaaS.
Conclusion
Monolith applications, a.k.a., big balls of mud (BBoM), are imagined to
be a permanent resource. In the institutions in which they are found the
BBoM is a totem to that culture. Consequently, any change to the BBoM
such as lifting it into a public cloud or containerizing is imagined by
institution’s culture to be threatening. Naturally, that response increases the
level of animus within the institution’s culture to the cloud as well as to
containers.
To be clear, given the physical structure of the BBoM’s source and system
code base it is rarely if ever cost effective to stuff a BBoM into a container or
to lift a BBoM into a private or public cloud. All BBoMs are defined and
managed to run on a physical server located in a private (or tenant) data
center. All BBoMs (even the web applications) were never implemented to
run on virtual machines distributed across network subnets exposed to the
Internet.
ABOUT THE AUTHOR
In the spring of ’85, out of the blue, while finishing an undergraduate
degree in Geology (and planning that fall to enter graduate school to study
Mineral Economics), the writer’s first computer was dropped on his work
desk along with the assignment to create a program that estimated the
ownership and operating expenses of surface and subsurface mining
equipment. Back then, in the dark ages, no one had a computer on their desk,
or extremely few people owned their own computer. Before that moment, the
writer had never touched a computer, or had seen a computer. At that time,
given 80% of others in Geology were out of work, the machine placed on my
desk was better than their alternatives.
That micro-computer was a brand-new proto-DOS HP-85, with a tiny key-
board, a 4” green monitor, and it ran applications you write in a language
called BASIC. I literally did not know what a CPU was, what RAM was (ram
was a barnyard animal, right?), what an operating system was, what a
programming language was, etc.. Thirty days later the bug-free program ran
in under 5 minutes, completing the work a Geologist accomplished in one
week using a hand-held HP calculator.
Rather than be run over by the gathering herd of machines, that fall the
writer switched disciplines entered graduate school to study information
systems. In the fall of 1985, the writer’s first personal computer was a
10MHZ CPU, 16KB of RAM, a 14” black-n-white TV as a monitor, with a
dial-up modem cartridge, used to connect to a VAX 11-780 running BSD
Unix 4.3 and INGRES, on which ran an application written in C code using
double pointer indirection. In the following spring, my next computer had 12
MHZ with 32 KB and man was it fast, really, really fast!
Since earning a Master of Science (MS) in 1987, and working in pure and
applied research, the author has been developing and operating globally
distributed systems since 1990, first using varieties of UNIX, and then
Windows NT, and then Linux, and now there off-spring. The author has
worked with a mixture of on-premise and since 2007 a mixture of cloud
platforms: first with proto-Azure while an employee at the Microsoft Center
of Excellence, then in 2012 using Amazon Web Services (using Hadoop and
Redshift), followed in 2015 using Google Cloud Platform (GCP).
As everyone involved with AWS has encountered, the use of Docker and
Kubernetes on AWS is growing rapidly. This manuscript is constructed from
the notes recorded while out in the field. This manuscript was first published
in Amazon Kindle at the end of May 2020.
The author can be reached at charles@thesolusgroupllc.com
For those interested in the AWS certifications held by this author, here is
where they stand.
AWS Certified Cloud Practitioner Validation Number#
PN49EN3CDE411H30
AWS Certified Cloud Practitioner Issue Date: February 12, 2019
AWS Certified Cloud Practitioner Expiration Date: February 12, 2021