Ccaws Unit 5
Ccaws Unit 5
In this article, I will use simple terms and a simple web application stack to discuss high
availability, how it is calculated and how it can be practically achieved. I will also briefly
touch on scalability and how similar approach can achieve higher availability and greater
scalability of our simple web stack.
Start Simple
When we start a new system, it makes sense to keep things simple. At the early stage of any
product, you don’t really know whether it is even going to fly. Speed of iteration, ability to
quickly respond to feedback is the most important attribute of your system.
AWS provides you with a variety of high level services that manage entire parts of the system
for you and if I were to start today, I would maximize the use of these resources, and thus
maximize the time spent doing that special thing that my product is good at. For the purposes
of this article, though, let us assume that I am a bit of an old school guy and I prefer your
plain vanilla LAMP stack. In that case, the simplest way for me to start would be by keeping
everything on a single box (or a single EC2 instance if I were to do it in the cloud). The main
issues with this approach are of course scale and availability. You can only support as many
customers as your single box and should this single box fail, you won’t support any at all.
And while these may not be your primary concerns in the early stages, as your system
matures and your customer base grows, availability and scale will become more important.
As the system grows, we often split it into tiers (also known as layers), which sets it on the
path of greater scale and availability. Now every tier can be placed on its own box of the
appropriate size and cost. We can choose bigger, better boxes with multiple levels of
hardware redundancy for our database servers and cheaper, commodity-grade hardware for
our web and application servers.
Let’s say that our business requirements call for 99.5% uptime. In other words, it is allowed
to be down no more than 44 hours in any given 12-month period. The total availability of a
system of sequentially connected components is the product of individual availabilities. Let’s
for example assume that individual servers that host our web and application tiers have 90%
availability and the server hosting our database has 99%. For simplicity’s sake, let’s also
assume that these availability numbers include hardware, OS, software, and connectivity.
Introduce Redundancy
At the first glance, we are not doing too well: The availability of the whole system is way
below our desired uptime and even below the availability of any single component of the
system. However, by splitting the system into tiers and by pushing the state to the database
tier, we have set the stage for solving (to some degree) the problems of availability and scale.
The simplest way to increase availability is by adding redundancy, which is much easier to do
with the stateless parts of the system. As we add web servers to our web tier, the probability
of its total failure decreases, pushing the availability of this tier and of the entire system up:
Adding even one extra server to the web tier brings its availability by 8%. Unfortunately,
adding another web server does not do as much to our overall availability, increasing it only
by 0.8%:
The cost of this tier grows linearly with every server, but availability grows logarithmically.
Consequently, we will soon reach the point of diminishing returns where the value of
additional availability will be less than the cost of adding another server:
Thus, we have noticed the first pattern: Adding redundancy to a single component or tier
leads to logarithmic increase in availability, eventually reaching the point of diminishing
returns, at least from the availability standpoint.
Now, this is oversimplification of course, and there could be other reasons for adding more
servers to your fleet. Overscaling in order to reduce the impact of server failure (AKA “blast
radius”) could be one. This is an example of so-called N+1 deployment. Another reason
could be scale (which will talk about later).
As we have exhausted our ability to have significant impact on the availability of our system
by adding redundancy to the web tier, we need to look for new opportunities elsewhere. The
next most logical place is another stateless part of the system, the application tier:
Again, we have gained around 8% after adding a second box and the incremental availability
gains started diminishing quickly after that, eventually also reaching the point of diminishing
returns.
As this point, we have noticed a second pattern: The availability of your entire stack cannot
exceed that of its least-available component or tier.
It looks inevitable that we have to add redundancy to your data tier as well:
And sure enough, doing so solves the problem, at least on paper. In reality, adding
redundancy to a statefull component bring with it additional challenges, but that is a topic of
another blog.
Let’s now take this one step further and take a look at another part of the stack that has been
assumed all along, but never brought to light, the data center that hosts it. Should it go down
due to power outage, Internet connectivity issues, fire or natural disaster, your entire stack
will become inaccessible, rendering all investments of time and money that we made into
your multi-node multi-tiered redundant stack useless.
We can choose to host our stack in a Tier IV data center, which often will cost us a bundle
and still may not offer sufficient protection against natural disasters. Or, we can apply the
same principle of redundancy to the data center itself and spread our stack across multiple
data centers. Wait, what? Multiple data centers? What about latency? What about
maintenance overhead?
All these and other questions may cross your mind when you read the last suggestion. And
under normal circumstances, you would be correct. In order for this approach to be successful
and cost effective, we need data centers that are:
Built for high reliability so that nothing short of a natural disaster brings one down
Located close to one another and connected via low-latency high-bandwidth links to ensure
low latency between the facilities
Located far enough from each other that a single natural disaster cannot bring all of them
down at the same time.
This definition sounds familiar to some of us, doesn’t it? It sounds like an AWS Availability
Zone (or AZ for short)! By moving our stack to AWS, we can spread it across multiple data
centers just as easily as within a single one. And it will cost us just as much:
If we add more bells and whistles, such as hosting our static assets in AWS Simple Storage
Service (S3), serving our content through a CDN — AWS CloudFront and add the ability to
scale both stateless tiers automatically (AWS Auto-Scaling), we’ll arrive at this canonical
highly available web architecture:
And to top things off, let’s also briefly talk about scalability. By splitting our stack into tiers,
we made it possible to increase the scale of each tier independently. When scaling a
component of a system, two approaches can be used: Getting a bigger box (also known as
scaling up or vertical scaling) or getting more boxes (also knowns as scaling out or horizontal
scaling).
Vertical scaling may be easier from operational standpoint, but has two major limitations:
First, bigger boxes tend to get more expensive faster than the additional scale they provide.
Second, there is a limit to how big of a box you can get. Horizontal scaling offers more room
to grow and better cost efficiency, but introduces additional level of complexity, requiring
higher level of operational maturity and high degree of automation.
In an event-driven system, events are generated by event producers, ingested and filtered by
an event router (or broker), and then fanned out to the appropriate event consumers (or
sinks). The events are forwarded to subscribers defined by one or more matching triggers.
These three components—event producers, event router, event consumers—are decoupled
and can be independently deployed, updated, and scaled:
The event router links the different services and is the medium through which messages are
sent and received. It executes a response to the original event generated by an event producer
and sends this response downstream to the appropriate consumers. Events are handled
asynchronously and their outcomes are decided when a service reacts to an event or is
impacted by it, as in the following diagram of a very simplified event flow:
When to use event-driven architectures
To monitor and receive alerts for any anomalies or changes to storage buckets, database
tables, virtual machines, or other resources.
To fan out a single event to multiple consumers. The event router will push the event to all
the appropriate consumers, without you having to write customized code. Each service can
then process the event in parallel, yet differently.
To provide interoperability between different technology stacks while maintaining the
independence of each stack.
To coordinate systems and teams operating in and deploying across different regions and
accounts. You can easily reorganize ownership of microservices. There are fewer cross-team
dependencies and you can react more quickly to changes that would otherwise be impeded by
barriers to data access.
Event producers are logically separated from event consumers. This decoupling between the
production and consumption of events means that services are interoperable but can be
scaled, updated, and deployed independently of each other.
Loose coupling reduces dependencies and allows you to implement services in different
languages and frameworks. You can add or remove event producers and receivers without
having to change the logic in any one service. You do not need to write custom code to poll,
filter, and route events.
In an event-driven system, events are generated asynchronously, and can be issued as they
happen without waiting for a response. Loosely coupled components means that if one
service fails, the others are unaffected. If necessary, you can log events so that the receiving
service can resume from the point of failure, or replay past events.
Event-driven systems allow for easy push-based messaging and clients can receive updates
without needing to continuously poll remote services for state changes. These pushed
messages can be used for on-the-fly data processing and transformation, and real-time
analysis. Moreover, with less polling, there is a reduction in network I/O, and decreased
costs.
Simplified auditing and event sourcing
The centralized location of the event router simplifies auditing, and allows you to control
who can interact with a router, and which users and resources have access to your data. You
can also encrypt your data both in transit and at rest.
Additionally, you can make use of event sourcing, an architectural pattern that records all
changes made to an application's state, in the same sequence that they were originally
applied. Event sourcing provides a log of immutable events which can be kept for auditing
purposes, to recreate historic states, or as a canonical narrative to explain a business-driven
decision.
well architecured best practices in
security:==
Security:
The Security pillar includes the ability to protect data, systems, and assets to take advantage
of cloud technologies to improve your security.
The security pillar provides an overview of design principles, best practices, and questions.
You can find prescriptive guidance on implementation in the Security Pillar whitepaper .
Design Principles
Implement a strong identity foundation: Implement the principle of least privilege and
enforce separation of duties with appropriate authorization for each interaction with your
AWS resources. Centralize identity management, and aim to eliminate reliance on long-
term static credentials.
Enable traceability: Monitor, alert, and audit actions and changes to your environment
in real time. Integrate log and metric collection with systems to automatically investigate
and take action.
Apply security at all layers: Apply a defense in depth approach with multiple security
controls. Apply to all layers (for example, edge of network, VPC, load balancing, every
instance and compute service, operating system, application, and code).
Protect data in transit and at rest: Classify your data into sensitivity levels and use
mechanisms, such as encryption, tokenization, and access control where appropriate.
Keep people away from data: Use mechanisms and tools to reduce or eliminate the
need for direct access or manual processing of data. This reduces the risk of mishandling
or modification and human error when handling sensitive data.
Prepare for security events: Prepare for an incident by having incident management
and investigation policy and processes that align to your organizational requirements.
Run incident response simulations and use tools with automation to increase your speed
for detection, investigation, and recovery.
Definition
There are seven best practice areas for security in the cloud:
Security
Identity and Access Management
Detection
Infrastructure Protection
Data Protection
Incident Response
Application Security
Before you architect any workload, you need to put in place practices that influence security.
You will want to control who can do what. In addition, you want to be able to identify
security incidents, protect your systems and services, and maintain the confidentiality and
integrity of data through data protection. You should have a well-defined and practiced
process for responding to security incidents. These tools and techniques are important
because they support objectives such as preventing financial loss or complying with
regulatory obligations.
The AWS Shared Responsibility Model enables organizations that adopt the cloud to achieve
their security and compliance goals. Because AWS physically secures the infrastructure that
supports our cloud services, as an AWS customer you can focus on using services to
accomplish your goals. The AWS Cloud also provides greater access to security data and an
automated approach to responding to security events.
Best Practices
Security
To operate your workload securely, you must apply overarching best practices to every area
of security. Take requirements and processes that you have defined in operational
excellence at an organizational and workload level, and apply them to all areas.
Staying up to date with AWS and industry recommendations and threat intelligence helps you
evolve your threat model and control objectives. Automating security processes, testing, and
validation allow you to scale your security operations.
In AWS, segregating different workloads by account, based on their function and compliance
or data sensitivity requirements, is a recommended approach.
In AWS, privilege management is primarily supported by the AWS Identity and Access
Management (IAM) service, which allows you to control user and programmatic access to
AWS services and resources. You should apply granular policies, which assign permissions
to a user, group, role, or resource. You also have the ability to require strong password
practices, such as complexity level, avoiding re-use, and enforcing multi-factor
authentication (MFA). You can use federation with your existing directory service.
For workloads that require systems to have access to AWS, IAM enables secure access
through roles, instance profiles, identity federation, and temporary credentials.
Credentials must not be shared between any user or system. User access should be granted
using a least-privilege approach with best practices including password requirements and
MFA enforced. Programmatic access including API calls to AWS services should be
performed using temporary and limited-privilege credentials such as those issued by
the AWS Security Token Service.
Reliability:=====
The Reliability pillar includes the reliability pillar encompasses the ability of a workload to
perform its intended function correctly and consistently when it’s expected to. this includes
the ability to operate and test the workload through its total lifecycle. this paper provides in-
depth, best practice guidance for implementing reliable workloads on aws.
The reliability pillar provides an overview of design principles, best practices, and questions.
You can find prescriptive guidance on implementation in the Reliability Pillar whitepaper.
Design Principles
Performance Efficiency:=======
The Performance Efficiency pillar includes the ability to use computing resources efficiently
to meet system requirements, and to maintain that efficiency as demand changes and
technologies evolve.
The performance efficiency pillar provides an overview of design principles, best practices,
and questions. You can find prescriptive guidance on implementation in the Performance
Efficiency Pillar whitepaper.
Design Principles
There are five design principles for performance efficiency in the cloud:
Definition
There are four best practice areas for performance efficiency in the cloud:
Selection
Review
Monitoring
Tradeoffs
Reviewing your choices on a regular basis ensures that you are taking advantage of the
continually evolving AWS Cloud. Monitoring ensures that you are aware of any deviance
from expected performance. Make trade-offs in your architecture to improve performance,
such as using compression or caching, or relaxing consistency requirements..
Design patterns:===
This guide provides guidance for implementing commonly used modernization design
patterns by using AWS services. An increasing number of modern applications are designed
by using microservices architectures to achieve scalability, improve release velocity, reduce
the scope of impact for changes, and reduce regression. This leads to improved developer
productivity and increased agility, better innovation, and an increased focus on business
needs. Microservices architectures also support the use of the best technology for the service
and the database, and promote polyglot code and polyglot persistence.
Traditionally, monolithic applications run in a single process, use one data store, and run on
servers that scale vertically. In comparison, modern microservice applications are fine-
grained, have independent fault domains, run as services across the network, and can use
more than one data store depending on the use case. The services scale horizontally, and a
single transaction might span multiple databases. Development teams must focus on network
communication, polyglot persistence, horizontal scaling, eventual consistency, and
transaction handling across the data stores when developing applications by using
microservices architectures. Therefore, modernization patterns are critical for solving
commonly occurring problems in modern application development, and they help accelerate
software delivery.
This guide provides a technical reference for cloud architects, technical leads, application and
business owners, and developers who want to choose the right cloud architecture for design
patterns based on well-architected best practices. Each pattern discussed in this guide
addresses one or more known scenarios in microservices architectures. The guide discusses
the issues and considerations associated with each pattern, provides a high-level architectural
implementation, and describes the AWS implementation for the pattern. Open-source GitHub
samples and workshop links are provided where available.