AWS Disaster Recovery
AWS Disaster Recovery
AWS Disaster Recovery
Page 1 of 22
October 2014
October 2014
Contents
Introduction...............................................................................................................................................................3
Recovery Time Objective and Recovery Point Objective ................................................................................................4
Traditional DR Investment Practices ............................................................................................................................4
AWS Services and Features Essential for Disaster Recovery ...........................................................................................5
Example Disaster Recovery Scenarios with AWS ...........................................................................................................9
Backup and Restore ................................................................................................................................................9
Pilot Light for Quick Recovery into AWS ................................................................................................................. 11
Warm Standby Solution in AWS ............................................................................................................................. 14
Multi-Site Solution Deployed on AWS and On-Site .................................................................................................. 16
AWS Production to an AWS DR Solution Using Multiple AWS Regions ...................................................................... 18
Replication of Data ................................................................................................................................................... 18
Failing Back from a Disaster....................................................................................................................................... 19
Improving Your DR Plan ............................................................................................................................................ 20
Software Licensing and DR ........................................................................................................................................ 21
Conclusion ............................................................................................................................................................... 21
Further Reading........................................................................................................................................................ 22
Document Revisions ................................................................................................................................................. 22
Page 2 of 22
October 2014
Abstract
In the event of a disaster, you can quickly launch resources in Amazon Web Services (AWS) to ensure business
continuity. This whitepaper highlights AWS services and features that you can leverage for your disaster recovery (DR)
processes to significantly minimize the impact on your data, your system, and your overall business operations. The
whitepaper also includes scenarios that show you, step-by-step, how to improve your DR plan and leverage the full
potential of the AWS cloud for disaster recovery.
Introduction
Disaster recovery (DR) is about preparing for and recovering from a disaster. Any event that has a negative impact on a
companys business continuity or finances could be termed a disaster. This includes hardware or software failure, a
network outage, a power outage, physical damage to a building like fire or flooding, human error, or some other
significant event.
To minimize the impact of a disaster, companies invest time and resources to plan and prepare, to train employees, and
to document and update processes. The amount of investment for DR planning for a particular system can vary
dramatically depending on the cost of a potential outage. Companies that have traditional physical environments
typically must duplicate their infrastructure to ensure the availability of spare capacity in the event of a disaster. The
infrastructure needs to be procured, installed, and maintained so that it is ready to support the anticipated capacity
requirements. During normal operations, the infrastructure typically is under-utilized or over-provisioned.
With Amazon Web Services (AWS), your company can scale up its infrastructure on an as-needed, pay-as-you-go basis.
You get access to the same highly secure, reliable, and fast infrastructure that Amazon uses to run its own global
network of websites. AWS also gives you the flexibility to quickly change and optimize resources during a DR event,
which can result in significant cost savings.
This whitepaper outlines best practices to improve your DR processes, from minimal investments to full -scale availability
and fault tolerance, and shows you how you can use AWS services to reduce cost an d ensure business continuity during
a DR event.
Page 3 of 22
October 2014
1
2
From http://en.wikipedia.org/wiki/Recovery_time_objective
From http://en.wikipedia.org/wiki/Recovery_point_objective
Page 4 of 22
October 2014
Page 5 of 22
October 2014
Page 6 of 22
October 2014
your public IP addresses to instances in your account in a particular region. For DR, you can also pre -allocate some IP
addresses for the most critical systems so that their IP addresses are already known before disaster strikes. This can
simplify the execution of the DR plan.
Elastic Load Balancing automatically distributes incoming application traffic across multiple Amazon EC2 instances. It
enables you to achieve even greater fault tolerance in your applications by seamlessly providing the load-balancing
capacity that is needed in response to incoming application traffic. Just as you can pre-allocate Elastic IP addresses, you
can pre-allocate your load balancer so that its DNS name is already known, which can simplify the execution of your DR
plan.
Amazon Virtual Private Cloud (Amazon VPC) lets you provision a private, isolated section of the AWS cloud where you
can launch AWS resources in a virtual network that you define. You have complete control over your virtual networking
environment, including selection of your own IP address range, creation of subnets, and configuration of route tables
and network gateways. This enables you to create a VPN connection between your corporate data center and your VPC,
and leverage the AWS cloud as an extension of your corporate data center. In the context of DR, you can use Amazon
VPC to extend your existing network topology to the cloud; this can be especially appropriate when recovering
enterprise applications that are typically on the internal network.
Amazon Direct Connect makes it easy to set up a dedicated network connection from your premises to AWS. In many
cases, this can reduce your network costs, increase bandwidth throughput, and provide a more consistent n etwork
experience than Internet-based connections.
Databases
For your database needs, consider using these AWS services:
Amazon Relational Database Service (Amazon RDS) makes it easy to set up, operate, and scale a relational database in
the cloud. You can use Amazon RDS either in the preparation phase for DR to hold your critical data in a database that is
already running, or in the recovery phase to run your production database. When you want to look at multiple regions,
Amazon RDS gives you the ability to snapshot data from one region to another, and also to have a read replica running in
another region.
Amazon DynamoDB is a fast, fully managed NoSQL database service that makes it simple and cost-effective to store and
retrieve any amount of data and serve any level of request traffic. It has reliable throughput and single-digit, millisecond
latency. You can also use it in the preparation phase to copy data to DynamoDB in another region or to Amazon S3.
During the recovery phase of DR, you can scale up seamlessly in a matter of minutes with a single click or API call.
Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse service that makes it simple and cost-effective
to efficiently analyze all your data using your existing business intelligence tools. You can use Amazon Redshift in the
preparation phase to snapshot your data warehouse to be durably stored in Amazon S3 within the same region or
copied to another region. During the recovery phase of DR, you can quickly restore your data warehouse into the same
region or within another AWS region.
You can also install and run your choice of database software on Amazon EC2, and you can choose from a variety of
leading database systems.
For more information about database options on AWS, see Running Databases on AWS.
Page 7 of 22
October 2014
Deployment orchestration
Deployment automation and post-startup software installation/configuration processes and tools can be used in
Amazon EC2. We highly recommend investments in this area. This can be very helpful in the recovery phase, enabling
you to create the required set of resources in an automated way.
AWS CloudFormation gives developers and systems administrators an easy way to create a collection of related AWS
resources and provision them in an orderly and predictable fashion. You can create templates for your environments
and deploy associated collections of resources (called a stack) as needed.
AWS Elastic Beanstalk is an easy-to-use service for deploying and scaling web applications and services developed with
Java, .NET, PHP, Node.js, Python, Ruby, and Docker. You can deploy your application code, and AWS Elastic Beanstalk
will provision the operating environment for your applications.
AWS OpsWorks is an application management service that makes it easy to deploy and operate applications of all types
and sizes. You can define your environment as a series of layers, and configure each layer as a tier of your application.
AWS OpsWorks has automatic host replacement, so in the event of an instance failure it will be automatically replaced.
You can use AWS OpsWorks in the preparation phase to template your environment, and you can combine it with AWS
CloudFormation in the recovery phase. You can quickly provision a new stack from the stored configuration that
supports the defined RTO.
Security and compliance
There are many security-related features across the AWS services. We recommend that you review the Security Best
Practices whitepaper. AWS also provides further risk and compliance information in the AWS Security Center. A full
discussion of security is out of scope for this paper.
Page 8 of 22
October 2014
AWS enables you to cost-effectively operate each of these DR strategies. Its important to note that these are just
examples of possible approaches, and variations and combinations of these are possible. If your application is already
running on AWS, then multiple regions can be employed and the same DR strategies will still apply.
Page 9 of 22
October 2014
The following figure shows data backup options to Amazon S3, from either on-site infrastructure or from AWS.
Figure 2: Data Backup Options to Amazon S3 from On-Site Infrastructure or from AWS.
Of course, the backup of your data is only half of the story. If disaster strikes, youll need to recover your data quickly
and reliably. You should ensure that your systems are configured to retain and secure your data, and you should test
your data recovery processes.
The following diagram shows how you can quickly restore a system from Amazon S3 backups to Amazon EC2.
Page 10 of 22
October 2014
Use Elastic IP addresses, which can be pre-allocated and identified in the preparation phase for DR, and
associate them with your instances. Note that for MAC address-based software licensing, you can use elastic
network interfaces (ENIs), which have a MAC address that can also be pre-allocated to provision licenses against.
You can associate these with your instances, just as you would with Elastic IP addresses.
Use Elastic Load Balancing (ELB) to distribute traffic to multiple instances. You would then update your DNS
records to point at your Amazon EC2 instance or point to your load balancer using a CNAME. We recommend
this option for traditional web-based applications.
For less critical systems, you can ensure that you have any installation packages and configuration information available
in AWS, for example, in the form of an Amazon EBS snapshot. This will speed up the application server setup, because
you can quickly create multiple volumes in multiple Availability Zones to attach to Amazon EC2 instances. You can then
install and configure accordingly, for example, by using the backup-and-restore method.
The pilot light method gives you a quicker recovery time than the backup-and-restore method because the core pieces
of the system are already running and are continually kept up to date. AWS enables you to automate the provisioning
and configuration of the infrastructure resources, which can be a significant benefit to save time and help protect
against human errors. However, you will still need to perform some installation and configuration tasks to recover the
applications fully.
Page 11 of 22
October 2014
Preparation phase
The following figure shows the preparation phase, in which you need to have your regularly changi ng data replicated to
the pilot light, the small core around which the full environment will be started in the recovery phase. Your less
frequently updated data, such as operating systems and applications, can be periodically updated and stored as AMIs.
Page 12 of 22
October 2014
After recovery, you should ensure that redundancy is restored as quickly as possible. A failure of your DR environment
shortly after your production environment fails is unlikely, but you should be aware of this risk. Continue to take regular
backups of your system, and consider additional redundancy at the data layer.
The following figure shows the recovery phase of the pilot light scenario.
Page 13 of 22
October 2014
Page 14 of 22
October 2014
Page 15 of 22
October 2014
Page 16 of 22
October 2014
Figure 9: The Recovery Phase of the Multi-Site Scenario Involving On-Site and AWS Infrastructure.
Page 17 of 22
October 2014
You dont need to negotiate contracts with another provider in another region
You can use the same underlying AWS technologies across regions
For more information, see the Migrating AWS Resources to a New Region whitepaper.
Replication of Data
When you replicate data to a remote location, you should consider these factors:
Distance between the sites Larger distances typically are subject to more latency or jitter.
Available bandwidth The breadth and variability of the interconnections.
Data rate required by your application The data rate should be lower than the available bandwidth.
Replication technology The replication technology should be parallel (so that it can use the network
effectively).
There are two main approaches for replicating data: synchronous and asynchronous.
Synchronous replication
Data is atomically updated in multiple locations. This puts a dependency on network performance and availability. In
AWS, Availability Zones within a region are well connected, but physically separated. For example, when deployed in
Multi-AZ mode, Amazon RDS uses synchronous replication to duplicate data in a second Availability Zone. This ensures
that data is not lost if the primary Availability Zone becomes unavailable.
Asynchronous replication
Data is not atomically updated in multiple locations. It is transferred as network performance and availability allows, and
the application continues to write data that might not be fully replicated yet.
Many database systems support asynchronous data replication. The database replica can be located remotely, and the
replica does not have to be completely synchronized with the primary database server. This is acceptable in many
scenarios, for example, as a backup source or reporting/read-only use cases. In addition to database systems, you can
also extend it to network file systems and data volumes.
We recommend that you understand the replication technology used in your software solution. A detailed analysis of
replication technology is beyond the scope of this paper.
Page 18 of 22
October 2014
AWS regions are completely independent of each other, but there are no differences in the way you access them and
use them. This enables you to create DR processes that span continental distances, without the challenges or costs that
this would normally incur. You can back up data and systems to two or more AWS regions, allowing service restoration
even in the face of extremely large-scale disasters. You can use AWS regions to serve your users around the globe with
relatively low complexity to your operational processes.
Page 19 of 22
October 2014
Page 20 of 22
October 2014
System access
You can also create roles for your Amazon EC2 resources, so that only users who are assigned to specified roles can
perform defined actions on your DR environment, such as accessing an Amazon S3 bucket or re-pointing an Elastic IP
address.
Automation
You can automate the deployment of applications onto AWS-based servers and your on-premises servers by using
configuration management or orchestration software. This allows you to handle application and configuration change
management across both environments with ease. There are several popular orchestration software options available.
For a list of solution providers, see the AWS Partner Directory.3
AWS CloudFormation works in conjunction with several tools to provision infrastructure services in an automated way.
Higher levels of abstraction are also available with AWS OpsWorks or AWS Elastic Beanstalk. The overall goal is to
automate your instances as much as possible. For more information, see the Architecting for the Cloud: Best Practices
whitepaper.
You can use Auto Scaling to ensure that your pool of instances is appropriately sized to meet the demand based on the
metrics that you specify in AWS CloudWatch. This means that in a DR situation, as your user base starts to use the
environment more, the solution can scale up dynamically to meet this increased demand. After the event is over and
usage potentially decreases, the solution can scale back down to a minimum level of servers.
Conclusion
Many options and variations for DR exist. This paper highlights some of the common scenarios, ranging from simple
backup and restore to fault tolerant, multi-site solutions. AWS gives you fine-grained control and many building blocks
to build the appropriate DR solution, given your DR objectives (RTO and RPO) and budget. The AWS services are
available on-demand, and you pay only for what you use. This is a key advantage for DR, where significant infrastructure
is needed quickly, but only in the event of a disaster.
This whitepaper has shown how AWS provides flexible, cost-effective infrastructure solutions, enabling you to have a
more effective DR plan.
Page 21 of 22
October 2014
Further Reading
Document Revisions
Weve made the following changes to this whitepaper since its original publication in January, 2012:
Added information about new services: Amazon Glacier, Amazon Redshift, AWS OpsWorks, AWS Elastic
Beanstalk, and Amazon DynamoDB
Added information about various features of AWS services for DR scenarios using multiple AWS regions
Page 22 of 22