0% found this document useful (0 votes)
277 views

Disaster Recovery and Data Replication Architectures

Disaster Recovery strategies

Uploaded by

sreeni17
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
277 views

Disaster Recovery and Data Replication Architectures

Disaster Recovery strategies

Uploaded by

sreeni17
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Gartner IT Security Summit 2005 Donna Scott

68 J une 2005
Marriott Wardman Park Hotel
Washington, District of Columbia
Disaster Recovery and Data Replication
Architectures
These materials can be reproduced only with Gartner's written approval. Such approvals must be requested via e-
mail quote.requests@gartner.com.
Disaster Recovery and Data Replication Architectures
Page 1
Donna Scott
C4, SEC11, 6/05, AE
2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction of this publication in any formwithout prior written permission is
forbidden. The information contained herein has been obtained fromsources believed to be reliable. Gartner disclaims all warranties as to
the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the
information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to
achieve its intended results. The opinions expressed herein are subject to change without notice.
Strategic Planning Assumptions: Through 2007, fewer than 20 percent of enterprises will
operate at Stage 3 the highest level of disaster recovery management process maturity (0.8
probability). By year-end 2007, large enterprises with well-defined disaster recovery processes
and regularly tested plans will rise from the current level of approximately 60 percent to 80
percent (0.8 probability).
Disaster Recovery Management Is Maturing
No data recovery
or shelfware plan
Data recovery as an IT
project
Platform-based
Plan occasionally tested
Ad hoc project status
reporting
Data recovery as a process, component of
business continuity management
Link DR to business process requirements
Defined organization
Plan regularly tested
Formalized reporting
Business integration
Partner integration
Process integration
Continuous improvement culture
Frequent, diverse testing
Formalized reporting to BCM,
executives and board
Stage 0
Stage 1
Stage 2
Stage 3
Disaster recovery management (DRM) has evolved over 20 years, from its roots in platform-based IT recovery
(such as mainframe) to integration in business continuity plans. Enterprises tend to evolve through at least four
DRM stages, moving to the next stage when the benefits outweigh the risks of inaction.
In the first phase, disaster recovery (DR) plans are nonexistent or they exist only as shelfware. They are not
tested or maintained and would not enable or direct recovery actions.
Enterprises typically move next to define a DR plan on a project basis. Typically, there is a realization inside IT
that some disaster risk mitigation must be implemented or there is outside business pressure to protect a specific
business process (such as the call center). The project is focused on a plan with occasional testing; however, its
typically not integrated into other IT/business processes and is not maintained.
In the next phase, enterprises focus on building a DRM organization and processes, ensuring a life cycle
approach to maintaining a plan, and regularly test the plan (once or twice per year). Business process owners
actively determine IT recoverability requirements, as well as participate in tests.
In the final phase, the focus is on process integration that is, with change management to ensure that DR
plans are kept up-to-date and with incident/problem management to leverage IT support processes. DRM is also
considered early in the stages of a new project. Emphasis is also on end-to-end planning, including partner
integration and continuous improvement/best practices.
Disaster Recovery and Data Replication Architectures
Page 2
Donna Scott
C4, SEC11, 6/05, AE
2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction of this publication in any formwithout prior written permission is
forbidden. The information contained herein has been obtained fromsources believed to be reliable. Gartner disclaims all warranties as to
the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the
information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to
achieve its intended results. The opinions expressed herein are subject to change without notice.
The need for 24x7 application availability and protection against site disasters is mandatory for critical business
applications used in critical business processes. This presentation focuses on the technology and data replication
architectures and best practices to achieve near-real-time recovery time and point objectives. Client issues
include:
How will enterprises justify investments in technologies, people and business processes needed to deliver
continuous application availability and protection from site disasters?
What technologies will be critical for data replication architectures and what are their trade-offs?
What architectures and best practices will enable enterprises to achieve 24x7 availability and disaster protection
as required by the business?
Client Issues
1. How will enterprises justify investments in technologies, people and business processes
needed to deliver continuous application availability and protection from site disasters?
2. What technologies will be critical for data replication architectures and what are their trade-
offs?
3. What architectures and best practices will enable enterprises to achieve 24x7 availability and
disaster protection as required by the business?
Disaster Recovery and Data Replication Architectures
Page 3
Donna Scott
C4, SEC11, 6/05, AE
2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction of this publication in any formwithout prior written permission is
forbidden. The information contained herein has been obtained fromsources believed to be reliable. Gartner disclaims all warranties as to
the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the
information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to
achieve its intended results. The opinions expressed herein are subject to change without notice.
Investing to Reduce Unplanned Downtime
40%
Operations
Errors
40%
Application
Failure
Investment Strategy
Redundancy
Service Contracts
Availability Monitoring
BCM/DRM and Testing
People and Process
Hiring and Training
IT Process Maturity
Reduce Complexity
Automation
Change and Problem Mgt.
People and Process
App. Architecture/Design
Mgmt. Instrumentation
Change Management
Problem Management
Configuration Management
Performance/Capacity Mgt.
20%
Environmental
Factors,
HW, OS, Power,
Disasters
Based on extensive feedback from clients, we estimate that, on average, 40 percent of unplanned mission-critical
application downtime is caused by application failures (including bugs, performance issues or changes to
applications that cause problems); 40 percent by operations errors (including incorrectly or not performing an
operations task); and about 20 percent by hardware, operating systems, environmental factors (for example,
heating, cooling and power failures), first-day security vulnerabilities, and natural or manmade disasters. To
address the 80 percent of unplanned downtime caused by people failures (vs. technology failures or disasters),
enterprises should invest in improving their change and problem management processes (to reduce the downtime
caused by application failures); automation tools, such as job scheduling and event management (to reduce the
downtime caused by operator errors); and improving availability through enterprise architecture (including
reducing complexity) and management instrumentation. The balance should be addressed by eliminating single
points of failure through redundancy, implementing BC/DR plans and reducing time-to-repair through
technology support/maintenance agreements.
Action Item: Dont let redundancy give you a false sense of security since 80 percent of downtime is caused by
people and process issues.
Client Issue: How will enterprises justify investments in technologies, people and business
processes needed to deliver continuous application availability and protection from site
disasters?
Disaster Recovery and Data Replication Architectures
Page 4
Donna Scott
C4, SEC11, 6/05, AE
2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction of this publication in any formwithout prior written permission is
forbidden. The information contained herein has been obtained fromsources believed to be reliable. Gartner disclaims all warranties as to
the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the
information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to
achieve its intended results. The opinions expressed herein are subject to change without notice.
J ustification Vehicle:
The Business Impact Assessment
Revenue
Know your downtime
costs per hour,
day, two days ...
Number of
employees affected
x hours out x
burdened hourly rate
Damaged Reputation
Customers
Suppliers
Financial markets
Banks
Business partners
...
Financial Performance
Revenue recognition
Cash flow
Lost discounts (A/P)
Payment guarantees
Credit rating
Stock price
Other Expenses
Temporary employees, equipment rental, overtime costs,
extra shipping costs, travel expenses, legal obligations ...
Direct loss
Compensatory payments
Lost future revenue
Billing losses
Investment losses
Productivity
RTO RPO
Client Issue: How will enterprises justify investments in technologies, people and business
processes needed to deliver continuous application availability and protection from site
disasters?
Enterprises need to understand the consequences of downtime to justify investments for operational availability
and business continuity (BC)/DR. A first step in developing a BC/DR plan is performing a business impact
analysis (BIA), where critical business processes are identified and prioritized and costs of downtime are
evaluated over time. The BIA is performed by a project team consisting of business unit, security and IT
personnel. Key goals of the BIA are to: 1) agree on the cost of business downtime over varying time periods, 2)
identify business process availability and recovery time objectives, and 3) identify business process recovery
point objectives. The BIA results feed into the recovery strategy and process. Enterprises that have never
instituted a BIA into their application life cycle processes typically initiate a BIA project and use their findings
to ensure that current recovery strategies meet business process requirements. With real-time enterprise (RTE)
applications, it is critical that BC/DR is built into the life cycle for new applications and business process
enhancement projects so that availability and recovery requirements are built into the architecture and design.
Action Item: Integrate business continuity management (BCM) and DRM into the enterprise project life cycle to
ensure that recovery needs are identified in projects initial phases or in changes in business processes and
systems.
Strategic Planning Assumption: By year-end 2007, 65 percent of large enterprises will integrate
disaster recovery requirements into the new project life cycle, up from fewer than 25 percent in
2004 (0.8 probability).
Disaster Recovery and Data Replication Architectures
Page 5
Donna Scott
C4, SEC11, 6/05, AE
2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction of this publication in any formwithout prior written permission is
forbidden. The information contained herein has been obtained fromsources believed to be reliable. Gartner disclaims all warranties as to
the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the
information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to
achieve its intended results. The opinions expressed herein are subject to change without notice.
Criticality Ratings/Classification Systems
RTO =five+days; RPO =one day
Quick ship contracts typical
Sourcing at time of disaster where
RTOs are lengthy
Departmental functions Class 4
RTO =three days; RPO =one day
Shared recovery environment
May include quick-ship programs
Enterprise back-office
functions
Class 3
RTO =8-24 hrs.; RPO =four hours
Dedicated or shared recovery
environment
Less-critical revenue-
producing functions
Supply chain
Class 2
RTO =0-4 hrs.; RPO =0-4 hrs.
Dedicated recovery environment
Architecture may include
automated failover
Customer-/Partner-Facing
Functions critical to
revenue production where
loss is significant
Class 1
(RTE)
DR Service Levels and Strategy Business Process Services Class
Client Issue: How will enterprises justify investments in technologies, people and business
processes needed to deliver continuous application availability and protection from site
disasters?
Business needs for application service availability/DR should be defined during the business requirements phase.
Ignoring this early often results in a solution that doesn't meet needs and ultimately requires significant re-
architecture to improve service. We recommend a classification scheme of supported service levels and
associated costs. These drive tasks and spending in development and application architecture, systems
architecture and operations. Business managers then develop a business case for a particular service
classification. From a DR perspective, this case is developed in the BIA and recovery strategy phases. Service-
level definitions should include scheduled uptime, percentage availability in scheduled uptime, and recovery
time and point objectives. In this example, Class 1 application services have an RTE strategy and are those that
the enterprise would suffer irreparable harm from if they were unavailable. Not all applications in a critical
business process would be grouped in Class 1; rather, only those deemed most-critical or with the most
downtime effect. The DR architecture for Class 1 and even Class 2 would result in an implementation across two
physical sites to meet availability/recovery needs.
Action Item: Develop a service-level classification system with associated development, infrastructure and
operations architecture requirements. A repeatable process is a process that works.
Disaster Recovery and Data Replication Architectures
Page 6
Donna Scott
C4, SEC11, 6/05, AE
2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction of this publication in any formwithout prior written permission is
forbidden. The information contained herein has been obtained fromsources believed to be reliable. Gartner disclaims all warranties as to
the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the
information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to
achieve its intended results. The opinions expressed herein are subject to change without notice.
Client Issue: What technologies will be critical for data replication architectures and what are
their trade-offs?
Tactical Guideline: There is no one right disaster recovery data center strategy. Many
companies implement all four methods, depending on their application and recovery
requirements.
Disaster Recovery Strategies
Common Strategies
Production/Load Sharing Production/Outsourcing /DR
Production/Development and Test DR Production/Standby DR
Client Questions:
How many data centers should I have? One or many?
Where should they be located? Is close better than far?
Should I reduce the cost of DR by using idle assets for other purposes?
Trade-off Costs, Risks, Complexity
Gartner frequently gets questions from clients about data center strategies such as, how many data centers,
how are they used, and what is a strategy for DR? Although there are no right answers to these questions (the
right answer for your organization depends on your business and IT strategy), there are common themes across
large enterprises. Although data center consolidation has often been used to reduce cost through economies of
scale, consolidation across the oceans is fairly rare. This is due to network latency causing unacceptable
response time levels for worldwide applications. However, those that do operate a single application instance
worldwide achieve greater visibility across business units (such as the supply chain) and reduced overall costs of
operation and application integration.
As far as the number of data centers since the Sept. 11 tragedy, there has been a slight increase in the overall
number of data centers for many organizations to achieve protection from disasters. From a DR perspective, the
trend is toward sub-24-hour recovery time objective (RTO), and often sub-four-hours, resulting in dedicated
recovery environments either internally or at outsourcers. Often to reduce total cost of ownership, the
recovery environment is shared with development/test or production load sharing, or through shared contracts
with DR service providers. Furthermore, capacity-on-demand programs are popular for internal recovered
mainframe environments.
Disaster Recovery and Data Replication Architectures
Page 7
Donna Scott
C4, SEC11, 6/05, AE
2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction of this publication in any formwithout prior written permission is
forbidden. The information contained herein has been obtained fromsources believed to be reliable. Gartner disclaims all warranties as to
the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the
information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to
achieve its intended results. The opinions expressed herein are subject to change without notice.
Geographic Load Balancer
Geographic
Load Balancer
Site Load
Balancer
Site
Load
Balancer
Web
Server
Clusters
Application
Server
Clusters
Database
Server
Clusters
Disk
PIT Image,
Tape B/U
Web
Server
Clusters
Application
Server
Clusters
Database
Server
Clusters
Transaction
Replication
DB
Replication
Remote
Copy
Secondary Site
LAN and
PC Tape
Backup
Production Site
Data Recovery Architectures:
Redundant Everything
Client Issue: What technologies will be critical for data replication architectures and what are
their trade-offs?
For application services with short RTO/recovery point objective (RPO) requirements, multi-site architectures
are used. Often, a new IT service or application is initially deployed with a single-site architecture and migrates
to multiple sites as its criticality grows. Multiple sites complicate architecture design (for example, load
balancing, database partitioning, database replication and site synchronization must be designed into the
architecture). For non-transaction processing applications, multiple sites often run concurrently, connecting users
to the closest or least-used site. To reduce complexity, most transaction processing (TP) applications replicate
data to an alternative site, but the alternative databases are idle unless a disaster occurs. A switch to the
alternative site can typically be accomplished in 15 minutes to 60 minutes. Some enterprises prefer to partition
databases and split the TP load between sites and consolidate data later for decision support and reporting. This
reduces the impact of a site outage, affecting only a portion of the user base. Others prefer more complex
architectures with bi-directional replication to maintain a single database image. All application services require
end-to-end data backup and offsite storage as a component of the DR strategy. Often, the DR architecture will
implement point-in-time replicas to enable synchronized backup and recovery. Application services with greater
than 24-hour RTO typically recover via tape in the alternative site.
Tactical Guideline: Zero transaction loss requires transaction mirroring and all the costs
associated with it.
Disaster Recovery and Data Replication Architectures
Page 8
Donna Scott
C4, SEC11, 6/05, AE
2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction of this publication in any formwithout prior written permission is
forbidden. The information contained herein has been obtained fromsources believed to be reliable. Gartner disclaims all warranties as to
the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the
information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to
achieve its intended results. The opinions expressed herein are subject to change without notice.
Point In Time Copies: The Data Corruption
Solution
Controller based
EMC TimeFinder/Snap
IBM Flashcopy
HDS ShadowImage
StorageTek SnapShot
Software
Oracle 10g Flashback
BMC SQL Backtrack
Veritas Storage Foundation
Client Issue: What technologies will be critical for data replication architectures and what are
their trade-offs?
PIT copy solutions are a prerequisite to building Real Time Infrastructures (RTIs). Their penetration rate in
enterprise sites is at least three to five times greater than the penetration rate of remote copy solutions. There are
two key reasons for this disparity. First, PIT copies protect against a frequent source of downtime: data
corruption. Second, they shrink planned downtime for backups from hours to seconds or minutes and they
simplify other operational issues like check-pointing production workloads and application testing.
Software based PIT copy technologies limit storage vendor lock-ins and have the potential of leveraging their
closeness to the protected applications into greater efficiency and tighter application integration. However, with
these advantages, there is the downside of potentially more complex software architectures (with many tools
potentially implemented) and the need for additional testing. Storage or controller based solutions give up some
intimacy with the applications to deliver a more agnostic platform and application solution, but at the cost of
greater storage vendor lock-ins.
In most situations choosing between software and controller based solutions will be driven by prior investments,
internal skills, application scale and complexity, and price.
Strategic Planning Assumption: By 2008, more than 75 percent of all enterprises using SAN
storage will have deployed a PIT solution to meet service level objectives (0.7 probability).
Disaster Recovery and Data Replication Architectures
Page 9
Donna Scott
C4, SEC11, 6/05, AE
2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction of this publication in any formwithout prior written permission is
forbidden. The information contained herein has been obtained fromsources believed to be reliable. Gartner disclaims all warranties as to
the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the
information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to
achieve its intended results. The opinions expressed herein are subject to change without notice.
Application/
Transaction
Level
Architected for no
downtime transparent
data access
Full or partial recovery
scenarios
Loosely coupled
application components
designed for integrity
Supports heterogeneous
disk
DBAs understand and
have confidence in the
solution
Must be designed upfront
Significant re-architecture
when not designed upfront
Requires applications/DB
groups to be responsible for
recovery
Does not offer prepackaged
solutions that provide
integrity and consistency
across applications
Pros Cons
Product Examples: IBM WebSphere MQ or other message-oriented
middleware; Teradata Dual Active Warehouse or other built-in application-
level replication; also fault-tolerant middleware or virtualization
Data Replication Alternatives
Client Issue: What technologies will be critical for data replication architectures and what are
their trade-offs?
The best method of achieving continuous 24x7 availability is building recovery right into the application or
enterprise architecture. This way, enterprises architect transparent access to data, even when some components
are down. Users are never or very rarely impacted by downtime, even with a site failure. Typically, the
architecture consists of asynchronous message queuing middleware but may be implemented with fault
tolerant infrastructure middleware that replicates the transaction to redundant applications in another
location. Applications and database architects and database administrators (DBAs) have confidence in this type
of solution because it is based on transactions not bits, or blocks or files that lack application/transaction
context. However, this type of architecture may take significant effort in the application design stages and most
enterprises do not task their application development organization with recovery responsibilities. Furthermore,
this method does not provide a prepackaged solution to ensure against conflict resolution and consistency and
integrity across applications during recovery. Rather, application developers and architects must assess methods
to roll back or forward to a consistent point in time and may code consistency transactions into applications to
enable this to happen. Due to these drawbacks, most enterprises use the infrastructure to enable recovery for the
majority of their needs and reserve this method for the most critical subset of applications.
Decision Framework: Use application/transaction-level replication where 24x7 continuous
application availability (no downtime) is required and for new application projects.
Strategic Planning Assumption: Through 2007, application and transaction-level replication
will be used by less than 10 percent of large enterprises (0.8 probability).
Disaster Recovery and Data Replication Architectures
Page 10
Donna Scott
C4, SEC11, 6/05, AE
2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction of this publication in any formwithout prior written permission is
forbidden. The information contained herein has been obtained fromsources believed to be reliable. Gartner disclaims all warranties as to
the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the
information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to
achieve its intended results. The opinions expressed herein are subject to change without notice.
Often included with DBMS
Some enable read/write use of
second copy and conflict resolution
Hardware-independent; supports
heterogeneous disk
No special application design
required
Allows flexibility in recovery to a
specific point in time other than last
backup/point of failure
DBAs understand and have
confidence in the solution
Generally low bandwidth
requirement; lower network costs
Can be used for short RPO, with
longer RTO, reducing software
license costs
Additional Data Replication Alternatives
DBMS
Log-Based
Replication
DBMS-specific solution
More operational complexity
than storage-controller-based
Replication
Solution automation/
integration varies
Does not replicate
configuration data in files
Requires active duplication
of server resources
Requires DB logs and/or
journaling could impact
production performance
No assurance of cross-
application data integrity
Complex failback due to
synchronization issues
Pros Cons
Client Issue: What technologies will be critical for data replication architectures and what are
their trade-offs?
Database log-based replication is a popular method for ensuring recoverability. Logs are read, with changes
typically shipped asynchronously (some solutions offer synchronous replication, but its rarely used), and can be
applied continuously, on a delay or to a backup upon disaster (this decision depends on the RTO and RPO). As
with transaction-level replication, DBAs and application architects understand and have confidence in the
solution. Furthermore, many solutions allow read/write access of the second copy, and, therefore, it is possible to
create failover transparency in the solution (if replication is closely synchronized). However, care must be taken
to avoid conflict resolution. To minimize conflict resolution, most enterprises apply transactions at a primary
location and only switch to the secondary when the application cannot access the primary database. A major
downside of this solution is that replication is needed for every database, thus, labor/management costs increase.
Furthermore, configuration data stored in file systems (rather than the database) is not replicated and
synchronization must be designed separately, typically through change control procedures. Moreover, cross-
application recovery integrity must be built into the solution for example, by writing synchronization
transactions and rolling back or forward to achieve consistency. Despite the drawbacks, thousands of enterprises
use database log-based replication to achieve short RTOs or RPOs.
Decision Framework: Consider replication at the database management system level to
provide short RPO and RTO for mission-critical applications, keeping in mind that data
integrity across applications must be designed into the applications and transactions.
Disaster Recovery and Data Replication Architectures
Page 11
Donna Scott
C4, SEC11, 6/05, AE
2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction of this publication in any formwithout prior written permission is
forbidden. The information contained herein has been obtained fromsources believed to be reliable. Gartner disclaims all warranties as to
the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the
information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to
achieve its intended results. The opinions expressed herein are subject to change without notice.
DBMS Log-Based/J ournaling/Shadowing
Product Strength Weakness
Oracle Data Guard Automation, function
included
Failover
resynchronization
DB2 UDB HADR
for v.8.2
Log apply automation;
included in ESE
Failover automation;
Cannot read target
SQL Server Log
Shipping
Function included Automation, failover
Resynchronization
Quest SharePlex
for Oracle
Bidirectional Cost
GoldenGate Data
Synchronization
Bidirectional, multi-
DBMS support
Gaining ground
outside NonStop
Lakeview, Vision,
DataMirror
AS/400 Gaining ground
outside AS/400
ENET RRDF z/OS z/OS only
HP NonStop RDF NonStop NonStop only
Client Issue: What technologies will be critical for data replication architectures and what are
their trade-offs?
Oracle Data Guard is a popular method for DR, as it is included with the Oracle license and has integrated
automation built-in for failover and failback and failback resynchronization in 10g, which enables changes
made at the secondary site to be integrated back into the primary site database management system (DBMS),
resulting in no lost transactions. Data Guard offers two methods for replication: shipping of archive logs (after
commitment), which could mean 15 to 30 or more minutes RPO, or shipping of redo log changes, which could
be implemented in synchronous mode for zero data loss or in the more commonly implemented asynchronous
mode. DB2 log shipping is included and offers asynchronous replication but does not have built-in automation.
In DB2 UDB v.8.2, IBM added HADR (included with Enterprise Server Edition only), which automates
shipping, receiving and applying logs (not failover). HADR does not enable users to read the target DBMS; for
this, you must implement the more-complex DB2 replication. SQL Server also lacks failover and failback
automation. Quest SharePlex, a popular tool for Oracle replication, provides close synchronization (a few
seconds to a minute) and bidirectional support. GoldenGate offers similar technology for multiple DBMS
platforms. HP NonStop has strong replication functionality for its DBMS. Suppliers of AS/400 replication
technology are Lakeview Technology, Vision Solutions and DataMirror. On the mainframe, ENET RRDF is
often deployed in environments with short RPO and longer RTO, so that the changed data is maintained in an
alternative site but not applied until disaster (or in tests).
Decision Framework: Most relational DBMS products include log-based replication, but the
degree of synchronization (speed) and automation varies considerably. Some third-party tools
offer multi-DBMS replication support, as well as integrated automation and synchronization.
Disaster Recovery and Data Replication Architectures
Page 12
Donna Scott
C4, SEC11, 6/05, AE
2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction of this publication in any formwithout prior written permission is
forbidden. The information contained herein has been obtained fromsources believed to be reliable. Gartner disclaims all warranties as to
the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the
information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to
achieve its intended results. The opinions expressed herein are subject to change without notice.
Storage
Controller
Unit-based
Infrastructure-based
solution requires less
effort from application
groups
Platform and data type
independence (including
mainframe)
Single solution for all
applications/DBMSs
Operational simplicity
Most solutions assure, but
do not guarantee, data
integrity across servers
and logical disk volumes
Minimal host resource
consumption
Data copies not available for
read/write access
Short recovery time, but user work
is interrupted
Storage hardware and software
dependent
Less storage vendor pricing
leverage
Failover is typically not packaged
and requires scripting
High connectivity costs
Monitoring/control must be built
Lack of customer and vendor
understanding of procedures and
technology to assure data integrity;
taken for granted that it works
Homogeneous SAN required
Pros Cons
Data Replication Alternatives Pros/Cons
Client Issue: What technologies will be critical for data replication architectures and what are
their trade-offs?
Storage controller unit-based solutions are popular for enterprises seeking to build recoverability into the
infrastructure and use the same solution for all applications and databases. Software on the disk array captures
block-level writes and transmits them to a disk array in another location. Because many servers (including
mainframes) can be attached to a single disk array, there are fewer replication sessions to manage, thus greatly
reducing the complexity. Although solutions generally ensure write-order integrity for each array, some are able
to provide a method for data integrity of the copy across arrays. These solutions started out synchronous and
therefore have been highly utilized in close proximity (under 50 miles). Synchronous solutions are extremely
popular in financial services industries where RPOs are set to no-loss-of-work. However, asynchronous solutions
are slowly gaining ground. The major drawback to storage controller-based solutions is that recovery cannot be
transparent to the applications because control of the target copy is maintained by the primary site, which is only
an issue for applications requiring 24x7 availability. Many enterprises use storage controller-based solutions and
use transaction-level or DBMS-level replication for those few applications requiring more stringent availability.
Another drawback is lock-in to storage hardware and software.
Decision Framework: Consider storage controller unit-based replication to achieve short
RPO/RTO, where enterprises desire to move to a single solution to address many
applications/data sources in the enterprise.
Disaster Recovery and Data Replication Architectures
Page 13
Donna Scott
C4, SEC11, 6/05, AE
2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction of this publication in any formwithout prior written permission is
forbidden. The information contained herein has been obtained fromsources believed to be reliable. Gartner disclaims all warranties as to
the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the
information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to
achieve its intended results. The opinions expressed herein are subject to change without notice.
Storage Controller Unit-Based Synchronous
Product Strength Weakness
EMC SRDF Strong market leader,
Cross platform MF
and distributed, m:n
consistency groups
Price, but getting
more competitive
IBM ESS Metro
Mirror; formerly
PPRC
Price competitive,
Consistency groups
across mainframe
and distributed
Late entry to market
in early 2004
Hitachi TrueCopy Consistency groups;
4:1 MF consistency;
small but loyal base;
Sun reseller
Consistency groups
1:1 for distributed
operating system
environments
HP Continuous
Access XP
Small buy loyal base;
MC/SG integration
Nearly exclusive to
HP-UX
Client Issue: What technologies will be critical for data replication architectures and what are
their trade-offs?
EMC SRDF was first to market with an array-based replication solution in the mid-1990s, and, as a result, is the
clear market leader. In addition, EMC offers consistency groups across multiple arrays so that when problems
occur, all replication is halted to provide greater assurance that the secondary site has data integrity. Furthermore,
unlike the alternative solutions, it is the only solution that supports distributed servers and mainframes on the same
array.
In the late 1990s, when SRDF had little competition, users often complained about pricing; however, with
additional market entrants, SRDF is being priced more competitively. IBMs Enterprise Storage Server (ESS)
Metro Mirror (formerly Peer to Peer Remote Copy PPRC) is price-competitive. It supports consistency
groups across mainframe and distributed environments, and now supports Geographically Distributed
Parallel Sysplex (GDPS). Hitachis TrueCopy supports consistency groups via time-stamping for mainframe and
distributed operating system platforms. However, consistency groups are more functional for mainframe platforms
when they can support four arrays vs. one in the distributed environment. HP licenses Hitachis TrueCopy and
adds value to it for its solution. It can support multiple server operating system platforms, but most customers use
it almost exclusively for HP-UX systems.
Decision Framework: Consider synchronous, storage controller unit replication where the two
facilities are less than 50 to 60 miles apart, so that network latency does not affect the
performance of applications.
Disaster Recovery and Data Replication Architectures
Page 14
Donna Scott
C4, SEC11, 6/05, AE
2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction of this publication in any formwithout prior written permission is
forbidden. The information contained herein has been obtained fromsources believed to be reliable. Gartner disclaims all warranties as to
the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the
information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to
achieve its intended results. The opinions expressed herein are subject to change without notice.
Storage Controller Unit-Based
Asynchronous
Product Strength Weakness
Hitachi TrueCopy
Async
Market leader;
Sun reseller
Mostly mainframe
installed base
Hitachi Universal
Replicator
J ournal-based;
Pull tech; Sun reseller
1:1 consistency
groups; new to market
EMC SRDF/A m:n consistency
groups (new)
Fairly new to market;
few production installs
EMC/Hitachi/IBM XRC
controller and host-
based replication
Supported by multiple
storage vendors
Mainframe only
NetApp SnapMirror Proven market leader Historically a
midrange player
IBM ESS Global
Mirror
HACMP integration;
8:1 consistency groups
MF and distributed
New to market; does
not support GDPS
(planned YE04)
HP Continuous
Access XP Extension
Integrated with
MC/Serviceguard
Nearly exclusive to
HP-UX
Client Issue: What technologies will be critical for data replication architectures and what are
their trade-offs?
Compared to synchronous storage controller unit replication, asynchronous storage controller unit replication is
relatively new, making its debut with Hitachi TrueCopy in the late-1990s. Although Hitachi is the market leader
in asynchronous storage controller unit-based replication, its installed base and market share pales in comparison
with synchronous replication. However, for many enterprises that have recovery sites more than 50 to 60 miles
apart, asynchronous replication alternatives have reduced the complexity of their recovery environment because
they could migrate to asynchronous from synchronous multihop architectures. A synchronous multihop
architecture is one where, due to the greater distance between facilities, a local copy is taken, then split and
replicated in an asynchronous mode to the secondary site. In this architecture, four to six copies of the data are
required vs. the two copies required otherwise. EMCs SRDF has many multihop installations and many clients
are testing its new SRDF/A to assess whether they can migrate from their multihop architectures to a single hop
with SRDF/A. In April 2004, IBM announced its first asynchronous solution for the ESS, branding it Global
Mirror, rather than PPRC. Hitachi released its new Universal Replicator in September 2004, enabling more
replication flexibility and the promise of future heterogeneity.
Decision Framework: Asynchronous, storage controller unit replication should be considered
where the two facilities are beyond synchronous replication distances (more than 50 to 60
miles apart).
Disaster Recovery and Data Replication Architectures
Page 15
Donna Scott
C4, SEC11, 6/05, AE
2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction of this publication in any formwithout prior written permission is
forbidden. The information contained herein has been obtained fromsources believed to be reliable. Gartner disclaims all warranties as to
the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the
information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to
achieve its intended results. The opinions expressed herein are subject to change without notice.
Data Replication Alternatives Pros/Cons
File-based Storage hardware independent
One solution for all
applications/data on a server
Failover/failback may be
integrated and automated
Read access for second copy,
supporting horizontal scaling
Low cost for software
File system dependent
More operational complexity
than storage controller-based
replication
Application synchronization
must be designed in the
application
Volume
manager-
based
Storage hardware
independent
One solution for all
applications/data on a server
Failover/failback may be
integrated and automated
Volume manager dependent
Data copies not available for
read/write access
Application synchronization
must be designed in the
application
More operational complexity
than storage controller-based
replication
Pros Cons
Client Issue: What technologies will be critical for data replication architectures and what are
their trade-offs?
File-based replication is a single-server solution that captures file writes and transmits them to an alternative
location. The major benefits are: 1) it does not require storage area network (SAN) storage, and 2) the files can
be used for read-access at the alternative location. File-based replication is most popular in the Windows
environment where SAN storage is not as prevalent, especially for critical applications, such as Exchange. A
drawback to this type of solution is that it is server-based; therefore, management complexity rises as compared
with storage controller unit replication.
Volume manager-based replication is similar to storage controller unit-based replication in that it replicates at
the block level and the target copy cannot be accessed for read/write. It requires a replication session for each
server and, therefore, has high management complexity. However, no SAN storage is required and it supports all
types of disk storage solutions. Both of these solutions are used for one-off applications/servers where
recoverability is critical. Furthermore, both solutions tend to offer integrated and automated failover/failback
functionality.
Decision Framework: Consider file-level replication to provide short RPO and RTO for
Windows-based applications. Consider volume manager-based replication for applications
requiring short RTO/RPO, where a heterogeneous disk is implemented.
Disaster Recovery and Data Replication Architectures
Page 16
Donna Scott
C4, SEC11, 6/05, AE
2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction of this publication in any formwithout prior written permission is
forbidden. The information contained herein has been obtained fromsources believed to be reliable. Gartner disclaims all warranties as to
the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the
information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to
achieve its intended results. The opinions expressed herein are subject to change without notice.
File and Volume/Host-Based Replication
Product Strength Weakness
NSI DoubleTake Market leader,
integrated automation
Windows-only
Legato RepliStor EMC, integrated
automation
Lack of focus
XOsoft WANSync No planned downtime
required, integrated
automation
New to market
Veritas Volume
Replicator
Market leader,
Integrated with
commonly used volume
manager, multiplatform,
VCS integration
Price; requires
VxVM
IBM Geo
Remote Mirror
Integrated with AIX
volume manager
AIX only
Softek Replicator Multiplatform Low market
penetration
Client Issue: What technologies will be critical for data replication architectures and what are
their trade-offs?
In file-based replication, NSI DoubleTake was the market leader in 2003 with an estimated $19.4 million in new
license revenue. NSI primarily sells through indirect channels (such as Dell and HP) to midmarket and enterprise
clients. Many use DoubleTake for Exchange and file/print. In the mid-to-late1990s, Legato had significant file-
based replication market share for its RepliStor product (then called Octopus), but it narrowed its focus (and thus
market share) and is broadening its focus since EMCs acquisition of Legato. RepliStor provides EMC with a
solution for enterprises that do not have or want heterogeneous disk. A newcomer on the market, XOsoft
differentiates itself in scheduled uptime no planned downtime is necessary to implement replication.
Therefore, one common use is disk migrations. Veritas is the leader for volume manager-based replication and
has the same look and feel as its popular volume manager product, VxVM. It is also integrated into VCS, where
the DR option provides long-distance replication with failover. Veritas improves manageability of multiple,
heterogeneous replication sessions and geographic clusters with CommandCentral Availability, previously called
Global Cluster Manager. Softek also offers a multiplatform volume manager-based solution, but it has low
market penetration. Formerly called TDMF Open, it has been rebranded Replicator. IBM offers a volume
manager-based solution for AIX called Geo Remote Mirror.
Decision Framework: Consider file-based replication for critical Windows-based applications
and volume-based replication for critical applications where a heterogeneous disk is
deployed.
Disaster Recovery and Data Replication Architectures
Page 17
Donna Scott
C4, SEC11, 6/05, AE
2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction of this publication in any formwithout prior written permission is
forbidden. The information contained herein has been obtained fromsources believed to be reliable. Gartner disclaims all warranties as to
the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the
information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to
achieve its intended results. The opinions expressed herein are subject to change without notice.
Other Recovery Technologies
Emerging network-based replication
Topio Data Protection Suite, Kashya KBX4000, FalconStor IPStor Mirroring,
DataCore SAN Symphony Remote Mirroring, StoreAge multiMirror, IBM SAN
Volume Controller
Point in time or snapshots to quickly recover from data corruption
EMC TimeFinder/Snap, IBM Flashcopy, HDS ShadowImage, Oracle 10g
Flashback, BMC SQL Backtrack, Imceda SQL Lightspeed, StorageTek
SnapShot, Veritas Storage Foundation
Wide-area clusters for automated recovery
HP Continental Cluster, IBM Geographically Dispersed Parallel Sysplex, Veritas
Cluster Server Global Cluster Option
Stretching local clusters across a campus to increase return on investment
HP MC/ServiceGuard, IBM HACMP, Microsoft Clustering, Oracle RAC,
SunCluster, Veritas Cluster Server
Capacity on demand/emergency backup for in-house recovery. Becoming mainstream
on S/390 and zSeries mainframes
Speed server recovery with Server Provisioning and Configuration Management Tools
Client Issue: What technologies will be critical for data replication architectures and what are
their trade-offs?
There are many other recovery technologies that may be used in disaster recovery architectures. A relatively new
set of network-based replication products (sometimes called virtualization controllers) moves the software from
the storage array controller into a separate array controller sitting in the storage fabric. This group of suppliers
hopes it can change the game and be successful at chipping away at storage controller unit-based replication
market share. They offer similar benefits in addition to heterogeneous disk support. Clustering local, campus
and wide-area offers automation for failover and failback, speeding recovery time and reducing manual
errors. Stretch clustering, where a local cluster is stretched across buildings or campuses using the same
architecture as local clustering, is becoming more popular as a way to take already purchased redundancy to
achieve some degree of disaster recovery (with a single point of failure for the data and networks). Servers
configured with capacity on demand enables pre-loaded but idle CPUs and memory to be turned on at the
recovery site for disaster recovery testing and in the event of disasters. This reduces the overall cost of dedicated
hardware for disaster recovery. And, finally, many enterprises are implementing standard server images (or
scripted installation routines) and using these templates (on disk) to restore servers and applications. This is
significantly faster than restoring the server from tape and can restore many servers in parallel, significantly
reducing manual effort.
Strategic Imperative: Managing the diversity of the infrastructure will reduce complexity and
improve recoverability and abi lity to automate the process.
Disaster Recovery and Data Replication Architectures
Page 18
Donna Scott
C4, SEC11, 6/05, AE
2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction of this publication in any formwithout prior written permission is
forbidden. The information contained herein has been obtained fromsources believed to be reliable. Gartner disclaims all warranties as to
the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the
information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to
achieve its intended results. The opinions expressed herein are subject to change without notice.
Best Practices in Disaster Recovery
Consider DR requirements in new project design phase and annually
thereafter
Testing, Testing, Testing
end-to-end test where possible
partial where not
tabletop tests can be advantageous to assess capabilities to
address scenarios as well as procedures
fast follow-up/response to test findings
Incident/Problem/Crisis Process
where IT incident could result in invocation of DR plan, leverage
problem management process which should already be in place
damage assessment: must assess costs of failing over to
alternate location vs. time to recover in primary location
Use automation to reduce errors in failover/failback
Use same automation for planned downtime (which results in
frequent testing)
The most important parts of disaster recovery management are: 1) considering DR requirements during new
project design phase to match an appropriate solution to business requirements rather than retrofitting it at a
higher cost and 2) testing it is only as a result of testing that an enterprise can be confident about its plan as
well as improve the plan through refining procedures and process. As much as possible, tests should be end-to-
end in nature and include business process owners as well as external partners (for example, that integrate with
enterprise systems). When an end-to-end test is not possible, partial tests should be done, with tabletop
walkthroughs to talk through the other components of the tests. Through frequent testing, participants become
comfortable with solving many kinds of problems in a way, they become more agile so that whatever the
disaster, people are likely able to react in a positive way to recover the enterprise without lulling into a chaotic
state (which would threaten recoverability). Moreover, for IT disasters, enterprises should leverage their incident
and problem management processes and pull in the DR team during the assessment process.
Another best practice is using automation as much as possible, not only to avoid human error during times of
crisis, but to enable other employees who may be implementing the plan to proceed recovery, even if members
of the primary recovery team are unavailable. By using the automation during planned downtime periods, testing
becomes part of standard production operations.
Client Issue: What architectures and best practices will enable enterprises to achieve 24x7
availability and disaster protection as required by the business?
Disaster Recovery and Data Replication Architectures
Page 19
Donna Scott
C4, SEC11, 6/05, AE
2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction of this publication in any formwithout prior written permission is
forbidden. The information contained herein has been obtained fromsources believed to be reliable. Gartner disclaims all warranties as to
the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the
information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to
achieve its intended results. The opinions expressed herein are subject to change without notice.
Primary Production Site Secondary Production Site
Async replication: RPO =0 to 15 seconds
Standby DBMS to mitigate risk of data
corruption
Month-end DBMS for reporting
Production
DBMS
Standby
DBMS
Month-End
DBMS
Local Failover
Server; HP-UX,
MC/Serviceguard
Quest SharePlex
captures DBMS
changes from
Oracle redo logs
Disaster
Recovery
DBMS
DR Test
DBMSs
SQL is applied
continuously to
remote DBMSs.
In the event of
disaster, replication
is reversed.
Failover
Failback
RTO less than one hour
Test disaster process once/quarter
Architecture minimizes planned
downtime for migrations/upgrades
Test
DBMSs
Case Study DBMS Log-Based
Replication Provides RTO Under One Hour
Client Issue: What architectures and best practices will enable enterprises to achieve 24x7
availability and disaster protection as required by the business?
A financial services company processes transaction data with a packaged application based on Oracle RDBMS.
Database access comes from internal (such as loan officers) and external customers (such as automated teller
machines), with some 300,000 transactions per day. To ensure data availability/recovery, the company deployed
Quest SharePlex to replicate its 500GB Oracle DBMS: 1) locally to mitigate data corruption risks in the
production database and provide a reporting database and 2) remotely (500 miles) as part of its DR plan.
SharePlex captures the changes to the DBMS (from the Oracle redo logs) and transmits them to local and remote
hosts. Changes are then applied continuously (by converting them to SQL and applying them to the target
DBMSs). SharePlex keeps primary and target DBMSs synchronized, and the company maintains a maximum of
15 seconds RPO. In a site disaster, the target is activated as the primary, any unposted changes would be posted
and the active database would be updated in the application middleware. The failover process, once initiated,
takes approximately one hour. Once the remote site is processing transactions, the replication process is reversed
back to the primary data center. Although the remote site is missing some transactions (<15 seconds RPO), they
are not lost. When a failback occurs, 100 percent of the transactions will have been accounted for, with zero data
loss. The company uses the same architecture to minimize downtime for migrations (Oracle 8i to 9i; Tru64 to
HP-UX).
Case Study: A large, regional financial services company uses DBMS-based replication to
build a DR architecture with under 15-second RPO, no data loss upon failback and one-hour
RTO.
Disaster Recovery and Data Replication Architectures
Page 20
Donna Scott
C4, SEC11, 6/05, AE
2005 Gartner, Inc. and/or its affiliates. All rights reserved. Reproduction of this publication in any formwithout prior written permission is
forbidden. The information contained herein has been obtained fromsources believed to be reliable. Gartner disclaims all warranties as to
the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the
information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to
achieve its intended results. The opinions expressed herein are subject to change without notice.
Recommendations
Make your disaster recovery management processes mature so they are integrated with
business and IT processes and meet changing business requirements.
Infuse a continuous improvement culture.
Plan for disaster recovery and availability requirements during the design phase of new
projects and annually re-assess for production systems.
Test, test and test more.
Use automation to reduce complexity and errors associated with failover/failback.
Select the replication methods that match business requirements for RPO and RTO. If a
single infrastructure-based solution is desirable, consider storage controller-based
replication. If 24x7 continuous availability is required, consider application, transaction or
database-level replication.
This is the end of this presentation. Click any
where to continue.
These materials can be reproduced only with Gartners written approval. Such approvals must be requested via
e-mail quote.requests@gartner.com.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy