h8930 Vplex Metro Oracle Rac WP
h8930 Vplex Metro Oracle Rac WP
h8930 Vplex Metro Oracle Rac WP
Abstract
This white paper describes EMC® VPLEX™ features and functionality
relevant to Oracle Real Application Cluster (RAC) and Database. The best
practices for configuring an Oracle Extended RAC environment to optimally
leverage EMC VPLEX are also presented.
September 2011
Copyright © 2011 EMC Corporation. All Rights Reserved.
For the most up-to-date listing of EMC product names, see EMC
Corporation Trademarks on EMC.com.
Audience
This white paper is intended for Oracle database administrators, storage
administrators, and IT architects responsible for architecting, creating, managing, and
using IT environments that focus on high availability with Oracle databases, VPLEX
technologies, and Symmetrix VMAX storage. The white paper assumes readers are
somewhat familiar with Oracle RAC and Oracle database technology, and EMC VPLEX
and the Symmetrix storage array.
Introduction
Oracle Extended RAC provides a way to scale out performance, utilize storage and
server resources at multiple sites, and provide an increased resiliency to failure
scenarios or maintenance operations without application downtime - allowing
organizations to eliminate database downtime, and continue business processing
uninterrupted, even in the case of complete site failures.
It should be noted that while Oracle Stretched RAC provides high availability across a
single database, it is still a good practice to deploy a Disaster Recovery (DR) solution
EMC offers VPLEX in three configurations to address customer needs for high-
availability and data mobility as seen in Figure 2:
• VPLEX Local
• VPLEX Metro
• VPLEX Geo
VPLEX Local
VPLEX Local provides seamless, non-disruptive data mobility and ability to manage
multiple heterogeneous arrays from a single interface within a data center.
VPLEX Metro with AccessAnywhere enables active-active, block level access to data
between two sites within synchronous distances up to 5 ms round-trip time (RTT).
Following are two examples of using VPLEX Metro and Oracle for data mobility and
high-availability.
Application and Data Mobility —By itself, the hypervisor has the ability to move
VMs without application downtime between physical servers. When combined
with server virtualization, VPLEX distributed volumes allow users to transparently
VPLEX Geo with AccessAnywhere enables active-active, block level access to data
between two sites within asynchronous distances. VPLEX Geo enables better cost-
effective use of resources and power. VPLEX Geo provides the same distributed
device flexibility as Metro but extends the distance up to and within 50 ms RTT. As
with any asynchronous transport media, bandwidth is important to consider for
optimal behavior as well as application access method to distributed devices with
long distances. An example of VPLEX Geo use case is when application 1 addressing
VPLEX distributed volumes in consistency group 1 and performs read/write only at
site 1. Application 2 addressing VPLEX distributed volumes in consistency group 2
and performs read/write only at site 2. VPLEX Geo maintains write IO fidelity for each
application at the remote site, although slightly behind in time. In this example both
applications can fail over and back independently, over long distances.
The VPLEX family uses a unique clustering architecture to help customers break the
boundaries of the data center and allow servers at multiple data centers to have
read/write access to shared block storage devices. VPLEX Local includes a single
cluster while VPLEX Metro and Geo each include two. A VPLEX cluster consists of one,
two, or four engines as seen in Table 1. Each VPLEX engine provides SAN/WAN
connectivity, cache and processing power with two redundant directors.
Table 1. VPLEX hardware components
Feature Description
VPLEX Local uses write-through caching and lets writes pass directly and get
acknowledged first at the storage behind the VPLEX volumes before acknowledging
them back to the host. With EMC storage such as Symmetrix and VNX™, where writes
only need to register with the storage persistent cache, application write response
time is optimal. VPLEX Metro also uses write-through cache but will acknowledge the
writes to the application only once they were registered with both local and remote
storage. On the other hand, writes to VPLEX Geo are cached at the VPLEX layer and
sent to the remote cluster in deltas that preserve write-order fidelity. In all three
VPLEX deployments – Local, Metro, and Geo, reads can benefit from the VPLEX cache,
and in VPLEX Metro and Geo, read-hits are served from the local VPLEX cluster cache.
1
The details in this section are based on VPLEX release 5.0.1 and may be different in other releases. The VPLEX product guide
provides exact version details.
Consistency groups
Starting with GeoSynchrony 5.0 for VPLEX Metro and GeoSynchrony 5.0.1 for VPLEX
Geo, consistency groups are used to organize virtual volumes. Consistency groups
aggregate volumes together to provide a common set of properties to the entire
group. In addition, you can move a consistency group from one cluster to another if
required. Consistency groups are particularly important for databases and
applications. All database LUNs (for example, Oracle data, control and log files)
require preserving write-order fidelity to maintain data integrity, and therefore should
always be placed in a single consistency group together. Often multiple databases
have transaction dependencies, such as when database links are used to connect
databases, or when the application issues transactions to multiple databases and
expect them to be consistent with each other. In these cases the consistency group
should include all LUNs that require preserving IO dependency (write-order fidelity).
Detach rules are predefined rules that determine consistency group I/O processing
semantics whenever connectivity with remote cluster is lost (for example network
partitioning or remote cluster failure). In these situations, until communication is
restored, most workloads require specific sets of virtual volumes to resume I/Os on
one cluster and suspend I/Os on the other.
In a VPLEX Metro configuration detach rules can depict a static preferred cluster by
setting2: winner:cluster-1, winner:cluster-2 or No Automatic Winner (where the last
one specifies no preferred cluster). When the system is deployed without VPLEX
Witness (discussed in the next section), I/Os to consistency group devices proceeds
on the preferred cluster and are suspended on the non-preferred cluster.
VPLEX Witness
VPLEX Witness, introduced with GeoSynchrony 5.0, is an optional external server that
is installed as a virtual machine. VPLEX Witness connects to both VPLEX clusters over
the management IP network. By reconciling its own observations with the information
reported periodically by the clusters, the VPLEX Witness enables the cluster(s) to
distinguish between inter-cluster network partition failures and cluster failures and
automatically resume I/O in these situations at the appropriate site. With
GeoSynchrony 5.0 VPLEX Witness affects only synchronous consistency groups in a
VPLEX Metro configuration, and only when the detach rules depict either cluster-1 or
cluster-2 as preferred for the consistency group (that is, VPLEX Witness does not
affect consistency groups where No Automatic Winner rule is in effect).
Without VPLEX Witness, if two VPLEX clusters lose contact, the consistency group
detach rules in effect define which cluster continues operation and which suspends
I/O as explained earlier. The use of detach rules alone to control which site is a
winner may add an unnecessary complexity in case of a site failure since it may be
necessary to manually intervene to resume I/O to the surviving site. VPLEX Witness
handles such an event dynamically and automatically. It provides the following
features:
Automatic load balancing between data centers
Active/active use of both data centers
Fully automatic failure handling
2
Based on management GUI options. CLI uses slightly different terms to specify the same rules.
Cache vault
To avoid metadata loss or data loss (when write back caching is used) under
emergency conditions, VPLEX uses a mechanism called cache vaulting to safeguard
cache information to a persistent local storage.
Symmetrix VMAX
Symmetrix VMAX is built on the strategy of simple, intelligent, modular storage, and
incorporates a new Virtual Matrix™ interconnect that connects and shares resources
across all nodes, allowing the storage array to seamlessly grow from an entry-level
configuration into the world’s largest storage system. It provides the highest levels of
performance and availability featuring new hardware capabilities as seen in Figure 6.
2 – 16 director boards
Up to 2.1 PB usable
capacity
Up to 128 FC FE ports
Up to 64 FICON FE ports
Up to 64 GigE / iSCSI FE
ports
Up to 1 TB global memory
(512 GB usable)
48 – 2,400 disk drives
Enterprise Flash Drives
Figure 6. The Symmetrix VMAX platform 200/400 GB
FC drives 146/300/450/600
Symmetrix VMAX provides the ultimate scale-out
GB 15k platform.
rpm It includes the ability to
incrementally scale front-end and back-end(orperformance by adding processing
400 GB 10k rpm)
modules (nodes) and storage bays. Each processing module provides additional
SATA II drives 2 TB 7.2k
front-end, memory, and back-end connectivity.
rpm
Symmetrix VMAX also increases the maximum hyper size to 240 GB (64 GB on
Symmetrix DMX™). This allows ease of storage planning and device allocation,
especially when using Virtual Provisioning™ where the thin storage pool is already
striped and large hypers can be easily used.
The EMC TimeFinder® family of local replication technology allows for creating
multiple, nondisruptive, read/writeable storage-based replicas of database and
application data. It satisfies a broad range of customers’ data replication needs with
speed, scalability, efficient storage utilization, and minimal to no impact on the
Virtual Provisioning
Symmetrix thin devices are logical devices that you can use in many of the same ways
that Symmetrix devices have traditionally been used. Unlike traditional Symmetrix
devices, thin devices do not need to have physical storage preallocated at the time
the device is created and presented to a host (although in some cases customers
interested only in the thin pool wide striping and ease of management choose to fully
preallocate the thin devices). You cannot use a thin device until it has been bound to
a thin pool. Multiple thin devices may be bound to any given thin pool. The thin pool
is comprised of devices called data devices that provide the actual physical storage
to support the thin device allocations. Table 2 describes the basic Virtual Provisioning
definitions.
Table 2. Definitions of Virtual Provisioning devices
Device Description
Thin device A host-accessible device that has no storage directly associated with it.
Data Internal devices that when placed in a thin pool provide storage capacity to be used
devices by thin devices.
Thin pool A collection of data devices that provide storage capacity for thin devices.
3
Starting with Oracle 11gR2, the number of CRS voting disks is determined automatically by the ASM redundancy level, such
as External redundancy implies 1 voting disk, Normal redundancy implies 3 voting disks, and High redundancy implies 5
voting disks.
No other timeout value changes are necessary (that is, such as crs disk timeout).
Additional notes
On x86-based server platform, ensure that partitions are aligned. VPLEX requires
alignment at 4 KB offset; however, if Symmetrix is used align at 64 KB (128 blocks)
offset (which is natively aligned at 4 KB boundary as well):
On Windows diskpar or diskpart can be used. On Linux fdisk or parted can be
used.
An example of aligning partition to 64 KB offset using fdisk is shown later in the
section: Create partitions on PowerPath devices.
Oracle Extended RAC and VPLEX Metro protection from unplanned downtime
The combination of Oracle Extended RAC and VPLEX Metro provides improved
availability and resiliency to many failure conditions and therefore increased the
availability of mission-critical database and applications.
Table 3 summarizes the list of failure scenarios and the best practices that will make
the database able to continue operations in each of them. Note that the list does not
cover failover to a standby system (such as Oracle Data Guard, RecoverPoint, SRDF,
and so on). EMC VPLEX with GeoSynchrony 5.0 Product Guide provides more
information regarding VPLEX connectivity best practices.
Failure Oracle database single server Oracle RAC Oracle Extended RAC
(non-RAC) (not extended) with VPLEX Metro
Host hardware Downtime implied until host Oracle RAC provides Same as Oracle RAC
failure or crash and application can resume database resiliency for a
operations. single server failure by
performing automatic
instance recovery and
having other cluster
nodes ready for user
connections.
Configure Oracle
Transparent Application
Failover to allow
sessions to automatically
failover to a surviving
cluster node.
Lab/building/site Downtime implied until host Downtime implied until By installing VPLEX
failure and application can resume host and application can clusters and Witness in
operations. resume operations. independent failure
domains (such as
another building or
site) it becomes
resilient to a lab,
building, or site
failures.
The VPLEX cluster in
the failure-domain not
affected by the
disaster will continue
to serve I/Os to the
application.
Failure Oracle database single server Oracle RAC Oracle Extended RAC
(non-RAC) (not extended) with VPLEX Metro
Failure Oracle database single server Oracle RAC Oracle Extended RAC
(non-RAC) (extended) with VPLEX Metro
Loss of connectivity Downtime implied until storage Downtime implied until VPLEX Metro
to storage array array connectivity can be storage array synchronous
resumed. connectivity can be consistency group
resumed. continues to serve
I/O’s at both sites,
even if one of the
storage arrays is not
available.
Oracle clusterware
would not know about
the storage
unavailability as VPLEX
cluster continues to
service all I/Os.
Failure Oracle Database Single Server Oracle RAC Oracle Extended RAC
(non-RAC) (not extended) with VPLEX Metro
VPLEX cluster Downtime implied until VPLEX Downtime implied until VPLEX Witness will
unavailable cluster connectivity can be VPLEX cluster allow I/O to resume at
resumed. connectivity can be the surviving VPLEX
resumed. cluster. RAC cluster
nodes connected to
that VPLEX cluster will
resume operations
without downtime.
Use Oracle Transparent
Application Failover
(TAF) to allow
automatic user
connection failover to
the surviving cluster
nodes.
Software Release
Server OS Oracle Enterprise Linux Release 5 Update 4
x86_64
EMC PowerPath Version 5.5 for Linux x86_64
Oracle Oracle Clusterware 11g R2 (11.2.02) and Oracle
Database 11g R2 (11.2.0.2) for Linux x86-64
25 x 60 GB thin 56 x 230 GB
Database: +DATA: ASM disk LUNs (1A5:1B4) RAID5 (3+1)
Data_Pool
Name: group
(C5:FC)
ERPFINDB
Size: 1 TB
+TEMP: ASM disk 6 x 50 GB thin
Num. LUNs: 38
group LUNs (17D:182)
5 j. Create the Symmetrix masking view to group the storage, port, and initiator
groups:
k. symaccess –sid 191 create view –name VPLEX1_View –storgrp
VPLEX1_Storage_Group –portgrp VPLEX1_Port_Group –initgrp
VPLEX1_Initiator_Group
6 l. Repeat steps 1–5 for storage provisioning the second VPLEX system (VPLEX2) to
the second VMAX array (sid 219)
Figure 8 lists the main tasks that are required for VPLEX Metro setup.
Note: You must set up both VPLEX Metro clusters as described. You cannot set each
cluster up individually and then join them later.
An example:
VPlexcli:/> ll **/hardware/ports/
/engines/engine-1-1/directors/director-1-1-A/hardware/ports:
Name Address Role Port Status
------- ------------------ --------- -----------
A2-FC00 0x500014426011ee20 wan-com up
A2-FC01 0x500014426011ee21 wan-com up
A2-FC02 0x500014426011ee22 wan-com up
A2-FC03 0x500014426011ee23 wan-com up
/engines/engine-1-1/directors/director-1-1-B/hardware/ports:
Name Address Role Port Status
------- ------------------ --------- -----------
/engines/engine-2-1/directors/director-2-1-A/hardware/ports:
Name Address Role Port Status
------- ------------------ --------- -----------
A2-FC00 0x5000144260168220 wan-com up
A2-FC01 0x5000144260168221 wan-com up
A2-FC02 0x5000144260168222 wan-com up
A2-FC03 0x5000144260168223 wan-com up
/engines/engine-2-1/directors/director-2-1-B/hardware/ports:
Name Address Role Port Status
------- ------------------ --------- -----------
B2-FC00 0x5000144270168220 wan-com up
B2-FC01 0x5000144270168221 wan-com up
B2-FC02 0x5000144270168222 wan-com up
B2-FC03 0x5000144270168223 wan-com up
An example:
Create a consistency group on both VPLEX systems, and add all virtual
volumes assigned for Oracle Extended RAC ASM devices, including Grid and
ASM devices for Oracle RAC database, that require write-order consistency to
the consistency group.
Figure 10 shows the GUI interface for logical layout and provisioning storage from
EMC VPLEX.
Match PowerPath pseudo device names across Oracle RAC Server nodes
To match the PowerPath pseudo device names across Oracle RAC Server nodes, EMC
recommends to use the PowerPath utility emcpadm. The utility allows to export the
mapping from one host and to import it to another. It also allows the renaming of
pseudo devices one at a time if necessary.
<source host> emcpadm export_mapping -f <mapping_file_name>
M
B
Reserved Default start New start
R (63 blocks) for partition for partition
1 1
64 KB 64 KB
Figure 11. Partition alignment to Symmetrix track size boundary (64 KB)
In this example, one partition is created for a PowerPath device that is going to be
used as an Oracle ASM device.
[root@licoc091 ~]# fdisk /dev/emcpowerd
Command (m for help): n
Command action
e extended
p primary partition (1-4)
p
Partition number (1-4): 1
First cylinder (1-52218, default 1): [ENTER]
Using default value 1
Last cylinder or +size or +sizeM or +sizeK (1-52218, default 52218):
[ENTER]
After the partitions were created ensure the other nodes recognize them. It may be necessary to
run the fdisk command on each other node and write (“w”) the partition table. Alternatively, a
rescan of the SCSI bus or reboot of the other nodes will refresh the information as well.
25000
Distance: 0 km (baseline)
Transaction Rate
20000
15000
10000
5000
0
1 2 3 4
Number of RAC nodes with active workload
25000
Distance: 100 km (1ms RTT)
Transaction Rate
20000
15000
10000
5000
0
1 2 3 4
Number of RAC nodes with active workload
Conclusion
EMC VPLEX running the EMC Metro or GeoSynchrony operating system is an
enterprise-class SAN-based federation technology that aggregates and manages
pools of Fibre Channel attached storage arrays that can be either collocated in a
single data center or multiple data centers that are geographically separated by metro
distances. Furthermore, with a unique scale-up and scale-out architecture, EMC
VPLEX’s advanced data caching and distributed cache coherency provide workload
resiliency, automatic sharing, and balancing and failover of storage domains, and
enable both local and remote data access with predictable service levels. Oracle
Extended RAC dispersed in two datacenters within metro distance backed by the
capabilities of EMC VPLEX provides improved performance, scalability, and flexibility.
In addition, the capability of EMC VPLEX to provide nondisruptive, heterogeneous
data movement, and volume management functionality within synchronous distances
enables customers to offer nimble and cost-effective cloud services spanning
multiple physical locations.
References
The following documents include more information on VPLEX and can be found on
EMC.com and Powerlink:
Implementation and Planning Best Practices for EMC VPLEX Technical Notes
EMC VPLEX Metro Witness – Technology and High Availability TechBook
Conditions for stretched hosts cluster support on EMC VPLEX Metro
EMC VPLEX with GeoSynchrony 5.0 Product Guide