NeIBM Rebook AIX Oracle
NeIBM Rebook AIX Oracle
Dino Quintero
Andrew Braid
Frederic Dubois
Alexander Hartmann
Octavian Lascu
Francois Martin
Wayne Martin
Stephan Navarro
Norbert Pistoor
Hubert Savio
Ralf Schmidt-Dannert
Redbooks
Draft Document for Review October 26, 2021 2:43 pm 8485edno.fm
IBM Redbooks
October 2021
SG24-8485-00
8485edno.fm Draft Document for Review October 26, 2021 2:43 pm
Note: Before using this information and the product it supports, read the information in “Notices” on
page ix.
© Copyright International Business Machines Corporation August 2021. All rights reserved.
Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule
Contract with IBM Corp.
Draft Document for Review October 26, 2021 2:43 pm 8485edno.fm
iii
8485edno.fm Draft Document for Review October 26, 2021 2:43 pm
Contents
Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .x
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Now you can become a published author, too! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Stay connected to IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
Contents vii
8485TOC.fm Draft Document for Review October 26, 2021 2:43 pm
Notices
This information was developed for products and services offered in the US. This material might be available
from IBM in other languages. However, you may be required to own a copy of the product or product version in
that language in order to access it.
IBM may not offer the products, services, or features discussed in this document in other countries. Consult
your local IBM representative for information on the products and services currently available in your area. Any
reference to an IBM product, program, or service is not intended to state or imply that only that IBM product,
program, or service may be used. Any functionally equivalent product, program, or service that does not
infringe any IBM intellectual property right may be used instead. However, it is the user’s responsibility to
evaluate and verify the operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter described in this document. The
furnishing of this document does not grant you any license to these patents. You can send license inquiries, in
writing, to:
IBM Director of Licensing, IBM Corporation, North Castle Drive, MD-NC119, Armonk, NY 10504-1785, US
This information could include technical inaccuracies or typographical errors. Changes are periodically made
to the information herein; these changes will be incorporated in new editions of the publication. IBM may make
improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time
without notice.
Any references in this information to non-IBM websites are provided for convenience only and do not in any
manner serve as an endorsement of those websites. The materials at those websites are not part of the
materials for this IBM product and use of those websites is at your own risk.
IBM may use or distribute any of the information you provide in any way it believes appropriate without
incurring any obligation to you.
The performance data and client examples cited are presented for illustrative purposes only. Actual
performance results may vary depending on specific configurations and operating conditions.
Information concerning non-IBM products was obtained from the suppliers of those products, their published
announcements or other publicly available sources. IBM has not tested those products and cannot confirm the
accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the
capabilities of non-IBM products should be addressed to the suppliers of those products.
Statements regarding IBM’s future direction or intent are subject to change or withdrawal without notice, and
represent goals and objectives only.
This information contains examples of data and reports used in daily business operations. To illustrate them
as completely as possible, the examples include the names of individuals, companies, brands, and products.
All of these names are fictitious and any similarity to actual people or business enterprises is entirely
coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrate programming
techniques on various operating platforms. You may copy, modify, and distribute these sample programs in
any form without payment to IBM, for the purposes of developing, using, marketing or distributing application
programs conforming to the application programming interface for the operating platform for which the sample
programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore,
cannot guarantee or imply reliability, serviceability, or function of these programs. The sample programs are
provided “AS IS”, without warranty of any kind. IBM shall not be liable for any damages arising out of your use
of the sample programs.
Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines
Corporation, registered in many jurisdictions worldwide. Other product and service names might be
trademarks of IBM or other companies. A current list of IBM trademarks is available on the web at “Copyright
and trademark information” at http://www.ibm.com/legal/copytrade.shtml
The following terms are trademarks or registered trademarks of International Business Machines Corporation,
and might also be trademarks or registered trademarks in other countries.
Redbooks (logo) ® IBM Garage™ POWER®
AIX® IBM Spectrum® POWER8®
IBM® IBM Z® POWER9™
IBM Cloud® InfoSphere® PowerVM®
IBM Cloud Pak® Interconnect® Redbooks®
Itanium, Intel logo, Intel Inside logo, and Intel Centrino logo are trademarks or registered trademarks of Intel
Corporation or its subsidiaries in the United States and other countries.
The registered trademark Linux® is used pursuant to a sublicense from the Linux Foundation, the exclusive
licensee of Linus Torvalds, owner of the mark on a worldwide basis.
Microsoft, Windows, and the Windows logo are trademarks of Microsoft Corporation in the United States,
other countries, or both.
Ansible, OpenShift, Red Hat, are trademarks or registered trademarks of Red Hat, Inc. or its subsidiaries in
the United States and other countries.
UNIX is a registered trademark of The Open Group in the United States and other countries.
VMware, and the VMware logo are registered trademarks or trademarks of VMware, Inc. or its subsidiaries in
the United States and/or other jurisdictions.
Other company, product, or service names may be trademarks or service marks of others.
Preface
This IBM® Redbooks® journeys, educates and prepares the readers to understand how
Oracle take advantages of the architectural capabilities of IBM Power Systems.
This book delivers a technical snapshot of Oracle on Power Systems utilizing the general
available and supported software and hardware resources to help you guide you to
understand:
Why Oracle on Power Systems?
Document efficiencies and benefits of the Power Systems architecture.
Strengths of the Power Systems Architecture (differentiate the technology and align with
Oracle DB requirements - processor architecture helpers and IBM AIX® on IBM POWER®
ecosystem).
Use scenarios which align to the story, showcase and document step-by-step the relevant,
selected strengths, ecosystem features in the scenarios.
The goal of this publication is to complementing Oracle and Power Systems features to
deliver a general and typical content including nuts and bolts with a modern publication view
of the solution for deploying Oracle (RAC and Single instance) on Power Systems by using
theoretical knowledge, hands-on exercises, and documenting the findings by way of sample
scenarios.
This publication addresses topics for developers, IT specialists, systems architects, brand
specialist, sales team, and anyone looking for a guide on how to implement the best options
for Oracle on Power Systems. Moreover, this book provides documentation to transfer the
how-to-skills to the technical teams, and solution guidance to the sales team. This publication
complements the documentation available at the IBM Documents, and aligns with the
educational materials provided by the IBM Garage™ for Systems Technical Training.
Authors
This book was produced by a team of specialists from around the world working at IBM
Redbooks, Poughkeepsie Center.
Dino Quintero is a Power Systems Technical Specialist with Garage for Systems. He has 25
years of experience with IBM Power Systems technologies and solutions. Dino shares his
technical computing passion and expertise by leading teams developing technical content in
the areas of enterprise continuous availability, enterprise systems management,
high-performance computing (HPC), cloud computing, artificial intelligence including machine
and deep learning, and cognitive solutions. He also is a Certified Open Group Distinguished
IT Specialist. Dino is formerly from the province of Chiriqui in Panama. Dino holds a Master of
Computing Information Systems degree and a Bachelor of Science degree in Computer
Science from Marist College.
Andrew Braid is a technical specialist working in the IBM Oracle Center in Montpellier,
France. He worked for Oracle Worldwide Support as part of the E-business Suite Core
technologies team and as a team leader for Oracle Premium Support in the United Kingdom
specializing in platform migrations and upgrades before working as a production Database
Administrator for several large Oracle E-Business Suite clients across Europe. He joined IBM
in 2011 to provide support for benchmarks and customers running Oracle Databases on IBM
Power Systems.
Frederic Dubois is a Global Competitive Sales Specialist at the IBM Garage for Systems at
IBM Global Markets in France. He delivers client value by way of his technical, presentation,
and writing skills, while supporting brand specific business strategies.
Alexander Hartmann is a Senior IT Specialist working for IBM Systems Lab Services in
Germany and is a member of the IBM Migration Factory. He holds a master's degree in
business informatics from the University of Mannheim, Germany. In his more than 25 years
of experience with relational databases he has been working the last 16 years intensively on
all aspects of Oracle Databases, with a focus on migration, performance- and
license-optimization. Besides the Oracle Database specialties his areas of expertise include
AIX, Linux, scripting and automation.
Octavian Lascu is an IBM Redbooks Project Leader and a Senior IT Consultant for IBM
Romania with over 25 years of experience. He specializes in designing, implementing, and
supporting complex IT infrastructure environments (systems, storage, and networking),
including high availability and disaster recovery solutions and high-performance computing
deployments. He has developed materials for and taught over 50 workshops for technical
audiences around the world. He is the author of several IBM publications.
Francois Martin is a Global Competitive Sales Specialist is responsible for developing brand
and product specific solutions that address client's business needs (both industry and
business) and deliver client value while supporting brand specific business strategies.
Francois has experience and skills with Power Systems Sales competition. He understands
customer situations, sales, and technical sales to tackle competitive proposals. Francois has
knowledge of competitors sales strategy, especially competition against Oracle Launch WW
sales plays, Enablement session, workshop, WebEx for sellers and BPs Skills and experience
from previous assignments in IBM Education, Cloud consolidation, Teaching, Technical Sales
Enablement, Performance benchmarks, TCO, AIX, Power systems, Virtualization,
Architecture design and technology.
Wayne Martin is the IBM Systems Solution Manager responsible for the technology
relationship between IBM and Oracle Corporation for all IBM server brands. He is responsible
for developing the mutual understanding between IBM and Oracle on technology innovations
that generate benefits for mutual customers. Wayne has held various technical and
management roles at IBM that focused on driving enhancements of ISV software that use
IBM mainframe, workstation, and scalable parallel products.
Stephan Navarro is an Oracle for IBM Power Systems Architect at the IBM Garage for
Systems at IBM Global Markets in France.
Norbert Pistoor was a Senior Consultant at Systems Lab Services in Germany and a
Member of the Migration Factory until his retirement in June 2021. He has more than 20 years
of experience with Oracle Databases and more than 10 years with cross-platform database
migrations. He contributed several enhancements to standard Oracle migration methods and
incorporated them into the IBM Migration Factory Framework. He holds a PhD in physics from
University of Mainz, Germany.
Hubert Savio is working in the IT field about 23 years as DBA, consultant, IT specialist and IT
architect on international projects, and Danish/Nordics, this helped him building skills required
by your customers within Linux, UNIX, SAP/Oracle areas on either on Premises or Cloud
solutions. Hubert has worked with Oracle since late 1996 as far back as releases 6 and 7.3.4.
He has a Masters degree in IT from the "Centre d'etudes Supérieures Industrielles of
Strasbourg" (in France). Hubert is specialized in Oracle Real Application Clusters under AIX.
designing, testing, and troubleshooting solutions for high performance, high availability and
near zero data loss disaster recovery. Ralf has helped customers in the financial,
telecommunications, utility, retail and manufacturing industries choose appropriate database
and infrastructure technologies to meet their business requirements for databases up to a
hundred terabytes in size. Most recently, he has been evaluating and implementing
technologies to provide Database as a Service to customers running their databases on IBM
Power Systems servers (on-premises) or in IBM Power Systems Virtual Server infrastructure
(off-premises). This work includes Oracle Database on AIX and open source databases on
Linux on Power.
Wade Wallace
IBM Redbooks, Poughkeepsie Center
Majidkhan Remtoula
IBM France
Reinaldo Katahira
IBM Brazil
Find out more about the residency program, browse the residency index, and apply online at:
ibm.com/redbooks/residencies.html
Comments welcome
Your comments are important to us!
We want our books to be as helpful as possible. Send us your comments about this book or
other IBM Redbooks publications in one of the following ways:
Use the online Contact us review Redbooks form found at:
ibm.com/redbooks
Send your comments in an email to:
redbooks@us.ibm.com
Mail your comments to:
IBM Corporation, IBM Redbooks
Dept. HYTD Mail Station P099
Preface xiii
8485pref.fm Draft Document for Review October 26, 2021 2:43 pm
1.1.1 Reliability
From a server hardware perspective, reliability is a collection of technologies (such as chipkill
memory error detection/correction, dynamic configuration and so on) that enhance system
reliability by identifying specific hardware errors and isolating the failing components.
Built-in system failure recovery methods enable cluster nodes to recover, without falling over
to a backup node, when problems have been detected by a component within a node in the
cluster. Built-in system failure recovery is transparent and achieved without the loss or
corruption of data. It is also much faster compared to system or application failover recovery
(failover to a backup server and recover). And because the workload does not shift from this
node to another, no other node's performance or operation is affected. Built-in system
recovery covers applications (monitoring and restart), disks, disk adapters, LAN adapters,
power supplies (battery backups) and fans.
From a software perspective, reliability is the capability of a program to perform its intended
functions under specified conditions for a defined period of time. Software reliability is
achieved mainly in two ways: infrequent failures (built-in software reliability), and extensive
recovery capabilities (self healing - availability).
IBM fundamental focus on software quality is the primary driver of improvements in reducing
the rate of software failures. As for recovery features, IBM-developed operating systems have
historically mandated recovery processing in both the mainline program and in separate
recovery routines as part of basic program design.
As IBM Power Systems become larger, more and more customers expect mainframe levels of
reliability. For some customers, this expectation derives from their prior experience with
mainframe systems which were “downsized” to UNIX servers. For others, this is simply a
consequence of having systems that support more users.
The cost associated with an outage grows every year, therefore avoiding outages becomes
increasingly important. This leads to new design requirements for all AIX-related software.
For all operating system or application errors, recovery must be attempted. When an error
occurs, it is not valid to simply give up and terminate processing. Instead, the operating
system or application must at least try to keep the component affected by the error up and
running. If that is not possible, the operating system or application makes every effort to
capture the error data and automate system restart as quickly as possible.
The amount of effort put into the recovery is, of course, proportional to the impact of a failure
and the reasonableness of “trying again”. If actual recovery is not feasible, then the impact of
the error is reduced to the minimum appropriate level.
Today, many customers require that recovery processing be subject to a time limit and have
concluded that rapid termination with quick restart or takeover by another application or
system is preferable to delayed success. However, takeover strategies rely on redundancy
that becomes more and more expensive as systems get larger, and in most cases the main
reason for quick termination is to begin a lengthy takeover process as soon as possible. Thus,
the focus is now shifting back towards core reliability, and that means quality and recovery
features.
1.1.2 Availability
Today’s systems have hot plug capabilities for many subcomponents, from processors to
input/output cards to memory. Also, clustering techniques, reconfigurable input and output
data paths, mirrored disks, and hot swappable hardware help to achieve a significant level of
system availability.
From a software perspective, availability is the capability of a program to perform its function
whenever it is needed. Availability is a basic customer requirement. Customers require a
stable degree of certainty, and also require that schedules and user needs are met.
Availability gauges the percentage of time a system or program can be used by the customer
for productive use. Availability is determined by the number of interruptions and the duration
of the interruptions, and depends on characteristics and capabilities which include:
The ability to change program or operating system parameters without rebuilding the
kernel and restarting the system.
The ability to configure new devices without restarting the system.
The ability to install new software or update existing software without restarting the
system.
The ability to monitor system resources and programs and cleanup or recover resources
when failures occur.
The ability to maintain data integrity in spite of errors.
The AIX operating system includes many availability characteristics and capabilities from
which your overall environment will benefit.
1.1.3 Serviceability
Focus on serviceability is shifting from providing customer support remotely through
conventional methods, such as phone and email, to automated system problem reporting and
correction, without user (or system administrator) intervention.
Hot swapping capabilities of some hardware components enhances the serviceability aspect.
A service processor with advanced diagnostic and administrative tools further enhances the
system serviceability. A System p server's service processor can call home in the service
report, providing detailed information for IBM service to act upon. This automation not only
eases the burden placed on system administrators and IT support staff, but also enables
rapid and precise collection of problem data.
On the software side, serviceability is the ability to diagnose and correct or recover from an
error when it occurs. The most significant serviceability capabilities and enablers in AIX are
referred to as the software service aids. The primary software service aids are error logging,
system dump, and tracing.
With the advent of next generation UNIX servers from IBM, many hardware reliability-,
availability-, and serviceability-related issues such as memory error detection, LPARs,
hardware sensors and so on have been implemented. These features are supported with the
relevant software in AIX. These abilities continue to establish AIX as the best UNIX operating
system.
IBM has made AIX robust with respect to continuous availability characteristics, and this
robustness makes IBM UNIX servers the best in the market. IBM AIX continuous availability
strategy has the following characteristics:
Reduce the frequency and severity of AIX system outages, planned and unplanned
Improve serviceability by enhancing AIX failure data capture tools.
Provide enhancements to debug and problem analysis tools.
Ensure that all necessary information involving unplanned outages is provided, to correct
the problem with minimal customer effort.
Use of mainframe hardware features for operating system continuous availability brought
to System p hardware.
Provide key error detection capabilities through hardware-assist.
Exploit other System p hardware aspects to continue transition to “stay-up” designs.
Use of “stay-up” designs for continuous availability.
Maintain operating system availability in the face of errors while minimizing application
impacts.
Use of sophisticated and granular operating system error detection and recovery
capabilities.
Maintain a strong tie between serviceability and availability.
Provide problem diagnosis from data captured at first failure without the need for further
disruption.
Provide service aids that are nondisruptive to the customer environment.
Provide end-to-end and integrated continuous availability capabilities across the server
environment and beyond the base operating system.
Provide operating system enablement and application and storage exploitation of the
continuous availability environment.
2.1 Introduction
This section describes IBM Power Systems architectural strengths.
Additional components not designed or manufactured by IBM are chosen and specified by
IBM to meet system requirements. These are procured for use by IBM using a rigorous
procurement process intended to deliver reliability and design quality expectations.
The systems incorporate software layers (firmware) for error detection, fault isolation and
support, and virtualization in a multi-partitioned environment. These include IBM designed
and developed service firmware. IBM PowerVM® hypervisor is also IBM designed and
supported.
In addition, IBM offers two operating systems developed by IBM: AIX and IBM i. Both
operating systems come from a code base with a rich history of design for reliable operation.
These components are designed with application availability in mind, including the software
layers, which are also capable of taking advantage of hardware features such as storage keys
that enhance software reliability.
Within the IBM POWER9™ processor and memory sub-system, this necessarily means
systematically investing in error detection. This includes the obvious such as checking for
data in memory and caches and validating that data transferred across busses is correct. It
goes well beyond this to include techniques for checking state machine transitions, residue
checking for certain operations and protocol checking to not only make sure that the bits
transmitted are correct, but also that the data went when and where it was expected and so
on.
When it is detected that a fault has occurred, the primary intent of the RAS design is to
prevent reliance on this data. Most of the rest of the RAS characteristics discussed in this
section describe ways in which disruption due to bad data can be eliminated or minimized.
However, there are cases where avoiding reliance on bad data or calculations means
terminating the operation.
It must be pointed out error that detection seems like a well understood and expected goal.
However, it is not always the goal of every possible sub-system design. Hypothetically, for
instance, graphics processing units (GPU s) whose primary purpose is rendering graphics in
non-critical applications have options for turning off certain error checking (such as ECC in
memory) to allow for better performance. The expectation in such case is that there are
applications where a single dropped pixel on a window is of no real importance, and a solid
fault is only an issue if it is noticed.
In general, I/O adapters can also have less hardware error detection capability where they
can rely on a software protocol to detect and recover from faults.
Another kind of failure is what is broadly classified as a soft error. Soft errors are faults that
occur in a system and are either occasional events inherent in the design or temporary faults
that are due to an external cause.
Data cells in caches and memory, for example, can have a bit-value temporarily upset by an
external event such as caused by a cosmic ray generated particle. Logic in processors cores
can also be subject to soft errors where a latch can also flip due to a particle strike or similar
event. Busses transmitting data can experience soft errors due to clock drift or electronic
noise.
Methods for interleaving data so that two adjacent bits in array flipping do not cause
undetected multi-bit flips in a data word is another design technique that is also important.
Ultimately when data is critical, detecting soft error events that occur needs to be done
immediately, inline to avoid relying on bad data because periodic diagnostics is insufficient to
catch an intermittent problem before damage is done.
The simplest approach to detecting many soft error events can simply be by having parity
protection on data which can detect a single bit flip. When such simple single bit error
detection is deployed, however, the impact of a single bit upset is bad data. Discovering bad
data without being able to correct it will result in termination of an application, or even a
system so long as data correctness is important.
To prevent such a soft error from having a system impact it is necessary not simply to detect a
bit flip, but also to correct. This requires more hardware than simple parity. It has become
common now to deploy a bit correcting error correction code (ECC) in caches that can contain
modified data. Because such flips can occur in more than just caches, however, such ECC
codes are widely deployed in POWER9 processors in critical areas on busses, caches and so
forth.
Protecting a processor from more than just data errors requires more than just ECC checking
and correction. CRC checking with a retry capability is used on a number of busses, for
example.
Significantly POWER processors since POWER6 have been designed with sufficient error
detection to not only notice key typical software upsets impacting calculations, but to notice
quickly enough to allow processor operations to be retried. Where retry is successful, as
expected for temporary events, system operation continues without application outages.
However, if a solid fault is continually being corrected, the second fault that occurs will
typically cause data that is not correctable. This results in the need to terminate, at least,
whatever is using the data.
In many system designs, when a solid fault occurs in something like a processor cache, the
management software on the system (the hypervisor or operating system) can be signaled to
migrate the failing hardware off the system.
This is called predictive deallocation. Successful predictive deallocation allows for the system
to continue to operate without an outage. To restore full capacity to the system, however, the
failed component still needs to be replaced, resulting in a service action.
Examples include a spare data lane on a bus, a spare bit-line in a cache, having caches split
up into multiple small sections that can be deallocated, or a spare DRAM module on a DIMM.
Sometimes redundant components are not actively in use unless a failure occurs. For
example, a processor will only actively use one clock source at a time even when redundant
clock sources are provided.
In contrast if a system is said to have “n+1” fan redundancy, all “n+1” fans will normally be
active in a system absence a failure. If a fan fails occurs, the system will run with “n” fans. In
such a case, power and thermal management code compensate by increasing fan speed or
making adjustments according to operating conditions per power management mode and
policy.
If on the first phase failure, the system continues to operate, and no call out is made for repair,
the first failing phase is considered spare. After the failure (spare is said to be used), the VRM
can experience another phase failure with no outage. This maintains the required n+1
redundancy. If a second phase fail, then a “redundant” phase have been said to fail and a
call-out for repair is made.
To a significant degree this error handling is contained within the processor hardware itself.
However, service diagnostics firmware, depending on the error, can aid in the recovery. When
fully virtualized, specific operating system involvement in such tasks as migrating off a
predictively failed component can also be performed transparent to the operating system.
The PowerVM hypervisor is capable of creating logical partitions with virtualized processor
and memory resources. When these resources are virtualized by the hypervisor, the
hypervisor has the capability of deallocating fractional resources from each partition when
necessary to remove a component such a processor core or logical memory block (LMB).
When an I/O device is directly under the control of the operating system, the error handling of
the device is the device driver responsibility. However, I/O can be virtualized through the VIO
Server offering meaning that I/O redundancy can be achieved independent of the operating
system.
2.2.8 Build system level RAS rather than just processor and memory RAS
IBM builds Power systems with the understanding that every item that can fail in a system is a
potential source of outage.
Although building a strong base of availability for the computational elements such as the
processors and memory is important, it is hardly sufficient to achieve application availability.
The failure of a fan, a power supply, a voltage regulator, or I/O adapter might be more likely
than the failure of a processor module designed and manufactured for reliability.
Scale-out servers will maintain redundancy in the power and cooling subsystems to avoid
system outages due to common failures in those areas. Concurrent repair of these
components is also provided.
For the Enterprise system, a higher investment in redundancy is made. The E980 system, for
example is designed from the start with the expectation that the system must be generally
shielded from the failure of these other components incorporating redundancy within the
service infrastructure (such as redundant service processors, redundant processor boot
images, and so forth.) An emphasis on the reliability of components themselves are highly
reliable and meant to last.
This level of RAS investment extends beyond what is expected and often what is seen in
other server designs. For example, at the system level such selective sparing includes such
elements as a spare voltage phase within a voltage regulator module.
Further, the error detection and fault isolation capabilities are intended to enable retry and
other mechanisms to avoid outages due to soft errors and to allow for use of self-healing
features. This requires a detailed approach to error detection.
This approach is beneficial to systems as they are deployed by end-users, but also has
benefits in the design, simulation and manufacturing test of systems as well.
Putting this level of RAS into the hardware cannot be an after-thought. It must be integral to
the design from the beginning, as part of an overall system architecture for managing errors.
Therefore, during the architecture and design of a processor, IBM places a considerable
emphasis on developing structures within it specifically for error detection and fault isolation.
Each subsystem in the processor hardware has registers devoted to collecting and reporting
fault information as they occur. The design for error checking is rigorous and detailed. The
value of data is checked generally wherever it is stored. This is true, of course for data used in
computations, but also nearly any other data structure, including arrays used only to store
performance and debug data.
Error checkers are derived for logic structures using various techniques such as checking the
validity of state-machine transitions, defining and checking protocols for generated
commands, doing residue checking for certain computational instructions and by other means
in an effort to detect faults before the resulting impact propagates beyond the detecting
sub-system The exact number of checkers and type of mechanisms is not as important as is
the point that the processor is designed for detailed error checking; much more than is
required simply to report during runtime that a fault has occurred.
All these errors feed a data reporting structure within the processor. There are registers that
collect the error information. When an error occurs, that event typically results in the
generation of an interrupt.
The error detection and fault isolation capabilities maximize the ability to categorize errors by
severity and handle faults with the minimum impact possible. Such a structure for error
handling can be abstractly illustrated in Figure 2-1 on page 11 and is discussed throughout
the rest of this section.
Ideally this code primarily handles recoverable errors including orchestrating the
implementation of certain “self-healing” features such as use of spare DRAM modules in
memory, purging and deleting cache lines, using spare processor fabric bus lanes, and so
forth.
Code within a hypervisor does have control over certain system virtualized functions,
particularly as it relates to I/O including the PCIe controller and certain shared processor
accelerators. Generally, errors in these areas are signaled to the hypervisor.
In addition, there is still a reporting mechanism for what amounts to the more traditional
machine-check or checkstop handling.
In an IBM POWER7 generation system the PRD was said to run and manage most errors
whether the fault occurred at runtime, or at system IPL time, or after a system-checkstop –
which is the descriptive term for entire system termination by the hardware due to a detected
error.
In IBM POWER8® the processor module included a Self-Boot-Engine (SBE) which loaded
code on the processors intended to bring the system up to the point where the hypervisor can
be initiated. Certain faults in early steps of that IPL process were managed by this code and
PRD ran as host code as part of the boot process.
In POWER9 process-based systems, during normal operation the PRD code is run in a
special service partition in the system on the POWER9 processors using the hypervisor to
manage the partition. This has the advantage in systems with a single service processor of
allowing the PRD code to run during normal system operation even if the service processor is
faulty.
In the rare event that a system outage resulted from a problem, the service processor had
access not only to the basic error information stating what kind of fault occurred, but also
access to considerable information about the state of the system hardware – the arrays and
data structures that represent the state of each processing unit in the system, and also
additional debug and trace arrays that can be used to further understand the root cause of
faults.
Even if a severe fault caused system termination, this access provided the means for the
service processor to determine the root cause of the problem, deallocate the failed
components, and allow the system to restart with failed components removed from the
configuration.
POWER8 gained a System-Boot-Engine which allowed processors to run code and boot
using the POWER8 processors themselves to speed up the process and provide for
parallelization across multiple nodes in the high-end system. During the initial stages of the
IPL process, the boot engine code itself handled certain errors and the PRD code ran as an
application as later stages if necessary.
In POWER9 the design has changed further so that during normal operation the PRD code
itself runs in a special hypervisor-partition under the management of the hypervisor. This has
the advantage of continuing to allow the PRD code to run even if the service processor is
non-functional (important in nonredundant environments.)
In the event of a fault, the code running fail the hypervisor can restart the partition (reloading
and restarting the PRD.)
The system service processors are also still monitored at runtime by the hypervisor code and
can report errors if the service processors are not communicating.
The PowerVM hypervisor uses a distributed model across the server’s processor and
memory resources. In this approach some individual hypervisor code threads can be started
and terminated as needed when a hypervisor resource is required. Ideally when a partition
needs to access a hypervisor resource, a core that was running the partition will then run a
hypervisor code thread.
Certain faults that might impact a PowerVM thread result in a system outage if these occur.
This can be by PowerVM termination or by the hardware determining that for, PowerVM
integrity, the system will need to checkstop.
The design cannot be viewed as a physical partitioning approach. There are not multiple
independent PowerVM hypervisors running in a system. If for fault isolation purposes, it is
desired to have multiple instances of PowerVM and hence multiple physical partitions,
separate systems can be used.
Not designing a single system to have multiple physical partitions reflects the belief that the
best availability can be achieved if each physical partition runs in completely separate
hardware. Otherwise there is a concern that when resources for separate physical partitions
come together in a system, even POWER9 Processor-based Systems RAS with redundancy,
there can be some common access point and the possibility of a “common mode” fault that
impacts the entire system.
Note: Oracle has specific considerations with regards to what they support as Public
Cloud. The following documents provide further details.
Licensing Oracle Software in the Cloud Computing Environment
https://www.oracle.com/assets/cloud-licensing-070579.pdf
Oracle Database Support for Non-Oracle Public Cloud Environments
(MyOracleSupport Doc ID 2688277.1)
This relates support and licensing policies Oracle applies to those various public Cloud
offerings.
Off-Premises means that service provider owns, manages, and assumes all responsibility
for the data centers, hardware, and infrastructure on which its customers' workloads run,
and typically provides high-bandwidth network connectivity to ensure high performance
and rapid access to applications and data.
Private cloud combines many of the benefits of cloud computing-including elasticity,
scalability, and ease of service delivery-with the access control, security, and resource
customization of on-premises infrastructure.
There are several motivators driving enterprises to switch from traditional IT to a cloud
computing model to run enterprise infrastructure more effectively and expand the business:
Lower IT costs: offload some or most of the costs and effort of purchasing, installing,
configuring, and managing your own on-premises infrastructure.
Improve agility and time-to-value: start using enterprise applications in minutes, instead of
waiting weeks or months for IT to respond to a request, purchase and configure
supporting hardware, and install software.
Scale more easily and cost-effectively: instead of purchasing excess capacity that sits
unused during slow periods, you can scale capacity up and down in response to spikes
and dips in traffic.
Many companies choose hybrid cloud to establish a mix of public and private cloud
resources-and with a level of orchestration between them-that gives an organization the
flexibility to choose the optimal cloud for each application or workload and to move workloads
freely between the two clouds as circumstances change.
IBM Power Systems provides a Cloud ready platform integrating with most of the needed
tools and software to enable customers on their Journey to Cloud and Hybrid Multicloud
More technical details about IBM Private Cloud Solution with Shared Utility Capacity can be
read at the following website:
http://www.redbooks.ibm.com/redpieces/pdfs/sg248478.pdf
All LPARs running Oracle DB on Power Systems can be assigned to a capped pool of
processors to benefit from CPU sharing mechanisms provided by the hypervisor (shared
processor pool). LPARs that require more CPU resource can get additional capacity from
other LPARs in that pool that cede idle CPU cycles to the pool, but overall CPU capacity is
hard limited by the shared CPU pool.
A given VM/LPAR can be explicitly limited to the maximum number of cores/processors it has
access to. This technology is consistent with Oracle’s hard partitioning guidelines for Oracle
Licensing terms and conditions: https://www.oracle.com/assets/partitioning-070609.pdf.
Power Enterprise Pools 2.0 has no effect on shared processor pools or LPARs capping
mechanism. It relies on real CPU consumption and does not interact with resources assigned
to a given LPAR. This means that Power Enterprise Pool 2.0 is not a technology to reduce SW
Licensing costs but optimize HW acquisition costs.
The IBM Power Virtual Server environment consists of Power Systems servers, PowerVM
Hypervisor, and AIX operating systems that are certified for Oracle DB 12c, 18c, and 19c and
other Oracle SW Products (Application, Middleware). This same stack is used by tens of
thousands of customers in their current IT environment. Oracle publishes their certifications of
the PowerVM hypervisor, its features, AIX 7.1 & 7.2 and confirms support of these features at
the following websites:
https://www.oracle.com/database/technologies/virtualization-matrix.html
https://support.oracle.com/portal/
As per Oracle Software Technical Policies document, “Technical support is provided for
issues (including problems you create) that are demonstrable in the currently supported
release(s) of an Oracle licensed program, running unaltered, and on a certified hardware,
database and operating system configuration, as specified in your order or program
documentation”:
https://www.oracle.com/us/support/library/057419.pdf
The environment leverages LPARs and adheres to Oracle's hard partitioning guidelines, as
long as LPM is not used with those LPARs running Oracle software. Of course, Oracle
licensing terms and conditions are always based on the contract between the customer and
Oracle:
https://www.oracle.com/assets/partitioning-070609.pdf
The Customer is responsible to comply with the terms of the existing licensing contract with
Oracle – regardless of the deployment option chosen – on-premises or IBM Power Systems
Virtual Server.
AIX/Power Systems provide choice & flexibility and prevent client lock-in to a specific vendor.
Customers can select the correct location to deploy their Oracle Workloads while leveraging
the value-add of Power Systems for Oracle. Such choice allows them to take advantage of
their investments in software and tailored databases options licenses by moving, migrating or
building Oracle environments to another location and switching back as per business
requirements as shown in Figure 3-3.
You have the flexibility to run your journey to the cloud at your own pace.
This section addresses those three steps to move from a traditional Power Systems IT
landscape to an hybrid (on-premises + off-premises) POWER infrastructure.
Within this transformation of traditional IT, organizations are also modernizing traditional
software to containerized applications to help improve operational efficiency, cloud integration
from multiple vendors and build a more unified cloud strategy. Red Hat OpenShift is fully
enabled and supported on IBM Power Systems to rapidly build, deploy and manage
cloud-native applications. IBM Cloud Paks are lightweight, enterprise-grade, modular cloud
solutions, integrating a container platform, containerized IBM middleware and open source
components, and common software services for development and management. IBM
introduced IBM Cloud Paks to address those application transformation needs and offer a
faster, more reliable way to build, move and manage on the cloud.
Thanks to Red Hat OpenShift, paired with IBM Cloud Paks, customers gain enterprise-ready,
containerized software solutions for an open, faster and more secure way to move core
business applications to any cloud (starting by on-premises private cloud).
More details about Red Hat OpenShift and IBM Cloud Pak® on IBM Power Systems:
https://www.redbooks.ibm.com/abstracts/sg248486.html
The same technology stack as on-premises allows users to develop and customize those new
environments and bring them back on-premises (or vice versa) without risks and huge effort.
The following example illustrates how to export a custom AIX image from on-premises Power
Systems infrastructure to IBM Power Systems Virtual Server using PowerVC and IBM Cloud
Object Storage. PowerVC images built with custom content can be exported in OVA format
into an IBM Cloud Object Storage bucket to be imported into the boot image catalog of your
Power Virtual Server Instance.
This allows to quickly build environments based on your own customized AIX Image. The
process can be easily reversed to migrate an AIX LPAR built in Power Virtual Server on which
customization has been done to back on-premises:
https://ibm.biz/HybridOracleDBonPOWER9-Part1
https://ibm.biz/HybridOracleDBonPOWER9-Part2
Clients can now avoid the need to build their own data center for DR purpose and choose to
subscribe to Power Virtual Server or alternate other off-premises Power Systems
infrastructure providers or start migrating production environments.
Clients can subscribe for the minimal virtual server configuration and expand the
configuration depending on the needs at the time of failover or workload needs. IBM
Infrastructure as a Service (IaaS) management expertise and experience is leveraged to
provide the services relating to the hardware and infrastructure. Applications, databases and
operating systems layers remain under customer or out-sourced management control.
IBM Cloud Paks include solutions, such as IBM Multicloud Manager that allows customers to
adopt VMs and containers in a hybrid multicloud environment while leveraging your current
infrastructure investments. This multicloud self-service management platform empowers
developers and administrators to meet business demands as shown in the following website:
https://www.ibm.com/docs/en/cloud-paks/cp-management
This platform allows you to efficiently manage and deliver services through end-to-end
automation while enabling developers to build applications aligned with enterprise policies
and using open source Terraform to manage and deliver cloud infrastructure as code.
With Red Hat Ansible Tower, users can now define one master workflow that ties different
areas of IT together - designed to cover a hybrid infrastructure without being stopped at
specific technology silos. Red Hat Ansible ensures your cloud deployments work seamlessly
across off-premises, on-premises, or hybrid cloud as easily as you can build a single system.
Figure 3-6 on page 24 shows the high level hybrid multicloud reference architecture inclusive
of the major industry hardware platforms - IBM Power Systems, IBM Z® and x86. Power
Systems is architected to economically scale mission-critical data-intensive applications,
either virtual-machine based or containerized - delivering industry leading reliability to run
them and reduce the cost of operations with built-in virtualization to optimize capacity
utilization. It also provides flexibility and choice to deploy those containerized applications or
Virtual Machine in the cloud of your choice.
Figure 3-6 applies to Virtual Machines (Logical Partitions) running AIX and therefore allows to
automate deployment and management of the Oracle Database on AIX across the
management stack.
There are other orchestration tools available such as VMWare vRealize. IBM partnership with
VMware provides clients with the vRealize Suite to unify applications and infrastructure
management across IBM Power Systems, x86 servers and IBM Z environments.
The following sections describe how to take advantage of key components of the Hybrid
Cloud Journey on POWER for your Oracle Database. Examples about automated Oracle
Database environment deployments are illustrated.
The document “Deploying Oracle Database as a Service with IBM Cloud PowerVC Manager
1.3.1” illustrates step by step the tasks to build an Oracle DB as a Service (DBaaS) offering
on AIX/Power Systems by developing a reference image with Oracle Database and Oracle
Grid Infrastructure installed. A post-provisioning step then updates the Oracle configuration
accordingly to obtained IP Address, hostname and database name defined during the LPAR
Creation process by way of PowerVC:
https://www.ibm.com/support/pages/node/6355627
Note: Implementation details and all developed scripts to implement Oracle Database as a
Service are included. You can re-use them As-Is to start your private cloud journey and
then customize and enhance them as per your own constraints and requirements as
follows:
https://github.com/ppc64le/devops-automation/tree/master/terraform/oracle/power
vc-oradbaas
Users of the DBaaS service can select from a set of deployable images, like Oracle Database
12c or 19c, with JFS2 or with ASM, and then provide customization parameters like database
name or required storage capacity by way of the PowerVC interface. The DBaaS service
administrator can also define constraints and any specific approval requirements as needed
before a request is fulfilled.
After the user request is approved all later steps are fully automated.
IBM Cloud PowerVC Manager sends the deployment request to PowerVC with the provided
customization parameters. IBM PowerVC then does all the heavy lifting - creating the LPAR,
by way of the HMC, allocating and mapping storage and then starting the LPAR. In the LPAR,
after boot, Cloud-Init evaluates the provided parameters and adjusts hostname, network
setting, database name and any other customization accordingly. Refer to Figure 3-7.
The deployment is then completed by running the post-deployment scripts which were
included in the deployment image at capture time. Cloud-Init is the technology that takes user
inputs and configures the operating system and software on the deployed virtual machine.
PowerVC relies on capturing and restoring images. This is the first step to convert traditional
on-premises POWER Systems infrastructure to a private cloud and offer services/templates
to users. Figure 3-8 shows the high-level preparation steps taken to provide different service
levels IaaS, PaaS and DBaaS, with each further refined image providing enhanced
functionality.
This requires to maintain image sets and handling a large set of images to offer wide flexibility
and choice with regards to the operating system version and software stack version
combination. However, PowerVC can be convenient when large databases must be cloned for
instance. Capture and restore process saves time and avoids re-installing the software stack
and exporting and importing the Oracle DB.
To increase agility and choice in services to offer to users, you add an orchestration and a
decomposition solution. This brings you to next level to Hybrid Cloud as such orchestration
tooling applies to both on-premises and off-premises environments.
The following link illustrates how to utilize IBM Terraform and Service Automation, instead of
PowerVC, to provide the control point and user interface for DBaaS for an Oracle Database,
while re-using existing PowerVC image:
https://www.ibm.com/support/pages/node/6355775
This image approach for provisioning limits the operating system and database versions you
offer to users as it increases the number of images to build and maintain. Decomposing the
steps to build a deployed Oracle Database environment into several steps will result in longer
deployment times, but will allow the reuse of parts in other services and provide higher
flexibility and customization to the user of the DBaaS offering.Terraform is an open-source
tool created by HashiCorp, to handle provisioning of cloud and infrastructure resources.
These resources are described to Terraform using a high-level configuration language, stored
in configuration files, or templates. From these templates, Terraform generates an execution
plan describing what it will do, and executes it to build the described infrastructure.
Figure 3-10 on page 28 shows the decomposition of an Oracle DBaaS on AIX service to
deploy an Oracle Database either on-premises or in an IBM Power Systems Virtual Server
with the capability to customize each of the steps based on user input.
Note that asset of related sample source code and scripts illustrating this modified approach
to provide a DBaaS service is available in a public GitHub repository at the following website:
https://github.com/FredD07/OracleDeployment
Important: You need to decide in advance what you want to have included in the image
and what you want to make customizable. For instance, we decided to include in the base
AIX image the Oracle user and group creation in addition to applying our AIX and Oracle
best practices settings to have a ready to use AIX image for Oracle. This operating system
configuration and customization can alternatively been implemented by way of a
post-deploy scripted template which is responsible to set AIX and Oracle prerequisites and
best practices.
Figure 3-10 Dividing an Oracle DBaaS to deploy it on an IBM Power System Virtual Server
First part of this DBaaS provisioning workflow is the LPAR creation. Depending on End-user
selection, the corresponding template is called to create the LPAR either on Power Systems
on-premises or in Power Virtual Server (off-premises) using an Image with a certain version
of AIX and all Oracle Prerequisites and best practices set:
Number and size of additional volumes that will host the Oracle DB has been integrated to
this "LPAR Creation Template" to customize accordingly to DBA requirements for this new
environment the storage layout.
The second part of this provisioning is about the installation of Oracle Grid Infrastructure and
the configuration of Oracle ASM with custom disks settings. Oracle Home Path can be set by
default or modify by the user in addition to Oracle ASM Instance Password and Oracle SW
Version to install. This example offers the installation of an Oracle 18c or 19c version.
A similar template has been developed to install the Oracle Database Engine and offer similar
customization and choice, for example, Oracle Home Path of the Oracle DB installation
directory, Oracle DB Version to install.
Last but not least is the creation of an Oracle Database Standalone Instance. We defined for
each template a set of input and output parameters:
Input parameter defines customization you want to bring and offer with regards to the
execution of each template. The list is not exhaustive and you can easily extend and
replace some parameters we defined to customize the Oracle software component
installation.
Output parameters are results of creation tasks during execution of the template and can
be used by other templates executed after it.
Additional Parameters can be set to go further into the customization and fit with your own
requirements, constraints and needs.
More details about Terraform and IBM Power Systems Virtual Server code is found at the
following website:
https://cloud.ibm.com/docs/terraform?topic=terraform-power-vsi
The following link illustrates the example where a new Oracle Database will be created either
on-premises or off-premises in IBM Power Systems Virtual Server. It shows orchestration of
those independent Terraform templates combined to create such Database As a Service for
the user. It also addresses the deployment of an application tier followed by its database tier:
https://ibm.biz/HybridOracleDBDeploymentonPOWER
can be used to update programs and configuration on hundreds of servers at once, but the
process remains the same.
Red Hat Ansible is an open source software, easy to install by way of Yum. It can be
supported with a subscription from Red Hat. IBM has created an extensive set of Red Hat
Ansible modules for the Power Systems user community, ranging from operating system
management to cloud management and everything in between. You can use those modules
to codify key maintenance and operational tasks for AIX and the software stack so that you
can focus on business priorities.
A notable entry to a rapidly jumpstart into your Red Hat Ansible project can be achieved with
Red Hat Ansible Galaxy as shown in Figure 3-11. Galaxy is a no cost site for finding,
downloading, and sharing community developed roles. You can explore the Community
Ansible Collection for IBM Power Systems at the following website:
https://galaxy.ansible.com/ibm/power_aix
You can download the Supported Red Hat Ansible Collection for IBM Power Systems from
Automation Hub (Red Hat Ansible subscription required). Refer to the following websites:
https://access.redhat.com/articles/3642632
https://cloud.redhat.com/ansible/automation-hub/ibm/power_aix
The Red Hat Ansible experience is identical across POWER and x86 servers. The same
steps can be repeated in the IBM Power Systems Virtual Server, public clouds environments
and on-premises. Some Red Hat Ansible modules or playbooks can be platform or operating
system specifics. Then, client can develop their own playbook to build a platform agnostic
management solution.
The following section illustrates an Oracle DBaaS on AIX and Power Systems using Red Hat
Ansible modules. It relies on the similar workflow used with the IBM Terraform and Service
Automation example as shown in Figure 3-12 on page 30.
This Red Hat Ansible playbook assumes the AIX LPAR has already been created. It can be
extended with the creation of the AIX LPAR by way of an additional Red Hat Ansible module.
Such corresponding Red Hat Ansible module is in charge of either using OpenStack APIs
with PowerVC or Power Virtual Server APIs to create the AIX LPAR that hosts the Oracle
Database.
The deployment of an Oracle Database has been decomposed into several Red Hat Ansible
modules to allow flexibility and re-use on those independent modules into other playbooks:
1. preconfig: This role performs AIX configuration tasks that are needed for Oracle
installation. It sets all AIX and Oracle best practices and prerequisites.
2. oracle_install: This role performs the Oracle binary installation.
3. oracle_createdb: This role creates the database instance using Oracle dbca utility and
custom parameters such as DB password, DB SID, and so on.
Figure 3-12 Oracle on AIX and Power Systems Red Hat Ansible Playbook example
Note: There is a playbook and roles project on development at the following link:
https://github.com/IBM/ansible-power-aix-oracle
You can re-use them As-Is to start 1st Oracle Deployment on AIX using Red Hat Ansible
and then customize and enhance them as per your own constraints and requirements.
The following document illustrates the deployment of an Oracle DBaaS with vRealize
Automation and vRealize Orchestration and IBM PowerVC. It leverages most of the concepts
and deployment logic introduced previously with PowerVC and IBM Terraform Automation
and Services examples:
Deploying Oracle DBaaS with vRealize Automation 7.2, vRealize Orchestrator and IBM
PowerVC:
https://www.ibm.com/support/pages/node/6355637
This chapter describes different methods to do a migration, lists things to consider, and shows
pros and cons of the different methods.
4.1 Motivation
Migrating a database appears to be a straightforward process, but when analyzed critically it
can become a complex undertaking when one or more of the following conditions exist:
The source database is very large.
The source and target operating systems are dissimilar.
The source and target systems are geographically remote.
The migration window time frame is limited or non-existent.
Downtime to perform test migrations is hard or impossible to allocate.
During migration you will need a downtime for the application. Depending on the size of the
database (the amount of data that has to be copied) and the available downtime window you
have to carefully plan the migration.
Do not forget that a migration does not only mean moving the database to the new platform. It
includes at least (but can also contain more than) the following steps:
Analyze the source environment.
Evaluate and select a migration method.
First test migration + application test.
Second test migration + application test.
Live migration.
Plan for ample preparation time. Especially if several databases and departments are
involved because migration is not a task, it is rather a project.
Note: Key aspects are also rollback options and point of no return in the migration
process.
Endianess
Endianess means the order in which the processor stores and interprets data. Systems are
either big-endian or little-endian. Big-endian systems store the most significant byte of
multi-byte data at the first memory address and the least significant byte at the last memory
address. Little-endian system store data the opposite way. See Figure 4-1.
Oracle gives you a list of which endianess is used for each platform as shown in Table 4-1 on
page 34.
The choice will always be a consideration between effort (preparation, planning, doing the
migration), Tool cost (as especially logical replication methods are often costly), size,
available downtime and available network speed.
You need to script all migration steps to be able to reproduce them (1st, 2nd test and live
migration), especially when migrating a complex environment, a higher number of databases
or critical systems.
For a hardware refresh while staying on the same platform you also have to plan your
approach.
Note: You must be aware that using Live Partition Mobility to move an LPAR to a new
Power System is technically feasible and simple. But Oracle Licensing terms state that
when you use Live Partition Mobility you will need to buy Oracle licenses for all CPU cores
in both the old and new server. This means this method usually cannot be used for
migration.
If any issues are revealed during the first test migration (error in scripts that had to be fixed,
errors in the migrated database during application testing, and so on) you fix the scripts and
go for the second test migration.
Depending on the migration method used, you may need a downtime for your source system
(this also applies to all following test migrations of course).
If any issues are revealed, consider repeating this step (3rd test migration) to avoid surprises
during live migration as much as possible.
Application will need to be shut down. After migration you need to (depending on the allotted
time) run basic or comprehensive functional application tests before taking the Go or No Go
decision. Remember to start your first full backup at this point.
If not stated otherwise you will need a downtime for the entire length of the process.
You will need to create an empty target database including all needed tablespaces. Imp
imports just the application schema/data into the new database. SYS/SYSTEM level objects
will not be transferred.
You can configure exp to only export one table. This allows to create scripts that can be
run in parallel. Therefore several exp/imp processes can run at the same time even though
exp/imp has no direct implementation of parallelism. However you either need good
planning to take care of Foreign Key constraints or you need to import without constraints
and after all tables are finished.
At the end of the imp process indexes will be created. You can skip this and create scripts
that start index creation (in parallel) as soon as each table is finished with data transfer.
As a summary, if you have small databases this is an easy, quick to setup migration method.
For any reasonable sized database this approach will be too slow.
The expdp/impdp command line tools connect from the client over network to the database.
However they just initiate the export/import. Data is written to a directory on the database
server that is defined in the database. It is also possible to run the impdp command in a way
that the export is done directly over a database link between target and source database
which saves time and space to do the migration. The transfer speed is considerably higher
compared to exp/imp. Build-in parallelism is available. Only table data will be transferred,
indexes will be recreated.
You will also need to create an empty target database including all needed tablespaces.
SYS/SYSTEM level objects will not be transferred.
With Oracle Data Pump you cannot use UNIX named pipes (FIFO) to run import while export
is still running. Data is written to a file system on the database server (into several files when
parallelism is used). You can however write data to an NFS file system which is exported from
the target database server. This means that no additional copying of the dump after export
has finished is needed, import can start immediately.
As a summary, Oracle Data Pump can be used for medium databases when there is
adequate downtime with manageable effort.
There are certain data types (for example, LONG / LONG RAW) that cannot be queried using
a database link. If such data types exist for a table you can combine with an exp/imp or Oracle
Data Pump for those tables.
This approach also can be classified into logical data transfer methods.
This allows a reduced downtime migration (only replication has to be stopped and application
has to be configured to access the target database). However, the tools are mostly quite
expensive.
Note: Oracle Data Guard running as a Logical Standby cannot be used in this context as
to set it up you first have to create a Physical Standby which can afterwards be converted
into a Logical Standby. So the constraints of setting up an Oracle Data Guard mirror still
apply.
If source and target are the same or a combination listed here, Oracle Data Guard is a good
method to perform the migration.
An additional use of Oracle Data Guard is to obtain/generate a copy of the database which
can be used for test migrations. As the source database must be down for migration it can be
hard to perform test migrations for databases that must be highly available. However if you
first create a Oracle Data Guard mirror for migration you can just stop replication, activate the
copy database and perform migrations steps from there.
Note: The procedure used in this document creates a convert script (convert_mydb.rman)
which includes all datafiles, not only those which contain rollback segments. Edit the script
to remove surplus datafiles. Another optimization is to allocate more than one disk channel
so all datafiles are converted in parallel. This saves overall conversion time.
The SQL query to identify the datafiles with rollback segments is:
select distinct(file_name)
from dba_data_files a, dba_rollback_segs b
where a.tablespace_name=b.tablespace_name;
One optimization that can be done when the target database is using a file system (not ASM)
is to create a Data Guard mirror where the file system from the target server which finally
contains the datafiles is NFS-exported to the server running the Data Guard instance. Using
this approach, after the Data Guard mirror is in sync, all datafiles are already at their final
location. This eliminates the time needed to copy the datafiles from source to target.
https://www.oracle.com/technetwork/database/features/availability/maa-wp-11g-platf
ormmigrationtts-129269.pdf
The high-level steps necessary are similar like the ones described in chapter 4.3.7,
“Cross-Platform Transportable Tablespaces” on page 38.
Imagine having a 20TB database and a downtime window of 2 hours. Even if all actions are
scripted this leaves a maximum of 80 minutes for conversion. This means that each second
20*1024/(60*80) = 4.3 GByte have to be read and written, a total of 8.6 GByte/s on average
over the entire conversion time. It might be a challenge to achieve this.
The needed downtime mostly depends on the amount of data changes to the source
database done since the last incremental backup. If not so many changes are done this will
not take so much time, especially if compared to a full conversion during downtime. The time
needed for the incremental backups can be further reduced by using Block Change Tracking,
which allows to only read the data blocks that have changed since the previous backup.
Cross-Platform Incremental backup is not a complete migration method in itself, but can be
combined with other methods such as Transportable Database, Cross-Platform Transportable
Tablespace, or Full Transportable Export/Import.
Unfortunately, the scripts provided by Oracle (My Oracle Support, Doc ID 2471245.1) do not
support running against a Data Guard Standby database.1
1 But we found a way to enhance them to make this scenario possible by creating a version that will run certain
commands against the target primary database instead but gets the critical data from the source standby database
by way of a database link. Contact the Systems Lab Services Migration Factory for more details about this
approach: https://www.ibm.com/blogs/systems/tag/ibm-systems-lab-services-migration-factory/
1. Setup:
/stage will hold the storage for the backup files and shared configuration files
On source side create standby, wait for sync. Activate block change tracking.
On target create an empty database plus Data Guard mirror.
2. Prepare and Roll-Forward:
On source (standby), run “xttdriver.pl --backup” (script from My Oracle Support).
On target (primary + standby), run “xttdriver.pl --restore”.
Repeat this process any number of times to keep target as closely in sync with source as
possible.
3. Transport (Downtime necessary if final migration, or stop sync to standby for test
migration):
Source:
– Alter tablespaces to read only.
– xttdriver.pl --backup (last time).
• (FTEX): expdp system/xxx full=y transportable=always [version=12].
Target:
– xttdriver.pl --restore (last time).
– Drop tablespaces to prepare for import of datafiles.
• (FTEX): impdp system/xxx ... transport_datafiles=‘...‘,‘...‘ ...
Table 4-2 gives some suggestions on which migration methods can be recommended for
different scenarios, but there might be more things to consider based on each case specifics.
Different platform / Transportable Source standby (on Less than one hour
same endianness. Database (TDB) NFS from target if on with enhancements.
(10.2+). file system);
Incremental Backup;
target standby.
Different platform / Full Transportable Source standby; Less than one hour
different endianness. Export/Import (FTEX) Incremental Backup; with enhancements.
(12c+) / target standby.
Cross Platform
Transportable
Tablespaces (XTTS)
(10g+).
We do unusually not prefer methods not mentioned in the table, but logical replication might
be appropriate in cases where small downtimes are required (less than one hour) and other
methods do not work.
To deal with these challenges and to make a structured approach to complex migration
projects, we have developed the Migration Factory Framework.
We have used the Migration Factory Framework (including some earlier versions)
successfully for several large migration projects, but it is still a work in progress which we
intend to enhance and improve further over time.
Menu program
The Menu program is the front-end part of the Migration Factory Framework, used by the
migrators to perform the steps in the migration workflow as it is defined in the Menu
configuration files. It is a python3 program with a simple character based user interface.
Each migration has a line in a database list file which contains information such as database
name, migration type (trial or live), which menu file to use (for example, which workflow), and
some more. When starting the program, the user is asked to choose a database to be
migrated, and if the database is found in the database list file, the corresponding line is locked
exclusively for this user.
If another user tries to work on the same database while the line is locked, it does not work.
The user is then shown a section from the workflow with the next available step marked with
an arrow and a few choices, including to execute the next step. Refer to Example 4-1.
If a step is executed successfully, the arrow advances to the next step in the workflow.
If a step fails, this is indicated by a crossed out arrow, and the workflow does not advance.
You now have to fix the problem and then resubmit the failed step. (However, this is an
exception).
The steps in the workflow can also be grouped into larger units, called tasks. In the database
list file, each task can be associated with a specific date and time when this task needs to be
executed. This does not happen automatically, but when the user starts a new task in the
Menu program, the date and time are checked, and the user has to explicitly agree if the task
needs to be executed ahead of time. This is useful for the beginning of the downtime, for
example, or when the workflow has to pause to allow for some manipulation outside the
workflow.
Each task also has a task ID associated to it which has to be entered by the user before a
new task starts. This can be used, for example, to synchronize the workflow to a change
management system. It also prevents the user from entering a new task accidentally by just
quickly progressing through the menu (although this is strongly discouraged, since the user is
expected to read through the output of each step carefully to spot any errors not caught by the
logic in the playbook).
The user can also decide to quit the menu after each step. In this case, the current status of
the migration is stored in a status file and the migration is unlocked in the database list.
Another user (or the same user) can later start the Menu program and, after choosing the
database, automatically continue at the same step where the last user quit the menu.
Users can also work at different migrations at the same time by starting multiple copies of the
Menu program in different terminal windows and choosing different databases from the
database list file.
Menu configurations
The Menu program workflows are fully configurable. You can define the menu structure,
including submenus, tasks, and the individual steps as shown in Example 4-2 on page 45.
For each step, there is a corresponding Red Hat Ansible playbook that gets started when the
step is executed. We have sample workflows for some migration methods available and we
plan to create a more comprehensive collection.
:t:A:
:m:install "Setup source reference and target":
{
:a:a:xtt_setup "Setup scripts for xtt":
}
:t:B:
:m:prep_target "Prepare target database"
{
:a:a:ftex_drop_ts_target "Drop tablespaces in target database"
}
:t:C:
:m:increment "Incremental backup and restore"
{
:a:a:xtt_backup_reference "Take incremental backup from reference node"
:a:a:xtt_transfer "Transfer backup info to target"
:a:a:xtt_restore "Apply incremental backup to target"
}
:t:D:
:m:conv2snap "Convert source standby to snapshot standby"
{
:aT:a:dg_convert_snapshot "Convert source standby to snapshot standby"
}
:t:E:!:
:m:ftex "Perform FTEX migration
{
:aL:a:ftex_ts_ro_primary "Set tablespaces to read only on primary"
:aT:a:ftex_ts_ro_reference "Set tablespaces to read only on reference node"
:a:a:xtt_backup_reference "Take final backup from reference node"
:a:a:xtt_transfer "Transfer backup info to target"
:a:a:xtt_restore "Apply final incremental backup to target"
:a:a:ftex_drop_ts_target "Drop tablespaces in target database"
:a:a:ftex_apply "Perform FTEX migration"
}
:t:F:
:m:conv2phys "Convert source standby to physical standby"
{
:aT:a:dg_convert_physical "Convert source standby to physical standby"
}
Example 4-2 on page 45 shows the menu configuration file for the workflow of the “Combined
Method” discussed in 4.4, “Combined Method for optimized migration” on page 40.
When a playbook is run through the menu program, all output is displayed on the panel and is
checked by the user for errors that have occurred but not been caught by the logic of the
playbook. For later reference, and as a documentation aid, all output is also collected in log
files in the respective project directory.
Helper programs
There are some helper programs available to assist in the preparation of the inventory file for
the Red Hat Ansible playbooks. When a Red Hat Ansible playbook runs, it needs an inventory
file which contains the hostnames of the machines used in the migration and many variables
needed by the playbooks. The helper programs will guide you through the process of creating
the inventory file by asking some basic questions about the migration and then provide you
with a template or sample inventory file that might need some further adaptation to your
specific environment.
The simplest install is a stand-alone Oracle install with JFS2 file systems. ASM requires the
installation of Grid Infrastructure which has the additional overhead of an ASM database.
When correctly configured the performance difference between JFS2 and ASM is negligible.
Implementing Oracle is a team effort with tasks for the DBA and the Systems, Storage and
Network administrators. Getting the pre-requisites correct is key to a successful install.
5.2 Firmware
Your firmware must be as up-to-date as possible.
Check the current FW level using the prtconf | grep "Firmware" AIX command from
your LPAR.
Example output:
# prtconf | grep "Firmware"
Platform Firmware level: VL940_027
Firmware Version: IBM,FW940.00 (VL940_027)
Latest FW level can be downloaded from: https://www-945.ibm.com/support/fixcentral
5.3 AIX
The 19c database software is certified against AIX 7.1 and 7.2, the minimum levels required
are:
7.1 TL5 SP01.
7.2 TL2 SP1.
Note: Unlike older versions, Oracle 19c is not certified on AIX 6.1.
For best performance and reliability, it is recommend to install the latest server firmware and
AIX TL and SP levels.
The Oracle installation documentation covers the majority of tasks required to install Grid
Infrastructure and the Oracle Database software and is at the following website:
https://docs.oracle.com/en/database/oracle/oracle-database/19/install-and-upgrade.
html
The Oracle documentation does not covers all of the best practices. There are some steps
which are omitted that we cover in this publication for completeness or to highlight their
importance.
One commonly used tool is TightVNC server. TightVNC server is not install on AIX by default
but you can find it as part of the AIX Toolbox for Linux Applications.
The AIX Toolbox for Linux Applications is at no cost to download at the following link:
https://www.ibm.com/support/pages/aix-toolbox-linux-applications-overview
You will also need to install the TightVNC viewer or another VNC viewer on your desktop to
access the session.
Connect as the oracle or grid user and start the VNC server. You are prompted for a
password if this is the first time VNC server has been used.
Export the DISPLAY variable with the value indicated by the vncserver.
Run the command xhost + to allow external connections to the vncserver.
Connect to the vncserver using the vncviewer you have installed using the same host and
port that the vncserver is running on and enter the password you have set.
The unzip needs to be version 6. You cannot work around this with an old version of unzip
because it cannot handle the file size. You cannot use jre to do the unzip because the
permissions will be wrong on the files and the runInstaller will not work.
Unzip is also included in the AIX Toolbox for Linux Applications. If you do not want to install
the full toolbox, you can download the rpm for unzip 6 from the following page:
https://www.ibm.com/support/pages/aix-toolbox-linux-applications-overview
The latest version at the time this publication was written is unzip-6.0-3.aix6.1.ppc.rpm.
Ulimit needs to be set for root, grid and oracle users. Without the correct values even copying
the binaries onto the file system can fail because the zip file size exceeds the default ulimit
maximum file size. Setting the values to unlimited help to avoid any issues:
chuser threads='-1' nproc='-1' fsize='-1' data='-1' rss='-1' nofiles='-1' root
By default, the input/output completion port (IOCP) is set to Defined. To enable IOCP, set
IOCP to Available and perform the following commands as root:
mkdev -l iocp0
chdev -l iocp0 -P -a autoconfig='available'
Verify the settings with the lsdev command to confirm the IOCP status is set to Available:
# lsdev | grep iocp
iocp0 Available I/O Completion Ports
Note: If IOCP has not been defined an AIX reboot is mandatory. Without this setting, the
Oracle installer fails but the Grid Infrastructure installed does not check the status of IOCP,
however your ASM disks are not available.
Note: You cannot apply updates that have a lower build sequence identifier.
If you update AIX, it is recommended to relink Oracle Home binaries. This done as the oracle
user with the command relink all.
The latest AIX Technology Level (TL) or Service Pack (SP) can be downloaded from IBM Fix
Central at the following website:
https://www.ibm.com/support/fixcentral/
5.4.1 Configure LPAR profile for Shared Processor LPAR performance (best
practice)
If you have shared processors, then correctly set and adjust entitled capacity (EC) and virtual
processor (VP) settings (rule of thumb: up to 30% gap range between EC and VP settings) to
mitigate CPU folding activity.
The previous lines are an extract from the lparstat output showing only the values
discussed.
5.4.2 Check that Power Saver Mode is set to Maximum Performance Mode
(best practice)
This is done by way of the HMC (from the Advanced System Management menu (ASM) or the
Command Line Interface (CLI)). This is the default from S924 model to E980 one.
To access the Advanced System Management menu from the HMC select the server and
navigate to:
Example output:
lspwrmgmt -m SERVER1 -r sys | cut -d, -f4,5
curr_power_saver_mode=Enabled,curr_power_saver_mode_type=max_perf
2 is the highest level of protection, 0 is the lowest despite the misleading name.
The Speculative execution fully enabled option is described in the documentation as follows:
This optional mode is designed for systems where the hypervisor, operating system, and
applications can be fully trusted. Enabling this option can expose the system to
CVE-2017-5753, CVE-2017- 5715, and CVE-2017-5754. This includes any partitions that
are migrated (using Live Partition Mobility) to this system. This option has the least
possible impact on the performance at the cost of possible exposure to both User
accessible data and System data.
If your whole environment is already protected against Spectre & Meltdown vulnerabilities,
then you can disable the security on the POWER9 frame by way of the ASM interface from
the HMC.
Removing the overhead of this protection can reduce execution time by as much as 6%.
Note: This change is only possible when the server is powered off.
Example output:
System configuration: type=Shared mode=Uncapped smt=8 lcpu=16 mem=16384MB
psize=16 ent=1.50
Example output:
Processor Implementation Mode: POWER 9
Note: POWER 9 mode is only possible with AIX 7.2 and not with AIX 7.1.
If at some time you need to use live partition mobility to a POWER8 server, then remain on
POWER8 mode.
Core and memory affinity map must be as close as possible for optimal performance. Check it
with the command lssrad -va from the AIX instance running on your LPAR. This reports the
logical map view.
The output of the command lssrad -va is shown in Figure 5-2 on page 54.
Figure 5-2 shows on the left the CPUs and the memory are not at all aligned. In this case,
consider shutting down the frame and restarting the LPARs starting with the biggest to align
the CPU and memory allocation. The result looks more like Figure 5-2 (to the right).
Tip: With a shared processor systems running RAC, it is strongly suggested to set the
vpm_xvcpus parameter from schedo to 2 to avoid RAC node evictions under a light
database workload conditions (schedo -p -o vpm_xvcpus=2).
5.5 Memory
The install document suggests that the minimum RAM for a database installation is 1 GB but
2 GB is recommended unless you are using Grid Infrastructure, then 8 GB is recommended.
From the testing that we have performed it is clear that the database software can be installed
with a small amount of memory, but dbca will fail with less than 8 GB of RAM (+swapspace).
We therefore recommend 8 GB of RAM as a minimum without Gird Infrastructure and 24 GB
with it.
In the environment we created to validate this document, we allocated 8 GB but clearly the
key factor in defining the memory for a partition is the memory required for the SGA and PGA
or memory target of the database instance for the workload that you will be running.
It is not sufficient to add the addresses to /etc/hosts. Oracle needs to be able to detect them
using nslookup. You need to add the DNS details in the resolv.conf file on your server.
rfc1323=1 either system wide with no command or on per interface basis with the chdev
command.
or
lsattr -El <interface name> -a rfc1323
Increase TCP Socket Buffer Space for Receiving and for Sending from 16K default to 256K
(tcp_recvspace=262144 and tcp_sendspace=262144 either system wide or per interface
basis). Check with the command no -Fa | egrep "tcp_recvspace|tcp_sendspace" (if system
wide) or lsattr -El <interface name> -a tcp_recvspace -a tcp_sendspace.
Enable Ethernet flow control: flow_ctrl=yes. Use the command lsattr -El <interface name>
-a flow_ctrl to check it, and to prevent rebroadcast resulting in network congestion.
Note: To take advantage of the flow control efficiency as much as possible, it must be
enabled on all network components including the network switch.
Enable jumbo frames: jumbo_frame=yes on per interface basis with the command lsattr -El
<interface name> -a jumbo_frame.
Note: The network switch must be jumbo frame capable with jumbo frame support
enabled.
Virtual Ethernet
Enabling largesend for mtu_bypass=on per interface basis with the command lsattr -El
<interface name> -a mtu_bypass.
Set the following parameter to 4096 using the chdev command on the Virtual Ethernet
adapter (inherited from SEA):
min_buff_tiny=max_buff_tiny=min_buff_small=max_buff_small=4096
The ntpd daemon maintains the system time of day in synchronism with Internet standard
time servers by exchanging messages with one or more configured servers at designated poll
intervals.
Under ordinary conditions, ntpd adjusts the clock in small steps so that the timescale is
effectively continuous and without discontinuities. Under conditions of extreme network
congestion, the roundtrip delay jitter can exceed three seconds and the synchronization
distance, which is equal to one-half the roundtrip delay plus error budget terms, can become
very large. The ntpd algorithms discard sample offsets exceeding 128 ms, unless the interval
during which no sample offset is less than 128 ms exceeds 900s. The first sample after that,
no matter what the offset, steps the clock to the indicated time. In practice this reduces the
false alarm rate where the clock is stepped in error to a vanishingly low incidence.
As the result of this behavior, after the clock has been set, it rarely stays more than 128 ms,
even under extreme cases of network path congestion. Sometimes, in particular when ntpd is
first started, the error might exceed 128 ms. With RAC, this behavior is unacceptable. If the -x
option is included on the command line, the clock will never be stepped and only slew
corrections will be used.
Note: The -x option ensures that time is not set back, which is the key issue.
The -x option sets the threshold to 600 s, which is well within the accuracy window to set the
clock manually.
After the cluster is installed, you can ignore the following message in one of the alertORA.log
of the cluster:
[ctssd(2687544)]CRS-2409:The clock on host rac1 is not synchronous with the mean
cluster time. No action has been taken as the Cluster Time Synchronization Service
is running in observer mode.
Note: RAC also has its own time sync method, but this is overridden by NTPD.
The following groups are optional and can be used for different roles if you want to limit
access:
mkgroup -A id=54324 backupdba #(for backup and restore)
mkgroup -A id=54325 dgdba #(for dataguard)
mkgroup -A id=54326 kmdba #(for encryption key management)
The following group is new with 19c and is for RAC administration:
mkgroup -A id=54330 racdba
For a stand-alone database with ASM create the grid user with the following groups:
mkuser id='54322' pgrp='oinstall' groups=dba,asmadmin,asmdba,oper,asmoper
home='/home/grid' grid
For RAC create the grid user with the following groups:
mkuser id='54322' pgrp='oinstall'
groups=dba,asmadmin,asmdba,racdba,oper,asmoper home='/home/grid' grid
The following directories are those suggested by Oracle in the install documents:
mkdir -p /u01/app/19.0.0/grid
mkdir -p /u01/app/grid
mkdir -p /u01/app/oracle
mkdir -p /u01/app/oraInventory
mkdir -p /u01/app/oracle/product/19.0.0/dbhome_1
chown -R grid:oinstall /u01
chown oracle:oinstall /u01/app/oracle
chmod -R 775 /u01/
Ulimit needs to be positioned for the root, grid and oracle users. Setting the values to
unlimited will avoid any issues:
chuser threads='-1' nproc='-1' fsize='-1' data='-1' rss='-1' nofiles='-1'
oracle
chuser threads='-1' nproc='-1' fsize='-1' data='-1' rss='-1' nofiles='-1' grid
If the user is going to be the owner of the listener, then check the section on 64K page
promotion.
5.8.1 Turn off Oracle Online Patching in your environment if not required (best
practice)
Update your oracle AIX user .profile file with the following:
export MPROTECT_TXT=OFF
This parameter prevents the CPU from skyrocketing in case of memory page claims under
certain circumstances.
5.9 Storage
This topic is covered in more detail in Chapter 8, “Oracle Database and storage” on
page 109.
Note: Normal practice is to use external redundancy on AIX because the redundancy is
already provided by the storage.
Set the AU size to at least 4 MB, the default of 1 MB is too small (list of value 1, 2, 4, 8, 16, 32
or 64 MB). ASM Allocation Unit size needs to be a multiple of the underlying storage block
size and larger AU size is recommended for big LUN size. Moreover larger AU sizes provide
performance advantages for data warehouse workloads.
5.9.2 Set the Queue depth to allow enough I/O throughput (best practice)
To check the current queue_depth value for a hdisk device, use the command:
To check the range of settings allowed by the driver, run the command:
# lsattr -Rl hdisk1 -a queue_depth
1...256 (+1)
To set a new value with your own device and value, use the command:
chdev -l hdisk1 -a queue_depth=256 –P
The value of max_transfer needs to be set to a minimum of 0x100000. Its value can be
checked using lsattr command and can be set using the chdev command in the same way.
The value of the algorithm is set to the shortest_queue again using chdev and lsattr
commands.
Example output:
# lsfs -q /u01
Name Nodename Mount Pt VFS Size Options Auto
Accounting
/dev/oralv -- /u01 jfs2 1002438656 noatime,rw yes no
(lv size: 1002438656, fs size: 1002438656, block size: 4096, sparse files:
yes, inline log: yes, inline log size: 512, EAformat: v1, Quota: no, DMAPI: no,
VIX: yes, EFS: no, ISNAPSHOT: no, MAXEXT: 0, MountGuard: no)
For the Oracle Database, set multiple I/O queues by configuring several LUNs to improve I/O
flow particularly if the queue depth value is low.
All spread across the LUN set (-e'x' flag) with noatime option and inline JFS2 logging:
For each Logical Volume (LV).
mklv -y'<LV name>' -t'jfs2' -e'x' <VG name> <LV size in PP unit>
Check LV spreading with the command:
lslv <LV name> | grep "INTER-POLICY" which is tagged as "maximum“
Sample output:
INTER-POLICY: maximum RELOCATABLE: yes
For datafiles: crfs -v jfs2 -d'<LV name>' -m'<JFS2 mount point>' -A'yes' -p'rw' -a
options='noatime' -a agblksize='4096' -a logname='INLINE' -a isnapshot='no’.
For redolog and control files: crfs -v jfs2 -d'<LV name>' -m'<JFS2 mount point>'
-A'yes' -p'rw' -a options='noatime' -a agblksize='512' -a logname='INLINE' -a
isnapshot='no'.
Example output:
# lsfs -q /oracle
Name Nodename Mount Pt VFS Size Options Auto
Accounting
/dev/oralv -- /oracle jfs2 1002438656
noatime,cio,rw yes no
(lv size: 1002438656, fs size: 1002438656, block size: 4096, sparse files:
yes, inline log: yes, inline log size: 512, EAformat: v1, Quota: no, DMAPI: no,
VIX: yes, EFS: no, ISNAPSHOT: no, MAXEXT: 0, MountGuard: no)
For the Oracle Database, set multiple I/O queues by configuring several LUNs to improve I/O
flow particularly if the queue depth value is low.
All spread across the LUN set (-e'x' flag) with noatime option and inline JFS2 logging:
For each Logical Volume (LV).
mklv -y'<LV name>' -t'jfs2' -e'x' <VG name> <LV size in PP unit>
Check LV spreading with the command:
lslv <LV name> | grep "INTER-POLICY" which is tagged as "maximum“
Sample output:
INTER-POLICY: maximum RELOCATABLE: yes
For datafiles: crfs -v jfs2 -d'<LV name>' -m'<JFS2 mount point>' -A'yes' -p'rw' -a
options='noatime' -a agblksize='4096' -a logname='INLINE' -a isnapshot='no’.
For redolog and control files: crfs -v jfs2 -d'<LV name>' -m'<JFS2 mount point>'
-A'yes' -p'rw' -a options='noatime' -a agblksize='512' -a logname='INLINE' -a
isnapshot='no'.
hdisk2 is available to use for ASM. You see that the Physical Volume Identifier (PVID) and
volume group are both set to none. If this is not the case ASM cannot see the disk.
If the PVID is not showing as none reset it with the following command:
chdev -l hdisk2 -a pv=clear
The permissions on the rhdisks are to 066. These are owned by the grid user and the group
asmadmin. If you need to rename the devices you can do so using the rendev command.
Installer (mandatory)
The grid infrastructure installer requests 79 GB of free space to install GI but actually only
needs less than 5 GB at this stage because the binaries are already unpacked on the disk.
If ASM fails to see your disks, you need to check the steps in 5.3.3, “Operating system
changes (mandatory)” on page 50) as the permissions and the ownership can all prevent the
installer from seeing the disks.
The runfixup.sh function will fix any minor configuration issues that are pre-requisites but it
does not apply the recommended best practices.
During the Grid Infrastructure Install process you can ignore the space requirement and the
swap space requirement.
If you are running AIX 7.2 TL 2 SP1, the documentation also recommends installing APAR:
IJ06143.
The following package is also required for 19C but is not in the documentation:
Xlfrte.aix61-15.1.0.9
If you do not install this package, you get the error PRVF-7532: Package “xlftre.aix61” is
missing.
When you click FC, you are required to log in using your IBM id.
Scroll down to find the download link as shown in Figure 5-5 on page 64.
Run the smitty command to show you the following files have been installed:
File:
I:xlfrte 15.1.0.0
S:xlfrte 15.1.0.10
I:xlfrte.aix61 15.1.0.0
S:xlfrte.aix61 15.1.0.10
Note: This document does not cover the requirements for NLS language settings.
In the documentation the minimum is stated as around 11GB. You need more for staging the
zip and with the log files that are created in the Oracle Home, you need around 15 GB as a
minimum to have a usable environment without any patching. In our test install the final size of
the installation alone was 12.6 GB without the zip file or any database logs or trace files
created.
Setting the user and group ids to be the same help if you need to transfer files or data
between servers.
Create the file system where the Oracle Home will be created.
Set the permissions on the file system so that the Oracle user has read, write and execute
permission.
For example:
chown oracle.oinstall /u01
chmod 755 /u01
You are required three directories for a stand-alone install one for the inventory, the
ORACLE_BASE and the ORACLE_HOME. ORACLE_HOME used to be a subdirectory of
ORACLE_HOME, this is no longer the case.
In this scenario, we create a file system /u01 for all the Oracle files.
/u01/app/oraInventory was created for the inventory files.
/u01/app/oracle was created for the Oracle Base.
/u01/app/product was created for the ORACLE_HOME to be created as
/u01/app/product/19c.
From this directory launch the unzip command for the zip file as the oracle user.
As the root user, run the rootpre.sh script which is found in the ORACLE_HOME/clone
directory.
For ease of access for the runInstaller, the VNCserver is launched as the user who will be the
owner of the Oracle binaries and from the directory that will become the ORACLE HOME.
You can check that the display is working correctly by using xclock. If the clock appears
cancel it and you are ready to run the runInstaller.
You can install Standard Edition but you are limited to using 2 sockets and just 16 threads.
At this stage the installer checks the pre-requisites and configuration. There are no APARs
listed that are missing.
The installer warned us that we did not have sufficient swap space, but we chose to ignore
the warning → next.
The summary installation window prompts us to click Install → Install.
The next action required is to run the orainstRoot.sh and the root.sh scripts as root.
On completion click OK.
The finish window is displayed with the message: The registration of Oracle Database was
successful.
Click Close to finish.
For this scenario, we created the /oradata file system owned by the oracle user for the
datafiles and redo logs.
This does not follow best practices, this is just for demonstration purposes. Your redo logs are
on a separate file system with a blocksize of 512 bytes. The datafiles need to be in a file
system mounted in cio or alternatively you can set the parameter filesystemio_options to
setall. Further details are given in the best practices documents linked to in the reference
section of this document.
1. Accept the default "Create a database" → Next.
2. Choose the advanced option → Next.
3. Choose Oracle Single Instance database and the template "General Purpose or
Transaction Processing". This will be fine for the majority of non data warehouse
environments → Next.
4. Choose your Oracle SID → Next.
5. Chose the database storage attributes option and changed the database file location away
from the oracle base directory to the mount point chosen for your datafiles. We often use
/oradata for example → Next.
6. If you choose to define a Fast Recovery Area it needs to be big enough to contain your
RMAN backup files and archive log files. If you do not define it you can create your backup
location later → Next.
7. The Specify Network Configuration Details page allows you to define a listener. Upon
starting the database will be picked up by the default listener so we do not need to
configure this section → Next.
8. Unless specifically required Oracle Database Vault and Oracle Label Security are not
needed → Next.
The next page has several tabs for defining memory, character sets, connection mode and
sample schemas. Unless you have specific requirements, then you can ignore all of the tabs
except memory.
You note that automatic memory management is once again the default (unlike 18c). If there
are other databases running in the same partition these are not taken into account in the
proposed memory size. The PGA rarely extends to its full allocation but as a general rule the
combined PGA.
SGA of all database on a partition cannot exceed 70% of the available memory on the
server or partition.
9. We do not recommend the use of Memory Target, it is a useful tool if you do not know the
memory size required but assigning SGA and PGA targets avoids memory resize
operations that can impact performance → Next.
10.By default the database will be configured to include Enterprise Manager database
express, deselect this option if you do not want this to be installed → Next.
11.The Database User Credentials page allows you to set SYS and SYSTEM passwords. If
this is a demonstration or test environment you can set a simple password, Oracle warns
that it does not match the expected standard → Next.
12.Accept the default option of Create database. Click the All Initialization Parameters to
change the database parameters that are written to the spfile.
13.For the initialization parameters, click Show advanced parameters. This sets
filesystemio_options to setall. Click to include in spfile and close.
14.You will need to modify the redo log definition in the Customize Storage window.
15.The default size of the redo logs is 200M which you need to increase for log groups 1, 2
and 3. The recommendation from Oracle is to switch redo logs 3 times each hour. This will
avoid waits on too many switches which have a performance impact but it maybe too
infrequent for some critical workloads. On most of our benchmark testing we set a redolog
size of between 1 GB and 10 GB depending on the intensity of the workload.
16.The default location for the redo logs is in the same place as the datafiles, change this if
you have created a separate file system as recommended in the best practices
documentation.
17.Click Apply each time you change a value or your changes are lost. After all redo log
changes have been applied click OK.
18.Click Finish to start the database creation.
After completed you are presented with another opportunity to change Password
Management options or you can just close the DBCA utility.
For memory allocation, we recommend setting SGA and PGA targets. Setting memory
targets can result in frequent memory resize operations which degrade performance.
As a general rule the total memory taken by the SGA and PGA must not exceed 70% of the
memory allocated on the machine to allow enough memory for the Oracle processes and
operating system.
DISCLAIMERS:
In this scenario we describe the experience in our test environment.The procedures
described in this chapter are not intended to replace any official product documentation;
official product documentation must always be followed.
The test environment we use provides just an example for specific tasks that have to be
performed for deploying and verifying the configuration. Due to the flexibility of the IBM
Power Systems platform, your environment can differ significantly from our test
environment.
Your platform design needs to consider all recommendations for functionality,
performance, availability and disaster recovery provided by the products guidelines.
1
IBM Spectrum Scale (formerly known as General Parallel File System) provides the shared storage infrastructure required for Oracle
RAC deployments.
The choice of deploying Oracle RAC on IBM POWER9 with AIX and IBM Spectrum Scale
provides a strong combination of performance, scalability, and availability for your mission
critical databases and applications. The combination (Oracle RAC, Power, AIX and Spectrum
Scale) goes back a long way, starting with Oracle RAC 9i3 to the most recent announcement
for Oracle RAC 19c4.
For additional information about IBM POWER9 systems see also the following IBM
Redbooks:
https://www.redbooks.ibm.com/Redbooks.nsf/pages/power9?Open
6.2.1 CPU
POWER9 provides advanced processor capabilities from which we mention the following (the
list is not exhaustive) which we consider the most relevant for our deployment:
High frequency super scalar RISC (Reduced Instruction Set Computing) architecture.
Parallel execution of multiple threads (8-way Simultaneous Multi-Threading - SMT).
Optimized execution pipeline and integrated accelerators.
Highly granular allocation capabilities (0.01 processor increments).
Dynamic CPU sparing.
6.2.2 Memory
Memory management for POWER9 servers include features such as:
Dynamic memory allocation (add/remove) - DLPAR Memory - This is the traditional
memory management based on PowerVM and Reliable Scalable Technology (RSCT)
feature.
2
Availability is a combination of hardware and software working in synergy for providing the uptime required by your
applications’ service level agreement. POWER9 platform combined with the AIX operating system and
management software can provide “five nines” uptime (99.999% availability), see:
https://www.ibm.com/it-infrastructure/us-en/resources/power/five-nines-power9.
3
For latest support matrix see: https://www.oracle.com/database/technologies/tech-generic-unix-new.html
4 See http://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/FLASH10907
Available Active Memory Expansion (priced feature) which provides in-memory data
compression which provides expanded memory capacity for your system. This feature
relies on-chip compression feature of the Power processor.
Active Memory Sharing which provides physical pool of memory sharing among a group of
partitions. This feature is implemented in PowerVM.
6.2.3 Networking
IBM PowerVM provides advanced networking virtualization options for LPARs, such as:
Virtual Ethernet using Shared Ethernet Adapter in the VIO server for external network
access.
Dedicated physical Ethernet adapters (standard Ethernet or RDMA capable).
SR-IOV logical ports (either dedicated or shared).
Virtual Network Interface Card (vNIC) client using shared SR-IOV logical ports in the VIO
Server.
Availability for LPAR access to external network can be provided by various techniques and
combinations of thereof:
Redundant VIO Server configuration with SEA failover (with optional dual Ethernet
adapters and link aggregation in the each VIO Server).
Dual Ethernet adapters (physical, virtual, SR-IOV, or vNIC) configuration in LPAR with
network interface backup configuration in AIX.
Redundant VIO Servers with vNIC failover.
– 24 GB of RAM.
– AIX 7.2 Technology Level 4, Service Pack 2.
Storage access:
– Virtual Fibre Channel (NPIV) protected by AIX multipathing with dual VIO Servers
configuration and redundant SAN fabrics.
Storage volumes
– Internal (non-shared): two LUNs (100 GB each); Internal storage is used for AIX rootvg
and for Oracle binaries5 (grid and database).
– Shared:
• Three 10GB LUNs for tiebreaker disks (Spectrum Scale cluster management).
• Eight 50GB LUNs for data (for shared Oracle storage).
Networking (see Example 6-1):
– Public network: - Virtual I/O Ethernet protected by SEA failover, identified on both
nodes as ent0.
– Private network: vNIC adapter protected by vNIC failover, identified on both nodes as
ent2.
A high level diagram for our test environment is shown in Figure 6-1. Node1 and Node2 are
LPARs in two distinct IBM POWER9 servers.
5
Oracle Database binaries can be deployed also on shared storage, but for availability during upgrades, we decided
not to use shared storage for database binaries.
Private LAN uses virtual Network Interface Card (vNIC), protected with dual VIO Servers
and vNIC failover. Configuration diagram for node1 is shown in Figure 6-5 (node2 uses a
similar configuration). For additional information and recommendations see the following
article on IBM Support for vNIC configuration. The private LAN is configured as ent2 on
both nodes.
The two node cluster configuration starts by installing the base operating systems and
required packages for deploying Oracle RAC solution (Oracle Clusterware and Oracle
Database).
AIX installation (AIX 7.2 TL4 SP2) procedure is beyond the scope of this document. Use the
installation method of your choice to deploy the operating system and additional packages
required for your environment. A description of the required operating system (AIX operating
system) and additional packages can be found in the following Oracle documents:
Grid Infrastructure Installation and Upgrade Guide - 19c for IBM AIX on POWER Systems
(64-Bit), E96274-03.
Oracle Database Installation Guide - 19c for IBM AIX on POWER Systems (64-bit),
E96437-04.
For the latest version for these documents always check the following URL:
https://docs.oracle.com/en/database/oracle/oracle-database/19/install-and-upgra
de.html
While we can execute all commands on each node, we can, optionally, use the distributed
systems management toolkit for AIX that allows us to run commands from a single node onto
multiple nodes (in our case, two nodes).
The packages used for this purpose are named dsm.core and dsm.dsh. These provide the
tools to execute commands and copy files on multiple nodes using a single point of control
(one of the nodes in the cluster).
Tip: The distributed systems management tools can be installed only on one of the nodes
in your cluster.
Example 6-2 shows the distributes systems management configuration. We use the dsh
command to execute commands on both nodes in the cluster. Since the remote command
uses /usr/bin/ssh, the ssh must be configured to execute commands on all nodes without
prompting the user for a password. Make sure your nodes are configured for password-less
remote ssh command execution.
Important: Password-less remote commands execution with secure shell (ssh) is also
required for Spectrum Scale configuration, and for Oracle RAC installation and
configuration.
Per Oracle Grid installation instructions, we have also configured the ssh server with:
LoginGraceTime 0a
Example 6-5 shows the real memory we have configured on our nodes.
Example 6-6 displays the AIX operating system packets we have installed on both nodes
in preparation for Oracle Clusterware and Oracle Database installation.
Tip: If AIX toolbox for Linux packages are installed (or other open source packages) we
recommend that you set the AIX MANPATH variable (in /etc/environment) to include the
path to the man pages of the installed products (for example, add /opt/freeware/man to
your current MANPATH).
messages
Example 6-7 shows the I/O completion ports for our cluster nodes.
Example 6-8 shows the VMM6 tuning parameters on our cluster nodes.
Example 6-8 Virtual memory manager (VMM) parameters - same on both cluster nodes
# vmo -L minperm%
NAME CUR DEF BOOT MIN MAX UNIT TYPE
DEPENDENCIES
--------------------------------------------------------------------------------
minperm% 3 3 3 1 100 %% memory D
--------------------------------------------------------------------------------
# vmo -L maxperm%
NAME CUR DEF BOOT MIN MAX UNIT TYPE
DEPENDENCIES
--------------------------------------------------------------------------------
maxperm% 90 90 90 1 100 %% memory D
minperm%
maxclient%
--------------------------------------------------------------------------------
# vmo -L maxclient%
NAME CUR DEF BOOT MIN MAX UNIT TYPE
DEPENDENCIES
--------------------------------------------------------------------------------
maxclient% 90 90 90 1 100 %% memory D
maxperm%
minperm%
--------------------------------------------------------------------------------
# vmo -L strict_maxclient
NAME CUR DEF BOOT MIN MAX UNIT TYPE
DEPENDENCIES
--------------------------------------------------------------------------------
strict_maxclient 1 1 boolean d
strict_maxperm
--------------------------------------------------------------------------------
# vmo -L strict_maxperm
NAME CUR DEF BOOT MIN MAX UNIT TYPE
DEPENDENCIES
--------------------------------------------------------------------------------
strict_maxperm 0 0 boolean d
strict_maxclient
--------------------------------------------------------------------------------
Example 6-9 on page 79 shows the AIX CPU folding7 configuration on our nodes.
6
VMM - Virtual Memory Manager (AIX kernel)
7 See https://www.ibm.com/support/pages/aix-virtual-processor-folding-misunderstood
Example 6-10 shows the maximum user processes and block size allocation (system
parameters) on our cluster nodes.
The network configuration for our test environment consists of the following:
Network tuning options parameters as shown in Example 6-11.
--------------------------------------------------------------------------------
# no -L tcp_recvspace
--------------------------------------------------------------------------------
NAME CUR DEF BOOT MIN MAX UNIT TYPE
DEPENDENCIES
--------------------------------------------------------------------------------
tcp_recvspace 64K 16K 64K 4K 8E-1 byte C
sb_max
--------------------------------------------------------------------------------
# no -L rfc1323
--------------------------------------------------------------------------------
NAME CUR DEF BOOT MIN MAX UNIT TYPE
DEPENDENCIES
--------------------------------------------------------------------------------
rfc1323 1 0 1 0 1 boolean C
--------------------------------------------------------------------------------
# no -L sb_max
--------------------------------------------------------------------------------
NAME CUR DEF BOOT MIN MAX UNIT TYPE
DEPENDENCIES
--------------------------------------------------------------------------------
sb_max 4M 1M 4M 4K 8E-1 byte D
--------------------------------------------------------------------------------
# no -L ipqmaxlen
--------------------------------------------------------------------------------
NAME CUR DEF BOOT MIN MAX UNIT TYPE
DEPENDENCIES
--------------------------------------------------------------------------------
ipqmaxlen 512 100 512 100 2G-1 numeric R
--------------------------------------------------------------------------------
Network interfaces are shown in Example 6-12. For our test environment we use ent0 for
Public LAN and ent2 for Private LAN (RAC Interconnect).
The IP network configuration for our nodes is shown in Example 6-13. We use the netstat
-i command to display configuration on both nodes. You can use a distributed shell (for
example, dsh) for issuing commands on both nodes from a single terminal. For dsh
configuration see “Tools for distributed commands (optional)” on page 75.
<node2> (aop93cl093)
-------------------------------------------------------------------------------
Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll
en0 1500 link#2 fa.f6.ea.66.14.20 12474675 0 11019992 0 0
en0 1500 129.xx.xx aop93cl093 12474675 0 11019992 0 0
en1 1500 link#3 fa.16.3e.23.22.6c 505 0 404 0 0
en1 1500 10.10.1 node2-ic 505 0 404 0 0
en2 1500 link#4 62.66.4a.82.88.7 30714349 0 28337844 0 0
en2 1500 10.20.1 node2-ic2 30714349 0 28337844 0 0
lo0 16896 link#1 3607774 0 3607774 0 0
lo0 16896 127 loopback 3607774 0 3607774 0 0
lo0 16896 loopback 3607774 0 3607774 0 0
The host names and basic network configuration are shown in Example 6-14.
<node2> (aop93cl093)
-------------------------------------------------------------------------------
authm 65536 Authentication Methods True
bootup_option no Use BSD-style Network Configuration True
gateway Gateway True
hostname aop93cl093.pbm.ihost.com Host Name True
rout6 IPv6 Route True
route net,-hopcount,0,,0,129.xx.xx.xxx Route True
Network name resolution for our test environment is configured for using static IP
addresses resolved to IP labels locally (/etc/hosts), and using Domain Name Server
(DNS - configured in /etc/resolv.conf), in this particular order. The network name
resolution order is set in /etc/netsvc.conf.
Important:
All nodes in your cluster MUST resolve the network names identically (same IP
labels for same IP addresses, and in the same order).
Oracle cluster configuration requires three IP addresses resolvable by DNS for
SCAN-VIPa configuration. The SCAN-VIP addresses are used for client access and
are dynamically assigned to nodes in the cluster by Oracle Clusterware.
a. SCAN - VIP - Single Cluster Access Name Virtual IP which is configured as a single entry
in your DNS to be resolved to three IP addresses in random fashion.
Time synchronization
As a general rule, in any cluster environment, time synchronization between cluster nodes is
required.
In our test environment we use Network Time Protocol (NTP) client on both nodes, which is
configured in /etc/ntp.conf, as shown in Example 6-15 on page 82.
After you have made the changes in your environment, you must restart the xntpd service to
pick up the changes (stopsrc -s xntpd; startsrc -s xntpd). To check the xntpd, you can
use the command shown in Example 6-16.
Check both nodes in the cluster use the same NTP configuration. You can check using the
date command, as shown in Example 6-17.
The users created have the following properties and capabilities as shown in
Example 6-19.
# dsh lsuser -f -a capabilities fsize cpu data stack core rss nofiles grid|dshbak
-c
HOSTS -------------------------------------------------------------------------
aop93cl093, aop93cld24
-------------------------------------------------------------------------------
grid:
capabilities=CAP_NUMA_ATTACH,CAP_BYPASS_RAC_VMM,CAP_PROPAGATE
fsize=-1
cpu=-1
data=-1
stack=-1
core=2097151
rss=-1
nofiles=-1
We have set the logon password for grid and oracle users and created a file in users’
home directory as shown in Example 6-20.
## For node1:
# su - oracle
$ cat ~/.oraenv
export ORACLE_HOME=/u02/app/oracle/product/19.0.0/dbhome_1
export ORACLE_SID=itsodb1
export PATH=$PATH:${ORACLE_HOME}/bin
$ exit
##For node2:
# su - oracle
$ cat ~/.oraenv
export ORACLE_HOME=/u02/app/oracle/product/19.0.0/dbhome_1
export ORACLE_SID=itsodb2
export PATH=$PATH:${ORACLE_HOME}/bin
$ exit
Important: Storage configuration must be adjusted to fit your configuration needs. Refer to
the following documentation:
Chapter 7 in Oracle manual Grid Infrastructure Installation and Upgrade Guide - 19c for
IBM AIX on POWER Systems (64-Bit), E96274-03.
Chapter 1 in Oracle manual Oracle Database Installation Guide - 19c for IBM AIX on
POWER Systems (64-bit), E96437-04.
The local storage configuration consists of two non-shared volumes (LUNs), each 100 GB is
size, on each cluster node. One disk volume is used for rootvg (AIX operating system) and
the second volume is used for Oracle binaries.
Local (non-shared) volumes are shown in Example 6-21.
Paging space is also configured per Oracle requirements, as shown in Example 6-22.
Note that, as per Oracle requirements, 16 GB of swap space is required as a minimum for
systems with real memory larger than 16 GB.
Note: You need to properly size the memory and paging space for your environment to
avoid the “out of paging space” error in AIX.
Local file systems used for our environment are shown in Example 6-23. Note that the
/u02 file system is used for Oracle binaries (typically, /u01 is used as the name of the file
system - however, this is just a convention).
Tip: Example 6-24 shows that a LV is used for JFS2 logging operations. You can also
configure your JFS2 file systems with inline logging.
HOSTS -------------------------------------------------------------------------
aop93cl093
-------------------------------------------------------------------------------
Filesystem GB blocks Used Available Capacity Mounted on
/dev/fslv05 90.00 46.82 43.18 53% /u02
Example 6-25 Directory structure for Oracle Grid and Oracle Database
##Oracle Grid
# dsh mkdir -p /u02/app/19.0.0/grid
# dsh mkdir -p /u02/app/grid
# dsh mkdir -p /u02/app/oracle
# dsh chown -R grid:oinstall /u02
# dsh chown oracle:oinstall /u02/app/oracle
# dsh chmod -R 775 /u02/
##Oracle Database
# dsh mkdir -p /u02/app/oracle
# dsh mkdir -p /u02/app/oraInventory
# dsh chown -R oracle:oinstall /u02/app/oracle
# dsh chown -R oracle:oinstall /u02/app/oraInventory
# dsh chmod -R 775 /u02/app
Note: Since we do not use ASM in our environment, we do not prepare a shared storage
for this purpose.
Note: The Spectrum Scale configuration in this section is provided for your reference. Your
configuration can differ, depending on your deployment requirements.
https://www.ibm.com/support/knowledgecenter/STXKQY_5.0.3/com.ibm.spectrum.scale
.v5r03.doc/bl1xx_library_prodoc.htm
https://www.ibm.com/support/knowledgecenter/STXKQY/gpfsclustersfaq.html
Configuration steps
1. Spectrum Scale software installed on our nodes is shown in Example 6-26 on page 88.
We have also set the path variable ($PATH) to include the Spectrum Scale binaries path.
# echo $PATH
/usr/bin:/etc:/usr/sbin:/usr/ucb:/usr/bin/X11:/sbin:/usr/java7_64/jre/bin:/usr/jav
a7_64/bin:/usr/lpp/mmfs/bin:/u02/app/19.0.0/grid/bin
2. Example 6-27 shows the network configuration we have used for our Spectrum Scale
cluster.
For our cluster configuration we have used en0.
Since Spectrum Scale is used as back-end storage for Oracle RAC, we have configured
local name resolution (/etc/hosts) for Spectrum Scale.
HOSTS -------------------------------------------------------------------------
aop93cl093
-------------------------------------------------------------------------------
en0:
flags=1e084863,814c0<UP,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64B
IT,CHECKSUM_OFFLOAD(ACTIVE),LARGESEND,CHAIN>
inet XXX.XX.XX.93 netmask 0xffffff00 broadcast XXX.XX.XX.255
tcp_sendspace 262144 tcp_recvspace 262144 rfc1323 1
3. We have tested the secure shell password-less access between Spectrum Scale cluster
nodes, as shown in Example 6-28 on page 89. Detailed ssh configuration is provided in
IBM Spectrum Scale Version 5.0.3: Concepts, Planning, and Installation Guide,
SC27-9567-03.
4. Example 6-29 shows the disks we have used for Spectrum Scale. We have configured
these disks with SCSI-3 Persistent reservation (only one of the disks shown as example).
Notes:
Spectrum Scale changes the PR_key_value upon cluster services startup.
During Network Shared Disk (NSD) configuration step, Spectrum Scale identifies
the disk devices uniquely on both nodes, even if the disk device names (as seen by
the AIX operating system) are not the same.
For consistency, we recommend that the disk devices in AIX use the same names
on both nodes. This can be achieved by identifying the disks on both systems based
on their UUID (lsattr -El hdiskYY |grep unique_id or lscfg -vpl hdiskYY|grep
hdiskYY) and then use the AIX rendev command.
5. Example 6-30 on page 90 shows the nodes descriptor file we have created for Spectrum
Scale Cluster definition.
7. Example 6-32 shows the Spectrum Scale license we have applied in our cluster.
Summary information
---------------------
Number of nodes defined in the cluster: 2
Number of nodes with server license designation: 2
Number of nodes with FPO license designation: 0
Number of nodes with client license designation: 0
Number of nodes still requiring server license designation: 0
Number of nodes still requiring client license designation: 0
This node runs IBM Spectrum Scale Advanced Edition
8. Example 6-33 shows the disks descriptor files we have created for our cluster and used to
add the NSDs to the cluster configuration. We have created the following files:
– gpfs_tie_disks, used to define NSDs further used for cluster tiebreaker configuration.
– gpfs_data_disks_oradata, used to define NSDs further used for Oracle shared data file
system.
# cat gpfs_data_disks_oradata
%nsd: device=/dev/hdisk9
nsd=data11
servers=aop93cld24,aop93cl093
usage=dataAndMetadata
failureGroup=1
pool=system
%nsd: device=/dev/hdisk10
nsd=data12
servers=aop93cld24,aop93cl093
usage=dataAndMetadata
failureGroup=1
pool=system
%nsd: device=/dev/hdisk11
nsd=data13
servers=aop93cld24,aop93cl093
usage=dataAndMetadata
failureGroup=1
pool=system
%nsd: device=/dev/hdisk12
nsd=data14
servers=aop93cld24,aop93cl093
usage=dataAndMetadata
failureGroup=1
pool=system
%nsd: device=/dev/hdisk13
nsd=data21
servers=aop93cld24,aop93cl093
usage=dataAndMetadata
failureGroup=2
pool=system
%nsd: device=/dev/hdisk14
nsd=data22
servers=aop93cld24,aop93cl093
usage=dataAndMetadata
failureGroup=2
pool=system
%nsd: device=/dev/hdisk15
nsd=data23
servers=aop93cld24,aop93cl093
usage=dataAndMetadata
failureGroup=2
pool=system
%nsd: device=/dev/hdisk16
nsd=data24
servers=aop93cld24,aop93cl093
usage=dataAndMetadata
failureGroup=2
pool=system
9. Example 6-34 shows the NSD creation and verification in our cluster. Note that at this time
the NSDs are not assigned to any file system.
10.Example 6-35 shows the cluster started and configured with NSD tiebreaker. Note the (*)
after the Quorum value (1) in the last mmgetstate output command. This signifies that the
Spectrum Scale cluster used nodes with tiebreaker disks as cluster quorum.
# mmgetstate -aL
Node number Node name Quorum Nodes up Total nodes GPFS state Remarks
----------------------------------------------------------------------------------
1 aop93cld24 2 2 2 active quorum node
2 aop93cl093 2 2 2 active quorum node
# mmchconfig tieBreakerDisks=”tie1;tie2;tie3”
........
# mmlsconfig tieBreakerDisks
tiebreakerDisks tie1;tie2;tie3
# mmgetstate -aL
Node number Node name Quorum Nodes up Total nodes GPFS state Remarks
----------------------------------------------------------------------------------
1 aop93cld24 1* 2 2 active quorum node
2 aop93cl093 1* 2 2 active quorum node
Example 6-36 shows the cluster configuration. Note the following parameters:
worker1Threads 48
workerThreads 512
prefetchThreads 72
These parameters are recommended for deploying Oracle RAC database configurations
as per Oracle’s Doc ID 2587696.1.
autoload yes
dmapiFileHandleSize 32
minReleaseLevel 5.0.3.0
ccrEnabled yes
cipherList AUTHONLY
usePersistentReserve yes
failureDetectionTime 10
worker1Threads 48
workerThreads 512
prefetchThreads 72
minQuorumNodes 1
tiebreakerDisks tie1;tie2;tie3
adminMode central
11.Example 6-37 shows the file system creation and activation. For Oracle RAC
deployments, check the My Oracle Support Document Doc ID 2587696.1 for Spectrum
Scale file system parameters recommendations.
Tip: The file system we create contains a single pool, named “system”. The system pool
contains disks for both data and metadata. The data and metadata disks are divided
into two failure groups (1 and 2). Data and metadata mirroring is configured for all files.
The two copies of each data and metadata block are stored in separate failure groups.
In addition to the two failure groups used for data and metadata, system pool also
contains a disk that is used as a file system descriptor quorum (failure group 3).
We create the file system first and then we add one disk to be used as file system
descriptor quorum. The NSD descriptor file we use for this purpose is:
# cat gpfs_tie_oradata_disk
%nsd: device=/dev/hdisk6
nsd=tie1
usage=descOnly
failureGroup=3
Important: Certain Spectrum Scale file system parameters, such as the file system
block size (and others as well) cannot be changed after file system creation. These
must be set at file system creation.
# mmmount all -a
# mmlsmount all -L
XXX.XX.XX.123 aop93cld24
XXX.XX.XX.93 aop93cl093
Tip: For additional details about the parameters used at file system creation time, see the
man page of the mmcrsfs command.
Note: The distribution of the three file system descriptors is one per failure group.
12.Example 6-39 shows the directory structure and permissions we have configured for
deploying Oracle Clusterware and one Oracle RAC database. Oracle Clusterware
Registry (OCR) and Vote files are deployed in /oradata/crs_files2 directory, while the
Oracle datafiles are deployed in /oradata/itsodb directory.
Important: We do not provide step by step Oracle Grid and Oracle Database installation
as this is documented by Oracle and guided through the Oracle Universal Installer (OUI),
rather we focus on the configuration parameters we have selected for our deployment.
During the installation and configuration of Oracle software, certain actions can be
performed on a single node, while some of the tasks must be performed on both nodes.
Tip: In our test environment we used an older version of VNC. You can use the graphics
server of your choice (for example, tightvnc).
The VNC configuration files can be found in each user’s home directory, in the ~/.vnc folder.
Example 6-40 shows the VNC server configured for grid and oracle users.
Example 6-40 VNC stared for grid and oracle users (node1)
# rpm -qa |grep vnc
vnc-3.3.3r2-6.ppc
# ps -aef |grep vnc |egrep “grid|oracle”
grid 9372104 1 0 Sep 29 - 0:00 Xvnc :2 -desktop X -httpd
/opt/freeware/vnc/classes -auth /home/grid/.Xauthority -geometry 1024x768 -depth 8 -rfbwait
120000 -rfbauth /home/grid/.vnc/passwd -rfbport 5902 -nolisten local -fp
/usr/lib/X11/fonts/,/usr/lib/X11/fonts/misc/,/usr/lib/X11/fonts/75dpi/,/usr/lib/X11/fonts/100dpi
/,/usr/lib/X11/fonts/ibm850/,/usr/lib/X11/fonts/Type1/
We logon as grid and unpack the Oracle 19c grid_home.zip installation archive in the
directory shown in Example 6-41 on page 96.
$ unzip /mnt1/db/19.3/AIX.PPC64_193000_grid_home.zip
.........
We logon as oracle and unpack the Oracle 19c db_home.zip installation archive in the
directory shown in Example 6-42.
$ unzip /mnt1/db/19.3/AIX.PPC64_193000_db_home.zip
.........
Note: Before starting the installation process, check the My Oracle Support document
INS-06006 GI RunInstaller Fails If OpenSSH Is Upgraded to 8.x (Doc ID 2555697.1).
Tip: You can "pre-patch" the 19.3 binaries before running the installer with an Oracle
Release Update which contains the fix for this issue. In this case, you do not need the
previously mentioned My Oracle Support Document.
The Patch Id. which contains the fix for this issue is 32545008, which is part of the April
2021 Release Update (RU):
gridSetup.sh -applyRU /<Staging_Path>/grid/32545008
where the <Staging_Path> is the staging directory where the RU from Apr 2021 has been
unpacked (in our test environment this is /u02/stage).
At the end of the Oracle Grid installation process we check the results using the following
procedure:
Note: In our test configuration we have used only one SCAN VIP. Standard deployments
recommended by Oracle need three SCAN VIP addresses, configured in DNS.
The Oracle Clusterware Repository and Vote files are shown in Example 6-44.
Example 6-44 OCR and Vote files (located in shared file system)
# ocrcheck -config
Oracle Cluster Registry configuration is :
Device/File Name : /oradata/crs_files2/ocr1
Device/File Name : /oradata/crs_files2/ocr2
Device/File Name : /oradata/crs_files2/ocr3
During the Oracle Database software installation process we chose to install software. The
instance and database will be created later using the Oracle dbca (Database Configuration
Assistant) utility.
From the VNC terminal window, logged on as oracle, we launch the Oracle Database 19c
Installer:
$ cd /u02/app/oracle/product/19.0.0/dbhome_1
$ ./runInstaller
We chose Set Up Software Only, click next and select Oracle Real Application Clusters
database installation.
We follow the instructions and finalize software installation. We check the new Oracle
Clusterware resources created as shown in Example 6-45. At this time, the database has
not been yet created, as such, the instance resources are OFFLINE.
After the database software has been installed, from the VNC terminal window, we launch
the Database Configuration Assistant:
$ /u02/app/oracle/product/19.0.0/dbhome_1/dbca
We select Create a Database, then follow the instructions. We have chosen to configure a
database named itsodb, with two instances (itsodb1/node1 and itsodb2/node2). The
shared file system location for this database is /oradata/itsodb.
For our installation we also select the Sample schemas and the Oracle Enterprise
Manager (EM) database express.
After the installation ends, we check the configuration using the commands shown in
Example 6-46.
--------------------------------------------------------------------------------
ora.LISTENER.lsnr
ONLINE ONLINE aop93cl093 STABLE
ONLINE ONLINE aop93cld24 STABLE
ora.net1.network
ONLINE ONLINE aop93cl093 STABLE
ONLINE ONLINE aop93cld24 STABLE
ora.ons
ONLINE ONLINE aop93cl093 STABLE
ONLINE ONLINE aop93cld24 STABLE
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.ASMNET1LSNR_ASM.lsnr(ora.asmgroup)
1 OFFLINE OFFLINE STABLE
2 OFFLINE OFFLINE STABLE
3 OFFLINE OFFLINE STABLE
ora.LISTENER_SCAN1.lsnr
1 ONLINE ONLINE aop93cl093 STABLE
ora.aop93cl093.vip
1 ONLINE ONLINE aop93cl093 STABLE
ora.aop93cld24.vip
1 ONLINE ONLINE aop93cld24 STABLE
ora.asm(ora.asmgroup)
1 OFFLINE OFFLINE STABLE
2 OFFLINE OFFLINE STABLE
3 OFFLINE OFFLINE STABLE
ora.asmnet1.asmnetwork(ora.asmgroup)
1 OFFLINE OFFLINE STABLE
2 OFFLINE OFFLINE STABLE
3 OFFLINE OFFLINE STABLE
ora.cvu
1 ONLINE ONLINE aop93cl093 STABLE
ora.itsodb.db
1 ONLINE ONLINE aop93cld24 Open,HOME=/u02/app/o
racle/product/19.0.0
/dbhome_1,STABLE
2 ONLINE ONLINE aop93cl093 Open,HOME=/u02/app/o
racle/product/19.0.0
/dbhome_1,STABLE
ora.qosmserver
1 ONLINE ONLINE aop93cl093 STABLE
ora.scan1.vip
1 ONLINE ONLINE aop93cl093 STABLE
--------------------------------------------------------------------------------
Connecting to (ADDRESS=(PROTOCOL=tcp)(HOST=)(PORT=1521))
STATUS of the LISTENER
------------------------
Alias LISTENER
Version TNSLSNR for IBM/AIX RISC System/6000: Version 19.0.0.0.0
- Production
Start Date 08-OCT-2020 09:13:29
Uptime 2 days 6 hr. 56 min. 11 sec
Trace Level off
Security ON: Local OS Authentication
SNMP OFF
Listener Parameter File /u02/app/19.0.0/grid/network/admin/listener.ora
Listener Log File
/u02/app/19.0.0/grid_base/diag/tnslsnr/aop93cld24/listener/alert/log.xml
Listening Endpoints Summary...
(DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=LISTENER)))
(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=129.40.93.123)(PORT=1521)))
(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=129.40.93.22)(PORT=1521)))
(DESCRIPTION=(ADDRESS=(PROTOCOL=tcps)(HOST=aop93cld24)(PORT=5500))(Security=(my_wa
llet_directory=/u02/app/oracle/product/19.0.0/dbhome_1/admin/itsodb/xdb_wallet))(P
resentation=HTTP)(Session=RAW))
Services Summary...
Service "8939c30fdc8c01b2e0530af11a191106" has 1 instance(s).
Instance "itsodb1", status READY, has 2 handler(s) for this service...
Service "b07f99e1d79401e2e05381285d5dd6c5" has 1 instance(s).
Instance "itsodb1", status READY, has 2 handler(s) for this service...
Service "itsodb" has 1 instance(s).
Instance "itsodb1", status READY, has 2 handler(s) for this service...
Service "pdb" has 1 instance(s).
Instance "itsodb1", status READY, has 2 handler(s) for this service...
Chapter 6. Oracle RAC for AIX with IBM Spectrum Scale 101
8485ch06.fm Draft Document for Review October 26, 2021 2:43 pm
7.1 Monitoring
We recommend that clients monitor both the server and the database.
For daily monitoring in a production environment, we recommend that nmon is launched with
the following options:
nmon -s60 -c1440 -f -d -V -^ -L -A -M
Note: The 60s capture interval for continuos monitoring might be to short as it generates
too much data for analysis. AIX by default runs topas_nmon collection. You can disable it if
nmon is used instead.
The command can be added to crontab to run at midnight each night to produce a daily nmon
monitoring file. These files can be useful as a reference if performance issues occur.
If you are investigating a specific issue that runs for a shorter time period you can decrease
the interval length by way of the -s parameter to capture more detailed performance data. In
benchmarks we often set the capture interval length to 10 seconds and run the nmon
command for the duration of the test workload that we want to monitor.
There are other options that can be useful to activate if you are investigating a specific issue
for example if you are investigating a cpu spike it is useful to add the following option:
-T Includes the top processes and their call parameters in the output.
You can find the full documentation for the nmon monitoring tool at the following website:
https://www.ibm.com/docs/en/aix/7.2?topic=n-nmon-command
We recommend that you collect nmon data on your AIX LPARs and also on all involved Virtual
I/O Servers.
After you have captured the nmon data it can be analyzed by a number of tools, most
frequently with the nmon analyzer tool.
This tool is an excel spreadsheet that turns the raw data into a readable report. You can
download it and find the documentation at the following website:
https://developer.ibm.com/technologies/systems/articles/au-nmon_analyser/
Alternative post-processing tools for nmon data can be found at this link:
http://nmon.sourceforge.net/pmwiki.php?n=Site.Nmon-Analyser
Note: Enterprise Manager Cloud Control allows the same data to be reviewed in real time
but we are not covering this topic.
To create AWR reports, you need to acquire the Oracle software license for the chargeable
add-on option Diagnostic and Tuning Pack, and then activate the Server Manageability Pack
option. It can be enabled with the following command by way of sqlplus:
ALTER SYSTEM SET
control_management_pack_access='DIAGNOSTIC+TUNING' SCOPE=BOTH;
The AWR report provides operating system data for the server but the numbers are not
always 100% accurate for a number of reasons:
1. The operating system level CPU data is for the server or partition and not for the individual
database instance.
2. Oracle does not record DB CPU use correctly for AIX. You will never see 90% cpu used for
a database on AIX when using SMT2, SMT4 or SMT8 because Oracle does not include
time spent waiting on cpu from threads other than the first thread on each core.
3. The %DB time can add up to over 100. Despite this, knowing the proportion of activity in
relation to other events is still a valuable metric.
4. I/O statistics need to be compared with the same statistics from the AIX hdisk layer, and
the storage system. The observed values should closely match. If there is a significant
discrepancy, then there are bottlenecks somewhere in the I/O stack or SAN infrastructure
which need to be investigated.
The AWR report is based on the delta between two metric snapshots in time. The longer the
time period between those two snapshots the harder it is to know exactly what happened. For
example, the reported CPU utilization data is the same for a process that ran at 100% cpu for
6 minutes or 10% cpu for 1 hour when the AWR report covers a 1 hour time period.
The maximum frequency for automated AWR snapshots is every 10 minutes. When
investigating a performance issue, particularly one that does not last for long, frequent
snapshots can be beneficial. It helps to eliminate some of the "background noise" of normal
activity and focuses on the issue.
If manually capturing the AWR snapshots for a specific issue, creating two snapshots during
the peak allow you to capture a clear image of what is happening during the peak.
From Oracle 12c onwards the AWR report contains the output from ADDM. This can be
extremely useful for finding performance issues that have already been picked up on by the
Oracle Database and often there is a solution proposed.
If you do not have the license for the Diagnostic and Tuning pack then STATSPACK is still
available but will provide significantly less information and analysis.
7.4.1 CPU
100% CPU consumption is not necessarily an issue. The true indicator of a system that is
overloaded is the run queue. If the run queue is higher than the number of cpu threads then
clearly the system is CPU bound. This information can be found in the PROC tab of the nmon
report.
Running nmon from the command line allows us to see CPU use in real time and to review
top processes using CPU allowing us to determine if an Oracle process or some other
workload is consuming the CPU resources.
The AWR report can help to identify individual SQL requests that can be consuming
excessive CPU time.
7.4.2 Memory
Oracle asks for a page space (or swap space) that is twice the size of the memory available at
the time of running the installer. On AIX we do not want to have any paging to paging space at
all. If the paging space is being used, then there is, or was, a shortage of physical memory
forcing AIX to page memory pages to paging space. You can see the paging space related
I/O activity in the PAGE tab of the nmon report. Current usage of paging space can be
determined by running lsps -a.
Note: There are workloads where not using CIO can be beneficial. You can see the types
of memory allocated in the MEMNEW tab in the nmon report.
If insufficient memory has been reserved for the operating system and the database
connections, then a high number of connections can result in memory swapping to paging
space.
7.4.3 I/O
The nmon report allows us to check the general performance of the I/O (with the -d option).
vmstat -v helps to detect a lack of physical buffers in logical volumes as shown in Figure 7-1.
This is for jfs2 not ASM.
You also find this information about the BBBP tab of your nmon with the output from vmstat
-v and then again with the output of ending vmstat -v.
By comparing the two you can determine if the number of blocked pbufs has increased during
the nmon capture. If this is the case this can be resolved by adding more LUNs or by
increasing the pv_pbuf_count by way of the lvmo command.
iostat -D shows if there is queuing at the operating system level as shown in Figure 7-2.
Figure 7-2 shows the avgserv of the I/O read is 0.2 ms which is good but the avgtime spent in
the queue of 2.2 ms is clearly an issue. This can be resolved by increasing the queue depth
on the disk or by adding more LUNs. Queue avgtime is also in nmon tab DISKWAIT (with
option -d).
If there is queuing at the Fibre Channel adapter level, you see the output (Figure 7-3) using
the fcstat command.
This is a similar setting to the queue_depth but for the adapters. The num_cmd_elements is
set using the command: chdev -l fcsX -o num_cmd_elems=YYY.
Where fcsX is the name of the Fibre Channel adapter. The value of YYY is the sum of the
queue_depths/number of FC adapters. Fibre Channel information is also in the nmon BBBF
tab (with nmon option -^).
If you are using VIO servers, check for any contention at that level. For example, if the VIO
servers are starved of CPU they can become a bottleneck for network or I/O adapters.
Note: num_cmd_elems in the VIO Servers is typically set to maximum supported by SAN,
and must not be lowered than what is configured in client LPARs.
Oracle provides a tool called orion which can be used to test I/O bandwidth and latency. This
tool is used by the I/O calibrate function in the Oracle Database. The difference between the
two is that orion does not require a database to be created to work and I/O calibrate updates
the statistics in the database for use by the Oracle optimizer.
Both of these tools return a value that allows you to compare the I/O capacity of different
environments. They must be used with caution, particularly in an active production
environment as they try to saturate the I/O resources during their test.
This chapter highlights key components involved in a database I/O and then discuss a proven
approach to utilize available server and storage resources in support of database I/O.
AIX JFS2 file system is the simplest and most performing option to deploy a single database
instance on a cooked file system. The depicted cache for the JFS2 file system is in most
cases not utilized as it is for the majority of Oracle workload types recommended to utilize the
JFS2 Concurrent I/O (CIO) feature of AIX to minimize constraints on concurrent write access
to data files. CIO and correct file system layout can provide I/O performance like raw devices
or Oracle Automatic Storage Management (ASM), while still providing the convenience of a
cooked file system.
For Oracle Real Application Cluster deployments, ASM or IBM Spectrum Scale (Chapter 6,
“Oracle RAC for AIX with IBM Spectrum Scale” on page 69) are typically chosen to provide
the shared concurrent access to data from multiple servers.
Note: Starting with Oracle 12c raw devices (disks or raw Logical Volumes (LV)) are only
supported as devices for ASM. Refer to Oracle Doc ID 578455.1.
Figure 8-2 illustrates a typical deployment based on Fibre Channel (FC) attached external
storage where the Fibre Channel Host Bus Adapters, physical or virtual (NPIV) are physically
allocated to the AIX LPAR. This configuration is typically only used for workloads with high I/O
requirements and sharing of FC adapters with other Logical Partitions (LPAR) in the physical
server is not feasible.
Workload consolidation resulting in improved resource utilization and reduced TCO typically
drives the sharing of FC adapters between multiple LPARs and Figure 8-3 shows a typical
deployment with dual Virtual I/O Servers (VIOS) where the physical FC adapters are
presented as N_Port ID virtualization (NPIV) adapters in the AIX LPAR. The use of virtual
SCSI (vscsi) devices to map storage space from VIOS to client LPAR for use by an Oracle
Database is discouraged as it introduces increased I/O latency and CPU usage in the VIOS
as compared to the NPIV technology.
To simplify the illustration only two FC adapters/ports are shown per VIOS. Those two ports
connect to different SAN switches for highest reliability. The number of connections and active
paths into the storage sub-system are configured to support the aggregated peak throughput
requirements driven through the server FC adapters to that storage.
The physical adapters in the VIO Servers have a command queue which restricts the number
of concurrent I/O which can be driven by that adapter/port at any point in time. The size of that
queue is specified by way of the num_cmd_elems adapter parameter (fcsX). The maximum
value for this setting is dependent on the adapter type. Older adapters typically support a
value up to 2048, but more current 16 Gbit adapters support a value of 3200 or even 4096.
Before setting or changing this value, verify with the storage vendor if the SAN can support
the number of FC adapter ports times num_cmd_elems concurrent pending I/O.
The max_xfer_size setting for a physical FC adapter has a dual meaning. On one side it
specifies the maximum size of an I/O driven against the SAN, and, on the other it influences
how much DMA physical memory is allocated to the adapter. This parameter value is typically
never reduced and only increased if guided so by IBM support. Effective maximum I/O sizes
are typically configured on the AIX <hdisk> layer and need to be lower or equal to the value of
max_xfer_size.
The NPIV adapter (virtual HBA) in the client LPAR also has a parameter num_cmd_elems
constraining the number of concurrent I/O driven by the LPAR by way of the respective NPIV
adapter. Note that num_cmd_elems in the client LPAR is limited to a value of 2048.
Storage capacity is being made available to the client LPAR in the form of one or more
storage volumes with assigned Logical Unit Numbers (LUNs) which can be accessed by way
of a defined set of NPIV HBAs in the client LPAR. The AIX multi-path device driver, or storage
vendor specific alternative, automates the distribution of physical I/O to the available volumes
over all available paths.
Each volume is represented as a single hdisk, or vendor specific device name like
hdiskpower, in AIX. Each hdisk has a SCSI command queue and its depth is controlled by
way of the queue_depth parameter. The maximum queue_depth value is 256 and it specifies
how many I/O can be concurrently active to the underlying volume over all defined paths. If
the hdisk queue is full, additional I/O is placed into a separate wait queue which is utilized in a
FIFO fashion.
The size restriction of the hdisk queue_depth drives the need to define and map more than
one volume for an Oracle Database to use for data or redo. A good starting point for storing
Oracle data, index, temp files is eight volumes and for redo data a different set of volumes at
minimum four, but often eight volumes. For Oracle usage those volumes are less than 2 TB
and typically less than 1TB in size. If more than 8TB of space are required, additional
volumes are mapped, again in multiples of eight. Eight was chosen based on typical
characteristics of today's SAN-attached storage solutions where the number of controllers in
the SAN-attached storage are multiples of two for redundancy and I/O to a specific volume is
typically routed by way of an "owning" controller. Only if that controller becomes unavailable
I/O is re-routed to an alternative path.
To emphasize the point, even if all LUNs are spread over all physical storage space in the
SAN-attached storage, you still want to configure several volumes to not be I/O constrained
by the queue depth of a single or small number of hdisks.
Recent AIX releases and FC adapters added the support for multiple I/O queues to enable
significantly higher I/O rates. For further details see the following link:
https://www.ibm.com/developerworks/aix/library/au-aix-performance-improvements-fc-
fcoe-trs/
The initial approach was based on the idea to isolate database files based on function and
usage; for example, define different pools of storage for data files and index files. Key
observations of this approach:
1. Manually intensive planning and manual file-by-file data placement is time consuming,
resource intensive and iterative.
2. Best performance achievable, but only with continues maintenance.
3. Can lead to I/O hotspots over time that impact throughput capacity and performance.
The current approach, which also is the idea under ASM, is based on "SAME" - Stripe And
Mirror Everything. The "Stripe" is for performance and load balancing and relevant for this
discussion. The "Mirror", to provide redundancy/availability, is in a SAN-attached based
environment implemented in the storage sub-system and not in AIX and not further discussed
here. Oracle Databases with ASM on SAN-attached storage typically use "EXTERNAL" as
the redundancy setting.
In ASM devices, typically raw hdisks, are grouped together by way of an ASM disk group.
ASM automatically stripes all data within a disk group over all underlying devices. Typically
you have disk groups for:
DATA: Data/index/temp.
FRA: Redo.
OCR: OCR/vote.
In the context of a production database deployment in AIX, the DATA disk group has eight (or
a multiple of eight) devices, the FRA disk group has at minimum four, more typical also eight
devices. It is recommended to separate OCR/Vote (and management DB if configured) into
an independent disk group so ASM disk groups for a database can be taken offline without
having to shut down Oracle cluster services. Further material on Oracle ASM can be found at
the following website:
https://www.oracle.com/database/technologies/rac/asm.html
The remainder of this section will discuss how data striping can be efficiently implemented
with the AIX logical volume manager for a single instance database persisting its data in AIX
JFS2 file systems.
Figure 8-4 illustrates the key concepts of the AIX logical volume manager.
Like ASM, AIX groups hdisks into volume groups. For the deployment of a production Oracle
Database a minimum of three AIX volume groups is recommended:
1. ORAVG: Oracle binaries.
2. DATAVG: Oracle data/index/temp.
3. REDOVG: Oracle redo; flash recovery.
ORAVG can be a single volume, or two. DATAVG contains eight volumes or multiples of eight
and REDOVG, at minimum four, better eight volumes.
To stripe data over all hdisk devices in a volume group AIX supports two approaches which
will be discussed in more detail in the remainder of this section:
1. Striping based on logical volume striping.
2. Striping by way of "PP striping" / "PP spreading"; as example look at orabin LV in graphic
(Figure 8-4).
Note: the specified block size (agblksize) needs to be adjusted to 512 for the file system
containing redo log files.
Create the DATAVG volume group and specify all the corresponding hdisks. Note the
specified PP size of 256 MB and the implications for future growth:
mkvg -S -s 256 -y DATAVG $disklist
Create the striped logical volume with a 1 MB strip size over all hdisks in DATAVG:
mklv -y oradatalv -S 1M -a e -t jfs2 -x <max # PP> DATAVG <size of LV in PP>
$disklist
Create JFS2 file system with inline log on just created LV
crfs -v jfs2 -d'oradatalv' -m /oradata -A'yes' -p'rw' -a agblksize='4096' -a
logname='INLINE'
Mount the file system:
mount /oradata
If utilizing striped logical volumes, the following needs to be understood and planned for:
All allocated volumes and hdisks in a VG need to be of the same size.
The LV is stripped over N hdisks (typically all hdisks in a VG) from one VG.
The LV space allocation can only grow in multiples of N times the PP size.
– Example: N=8, PP size=1GB ' 8G, 16G, 24G, …
A file system (FS) on top of a stripped LV always grows with the same increments (N * PP)
so no space is wasted. Note that you must create the stripped LV first and then create the
JFS2 file system on top of the existing striped LV.
If any of the N hdisks runs out of available PP, attempts to grow the LV will fail. There are
two options to resolve this:
1. Grow the underlying volume(s) dynamically by way of SAN methods and discover the new
size dynamically by way of chvg -g <VG name>. Minimum size increase for each volume is
PP size and increases are in full multiples of PP size.
2. Add another N hdisks (volumes) to the VG and expand the LV to those new volumes. This
is called a "stripe column". The system administrator manually adds the hdisks to the VG
and then expands the LV accordingly.
Option 2 requires significant SAN and storage subsystem changes and additional work by the
AIX admin too. Adding volumes also impacts for example flashcopy configurations.
Note: The specified block size (agblksize) needs to be adjusted to 512 for the file system
containing redo log files.
Create the DATAVG volume group and specify all the corresponding hdisks. Note the
specified PP size of 8 MB and the implications for future growth discussed later. This PP
size is the strip size used for striping the data over all hdisks in the VG and is selected as
small as possible while still providing sufficient space in the VG for database data with less
than about 60,000 PPs. Alternative larger PP sizes are 16 MB, 32 MB and potentially 64
MB, but at that size there is a good likelihood to see hotspots which rotate over the disks:
mkvg -S -s 8 -y DATAVG $disklist
Create the logical volume with maximum spread over all hdisks in DATAVG. The value for
<size of LV in PP> is evenly devisable by the number of disks in the VG:
mklv -y oradatalv -e x -a e -t jfs2 -x <max # PP> DATAVG <size of LV in PP>
$disklist
Create JFS2 file system with inline log on just created LV:
crfs -v jfs2 -d'oradatalv' -m /oradata -A'yes' -p'rw' -a agblksize='4096' -a
logname='INLINE'
Mount the file system:
mount /oradata
If utilizing PP Spreading to stripe data over all disks in a VG the following needs to be
understood and planned for:
All allocated volumes and hdisks in a VG are of the same size.
PP size needs to be planned carefully as management of VG with more than 60k PP
becomes slower.
The LV is PP spread over M hdisks (typically all hdisks in VG and a multiple of 8) from one
VG.
The LV space allocation can grow in multiples of one PP size but grows in multiples of M
PP sizes for balanced I/O distribution.
A file system (FS) on top of a PP spread LV always grows with the same increments (1+ or
M * PP) so no space is wasted. Note that you must create the LV first and then create the
JFS2 file system on top of existing PP-spread LV.
If any of the M hdisks in the VG runs out of available PP, AIX skips that hdisk in the
round-robin allocation from eligible hdisks in the VG. If no eligible hdisk has an available
PP, further attempts to grow the LV will fail.
There are two options to resolve the <out-of-space> condition:
1. (Preferred option) Grow the underlying volume(s) dynamically by way of SAN methods
and discover the new size dynamically by way of chvg -g <VG name>. Minimum size
increase per volume is PP size.
2. Add another K hdisks (volumes) to the VG and then execute "reorgvg <VG name>" to
redistribute the allocated space evenly over all hdisks in the VG. K >= 1.
Option 2. requires significant SAN changes and additional work by the AIX admin. Adding
volumes also impacts for example flashcopy configurations.
reorgvg is an I/O intensive operation. Note that there is no method to specify which PP will be
migrated and you can end up with some original data files still only being backed by the initial
M hdisks resulting in some disks being more accessed in a VG than others.
9.1 Introduction
Supported networking technologies for Oracle Database on AIX are Ethernet and InfiniBand.
Several communication adapters with different speeds and feature sets are supported for IBM
Power Systems servers. For details about IBM Power Systems servers and supported
communication adapters, check the following website:
https://www.ibm.com/support/knowledgecenter
For the latest supported network technologies for Oracle RAC refer to the following website:
https://www.oracle.com/database/technologies/tech-generic-unix-new.html
After the options overview a sample configuration is presented illustrating how Shared
Ethernet Adapter functionality can be utilized to deploy two independent RAC clusters onto
the same two physical Power Systems servers.
The use of dedicated network adapters can be the choice for the Oracle Real Application
Cluster Interconnect in a large environment.
In addition to sharing of physical network adapters between LPARs, SEA technology also
enables Live Partition Mobility (LPM). LPM enables the migration of an active LPAR between
two separate physical Power servers without having to stop the applications or the operating
system. Live Partition Mobility is supported for LPARs running single instance Oracle
Databases or Oracle RAC. For additional information about supported configurations, see the
following website:
https://www.oracle.com/database/technologies/virtualization-matrix.html
Live Partition Mobility is fully supported for client access network and the RAC interconnect.
The graphic shows only a single network. Note that for Oracle RAC a minimum of two
independent networks is required.
For aggregation of clusters with high packet rates on the RAC interconnect, it is better to
utilize dedicated physical adapters for the highest user(s) as high packet rates drive
significant CPU utilization in the respective VIO Servers.
Note: Single Root IO Virtualization (SR-IOV) is a new technology which is available with
latest AIX and IBM Power Systems servers. However at the time this publication was
written, this is not yet supported by Oracle.
When Oracle support for SR-IOV based technology becomes available, it is highly
recommended to utilize this new technology to share network adapters more efficiently
between multiple client LPARs.
The configuration shown in Figure 9-2 is the minimal recommended configuration for an
Oracle RAC cluster which also provides protection against networking single points of failure
utilizing SEA failover.
If higher bandwidth is needed, physical Ethernet link aggregation over two or more physical
network ports in the VIO Servers can be used for SEA backing adapters. Another option is to
configure multiple SEA adapters in VIO Servers and two or more corresponding Virtual I/O
Ethernet Adapters in the client LPAR. Oracle High Availability IP (HAIP) can then aggregate
those virtual network ports for the RAC interconnect providing availability and high bandwidth
by distributing network traffic over all specified interfaces.
Related publications
The publications listed in this section are considered particularly suitable for a more detailed
discussion of the topics covered in this book.
IBM Redbooks
The following IBM Redbooks publications provide additional information about the topic in this
document. Note that some publications referenced in this list might be available in softcopy
only.
IBM Power Systems Private Cloud with Shared Utility Capacity: Featuring Power
Enterprise Pools 2.0, SG24-8478-01
IBM PowerVC Version 2.0 Introduction and Configuration, SG24-8477
Red Hat OpenShift V4.X and IBM Cloud Pak on IBM Power Systems Volume 2,
SG24-8486
You can search for, view, download or order these documents and other Redbooks,
Redpapers, Web Docs, draft and additional materials, at the following website:
ibm.com/redbooks
Online resources
These websites are also relevant as further information sources:
Licensing Oracle Software in the Cloud Computing Environment
https://www.oracle.com/assets/cloud-licensing-070579.pdf
On-premises: IBM Private Cloud with Dynamic Capacity
https://www-01.ibm.com/common/ssi/ShowDoc.wss?docURL=/common/ssi/rep_ca/1/897/E
NUS120-041/index.html&lang=en&request_locale=en
Off-premises: IBM Power Systems Virtual Server
https://www.ibm.com/cloud/power-virtual-server
SG24-8485-00
ISBN DocISBN
Printed in U.S.A.
®
ibm.com/redbooks