Reliability Centered Maintenance
Reliability Centered Maintenance
Reliability Centered Maintenance
INTRODUCTION
With a few exceptions, preventive maintenance has been considered the most advanced and effective
maintenance technique available for use by industrial and facility maintenance organizations. A Preventive
Maintenance (PM) program is based on the assumption of a "fundamental cause-and-effect relationship between
scheduled maintenance and operating reliability. This assumption was based on the intuitive belief that because
mechanical parts wear out, the reliability of any equipment [is] directly related to operating age. It therefore
followed that the more frequently equipment was overhauled, the better protected it was against the likelihood of
failure. The only problem was in determining what age limit was necessary to assure reliable operation."
Nowlan and Heap reached the conclusion that, "a maintenance policy based exclusively on some maximum
operating age would, no matter what the age limit, have little or no effect on the failure rate."
In separate independent studies, it was noted that a difference existed between the perceived and the intrinsic
design life for the majority of equipment and components. In fact, it was discovered that in many cases equipment
greatly exceeded the perceived or stated design life.
Reliability-Centered Maintenance (RCM) is the optimum mix of reactive, time- or interval-based, conditionbased, and proactive maintenance practices. The basic application of each strategy is shown in Fig. 1. These
principal maintenance strategies, rather than being applied independently, are integrated to take advantage of
their respective strengths in order to maximize facility andequipment reliability while minimizing life-cycle
costs.
RCM includes reactive, time-based, condition-based, and proactive tasks. In addition, a user should understand
system boundaries and facility envelopes, system/equipment functions, functional failures, and failure modes, all
of which are critical components of the RCM program.
BACK TO TOP
DESCRIPTION
Preventive Maintenance (PM) assumes that failure probabilities can be determined statistically for individual
machines and components, and parts can be replaced or adjustments can be performed in time to preclude
failure. For example, a common practice has been to replace or renew bearings after so many operating hours
assuming that bearing failure rate increases with time in service.
Fig. 2, Bearing Life Scatter, shows the failure distribution of a group of thirty identical 6309 deep groove ball
bearings installed on bearing life test machines and run to failure. The wide variation in bearing life is obvious and
precludes the use of any effective time-based maintenance strategy.
Fortunately, computer advances in the 1990s have made it possible in many cases to identify the precursors of
failure, quantify equipment condition, and schedule the appropriate repair with a higher degree of confidence than
was possible when performing strictly interval-based maintenance relying upon usually erroneous estimates of
when a component might fail. Also, it has been discovered recently that there are many different equipment
failure characteristics, only a small number of which are age- or use-related. This new knowledge has increased
the emphasis on Condition Monitoring (CM), often referred to as Condition-Based Maintenance, which has
caused a reduction in the reliance upon time-based PM.
It should not be inferred from the above that all interval-based maintenance should be replaced by conditionbased maintenance. In fact, interval-based maintenance is appropriate for those instances where abrasive,
erosive, or corrosive wear takes place, material properties change due to fatigue, embrittlement, etc. and/or a
clear correlation between age and functional reliability exists.
In addition, for those systems or components where no failure consequences in terms of
mission,environment, safety, security, or Life-Cycle Cost (LCC) exist, maintenance should not be
performed, i.e., the equipment should be run to failure and replaced.
The concept of RCM has been adopted across several government and industry operations as a strategy for
performing maintenance. RCM applies maintenance strategies based on consequence and cost of failure. In
addition, RCM seeks to minimize maintenance and improve reliability throughout the life-cycle by using proactive
techniques such as improved design specifications, integration of condition monitoring in the commissioning
process, and the Age Exploration (AE) process.
A. RCM Principles
B. Types of RCM
There are several ways to conduct and implement an RCM program. The program can be based on rigorous
Failure Modes and Effects Analysis (FMEA), complete with mathematically-calculated probabilities of failure
based on design or historical data, intuition or common-sense, and/or experimental data and modeling. These
approaches may be called Classical, Rigorous, Intuitive, Streamlined, or Abbreviated. Other terms sometimes
used for these same approaches include Concise, Preventive Maintenance (PM) Optimization, Reliability Based,
and Reliability Enhanced. All are applicable. The decision of what technique to use should be left to the end user
and be based on:
a.
b.
c.
o
o
o
a.
b.
c.
o
o
o
Consequences of failure
Probability of failure
Historical data available
Risk tolerance
Resource availability
Classical/Rigorous RCM
Benefits: Classical or rigorous RCM provides the most knowledge and data concerning
system functions, failure modes, and maintenance actions addressing functional failures of any of
the RCM approaches. Rigorous RCM analysis is the method first proposed and documented by
Nowlan and Heap and later modified by John Moubray, Anthony M. Smith, and others. In addition,
this method should produce the most complete documentation of all the methods addressed here.
Concerns: Classical or rigorous RCM historically has been based primarily on the FMEA with
little, if any, analysis of historical performance data. In addition, rigorous RCM analysis is extremely
labor intensive and often postpones the implementation of obvious condition monitoring tasks.
Applications: This approach should be limited to the following three situations:
The consequences of failure result in catastrophic risk in terms of environment,
health, or safety, and/or complete economic failure of the business unit.
The resultant reliability and associated maintenance cost is still unacceptable after
performing and implementing a streamlined type FMEA.
The system/equipment is new to the organization and insufficient corporate
maintenance and operational knowledge exists on function and functional failures.
Abbreviated/Intuitive/Streamlined RCM
Benefits: The intuitive approach identifies and implements the obvious, usually conditionbased, tasks with minimal analysis. In addition, it culls or eliminates low value maintenance tasks
based on historical data and Maintenance and Operations (M&O) personnel input. The intent is to
minimize the initial analysis time in order to realize early-wins that help offset the cost of the FMEA
and condition monitoring capabilities development.
Concerns: Reliance on historical records and personnel knowledge can introduce errors into
the process that may lead to missing hidden failures where a low probability of occurrence exists.
In addition, the intuitive process requires that at least one individual has a thorough understanding
of the various condition monitoring technologies.
Applications: This approach should be utilized when:
The function of the system/equipment is well understood.
Functional failure of the system/equipment will not result in loss of life or
catastrophic impact on the environment or business unit.
For these reasons, the streamlined or intuitive approach has been recommended
for DOS,NASA, and NAVFAC facilities. In addition, a streamlined or intuitive approach has been
successfully used in both discrete and continuous manufacturing facilities.
C. RCM Analysis
The RCM analysis should carefully consider and answer the following questions:
What does the system or equipment do; what are the functions?
What functional failures are likely to occur?
What are the likely consequences of these functional failures?
What can be done to reduce the probability of the failure(s), identify the onset of failure(s),
or reduce the consequences of the failure(s)?
Answers to these four questions can be used with the decision logic tree depicted in Fig. 3,Reliability-Centered
Maintenance (RCM) Decision Logic Tree, to determine the maintenance approach for the equipment item or
system.
Note that the analysis process as depicted in Fig. 3 has only four possible outcomes:
D. Failure
Failure is the cessation of proper function or performance. RCM examines failure at several levels: the system
level, sub-system level, component level, and sometimes even the parts level. The goal of an effective
maintenance organization is to provide the required system performance at the lowest cost. This means that the
maintenance approach must be based on a clear understanding of failure at each of the system levels. System
components can be degraded or even failed and still not cause a system failure. A simple example is the failed
headlamp on an automobile. That failed component has little effect on the overall system performance.
Conversely, several degraded components may combine to cause the system to have failed, even though no
individual component has itself failed.
a method of dividing a system into subsystems when its complexity makes an analysis by other means difficult:
A system boundary or interface definition contains a description of the inputs and outputs
that cross each boundary.
The facility envelope is the physical barrier created by a building, enclosure, or other
structure; e.g., a cooling tower or tank.
Standardize on selecting boundaries. For example, a pump could include the first
upstream/downstream isolation valve, the coupling, and associated gauges. The motor would
include the electrical circuit from the load side of the motor control center but not the coupling.
The intent is to develop a series of modular FMEAs and assemble them as if they were Lego Blocks and select
the maintenance actions based on the consequences of risk determined by the criticality and probability factors
defined in Tables 1 and 2 respectively.
Function and Functional Failure
The function defines the performance expectation and can have many elements. Elements include physical
properties, operation performance including output tolerances, and time requirements such as continuous
operation or limited required availability.
Functional failures are descriptions of the various ways in which a system or subsystem can fail to meet the
functional requirements designed into the equipment. A system or subsystem that is operating in a degraded
state but does not impact any of the requirements addressed in System and System Boundary, has not
experienced a functional failure.
It is important to determine all of the functions of an item that are significant in a given operational context. By
clearly defining the functions' non-performance, the functional failure becomes clearly defined. For example, it is
not enough to define the function of a pump to move water. The function of the pump must be specific and
defined in such terms flow rate, discharge pressure, vibration levels, B10 (L10) Life efficiency, etc. (Reliability
HotWire)
Failure Modes
Failure modes are equipment- and component-specific failures that result in the functional failure of the system or
subsystem. For example, a machinery train composed of a motor and pump can fail catastrophically due to the
complete failure of the windings, bearings, shaft, impeller, controller, or seals. In addition, a functional failure also
occurs if the pump performance degrades such that there is insufficient discharge pressure or flow to meet
operating requirements. These operational requirements should be considered when developing maintenance
tasks.
Dominant failure modes are those failure modes responsible for a significant proportion of all the failures of the
item. They are the most common modes of failure.
Not all failure modes or causes warrant preventive or conditioned based maintenance because the likelihood of
their occurring is remote or their effect is inconsequential.
Reliability
Reliability is the probability that an item will survive a given operating period, under specified operating
conditions, without failure usually expressed as B10 (L10) Life and/or Mean Time to Failure (MTTF) or Mean Time
Between Failure (MTBF). The conditional probability of failuremeasures the probability that an item entering
a given age interval will fail during that interval. If the conditional probability of failure increases with age, the item
shows wear-out characteristics. The conditional probability of failure reflects the overall adverse effect of age on
reliability. It is not a measure of the change in an individual equipment item.
Failure rate or frequency plays a relatively minor role in maintenance programs because it is too simple a
measure. Failure frequency is useful in making cost decisions and determining maintenance intervals, but it tells
nothing about which maintenance tasks are appropriate or about the consequences of failure. A maintenance
solution should be evaluated in terms of the safety, security, or economic consequences it is intended to prevent.
A maintenance task must be applicable (i.e., prevent failures or ameliorate failure consequences) in order to be
effective.
Failure Characteristics
Conditional probability of failure (Pcond) curves fall into six basic types, as graphed (Pcond versus Time) in Figs. 2-2
and 2-3, Random Conditional Probability of Failure Curves and Age Related Conditional Probability of
Failure Curves. The percentage of equipment conforming to each of the six wear patterns as determined in
three separate studies is also shown in both figures. (More)
The failure characteristics shown in Figs. 4 and 5, Random Conditional Probability of Failure Curves, were
first noted in the previously cited book, Reliability-Centered Maintenance. Follow-on studies in Sweden in
1973, and by the U.S. Navy in 1983, produced similar results. In these three studies, random failures accounted
for 77-92% of the total failures and age related failure characteristics for the remaining 8-23%.
The basic difference between the failure patterns of complex and simple items has important implications for
maintenance. Single-piece and simple items frequently demonstrate a direct relationship between reliability and
age. This is particularly true where factors such as metal fatigue or mechanical wear are present or where the
items are designed as consumables (short or predictable life spans). In these cases an age limit based on
operating time or stress cycles may be effective in improving the overall reliability of the complex item of which
they are a part.
Complex items frequently demonstrate some infant mortality, after which their failure probability increases
gradually or remains constant. A marked wear-out age is not common. In many cases scheduled overhaul
increases the overall failure rate by introducing a high infant mortality rate into an otherwise
stable system.
Preventing Failure
Every equipment item has a characteristic that can be called resistance to or margin to failure. Using
equipment subjects it to stress that can result in failure when the stress exceeds the resistance to failure. Fig.
6, Preventing Failure, depicts this concept graphically. The figure shows that failures may be prevented or item
Stress is dependent on use and may be highly variable. It may increase, decrease, or remain constant with use or
time. A review of the failures of a large number of nominally identical simple items would disclose that the majority
had about the same age at failure, subject to statistical variation, and that these failures occurred for the same
reason. If one is considering preventive maintenance for some simple item and can find a way to measure its
resistance to failure, he or she can use that information to help select a preventive task.
Adding excess material or changing the type of material that wears away or is consumed can increase resistance
to failure or the rate of degradation. Excess strength may be provided to compensate for loss from corrosion or
fatigue. The most common method of restoring resistance is by replacing the item. The resistance to failure of a
simple item decreases with use or time (age), but a complex unit consists of hundreds of interacting simple items
(parts) and has a considerable number of failure modes. In the complex case, the mechanisms of failure are the
same, but they are operating on many simple component parts simultaneously and interactively so that failures
no longer occur for the same reason at the same age. For these complex units, it is unlikely that one can design a
maintenance task unless there are a few dominant or critical failure modes.
Failure Modes and Effects Analysis (FMEA)
FMEA is applied to each system, sub-system, and component identified in the boundary definition. For every
function identified, there can be multiple failure modes. The FMEA addresses each system function (and, since
failure is the loss of function, all possible failures) and the dominant failure modes associated with each failure,
and then examines the consequences of the failure. What effect did the failure have on the mission or operation,
the system, and on the machine?
Even though there are multiple failure modes, often the effects of the failure are the same or very similar in
nature. That is, from a system function perspective, the outcome of any component failure may result in the
system function being degraded.
Likewise, similar systems and machines will often have the same failure modes. However, the system use will
determine the failure consequences. For example, the failure modes of a ball bearing will be the same regardless
of the machine. However, the dominate failure mode will often change from machine to machine, the cause of the
failure may change, and the effects of the failure will differ.
Fig. 7, FMEA Worksheet, provides an example of a FMEA worksheet.
Very Low
Low
4
5
Low to
Moderate
Moderate
Moderate
Comment
No reason to expect failure to have any effect on safety, health, environment,
or mission.
Minor disruption to facility function. Repair to failure can be accomplished
during trouble call.
Minor disruption to facility function. Repair to failure may be longer than
trouble call but does not delay mission.
Moderate disruption to facility function. Some portion of mission may need to
be reworked or process delayed.
Moderate disruption to facility function. 100% of mission may need to be
reworked or process delayed.
to Moderate disruption to facility function. Some portion of mission is lost.
High
High
Very High
9
10
Hazard
Hazard
Reliability, Maintainability, and Supportability Guidebook, Third Edition, Society of Automotive Engineers, Inc.,
Warrendale, PA, 1995.
The Probability of Occurrence (of Failure) is also based on work in the automotive industry. Table 2,Probability
of Occurrence Categories, provides one possible method of quantifying the probability of failure. If there is
historical data available, it will provide a powerful tool in establishing the ranking. If the historical data is not
available, a ranking may be estimated based on experience with similar systems in the facilities area. The
statistical ("Effect") column in Table 2 can be based on operating hours, day, cycles, or other unit that provides a
consistent measurement approach. The statistical bases ("Comment") may be adjusted to account for local
conditions. For example, one organization changed the statistical approach for ranking 1 through 5 to better
reflect the number of cycles of the system being analyzed.
Table 2. Probability of Occurrence Categories
Ranki Effect Comment
ng
1
1/10,0 Remote probability of occurrence; unreasonable to expect failure to occur.
00
2
1/5,00 Low failure rate. Similar to past design that has, in the past, had low failure rates for
0
given volume/loads.
3
1/2,00 Low failure rate. Similar to past design that has, in the past, had low failure rates for
0
given volume/loads.
4
1/1,00 Occasional failure rate. Similar to past design that has, in the past, had similar failure
0
rates for given volume/loads.
5
1/500 Moderate failure rate. Similar to past design that has, in the past, had moderate
failure rates for given volume/loads.
6
1/200 Moderate to high failure rate. Similar to past design that has, in the past, had
moderate failure rates for given volume/loads.
7
1/100 High failure rate. Similar to past design that has, in the past, had high failure rates
that has caused problems.
8
1/50 High failure rate. Similar to past design that has, in the past, had high failure rates
that has caused problems.
9
1/20 Very High failure rate. Almost certain to cause problems.
10
1/10+ Very High failure rate. Almost certain to cause problems.
Reliability, Maintainability, and Supportability Guidebook, Third Edition, Society of Automotive Engineers, Inc.,
Warrendale, PA, 1995.
F. RCM Implementation
There is no one set path for successfully implementing RCM because RCM is more than just performing a Failure
Modes and Effects Analysis (FMEA), adopting condition monitoring techniques, and/or optimizing a maintenance
and overhaul program through the application of an Age Exploration (AE) process. A successful RCM
implementation process first must recognize what and where the source of return on investment (ROI) resides.
The source(s) of ROI may be tangible and/or intangible. For the former, a quantifiable business case may be
developed based on financial benefit (savings, cost avoidance, reduced Work in Progress (WIP) and/or reduced
liability) to the organization while for the latter, the benefit may be unquantifiable (employee skills, morale,
customer relations, etc.) In either case, a baseline and goal must be established through some mechanism such
as internal or external benchmarking, which results in a defined gap between the "As-Is" and the "To-Be" state
and the ROI identified for closing all or a portion of the gap.
Remember, caveat emptor. That is, RCM is not for everyone and very few organizations will benefit from
implementing all elements of a classical RCM program. RCM like all tools/processes has an element
of diminishing return. Not all the elements of RCM which are applicable to a nuclear power plant, the aircraft
industry, and/or a 24/7 continuous process plant in a sold out condition, will be applicable to a batch process
operation or a non-production facility. However, there are a few truths everyone should follow and there is no
need to pilot or perform an FMEA analysis. They are:
1.
KPIs. In doing this, the cost of obtaining data for the KPIs and the relative value they add to the overall program
must be calculated. While advocating doing the right things within the maintenance program with life-cycle cost as
a driver, the cost of the capturing supporting KPIs must also be watched closely.
1. Benchmark Selection
After selection of the appropriate KPIs is complete, benchmarks should be established. These characterize the
organization's goals and/or progress points for using KPIs as a tool for maintenance optimization and continuous
improvement. Benchmarks may be derived from the organizational goals and objectives or they may be selected
from a survey performed with similar organizations. These benchmarks will be used as a target for growth and to
evaluate risk associated with non-achievement of progress.
2. Utilization of KPIs
After benchmarks are established and data collection has begun, the information must be acted on in a timely
manner to maintain continuity within all of the processes that are counting on KPIs as a performance
enhancement tool. In order to take full advantage of the benefits of KPIs they should be displayed in public areas.
The tracking and publication of KPIs inform the people of what is important, what are the goals, and where they
stand with respect to performance expectations. The impact of displaying KPIs often has an immediate effect on
the workers in the functional area being measured. In addition, KPIs are an integral part of any Team Charter as
they allow the Team and Management to determine Team priorities and measure productivity.
BACK TO TOP
o
o
MIL-STD-2194 (SH), Infrared Thermal Imaging Survey Procedure for Electrical Equipment
MIL-STD 2173 (AS), Reliability-Centered Maintenance Requirements for Naval Aircraft,
Weapons Systems and Support Equipment (U.S. Naval Air Systems Command)
NAVAIR 00-25-403, Guidelines for the Naval Aviation Reliability Centered Maintenance
Process (U.S. Naval Air System Command)
MIL-P-24534, Planned Maintenance System: Development of Maintenance Requirement
Cards, Maintenance Index Pages, and Associated Documentation (U.S. Naval Sea Systems
Command)
SAE JA1O11, Evaluation Criteria for Reliability-Centered Maintenance (RCM) Processes
SAE JA1O12, A Guide to Reliability-Centered Maintenance (RCM)
59081-AB-GIB-010/MAINT, Reliability-Centered Maintenance Handbook (U.S. Naval Sea
Systems Command)
BACK TO TOP
ADDITIONAL RESOURCES
WBDG
Building / Space Types
Applicable to equipment in all building types and space types
Design Objectives
Cost-Effective, Functional / Operational, Historic PreservationUpdate Building Systems
Appropriately, ProductiveAssure Reliable Systems and Spaces, ProductivePromote Health and
Well-Being, Secure / SafeEnsure Occupant Safety and Health, Secure / SafeProvide Security for
Building Occupants and Assets, SustainableOptimize Operational and Maintenance Practices
Products and Systems
Applicable to most Systems
Federal Agencies
Organizations
Publications
Complete Building Equipment Maintenance Desk Book, Second Edition by Sheldon J. Fuchs.
Englewood, NJ: Prentice-Hall, 1992.
Continuous Commissioning Guidebook for Federal Energy Managers (PDF 840 KB) DOE
Federal Energy Management Program (FEMP), October 2002.
Dependability Management-Part 3-1 1: Application Guide-Reliability Centered
Maintenance byInternational Electrotechnical Commission. Document No. 56/651/EDIS.
Maintainability: A Key to Effective Serviceability and Maintenance Management by B.S.
Blanchard, D. Verma and E.L. Peterson. New York: John Wiley & Sons, Inc., 1995.
Other