Chapter 9. Failure Root Cause Prevention
Chapter 9. Failure Root Cause Prevention
Chapter 9. Failure Root Cause Prevention
and equipment failures is known as Root Cause Failure Analysis (RCFA). RCFA is manpower and
time intensive, and so it is only applied after a serious incident justifies it use.1 Reserving RCFA
for investigating major failures ensures that major failures will continue to occur. You might find
and remove some contributing causes, but thousands of defects in your business will stay
behind to create more future catastrophes. Failures are the result of multiple failed processes.
2
There is never just one cause of equipment failure. There are at least four, and usually more,
contributing factors to a machine failure event. Using Root Cause Failure Analysis will not
discover all of the contributing factors because many of them are hidden in the distant past,
while others started in other places far from your operation. In the Plant Wellness Way, the first
thing you do after a failure is review the processes for the causes that contributed to the event.
Figure 20.1 shows how this root cause removal strategy works to prevent the many defects at
the bottom of the failure pyramid from becoming big troubles that you later must fix one at a
3
4
9.1 Improve the Process Design
World-class operations recognize the interconnectivity of their processes and work hard to ensure
the right results at every stage in every process. Figure 20.2 shows a failure in product assembly.
The root cause traces back to its manufacture, when it leaves the process and enters another, then
a second and a third. The defective item started its life elsewhere and ended up causing problems
during assembly. There are innumerable opportunities for errors and defects to occur in all
processes. Process after process connects with others, causing a tangled web of interaction. Errors,
mistakes, and defects can come from everywhere. Any process that goes wrong has an impact on
numerous others downstream. Much time, money, and resources will be wasted. 5
6
There is an insightful story told of the late Sir Ernest Shackleton, one of the great early South
Pole explorers. On board his ship bound for Antarctica, he watched a man tie a knot in a rope
that was holding down vital supplies. Shackleton saw that it was the wrong knot for the job. In
wild seas, it would come loose and all the goods and supplies would be lost. Shackleton went to
the man and asked him about his experience at sea. He learned that the man was new to
seafaring. With patience and thoroughness, Shackleton taught him how to tie the correct knot,
one that would be secure in all weather and sea conditions. His comment to the new seafarer is
insightful for all of us who want successful outcomes: “There is always only one knot that is
An important asset management indicator to collect and present is where failures arose during
the equipment life cycle. Today’s failures started in the past when their causes were initiated in
previous processes. Tracing the parts replaced on corrective and breakdown work orders back
through the processes they traveled lets you observe their life cycle. Where you find problem
causes, you stop them so that they cannot arise in the future. If a part’s failure was started by
an error at an external repair shop, it will happen again if you don’t get the shop to fix the
causes. A stress-induced failure from shaft misalignment indicates that your equipment
The failed item or part is used to start the review of its life. The failure is the last event in a
long chain of causes and effects. The failure mode site on the part contains evidence of the
causes of its failure. The causes that came together to fail the part passed through your
processes undetected until they combined to initiate the failure. It is necessary find the culprit
11
When a weakness is found, the process is reengineered to remove the opportunity for out-of-
control variation. The process redesign is trialed, and the successful solution is documented and
implemented. The people inside and outside the organization affected by the change are trained in
A few examples of life-cycle process monitoring measures used to find weakness in processes by
observing their effects on the operation are listed below. The indicators are simply a count of the
processes used during the life of failed and replaced parts. A pie chart or bar chart of the number
of maintenance corrective work orders and breakdown work orders per category for a period
shows the regularity that these indicators of process design weakness arise in an operation.
12
The measures are selected with the intention of finding the weak life-cycle processes that are
making your machines fail in order to identify what more to do to make a process more robust,
antifragile and successful.
• When failed equipment parts were serviced by external vendors
• Whether repaired equipment had service duty specifications
• Who did the previous repairs or replacements
• Whether repaired equipment had be run using ACE 3T operating procedures
• Count of the number of events when the equipment was run overloaded
• The equipment repairs in which parts were drawn from our store
• The equipment repairs in which parts were purchased direct
• Whether equipment repairs were done to ACE 3T procedures
13
These measures let you target your process redesign to build more successful maintenance
and operating processes. As time goes by and data accumulate, you can develop additional
subcategories within the measures to focus on finding the specific process step that starts the
defects causing the repairs and breakdowns.
expertly keep their equipment parts well within the capability of the materials of construction.
The experience of high-reliability organizations is that equipment failure starts with poor
business system process control. The necessary systems and controls that produce high
reliability are not present and followed. Equipment failures then result from out-of-control
variation. The organization’s quality management system fails first, and then the equipment is
failed by the system. To fix management system failures, it is necessary to understand how
business processes can fail. By understanding how each process step can fail, you build in the
use and enforce. To get high reliability, the experience of high-reliability organizations tells us,
we must put into the business the processes, the specialist technical knowledge, and the right
activities done correctly that cause high reliability. Don’t begrudge drawing a process flow
diagram for each of your processes, and for each step in a process, to identify the hundreds of
ways the processes could fail. Risks can live anywhere, and you need to see all of the places
where your problems can start. Figure 20.3 traces a machine component manufacturing
process step down to the fundamental tasks and actions. The work flow details expose
23