Chapter 9. Failure Root Cause Prevention

CHAPTER 9.
FAILURE ROOT CAUSE PREVENTION

Highly reliable organizations proactively focus on preventing problems from entering their
operation and removing those that remain. They set control mechanisms, standards, and
checkpoints to spot and stop the defects that turn into future failures. They look for what can
go wrong before it does and prevent its causes from happening. Instead of having problems
and then investigating their causes, they imagine their problems and proactively act to
eliminate their possibility. If your operation is suffering equipment and production problems,
don’t try to discover why they happen and figure out how to solve them. First, look at your
processes. The vast majority of your production problems are caused by bad business process
design. Fix your process weaknesses and do the new training, then put the answers to use. The
problems disappear because they no longer can exist in your company.
1
Solving the real problem means finding its true causes. The technique used to investigate plant
and equipment failures is known as Root Cause Failure Analysis (RCFA). RCFA is manpower and
time intensive, and so it is only applied after a serious incident justifies it use.1 Reserving RCFA
for investigating major failures ensures that major failures will continue to occur. You might find
and remove some contributing causes, but thousands of defects in your business will stay
behind to create more future catastrophes. Failures are the result of multiple failed processes.
2
There is never just one cause of equipment failure. There are at least four, and usually more,
contributing factors to a machine failure event. Using Root Cause Failure Analysis will not
discover all of the contributing factors because many of them are hidden in the distant past,
while others started in other places far from your operation. In the Plant Wellness Way, the first
thing you do after a failure is review the processes for the causes that contributed to the event.
Figure 20.1 shows how this root cause removal strategy works to prevent the many defects at
the bottom of the failure pyramid from becoming big troubles that you later must fix one at a
time with RCFA.
3
4
9.1 Improve the Process Design
World-class operations recognize the interconnectivity of their processes and work hard to ensure
the right results at every stage in every process. Figure 20.2 shows a failure in product assembly.
The root cause traces back to its manufacture, when it leaves the process and enters another, then
a second and a third. The defective item started its life elsewhere and ended up causing problems
during assembly. There are innumerable opportunities for errors and defects to occur in all
processes. Process after process connects with others, causing a tangled web of interaction. Errors,
mistakes, and defects can come from everywhere. Any process that goes wrong has an impact on
numerous others downstream. Much time, money, and resources will be wasted. 5
6
There is an insightful story told of the late Sir Ernest Shackleton, one of the great early South
Pole explorers. On board his ship bound for Antarctica, he watched a man tie a knot in a rope
that was holding down vital supplies. Shackleton saw that it was the wrong knot for the job. In
wild seas, it would come loose and all the goods and supplies would be lost. Shackleton went to
the man and asked him about his experience at sea. He learned that the man was new to
seafaring. With patience and thoroughness, Shackleton taught him how to tie the correct knot,
one that would be secure in all weather and sea conditions. His comment to the new seafarer is
insightful for all of us who want successful outcomes: “There is always only one knot that is
right for the situation.” 7

Shackleton’s method of failure prevention is the technique used in the Plant Wellness Way: do
what stops the causes of failures from starting. First, put the right practices into your processes
and make sure they are done right every time. In the Plant Wellness Way, when things fail, the
first question you should ask is, what is wrong with the process? You can skip the RCFA, but you
cannot skip finding and fixing the design faults and missing quality controls in your processes.
9.1.1 Prevent the Chance of Failure Starting
The necessities for high equipment reliability cannot be left to luck. If Shackleton had left it to the
new seafarer to realize that he was using the wrong knot for the job, the expedition would have
failed. Like Shackleton, you must find and remove the risks in your processes before they destroy
your operation. Do the same for your business that Shackleton did for the Antarctic expedition:
look for where troubles will start in your processes, then introduce, teach and use the right
practices so that risks will never arise. 8
9.1.2 Identify Where Your Equipment Problems Begin
An important asset management indicator to collect and present is where failures arose during
the equipment life cycle. Today’s failures started in the past when their causes were initiated in
previous processes. Tracing the parts replaced on corrective and breakdown work orders back
through the processes they traveled lets you observe their life cycle. Where you find problem
causes, you stop them so that they cannot arise in the future. If a part’s failure was started by
an error at an external repair shop, it will happen again if you don’t get the shop to fix the
causes. A stress-induced failure from shaft misalignment indicates that your equipment
installation process is weak. 9

You seek to understand whether your reliability troubles are in fact attributable to
manufacturing defects, subcontractor mistakes, production process causes, material selection
causes, equipment installation troubles, operating errors, vendor-produced causes,
procurement errors, warehouse management failure, poor workmanship, and so on.
The failed item or part is used to start the review of its life. The failure is the last event in a
long chain of causes and effects. The failure mode site on the part contains evidence of the
causes of its failure. The causes that came together to fail the part passed through your
processes undetected until they combined to initiate the failure. It is necessary find the culprit
processes and fix them. 10

This is not a root cause analysis investigation to find the actual cause of failure. It is an
investigation of process design weakness to identify the presence of failure-causing steps.
Typically, an experienced discipline maintenance engineer or design engineer, or a career
maintenance supervisor, or maintenance planner would identify all of the processes associated
with the replaced parts.
A process is weak if it does not prevent all of the Physics of Failure Factors that damage or
destroy the part. Finding answers to the eight life-cycle questions (see Chapter 9) is a good
place to start an investigation. Using evidence from the failure, process maps of all of the
processes used during the failed part’s life cycle are reviewed for risks that could have allowed
defects and causes of the failure to arise and remain active.
11
When a weakness is found, the process is reengineered to remove the opportunity for out-of-
control variation. The process redesign is trialed, and the successful solution is documented and
implemented. The people inside and outside the organization affected by the change are trained in
the proper use of the new process.
A few examples of life-cycle process monitoring measures used to find weakness in processes by
observing their effects on the operation are listed below. The indicators are simply a count of the
processes used during the life of failed and replaced parts. A pie chart or bar chart of the number
of maintenance corrective work orders and breakdown work orders per category for a period
shows the regularity that these indicators of process design weakness arise in an operation.
12
The measures are selected with the intention of finding the weak life-cycle processes that are
making your machines fail in order to identify what more to do to make a process more robust,
antifragile and successful.
• When failed equipment parts were serviced by external vendors
• Whether repaired equipment had service duty specifications
• Who did the previous repairs or replacements
• Whether repaired equipment had be run using ACE 3T operating procedures
• Count of the number of events when the equipment was run overloaded
• The equipment repairs in which parts were drawn from our store
• The equipment repairs in which parts were purchased direct
• Whether equipment repairs were done to ACE 3T procedures
13
These measures let you target your process redesign to build more successful maintenance
and operating processes. As time goes by and data accumulate, you can develop additional
subcategories within the measures to focus on finding the specific process step that starts the
defects causing the repairs and breakdowns.
9.2 Behaviors of High-Reliability Organizations

The U.S. nuclear aircraft carrier fleet and nuclear submarine fleet are renowned worldwide as
high reliability organizations. Starting with vision and leadership, it took a lot of consistent,
persistent effort, and some tragic failures to get there.
The nuclear submarine USS Thresher sank with 129 people on board on April 10, 1963.
Although the vessel was not recovered from its resting place 2.5 kilometers deep, the naval
investigation review board used photographic and retrieved evidence, along with laboratory
14
tests, to identify failed brazed pipe joints as the most likely cause of the incident.
The loss triggered a complete review of naval nuclear vessel design and operating procedures.
Even though the fleet’s equipment was built and maintained to high quality standards, and its
personnel had specialist technical training, the quality
control requirements became more demanding. Designs were simplified to remove complexity
and to behave in known ways. Quality control in manufacture was improved. Operating
practices became more stringent to remove the chance of variation. All crew members had to
reach expert status in their discipline and equipment if they were to remain on the ship or
submarine.
The organizational structure on U.S. nuclear fleet vessels is unusual. The crew members are
the experts in running the ship and keeping it safe; the officers are there to support the crew in
their efforts and to address issues that might reduce the crew’s effectiveness.
15
That structure makes the operating crew more important to the ship’s survival than the officers
—a true inverted organization with managers at the bottom working for the producers at the
top.
Central to the success of high-reliability organizations is the realization that everything can
go wrong. The only sure protection is to know exactly what is happening with the equipment
throughout the plant all the time. The equipment must be set up perfectly at the start and then
monitored to ensure that it behaves exactly as it should when it is used. What you don’t
understand, you don’t do yourself—instead, you get help from those who do know until you are
trained and expert enough to do the task. Human error is acknowledged and addressed
through teamwork, in which people help each other constantly and documented checks,
counter checks, and double checks are a way of life.
16
High-reliability organizations proactively control every process and every step in those
processes. Nothing is unimportant because consequential effects mean that the smallest risk
can be the start of the biggest catastrophe. This requires a dedication to diligence beyond what
people in commercial industry expect and are paid to do. High reliability cannot be bought with
money—it lives in the hearts and minds of people who want to be the best at what they do and
are respected by their peers and managers for that expertise because it is so valuable to the
success of the organization.
The U.S. nuclear fleet’s equipment is designed for simplicity, high reliability, and
maintainability. The business systems in use demand proof of compliance with best practices.
Its crews are educated to be a technical knowledge repository on their plant. Its people are
trained to act skillfully in a highly reliable manner.
17
The organization is structured to put knowledgeable experts immediately at the situation of risk
and danger and bring the power of teamwork into play. Those are key reasons why the U.S.
nuclear fleet is a high-reliability organization.
9.3 Limitations of Our Materials of Construction
We live in a probabilistic universe in which its physics produces divergence and sudden change
in the way matter behaves when it reaches critical points.3 Unless the physics of a situation is
controlled, you can get sudden changes in the behavior of your materials of construction. The
failure of equipment parts and the resulting poor reliability and safety are direct results of
exceeding the physical and chemical boundaries of the materials of construction. Poor reliability
and poor safety are to be expected in organizations in which people do not know the limits of
their machinery and do not understand what is happening to the parts inside them.
18
People create high reliability when they know the engineering of their plant and process and
expertly keep their equipment parts well within the capability of the materials of construction.
The experience of high-reliability organizations is that equipment failure starts with poor
business system process control. The necessary systems and controls that produce high
reliability are not present and followed. Equipment failures then result from out-of-control
variation. The organization’s quality management system fails first, and then the equipment is
failed by the system. To fix management system failures, it is necessary to understand how
business processes can fail. By understanding how each process step can fail, you build in the
correct risk controls needed to achieve high reliability at each step. 19

You cause your own equipment reliability through the quality management systems that you
use and enforce. To get high reliability, the experience of high-reliability organizations tells us,
we must put into the business the processes, the specialist technical knowledge, and the right
activities done correctly that cause high reliability. Don’t begrudge drawing a process flow
diagram for each of your processes, and for each step in a process, to identify the hundreds of
ways the processes could fail. Risks can live anywhere, and you need to see all of the places
where your problems can start. Figure 20.3 traces a machine component manufacturing
process step down to the fundamental tasks and actions. The work flow details expose
opportunity for failure everywhere. 20

21
Once you go into the details of your own processes, you’ll see an enormous number of risks
you were not even aware of. The presence of those risks means that things can go wrong, and
they will with a frequency that is dependent on the designs of the processes used during the
life cycle and whether they were constructed to stop or prevent each risk from arising.
In an organization using a Plant Wellness Way system of reliability, the problems and
troubles caused by your processes are uncovered using Chance of Success Mapping to identify
the risks. At each step, you list what has gone wrong in the past and what could go wrong in the
future. For each risk, you develop mitigations to proactively prevent them from happening. You
compile a list of the changes needed to maximize each step’s chance of success, and that is
your plan to create a far more successful process. In this way, you design and build highly
successful operations and equipment without roots of failure inside them.
22
THE END
23

Chapter 9. Failure Root Cause Prevention

Uploaded by

Copyright:

Available Formats

Chapter 9. Failure Root Cause Prevention

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 9. Failure Root Cause Prevention

Uploaded by

Copyright:

Available Formats

CHAPTER 9.

FAILURE ROOT CAUSE PREVENTION

time with RCFA.

right for the situation.” 7

installation process is weak. 9

manufacturing defects, subcontractor mistakes, production process causes, material selection

causes, equipment installation troubles, operating errors, vendor-produced causes,

procurement errors, warehouse management failure, poor workmanship, and so on.

processes and fix them. 10

the proper use of the new process.

9.2 Behaviors of High-Reliability Organizations

correct risk controls needed to achieve high reliability at each step. 19

opportunity for failure everywhere. 20

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.