Rel Code
Rel Code
Rel Code
Published by:
Albany Interactive Pty Ltd
22 Honeywood Court
Samford, Queensland
Australia 4520
17 EXERCISES
REFERENCES
1. INTRODUCTION
1.1 Aim of RELCODE
The aim of RELCODE is to deal with several problems in the field of reliability and maintenance
management:
1.2 Methods
RELCODE uses well documented statistical and analytical methods which are widely accepted by
reliability and maintenance engineers. These include:
• Many recent extensions of Weibull analysis which take advantage of the power of the modern
computer
• Calculation of inspection intervals for hidden failures and for condition monitoring, as used for
example in Reliability Centered Maintenance
1. Introduction 1-1
RELCODE is thus a valuable tool for the reliability or maintenance professional.
1.3 Database
RELCODE incorporates a database (using Microsoft Access) which is designed to hold the
information required to carry out reliability and replacement analysis on your data. This includes
data on age at failure and on known successful performance of items which have run without
failure. It also has provision for entering cost and usage data and for recording the current
reliability parameters and replacement policy for your items. As you enter your data it is
automatically stored in this database and is then easily retrieved for analysis or amendment using
familiar Windows methods. There are also options for importing and exporting data in formats
compatible with spreadsheets, and for importing data from earlier versions of RELCODE.
The failure of equipment usually can be described by one or more of the following types of failure
rate:
The WEIBULL distribution can identify any one of the above failure patterns, i.e. Decreasing,
Constant or Increasing failure rate, but if more than one pattern is present it produces an averaging of
the several patterns.
By the use of these models, RELCODE enables relevant failure patterns to be determined from the
users data. The mean life of the item can be estimated, as well as the reliability to any age, and
confidence limits for reliability.
Identifying the type of failure rate pattern helps you to determine the root cause as to why items are
failing. EARLY LIFE failure is normally associated with manufacture or installation defects. Items
which have survived the early life failure or burn-in period are more reliable than the average new
item. For improvements in reliability, the manufacture or installation problems must be eliminated,
or a period of accelerated aging introduced in order to eliminate early failures in service.
RANDOM failures can indicate a cause of failure which is external to the item itself, and which
causes a stress in excess of the design strength. An example is a nail in a tire or a brick through a
window. RANDOM failures also occur in complex and electronic equipment where there are a very
large number of possible causes of failure.
WEAROUT failures are due to wear or fatigue and occur in many mechanical components such as
gears or valves or in perishable items such as seals and filters. They may be sudden in nature or may
1-2 1.Introduction
take the form of a slow degradation which eventually reaches a level which is formally defined as a
failure. The ability to detect the onset of wearout when superimposed on Early Life or Random
failures is a feature of the Bi-Weibull distribution.
For the maintenance engineer, major concerns are the setting of preventive maintenance intervals
and the forecasting of spare parts requirements. RELCODE assists the user in evaluating the
following three replacement policy options:
1. Introduction 1-3
1.6 The Weibull Distribution
A more flexible distribution is the Weibull distribution, which has a constant which allows it to
assume a variety of shapes. This constant is called "Beta" (ß) and is termed the "Shape Parameter".
Figure 1.2. illustrates the form that the Weibull takes for a variety of Beta values.
Figure 1.2.
All probability distributions have their own equation. The Weibull's is:
It is not really important that we fully understand this equation, since RELCODE does all the
necessary calculations. The parameter eta (η) is called the characteristic life and is defined as the time
before which there is a 63.2% risk of failure. The 63.2% figure arises from the mathematics of the
Weibull equation, as (1 - (1/e)) where e is the base of natural logarithms. This is the value of the
Weibull cumulative distribution function when t = η. The distribution shown in Figure 1.2 is Two-
Parameter Weibull, the two parameters being ß and η.
1-4 1.Introduction
Figure 1.3. Three Parameter Weibull Distribution
Reliability analysis sometimes requires a distribution with even more flexibility than the two or three
parameter Weibull. Authors such as Kao (1959), Clark (1991) and Hastings and Ang (1995) have
proposed distributions based on combining two (or possibly more) Weibull models in various ways.
Distributions of this type are referred to as bi-Weibull distributions. Bi-Weibull distributions can be
formed by adding Weibull probability density functions, multiplying reliability functions, adding hazard
functions, or in other ways, and can involve varying numbers of parameters.
In RELCODE we make use of a specific type of bi-Weibull distribution, describes in Hastings and Ang
(1995) as Hastings Distribution. This is formed by adding together a Weibull Two Parameter and a
Weibull Three Parameter Hazard Function, as detailed in Chapter 12. This provides a distribution with
five parameters, which can represent combinations of any two failure patterns, that is any two of burn-
in, random and wearout patterns. This can mean, for example, burn-in followed by wearout, or random
followed by another random with a higher failure rate. This provides an extremely flexible distribution
model which can fit the vast majority of observed failure patterns. If the data do not require the full five
parameters, it will revert to a Weibull distribution. Here we shall refer to this distribution as the bi-
Weibull distribution, unless it is necessary to distinguish between this particular form of bi-Weibull and
others, when we shall refer to it as Hastings distribution, or Hastings bi-Weibull distribution.
1. Introduction 1-5
2. INSTALLING AND RUNNING RELCODE
a. Start Windows.
c. RELCODE will normally automatically start to install, but if it does not then click Start, Run.
At the Command prompt type X:\SETUP where X is the letter reference of your CD drive and
press the Enter key or click the OK button.
Click Start, Programs. Locate the Reliability Analysis program group and click it, or locate and
then double click the RELCODE icon
Click the Start button to go to the Item Header Screen shown in Figure 2.2.
To exit from RELCODE click the Close button on the bottom right of the current screen. This will
take you back to a preceeding screen. Continue until you reach the Start screen (Figure 2.1),
where you click the Exit button to exit RELCODE.
RELCODE is concerned with the analysis of the reliability of items which, when they fail, are
replaced by new items, or are repaired to a good-as-new standard. We shall take as an example a
drive belt on a crane hoist motor. The first thing to emphasize is that the data relating to the
component you are analyzing should be taken from records pertaining to the same kind of belt
working on the same kind of motor - in other words it should be as homogeneous as possible.
Let us assume that you have data which shows the total hours run by the crane at the time when drive
belt replacements occurred. To analyze the reliability of the belts, we need to know the age of the
individual belts when they were replaced, and whether the replacements were due to the belt failing or
for some other reason. We also need to know the ages of surviving belts which at the time of the
analysis have not failed but are still running. In other words, we need data on both the failures and
the successful performance of the components in question.
You should extract the data from your records into a worksheet in the style of Figure 3.1:
0
2634 2634 F
9002 6368 S
13791 4789 F
17331 3540 (Surviving) S
"F" means that the component was replaced as a result of failure - it was a "Failure Replacement".
The concept of “Failure” can include situations where the belt actually fails in service, and situations
where it is deemed to have deteriorated to a point where it is prudent to replace it. The latter situation is
a “Near Failure Condition” replacement in which the condition is such that the belt can be regarded as
having failed. The definition of a failure event is at the discretion of the user, in that you could choose to
analyze failures of only one particular type, or you could regard all sources of failure as “failures” for a
particular study. The key thing is that it is important to use a consistent definition for failure events in
any given data set.
"S" refers to "Suspensions". Suspensions or suspended items are items for which we have successful
performance data, but which have not failed nor reached a near failure condition. One source of
suspensions can be Preventive Replacements carried out not as a result of failure, or even near failure
But there are other conditions where you would record "Suspensions", for example, if the crane were
taken out of service, or sold, or the engine were replaced with a new (or reconditioned) one, the old
drive belt being removed with the old engine. Another source of suspended items is items currently
running which have not failed. Thus, the last entry in Figure 3.1 is the current odometer reading for the
crane, and gives one more suspension "reading" on a time when the drive belt was still working. These
"suspensions" are an integral part of the RELCODE formula, and in fact allow the calculation of much
more accurate reliability and replacement solutions than would be possible if only failure data were
employed.
Getting back to your work-sheet, you would continue organizing your data in the same way as in Figure
3.1 for other similar drive belts. As you can see, the odometer reading between events is simply the
subtraction of one entry from the one immediately following it.
RELCODE analyses data relating to the failure or suspension of components or equipment. Up to 1008
entry records are allowed for a given type of item.. Failure data about a component is quite simply the
time it took for it to fail or the distance it traveled before failure etc. (depending on the units involved).
Suspension data relates to those items which were withdrawn from use for some reason other than
failure, or continue in use without failure. For each component the following information is entered:
• The age of the item when it failed or was suspended (i.e. after how many hours or kilometers etc.).
• If several failures occurred at the same age then we can enter a frequency. Similarly, if several
suspensions occurred at the same age we can enter a frequency. The default frequency for any entry
is 1. If both failures and suspensions occurred at the same age, make two separate entries, one for
failures and one for suspensions.
Figure 3.2 shows an example of data suitable for input into RELCODE. This data represents more
extensive drive belt reliability data than that shown in Figure 3.1. RELCODE will accept the data in any
sequence by age. RELCODE sorts the data by age during the course of data entry.
3.4 Database
You can create a new RELCODE database by copying an existing one. For example, you can copy the
original database RELDAT97.MDB supplied with RELCODE. This contains a few demo items which
can optionally be deleted later. The new database can be opened by using the Change Database option
on the Item Header Screen.
Enter the Item Reference, up to 70 characters long, and the age unit, up to 10 characters long.
The age unit is the unit in which age at failure or suspension is measured. For example, the units might
be kilometers, hours, cycles, miles, or anything. The name of the unit is purely descriptive and is up to
10 characters long. The units appear in appropriate positions in the output. In our example the units are
hours. Other parameters such as an Item Description and various cost data can also be entered at this
time, or can be added later. Click the Ok button to save the Item in your database. To enter failure and
suspension data for this item, click the Event Data Entry button on the right hand side of the screen. The
Event Data Screen will then appear. This is illustrated in Figure 3.4.
If you are at the Item Header Screen, click the Event Data Entry button on the right hand side of the
screen. The Event Data Screen will then appear. This is illustrated in Figure 3.4.
The Record Numbers are created and maintained automatically. Each event record can contain data
indicating the occurrence of one or more failures at a stated age, or one or more suspensions at a stated
age.
New event records are created by entering data at the panel on the right of the Event Data screen. To
add a record, click the New Record Button and then enter data in the fields below it:
Then click the Save button and the data will be written into the grid on the left of the screen, or if you
decide not to save the data you have just entered, click Cancel. Data can be entered in any sequence by
age and will be automatically sorted when you exit this screen. The default entries are F for failure and 1
for frequency.
The age data can have up to six digits in front of the decimal point and are displayed by the computer
with two digits after the decimal point. In some applications it may be necessary to scale the data. For
Age values of less than 0.01 are not accepted. If failures do occur at zero age, enter a small age value to
represent them.
To print the current data, click the Print Data button. Figure 3.5 shows the result for the current data.
1 1584 F 1
2 2634 F 1
3 3540 S 1
4 4136 F 2
5 4789 F 1
6 5579 F 1
7 6368 S 3
Once we have entered our data we can analyze it. To start the analysis we click the Analysis Menu
button on either the Item Header Screen or the Event Data Screen. The Analysis Menu Screen shown in
Figure 4.1 then appears.
To fit a distribution to our data we click the button labeled: “Model Optimization - Fit Distribution”,
shown in Figure 4.1. RELCODE then fits a range of distributions using a variety of techniques, and
produces results which, for the data of Chapter 3, are as shown in Figure 4.2. The various distributions
and fitting methods are referred to as Models 1 to 6. RELCODE also recommends a preferred model at
the bottom of the screen (above the command buttons). In Figure 4.2 we see that the preferred model is
a Weibull 2 Parameter distribution, with parameters obtained by Maximum, Model Accuracy, Model 3.
Data Summary. The first part of Figure 4.3 summarises our data, giving the numbers of failures and
suspensions, and the total of failures and suspensions.
Fitted Parameters. The second part shows the parameters of the fitted distribution. In this case
these are a Shape Parameter (BETA) of 2.17 and a Scale Parameter (ETA) of 5827.32. Also shown is
the value of the Mean Life (or MTBF) which is 5160.70. This is the average life which can be
expected from the drive belts, as calculated from the distribution model. Also, we get the
Characteristic Life which is the expected age to which 36.8% of items will survive and the standard
deviation which is a measure of the spread of the failure ages.
Goodness of Fit Test. The third part of Figure 4.3 addresses the question of whether the fitted
distribution provides a good fit, in a statistical sense, to the data. The test used is the Hastings-Ang
Model Accuracy test, described in Ang and Hastings (1994). This article also discusses the merits of
Model Accuracy Test. The "Goodness of Fit" section shows the Model Accuracy which is defined by
A Model Accuracy of 100% means that all the points lie exactly on the fitted line, i.e. a perfect fit. In
this case the Model Accuracy is 99.46%. This means that the average (root mean square) distance of the
points from the line is 0.0054 or 0.54% on a linear probability scale.
Figure 4.3 shows the critical values of model accuracy given by the Hastings-Ang test, at various
confidence levels. The critical value at any given level of confidence is the value of model accuracy
which would cause us to reject the model with that level of confidence. The critical value at the 80%
Confidence Level is 90.96%. If the observed Model Accuracy is less than 90.96%, we can say with 80%
confidence that the model given in Figure 4.3 does not fit the data. In this example the observed Model
Accuracy is 99.46%, and we do not reject the hypothesis that the model fits the data. Corresponding
statements can be made at the 90%, 95% and 99% confidence levels using the values shown.
As a qualitative measure, Model Accuracies are graded as follows in relation to the critical values. Note
that the critical values decrease as the confidence levels increase.
Thus, in the example, the Model Accuracy is Good. We conclude that the Weibull distribution, with the
parameters shown, provides a good fit to the data in this example. RELCODE incorporates extended
• Indicate a preferred model and allow us to select this model or any of the other models if we
wish,
Thus we determine an appropriate distribution model for our data, an indication of whether the model
is statistically valid, and also the parameters of our model which may already contain valuable
information. In the case where the result is a two parameter Weibull distribution we can infer the
shape of the distribution and the corresponding failure rate from the Shape Parameter, beta. If beta is
less than 1 this indicates burn-in failures, if it is equal to or close to 1 it indicates random failures and
if it is greater than 1, as in this case, it indicates wearout failures. The Mean Life or MTBF is also
estimated.
Thus we conclude that for the given data we have wearout failures and a Mean Life of estimated at
5160 hours.
Further insight into the reliability of our items can be obtained from the graphs which are discussed
in the next chapter. To view these graphs, click the Graphs Menu button at the bottom of the
parameters screen, Figure 4.3.
Model 1 fits the two parameter Weibull by transforming the data to Weibull probability scales and then
using linear regression to fit a straight line. This is equivalent to the probability paper method, widely
used for manual calculations.
Model 4 fits the three parameter Weibull by subtracting a minimum life value, transforming the data to
Weibull probability scales and then using linear regression to fit a straight line. This is equivalent to the
probability paper method for the three parameter Weibull distribution. The best value of the minimum
life is found by a search process.
Model Accuracy. Models 3, 5 and 6 fit the data using the Model Accuracy method. This method
searches for distribution parameters such that the root mean square distance of the observed points
from the fitted line is as small as possible, on linear scales of age and reliability. By contrast, the
Linear Regression method (despite its name) uses the highly non-linear Weibull probability paper
scales. The Linear Regression method can be unduly influenced by points at very low or very high
probability levels where the scale of the graph is very spread out. The Model Accuracy method
applies uniform scaling in seeking a curve which minimises the average (root mean square) distance
of the points from the line. The Model Accuracy method is therefore recommended, but the Linear
Regression models are available because they emulate the widely used graph paper methods.
Model 3 fits the two parameter Weibull distribution using the Model Accuracy method.
Model 5 fits the three parameter Weibull distribution using the Model Accuracy method
Model 6 fits the bi-Weibull distribution using the Model Accuracy method.
RELCODE selects the preferred model on the basis of Relative Model Quality. This uses the Model
Accuracy for the model, but does not simply choose the model with the highest accuracy. This is
because, a model with more parameters can be expected to give higher accuracy than a model with
fewer parameters even when no real improvement exists. To allow for this, a parameter factor is
introduced. The Parameter Factor is an allowance for the number of parameters estimated. This
factor has been determined by simulation, and is the average improvement in model accuracy
obtained when fitting the same data with 3 or 5 parameters as opposed to parameters.
RELCODE selects the preferred model by calculating the relative model quality as follows:
The 2 parameter Weibull model obtained by linear regression (Model 1) is taken as a reference point
and will always have a relative model quality of zero. The model with the highest relative model
quality is the preferred model. In the event of a tie, the lowest numbered model is chosen.
RELCODE will present your reliability data in a number of graphical forms. Click the Graphs
Menu button on the bottom of the Parameters Screen (Figure 4.3) and the Graphs Menu shown in
Figure 5.1 will appear. Five different graphs can then be obtained by clicking the option buttons
on the Graphs Menu.
The Reliability Function is a plot of Reliability, shown on the vertical scale as a percentage ranging
from 100% to 0%, against Age. From the graph we can read off the reliability to any selected age, or the
age for which we have any selected reliability value. For example, we can find the age to which the
reliability level is 90% (known as the B10 Life). To find this, enter the graph from the left at the 90%
level. The corresponding age is approximately 2100 hours, and this is the B10 Life for the Drive belts.
To return to the Graphs Menu, click Other Graphs. To return to the distribution parameters screen click
Close.
The Weibull plot has the component age, t, on a logarithmic horizontal scale. The vertical scale is the
cumulative probability of failure, F(t), transformed to ln(ln(1/(1-F(t)))). The data are plotted on these
scales and a straight line is drawn through them to represent a Weibull distribution. The equation of the
line corresponds to the parameters of the currently selected distribution model.
The next graph menu option is the Probability Density Function shown in Figure 5.5. The
probability density funtion has the property that the area under the curve between any two ages
gives the probability that a new item will fail in that age range. Note that the vertical scale is
failures per 10000 hours. The horizontal and vertical scaling are carried out automatically by
RELCODE.
The next option on the Graphs Menu is the Hazard Function. This shows the failure rate among
items which have survived to any given age. Figure 5.6 illustrates this.
This concludes the reliability distribution graphs. To return to the Graphs Menu, click Other Graphs.
To return to the distribution parameters screen click Close.
6.1 Background
A common problem for maintenance managers is to determine a policy to adopt in regard to the
replacement of components which do, or may, fail. The appropriate policy will depend on such factors
as:
• The reliability of the component as a function of operating life, and in particular whether wearout
occurs
• Whether a practical method of condition monitoring exists for the component, how effective it is in
predicting failure and what options it may provide for component replacement
• The costs arising if we need to replace a component at an inconvenient time as the result of actual
failure, or the detection of a near-failure condition
• The costs associated with replacement of the component before failure on an age basis, at a
convenient time, e.g. at a routine maintenance time.
• Safety considerations
• Other possible maintenance actions such as overhaul or cannibalisation and the extent to which
these restore the component to "as good as new" condition.
In our analysis of replacement policies in this chapter we shall consider two types of situations in which
component replacements occur. We shall refer to these as Failure Replacement and Age Based
Preventive Replacement. The definitions of these are as follows:
(a) Failure Replacement A failure replacement is a replacement which occurs following the failure
of a component in service, or following the identification of an unfavorable condition which
leads us to promptly replace the component, within a short time of the condition being detected.
We could refer to this as a “failure or near-failure condition” replacement, however, the term
“failure” replacement will be used.
(b) Age Based Preventive Replacement An age based preventive replacement is replacement of a
component which has not failed, but which has reached an age deemed appropriate for
preventive replacement. An example where this occurs is with “lifed” components in aircraft.
Under this policy, only failure replacements are carried out and there is no preventive
replacement. This includes near-failure-condition replacements.
Note that under both the Age-Based and Block Replacement Policies, some failure
replacements will occur.
Age Based preventive replacement can only be worthwhile if two conditions hold:
(a) The failure rate of the components is increasing, or will increase before another age based
preventive replacement opportunity occurs
(b) The cost of failure replacement is greater than the cost of age based preventive replacement.
Thus age based preventive replacement is not appropriate, if the failure rate (hazard function) is
decreasing (Burn-In Failures) or constant (Random Failures), because the new replacement item will not
be any more reliable than the one it replaces. It is important to analyse data to determine whether
wearout is present before jumping to the conclusion that age based preventive replacement is
appropriate.
Even if wearout occurs, the choice of policy will depend also on the cost of age based preventive
replacement being less than the cost of failure replacement. Age based preventive replacement policies
result in loss of useful life of the components which are removed before failure. For preventive
replacement to be worthwhile, this loss must be more than compensated by cost savings resulting from
fewer failure replacements. This can only occur if failure replacements are expensive when compared
to preventive replacements. The determination of an optimal (i.e. minimum cost) policy will depend on
a trade off between these factors.
The cost of making an age based preventive replacement is usually less than the cost of failure
replacement. This is because we can arrange for age based preventive replacements to be made at a
prearranged time so as to avoid loss of production. Also, if age based preventive replacement of a given
type of component is carried out as part of a routine service or overhaul, the repair cost tends to be
reduced as the replacement can be done as part of the other work.
Where a good condition monitoring technique is applicable, this tends to work against age based
preventive replacement. A good condition monitoring technique will have the following characteristics:
• Not prone to giving false alarms, that is, indicating imminent failure when in fact the component
will last for a considerable time
• Gives a consistent indication of the Delay Time, that is the time between potential failure being
indicated and actual failure occurring.
If a good condition monitoring system is available, the cost of Failure Replacement (which includes on-
near-failure-condition-replacements) will be reduced. For example, if a bearing can be monitored
regularly and a condition (such as a vibration level) can be identified which accurately predicts when a
failure will occur within a few days, then we can make a replacement the day following detecting this
condition. This may not be as cheap as a an age based preventive replacement, but may be cheaper than
an actual in service failure.
• the long run average cost for any age-based replacement policy specified by the user;
• the long run average cost for any block replacement policy specified by the user;
• the long run average cost for a policy of replacement only on failure.
In the previous section we saw that the question of the costs of failure replacement and preventive
replacement must be addressed if we are to determine the best replacement policy. Before considering
how to determine an optimal policy, we shall look in more detail at the cost factors involved.
a) The cost of the component itself, as purchased from the supplier, including taxes where relevant;
d) Inventory carrying costs, e.g. cost of capital tied up, warehouse costs, insurance;
e) Exceptionally, costs may be lowered if available spares are excessive in quantity for some reason,
e.g. by cannibalisation
g) Cost of secondary damage which may be caused when the component under consideration fails
Of the cost factors just outlined, item (f), the cost of lost production, etc., may be the most difficult to
estimate. The extent of lost production could vary considerably depending on just when a replacement
occurs. In practice we may need to make a management judgement on the figure we use here. An
advantage of RELCODE is that it is easy to carry out a "what if" analysis with different cost figures and
see what effect this would have on the replacement policy.
For the age based policy it will be necessary to keep records of the ages of particular components. For
the block policy, and the replace only on failure policy, the ages of particular components are not
needed.
Before starting replacement analysis we must first have entered our event data and determined the
life distribution parameters.
Also, replacement analysis requires some additional data (e.g. costs) which can either be entered at
the Item Header Screen or in response to prompts in the course of the Replacement Analysis. In
this introduction we shall enter this data in response to prompts in the course of the analysis, and
conclude by illustrating this data at the Item Header Screen.
To start replacement analysis, click the Replacement Analysis button on either the Analysis Menu
Screen (Figure 4.1) or the Weibull or Bi-Weibull Parameter Screen (Figure 4.3). When we select
the Replacement Analysis Menu, if we have not previously entered our Cost of Failure and Cost of
Preventive Action data, we will be prompted for them, as shown in Figure 6.1. The theoretical
background to the data requirements and calculation of minimum cost replacement policies is given in
Chapter 15.
The values entered in Figure 6.1 are a Failure Replacement cost of $1000 and Preventive
Replacement cost of $110.
The dotted horizontal line is drawn at the cost of a policy of replacement only on failure. This gives a
visual indication of the relative savings (if any) from a preventive replacement policy. The vertical
scale will be automatically adjusted to show reasonable numeric values. We can see from the graph
that the minimum costs occur with preventive replacement at about 2000 hours, when the cost is
about $10 per 100 hours. The panel shows the exact values.
This option calculates the cheapest replacement age. It compares the cost per age unit for replacement
only on failure with the cost of a policy of replacing individual components when they reach a certain
age, known as the preventive replacement age. The preventive replacement age is varied using a
search procedure to try to find the value of the preventive replacement age which minimizes costs,
and whether the resulting costs are cheaper than those associated with a policy of replacement only on
failure. The results are in the form shown in Figure 6.4. If a policy of replacement only on failure is
the cheapest then the output will show this.
In the present example, the cheapest age-based replacement policy is preventive replacement at 2080
hours. The cost is then $0.0995 per hour, or $9.95 per 100 hours.
The results given by RELCODE for this and other replacement ages are rounded to multiples of a
scale factor which is automatically calculated. The scale factor is the single column increment used in
Figure 6.3 and similar graphs.
If we want to know the Cheapest Age-Based Replacement Policy for a range of replacement costs, we
can go back to the Item Header Screen and alter the costs as often as we wish.
The third option at the Planned Replacement Menu (Figure 6.2) calculates the average cost per unit time
for any user specified value of the preventive replacement age. Initially we are asked to enter the
specified preventive replacement age, as shown in Figure 6.5. In the example the optimal preventive
replacement age, as shown in Figure 6.4 is 2080 hours. This is an odd amount and it is more likely that
we would want to specify the preventive replacement age as round figure, reasonably close to the
optimal value. For example we may chose 2000 hours as the specified replacement age. We therefore
enter 2000 as the specified age in Figure 6.5.
When we click OK, RELCODE calculates the costs for this preventive replacement age and also the
proportion of preventive replacements and the proportion of failure replacements. The results as
shown in Figure 6.6.
By returning to this option, we can compare the costs for two or more preventive replacement ages
and see how significant the difference in cost is. The advantage of being able to do this is that, for
example, we may carry out a major service at 2500 hours and it would be more convenient to replace
the drive belts at 2500 hours rather than at 2080 or 2000 hours. We can compare the costs of the
various policies in terms of both cost and the proportion of failure replacements.
The aim of this analysis is to estimate the number of replacement parts that will be required to cover
both failure replacements and preventive replacements for a given annual component utilization.
The calculation is based on an assumption of steady state average conditions and in practice it may be
prudent to carry more spare parts as a safety stock and as an allowance for transient effects.
For example, suppose that each crane has two similar drive belts, and that we operate a fleet of 15
such cranes. This means that the number of components “at risk” will be 2 x 15 = 30. Suppose also
that the cranes have an average utilization of 2500 hours per year. The analysis will calculate the
average number of replacement components needed when preventive replacement occurs at a specified
age.
We are asked to enter the specified preventive replacement age along with the number of components
at risk and the annual component utilization. The total annual requirement for replacement
components under the current replacement policy is then calculated. Figure 6.7 shows the data entry
prompts.
Figure 6.8 shows the results of the calculations. We see that, for our current data, and for a
preventive replacement age of 2000 hours, the steady state average annual requirement for spare parts
will total 38.85, or 39 in round figures, of which 35.26 (35 if rounded) will be preventive
replacements and 3.60 (4 if rounded) will be failure replacements. This therefore gives us also a
figure for the expected number of in-service failures per annum under this policy.
6.13 Conclusion
In this chapter we have seen how RELCODE helps us to analyse our options in relation to
preventive replacement of components, and in particular, how we can:
• Graphically display the relationship between cost and preventive replacement age (Figure 6.3)
• Calculate the minimum cost age based replacement policy (Figure 6.4)
• Calculate costs for any preventive replacement age which we specify (Figure 6.6)
• Calculate annual steady state average replacement parts requirements for any selected policy,
and for the policy of replace only on failure (Figure 6.8).
We can save the cost and replacement age parameters which we have entered by clicking the Save
Parameters button on the results screens, such as Figure 6.8. The parameters are then saved in the
database and also appear on the Item Header Screen as shown in Figure 6.9.
Figure 6.8 Spare Parts Annual Requirement for Age Replacement Policy
Under a block replacement policy, all components are replaced simultaneously - in a block, at certain
intervals of time. Items which fail in between the block replacement times are replaced when they fail
(or reach an identified near-failure-condition), these being failure replacements. At the time of block
replacement, all items are replaced including those which have been subject to failure replacement. We
refer to the time between block replacements as the block replacement interval.
The block replacements are preventive replacements. The cost of a block preventive replacement may
differ from that of an age based preventive replacement. Usually a block replacement will be cheaper
(per component replaced) because there are economies of scale in doing many replacements at the same
time.
In our example, let the cost of block preventive replacement be $60 per item. We set this value at the
Item Header Screen, by selecting the relevant item, amending the Preventive Action Cost field (bottom
left part of screen) and clicking the OK bottom. Figure 7.1 shows the result. The other data are the
same as in Chapter 6, at this stage, in particular, the cost of failure replacement is still $1000.
Block replacement policies can be analyzed using the “Block Based Replacement Policies” options on
the Replacement Analysis Menu shown in Figure 6.2.
Figure 7.1 Item Header Screen with Amended Preventive Action Cost
This option will calculate the cost for a policy of replacement only on failure and the cost of block
replacement. If block replacement is cheaper, then the cheapest block replacement interval is found.
The results are in the form shown in Figure 7.3.
This option calculates the average cost per unit time for any value of the block replacement interval
entered by the user. In this case RELCODE first prompts for the specified block replacement interval.
In the example the optimal block replacement interval was 1522.32 hours, as shown in Figures 7.2 and
7.3. In practice we may wish to pick a round number close to this value, such as 1500 hours. We enter
this in response to the prompt as shown in Figure 7.4.
The analysis is similar to that for the age-based policy described in Section 6.12. When we click the
“Spare Parts Requirements - Block Policy” button at Figure 6.2, we are asked to enter the specified
block replacement interval along with the number of components at risk and the annual utilization per
component.
The total number of components needed per annum is then calculated. Figure 7.6 shows the data
entry prompts. Figure 7.7 shows the result.
The median rank points are such that we are 50% confident that the reliability is greater or less than
the median rank value. Statistical theory allows us to place other points on the reliability graph, which
correspond to percentages other than 50%. For example, at a given failure age, we can calculate a
probability such that we are 95% confident that the reliability is greater than that value. This is a 95%
lower confidence limit for the reliability at the corresponding age.
Confidence limits are values such that we are confident at some stated level (e.g. 95% confidence) that the value
taken by a variable, if a trial is repeated, will lie in a certain range. The “certain range” depends on the numbers
involved, and whether we are talking about “upper”, “lower” or “two-sided” confidence limits.
The larger the sample and the more failures we have observed, the tighter the confidence limits will be.
Conversely, if we have a small sample or very few failures, the confidence limits will be wide. We cannot make
any absolute statement about probabilistic reliability, only that we have a certain level of confidence of a certain
level of reliability, e.g. 90% confidence of 90% reliability.
We shall continue with the example for which the data is given in Figure 3.2. For this data, the
points plotted in Figure 5.2 are the median rank or best point estimates of the reliability against
age. At the Analysis Menu, Figure 4.1, click the “Confidence Limits for Reliability” button. The
screen shown in Figure 8.1 will appear.
In Figure 8.1, look at the table in the centre, and in particular at the columns headed “Age
(Hours)” and “Median Rank”. These are the values of Age and Reliability at which the points are
plotted in Figure 5.1. For example, the second row of the table in Figure 8.1 has the values:
and this corresponds to the second point from the left in Figure 5.2. Note that Figure 5.2 works in
percentages, whereas Figure 8.1 works in probabilities, that is 0 to 1, so that 0.84 in Figure 8.1
corresponds to 84% in Figure 5.2.
Figure 8.1 gives confidence limits for the reliability at all the failure ages in our data. For
example, in row 2 of the table, under the heading “Lower One Sided” in the 95% column is the
value 0.61. This means that we are 95% confident that the reliability to 2634 hours is greater than
0.61. Under the heading “Upper One Sided” we see the value 0.97. This means we are 95%
confident that the reliability to 2634 hours is less than 0.97. Values in the other columns are
confidence limits at 99% and 90% levels.
The values in the table in Figure 8.1 are also known as “Ranks”. The Median Ranks are the
central values which are exceeded with probability 50%. Ranks can also be tabulated in terms of
We refer to these confidence limits as distribution free, because no assumptions about the form of
the life distribution are required to calculate them.
Click the “Graph” button on the Confidence Limits screen, Figure 8.1, and a graph of the results
will appear. This is illustrated in Figure 8.2. The Median Rank Points appear as circles, as in
Figure 5.2, and the upper and lower one sided 95% confidence limits are shown as small dashes.
This gives us a graphical indication of the spread of values that could occur. Taking a two sided
viewpoint, we are 90% confident that the reliability versus age will lie between the small dashed
points plotted.
The MTBF, or Mean Time Between Failures, is a concept which is widely used in reliability
analysis.
In the case of an item which is has a constant failure rate (random failures), the MTBF is the mean
life. The MTBF also arises from a situation where components are replaced on failure, and in time
a steady state is reached regardless of the failure distribution function. The MTBF is then literally
the Mean Time Between Failures. Again the MTBF will be the mean life of the components.
RELCODE will calculate a point (or “best”) estimate for the MTBF for our current data and will
also calculate confidence limits. To do this from the Analysis Menu, Figure 4.1, click the
“Confidence Limits for Reliability” button. The screen shown in Figure 8.1 will appear. Now
click the “Confidence Limits for the MTBF” button at the bottom left of the screen, Figure 8.1.
It is important to note that this analysis is only valid for items subject to random failures, so that we
should only use this analysis if our reliability analysis has yielded a two parameter Weibull
distribution with a BETA value close to 1. By “close to 1”, we might regard values in the range
0.8 to 1.4 as being satisfactory, although accuracy decreases as we move away from the value
BETA = 1.
Since the data entered in Chapter 3 has not yielded a Random Failure pattern, we shall use other
data in this example. This is the Switch Failure Data as shown in Figure 8.3.
To analyses the data of Figure 8.3 we first enter it into RELCODE, then select “Analyse Data” and
“Model Optimization - Fit Distribution”. The preferred model is a two parameter Weibull with a
Beta value of 1.01. This is well within the range which we can regard as “Random”. The
parameters are as shown in Figure 8.4.
Returning to the Analysis Menu, we select “Confidence Limits for Reliability” and then
“Confidence Limits for the MTBF”. This yields the result shown in Figure 8.5.
From Figure 8.5 we see that the point estimate of the MTBF under the assumption of random failures is
1257 operations. Figure 8.5 gives a two sided 90% confidence interval for the MTBF, that is values
such that we are 90% confident that the MTBF lies between the lower and upper values given. This also
means that we are 95% confident that the MTBF lies above the lower value, and 95% confident that it
lies below the upper value.
Also, value of the lower confidence limit depends on whether or not the trial ends on a failure. In the
Switch example, all the switches run to failure, so the trial does end on a failure. We see therefore that
the results are:
Hence we can state that we are 90% confident that the value of the MTBF lies between these values. In
reliability studies we are usually concerned mainly with the lower confidence limit. For example, there
may be a contract requirement to demonstrate that the MTBF exceeds a certain value with 95%
confidence. In this case we can say that the MTBF of the switches exceeds 800 operations with 95%
confidence.
The inspection interval is assumed to be short relative to the MTBF. On average the item fails at its
MTBF, half way through an inspection interval. Hence:
A = M / (M + I/2)
I = 2 * M ((1 - A)/A)
This formula is used by RELCODE to give a suggested inspection interval. Initially we obtain a
distribution model using RELCODE, either by data analysis or by direct entry of parameters using the
“No Data” option. This gives an estimated value for the Mean Time Between Failures (MTBF) of
the device. At the Analysis Menu, click the Inspection Interval button to obtain a suggested value for
the inspection interval.
Figure 9.1 shows an example. The data here is from the Hydraulic Seal example of Chapter 16. This
has an MTBF of 120 months. For a 99% availability, the formula in this section gives:
I = 2 * M * ((1 - A)/A) = 2 * 120 * ((1 - .99) / .99) = 2.42 months.
This is the value shown. The value is not rounded by RELCODE and we will usually use judgement
in deciding a rounded value, for example 2 or 3 months in this case. A shorter interval will
correspond to a higher availability. As we see from Figure 9.1, RELCODE also shows the intervals
for 98% and 99.5% availability, and a graph of Availability against Inspection Interval, an example of
which is shown in Figure 9.2.
Theory. Let R(t) be the survival function of the PF Interval and I be the condition monitoring
interval.
If a potential failure condition emerges at time x measured from the last inspection, then the
probability that item survives to the next inspection is
R(I-x). We can reasonably assume that the rate of emergence of potential failure conditions does not
vary significantly over the inspection interval. Also, the potential failure condition will emerge in
some interval.
The probability, p, of the condition being detected is therefore given by the average value of the
survival function R(t) over the interval:
I
p = ∫ R(t )dt / I 9.1
0
Note that the integral of the reliability function is the Truncated Mean Life function which appears in
a number of other analyses.
To use RELCODE to estimate the condition monitoring interval, we need to make RELCODE model
the PF Interval. Having set this distribution either by fitting a distribution to data or by using the “No
Data” option to set the parameters, we can then use the Condition Monitoring Interval option on the
Analysis Menu. The result is illustrated in Figure 9.3. This example is based on the item “PF
In addition to fitting distribution parameters to our data, RELCODE allows us to enter any values of
the parameters that we wish. It will then carry out a goodness of fit test for those parameters in
relation to the current data. We can also proceed to the Replacement Analysis options with these
parameters.
The Display Distribution Parameters option is on the Analysis Menu, see Figure 4.1. When we click
the Display Distribution Parameters button, the screen shown in Figure 9.5 appears. The screen is
labeled, Display or Amend Distribution Parameters. This is because, in addition to displaying the
existing values of the distribution parameters, we can actually change the values of the parameters,
should we wish to do so. Note that the example in this chapter relates to the Drive Belt data, so if
you are continuing on from Chapter 8 you will need to return to the Item Header Screen and select the
Drive Belt data in order to get the numeric results shown in Figure 9.5.
If we have previously fitted a distribution, the screen will show the current parameter values. We can
amend these values if we wish. If we then press the OK button, a goodness of fit test will be carried
out, relating the amended parameter values to the current data.
For example, we can use this facility to test whether a negative exponential distribution would fit the
existing data. To do this, we change the shape parameter BETA from its value of 2.17, to the value
1.0 and press the OK button. The result is shown in Figure 9.6.
Figure 9.6 shows that the negative exponential distribution is rejected with 99% confidence as a fit to
the current data. If we wish to retain the amended values of the distribution parameters, we click the
Save Parameters button. Otherwise the amended parameters will not be saved.
The Previous Analysis Summary summarizes the results of the analysis for the current item. It can be
printed, saved to a file, or sent to the clipboard from where it can be incorporated into other
documents.
Data exported from Windows RELCODE is in a single file format described in Section 10.3 of this
chapter. Windows RELCODE will also import data from files in this format.
Users of DOS RELCODE can have data files in several formats. DOS users wishing to transfer
data to Windows RELCODE must send their data to a file in one of the DOS RELCODE formats
which appears in Table 10.1 below. Details of these formats are given in the DOS RELCODE
Users Manual.
The following procedure can then be used to read a suitably formatted file into RELCODE for
Windows.
1. Click the Import Data button on the Item Header Screen (Figure 10.1)
2. Select the file type and filename (Figure 10.2) and click the OK button. In the example the
selected file is called manual.rud. The data format in the file must correspond to the selected
file type.
3. The data will be imported to the RELCODE Access database and will appear on screen.
Figures 10.3 and 10.4 show the data which has been added from file manual.rud.
RELCODE will automatically create files in the formats which Windows RELCODE can read.
However, we may have data from other sources which we wish to import into Windows RELCODE. To
import our data we put it into the format described in this section. First we describe this format, then we
discuss creating the file from a spreadsheet.
The layout of the data file in RELCODE Windows Standard Format is as follows:
Row 1: Title
Row 2: Ageunit, Failure Replacement Cost, Preventive Action Cost
Row 3: Age, Event type (F=failure or S=suspension), Frequency
Subsequent rows are similar to row 3. In rows 2 and 3, commas are used to separate the variables.
The Title is up to 70 characters long and the Ageunit is up to 11 characters long. An example is
given in Figure 10.5
Figure 10.6 An Excel Spreadsheet showing RELCODE Data. Output the data in .CSV Format
to obtain a file similar in format to Figure 10.5
In Figure 10.6, the left hand column represents the row numbers and the top row represents the
column letters and these are not part of the data.
Data can be exported from RELCODE to an ASCII file. To do this proceed as follows:
1. Select the relevant Item at the Item Header Screen (Figure 10.1)
2. Click the Export Data button. A pop up Export Data screen will appear as shown in Figure
10.7.
3. Select or enter the name of the file to which you wish to send the data and click OK.
When we click the Save Results to File button we will be prompted for a file name. The results
will then be written to the file in a text format. This file can then be printed, or read by another
program as devised by the user.
Some non-standard versions of RELCODE have an automatic file reading feature which allows the
user to specify a file from which data will be automatically read when the Item Header screen is
first loaded. This section applies only to those installations where the automatic read feature has
been specifically provided.
The file to be read can be in either RELCODE for Windows Data Format or DOS RELCODE
Standard Ungrouped Data Format. The RELCODE for Windows Data Format is defined in
Section 10.2. of this chapter. The DOS RELCODE Standard Ungrouped Data Format is defined
in the DOS RELCODE User Manual.
To activate automatic reading the user runs RELCODE and at the Start Screen (Figure 2.1) clicks
the (special feature) Set Automatic Read button. The Set Automatic Read Screen shown in Figure
10.8 then appears.
The file from which data is to be automatically read is referred to as the Auto-Read file. To use
the Auto-Read feature, at the Set Automatic Read Screen the user enters following:
When the user clicks the OK button, this information is stored in the local file DBNAME.TXT
(which also includes the name of the current or default Access database used by RELCODE).
When Auto-Read is active and the Item Header Screen (Figure 2.2) is first loaded, RELCODE will
attempt to read the Auto-Read file. If successful, the data in this file will be imported into the
database and will be selected as the current item. To view the event data, click the Event Data
Entry button. The data can then be analysed as normal.
11.1 General
The ultimate goal in the development of systems is to achieve such a high level of reliability that
maintenance is no longer needed. Whilst this goal may be some way off, steps toward it benefit from
the measurement of reliability. In this and following chapters we shall introduce the concepts of
statistical reliability analysis as a basis for measuring reliability, understanding different reliability
patterns and selecting appropriate maintenance policies and reliability improvement approaches.
• Measure reliability as a basis for system acceptance, quality assurance and continuous
improvement
• Identify appropriate preventive maintenance or replacement policies
• Make comparisons between competing designs, versions, products
• Establish mean life and other parameters for spare parts planning
• Establish failure rate patterns as an aid to identifying the root cause of failure
• to introduce the following statistical distribution functions used in reliability analysis and to
provide an understanding of their meaning:
hazard function (failure rate)
failure probability density function
reliability function
distribution function
• to show graphs illustrating the various functions, as a basis for understanding reliability data
analysis.
11.2 Definitions
Reliability: The ability of an item to perform a required function under stated conditions for a
stated period of time.
11.2.3 Failure
The definition of reliability just given implies that we are able to distinguish between a failure on the
one hand and successful performance on the other. In some cases, the onset of failure is a clear cut
event, for example, a metal filament light globe either works or it does not. However, in many cases,
it will be necessary to carefully define what constitutes a failure.
For example, a catalyst used in a chemical reaction can deteriorate gradually, and if a given level of
catalytic action were important, in a certain application, it would be necessary to define this in
specifying when the catalyst had “failed”. Comparisons of the reliability of systems can only be fairly
made if “failure” is defined in the same way for each type of item. The following definition of failure
is given in BS4779.
From the foregoing discussion we see that the "required function" must be defined clearly in relation
to any specific application if consistent measurement of reliability is to be achieved.
Reliability: The probability that an item will perform a required function under stated
conditions for a stated duration of operating life.
Using standard statistical concepts, we regard the time to failure as a random variable, which we shall
denote by the symbol T. Let the variable t be the measure of the operating life in appropriate units;
for example, hours, kilometres, cycles. Then the reliability to age t is the probability that an item
survives to age t without failure.
Many items exhibit one or more of three phases of failure, known as Burn In, Random and Wearout
Failures. These are summarised below.
The Bath Tub Curve is a schematic plot of the failure rate or hazard function for an item which
exhibits all three phases of failure. Figure 11-1 illustrates this.
Failure
Rate
Age
The failure rate which appears on the vertical axis of the Bath Tub Curve is the probability that an
item fails in the next small time interval, given that it has survived so far. This quantity is also
referred to as the failure rate, instantaneous failure rate, hazard function or, in the case of human life,
as the force of mortality.
From the Bath Tub Curve we see that the failure rate is high during the Burn In phase, is relatively
low and constant during the Random phase, and then increases again in the Wearout phase.
In practice, items may not exhibit all these phases of failure. Manufacturers may artificially age
items to eliminate Burn In failures, (referred to as Stress Screening) and the onset of Wearout may lie
outside the normal range of operating life. Thus the Random failure phase is often regarded as the
most important, and the failure rate is regarded as being roughly constant over the operating life of
equipment. However, it would be unwise to dismiss Burn In and Wearout failures too lightly.
Pattern A is the bath tub curve discussed in the previous section. Pattern B is constant failure rate
followed by wearout. This may occur if early life failures are eliminated by stress screening.
Pattern C is gradually increasing failure rate. This is typical of items which are subject to
corrosion or chemical wear.
Pattern D is an initially increasing failure rate followed by a constant failure rate. Here new items
are resistant to excess stress, but after a while this resistance is lost and random failures occur.
Pattern E is constant failure rate, that is random failures only. This pattern is common. It usually
indicates failure causes which are external to the item itself, such as a metal object breaking a pump
vane or a nail in a tyre.
It helps identify the root cause of failure. There is a tendency to assume that wearout causes most
failures, but in fact, burn-in and random failures are more common. The identification of the
failure pattern or patterns will give a useful indicator of how to find, and hence eliminate the
physical cause of failure.
Burn-in failures are a sign of defective manufacture, installation, set up or maintenance. When
they are present, attention should focus checking new or recently refurbished items for correct
assembly, etc.
Random failures, as already indicated, occur typically due to sudden external stresses in excess of
installed strength. This will include misuse or misadventure failures.
Wearout in fact has several patterns. One is characterised by gradually increasing failure rate,
typical of corrosion, dirt build up, chemical or erosive wear. The second is a sudden, sharp
increase in failure rate, typical of fatigue failure or the conventional wear of rubbing or other
mechanical action.
Whilst the above mechanisms often occur in association with the failure patterns indicated, they are
only a broad guide and exceptions may occur in practice.
Also, the patterns shown in Figure 11-2 do not constitute an exhaustive list. For example, we
sometimes have a constant failure rate followed by another (usually higher) constant failure rate.
So far we have introduced some basic concepts of statistical reliability, particularly the various failure
rate patterns. Now we shall look more formally at a range of function and equations used in
reliability analysis.
The failure probability density function (p.d.f.) is a plot which is such that the area under the curve
between any two ages is equal to the probability that a new item fails in the given age interval. This
differs from the failure rate curve in which the probability of failure is conditional on the item having
survived to the current age. Figure 11-3 shows a schematic failure p.d.f. exhibiting all three failure
phases. Note that in the failure p.d.f. the curve falls to zero at the right hand end, whereas the Bath
Tub Curve, Figure 11-1, continues to rise.
To illustrate the difference between the two curves we can use the human analogy. For an 85 year old
person the Bath Tub Curve will show the probability of death before age 86, which is relatively high,
then for an 86 year old person the probability of death before age 87, which is higher still, and so on.
By contrast the failure p.d.f. will show the probability that a newly born baby will die at age 85,
which is low, followed by the probability that a newly born baby will die at age 86, which is lower
still, and so on.
For the probability density function the area under the curve between any two ages t1, t2, gives the
probability that a new item will fail in that age interval. The total area under the curve adds up to 1,
because every item is certain to fail at some time.
∫
t2
Probability of failure in interval t 1 to t 2 = f (t )dt 11.2
t1
∞
∫ f (t )dt = 1
0
11.3
The reliability function corresponds to the probability that an item survives to any given age.
For an item which starts to operate at age t = 0, the reliability function is the probability that failure
does not occur in the interval 0 to t. We denote this by R(t)
Figure 11-4 schematically illustrates a reliability function R(t) and also the cumulative probability of
failure or distribution function F(t). In Figure 11-4 the vertical scale represents the reliability, or
probability of survival, expressed as a percentage. The horizontal scale represents the age of the
item.
The reliability function is related to the failure probability density function by the fact that the
reliability to age t is 1 minus the area under the failure probability density function up to age t
R(t) = 1- ∫ f(u) du
o 11.5
Figure 11-4 - Reliability Function and Cumulative Probability of Failure or Distribution Function F(t)
Figure 11-4 schematically illustrates this function, F(t). It also shows the complementary nature of the
reliability and distribution functions, coresponding to equation 11.8.
t
F(t) = ∫ o f(u) du
11.10
The hazard function can be related to the reliability function R(t) and the probability density
function f(t) as follows. The probability of failure in the interval t to t + δt is f(t).δt and is also
R(t).h(t).δt .
We can derive a general relationship between the reliability and the hazard function from the
preceding equations.
R(t ) = e ∫0
t
− h ( u ) du
11.18
11.12 Conclusion
In this chapter we have introduced the basic terms used in the statistical analysis of reliability data
and the statistical functions, hazard function, failure p.d.f., reliability function and cumulative
distribution function (cdf or df). In the next chapter we shall introduce particular forms of these
functions which are widely used in reliability analysis.
12.1 Introduction
Decisions about aspects of reliability and maintenance depend to a significant extent on an assessment
of when items will fail. We shall rarely know exactly when failure will occur, and the best scientific
assessment will normally involve statistically fitting a distribution to failure data.
There are several reasons for using standard distribution models and standard procedures for
reliability analysis. These are:
(a) The desire for objectivity. Using a standard technique allows us to treat data from varied
sources in similar style and in this way assists with engineering judgement across a broad
spectrum of applications.
(b) The need for automating data analysis. The existence of a standard procedure means that this
procedure can be followed by technical staff and that it can also be computerised, leading to
efficient treatment of data and the extraction of useful information in a cost effective way.
(c) The merits of the techniques have been established in many studies, and lead to directly useful
information such as the values of the distribution parameters.
The equations for the various reliability functions then are as follows, expressed in terms of the
parameter, λ.
F(t) = 1 - exp [-λt] 12.1
R(t) = exp [-λt] 12.2
f(t) = λexp [-λt] 12.3
h(t) = λ 12.4
Mean Life = 1/λ = MTBF 12.6
MTBF = Mean Time Between Failures
The negative exponential distribution is a special case of the Weibull distribution (discussed in the
next section), with Weibull Parameters β = 1, and η = 1/λ.
Figure 12-1 shows the negative exponential failure rate (a constant). The graph was produced by a
program which generates Weibull plots and is the special case where BETA = 1.
Figure 12-2 shows the negative exponential probability density function, corresponding to equation
12.3.
Figure 12-3 shows the negative exponential distribution cumulative distribution function,
corresponding to equation 12.1.
Introduction
The Weibull Distribution (pioneered by Swedish researcher Waloddi Weibull in the 1950s) can
represent any one phase of failure, that is, Burn In, Random or Wearout, depending on the Shape
Parameter of the distribution. It cannot represent the existence of all three failure phases (or even two
phases) for the same item. Nevertheless it is found that the Weibull distribution provides a good
statistical model for many practical purposes, and is usually superior to other models with the same
number of parameters.
The equations of the Weibull distribution contain a parameter called BETA, denoted by the Greek
letter β, which is known as the shape parameter. The shape of the Weibull probability density
function and other functions is different for different values of BETA.
When the value of BETA is less than 1, the Weibull distribution represents a pattern of Burn-In
failures. For BETA equal to 1 the Weibull distribution reduces to the negative exponential
distribution. For BETA greater than 1, the Weibull distribution represents wearout failures. The
larger the value of BETA, the more pronounced is the wearout effect. BETA values in the range 1.5
to 2.5 may indicate some blend of random and wearout failures. This relationship between the BETA
value and the phase of failure is summarised in Table 12.1.
<1 Burn In
1 Random
>1 Wearout
The Weibull distribution also has a scale parameter ETA (η), known as the Characteristic Life. This
parameter is related to the mean of the distribution and corresponds to the age by which 63.2% of
items have failed.
There is also a three parameter version of the Weibull distribution which we shall consider in a later
section.
The equations of the Weibull distribution, based on shape parameter β (BETA) and Characteristic
Life η (ETA) are as shown in Table 12.2 below.
1/ 2
[ ] − 1
Γ (β + 2) / β
Coefficient of Variation (σ / µ )
{
Γ[(β + 1) / β ] }
2
Characteristic Life
The Characteristic Life ETA (η) has the property that when t = η, then the cumulative distribution
function takes the value: 1 - exp (-1) = 0.632 for every β.
Graphs illustrating various functions of the Weibull distribution for different values of the shape
parameter BETA (β) are given in this section and subsequent sections. These graphs were created by
the RELCODE software package.
First, in Figure 12-4, we illustrate the Hazard Function or failure rate h(t), as this shows clearly the
relationship between the Weibull shape parameter and the phases of failure represented in the Bath
Tub curve (Figure 11-1).
The vertical scale of the hazard function graph is measured in failures per unit of operating life and
the graph represents the instantaneous failure rate at any age.
The horizontal scale for this and other graphs in this section is the Operating Life or age in
appropriate units.
Figure 12.4 shows the Weibull Hazard Function for BETA = 0.75. As BETA is less than 1, the
hazard function or failure rate decreases with age, corresponding to Burn-In failures.
We have noted previously that for BETA = 1.0, the Weibull distribution reduces to the negative
exponential distribution, so the hazard function for BETA = 1 is a constant and in fact, η (ETA) is
the conventional Mean Time Between Failures or MTBF in this case.
For BETA = 2.0 the hazard function increases at a constant rate and corresponds to Pattern C in
Figure 11-2. This represents what may be termed "gradual wearout" or possibly a blend of random
and wearout failures, in contrast to the case where BETA = 3.3. For values of BETA above 2.0, the
gradient of the hazard function increases with age, representing a stronger or more marked wearout
effect than for lower values of BETA.
The vertical scale of the probability density function graph is measured in Failures per Unit of
Operating Life. The simplest way to interpret this graph is by recalling that the area under the curve
between any two values of Operating Life, is the probability that a new item will fail in that age range.
For BETA < 1, represented by Figure 12-7, the Weibull p.d.f. is skewed to the left and goes to
infinity at age zero. This represents the Burn-In failure phase with a high initial failure probability
density which then decreases.
For BETA = 1, recall that the Weibull distribution reduces to the Negative Exponential distribution
represented by Figure 12-2 the value of the failure probability density at age zero is 1/η, and the pdf
then decreases with age.
For BETA greater than 1 we illustrate two cases in Figure 12-5. Firstly, for BETA = 2.0, the
gradient at the origin is initially positive and decreases gradually. This corresponds to the gradual
wearout or combined random/wearout situation. Secondly, for BETA = 3.3, the Weibull distribution
takes on a bell shape which is very similar to the Normal distribution. These figures illustrate the
versatility of the Weibull distribution in representing a family of probability density functions which
includes the Negative Exponential, a shape comparable to the Normal, and a range of other shapes
representative of different failure patterns. It is this flexibility which has made the Weibull
distribution popular with reliability engineers.
Graphs of the Weibull reliability function R(t) for BETA values of 0.75, 1.0, 2.0 and 3.3 are shown
in Figure 12-6 respectively.
The vertical scale of reliability function graphs is Probability expressed as a percentage and the
horizontal scale is Operating Life. The graph shows the probability that a new item will survive to the
corresponding age.
For BETA < 1, the Weibull reliability function falls steeply at first and then flattens out. The
reliability will approach zero as the age tends to infinity, but with BETA < 1, this approach is very
gradual. The reliability will have the value 36.81% when the age is equal to ETA, whatever the value
of BETA.
For BETA = 1, the Weibull distribution reduces to the Negative Exponential distribution. The curve
falls quite steeply at first, though not as steeply as for BETA <1. After the age value ETA, the curve
asymptotically approaches the zero level at a faster rate than for BETA <1.
For BETA = 2.0, the curve remains high initially and then climbs at a fairly steady rate, finally
approaching the zero value asymptotically. This is the gradual wearout case.
For BETA = 3.3, the Weibull reliability function remains close to the 100% level initially and then
falls sharply, indicating strong wearout. The zero level is approached more rapidly than in the
previous cases and is reached within the range of the graph (within the accuracy of the plot).
Further flexibility can be introduced into the Weibull distribution by adding a third parameter which is
a location parameter and is usually denoted by the symbol gamma (γ). The probability of failure is
zero for t<γ and then follows a Weibull distribution with origin at age γ. Gamma is often referred to
as the Minimum Life parameter and can be used in conjunction with any values of BETA and ETA.
Figure 12-7 illustrates a Weibull p.d.f. with GAMMA = 12 and BETA = 2. We see that the failure
probability density is zero between 0 and 12 and then follows the usually Weibull pattern for BETA =
2. Figure 12-8 shows the corresponding c.d.f.
The three parameter Weibull distribution may give a better fit to given failure data than the two
parameter distribution. From a mathematical viewpoint, giving the distribution an extra parameter
allows it more flexibility leading to a closer (or at least as close) fit to any given data. Although
gamma is usually called the "Minimum Life" this does not guarantee that no failures will occur below
this value in the future.
Figure 12-8 - Three Parameter Weibull Cumulative Distribution Function with Minimum
Life = 12 Units
The first of these hazard functions is a two parameter Weibull hazard function with the equation:
In equation 12.7, t is the component age, h(t) is the hazard function at age t, λ is the reciprocal of
a scale parameter and θ is a shape parameter. The case where θ = 1 corresponds to a constant
failure rate λ.
The second hazard function is a three parameter Weibull hazard function, which becomes operative
for t > γ. The equation is:
( β −1)
β (t − γ )
h( t ) = 12.8
η η
In equation 12.8, β, η and γ are respectively shape, scale and location parameters, as in the three
parameter Weibull distribution.
Adding the two hazard functions gives Hastings bi-Weibull distribution, for which the hazard and
reliability equations are:
Hazard
( β −1)
(θ −1) β (t − γ )
h(t ) = λθ (λt ) + t≥γ (12.10)
η
η
Reliability
θ
R ( t ) = e − ( λt ) 0<t<γ (12.11)
θ
+ (( t −γ ) / η ) β ]
R(t ) = e −[( λt ) t≥γ (12.12)
In equations 12.7 to 12.12, θ is not confined to values less than or equal to 1, and β is not confined
to values greater than 1, although the values do often conform to these ranges in practice.
γ ≥ 0, η > 0, β > 0, λ ≥ 0, θ ≥ 0.
Example
Figures 12-9, 12-10 and 12-11 show respectively the Hastings bi-Weibull hazard function,
probability density function and reliability function for the following parameter values:
λ = 0.01, θ = 0.6, γ = 40, η = 40, β = 3.0.
This example corresponds to a combination of burn-in and wearout failures. The range of shapes
which can be taken by the Hastings bi-Weibull distribution is large. Any combination of two
Weibull failure rate patterns can be accommodated, for example, burn-in plus wearout, random
plus wearout, burn-in plus random, random plus another random starting later. β is not required to
be greater than 1, nor λ less than 1. In practice, the ability of the Hastings distribution to detect the
onset of wearout is one of its main advantages.
12.10 Discussion
The negative exponential, Weibull and bi-Weibull distributions form a family of distributions of
gradually increasing complexity.
The Weibull includes the negative exponential as a special case and extends the range of models to
include Pattern C and Pattern D as an approximate three parameter Weibull. It will also provide a
solution for the strictly decreasing failure rate pattern and for the strong wearout case which do not
occur in Figure 11-2 but do exist in practice. However, a negative aspect is that where a “double”
pattern exists, Weibull fitting can tend to obscure this, since it is bound to average out the two
phases.
The bi-Weibull distribution includes the Weibull as a special case, but allows two failure phases so
that patterns B and F are now covered. Also, as Figure 12-9 shows, the Hastings bi-Weibull
distribution can provide a fair approximation for the Bath Tub, Pattern A. Thus the whole range of
patterns is effectively covered.
In a simple case like a metal filament light globe, failure is a clear cut event. In other cases, an
item might deteriorate to a point where it is no longer usable, and then be taken out of service
without experiencing actual failure to perform. An example of this would be a tyre which wears to
a point where its use is no longer legal.
In some cases an item may be repaired in the course of its life. For example, a tyre may have a
puncture which is repaired and the item then continues in service. Some punctures, however, may
be so serious that the tyre is not repaired but is discarded. Assuming that we are interested in how
long tyres last before they are replaced, then any event which requires replacement of the tyre is a
failure. This includes both severe punctures which require the tyre to be discarded and normal
wear which reaches the legal limit. A “normal” puncture, which is repaired so that the tyre
continues in use, is not a failure in this case. These repairs are known as a “bad as old” repairs,
since the tyre will continue to work, but its condition of wear will still reflect its age.
The definition of “failure” is essentially up to the analyst and any logically consistent definition
may be applied, provided that the results take due account of the definition used.
Pursuing the tyre example, in an application where the kilometres between any wheel changes was
of interest (perhaps because the vehicle operated in a remote location where puncture repair
facilities were not available) then normal punctures may be regarded as failures for purposes of
analysis. We shall assume that the analyst has defined failure in a way suited to his purposes.
When replacement occurs and data on the new item forms part of our analysis, we assume that the
new item is similar to the original item when new. This may be because the new item actually is
new, or because it is “as good as new”.
There may be situations where repairs leave an item in a condition which is neither “as bad as old”
nor “as good as new”, but we shall not consider these.
Where suspensions occur, it is essential to take them into account if valid reliability analysis is to be
carried out. However, for the present we shall consider situations where all items run to failure,
leaving the analysis of the suspended item case until later.
2. Estimate the parameters for the assumed model type from the data, yielding a fitted model.
3. Statistically test the hypothesis that the data could have arisen at random from the fitted
model. This gives a measure of the goodness-of-fit of the model.
4. Repeat with other model types and use a statistical test of model quality to decide which
model is most appropriate.
5. Use the preferred model as an aid to determining appropriate replacement policies or for
other management decisions.
The plot is of the cumulative probability of failure. A special probability paper is used, the scales
of which are modified so that any Weibull distribution function appears as a straight line on the
graph paper. The vertical axis represents the cumulative percentage failures and the horizontal axis
represents the age at failure. The horizontal scale is logarithmic, whilst the vertical scale
[ [ ]]
represents log log 1 / (1− α ) for probability α. This transformation converts the Weibull
distribution function into a straight line.
13.5.1 Example
As an example of Weibull plotting, consider the data in Table 13-1 which relates to a sample of 10
switches. The switches themselves are labeled A,B,C, and so on, and the number of operating cycles
to failure were observed as shown in Table 13.1.
A 1980
B 760
C 120
D 210
E 2170
F 3800
G 700
H 1350
I 1100
J 380
1 Sort the data by increasing age at failure, determining the order number of each failure.
2 Estimate the cumulative probability of failure at each failure age using the median rank
formula, equation 13-2.
3 Plot the cumulative probability of failure against age on Weibull paper.
4 Determine the Weibull parameters by fitting a straight line to the data on the probability
paper.
Equation 13.2 is Benard's formula. It estimates what is known as the Median Rank of the cumulative
probability of failure. This formula is used by RELCODE.
To illustrate the application of equation 13.2, consider the somewhat extreme case where we had only
one failure (a sample of 1). We would not expect the age of this failure to represent the age by which
100% of items in the underlying population would fail. It would be more realistic to regard the single
age at failure as representing the age by which 50% of the underlying population would fail. Benard's
formula for i = 1 and N = 1 gives a probability level of
The next step in the analysis is to extend Table 13.2 to show the Cumulative Probability of Failure
Estimator as given by equation 13.2. This is shown in Table 13.3. As an example of the
calculation, consider the tenth failure, for which i = 10. We have:
RELCODE will perform a Weibull plot, plotting the Median Rank (expressed as a percentage) against
the corresponding number of operations to failure, t. RELCODE then estimates the Weibull
parameters ETA (η) and ΒΕΤΑ (β). Figure 13-2 shows this.
β = 1.01
η = 1344 Operations
Since β is close to 1, we deduce that the switches are subject to random failures and that the MTBF is
1334 operations. This concludes the Weibull analysis for these switches.
Constant (or approximately constant) failure rates arise frequently - and are often assumed to occur
without being really verified. Random failures arise particularly with:
• failures arising from some erratic external cause, e.g. metal obect damaging a pump vane;
• electronic equipment;
• very complex equipment, or aggregations of equipment with some components replaced;
• as an approximation for any other hazard function over a short interval.
The aim of this analysis is to show how to estimate reliability in the case of random failures, in
particular
The random failure rate case corresponds to a Weibull distribution with β = 1. The hazard function
then reduces to a constant h(t) = 1/η = λ, a constant referred to as the failure rate. This distribution
is most commonly referred to as the Negative Exponential Distribution, details of which are given in
Chapter 12. The MTBF is the reciprocal of the failure rate.
If m items are observed and ti, is the observed operating time of the i-th item, the total service-hours,
T, is found by adding up the operating time of the items. In equation form this is expressed by:
T = Σ ti 13.4
i=1
It is immaterial whether any specific item fails, is suspended or replaced, since the failure rate for all
items is assumed to be constant and independent of age.
Confidence limits are values such that we are confident at some stated level (e.g. 90% confidence)
that the true value will not be greater than or less than the specified limit.
A one sided 90% lower confidence limit for the MTBF would be a value such that we are 90%
confident that the true MTBF exceeds the value given.
A one sided 90% upper confidence limit for the MTBF would be a value such that we are 90%
confident that the true MTBF is less than the value given.
A two sided 90% confidence interval is such that we are 90% confident that the true value lies
between the upper and lower boundaries of the interval. In this case, there is a 5% chance of the
value being below the lower limit and a 5% chance of it being above the upper limit.
In reliability analysis we are usually interested in lower one sided confidence limits.
There is a special case where we have zero failures. In this case, no upper limit can be specified,
but the lower limit is the elapsed service-hours, T, multiplied by the value in the Table 13.4 for n
= 0.
Confidence limits for the MTBF are then found using Table 13.3. If testing terminated at the n-th
failure, we enter the table at the corresponding number of failures n, and use the “End on Fail”
columns for the lower limit. We select the column according to the confidence limit we are
seeking, that is, 99%, 95% or 90%, lower or upper. If testing terminated after a certain time, but
not specifically at a failure, then we use the “End on Time” columns.
The value in the table is then multiplied by the point estimate of the MTBF (equation 13.5) to give
the required confidence limit.
T = 100, n = 10
Failures, n = 10
Lower Conf Limit = 95%
Table entry = 0.637 (end on failure case)
0 - 50 100 4
50 - 100 200 4
Solution
The question asks for a two sided 90% confidence interval. This corresponds to finding the lower and upper
one sided 95% confidence limits.
Result. MTBF Point Estimate = 4286 hours, 90% two sided interval is 2743 to 7089 hours.
Suspended items may arise for a number of reasons. One case is where these items are continuing in
service and simply have not yet failed.
A procedure for dealing with suspended items is illustrated in the following example:
An example of reliability data which includes both failures and suspended items is shown in Table
14.1. This example relates to diesel engines in earthmoving plant. Failure is defined as a situation
where an engine in replaced by a new or overhauled (good as new) engine.
The method used to allow for suspended items is a refinement of the Age Sensitive Method described
in Hastings and Bartlett (forthcoming 1997). The Age Sensitive Method is an improvement on the
method described by Herd (1960) and Johnson (1964). The use of the Hastings-Bartlett Age
Sensitive Method, improves the accuracy of RELCODE in estimating reliability from your data,
relative to the Herd-Johnson method.
In the case where there are no suspended items we made use of the failure order-number, i, and the
total number of items, N, in estimating the cumulative probability of failure, using equation 13.2.
With suspended items we make use of a modified order-number, mi, which allows for the
suspended items. The modified order number is calculated using formulae given in this section.
The following symbols will be used:
i = failure order-number
j = suspension order-number
e = event order-number
N = total number of events
ei = event-number of failure i
ej = event number of suspension j
mi = modified order-number of failure i
S(i) = set of suspensions occurring at or after failure i-1 and before failure i.
This set may be empty.
fi = age at which failure i occurs
sj = age at which suspension j occurs
m*i = N + 1 - mi
e*i = N + 1 - ei
αj = the proportion of the current inter-failure interval which has elapsed
when suspension j occurs.
f0 = m0 = e0 = 0
e *i e *j + 1− α j
m i − m i −1 = m 1− * ∏ *
*
i −1
14.2
e i −1 S (i ) e j − α j
In equation 14.2, the product is taken over suspensions in the set S(i). If this set is empty the
product term has value 1, and the equation reduces to:
The calculation is illustrated by the following example, based on the data in Table 14.1. Consider
Event 1 in Table 14.1. As this first event is a failure, we apply equation 14.3 and get
= 1.259
m2 = 2.259
Event 5 is a failure and there are no suspensions in between events 4 and 5 so we use equation 14.3,
giving,
m3 - m2 = (6 + 1 - 2.259)/(6 + 1 - 4) = 1.58
We have now calculated all the modified order-numbers. We then use equation 14.4 to calculate the
median ranks. Table 14.2 summarizes the results. Once the median ranks have been calculated, the
usual probability plotting technique can be applied.
Table 14-2 Modified Order Numbers and Median Ranks
Event Hours Status Failure Modified Median
Order Run Number Order Rank
Number e i Number (mi-.3)/
mi (N + .4)
1 3895 Failure 1 1 11%
2 4733 Suspension
3 7886 Suspension
4 9063 Failure 2 2.259 31%
5 10030 Failure 3 3.839 55%
6 12123 Suspension
RELCODE uses the analytical method just described to get the modified order numbers and median
ranks which are then used in fitting distribution models. The Confidence Limits Table in RELCODE,
shown at Figure 14.1, shows the values of the modified order numbers, which correspond to those in
Table 14.2. Figure 14.1 also shows the median ranks for the reliability and these values are one
minus the values in Table 14.2. Fitting a distribution model using RELCODE yields the following
parameters:
Figure 14.1 Confidence Interval Screen - Showing the Modified Order-Numbers and
Median Ranks.
1 290 S 1
2 334 S 1
3 452 F 1
4 695 F 1
5 769 F 1
6 1668 F 1
7 2150 S 1
8 2210 S 1
9 2252 S 1
10 2467 S 1
11 2607 S 1
12 2662 F 1
13 3212 S 1
14 3260 F 1
15 3576 F 1
16 3820 S 1
17 3852 S 1
18 3984 S 1
19 4011 S 1
20 4203 S 1
21 4454 S 1
22 4636 F 1
23 4818 F 1
24 5041 F 1
25 5134 F 1
We enter the data into RELCODE in the usual way and at the Analysis Menu select Model
Optimization - Fit Distribution. RELCODE fits a range of models and recommends a preferred
model. The results of fitting the various models to the data of Table 14.3 are shown in Figure
14.3. This procedure was introduced in Chapter 4. In this case the preferred model is Model 6,
the bi-Weibull Distribution.
RELCODE selects a preferred model on the basis of relative model quality which is defined in
Section 4.8. In Figure 14.3 we see that the relative model quality for the bi-Weibull distribution is
noticeably higher than for the other distribution models. The fitted parameters of the bi-Weibull
are shown in Figure 14.4.
Not all examples give such a clear cut interpretation as this one, but the bi-Weibull is generally
very valuable in identifying multiple failure rate patterns.
A reliability plot showing the data and the fitted distribution in given in Figure 14.5. We can see
that the reliability falls relatively slowly at first and then falls sharply at about 4500 hours. This
corresponds to the situation which we have identified from the parameters shown in Figure 14.4.
Figure 14.6 shows the bi-Weibull hazard function for this example, with the sharp rise in the
failure rate starting at about 4500 hours.
Figure 14.7 shows the bi-Weibull plot on Weibull Probability Scales. The change in slope between
the random and wearout phases is clearly apparent.
Figure 14.8 shows the plot generated when we fit the 2 parameter Weibull to the Oscillating Axle
Bush data. From a simple manual plot we might regard this model as satisfactory. This would
mean that we would miss the sharp increase in failure rate that occurs at about 4500 hours. We
could therefore be over optimistic in assessing the reliability of this component and fail to recognise
the need for preventive replacement and/or design review.
This concludes the bi-Weibull example. The example will, however, be referred to further in our
discussion of the various models and fitting techniques later in this chapter.
The models and fitting methods used by RELCODE were introduced in Chapter 4. Here we give
some additional details regarding these. For all models except model 2, we calculate the median
This method was regarded as adequate before the advent of more advanced computer based
techniques. It suffers from the problem that the Weibull probability paper transformation is highly
non-linear. This means that the distance of a point from the fitted line represents different
probabilities, depending on where you are on the paper. A point one centimetre from a line near
the bottom left of the paper has a probability error of about 0.1%, whereas a point one centimetre
from a line near the centre of the paper has a probability error of about 10%. This distortion can
cause the fitted distribution to be statistically rejected, even though parameter values can be found
(by other methods) which give a good fit.
The maximum likelihood method is based on the concept that if a failure occurs at age t, then the
likelihood of this event, for a given underlying distribution model, is given by the value of the
probability density function for that distribution at age t, f(t). Formulas for fitting the two
parameter Weibull distribution by maximum likelihood are given by Nelson (1981), pages 340-341.
Recognising the distorting effect of probability paper, Ang and Hastings (1994) used the
probability error as a basis for fitting distributions, and introduced the term Model Accuracy to
represent the value of a coefficient of conformance based on probability error. The probability
error is the difference between the reliability level of an observed data point ri and corresponding
model value, vi of the reliability at the same age. The root mean square probability error (RMSPE)
is given by:
n
= ∑ (ri − v i ) / n
2
RMSPE 14.9
i =1
Ang and Hastings (1994) used the mean absolute probability error rather than the root mean square
probability error. The change to the root mean square probability error has been made to reflect
the fact that errors are likely to be normally distributed.
To decide whether a distribution model is statistically valid we carry out a test of goodness of fit.
For the case where all items fail, the Kolmogorov-Smirnov (KS) type of test is applicable. A
version specifically suited to the Weibull distribution is described by D’Agostino and Stephens
(1986).
However, the presence of suspended items invalidates KS type tests. If there are quite a few
suspended items occurring between any pair of failures, the KS test may reject a model which is, in
fact, quite accurate. For this reason, Ang and Hastings (1994) developed the Model Accuracy
Test, which is based on model accuracy statistics, and which is applicable with or without
suspended items.
2. An observed value of a test statistic, for example, the Model Accuracy statistic given by
equations 14.9, 14.10.
4. A critical value for the test statistic at the given confidence level.
5. We reject the hypothesis (Ho) at a given confidence level if the observed value of the test
statistic is less than the critical value.
Rejected Model. Figure 14-9 shows the goodness of fit test for the 2 parameter Weibull model
fitted by linear regression, for the Axle Bush data. The corresponds to the Weibull Plot shown in
Figure 14.8. In this case the hypothesis that the distribution fits the data is rejected. RELCODE
gives us two indications that this is not a good model for this data, firstly by recommending the bi-
Weibull model in this case, secondly by rejecting the model in the goodness of fit test. If we
simply used the Weibull plot and did not apply a suitable goodness of fit test we might have
accepted the 2 parameter Weibull result and concluded that we had a beta value of 1.5 indicative of
Figure 14.9 Parameters and Goodness of Fit test for the 2 Parameter Weibull Model for the
Oscillating Axle Bush
The distortion introduced by the Weibull paper, discussed in Section 14.7.1 - Model 1, leads to the
need for a fitting method which avoids these distortions. This is achieved by using a computer
search technique to find parameter values which minimise the root mean square probability error.
Equivalently, we are maximising the model accuracy as defined by equations 14.9, 14.10.
This suggests a much stronger wearout pattern than that found by Model 1. In the linear regression
method, greater weight was given to the early failures because of the scale distortion of the Weibull
paper. The accuracy of Model 3 is greater than Model 1, but still is not high enough to be
statistically acceptable.
In this example, this model does not find a 3 parameter Weibull which improves on the 2 parameter
result of Model 1.
Consider a situation where an item undergoes a trial and is either a success or a failure. Suppose that
success occurs with probability p and failure with probability 1-p. The trial is then known as a
Bernoulli trial. If n independent Bernoulli trials are carried out, each with the same probability of
success, p, then the probability of exactly x successes occurring is
Cx px (1-p)n-x
n 14.17
(p + (1-p))n 14.18
The number of successes in n independent Bernoulli trials each with the same probability of success,
p, is a binomial random variable which we denote B:n,p.
where x is an integer 0 ≤ x ≤ n.
Estimation of Reliability
The reliability of an item in a situation corresponding to a Bernoulli trial is its probability of success.
Suppose we test n items and observe x successes and wish to make a statement about the reliability of
the items. We assume that the trials are Bernoulli with the same chance of success each time. The
sampling distribution of the number of successes is then a binomial distribution. The sample mean
x/n is an unbiased estimator of the probability of success, p,
pA = x/n 14.20
The lowest value of p, say, pL, for which there is probability (1 - α) of getting more than x successes
in n trials is the lower one sided confidence limit for p at the 100 α % level.
The highest value of p, say pU, for which there is probability α of getting more than x successes in n
trials is the upper 100 α % confidence limit for p, given by
For given x, n, α we can obtain pL and pU from equations 14.21 and 14.22. RELCODE gives these
confidence limits.
The binomial terms for this case are, if p is the probability success:
0 (1 - p)2
1 2p(1 - p)
2 p2
At the time of the first failure we have 1 success and so, for the 5% level, the lower confidence limit
is such that the probability of more than one success is 0.05.
For the upper limit, the probability of more than one success is 0.95
At the time of the second failure we have zero successes so the lower limit is given by
14.10 Conclusion
RELCODE provides a range of model fitting and testing options which goes well beyond the basic
Weibull plotting technique.
15.1 Introduction
The background to planned replacement analysis for components is given in Chapter 6 for age
based replacement and in Chapter 7 for block replacement. In this chapter the underlying
mathematics for these models is described. This includes the “replace only on failure” strategy
which is considered first.
The simplest form of replacement policy is to replace only on failure. That is, we carry out failure
replacements only, and do not do any preventive replacements.
If failure replacements only are carried out the average cost per unit time in the long run will be given
by
GROOF = Average cost per unit time for replacement only on failure
µ = Mean Life of components (with no preventive replacement)
Cf = Cost of Failure Replacement
In an age-based preventive replacement policy, items are replaced under the following rules:
A "preventive replacement age" denoted tp, is set as part of the policy. If a item fails before age tp a
failure replacement is made.
To implement this policy in practice we need to record when each replacement occurs so that the age
of every component is known. The saving from preventive replacement may depend on the
preventive replacement occurring at a convenient time, e.g. at the next routine service. At the time of
such a service, any component which had reached (or exceeded) its preventive replacement age would
be replaced.
Items which fail before the preventive replacement age still require failure replacement.
F(tp)
Preventive
Replacement
Age, tp
Figure 15-2 shows in a schematic form a typical sequence of events under an Age-Based Preventive
Replacement Policy. Starting with a new component, in Figure 15-2, this first component survives to
age tp, when it is replaced on a preventive basis. This is indicated by the symbol P in the figure.
The second component fails before age tp, so a Failure Replacement occurs, indicated by F. The
sequence of replacements continues, with preventive replacement occurring whenever a component
survives to age tp, and failure replacement occurring otherwise.
Figure 15-2 - Schematic Sequence of Events for an Age-Based Preventive Replacement Policy
P = Preventive Replacement
F = Failure Replacement
tp = Preventive Replacement Age
Sequence of Events
tp <tp tp tp <tp
________P_______F____________P____________P_____F____Time
The cheapest Age-Based Preventive Replacement Policy is the one which has the lowest long run cost
per unit time. This cost is derived by determining the average replacement cost per component and
the average life per component, and then dividing the average cost by the average life. The average
cost per component is given by:
Average Cost of Probability Cost Probability of
cost per = Failure x of Failure + Preventive x Preventive
component Replacement Replacement Replacement Replacement
Average
cost per = Cf F(tp) + Cp [1 - F(tp)] 15.3
component
In symbols, and denoting the failure probability density function by f(t), the truncated mean life is given
by:
tp
Using the integration by parts formula, equation 15.4 can also be expressed as shown in equation
15.5, and it is this version of the equation which is used in RELCODE.
tp
Cf F( t p ) + Cp [1 - F( t p)]
G = tp
15.7
∫
o
[1 - F( t )] dt
The minimum cost policy is found by evaluating G for a range values of tp, and choosing the value
which gives a minimum.
The cost G will vary with the preventive replacement age in the way shown in Figure 15.3. The cost
per unit time, G, will have a minimum value G*A which will occur at the optimal preventive
replacement age t*p. RELCODE will find t*p, searching in increments of width equivalent to the
horizontal space occupied by one character on a screen 80 characters wide.
The asymptotic value of G as tp increases is GROOF, given by equation 15.2. If the cost of failure
replacement is only slightly greater than the cost of preventive replacement, or the wearout effect is
only slight (BETA only slightly greater than one) then the minimum in Figure 15.3 will be very
shallow. In such a case, for all practical purposes the optimal policy is to replace only on failure, and
RELCODE will indicate this.
Figure 15-3 Variation of Cost with Preventive Replacement Age for Age Based Preventive Replacement Policy
The truncated mean life of a component for a given age based preventive replacement policy is given
by equation 15.5. RELCODE uses equation 15.5 in conjunction with equation 15.8 to determine the
number of spare or replacement parts which will (on average) be required.
The average proportion of failure replacements will be F(tp) and this is used to estimate the number of
failure replacements and preventive replacements for a given replacement policy.
In a block preventive replacement policy, all components are replaced simultaneously, in a block, at
fixed intervals of time (or operating life). Items which fail in between the block replacement times are
replaced when they fail, these being failure replacements. At the time of block replacement, all items
are replaced including those which have been subject to failure replacement. We refer to the time
between block replacement as the block replacement interval.
The block replacements are preventive replacements. The cost of a block preventive replacement may
differ from that of an age based preventive replacement. Usually a block replacement will be cheaper
(per component replaced) because there are economies of scale in doing many replacements at the
same time.
The interval between block replacements may be expressed in terms of calendar time, or in terms of
operating hours where this is more appropriate. In the latter case all components would be assumed to
operate concurrently. An example of this type of policy is light bulbs in a street, where a block policy
would involve replacing all the light bulbs in a single pass at certain intervals of time. In addition,
individual light bulbs which failed in between the block replacement intervals would be replaced on
failure.
Figure 15-5 shows a typical time interval in a block replacement policy in schematic form. There are
10 lamps in a street and initially all the light bulbs are new. As time goes by individual light bulbs fail
and are replaced. These are failure replacements. In Figure 15-6 the first bulb to fail is in Lamp
Number 4. Later failures occur in other lamps, and the bulb in Lamp 4 in fact fails again before the
block replacement time interval, xp, is reached. At time xp, all the light bulbs currently in use are
replaced, regardless of age.
We then have a situation identical to the starting position in that all the light bulbs are new. Thus,
subsequent time intervals will see, in a statistical sense, a repeat of the original pattern, although, of
course, the timing and number of individual failures will depend on chance.
In the block replacement policy, the aim is to choose the value of the interval xp which minimises
costs.
Figure 15-13 - Block Preventive Replacement Policy in Schematic Form : Light Bulbs in Street Lamps.
Lamp Number
1 _____________________________________
2 __________________________F__________
3 _____________________________________
4 _____F__________________________F____
5 _____________________________________
6 _____________________________________
7 _____________________________________
8 ___________________________F_________
9 ________________F____________________
10 _____________________________________
0 Time Axis xp
To determine the average replacement cost per unit time we need an expression R(x) defined as the
mean number of failure replacements per component in time x. The average cost per unit time is then
g(xp), given by
Cf R( xp ) + Cp
g( xp) = 15.12
xp
For the case where no preventive replacements are made the cost is given by
For a discrete life distribution model in which fi is the probability that a currently new component will
fail in age interval i the renewal function Ri can be derived as follows. Let ri be the mean number of
renewals per component in the ith time interval. The ri is given by
r1 = f1
r2 = f2 + r1f1
n −1
rn = f n + ∑ ri f n −1 , n = 2,3,4,... 15.14
i =1
Rn is given by
R1 = r1
Rn = Rn-1 + rn, n = 2, 3, 4, ... 15.15
To use the No Failure Data screen, you enter estimated information about the item. The entries are as
follows:
1. You select a failure pattern from a choice of Random Failures, Gradual Wearout or Steep Wearout.
2. You enter an estimated value for the mean life of the item.
3. Optionally you can enter an age of onset of failures, which is applicable in cases where you consider that
there would be an initial period in which there would be a negligible probability of failures occurring.
The default value of the age of onset of failures is zero.
The user does not need to be concerned with Weibull analysis in this case, but as a matter of information,
RELCODE will use the entries at this screen to establish an equivalent Weibull distribution model. The
Random Failure pattern converts to a Beta value (shape parameter) of 1, Gradual Wearout to a Beta value of 2
and Steep Wearout to a beta value of 3.5. The age of onset of wearout provides a Gamma value (location
parameter). The Eta value (characteristic life) is calculated by RELCODE from the Mean Life and the other
parameter values.
A second approach to the situation where we have no failures involves making a “three point estimate”.
Specifically we estimate the ages, t80, t50, t20 to which the component has 80%, 50% and 20% reliability.
We then enter these ages as though they were failures, and this enables RELCODE to fit a distribution. If we also
have data regarding the cost of failure replacement and the cost of preventive replacement we can then proceed to
solve the replacement problem.
As an example, consider the following. A diaphragm valve in a slurry pump is subject to failure. No data
records are available, but the three point estimate shown in Figure 16.2 has been made. These estimates are as
follows. Firstly we estimate the age at which we consider that 80% of the diaphragms will still be surviving. In
this case we estimate this as 60 months (5 years). Thus, we expect that 20% of the diaphragms will last less than
5 years, and 80% will last longer than 5 years.
Secondly, we estimate the age by which we expect that 50% will have failed. In this case we estimate this at 72
months (6 years). Thus we expect that half the diaphragms in these pumps will last for less than 6 years and that
half of them will last for longer than 6 years. Thirdly, we estimate the age by which we expect that 80% will
have failed. In this case we estimate this age as 90 months (7.5 years). Thus we expect that 80% will have failed
before they are 7.5 years old, whilst 20% will last for longer than 7.5 years. We also estimate the cost of failure
replacement as $1000 and the cost of preventive replacement as $200. These estimates are shown in Figure 6.1,
along with the number of diaphragms in the population (16) and the average annual utilization per component,
which in this case is 12 months, since the pumps are all used all the time.
16.5 Results
We shall not show all the screens for the example, but only the main results. Figure 16.4 shows the reliability
plot obtained when RELCODE uses the three point estimate data from Figure 16.3. The three point estimate
corresponds to the three points shown on the graph, which we see are at (or very close to) the 80%, 50% and
20% reliability levels. Thus RELCODE has determined a Weibull distribution which corresponds closely to our
reliability estimates. In this case the result is a three parameter Weibull.
We can then proceed to carry out a replacement policy analysis in the usual way. The age based replacement
policy cost graph is shown in Figure 16.5. The optimal solution is to replace the diaphragm at approximately
50 months. We can continue with other analyses in the usual way.
The three point estimate method has allowed us to estimate the distribution function of the components, and we
can then proceed with any of the RELCODE analyses using that distribution. If we subsequently get real data
for this component we can replace the three point estimate data by the real data and repeat the analysis. If we
get more information, but not actual data, we can change our three point estimate if we wish. It is advisable to
note, say in the Item Memo field on the Item Header Screen, when we are using an estimate and not real data.
The three point estimate method relies on the fact that if we have three failures, then on a reliability plot,
these will be plotted at approximately the 80%, 50% and 20% reliability levels.
R% = 100 * (1 - P)
i R%
1 79.4
2 50.0
3 20.6
Within plotting accuracy, we see that the points plotted will therefore be at approximately the 80%, 50% and
20% reliability levels. Thus, by estimating the ages corresponding to these reliability levels we provide data
enabling RELCODE to fit a distribution which suits our estimates. It is possible that the data may give a poor
fit to the best Weibull model found, in which case the method is inappropriate and should not be relied on.
RELCODE will give an indication of this in the Goodness of Fit Test.
16.7 Conclusion
The methods given in this chapter let us analyse situations even though no data is available. We can do this either
by using the "No Failure Data" screen, or by using the Three Point Estimate method. In either case, we estimate
a suitable life distribution, and then use RELCODE in the usual way to derive preventive replacement ages and
other results.
Exercise 1 - Bearing
Heavy duty bearings in a steel forging plant have failed after the following numbers of weeks of
operation.
1. Use RELCODE to select a Weibull life distribution model, estimate the parameters and the
mean life.
2. The cost of Preventive Replacement is $100 and the cost of Failure Replacement is $1000.
Determine the optimal replacement policy and the corresponding cost per week.
3. In practice, the forge has a major service every four weeks. Preventive replacement of the
bearing can be carried out as part of this maintenance activity.
a. At what age (a multiple of four weeks) should the bearing be replaced, to minimise costs.
b. If there is a safety argument for keeping the number of in service failures as low as
possible within reasonable costs, what should the replacement policy be?
Support your conclusions by giving the costs for some alternative policies.
4. There are four similar forging plants and each works for 50 weeks per year. Estimate the
number of replacement parts required per year if the policy is preventive replacement at age
8 weeks. How many failure replacements will occur per year (steady state average) under
this policy?
Truck 1 Truck 2
51220 45380
68060 103510
Truck 1 Truck 2
105680 132720
3. For the RELCODE preferred model and for the two parameter Weibull model, examine the
Reliability Function and the Weibull Probability Paper plots, and the Goodness of Fit Test
results. Which model do you consider to be the most appropriate and why?
4. What type of failure pattern(s) is/are indicated (EARLY LIFE, RANDOM, WEAROUT?)
5. The Preventive Replacement Cost is $100 and the Failure Replacement Cost is $1000.
Determine the optimal preventive replacement age, and the cost under this policy, and the
saving of this policy when compared with a policy of replacement only on failure.
6. Preventive replacement can only be carried out at odometer readings which are multiples of
5,000 kms. Select an appropriate preventive replacement age. What is the cost
($/kilometer) for this policy? How does this compare with the cost for the optimal policy?
7. If the company has a fleet of 6 similar dump trucks, each of which averages 50,000
kilometers per year, estimate the number of seal replacements which will be needed per
year, under an appropriate replacement policy.
8. If 6 dump trucks average 50,000 kilometers per year, estimate the average number of in-
service seal failures which will occur per year, given that the policy is to replace seals on a
preventive basis at 30,000 kilometers.
The cloth filter on a sugar centrifuge is currently replaced on a preventive basis if a suitable
opportunity occurs and the cloth has been in use for at least 20 hours. The cloth is also replaced on
failure. The following data are available for the most recent six cloth replacements.
Cloth Age in Hours Comment
1. Use RELCODE to analyse the failures and estimate the following parameters:
3. The company has three centrifuges which each run an average of 400 hours per month. Estimate
the number of replacement cloths required per month under the existing and recommended
replacement policies.
1. Use RELCODE to determine a suitable life distribution model and estimate the parameters of the
distribution and the resulting model accuracy. Use this distribution to answer the remaining parts
of this question.
Exercise 5 - Insulator
A new type of insulator for high voltage electric power lines is being trialled.
An analysis of six insulators operating in an accelerated test environment, designed to simulate
coastline conditions, shows that the insulators fail to operate to specification, or survive, after the
following numbers of months.
The design called for 95% confidence of 90% reliability over an 18 month period in this trial. Do the
data indicate that this criterion has been met?
Exercise 6 - Maintainability
An aircraft maintenance check takes the following times in minutes to complete on six successive
occasions:
Find a suitable distribution to fit this data and estimate the maintainability, given a maintenance time
constraint of 60 minutes. (Note. The maintainability is the probability that the maintenance is
completed within the maintenance time constraint.)
Exercise 1 - Bearing
1. A three parameter Weibull model is selected, with GAMMA = 6.08 weeks, BETA = 1.17,
ETA = 13.28 weeks, Mean Life = 18.66 weeks.
4 25.00 0
8 23.93 10
a. Replace at 8 weeks
b. Replace at 4 weeks. The preferred model indicates zero in-service failures in this case,
but it is possible in practice that some failures could occur, as the model is statistical.
4. 23.01, 2.44
3. 72.58%. From Reliability column in the Life Distribution Function Tabulations, under the
Replacement Analysis Menu (interpolating).
No. Use: Analysis Menu, Confidence Limits for Reliability, Graph. Interpolating between the lower
blue bars would indicate a lower 95% confidence limit of about 50% at 18 months. This is below the
required level of 90%, so the criterion has not been met.
Exercise 6 - Maintainability
Enter the maintenance times as “failures”. Fitted distribution is Weibull 2 parameter with BETA =
7.54, ETA = 58.24.
The maintainability with a maintenance time constraint of 60 minutes is given by the probability that
maintenance is completed in less than or equal to 60 minutes. This is obtained from the Cumulative
Distribution Function graph, or by subtracting from 1 the value in the Reliability column in the Life
Distribution Function Tabulations, under the Replacement Analysis Menu. To reach this menu in the
full system you will need to enter values for the replacement costs. If replacement analysis is not
being used, enter any arbitrary costs, e.g. 1000, 100. The Reliability value in the “Life Distribution
Function Tabulations” Table at 60 minutes is 0.2859. The maintainability is therefore:
Ang J Y T (1994) “Model Accuracy and Goodness of Fit for the Weibull Distribution with Suspended
Items”, PhD thesis, Monash University.
Ang J Y T and Hastings N A J (1994) “Model Accuracy and Goodness of Fit for the Weibull
Distribution with Suspended Items”, Microelectronics and Reliability, 34, 7, 1177-1184.
Clark, W.B.,(1991) “Analysis of Reliability Data for Mechanical Systems”, Proc. Annual Reliability and
Maintainability Symp. IEEE, 1991, pp 438-9.
D’Agostino R B and Stephens M A (1986) “Goodness of Fit Techniques”, Marcel Dekker, Inc, New
York.
Epstein B(1960) "Estimation from Life Test Data" in IRE Transactions in Reliability and Quality Control,
April 1960, pp.104-107.
Hastings N A J and Bartlett H J G (1997), “Estimating the Failure Order-Number from Reliability
Data with Suspended Items”, IEEE Transactions on Reliability.
Hallinan, A J (1993) “A Review of the Weibull Distribution”, Journal of Quality Technology, 25, 85-
93.
Herd R.G. (1960) “Estimation of Reliability from Incomplete Data”, Proc. 6th National Symposium on
Reliability and Quality Control, 202-217.
Johnson L.G. (1964) “Theory and Technique of Variation Research”, Elsevier, Amsterdam.
Kao J H K (1959) “A Graphical Estimation of Mixed Weibull Parameters in Life Testing Electronic
Tubes”, Technometrics, 1, 4, 389-407.
Kao J H K (1960) “A Summary of Some New Techniques on Failure Analysis”, Proc 6th National
Symposium on Reliability and Quality, Washington DC, Jan 11-13, 1960, 190-201.
Lawless J F (1982) “Statistical Model and Methods for Life Data”, Wiley, New York.
Natesan and Jardine A (1986) “Graphical Estimation of Mixed Weibull Parameters for Ungrouped
Multicensored Data”, Maintenance Management International, 115-127.
References 18-1
Nelson W (1981) “Applied Life Data Analysis”, J Wiley and Sons, New York.
O’Connor P D T (1991), “Practical Reliability Engineering”, J Wiley and Sons, Chichester, UK.