EXAKT-Reliability Centered Knowledge Book

Reliabilty-centered
Knowledge
Using Maintenance Databases for Reliability Analysis and Improvement
Page
Part 1. Knowledge Management 13
Part 2. Condition Based Maintenance 83
Part 3. Reliability Centered Maintenance 201
By:
Murray Wiseman
Daming Lin
September 2005
Page 1
Optimal Maintenance Decisions (OMDEC) Inc 2004
Preface
This book provides the course notes for a CBM (condition based maintenance1) training
session that describes in 3 parts:
1. Database attributes required for reliability analysis,

2. Smart condition based maintenance, and
3. Reliability-centered maintenance
We begin with an introduction of the term “reliability-centered knowledge” to imply that

structured and valid information will drive reliability improvement. The book and course
draw liberally from RCM practice invoking such concepts as “FMEA”, “decision
analysis”, and “age exploration”. Chapters 1 and 2 outline a reliability-centered
information strategy, while Chapter 3 describes principles and techniques for using the
information from a well-conceived information strategy. Chapter 4 sets forth the practical
infusion, into the CMMS, of the reliability-centered principles developed in Chapters 1, 2
and 3.
Part 2 offers an introduction and background to CBM that is ultimately extended to a

CBM optimizing methodology developed at the University of Toronto (called EXAKT).
We begin Part 2 with a theoretical development of CBM, a history of CBM, and a
discussion of the reasons for selecting CBM as a proactive task. Next we expose the
anatomy of CBM , specifically its three sub-processes – data acquisition, signal
processing, and decision making. The fundamentals of CBM are explored further
including the RCM concept of the “P-F interval”2, which we describe and reconcile (in
Chapter 9. ) with the principles of CBM optimization. The development of the
relationship between data and risk ensues, using a time-based maintenance example. The
ideas are, then, extended to CBM using the Weibull PHM3 model (Chapter 10. ).
The growing volumes of data that flood ever diminishing resources of today’s
maintenance departments compel us to automate the CBM process in all three of its steps:
the acquiring of the data, its interpretation, and the decision of when and how to act upon
that data.
At this point readers are invited to work through a step-by-step exercise during which
they encounter the basic features of CBM statistical modeling software. We proceed to
build an optimal decision model using a reduced set of haul truck transmission oil
analysis data. In the exercise that follows, the users deploy the model that they have
1
Also called Predictive Maintenance (PdM), Condition Monitoring (CM), and On-condition maintenance.
2
The term P-F Interval was coined by John Moubray to represent the concept described by Nowlan and
Heap for the period between the appearance of a potential failure and the occurrence of a functional failure.
See The Elusive P-F Curve, Chapter 9. , page106 .
3
The PHM (proportional hazard model) extends the age based reliability model developed by Walodi
Weibull in the 1950’s to one developed by Cox in the 1970’s that adds condition monitoring and
performance data to the age-reliability relationship.
Page 2
previously created. They examine its automated analysis, reporting, and database
functionality. In a second exercise we explore the vital issue of data validation. The
example has been taken from a CBM project at a coal mine in which invalid data,
missing data, faulty failure definition, the impact of oil changes on oil analysis data, and
cost sensitivity analysis are all encountered, and their respective remedies explored. At
this time, we introduce an advanced topic – the analysis of complex items4. We define a
complex item and describe the data structure needed for representing complex items in a
decision model.
Chapter 11. introduces expert systems for CBM decision making, It describes, in detail, a
successful methodology applied to vibration analysis. The chapter closes by proposing a
hybrid system combining the respective advantages of an expert system with those of a
statistical modeling system. Chapter 12. unifies the principles of prognostics and
diagnostics by outlining a methodology known as case based reasoning, which extends
RCM knoweldge to automated diagnostics. Chapter 13. reviews the technical literature
in a thorough survey of signal processing and decision making approaches used in CBM.
Reliability-centered maintenance (RCM) forms the philosophical framework of this

work. Effective CBM flows from RCM analyses. We have, for this reason, included Part
3, an RCM manual, which unites the first two pillars of RCM (FMEA and the decision
algorithm) with the third pillar “Age Exploration” developed throughout Parts 1 and 2.
I hope you enjoy the book and invite your comments at murray@omdec.com.
Murray Wiseman
Optimal Maintenance Decisions (OMDEC) Inc.
4
Complex items are items that are subject to more than one failure mode.
Page 3
Introduction by Andrew Jardine
Over the past decade, in my work as principal investigator at the CBM laboratory and
during my travels and speaking engagements, people ask what inspired the EXAKT
development project. The answer to that question is quite simple. Condition based
maintenance is the most desirable form of maintenance, yet, former students, now
maintenance professionals, tell me that they find, often, that their current CBM programs,
such as oil analysis, don’t deliver the intended results. I asked them how “exactly” their
staff interpret condition monitoring data. In other words, how do they decide whether or
not to remove an item for repair? Their answers led me to investigate whether a more
rigorous decision methodology might improve the payback on the rather large investment
they were making in condition based maintenance.
I found that two approaches were being used to interpret and act upon CBM data. One
method arrived at decisions by recalling solid experience and engineering knowledge that
a known level of a monitored variable indicates the initiation of a particular failure mode.
The second, relied on “trend analysis” as the basis for making the “maintain-now-or-
continue-operating” decision. Looking closely at the data and results in both cases, I
found that, while the former achieved, generally, the expected benefits, the latter failed to
provide measurable return on the investment in the fixed and running costs of the CBM
program.
In the first case, CBM detection of, for example, diesel fuel in lubricating oil, reflects the
“ground truth” of a failed condition – that is, a leaking of fuel past the sealing surfaces of
some interface, perhaps the piston, ring, and cylinder wall. Similarly, coolant in the lube
oil, reflects the breakdown of some interface, possibly a gasket, separating the cooling
and lubricating fluids. However, where “data trending” is the principal method for
decisions, the relationship between monitored data and the failure mechanism is often
vague. We rely on a palpable deviation from some “normal” trend to alert us to a
problem.
Although this sounds like a reasonable approach, it works only if the data clearly reflects
a developing failure. But such is often not the case. Usually, several separate or inter-
related phenomena affect the monitored data. Although common sense would have us
believe that monitored signals from the machine must contain its health information, we
often know little about the nature of that relationship. For example, if the operator of a
nuclear reactor alters the temperature of the sealing fluid in the cooling water pump, then
the leak rate, normally used to monitor seal health, would tend to decrease, even if the
seal were, indeed, beginning to fail. The interpretation of trends, thus, becomes
complicated. Add to this, random noise, the effects of load variation, and more than one
failure mode, and you can imagine that attempting trend analysis of multiple data
streams, emanating from complex systems, might frustrate the well-intentioned
maintenance planner or engineer.
Page 4
This problem posed a unique challenge. The condition monitoring phrase “equipment
health” brings to mind the idea of human health. I looked at the medical field where the
problem of symptoms based prognostics is well known. The concept of “risk factors” that
associate medical test results with specific illnesses seemed perfectly analogous to the
problem of risk based decisions in maintenance. Cox’s proportional hazard model in the
1970’s had proved useful in the detection of illnesses and in the prediction of human
survival. I applied these ideas, first, to jet propulsion engines, and discovered that we
could model the risk of engine failure in terms of the oil analysis results of iron and
chromium, and the engine’s accumulated flight hours since overhaul. That work proved
very encouraging. So much so, that we set out to develop a general purpose software
platform for PHM (proportional hazard modeling) prediction. Over the past decade, at the
CBM laboratory of the University of Toronto, we gradually improved the program by
applying it to many industrial CBM situations. It has reached the stage now, where it
should be made commercially available to the mainstream of the physical asset
management community. That is the reason OMDEC was spun off from the CBM lab.
I have often been asked why we called the program “EXAKT”, implying that CBM is an
“exact” science, while in fact the methodology of EXAKT is based on probabilities and
statistics. Certainly, I can see why some people think that the name “EXAKT” and the
probabilistic nature of failure are incongruous. Most managers, however, understand risk.
They instinctively weigh probabilities when making decisions in the normal course of
their activities. If they were told “exactly” the risk levels associated with alternative
decisions, they would find such information helpful indeed. Otherwise stated, if they
knew “exactly” with what level of confidence they may accept a residual life estimate for
some operating physical asset, they could adjust their operational and maintenance plans
accordingly.
Self doable, tutorial exercises are a good way to provide a comfort factor to potential
users. EXAKT, is actually a usable tool. But, because EXAKT evolved as a research
platform, some people have formed the impression that it is too difficult for them. This
book sets out to dissolve that feeling. Besides a sound treatment of the founding
principles of CBM based on RCM derived knowledge, it contains step-by-step tutorials
that convey a number of common data problem solving techniques.
I take great pleasure in writing this introduction to “Reliability-centered Knowledge”. I

am certain that it will add substantially to the success of its readers’ CBM endeavours.
Andrew Jardine
Principal Investigator, CBM Lab
Professor, Mechanical and Industrial Engineering
University of Toronto
Page 5
Contents:
Part 1. Knowledge Management ________________________________________ 13
Chapter 1. The knowledge elements ____________________________________ 13
Introduction________________________________________________________ 13
The Work Order UML Class Diagram__________________________________ 15
Incorporating RCM knowledge attributes _______________________________ 15
The Seven Knowledge elements of RCM ________________________________ 16
The “failure code” problem ___________________________________________ 18
Chapter 2. Requirements of Information ________________________________ 19
Data Structure______________________________________________________ 21
Implementing a Reliability Knowledge Base _____________________________ 22
Other “FMEA” data types and definitions _______________________________ 27
Conclusions ________________________________________________________ 30
Chapter 3. Using maintenance information ______________________________ 33
Introduction________________________________________________________ 33
The problem with failure rates ________________________________________ 34
How to use maintenance data? ________________________________________ 35
Age Exploration Procedures __________________________________________ 38
Random Failure____________________________________________________ 38
Failure Finding Intervals_____________________________________________ 39
Measuring Reliability Improvement ____________________________________ 41
Refining the maintenance program_____________________________________ 44
Assessing the effectiveness of a CBM Program ___________________________ 44
Improving the program through failure mode assessment ___________________ 46
Software analytic tools ______________________________________________ 47
CBM (on-condition maintenance) benefits analysis________________________ 49
Engineering Change Assessment ______________________________________ 51
Keeping Track of Components ________________________________________ 52
Introduction_______________________________________________________ 52
Recording Events for Reliability Analysis _______________________________ 52
Keeping track of system component ages________________________________ 53
Significant components______________________________________________ 54
Suspended Animation________________________________________________ 55
Handling meter anomolies ___________________________________________ 55
Marginal analysis ___________________________________________________ 57
Chapter 4. Acquiring Maintenance Information __________________________ 58
Page 6
Introduction________________________________________________________ 58
Lexicon ____________________________________________________________ 59
The purpose of the EWOP ____________________________________________ 60
Work order documentation procedures for the EWOP ____________________ 60
The events table_____________________________________________________ 65
The RCM knowledge base ____________________________________________ 66
Uniqueness of a work order ___________________________________________ 66
Examples __________________________________________________________ 66
Summary and Conclusions____________________________________________ 71
Chapter 5. Assessing “What-if” from maintenance information______________ 73
Introduction________________________________________________________ 73
Modeling a simple system using SPAR __________________________________ 73
Objective of the analysis_____________________________________________ 74
The system function ________________________________________________ 74
Running the program _______________________________________________ 75
Remarks _________________________________________________________ 76
Repair effectiveness ________________________________________________ 76
Applying Preventive Maintenance _____________________________________ 78
Optimizing PM ____________________________________________________ 80
Part 2. Condition Based Maintenance ___________________________________ 83
Chapter 6. Deciding on CBM _________________________________________ 83
Introduction________________________________________________________ 83
Why do CBM?______________________________________________________ 84
History of CBM _____________________________________________________ 87
Chapter 7. Anatomy of CBM __________________________________________ 91
Data Acquisition ____________________________________________________ 91
Signal Processing____________________________________________________ 95
Decision Making ___________________________________________________ 100
Chapter 8. CBM Fundamentals_______________________________________ 103
The fundamental premise of CBM ____________________________________ 103
CBM Program Criteria _____________________________________________ 103
CBM Monitoring Frequency ________________________________________ 103
Estimating the PF Interval __________________________________________ 105
Chapter 9. The Elusive P-F Curve ____________________________________ 106
Are failures required – multiple levels of intrusiveness? ___________________ 108
Discussion of Case 2_______________________________________________ 109
Page 7
Discussion of Case 1_______________________________________________ 111
Chapter 10. Optimizing CBM _________________________________________ 113
Developing a Maintenance Risk Model ________________________________ 113
The traditional risk model___________________________________________ 113
Combining Data and Risk___________________________________________ 114
The Optimal Risk _________________________________________________ 116
A Time Based Maintenance Model ____________________________________ 118
Blending in Cost __________________________________________________ 123
A Condition Based Maintenance Model ________________________________ 125
Automated CBM Decision Making ___________________________________ 126
Example 1 Creating and deploying a decision model______________________ 127
Example 2 Data validation __________________________________________ 131
Example 3 Complex Items __________________________________________ 146
Example 4 Data transformations______________________________________ 150
References ________________________________________________________ 151
Chapter 11. CBM Decision Making with Expert Systems ___________________ 152
Step 1 Data normalization ___________________________________________ 153
Step 2 The screening matrix__________________________________________ 154
Step 3 Cepstrum analysis ____________________________________________ 154
Step 4 Demodulation________________________________________________ 155
Step 5 Component specific diagnostic matrices __________________________ 157
Step 6 Decision making______________________________________________ 157
A proposed hybrid decision tool ______________________________________ 160
The ABB fault simulator____________________________________________ 160
Chapter 12. Case based reasoning______________________________________ 165
Introduction_______________________________________________________ 165
Efficient Troubleshooting____________________________________________ 166
Case Base Development _____________________________________________ 168
Terminology _____________________________________________________ 168
Building a knowledge domain _______________________________________ 169
Building a case ___________________________________________________ 170
Case Study ________________________________________________________ 171
The seed case base__________________________________________________ 174
Performance measurement __________________________________________ 175
Conclusions _______________________________________________________ 175
Chapter 13. A survey of signal processing and decision technologies for CBM __ 177
Page 8
Introduction_______________________________________________________ 177
Data acquistion ____________________________________________________ 178
Signal processing___________________________________________________ 178
Signal processing _________________________________________________ 179
Value type data analysis ____________________________________________ 184
Data analysis combining event data and condition monitoring data __________ 184
Maintenance decision support ________________________________________ 186
Diagnostics ______________________________________________________ 186
Prognostics ______________________________________________________ 192
Multiple sensor data fusion __________________________________________ 197
Concluding remarks ________________________________________________ 199
Part 3. Reliability Centered Maintenance ________________________________ 201
Chapter 14. Pillars of RCM ___________________________________________ 201
Introduction_______________________________________________________ 201
RCM Execution Strategies ___________________________________________ 203
Chapter 15. Failure Modes and Effects Analysis __________________________ 204
Question 1 – Functional Analysis _____________________________________ 204
The process ______________________________________________________ 204
Example 1 _______________________________________________________ 207
Example 2 _______________________________________________________ 210
Example 3 _______________________________________________________ 212
Example 4 _______________________________________________________ 213
Question 2 – Failure Analysis ________________________________________ 214
The process ______________________________________________________ 214
Example 1 _______________________________________________________ 214
Example 2 _______________________________________________________ 215
Example 3 _______________________________________________________ 215
Question 3 – Failure modes analysis ___________________________________ 216
The process ______________________________________________________ 216
Example 1 _______________________________________________________ 218
Example 2 _______________________________________________________ 219
Example 3 _______________________________________________________ 220
Question 4 – Effects analysis _________________________________________ 220
The process ______________________________________________________ 220
Example 1 _______________________________________________________ 221
Example 2 _______________________________________________________ 230
Example 3 _______________________________________________________ 231
Chapter 16. The RCM Decision Algorithm_______________________________ 233
Questions 5, 6, and 7 ________________________________________________ 233
The process ______________________________________________________ 233
Page 9
Example 1 _______________________________________________________ 235
Example 2 _______________________________________________________ 236
Example 3 _______________________________________________________ 239
Example 4 _______________________________________________________ 243
Chapter 17. Integrating Reliability Information - MIMOSA _________________ 249
UML Class Diagrams _______________________________________________ 249
Chapter 18. Managing Strategy________________________________________ 254
Introduction_______________________________________________________ 254
Extending the Maintenance Audit_____________________________________ 255
Physical asset management inputs, outputs, and control __________________ 256
Physical Asset Management Effectiveness Indicators (KPIs)_______________ 257
Choosing between model 1 and model 2 ________________________________ 258
Drilling down from the KPIs _________________________________________ 260
How to start _______________________________________________________ 262
Chapter 19. Appendices ______________________________________________ 263
Appendix 1. EWOP details __________________________________________ 263
Used components and components in suspended animation ________________ 263
The EWOP’s Impact on the Work Process______________________________ 265
Using the EWOP prototype software __________________________________ 267
The onion skins of CBM____________________________________________ 268
The EWOP and EXAKT____________________________________________ 269
Appendix 5 The EWOP and EXAKT__________________________________ 269
Appendix 2. _______________________________________________________ 271
The role of the RCM Facilitator - Five Skill Areas: _______________________ 271
Appendix 3. _______________________________________________________ 276
Sizing the analysis_________________________________________________ 276
Selecting the significant items _______________________________________ 278
Appendix 4. _______________________________________________________ 278
Failure finding intervals for complex items (multiple failure modes and devices) 278
Appendix 5. _______________________________________________________ 280
Truck description _________________________________________________ 280
Appendix 6. _______________________________________________________ 288
Terminology used: ________________________________________________ 288
Various definitions of “Life” ________________________________________ 290
Appendix 7. _______________________________________________________ 290
Time to Failure - Relationship among hazard, reliability, and probability density
functions ________________________________________________________ 290
Appendix 8. _______________________________________________________ 293
Page 10
Random failure survival curve _______________________________________ 293
Appendix 9. _______________________________________________________ 293
Inherent reliability characteristics_____________________________________ 293
Appendix 10. ______________________________________________________ 294
Failure mode depth of causality ______________________________________ 294
Appendix 11. Cost Comparison of CBM Policies ________________________ 295
Appendix 12. ______________________________________________________ 300
Expected failure time for an item whose maintenance policy is time-based ____ 300
Appendix 13. ______________________________________________________ 302
Default RCM decision diagram answers in the absence of operating experience 302
Appendix 14. ______________________________________________________ 303
Additional Relcode examples ________________________________________ 303
Appendix 15. EXAKT Exercises ______________________________________ 307
Appendix 16. References to Chapter 13.________________________________ 326
Page 11
This page left intentionally blank.
Page 12
Part 1. Knowledge Management
Chapter 1. The knowledge elements
Introduction
The quest for information consumes the physical asset manager, more so than his
counterparts in any other sector of the organization. Physical asset managers seek out
information technology products that promise to help them decide how, intelligently, to
deploy their forces.
To what extent may maintenance professionals influence the design of the technology
they acquire? In a perfect market economy, they, as a group, might exercise control over
the features and cost of technical products and services in which they invest. If a product
does not meet their desires, then, in the utopia of unlimited choice and instant
information, they will simply select one that does. Of course, we neither live nor
consume in an ideal marketplace. Does its imperfection frustrate our practical needs by
substituting perceived ones? What can we do to exercise due influence over the design of
new products and services destined to find their way into our technological tool box?
A look at the other side of the coin – the producer’s viewpoint – may prove enlightening.
In the pursuit of open standards for their products’ unhindered electronic inter-
connectivity, technology producers often form trade associations. Such organizations
have demonstrated unprecedented collaboration (even amongst otherwise acute
competitors) in defining a technology framework with which to address their common
market, and to do so with utmost efficiency. We list, in Table 1-1, a sampling of four
such associations, along with their respective website slogans and some representative
members.
Table 1-1
Website Slogan Typical sponsors/members

www.mimosa.org “a non-profit trade Emerson Process
association which develops Management CSI, The DEI
and encourages adoption of Group, DLI Engineering
open information standards Corp., Indus International,
for Operations and Inc., Rockwell Automation,
Maintenance” SAIC, Siemens AG
www.osacbm.org “the Open System Boeing, Caterpillar, Oceana
Architecture for Condition Sensor Technologies,
Based Maintenance” Rockwell Automation, ARL,
Office of Naval Research
www.opcfoundation.org “Dedicated to interoperability ABB Automation, Acsis, Inc,
Page 13
Website Slogan Typical sponsors/members
in automation” Advanced Engineering, Inc.,
Advanced, Measurement &
…
www.hartcomm.org Real-time connections … ABB Automation Products,
helping you lower Action Instruments, Adaptive
maintenance cost, increase Instruments LLC, Advanced
plant availability, improve Flow Technologies Co., Agar
plant operations, and facilitate Corporation Inc., American
regulatory compliance. Level Instruments …
Such focused business energy, without doubt, propels advanced open enterprise
application integration (EAI)5 technology that will yield cost savings and efficiency for
technology vendor and user alike. On the other hand, a healthy balance, can help keep
those benefits flowing equitably. Far from discouraging such valuable technological
advances in enterprise integration standards, we propose to amplify their benefits with a
growing understanding of the fundamental knowledge elements governing failure
behavior as revealed by “reliability-centered maintenance”6.
The technology industry seeks out the maintenance professional. That individual labors
relentlessly in pursuit of overall equipment effectiveness at lowest cost. Seldom
possessing adequate time or resources to research and analyze the multitude of failures
and reliability problems encountered, he has come to rely upon a network of suppliers,
who prefer to be known as “solution providers”.
In this book we address the technology of information in physical asset management. We

explore the issue of information management as it relates to maintenance effectiveness.
We examine the importance of “knowledge harvesting” methods such as case based
reasoning (CBR), an approach to incremental, sustained learning where each new
experience is retained, making it available for solving future problems.
Effective learning requires a well worked out set of methods in order to extract relevant
knowledge from experience, integrate that experience into an existing knowledge
structure, and index it for later matching with similar cases. We hope that the principles
described here will help maintenance professionals to understand and to more clearly
express their reliability information needs in the torrent of new products and services that
may otherwise overwhelm them.
5
Open Applications Group – Enterprise Application Integration OAG-EAI
6
Reliability-centered maintenance (RCM) themes and principles are invoked throughout this book and
developed in detail in Chapter 14. (page 201).
Page 14
The Work Order UML Class Diagram
The UML7 (Unified Modeling Language) is a graphical modeling language used to
develop computerized business solutions. In this section we invoke the UML to help us
clarify some of the business processes of maintenance. A physical asset management
(maintenance) information system revolves about the work order. It is the focal point for
the request, manpower allocation, procurement, execution, and historical documentation
of a maintenance action. First we represent the work order in a UML class diagram as in
Figure 1-1.
Figure 1-1: A UML class diagram representation of a work order

The work order icon of Figure 1-1 represents a WorkOrder class. By “class”, we mean a
specification of what a work order should be and do. A class diagram icon consists of
three parts, from top to bottom, that hold, respectively, its name, its descriptive attributes,
and what it does (or what can be done to it) – its operations. A UML icon will seldom
show everything there is to know about its entity, but rather will expose just enough detail
to convey the intended ideas about it. At times the un-detailed class diagram of Figure
1-1 will be adequate as a high level view for the discussion needs of the moment. At
other times, for some more descriptive purpose, we will need to convey additional details
about objects of the workorder class, for example, as in Figure 1-2.
.
Figure 1-2: A UML class with some attribute and operation

information exposed. In this view the data types of the
attributes such as: Integer, String, Double, and Date are shown.
Parentheses “()” indicate an operation associated with the work
order.
Incorporating RCM knowledge attributes

What attributes of a work order might we consider important when addressing an
organization’s reliability needs? The pivotal maintenance research project by Nowlan and
7
For a thorough discussion of the UML see “The Unified Modeling Language User Guide”, Grady Booch,
James Rumbaugh, Ivar Jacobson, ISBN 0-201-57168-4. Addison-Wesley 1998.
Page 15
Heap8 entitled “Reliability-centered Maintenance” or RCM9 sheds considerable light on
this question. Nowlan and Heap investigated the failures of airplanes over three decades,
and, based on a remarkably comprehensive study, discovered those informational
elements that are essential to understanding the requirements of a maintenance program.
SAE Standard JA101110 encapsulates those information requirements in seven RCM
(reliability-centered maintenance) questions:
The Seven Knowledge elements11 of RCM

1. What are the functions and associated desired standards of performance of
the asset in its present operating context (functions)?
2. In what ways can it fail to fulfill its functions (functional failures)?
3. What causes each functional failure (failure modes)?
4. What happens when each failure occurs (failure effects)?
5. In what way does each failure matter (failure consequences)?
6. What should be done to predict or prevent each failure (proactive tasks
and task intervals)?
7. What should be done if a suitable proactive task cannot be found (default
actions)?
Organizations that methodically seek out the answers to these questions, quickly realize
significant improvements in the efficiency and effectiveness of their maintenance
processes. Endowing the work order class with these informational attributes, lays down
the basis for a “reliability-centered knowledge base” or a “reliability information
system”. Such a structure must store useful knowledge with the intent to sustain and
improve the overall operating effectiveness of our physical assets. Figure 1-3 illustrates
the purpose of recorded maintenance data.
A maintainer, in the course of his or her day-to-day inspection, troubleshooting, and

repair activities must record clearly, which functions of the asset have been (or were in
the process of being12) lost, compromised or threatened, in what way, why, what
happened, and how it mattered.
8
F. Stanley Nowlan, Howard F. Heap, Reliability-Centered Maintenance, United Airlines under the sponsorship of the Office
of Assistant Secretary of Defence (Manpower, Reserve Affairs and Logistics), 1978.
9
Reliability-Centered Maintenance is a process for determining the maintenance requirements of a physical asset by
addressing the consequences of failure and seeking the most cost effective preventive or mitigating tasks.
10
Society of Automotive Engineers, SAE JA 1011 Issued Aug1999 Evaluation Criteria for Reliability-Centered Maintenance
(RCM) Processes
11
We use the expression “knowledge element” interchangeably with the phrase “RCM question” in order
to emphasize that knowledge drives decisions to do the right maintenance at the right time. The 7
knowledge elements constitute the framework of our reliability centered knowledge to be physically
enshrined in our CMMS.
12
All corrective activities are motivated either by a functional failure or a potential failure
Page 16
Figure 1-3: The use of recorded information in maintenance
The information in the left box of Figure 1-3 is sometimes referred to as “as-found”
information. Precise, consistent language populating structured CMMS13 records is grist
for the mill of continuous improvement. While reliability-centered maintenance initially
analyzes “what could happen” to a physical asset, maintainers using a similar conceptual
framework, add information on “what did happen”. The term “PM” in the right hand box
of Figure 1-3 refers to “preventive maintenance” in its broadest sense. PM, in this
context, includes any type of pro-active scheduled inspection (CBM), overhaul, failure
finding activity, or even an engineering or process modification14. To portray the
knowledge retention characteristics of our maintenance information system in the light of
RCM thinking, we redraw the work order UML icon as in Figure 1-4.
Figure 1-4: Reliability informational attributes of the work order class
13
Computerized maintenance management system (CMMS), also known as a Maintenance information
management system (MIMS)
14
However, a modification, although very often carried out by maintenance personnel, is not
“maintenance” in the strict sense. It is a design improvement to the inherent reliability (capability) of the
asset.
Page 17
The five failure descriptive attributes exposed by the WorkOrder class icon of Figure 1-4
are precisely those of the first five RCM questions (page 16). Additional work order
attributes will describe what was actually done. The rigorous representation of historical
failure information (of Figure 1-4) contrasts starkly with popular attempts to define and
capture failure codes.
The “failure code” problem

Many maintenance professionals who seek information with which to improve Overall
Equipment Effectiveness15 seize upon the opportunity to record failures in the form of
“failure codes”. Such short descriptive acronyms or phrases appear as a “good first step”
towards acquiring useful knowledge about failure behavior. Ideally, failure codes should
present themselves in the form of configurable, context sensitive16 drop-down lists (or
checkboxes) for convenient, yet accurate, failure classification. The maintainer selects a
failure code while completing the work order form. Failure codes, unfortunately, seldom
realize (PM assessment / improvement) expectations.
Pick lists of maintenance failure codes are often difficult to choose from and prone to
error. The selection items are often too general or do not adequately fit a given situation.
Or, alternatively, long lists of precise codes suffer from “choice overload” resulting in the
overuse of the default “Other”. Without doubt, effective and accurate lists are the
ultimate objective of reliability-centered knowledge systems. But deciding what selection
choices to place on such pick lists is no trivial matter. Some intermediary process is
required that will facilitate the day-to-day recording of useful reliability knowledge in the
short term, but additionally, must eventually evolve to the provision of accurate, robust
pick lists. Chapter 3. (page 33) will address the problem of failure code development
and suggest an approach that is reasonable, simple, robust and progressive. That
approach, elaborated in Chapter 4. (page 58), will unify failure mode records in the RCM
worksheet (knowledge base) with the failure codes in the work order database.
15
The OEE (Availability x Productivity x Quality) tracks maintenance effectiveness, where: Availability =
(scheduled time – downtime due to all forms of maintenance)/(scheduled time). Productivity = Product rate
setting/Desired product rate. Quality = (Product – Scrap)/Product. Additionally, tracking Reliability =
MTTF, will provide further measures of maintenance effectiveness. Two OEE models are thoroughly
described in Chapter 18. on page 254.
16
Only those failure codes appropriate to the equipment and the symptom should appear
Page 18
Chapter 2. Requirements of Information
In order to set up a reliability information system or to add functionality to an existing
CMMS (computerized maintenance management system) consistent with the goals
outlined in Chapter 1, we would need to provide for the following “reliability-
centered” requirements described by Nowlan and Heap:
1. To determine the types of failures the equipment is actually exposed to as well as

their frequencies
2. To expose the consequences of each failure, ranging from direct safety hazards
through serious operational consequences, high repair costs, long out-of-service
times for repair, to a deferred need to correct inexpensive functional failures
3. To confirm that functional failures originally classified as evident (during RCM
or other initial analysis) are in fact evident to operational staff during the normal
performance of duties
4. To identify the circumstances of failure in order to determine whether the failure
occurred as a result of normal operation or was due to some external factor
(accidental damage)
5. To confirm that on-condition17 (CBM) inspections are really measuring the
reduction in resistance18 to a particular failure mode
6. To inform us of the actual rates19 of reduction in failure resistance in order that
we may determine optimum inspection intervals
7. To record the mechanism involved in certain failure modes in order to identify
new forms of on-condition inspection (CBM) or parts that require design
improvement
8. To identify those tasks assigned originally but that do not prove applicable and
effective20
9. To identify maintenance packages that are generating few trouble reports and
are candidates for longer interval schedules
10. To identify items that are not generating trouble reports
11. To record the working ages of assets and components at which failures occur in
order to:
1. assess and refine inspection or scheduled repair intervals
2. build CBM data interpretive decision models21
These requirements demand the provision of tools that will enable the systematic
collection, storage, and retrieval of historical experience that is relevant to asset
17
On-condition maintenance: The detection of a potential failure. Also known as condition based
maintenance (CBM) and predictive maintenance.
18
Increased probability of failure as reflected by a condition indicator or by indicators of imposed stress
19
The PF (potential failure to functional failure) interval coined by John Moubray. See Chapter 9. ( p 106)
20
Applicable: A task is technically feasible. Effective: A task accomplishes the intended objective
21
Such as proportional hazard models that estimate the statistical significance, to risk of failure, of age and
measurement observations, and operational factors in order to predict remaining useful life. (See Chapter
10. page 113)
Page 19
reliability. They must meet the physical asset manager’s need to assess the applicability
and effectiveness of a given proactive task. An applicable pro-active task is one that is
technically feasible. An effective task deals satisfactorily22 with the consequences of the
failure that it addresses. Nowlan and Heap used the term age exploration23 to describe
any technique that analyzes information revealed from maintenance tasks. Using age
exploration methods we assess a task’s real applicability and effectiveness, and, if
necessary, we modify the maintenance program accordingly. Figure 2-1 represents, in a
“UML context diagram” a high level view of a reliability information system meeting the
afore-listed eleven requirements.
Figure 2-1: UML Context diagram of a Reliability Information System and various actors who
interact with it. The term “Use Case” refers to the performance of some operation required by the
user. For example a maintainer “completes a work order”, or a Supervisor “audits a maintenance
record”, describes two use cases.
A context diagram such as that of Figure 2-1 shows merely an overall proposed system
and the persons (or other systems) that we intend should interact with it. The relative
22
That is it entirely avoids or reduces the consequences and probability of failure to a satisfactory level.
23
Age Exploration: Any analysis procedure that examines historical maintenance data in order to alter the
maintenance plan for improved physical asset reliability.
Page 20
impact of the actors, the sequence and the details of their use cases are fully described in
other UML diagrams24. Persons or entities other than those portrayed in Figure 2-1 may
interact with the reliability information system. They may include vendors, specialists, or
even automated “intelligent agents”25. Each actor inter-relates with the system in different
ways, the details of which may be described in other diagrams of the UML.
Data Structure
A simplified data structure for the reliability information system of Figure 2-1 could
resemble that of Figure 2-2.
Figure 2-2: Data model. Each table lists its column names.
The bold column names of Figure 2-2 designate a Primary Key (the values in the column
must be unique and non null) and Foreign Key (the values in the column must be
populated with a value from the primary key of the related table). This relationship is a
direct enabler of reliability analysis. It allows the incidences of an important failure mode
to be counted, studied, and correlated with two types of data: 1. working age, and 2.
monitored condition indicators. (Such analyses will be explored thoroughly in Chapter
10. Optimizing CBM page 113)
Note that the database table, “RCM_Table” contains all of the RCM questions. The one-
to-many cardinality arrow (with the three pronged reverse arrowhead) of Figure 2-2
indicates that each row of the WorkOrders table must relate to a row in RCM_Table. That
is, a work order is an instance of a RCM_table record. This constraint represents a
problem in current PM programs managed by most existing CMMSs because:
1. A single work order can cover, for example, the overhaul of an entire system or
product line with multiple components and failure modes, and
2. A single CBM (condition based maintenance) inspection work order can span
multiple systems.
24
Such as sequence, use case, and others.
25
Automated “watchdogs” that analyze data and recommend (or implement) an action. See(UML Class
Diagrams page 249).
Page 21
Yet, the proposed reliability information system must respect the one-to-many integrity
constraint between the Workorders table and the RCM table. Without such a
relationship we could not trace the decision roots of a pro-active programmed task, and
consequently we could not scrutinize the records of each pro-active task with regard to its
applicability and effectiveness.26 Without such an ability, we may not question and
improve our maintenance strategy. Hence a conflict arises between the proposed
reliability-centered knowledge base and existing maintenance processes that execute
through the CMMS. We may resolve the difficulty in a number of ways that will depend
on the current CMMS data structure. We offer one solution in Figure 2-3, which shows
the primary key of WorkOrders expanded to include an additional work order attribute
called “Sub_No”.
Figure 2-3: WorkOrders with extended primary key
Under this schema every work order can be related to a specific record in the knowledge
base table “RCM”. Now the Job_No may represent a group of (child) work orders each
corresponding to a particular failure mode (i.e. record) in the RCM table27. We will
propose a comprehensive solution to this problem in Chapter 4. Acquiring Maintenance
Information (page 58).
Implementing a Reliability Knowledge Base

Two fundamental themes dominate this chapter: 1) The use of unambiguous language to
describe the results of each work order task, and 2) The participation of maintainers,
their supervisors and support staff in the continuous evaluation and improvement of the
knowledge base. Smith28 sums up the objective (of reliability data) by stating, “The
26
This means, for example, that a task, say, “Inspect panels for loose contacts” may be traced right back to
the performance requirement of the asset, providing an auditable trail that may be scrutinized in order to
evaluate or upgrade the maintenance plan at some later time.
27
To the author’s knowledge, this idea has not yet been implemented. Nevertheless, careful consideration
and testing of its practicality will likely prove a worthwhile endeavour.
28
Smith, A.M. and Watson, I.A. (1980). Common cause failures — a dilemma in perspective. Reliability
Engineering 1, 127-142.
Page 22
problem is knowing what to collect, how to make sense of the wealth of data that one can
gather, and what to do with it.” We wish a data structure that captures the information
whose subsequent analysis will either confirm the underlying assumptions of each PM
task, or, point out conflicts with recorded observation, and, thereby suggest specific PM
effectiveness (policy) improvements. That structure must enable the fulfillment of all
eleven requirements outlined earlier (page 19.) in this chapter.
The primary purpose of analyzing failure and inspection data is to assess PM

effectiveness in order to improve its impact on cost and reliability, availability, and
quality. Improvement is then actually achieved (in the light of the collected information)
by adjusting PM tasks (including on-condition decision making procedures) and task
intervals, adding new tasks, and eliminating unnecessary (over intensive) PM tasks issued
by the CMMS. The process of improvement is a continuous one that responds to growing
knowledge of the reliability characteristics (Appendix 9. page 293) of an item and
changes in design or in operating context.
A maintainer, having executed a repair task (resulting from a failure, or potential failure),
will need to complete the work order form by providing data in sufficient detail and in the
proper format for a reliability information system. This activity is illustrated in Figure
2-4 by another type of UML diagram called a Use Case diagram. In it, the actor, in this
case the Maintainer, is shown interacting with the system in order to perform the action
(the use case inside the oval), “Complete the work order form”.
Figure 2-4: Use Case Diagram - Complete the work order form
Step 1
Rather than a conventional pick list of failure codes, the user (maintainer) should
conveniently display the RCM table records for a given item. The great advantage of
presenting the failure modes in the full context of the function, functional, failure, effects,
and consequences is that (unlike the use of failure codes) there will be little ambiguity or
Page 23
uncertainty29 in how to categorize the failure. Subsequent reliability analysis of these
records will, therefore, be founded upon precise historical data.
The RCM data may be referenced by a CMMS user through a multi-row database form, a
spreadsheet tool (e.g. MS Excel), or a commercial RCM database application integrated
with the CMMS30.
Step 2
At this point, once the records are displayed, there are two possibilities:
1. An appropriate record in the RCM table, that accurately describes the current situation,
will be found, or
2. An appropriate record will not be found.
Step 3
If an appropriate record is found, the user will select that RCM identification number
(called RCMREF) for insertion into the WorkOrders table. Additional table attributes will
be required some of which are indicated in Figure 2-5.
Figure 2-5: Additional attributes in WorkOrders

The additional fields, Event_Type and Additional, require explanation. There are only
two types of failure to choose from in the Event_Type field. The failure is either 1) a
Functional failure, or 2) a Potential failure.31 The Additional field may be used to
describe the condition indicator and its current observation level, for example, the length
and number of cracks in a structural component, or a specific vibration feature’s value.32
Step 4 Extending the Use Case if no record is found

We describe the more challenging case next. It is the situation where no existing record in
the RCM table adequately describes the current failure (or potential failure). Either an
29
see The “failure code” problem page 18
30
Acquiring Maintenance Information (page 58) suggests one type of user interface.
31
Other events in this field may be “suspension” and “suspended animation”. See Chapter 4. Acquiring
Maintenance Information (page 58)
32
Providing a vital link between the CMMS and various CBM and other plant systems (see Integrating
Reliability Information page 249). The description of the symptoms as reflected by various monitoring or
inspection techniques, should be appended to the Effects of of the referenced record in RCM_table.
Page 24
RCM analysis has not yet been completed for the item in question, or the present
situation had been overlooked in the RCM analysis for the item. The maintainer must
insert a record in RCM. We “extend” the use case “Complete the work order form” to
cover this new situation. The extended33 UML Use Case diagram is shown in Figure 2-6.
Figure 2-6: Extending the Use Case "Complete the work order Form"
The proposed course of events (of Figure 2-6) challenges the maintainer, the supervisor,
the maintenance engineer and all parties dedicated to high quality information in the
system. Ideally, the RCM knowledge base would have been pre-populated by the RCM
team, assembled expressly for that purpose.34 Those persons would have deliberated and
determined the answers to the seven RCM questions (page 16) covering the entire item.
The RCM analysis process in which they engage is highly structured and well facilitated.
Can we expect the maintainer to provide information of the same high quality, but
without the benefit of adequate time and resources normally accorded to an RCM team?
No. Nevertheless, valuable experience and knowledge about a failure (or potential
33
The return arrow (with the unfilled arrow head) in Figure 10 labelled with the <<extend>> “stereotype”
indicates that the additional use case is sometimes required.
34
The RCM process is described fully in Part 3. on page 201
Page 25
failure) must be captured at this opportune moment35. How can the Maintainer
accomplish this in little time, working alone? He cannot. The system must provide audit,
approval, support, and educational functionality to assist the maintainer in this effort.
Figure 2-7 displays three new fields in RCM (Approval, Deletion, and Last_Update) that
may be used for this purpose.
Figure 2-7: Adding 3 more fields: Approval, Deletion, and Last_Update, to RCM_Table.
Before describing the approval function, it must be emphasized that the quality (hence
usefulness), of the reliability knowledge base depends mostly on human collaboration
(and less so on computer systems). Quality in the RCM table records does not rest (and
he must not perceive it to be so) entirely upon the shoulders of the maintainer. All
personnel will contribute to the ultimate integrity of the records added to the RCM table.
The act of doing so will grow their understanding of the behavior and consequences of an
item’s failure.
The entire process of successfully completing the information fields in the reliability
knowledge base depends on, at least, six supporting functions:
1. Simplified guidelines and training document,

2. Accessible examples,
3. Supervision, discussion, transfer of practical experience
4. Expert support for RCM concepts (hotline),
5. Evolving failure codes36, and an
6. Audit and approval process
35
If not then, when?
36
Without doubt, effective and accurate lists are the ultimate objective of reliability and OEE centered
information systems. But deciding what choices to place on such picklists is no trivial matter. Some
intermediary process is required that will facilitate the day-to-day recording of useful reliability related data
in the short term, but additionally, must eventually evolve to the provision of accurate, robust picklists.
Chapter 4. (page 58) will address the problem of failure code development and suggest an approach that is
reasonable, simple, and progressive. .
Page 26
During the process of maintaining the reliability knowledge base, personnel will discover
the most common error – incorrect choice of the failure mode causality depth. We treat
this question in detail in Appendix 10. on page 294. Figure 2-8 illustrates a suggested
discussion and approval process in another UML diagram type known as a Sequence
diagram.
Figure 2-8: A sequence diagram illustrating the creation, approval and discussion of an RCM record
As its name suggests, a sequence diagram shows the timing of various interactions
associated with a use case, say “Inserting an RCM record”. Time proceeds from top to
bottom. The diagram focuses on the messages that are transmitted amongst the
interacting “objects” at various times, thus defining a sequence of messages. The UML
sequence diagram of Figure 2-8 indicates that the Maintainer has created a record. The
newly created record “object” (appearing slightly lower down on the time line) signals
the Maintenance Supervisor that he should verify and approve the information. When that
is done the RCM_Record object signals the Maintainer that he may review any changes
made. Finally the Maintainer may issue a signal indicating that a face-to-face discussion
is desired. While the message passing takes place internally in software and is transparent
to the user, the audit and review functionality provides for confidence in data integrity.
Other “FMEA” data types and definitions

Derivatives of the original FMEA worksheets and process have added to and subtly
altered the meanings of the four knowledge elements: Function, Failure, Failure Mode,
and Failure Effect; as defined by Nowlan and Heap and most recently enshrined in the
Page 27
SAE JA1011-1999 standard37. Contending standards and older standards such as
“FMECA” (Failure Modes, Effects, Criticality Analysis – MIL STD 1639A) and
FMEA/AIAG (Automotive Industry Action Group 1995)38 use several of the same words
and phrases but ascribe to them different meanings. Hence alternate definitions of FMEA
terminology have been and continue to be used extensively in many industries.
Understandably, this has led to confusion and miscommunication. Table 2-1 provides a
comparison of alternative terminology.
Table 2-1
Terminology Non SAE-JA1011 definition SAE-JA1011 definition
FMEA A systematic tool for Different definition: A tool
identifying: effects or for determining the
consequences of a potential functions, functional failures,
product or process failure, causes, and effects of a
methods to eliminate or failure of an item in its
reduce the chance of a failure operating context
occurring
Potential Failure Incorrect material choice, Different definition: An
inappropriate specifications, indicator that a failure mode
operator assembling part has occurred and is in the
incorrectly, excess variation process of degrading to a
in process resulting in out- functional failure. At the
spec products. Example: Air time of detection, however, it
Bag (excessive air bag has no dire consequences.
inflator force, operator may
not install air bag properly on
assembly line such that it may
not engage during impact
Basic and Secondary Basic Function: ingress to and Similar definition: Primary
functions egress from vehicle, function: why item
Secondary function: protect purchased / installed.
occupant from noise Secondary function: All
other functions (protective,
environmental, appearance,
control-containment-comfort,
health and safety, efficiency,
structure-superfluous). See
page 220.
Failure Mode Physical description of a Different definition: The
failure. e.g. noise enters at cause (at a practical causality
door-to-roof interface depth) of a failure.
Failure Effects Impact of failure on people, Different definition: The
equipment. E.g. driver typical worst case scenario of
37
SAE JA 1011 Issued Aug1999 Evaluation Criteria for Reliability-Centered Maintenance (RCM)
Processes
38
GM, Ford, and Chrysler Quality documents.
Page 28
dissatisfaction. relevant events touched off
by a failure mode occurring
before, during, and after the
failure. The scenario will
encompass those events at
the local/component level,
the system/equipment level,
the organizational level, and
even the external /societal/
environmental level as
appropriate.
Failure Describes the way in which
an item’s function is lost or
compromised. Includes
partial or total loss of
function and describes the
precise manner in which the
function fails to perform.
Failure Refers to the underlying Somewhat different
Cause/Mechanism (root) cause of a failure. E.g. definition: In SAE JA-1011
insufficient door seal. there is only one active
definition for these terms.
That is to say: “Failure
Cause” = “Failure Mode” =
“Failure Mechanism” =
“Root Cause”. It is the failure
mode (or modes) retained in
an analysis (for example,
from a cause and effect
diagram if required.) for
which there is a practical
consequence mitigating
activity.
Severity A rating corresponding to the None.
seriousness of an effect of a
"potential failure mode".
(scale: 1-10)
Occurrence A rating corresponding to the None.
rate at which a first level
cause and its resultant failure
mode will occur over the
design life (scale 1-10)
Detection A rating corresponding to the None.
likelihood that the detection
methods or current controls
Page 29
will detect the potential
failure mode (scale 1-10)
Risk Priority Number Severity × Occurrence × None. Note that SAE
(RPN) Detection JA1011 does not preclude the
use of RPN. Neither does
RPN detract from SAE
JA1011, but merely adds
another dimension to the
analysis, if required.
Consequences Unclear or varied. None in FMEA but are
addressed in the
subsequent decision
process of RCM. The
consequences of failure are
one of:
1. Hidden,
2. Safety, health,
Environmental,
3. Operational, or
4. Non-Operational.
It is to be emphasized, with regard to these comparative terms of reference, that neither

definition set is “right” or “wrong” per se. Rather they (alternate meanings) should be
recognized as being different. Discussions among operators, engineers, and maintainers
should clarify, at the outset, which set of definitions is to be used.
Conclusions
Higher demands on increasingly sophisticated systems and greater complexity of

equipment raise the risk of failure and its consequences. Maintainers must respond by
collecting better data, analyzing it thoroughly, and acting upon the results of their
analyses.
The great advantage of recording failure modes in the CMMS in the full context of the
function, functional failure, effects, and consequences is that there will be little ambiguity
or uncertainty about how to categorize the current failure. Subsequent reliability
analysis39 of these records will, thereafter, be founded upon precise historical data
concerning failure, its causes, effects, and consequences.
39
Refers to age exploration – analyzing the information gained from the execution of maintenance tasks.
That analysis is directed at OEE (defined in glossary on page 288) improvement and cost reduction without
compromising safety and the environment.
Page 30
The general approach to PM assessment and improvement is a double-barreled cannon –
(1) a program of scheduled RCM analysis reviews of significant items40 by a team of
domain experts, and (2) a systematic process for supplementing that knowledge with
accurate historical information. Both these activities populate the same knowledge base.
The former exercises a rigorous process for establishing consensus on an item’s
maintenance characteristics. The latter accumulates reliability data in the field, extending
the knowledge of, and validating the assumptions of the former. Although currently rare,
cross fertilization of the two processes is immensely valuable and will inevitably vitalize
both.
The RCM reliability knowledge base will, ultimately, contain a record for every failure
mode that may reasonably occur in the organization’s asset hierarchy. As the knowledge
base grows, managers, maintenance engineers, planners, and reliability specialists may
apply rich software enabled data analysis and modeling tools41 to optimize their PM and
CBM decisions.
One further argument in favor of systematic work order documentation procedures, such
as those discussed thus far, can be found in the diagnostic experiences related to a
Turbofan engine. Why encourage the feedback of maintenance information from the
field? To complement and to enrich the RCM analysis? Figure 2-9 provides an answer,
which we may reasonably extend to many other installed systems. The Venne diagram
illustrates the gap found between the list of anticipated failure modes and those actually
experienced throughout a large fleet of engines.
40
Item: A group of one or more parts or assemblies that is convenient to treat as a single entity for
reliability analysis. Items are defined at a high enough level of indenture so that their failures may be
clearly related to failure of the equipment as a whole. (See Appendix 3. Sizing the analysis page 276.)
Significant item: An item whose failures:
· Are not evident under normal circumstances, or
· Can directly negatively impact safety or the environment, or
· Can have direct major economic or operational impact.
41
For example, the EXAKT software for developing optimized CBM decision models, and other tools such
as Pareto and Weibull analysis, and real time productivity and maintenance performance management
systems (Managing Strategy Chapter 18. page 254).
Page 31
Figure 2-9 FMEA Anticipated and actual failure modes experienced
Page 32
Chapter 3. Using maintenance information
More important, by directing both scheduled tasks and intensive age exploration at those
items which are truly significant at the equipment level, the ultimate result will be
equipment with a degree of inherent reliability that is consistent with the state of the art
and the capabilities of maintenance technology
– Nowlan and Heap, Reliability-centered maintenance
Introduction
Few will deny that maintenance departments are very good at amassing data. Fewer still
will argue, though, that we are equally adept at analyzing and interpreting the data that
ends up in our computerized maintenance related systems. We store vast amounts of data,
mainly because it is technologically possible to do so. It often costs relatively little or
nothing to add another field to a form, another sensor to an assembly, or to record another
output from a control system.
We tend to defer, indefinitely, any serious consideration of the data itself. It may be
useful in the future – therefore, we collect it now. In Chapter 1. (page 13), we examined
the structures and procedures required for “maintenance data integrity”. We also
proposed a framework for collecting and managing maintenance data whose format and
content will be useful to those responsible for asset reliability – managers and planners.
They are the individuals who plan maintenance and who, therefore, must consider the
complexity of factors represented in Figure 3-1. In this book we survey the state-of-the-
art of maintenance information analysis techniques.
Figure 3-1: Maintenance planning factors42
42
GE Power Systems Heavy-Duty Gas Turbine Operating and Maintenance Considerations Robert Hoeft
and Eric Gebhardt GE Energy Services Atlanta, GA
Page 33
The problem with failure rates
It is sometimes thought that experience derived from others in the form of failure rates
can be useful in reliability-centered43 decision-making. Electricité de France44 notes that
“Data available in the literature cannot be used … it is related to equipment which, from many
viewpoints (operating conditions, maintenance, environment, etc) is very different … Moreover, it
scarcely provides information about the samples used to derive the data and rarely mentions
parameters other than the operating failure rate. For all these reasons, Electricité de France does
not consider the information provided by these tables as very … reliable.”
Even when failure rates are gathered from equipment operating in similar contexts, their
value to reliability investigations is limited. Failure rate is merely the inverse of an item’s
MTTF (or average life).45 Average life alone does not allow us to determine the right
intervals for PM tasks. As an example, consider that many items (most bearings and other
complex components) fail randomly. Only 37% (see Figure 3-4 on page 38 and
43
The expression, “reliability-centered” refers to decisions taken with the objective of sustaining OEE and
reliability while keeping costs acceptably low.
44
Dorey, J. (1981). Consideration of the reliability of pumps, derived from the first year of experience of
the SRDF, the reliability data collection system of Electricité de France. Reliability Engineering 2, 179-
192.
45
For items that fail randomly
Page 34
Appendix 8. on page 293) of such items will survive to their average life. For other types
of failure behavior, for example, for items that wear out, the timing of a PM task would
depend on the item’s useful life – the age to which most of the items survive. The useful
life (Figure 3-2) of an item, however, is unrelated, in any simple way, to its MTTF.46
Figure 3-2: Useful life of an item, the age to which most items survive
Where the consequences of failure are economic only, failure rate can help decide
whether PM is effective. The cost of a scheduled PM task over a long time period47
should be substantially less than the failure rate × time period × average cost of a failure
and its (economic) consequences. This is illustrated formally in Equation 3-1.
Cost of PM in period < failure rate × time period × average failure Cost over period
Equation 3-1: Justifiable cost of PM
Failure rate can also help decide on stocking levels and economic order points for spare
parts48. Finally, knowing the failure rate of a protective device will permit a
determination of an adequate failure finding inspection interval required to achieve a
specified availability (Equation 3-2 page 40).
How to use maintenance data?

If the failure rate alone is insufficient in helping us to schedule on-condition (CBM)
inspection, overhaul, or discard tasks, we therefore must ask two vital questions:
46
If one mistakenly bases scheduled renewals on MTBF, one will usually grossly underestimate the
number of failures expected to be prevented. (See Appendix 6. Various definitions of “Life” page 290.)
47
Note that we interpret “time period” in Equation 3-1 to mean “working age”. It is the usage
measurement or accumulated stress on the physical asset since installation or major overhaul. Use calendar
time only when the equipment functions regularly in time. More commonly we measure working age in
the specific engineering units of production. For example, for a skip hoist in an underground mine –
number of trips, for a haul truck in an open pit mine – tons of ore hauled, and so on.
48
“Spares” manual. Software developed by the CBM Laboratory at University of Toronto.
Page 35
1. What additional data do we require?, and
2. How can we use it?
The preceding chapters provided some answers to question 1. We drew from the thinking
of reliability-centered maintenance (RCM) in order to describe a data structure into which
maintenance personnel may compile their day-to-day observations. This chapter focuses
on the second question – how to analyze the data in maintenance databases so that we
may make full use of that information. Primarily, we wish to use it to optimize49 every-
day maintenance management decisions.
To start with, we describe, some of the reliability information methodologies used in

commercial and military aviation. The following five characteristics distinguish
maintenance in those industries:
1. The overarching characteristic in the commercial aviation industry is

governmental oversight of maintenance information and intense scrutiny of all
reliability documentation procedures.
2. All PM tasks in commercial aviation derive from the MSG350 program. The
equally weighted pillars of that program are:
• Initial information gathering (known as, failure modes and effects analysis or
FMEA)
• Initial PM (including failure finding, CBM, redesign) decision making
• Continuous information analysis (age exploration51) for improvement52
The improvement in failure management and reliability accomplished over the

period that MSG3 evolved is portrayed in the graph of Figure 3-3. (Roughly, 40%
of accidents are maintenance related.)
49
Optimal decisions should support a stated objective. For example, minimum cost, maximum availability,
a specified reliability, or some set of performance measures tailored to the asset in its current operating
context.
50
Maintenance Steering Group 3, the defining document for Reliability-centered maintenance in
commercial aviation upon which the technical and regulatory infrastructure is based.
51
Age Exploration: Any analysis procedure that examines historical maintenance data in order to alter the
maintenance plan for improved physical asset reliability.
52
Including engineering changes and their assessment.
Page 36
Figure 3-3: Improvement in aviation safety over 3 decades
3. The pressure to control costs in the face of intense competition characterizes

maintenance management in commercial aviation.
4. Economic, safety, and regulatory factors have forced an intimate relationship

between the maintaining (operating) organization and the equipment
manufacturers. Tangible outcomes from such collaborations have been quantum
improvements in equipment safety and maintainability. Of greatest impact are
design features permitting easy detection of potential failures53, built-in test
equipment, instrumentation rendering evident otherwise hidden functions, and
backup systems for all vital functions.
5. With regard to information management, perhaps the single most distinctive

activity in aviation maintenance is the careful recording and thorough analysis of
as-found information that results from a maintenance task. Knowledge
management techniques such as case-based reasoning (see Chapter 12. page 165)
have, not surprisingly, germinated in that industry.
It is to be noted that these five characteristics of maintenance in commercial aviation are

not dissimilar to those influencing most large industries today. (Witness: 1. increased
government oversight through agencies such as OSHA54 and EPA55, 2. the growing need
to improve reliability in the face of global competition, 3. tightening cost-price pressures,
4. greater collaboration between the OEM56 and the user as a result of maintenance
outsourcing and remote diagnostics and servicing, and 5. the growing understanding of
53
Potential failure: an indicated failure that has clearly initiated and is in the process of deteriorating to
become a functional failure that may have safety, environmental, operational, or major economic
consequences.
54
Occupational Safety and Health Administration, http://www.osha.gov/
55
Environmental Protection Agency, http://www.epa.gov/
56
Original Equipment Manufacturer
Page 37
the need to collect and analyze “good” data.) Hence, we can reasonably predict that
analogous problem resolution strategies and age exploration procedures will spread to the
broader marketplace.
Age Exploration Procedures

The purpose of age exploration is to acquire deep knowledge of equipment failure
characteristics. We may accomplish this objective, primarily, by analyzing the data and
observations associated with maintenance activities. Data will come from both the
results of proactive scheduled maintenance tasks (for example, inspections or overhauls),
and from maintenance tasks that have been provoked by a failure. Valid data will include
the results of CBM inspections, the records of functional and potential failures, and the
records of preventive renewals. The quality of stored data will determine the validity
(confidence level) of the conclusions drawn57.
Random Failure
Consider the common asset behavior known as “random failure”.58 We stated earlier
(page 34) that only 37% of randomly failing items survive until their mean time to failure
(MTTF). Figure 3-4 illustrates this behavior.
1
Probability of survival without failure
.78
.61
.50 .47
.37
.29
.22
0
0.25 0.50 0.75 1 1.25 1.50
X the MTBF
Figure 3-4: Survival probability (also known as the Reliability) for an item whose failure behavior is
random59
The calculation of survival probability (reliability) at each quarter multiple of the MTTF
shown on the graph is provided in Appendix 8.( page 293). An item whose short term
57
The American Petroleum Institute API 580, process for Risk Based Inspection draws the relationship
between quality of data and risk.
58
Despite the expression “random” we may estimate the risk of failure in a small interval at any age (given
that the item has survived to that age) to be a constant value equal to 1/MTBF.
59
Nowlan and Heap, Reliability-Centered Maintenance
Page 38
risk60 of failure remains constant throughout its life is said to fail randomly. It does not
age, as would an item whose short term risk of failure increases as it gets older. Although
its conditional probability of failure curve (see Appendix 7. on page 290) is flat, a
randomly failing item’s probability of survival curve (Figure 3-4) decreases exponentially
with age. That is to say, it drops by a constant percentage (of its current value) in each
subsequent interval. The survival graph of Figure 3-4 illustrates this phenomenon. The
time axis is divided into equal lengths and the survival probability after each interval is
indicated on the curve. The percent decrease in survival probability from time 0 to time
.25 MTBF time units is (100 – 78)/100 or 22%. Similarly the percentage drop from time
0.25 to 0.50 MTBF time units is (78 –61)/78, which is also 22%, and so on. We take
advantage of this “exponential” behavior, in the following section, to help determine
inspection (failure finding) intervals of an important class of equipment whose failure
behavior is characteristically random61 – safety devices.
Failure Finding Intervals

An inspection to detect a hidden failure62 may be considered a preventive maintenance
(PM) task since its purpose is to “prevent” a multiple failure63. In setting up such a
preventive maintenance program, we must assess the effectiveness of any inspection
interval strategy that we decide upon. Because most safety/protective devices fail
randomly, the following explanation, with respect to Figure 3-4, will help clarify the
problem.
The divisions along the time axis of the survival graph of Figure 3-4 have been marked at
an arbitrary ¼ the item’s MTTF. Let us suppose that we carry out our failure finding
inspections at those same intervals. The following statements follow from this particular
(exponential survival) age-reliability relationship:
1. Whatever the age of the item, its survival probability is 78% of its previous value.
(as explained in Appendix 8.( page 293)).
2. Its availability in a very small interval immediately after an inspection would be
100%.
3. Its availability in a small interval immediately before the next inspection will be
78%.
4. The average availability of the item will be 89%. (average of 78 and 100).
Therefore an inspection policy of 1/4 the average life will provide an average
availability of 89%.
60
By short term risk, we mean the conditional probability of failure in a small interval. It is the probability
of reaching the interval minus the probability of surviving the interval divided by the probability of
reaching the interval.
61
The assumption of exponentiality for an item that does not wear out (such as most safety devices and
complex items) is, in fact, a conservative one. – N & H.
62
Hidden Failure: A failure of a protective function, e.g. a safety limit switch, that would normally go
undiscovered until the function that it was protecting, e.g. a high level limit switch, also fails.
63
Multiple Failure: A failure of a protected function at a time when its protective function is already in a
failed state.
Page 39
5. If this degree of availability is inadequate (that is if we need to achieve a higher
average probability that the device will be operational), we must reduce the
interval – that is, increase the frequency of our inspections. The failure finding
interval (I) as a function of the desired availability and mean time to failure (M)
of the protective (safety) device may be calculated from the formula of Equation
3-2.
I = 2 × (1 − desired availabili ty ) × M
Equation 3-2 Failure finding interval for desired device availability. Equation is valid for
safety devices whose availabilities are greater than 95%64.
To give us a feel for the numbers generated from Equation 3-2, Table 3-1 shows
those failure finding (inspection) intervals needed to ensure the specified
availabilities for a safety device whose mean-time-between-failure is 3 years.
Table 3-1
Required safety device 99.999% 99.99% 99.97% 99% 98.5% 98% 96%
availability
Inspection interval as a % of 0.002% 0.02% 0.06% 2% 3% 4% 8%
MTBF (I/M x 100)
Example: MTTF = 3 years ½ hour 5 hours 15 22 33 44 88
Inspection interval to achieve hours days days days days
required safety device
availability
Usually, the manufacturer of a safety device declares its MTTF. The results of failure
finding inspections, however, should be recorded by the user in the CMMS (as described
in Chapters 1 and 2), so that a reliability software product65 may ascertain true average
life and failure behavior of the device or (group of similar devices) under actual working
conditions. Equation 3-2 is valid only in the range of high availability (>95%). Knowing
the reliability of the device, the problem of determining the appropriate failure finding
interval is thus reduced (by Equation 3-2) to the problem of knowing what availability
the asset managers, the owners, the users, and the environmental and safety authorities
will accept for the device in question.
In fact, it is of greater interest to specify, the maximum mean time between multiple
failure66 that interested parties are prepared to accept. We use Equation 3-3 to calculate
the appropriate failure finding interval, Iff, knowing the mean-time-to-failure of the safety
device (Msd), that of the protected function (Mpf), and the maximum tolerable risk of a
multiple failure, i.e. the mean-time-to-multiple-failure (Mmf).
64
Covers most electro-mechanical safety devices
65
EXAKT, Relcode, SuperSMITH, and others.
66
Multiple Failure: A failure of a protected function at a time when its protective function is already in a
failed state.
Page 40
M sd × M pf
I ff = 2 ×
M mf
Equation 3-3: Failure finding interval for risk of a multiple failure
Equation 3-3 describes the simplest configuration of a single device protecting a single
function. Appendix 4. (page 278) provides several extensions of this formula that cover a
variety of common situations involving multiple devices in parallel or in series, multiple
modes of failure, and other configurations.
Failure finding interval where only cost is at issue

Where safety and environmental issues are not at stake, then the selection of a failure
finding inspection interval for “safety” devices is strictly a matter of cost. The cost
factors are 1) the average cost of an inspection, and 2) the expected average cost of a
multiple failure. Thus we seek the optimal failure finding interval that results in the
lowest total cost (number of instances of an inspection × cost of an inspection + number
of multiple failures × the cost of a multiple failure). The optimal failure finding interval
for the simple case of a single safety device is given in Equation 3-4.
2 × M sd × M pf × C ff
I off = Equation 3-4
C mf
where:
Ioff = optimal failure finding interval
Cff = average cost of an inspection
Cmf = average cost of a multiple failure
Appendix 4. (page 278) provides a formula for the optimal failure finding interval for
multiple redundant safety devices.
Measuring Reliability Improvement

One of the objectives of the reliability information system proposed in Chapter 2. (page
19), is the ability to gauge how well the maintenance plan is proceeding – that is, how
successfully the continuous reliability improvement and cost reduction objectives are
being attained. Many organizations manage fleets of similar equipment units.
Commercial aviation experience is particularly applicable to those other types of fleet
operators. The conditional probability of failure curves of Figure 3-5 show how age-
Page 41
exploration methods resulted in engineering and maintenance improvements that
gradually overcame the dominant failure modes on the JT8D engine67.
The effects of gradual improvement

Conditional probability of failure for 100
0.4
June – August 1964
0.3 August – Oct. 1964
0.2 Oct. – December 1964

hour intervals
0.1 January – Feb. 1966

May – July 1967
October –
December 1971
0 2000 4000 6000 8000
Operating age since last shop visit (flight hours)
Figure 3-5: Assessing reliability improvement
The conditional probability of failure is the most common and useful way of describing
an item’s age reliability relationship. The conditional probability is the probability of
failure in an upcoming small interval of time given that it will have survived to that point
in time. Mathematically, it is expressed by Equation 3-5.68
probabilit y of entering Interval − probabilit y surviving Interval

Conditiona l prob. of failure in Interval =
probabilit y of entering Interval
Equation 3-5: Conditional Probability of Failure
Note how, in Figure 3-5, the conditional probability of failure curve continued to flatten
until it eventually showed no relationship of engine failure risk to operating age. During
the seven year period from 1964 to 1971 dominant failure modes were detected and
removed by redesign.
67
Report AD-A066-578, “Reliability-Centered Maintenance”, F. Stanley Nowlan, Howard F. Heap,
National Technical Information Service, U.S. Department of Commerce, 1978 (Figures 2, 3, 4, 5, and 7 to
18 have been reproduced from this reference document.)
68
The definition of conditional probability of failure is more thoroughly elaborated in Appendix 7. on page
290
Page 42
The example of Figure 3-5 portrays a typical improvement pattern that applies to new
equipment (or equipment where effective PM had not been applied previously). The key
questions are:
1. What is the optimum reliability state?

2. How quickly can we achieve the optimum reliability state?
3. What actions do we take to accelerate the process? and
4. How do we measure our progress to that end?
Age exploration as a generic and integral part of an information strategy will help to
achieve our reliability improvement goals in minimal time and at lowest cost.
Furthermore, one may, predict the expected rate of reliability improvement by
assuming that it will occur exponentially (that is, at a constant percentage of
improvement over the reliability of each preceding period). Figure 3-6 illustrates the
prediction and the reality.
Decreasing failure rate
2.0
Failure rate (failures per 1000 hours)
1.0
0.9
0.8
0.7
0.6 Experience
0.5
0.4
0.3
Date of Forecast
0.2 forecast
0.1
1963 1964 1965 1966 1967 1968 1969 1970 1971 1972
Figure 3-6: Exponential reliability improvement
Improvements in reliability as a result of applying a maintenance information strategy
based on the principles of age exploration may be expressed as a decrease in failure rate.
The graph of Figure 3-6 shows the actual failure rates of the JT8D engine compared to
the forecast improvement in reliability. The forecast is characteristically exponential
when age exploration is used. The temporary deviation from the forecasted level between
1969 and 1971 was the result of the appearance of a new dominant failure mode that took
several years to resolve by redesign. Regardless of whether the improvement follows an
exponential or some other pattern, the point is, that good information recording
procedures will ascertain the validity of a given improvement initiative.
Page 43
Refining the maintenance program
Knowledge of how and when to improve a maintenance program comes principally from
two information sources:
a. observations (of potential failures) by the maintainer during the course of

carrying out scheduled (preventive) tasks, and
b. the experience of functional failure.
Once the maintenance program goes into effect, age exploration of the results of the
scheduled tasks provides the basis for adjusting the initial conservative task intervals set
up by the RCM analysis team. And as further data becomes available the default69
decisions, made in the absence of information, are gradually eliminated from the
program. The process is portrayed symbolically in Figure 3-7.
Figure 3-7: Using information to improve the maintenance program
Assessing the effectiveness of a CBM Program

In Chapter 8. (page 103) we will discuss the requirements that a CBM task be both
applicable and effective. The applicability and effectiveness requirements of a CBM task
performed on a physical asset are:
69
“Default” here does not refer to RCM question 7 (page 16), but rather, to the conservative (default)
answers to the questions of the RCM algorithm in the absence of experience. The default answers are
provided in Appendix 13. on page 302.
Page 44
1. Applicability: An indicator of an initiating failure process (reduced failure
resistance) can be detected and measured, and there is sufficient warning time in
which to proact, and
2. Effectiveness: The task will entirely avoid or reduce, to a tolerable degree, the
failure consequences, at an acceptable cost
Figure 3-8 demonstrates how we may assess pre-condition 2, “Is the CBM task
effective?”. The graph plots the age-reliability relationship for the two types of failure:
a. functional failure, and

b. potential failure.
Recalling the data structure proposed in Figure 2-5 on page 24, we note that the CMMS
must accommodate the distinction between potential and functional failure as recorded
on a work order. Reliability analysis software70 may process that data and generate the
conditional probability of failure graph of Figure 3-8 and thus assist in the evaluation of
the merits of the CBM program. The upper curve shows the conditional probability curve
for all removals including both functional failures and potential failures. The lower curve
(line) shows the conditional probability of functional failures as reported by operating
personnel and recorded on the work order.
Conditional probabilty of failure for 200
0.4
0.3
0.2
hour intervals
Total removals
0.1 Potential failures
Functional failures
0 1000 2000 3000 4000
Figure 3-8 Conditional probability of functional and potential failure
The distance between these two curves represents the conditional probability of detecting
potential failures as a result of on-condition inspections. The difference between the Total
removals and Functional failures conditional probability curves, represents the
effectiveness of the existing CBM program. Functional failures may have safety,
70
See Software analytic tools page 47
Page 45
operational or economic consequences. Potential failures, by definition, do not have
safety or (significant) operational or economic consequences.
An analysis of Figure 3-8 determines that no scheduled overhaul of this unit will offer
additional value because the conditional probability of functional failure is independent
of the equipment's working age71 (as a result of the on-condition maintenance tasks that
have been performed). Scheduled overhaul, where effective on-condition maintenance is
in place, will, therefore, be ineffective. In fact, we would not want to reduce the
incidence of potential failures except by redesign since they are clearly effective in
reducing the number of functional failures.
Improving the program through failure mode assessment

The knowledge management suggestions of Chapters 1 and 2 propose procedures and
structures for the discrimination of potential failures and functional failures. The entity
relationship diagram of Figure 2-2, also provided for the recording of distinct failure
modes (as documented in the RCM knowledge base). A reliability analysis such as that of
Figure 3-9 will deliver deep insight and benefits to the physical asset manager.
Conditional probability of failure
Total removals
Unverified failures Verified failures
Failure mode C
Failure mode B
Infant
mortality Failure mode A
Operating age
Figure 3-9 Multiple failure modes in an item

The graph of Figure 3-9 shows the various age-reliability relationships that can be
extracted from the work order history for an item subject to several failure modes. The
upper curve shows the combined conditional probability for all reported failures. The
71
Low and constant conditional probability of failure curves are characteristic of a well maintained item.
Page 46
distance between the upper curve and next lower one represents the probability of
unverified (not attributed to a failure mode) from unknown causes.
To determine how we might improve the reliability of this item we must examine the
contributions of each failure mode to the total verified failures. For example, failure
modes A and B show no increase with increasing age; hence any attempt to reduce the
adverse age relationship must be directed at failure mode C. There is also a relatively
high conditional probability of failure immediately after a shop visit as a result of notable
infant mortality from failure mode A. The higher incidence of early failures from this
failure mode could be due to a problem in shop procedures. If so, the difficulty might be
overcome by changing shop specifications either to improve quality control or to break in
a repaired unit before it is returned to service.
Software analytic tools

We have seen in this chapter that graphical representations of the data in a CMMS
provide insights into failure behavior and the effectiveness of proactive maintenance
programs. We can generate and study such graphs provided we have implemented
information gathering procedures as were described in Chapter 1. and Chapter 2. Many
software tools can tap that knowledge. The following exercises use the Relcode software
package to illustrate the procedure for generating age-reliability relationship graphs.
Example 1
Heavy duty bearings in a steel forging plant have failed after the following number of
weeks of operation.
Age at Failure
(Weeks)
8
12
14
16
24
24 unfailed
This data may be entered into the Relcode data entry screen as shown in Figure 3-10.
Figure 3-10: Relcode data entry for steel forging plant bearings
Note that record five holds the remaining unfailed (suspended) bearings. The analysis is
performed by the software and the graph of the hazard function, which differs from the
Page 47
conditional probability of failure graph only by a constant (see Appendix 7.), can be
displayed as in Figure 3-11
Figure 3-11: Hazard rate graph indicates that there is a period of about 5 weeks where the
conditional failure probability is negligeable, followed by a period where conditional failure
probability increases with working age.
Exercise 2
Records from two heavy duty dumper trucks show that fan belt failures occurred at the
following odometer readings (kilometers, from new).
Truck 1 Truck 2
51220 45380
68060 103510
At present the odometer readings are:
Truck 1 Truck 2
105680 132720
We populate the six Relcode records with the age values: 51220, 68060-51220, 45380,
103510-45380, 10568-68060, and 132720-103510 as illustrated in Figure 3-12.
Page 48
Figure 3-12: Relcode data entry for heavy duty dumper truck fan belts
Figure 3-13
Additional examples in the use of Relcode are given in Appendix 14. on page 303.
CBM (on-condition maintenance) benefits analysis

Comparing the cost effectiveness of an existing maintenance policy and that of a
proposed new one helps to evaluate the benefits of proceeding with the proposed policy.
By a policy in CBM we mean, how we define a potential failure72. Setting the threshold
(for declaring a potential failure) too low (too conservatively) causes a greater number of
premature replacements driving our long-run PM costs unnecessarily high. If the MTTR
(Mean-Time-To-Repair) is significant, it will also cause our long-run equipment
availability to be very low.
72
For many failure modes the measurement level at which a potential failure is declared is based on
judgment and experience. The EXAKT methodology recognizes the probabilistic nature of a potential
failure and therefore defines a “best” decision (way of setting an action limit) based on a stated long-run
optimizing objective.
Page 49
On the other hand, if we set our alert level too high (too liberally) we will experience a
larger number of failures than necessary and incur unnecessary costs (and possibly
health, safety, and operational consequences) and excessive downtime. Our goal is to set
our potential failure declaration (data interpretation) policy at the optimal position (best
compromise) between the two poles. The EXAKT methodology is a form of age
exploration. It models the ages of previous potential and functional failures and
preventive renewals together with the condition data73 leading up to those events. It
blends in the failure’s economic consequences, and generates an optimal policy for
declaring potential failures. The effectiveness of a proposed new policy may be compared
with that of current practices by using the software’s “cost comparison” function The
details on how to evaluate a proposed policy are provided in the Appedix (page 295).
CBM effectiveness is related, ultimately, to how “good” the condition data is. That is,
to what degree it holds information that, in some way, reflects the degradation process in
the item (and/or to what extent it measures the accumulated external stress imposed on
the item). CBM effectiveness is also, quite obviously, highly related to the ratio of the
average cost of a preventive action to the average economic consequences of failure.
Lastly, CBM effectiveness depends, as well, on the quality of data collection, processing,
and analysis.
The EXAKT manual explains CBM effectiveness in the following terms:
When some policy (of PM74) is applied to the data, the cost is defined as the average
realized cost. The “average realized cost” is the realized cost for all failed and
preventively replaced histories divided by the total realized time for failed and
preventively replaced histories. The formula is:
Cost = (#(failures)*(C+K) + #(prev. repl.)*C) / (totalworkingage(failures) +

totalworkingage(prev. repl.)).
Where C is the average cost of a proactive task and K is the average additional cost of
the economic consequences of failure (secondary damage, fines, lost sales, and so on.)75
CBM Effectiveness Comparison

In order to assess CBM cost effectiveness, we can consider the average historical
Figure 3-14 Results of evaluating a proposed policy.
73
Observations, operating data, machinery signals, etc from which a potential failure may be deduced.
74
“PM” in the general sense of proactive maintenance referring here to a policy of scheduled inspections
(on-condition maintenance), scheduled rework, or scheduled discard.
75
The EXAKT methodology is thoroughly examined in Chapter 10. Optimizing CBM on page 113.
Page 50
maintenance and failure costs, per unit of working age, of applying the current policy.
We may calculate the projected (total failure and maintenance) costs per unit of working
age under a proposed “optimal” policy. And finally, we may campare both, to the
calculated cost (of mainenance and failure per unit of working age) with no-proactive
maintenance policy whatsoever. Figure 3-14 provides an example of an evaluation of a
policy proposed by an EXAKT analysis of the data.
In Figure 3-14 the projected average cost per unit of working age is 84% of that of the
current policy. The percentage of preventive (versus reactive) maintenance incidents will
be 98.79%, which is 230.5% greater than that of the existing policy. However, the mean-
time-between-replacements will be less than (3326) the current value (8775.29). This
means that we will intervene more frequently, in order to realize a net cost saving of
16%.
Engineering Change Assessment

From the foregoing, it behooves physical asset managers to scrutinize and improve PM
continually. All maintenance data collected should, primarily, serve this purpose. The
data structure proposed in Chapter 2. enables such an examination. The assessment
should target each significant PM task in order to evaluate 1 - its applicability, and 2 - its
effectiveness. We may assess the effectiveness current and proposed PM activities -
CBM, time based renewals (overhaul), and scheduled failure-finding inspection tasks.
Database and reliability software applications should be used to facilitate this essential
work.
Engineering changes to a physical component or a proces, undertaken to improve

reliability, should likewise undergo assessment. Figure 3-15 illustrates an example of an
analysis of an engineering change.
Page 51
10
Number of premature removals
Start of borescope inspection,

125-cycle intervals
Inspection interval reduced to
30 cycles
Modification started
Modification completed
5
Inspection requirement
removed
0
1971 1972 1973
Calendar quarters
Figure 3-15
Figure 3-15 depicts the history of the C-sump problem in the General Electric CF6-6
engine on the Douglas DC-10. The on-condition task instituted to control this problem
had to be reduced to 30-cycle intervals in order to prevent all functional failures. The
precise cause of this failure was never pinpointed; however, both the inspection task and
the redesigned part covered all possibilities. Once modification of all in-service engines
was complete no further potential failures were found, and the inspection requirement
was eventually eliminated.
The work order documentation practices proposed in Chapters 1 and 2, facilitate an

evaluationa of engineering modifications, as to their effectiveness with respect to their
stated objective.
Keeping Track of Components
Introduction
Reliability analysis relies on the availability of systematic records of events undergone by
significant components. The procedures described in this section will enable
computerized maintenance management systems to fill an important role in reliability
analysis.
Recording Events for Reliability Analysis

All of the analysis methods that we have described depend on the integrity of the data
collected in the CMMS database as well as that in the CBM and other operational
databases. Data from all sources must support the computation of a significant item’s
Page 52
working age. To perform reliability analysis, we must know the working age at each
important event in an item’s lifecycle. Such events include, at the very least, a beginning
(B) event (at installation, overhaul, or replacement) and an ending (E) event. The two
principal ending events are:
1. Ending with failure (EF), and

2. Ending by suspension (ES).
An ES event is a removal of the item from operation for any reason other than failure76.
Additionally, there are two categories of failure (EF) events:
A. Functional failure
B. Potential failure
Each of these event types must be identified in a CMMS work order record. The way in
which these events are recorded was introduced in Chapter 1. and Chapter 2. A practical
approach for recording events will be outlined in Chapter 4.
Keeping track of system component ages

As a result of the continual process of repair and replacement of field parts and the
incorporation of design modifications, the significant components of any complex system
(such as an engine that has been in service for some time) will be of widely disparate
ages. The overall age identified with an engine is the age of its nameplate. The
nameplate is useful in referring to individual engines. An actual engine unit in the
operating fleet will consist of parts older or younger than its nameplate. It is sometimes
necessary to keep track, not just of the age of each engine, but of the ages of all
significant parts from which it is assembled. The database structure presented in the
section Data Structure (page 21) of Chapter 2. accommodates this requirement by
identifying the five knowledge elements each time a work order is closed. By keeping
track of the beginning and ending events for each significant component, its working age
at any time may be calculated by software.
The foregoing begs the question “What is a significant component?” We must decide,
therefore, whether or not a part is, in fact, significant, so at to justify tracking its
individual lifecycle. A part should be considered significant if it is related to a failure
mode, whose consequences are significant. The example of an ingot transporter in the
following section illustrates the notion of significance.
76
An item may be removed or reworked as the result of a scheduled task, or because it was expedient to do
so at that particular time. Such an event would be an “ES” event. If the item failed (a functional failure), or
failure was imminent (a potential failure), that event would be classified as “EF”.
Page 53
Significant components
Appendix 10. (page 294) illustrates the variable depth of causality at which a failure
mode may be reported. Consider an ingot transporter in a steel mill (Figure 3-16).
Figure 3-16: Ingot transporter with two hydraulic pumps

It has two identical hydraulic pumps, one on the left and one on the right side. A failure
of either pump will cause a failure of the equipment as a whole. A pump itself may fail
due to a number of causes – seal failure, gear wear, and so on. We have a choice to make
regarding the depth of causality (see Appendix 10. page 294) at which to report reliability
related data (on the completed work order) regarding these pumps. We could decide to
define the failure mode as “hydraulic pump failed” without specifying whether it was the
pump on the right or left side. Our CMMS records would not distinguish which of the
two pumps failed.
Had the two pumps been piped in parallel as backups for one another, this level of
causality might be sufficient since the failure of either one would have no “direct”
operational consequences.77 However, in this case the consequences of a failure of either
pump are, in fact, operational. Hence it is worthwhile to define two failure modes78:
“hydraulic system failure due to failure of left pump” and “hydraulic system failure due
to failure of right pump”. The family of conditional probability curves (Figure 3-9 page
46) for the transporter might therefore include one curve representing the age-reliability
relationship of the left pump and one representing that of the right pump. If these pumps
are relatively quick and easy to replace we would probably stop there. That is, the
consequences of failure do not justify specifying the failure modes of the pump itself (for
example gear worn, diaphragm leaks, etc). In some other system, operating in another
context, an identical pump could require deeper failure mode analysis than that elected to
be performed here. There are no hard-and-fast rules on the depth of casality at which to
manage a failure mode. It depends entirely on operating context.
77
It does have non-operational consequences since the failed pump would have to be repaired in order to
avoid a multiple failure which would indeed have operational consequences.
78
And therefore two corresponding records in the RCM table. This is entirely a judgement call, in which
supervisors, maintainers, and operators decide that the consequences are severe enough for them to take the
extra trouble to understand the behavior of the pump at each location.
Page 54
Keeping track of the working ages of individual components upon which we wish to
perform reliability analysis can be a daunting data challenge. However the EXAKT
maintenance information methodology simplifies the chore by introducing two new
concepts:
1. Suspended animation, and

2. Marginal analysis.
Suspended Animation
Suspended animation is a period in which a component (equipment or module) is out of
operation, however, not due to failure. For example, one cylinder of a gas compressor
may be taken off line for minor maintenance.79 We introduce two new events in order to
deal with this (and similar) situations:
1. BSA - the beginning of a suspended animation interval, and

2. ESA - the ending of a suspended animation interval.
Typically, meters or process computers, provide the working age for the system as a
whole, for example, the throughput of a production line. The production line consists of
many equipment units. If an individual item or component of an item is taken off line for
an extended period of time, the work order should mark the event with “BSA” and record
the system working age. When the component is returned to service, the work order
should mark that event with “ESA” and record the system working age. The software
will, via these recorded events, keep track of the working age of every significant
component or equipment, without the necessity of individual working age meters for
every significant item.
Handling meter anomolies

Suppose a meter records equipment operating time or production throughput, and, it has
been damaged and replaced. Perhaps, in some other instance, the meter rolls through to
“zero” after reaching its numerical limit. Or, we need to reset the meter to a different
value for operational reasons. In all these cases, we may easily lose track of the physical
asset’s actual working age, not to mention that of its significant components. Not to
worry. We may conveniently use the BSA and ESA events to maintain an accurate record
of working age, not only of the equipment unit as a whole, but also for each of its
significant components.
79
A “minor” maintenance event is one that does not rejuvinate the equipment. It may be an adjustment or
recalibration or an alignment. It may, or may not, impact condition monitoring data.
Page 55
Figure 3-17 Events table (partial) for a fleet of haul trucks
Consider the events table of Figure 3-17. Events such as B, BSA, indicate the beginning,
the beginning of a period of suspended animation respectively of an item (an equipment
unit).
Equipment stops but meter keeps running

Ident Date WorkingAge Event Comments
HT-06 12/30/93 0 B Unit installed
HT-06 5/4/94 5000 BSA Unit stops
HT-06 8/12/94 6000 ESA Unit starts
HT-06 4/23/95 10524 EF Unit fails
Figure 3-18 Equipment stops but meter keeps running
Figure 3-18 indicates an interval of suspended animation having occurred from "5/4/94"
to "8/12/94". Assume, that for some reason, regardless of the unit’s inactivity, the meter
continued to count working age. Some time following its return to service, the unit failed
at a recorded age of 10524, which, obviously, is not the true working age at the end of its
lifecycle. Fortunately, the recorded “artificial” BSA and ESA events permit sofware,
easily, to compute the total accumulated working age at the failure event (EF), as (5000 -
0) + (10524 - 6000) = 9524
Component removed but equipment keeps operating

HT-07 1/19/94 0 B Unit installed – all comps new
HT-07 8/3/94 4500 B1SA Significant component removed
HT-07 12/29/94 7523 E2FF Component 2 failed
HT-07 7/19/96 19575 B2FF Component 2 renewed
Now consider the equipment item "HT-07". On "8/3/94" a significant component was
removed but the component was part of a system that continued to operate. Failure of a
second component in the system occured on 12/29/94, while the first component was still
Page 56
out of service. Once again, accurate working ages may be tracked. The B1SA event tells
the software that component one’s life has been suspended at 4500 - 0 = 4500 hours.
Until a E1SA event occurs Component 1 is considered to be in suspended animation.
Meter reset
Observe equipment "HT-17" (Table 3-2). There was a meter malfunction on 4/22/96 and
the meter was reset (arbitrarily) to 1000 hours next day. At meter reading 4275 failure
occurred. The lifecycle working age at failure is (7230 - 0) + (4274 - 1000) = 10504
hours.
Table 3-2 Documenting a meter reset

HT-17 3/14/95 0 B
HT-17 4/22/96 7230 BSA Meter reset
HT-17 4/23/96 1000 ESA Meter reset
HT-17 12/21/96 4274 B
Suppose a CBM inspection was made on "7/23/96 at which time the meter indicated
2035. Then, the actual working age at the time of the CBM inspection would be
computed automatically by the software as (7230 - 0) + (2035 - 1000) = 8265.
Marginal analysis
We note, as well, in Error! Reference source not found. that “Events” such as B1, E1F,
B1SA, and E1SA refer to a specific component (or failure mode), rather than to the
equipment as a whole. In Example 3 Complex Items (page 146) of Chapter 10. we will
show how the data structure of EXAKT accommodates the accurate tracking of
individual component ages, simply by keeping track of the date and working age of each
significant component’s installation and ending events.80 In Chapter 4. we will suggest
specific database practices and procedures for practical application of the important
principles associated with tracking significant components and failure modes.
80
In Chapter 17. (page 249) we describe the MIMOSA strategy to track the dates and working ages of
equipment assets that are moved from one operational “segment” to another.
Page 57
Chapter 4. Acquiring Maintenance
Information
Introduction
EWOP is an acronym for EXAKT Work Order Processor. The EWOP is a CMMS
integrated process for managing maintenance data in support of reliability. EXAKT
CBM optimization projects and all reliability (and reliability-centered maintenance)
analysis81 endeavors will benefit from intelligible data.
The EWOP addresses an inherent weakness in the customary management of

maintenance and repair work orders. The weakness lies in the fact that a single work
order often addresses combinations of items, functions, failures, and failure modes. For
example, a single work order may cover the inspections of an entire group or category of
equipment. Worse, it may address several failures on a given unit, each of which may be
due to different causes. Yet, each work order form provides only one set of data entry
fields. This requires the maintainer to choose and report on a subset of his observations
related to the work order – a difficult intellectual challenge in most cases. To compound
the complexity of documenting his work order, the maintainer must then select from a
drop down list, failure codes that apply, often only peripherally, or not at all, to the
current situation as found. This results, not surprisingly, in the overuse of the default
choice, “other”. Figure 4-1 depicts a simplified but typical work order form.
Figure 4-1 A Simplified Work Order Form
81
We make the distinction between “Reliability Analysis” (RA) and “Reliability-centered Maintenance
Analysis” (RCMA). The former is the study of what did happen, while the latter is the study of what could
happen. RCMA could include a consideration of RA studies. RCMA develops a knowledge base that
results in an initial maintenance strategy. RA is used to continuously update the knowledge base and to
sanity-check and correct any assumptions or mistakes made due to incomplete information at the time of
RCMA. Nowlan and Heap used the term “age exploration” to describe any type of RA technique.
Page 58
The underlying work order data structure reflected in Figure 4-1, cannot adequately
represent the variety of maintenance knowledge elements that can apply to any given
maintenance or repair situation. Without such information, we lack the ability, during
subsequent analysis of the asset’s work order history, to reconstruct faithfully what
actually happened in each instance. Our inability to use work order histories effectively
for reliability analysis, represents a fundamental problem in physical asset management.
To be of value, work order historical data must isolate (and report on) the five basic
reliability knowledge elements:
o What function was lost, compromised or threatened?

o In what way - failure description (including whether the loss of function was total,
partial, or potential)?
o What was the cause (at a practical depth in the causality chain)?
o What happened - effects (the sequence of relevant events preceding, during, and
following the failure)? and
o How does it matter (consequences to the user/owner/society)?
Because reliability analysis (for example, Weibull, EXAKT, Pareto, Jack-knife, and
many others) requires this degree of information granularity, we seldom observe analyses
based on fact in the typical maintenance organization. The inability to perform fact
based reliability analysis makes it difficult, and often impossible, to improve the overall
equipment effectiveness (OEE) of physical assets. On the other hand, if a reliability
analysis can be performed, attaining policy improvement is usually straight forward.
EWOP enables the use of the CMMS historical database for reliability analysis!
Lexicon
Item: An assemblage of components that is convenient to analyze and monitor as a

group. Often, an equipment unit specified in the CMMS asset hierarchy, qualifies as an
item.
Component: A significant sub assembly or part of an item, for which it is worthwhile to

analyze and monitor its reliability.
Significant: Describes whether an event has hidden consequences or consequences that

are economically important, operationally important, or can impact health, safety, or the
environment.
Failure mode: A cause of a failure described at a practical depth in the causality chain.
Note: In the EWOP (and in reliability analysis in general) the terms component and
failure mode may be used interchangeably. Where a significant component is affected by
a dominant failure mode the terms are equivalent. However where a component’s
reliability is affected by more than one reasonably likely failure mode, our interest
centers on the failure mode rather than on the component. Either way, our analysis
Page 59
proceeds identically. Predictive models, therefore, will sometimes apply to a component
and they will sometimes apply to a failure mode.
EXAKT: EXAKT is a reliability analysis methodology that creates CBM data

interpretation (or predictive) models. It does so by recognizing patterns of historical
events in an asset that relate, in some way, to the data in one or more CBM and process
databases. We wish to use the data in those databases to decide when and how to
maintain the asset. When EXAKT finds a statistically valid relationship, it produces a
predictive CBM model that supports such maintenance decisions. EXAKT generated
decisions are said to be “optimal”, meaning that they support the attainment of some
objective related to the maintenance of the asset. The optimal model is implemented by
an “intelligent agent” that silently monitors new data and generates optimal decisions
regarding whether to maintain an item at the current time or to continue operating for an
additional period (the observation interval).
The purpose of the EWOP
The EWOP extracts information from the work order historical database and, using that
information, performs these two important functions:
1. Adds to the organization’s reliability-centered knowledge base, and

2. Generates an Events table for reliability analysis.
Work order documentation procedures for the EWOP
The EWOP requires that maintenance personnel adopt a new approach regarding the way
in which they document their work orders at the completion of a maintenance or repair
task. A work order is the principal vehicle for the acquisition of information revealed
from the field, describing the “as-found” state of a physical asset. Such information
assists engineers in conducting reliability analysis (RA). RA requires an accurate
chronology of events related to the failure behavior of a physical asset. This section
suggests a practical way of documenting a work order. Good documentation will support
subsequent reliability analysis, which, in turn, will improve reliability and lower cost.
Page 60
Figure 4-2 Work order with EWOP’s required data fields
At a minimum a (EWOP friendly) work order should provide the following sixteen items
of data:
1. Work order number
Uniquely identifies a work order.
2. Item
A collection of components or assemblies that is convenient to consider and analyze as a

group. An equipment or system as it is defined in the CMMS asset register often
identifies an item.
3. Date out
The date/time that an Item is taken out of service.
4. Date back
The date/time that an Item is returned to service.
5. Work performed
What was done to the Item.
Page 61
6. By
Who executed the work
7. Working age out (Wageout)
Working age should closely relate to the accumulated work performed and stress
undergone by the item. (For example kilowatt-hours, fuel consumed, production units
produced, and so on are sufficiently accurate ways to measure working age). Wageout is
the working age at Date out.
8. Working age back (Wageback)
Working age at the Date back.
9. RCMREF
A reference to a record in the RCM knowledge base.
10. Function
The function that was lost, compromised, or threatened by the events that provoked the
issuance of the work order.
11. Failure
The way in which the function was lost, compromised, or threatened. For example, total
loss of function, partial loss of function, or potential loss of function. (For example the
failure “Transmits only 60% to 80% of the required torque of 900 ft-lb” speaks directly to
the function “To transmit 900 ft-lb of torque to the wheel”.)
12. Cause
What caused the loss (compromise or threat) to the function. Identify the cause at a
practical depth in the causality chain. (For example, Clutch slips, or Clutch slips due to
worn disks, or clutch slips due to warn disks caused by oil leak, or clutch slips due to
warn disks caused by oil leak as a result of an incorrect seal fitting, etc. The depth of
causality at which we state the cause (failure mode) is dictated by the practicality at
which the organization can do something about the cause and its consequences. The
causality level is decided through discussion among maintainers, supervisors, engineers,
and planners. This is the most challenging, yet rewarding, part of the EWOP approach to
maintenance information management.
Page 62
13. Effects
The sequence of relevant events (that are worthwhile to record), within the component,
within the equipment, and within the organization that led up to the failure, occurred
during the failure, and occurred following the failure82
14. Consequences
Based on the Effects, a determination of how the failure matters. Select one of 1. Hidden,
2. Safety, health, environment, 3. Operational, and 4. Non-operational.
15. Event type

o FF – the ending and renewal of a component (failure mode) due to a functional
failure
o PF – the ending and renewal of a component (failure mode) due to having
detected a potential failure in time to avoid the more dire consequences of a FF.83
o S – the ending and renewal of a component (failure mode) for any reason other
than (functional or potential) failure. (For example preventively replacing the
component.)
o B – the beginning of the life of a component in the item (if not FF, PF, or S) 84
o BSA – the beginning of a period of temporary removal (suspended animation) of
a component from the item.
o ESA – the return of the same component to the item after a period of suspended
animation
o SA – the beginning and ending of a period of suspended animation if reported on
the same work order.
82
To the degree that the consequences of the failure are significant the Effects should answer these
questions:
A. What sequence of events (internally and organization wide) could be touched off by the failure
mode?
B. How does the failure make itself known? What observable events lead up to the failure?
C. How is safety or the environment impacted? (do not mention the words "safety" or
"environment")
D. How is production impacted? (quality, cost, customer service)
E. Is there any additional damage caused by the failure?
F. How long will it take and what actions must be accomplished to correct the failure?
G. How does the likelihood of this failure depend on deeper causes? Has it happened before? How
often? Under what circumstances?
83
It is important to distinguish between the event types FF and PF. If the consequences of failure have been
largely avoided or mitigated due to having detected the failure, select PF. In subsequent reliability analyses,
the hazard rates for FF and PF events may be compared and an evaluation of CBM effectiveness may be
made.
84
Perhaps the failure was covered by a previous work order. The item functioned for a period of time
without this component (e.g. one cylinder of a gas compressor) and the component was finally re-instated
in the current work order.
Page 63
o MR – the minor repair of the item. It does not renew any components. Sometimes
it will impact the monitored data. For example, a calibration, a shaft alignment, an
oil change, the balancing of an impeller, and so on.
In most cases one of the first three (FF, PF, and S) will apply. Properly selecting the
Event type from the drop down list and providing additional information in the “Work
performed” field will allow the EWOP to create the required Events (for subsequent
reliability analysis) in the Events table. Several examples are given beginning on page 66
in the Section “Examples”.
16. Additional Information
Anything else that needs to be communicated relative to the work order.85
One may argue that, this degree of “wordiness” and detail in completing a work order,
especially where a general overhaul was performed, is onerous and excessive. One would
agree, however, that it is worthwhile to adopt a degree of informational completeness that
is proportional to the consequences and probability of the failure of the equipment in
question. Recalling the example of the ingot transporter in the section Significant
components on page 54, the repair person will devote less time and apply a smaller
amount of detail to a less critical equipment, say, one whose functions are duplicated by a
backup system. Recall too, that most of the 16 reliability information elements for a
failure mode of an item, need be entered manually, only once. Thereafter, in future
incidences, the RCM record is merely referenced (using information element 9) in the
work order record. (Recall the advantages of the one-to-many integrity constraint
illustrated in Figure 2-2 on page 21). Moreover, the Event codes described in the next
constitute accurate failure codes that will eventually reduce the clerical verbosity of this
approach, almost to zero.
85
Could contain “pseudo” work orders as structured text. This might be in the case of CMMSs that do not
permit the creation on demand of additional work orders, which are needed to cover unique item-function-
failure-causes.
Page 64
The events table
Figure 4-3 The events table
The events table (Figure 4-3) is one of the two important outputs of the EWOP.
Reliability analyses (such as Weibull, Pareto, and Proportional hazard modeling) require
information on the beginning and ending events of a component or a failure mode. The
most significant events in the life of a component are 1. its installation, 2. its ending due
to failure, and 3. its ending due to a reason other than failure. Reliability analysis makes
use of the dates and working ages at which events occur. Events define a component’s
life cycles. Reliability analysis discovers the relationship between a component’s
working age and its failure probability86. Proportional hazard modeling takes RA one step
further by discovering the three-way relationship between age, reliability, and condition
monitoring87data.
86
It is often referred to as the age-reliability relationship.
87
Of all types: performance data, process data, sensor data, and external data such as environmental factors.
Page 65
The RCM knowledge base
Figure 4-4 The RCM knowledge base.
Figure 4-3 shows examples of records in the RCM knowledge base – the second
important output of the EWOP. The EWOP added these records to the RCM knowledge
base, by using the information from the work order.88
Uniqueness of a work order
The EWOP requires that a work order relate to a single combination of the fields:
1. item
10. function
11. failure
12. cause
If more than one combination of these four elements occurs (and is significant) in a work
order, they should be extracted into as many sub work orders as required. If the CMMS
cannot be adapted to generate additional work orders, on demand, (in order to
accommodate each additional significant combination of data elements), the text area
(“Additional information”) of the work may be used for this purpose. (See Appendix The
short term process page 265)
Examples
Example 1
A component has suffered a functional failure and is renewed. Select: “FF”. The
EWOP will generate the EnnnFF and BnnnFF events in the Events table, where “nnn” is
88
“As-found” information complements and sanity checks the knowledge developed during reliabity-
centered maintenance analyses. The reliabilty knowledge base is, in this way, said to be “living”.
Page 66
the RCMREF of the RCM knowledge base record that describes the current item-
function-failure-cause.
Figure 4-5 Work order with the Event Type "FF" selected
The work order of Figure 4-5 illustrates the event type “FF” having been selected. The
EWOP, upon encountering “FF” generates the two events of Figure 4-6.
Figure 4-6 Events generated from Work order 9 with event type "FF"
Examine the Events table of Figure 4-6, in particular the column “Event”. On May 14,
2004. work order 9 was issued. The work order generated the ending event with failure,
“E1101FF”, and the beginning event, “B1101FF”. Note that the RCM reference number,
1101, was sandwiched between the first letter and the last two. Thus, if we were
analyzing (building a model of) the failure behavior of a particular failure mode we
could easily identify all events in the database that refer to that item-function-failure-
cause.
When building the predictive model, we will conveniently “map” (for example by using
the mapping dialog of EXAKT, described in the Appendix Exercise page 313 ) the event
“E1101FF” to the event EF in the model named “Crusher 1017 B Failure mode 1101”.
Page 67
Finally, in this instance, no RCM reference existed prior to the work order. Therefore, the
EWOP generated record “1101” (using the information provided on the work orders) and
inserted it into the the knowledge base (RCM table). The record will be quality checked
by a reliability engineer, planner, or analyst or other person versed in the use of RCM
language and concepts. This is one important way in which the reliability-centered
knowledge base grows.89
Example 2
A component has revealed a potential failure and is renewed. The technician will
select “PF” in the Event type field. Similarly to Example 1, the EWOP will generate the
event records “EnnnPF” and “BnnnPF” as illustrated in Figure 4-7.
Figure 4-7 Events generated from a work order with the Event type "PF"
A second important aspect of this work order, is that it relates to an already known failure
mode, hence the reference to RCM record 890. The technician, checked the knowledge
base and discovered that the item-function-failure-cause is known90. Therefore, he
references (rather than duplicates) the known reliability information. The referenced
RCM record is illustrated in Figure 4-8.
Figure 4-8 The RCM record "890" referenced in Work order 10
Example 3
A component has been renewed preventively. The technician selects “S” (for
suspension) in the Event type field. No failure (neither a PF nor a FF) has occurred.
Figure 4-9 Events generated from a work order with the Event type of "S"
89
The other way is through reliability-centered maintenance analysis (RCMA). While RA analyzes failure
and potential failure events that have occurred, RCMA considers and analyzes reasonably likely failure
modes that may or may not have occurred in the past. The two processes populate the same knowledge
base. Each benefits from the expererience and thinking of the other.
90
Either it has occurred before, or it was considered and included in a RCMA project.
Page 68
When building of a predictive model of failure mode 1124 (i.e. RCM record 1124), a
reliability engineer or analyst will map the events E1124S and B1124S to the events ES
(ending by suspension) and B (beginning of life cycle) of failure mode “1124” of item
“Crusher 1017B”.
Example 4
A new component has been installed but no component was replaced. Only one
event generated.
Figure 4-10 A single event generated for an installation of a component for the first time
Examine the second record of Figure 4-10. Work order 22 covers the installation of a
component where there was none. Hence EWOP inserts a single Event “B852B”. EWOP
knows to insert only a beginning event because the technician selected “B” (rather than
FF, PF, or S) in the Event type field.
A month earlier on work order 21, the “component” 852 was removed, possibly for a
minor repair (such as cleaning, adjustment, etc) with the intention of placing the part back
in service at a later date. The component was supposedly placed in “suspended
animation” (hence the event E852BSA). However, it was decided to renew the
component fully before placing it back in service. That is why we have a “B” event
where an ESA event was expected.
It will be quite clear to an analyst, however, that E852BSA event should be mapped to a
ES event, leaving no ambiguity that the component’s lifecycle was suspended (rather
than having been placed in suspended animation as originally thought).
Example 5
The item’s meter was reset
Figure 4-11 Meter reset
Page 69
Work order 13 covers the repair of a component (whose dominant failure mode is
documented in RCM record 1120). However, it was necessary to reset the meter at this
time. (This could have been for operational reasons, the meter reached its maximum, or
because the meter was replaced.) Of course, a meter reset such as this will erroneously
impact the recorded ages of every significant item and failure mode whose reliability we
wish to track. How do we make sure our reliability data is not compromised by such an
event?
The answer is simply to include the phrase “meter reset” in the Work performed field.
The EWOP will understand that the meter was reset and it will insert two artificial events
into the Events table. The events will be labelled “BmeterESA” and “EmeterBSA”. This
tells the analyst that when modeling any component or failure mode of the item, he
should map these events to ESA and BSA in the reliability analysis project for the
component or failure mode under scrutiny. The analysis software will internally adjust
the working ages of every component whose histories include ESA and BSA events.91
Example 6
A used component was installed replacing a failed component.
Figure 4-12 Used component of age 4000 installed
Work order 23 covered the functional failure and renewal of component (failure mode)
881 on November 14. Observe (in Figure 4-13) the way in which the work order 23 was
documented. The phrase “used (4000)” was inserted by the technician into the text of the
of the Work performed field. This tells the EWOP that a 4000 old component, rather than
a new component was installed.
91
The internal adjustment by reliability software of the individual component age is discussed in Chapter 4.
.
Page 70
Figure 4-13 Specifying that a used component was installed
The EWOP, therefore, subtracts 4000 from the B event, thereby “instantly” aging the
component by 4000. In addition it generates a “Start Monitoring” (SM) event to indicate
that no CBM results will apply to this component for the first 4000 time units of its life.
In any subsequent analysis of the reliability of component (or failure mode) 881, the
meaning will be clear. Appendix 1. EWOP details (page 263) provides further discussion
on the analysis regarding replacement by a used component.
Example 7
A component is placed in suspended animation and the same component returns

from suspended animation
Figure 4-14 A component placed in suspended animation and later on returns to service
In work order 19 component 800 was removed for some minor reason and was placed
into suspended animation. Four months later it was reinstated on work order 20. Once
again the events E800BSA and B800ESA will be mapped to events BSA and ESA
respectively when modeling component (failure mode) 800.
You may be wondering why E for (“ending event”) is the prefix to E800BSA that marks
the beginning of a period of suspended animation. The reason for this seeming oxymoron
is that the beginning of a period of suspended animation is, from the component’s point
of view, the end of a segment of its life-cycle.
Summary and Conclusions
We have described a basic limitation in current CMMS procedures related to historical

record keeping for reliability analysis. To date, the CMMS has been used primarily for
work order management, planning, scheduling, and spare parts provisioning.
Maintenance professionals recognize the CMMS database as a source of practical
knowledge with which they may improve maintenance policy. This chapter suggests a
Page 71
simple process for reliability information management, assisted by a software tool called
EWOP. The EWOP provides a quick, consistent, and inexpensive way to capture field
information needed by all reliability analysis techniques and software92.
We built our methods upon our understanding of the fundamental reliability-centered

knowledge elements introduced in Chapter 1. and Chapter 2. By actually implementing
an EXAKT CBM optimized model, we have set in motion a broader process of building
and using a living reliability-centered knowledge base. In addition to more effective
CBM, we will, without doubt, quickly discover a multitude of visible, spin-off benefits to
other physical asset management initiatives undertaken by our organization. All physical
asset management improvement programs (for example RCM, TPM, RBI, Six-sigma, and
many others) depend on good information. Because the information in a reliability-
centered knowledge base has been rendered in its most fundamental form, it may be
analyzed, processed, and used advantageously within the structure of any other physical
asset management improvement methodology.
We hope that the growing use of EWOP will encourage CMMS builders to supplement
their systems with similar features. Additionally, we encourage reliability consultants to
embrace and teach the principles of living reliability-centered knowledge. A
demonstration version of the EWOP may be obtained by communicating with the author.
92
Including reliability-centered maintenance analysis, FMEA, FMECA, root cause failure analysis, risk
based inspection, HAZOP, FRACAS, six-sigma, and many others that require the consideration of facts
based on experience.
Page 72
Chapter 5. Assessing “What-if” from
maintenance information
Introduction
We gather information in the course of our day-to-day maintenance activities in order to
deepen our understanding of failure so that we may better manage its causes and control
its consequences. We use our growing knowledge of the causes and effects of failure to
improve reliability. By "reliability improvement" we mean the attainment of desired
levels of availability, reliability, operating/maintenance cost, yield, production rate,
safety, and environmental integrity of each significant physical asset in its operating
context.
How do we improve any part of maintenance? Invariably, by adding or altering an aspect

of some maintenance policy. Every maintenance department, consciously or
unconsciously, operates according to a set of policies. Policies may have been written
down explicitly as guidelines and procedures, or they may have originated long ago and
persist as habit and tradition. The physical asset manager, in his primary role, monitors
the effectiveness of currently active policies. Those polices govern the reliability of the
significant items that fall within the compass of his responsibilities.
In the preceding chapter, we described methods and tools for using the CMMS to report
the outputs of an existing maintenance policy. For example, the graph of Figure 3-8 (page
45) reports on the effectiveness of our current CBM program. And the graphs of Figure
3-9 (page 46) describe the actual failure behavior of items. They provide clues as to
whether a different maintenance policy or physical modification may act to our
advantage.
All the previous methods help us track the effectiveness - the maintenance outputs - of
past and present policies. They do not predict what would happen in the future if a
maintenance policy were altered. The capacity to perform “what if” analysis on the
future impact of policy changes, would, no doubt, assist the physical asset manager. He
could, thereafter, ask questions of the type, “What will the
downtime/availability/reliability/cost be of my system if I double/triple/halve the
overhaul frequency?” We can perform decision analyses such as these by building and
running a model. In this chapter we examine the powerful modeling technique known as
Monte Carlo Simulation.
Modeling a simple system using SPAR93
Assume that we have operated and recorded, in our CMMS, failure and installation
events of a simple item over a number of years. We note from these records, that the
average life (MTTF) was 0.5 years. We observed the average repair time (MTTR) to be
93
Monte Carlo Simulation software available from Clockwork Solutions, www.clockworksolutions.com
This exercise was compiled by Naaman Gurvitz of Clockwork.
Page 73
10 days (0.0274 years) and that the actual repair time was normally distributed with a
10% standard deviation. We desire, at this time, to predict the maintenance performance
for this item over the next two years under a variety of alternative policies and conditions.
Objective of the analysis

To predict maintenance performance for various failure distributions and
maintenance policies:
1. Perfect repair
2. Imperfect repair
a. Various repair effectiveness values
3. Periodic overhaul (time-based maintenance)
a. Perfect repair
b. Imperfect repair
We proceed to build a model by providing SPAR™ with three types of information:
1. the system function (the reliability block diagram) using the Graphical System
Function Generator,
2. the failure and repair behavior, using the Input Generator, and
3. the maintenance policies, using the Bubble Logic Generator.
The system function
Figure 5-1The reliability block diagram for a single line replaceable unit (LRU) named "SGN"
Figure 5-1 presents the simplest of reliability block diagrams containing a single line
replaceable unit.
Failure behaviors
As a hypothetical set of cases for our examination, we will assume 4 possible failure
distributions for the single LRU of Figure 5-1: 1) exponential, and 2) Weibull with shape
parameters 1.5, 2.5, and 3.5. An exponential distribution’s single parameter is the item’s
MTTF, which in this is case 0.5 years. For the three Weibull distributions, we may
calculate the second (scale) parameter, λ, using the equation:
β
Γ(1+1/β)
λ=
MTTF
Equation 5-1
Page 74
where Γ is the gamma function94. And MTTF =0.5. Equation 5-1 yields the following
values for the Weibull scale parameter, λ:
λ β
2.4230 1.5
4.2035 2.5
7.82445 3.5
We can now enter, into the SPAR™ program, the parameters of the 4 failure
distributions, and the parameters for the repair time normal distribution (0.0274 years and
.00274 years). We specify a service time observation window of 2 years and run the
program.
Running the program

SPAR generates the prediction graphs for availability, downtime, and failure illustrated in
Figure 5-2, Figure 5-3, and Figure 5-4 respectively.
Figure 5-2 Graphs of predicted availability over 2 years for each of the 4 distributions
94
The value of the gamma function Γ(x) for any x may be looked up in a table similar to trigonometric
tables, for example, sin(x)
Page 75
Figure 5-3 Predicted average downtime over 2 years for each of 4 distributions
Figure 5-4 Predicted number of failures in a two year period for each of 4 failure distributions
Remarks
We may conclude that it is technically feasible, (knowing the failure and repair
distributions) to analyze and predict maintenance performance. At this point we
increase the level of realism one notch by considering policies where repair effectiveness
will be less than “perfect”.
Repair effectiveness
We define “repair effectiveness” as a reduction in age. Following a perfect repair we
would “reset” a component’s age to zero. That is, age conservation for a 100% effective
maintenance action is “0”. If the repair is imperfect we use the SPAR program’s bubble
Page 76
logic to instruct the calculation engine to conserve a portion of the item’s age after repair.
Assume, for example, that a “minimal” repair will actually conserve 99% of an item’s
age95. We enter this information into SPAR using its Bubble Logic generator tool. SPAR
then generates the following Dynamic Logical Sentence (DLS):
At Collision
START DLS (1)
Comment: Setting age upon repair
1.1 If LRU 1 in current system is repaired now
Set age of LRU 1 in current system to .99*age at last failure
1.1 End Of If
END DLS (1)
The DLS tells the calculation engine to treat repair as “minimal”. We run the analysis
once again. This time, however, the predictive results will account for the minimal nature
of the repair. We refer to such repairs as “as bad as old”. Compare the results of the
following graphs (Figure 5-5, Figure 5-6, and Figure 5-7) to the previous ones (Figure
5-2, Figure 5-3, and Figure 5-4 ) where a perfect repair policy was assumed.
Figure 5-5 Predicted availability under a minimal (“as bad as old”) repair policy
95
For example, to get the equipment back into production quickly, the policy may be to replace only the
failed component(s), leaving the others in the unit to continue aging.
Page 77
Figure 5-6 Predicted downtime under a minimal (“as bad as old”) repair policy
Figure 5-7 Predicted number of failures under a minimal (“as bad as old”) repair policy
We note that the repair policy "as bad as old" leads to lower system performance than in
the "as good as new" case. This is expected. However, it is not true (comparing the blue
lines and bars of each set of graphs) for the case of an exponential failure distribution.
That is because the exponential distribution is "ageless"; a unit whose failure distribution
is exponential is always as good as new! At this point we ratchet up the level of realism
another notch by adding preventive maintenance (periodic overhauls) to our maintenance
policy for this item.
Applying Preventive Maintenance

The purpose of preventive maintenance is to reduce the future chance of unplanned
failures, or, in other words, to rejuvenate the component. In this model we shall assume
that preventive maintenance reduces the component age back to zero (as good as new).
Preventive maintenance is an “external” event that influences the system. We add to our
Page 78
current minor repair policy a proposed preventive maintenance schedule. We do this by
using SPAR’s Input Generator tool.
Through a series of dialogs, we modify the current project, by telling SPAR to apply PM
periodically at 6 month intervals. We also indicate to SPAR that the PM duration is 14
days (0.0384 years). By default, the PM is considered to apply zero age conservation,
which is what we want. As previously, we run the program and generate the maintenance
performance prediction graphs of Figure 5-8, Figure 5-9, and Figure 5-10.
Figure 5-8 Time Dependent Availability for Weibull β=2.5 distribution, (a) Perfect Repair, (b)
Minimal Repair, (c) Minimal repair and Periodic Maintenance
Figure 5-9 Average Downtime for Weibull β=2.5 distribution, (a) Perfect Repair, (b) Minimal Repair,
(c) Minimal repair and Periodic Maintenance
Page 79
Figure 5-10 Number of Failures for Weibull β=2.5 distribution, (a) Perfect Repair, (b) Minimal
Repair, (c) Minimal repair and Periodic Maintenance
Optimizing PM
It is usual to define an optimal PM policy as one that minimizes lifecycle cost. Lifecycle
cost would include the cost of lost production due to failure and maintenance. We set up
the variables of our optimization problem as follows:
Variable Definition
Td total down time (due to either PM or failure) of the system
Cd cost of downtime per unit time (i.e. production loss)
Nf number of failures
Cf cost per failure (not including downtime but only fixed costs such as
man-hours, spare parts and so on.)
Nm number of preventive maintenance operations
Cm cost of a maintenance operation (not including downtime)
Total Cost Cost = Cd * Td + Cf * Nf + Cm * Nm
We proceed to determine the optimal maintenance strategy for, say, the case of the
Weibull failure distribution with shape factor = 2.5 and a “as bad as old” repair policy.
Three possible maintenance strategies are:
1. No maintenance.
2. Preventive maintenance every 6 months.
3. Preventive maintenance every 3 months.
The cases of no maintenance and maintenance every 6 months have already been run. We
easily run another case with maintenance every 3 month. Then we have SPAR display the
comparative results graphs of Figure 5-11 and Figure 5-12.
Page 80
Figure 5-11 Average Downtime for Weibull β=2.5 distribution with Minimal Repair and: 1. No
Maintenance, 2 Maintenance Every Six Months, and 3. Maintenance Every Three Months
Figure 5-12 Average number of failures for Weibull β=2.5 distribution with Minimal Repair and: 1.
No Maintenance, 2. Maintenance Every Six Months, and 3. Maintenance Every Three Months
Using these results we set up the following spreadsheet calculating cost as Cost = Cd *
Td + Cf * Nf + Cm * Nm:
On the lower row of this spreadsheet we have applied the following values for this
exercise:
Page 81
Variable Definition Value
Cd cost of downtime per unit time (i.e. production loss) $0.10
Cf cost per failure (not including downtime but only fixed costs $10
such as man-hours, spare parts and so on.)
Cm cost of a maintenance operation (not including downtime) $1
We enter the downtimes (from Figure 5-11) and the number of failures (from Figure
5-12) into the spreadsheet. The number of PM events (0, 3, and 7) for each case are
calculated by hand. (e.g. the number of 3 month interval PMs that will take place in 24
months = 7). We conclude that the most cost effective policy of the three alternatives is to
perform preventive maintenance every 3 months. However, a change in the relative costs
of failures versus those of maintenance versus those of lost production during downtime
will likely change the best policy.
Page 82
Part 2. Condition Based Maintenance
On-condition inspections, which make it possible to preempt functional failures by
potential failures, are the most effective tool of preventive maintenance – Nowlan and
Heap, Reliability-centered Maintenance.
Chapter 6. Deciding on CBM
Introduction
Most courses and books on CBM (also known as “Predictive maintenance”, “On-
condition maintenance”, and “Condition monitoring”) focus much attention on the
“technology” of acquiring and manipulating condition monitoring data. CBM hardware
and software providers provide excellent training to their customers in the efficient use of
their products and services. This book, on the other hand, explores the informational
processes underlying the technology of condition based maintenance.
We seek to perform CBM tasks that are applicable (feasible and practical) and effective
(accomplish the intended objective). CBM may, in a sense, be thought of as the most
“noble” or preferred form of maintenance for these reasons: Wherever applicable and
effective, CBM is:
1) the least intrusive,
2) the least expensive, and
3) the least tolerant of failure.
The third of these points requires some explanation. We perform time-based (preventive)
maintenance at a time prior to the age at which we expect the item to fail. In other words,
at an age to which most items of the kind in question survive (see Figure 3-2 on page 35).
By definition then, we expect that some items will fail prior to the scheduled preventive
renewal. We are prepared, therefore, to tolerate a relatively small number of failures.
CBM, on the other hand, is designed to intervene at the point of potential failure. Figure
6-1 illustrates CBM theory.
Page 83
Figure 6-1: CBM theory
Figure 6-1 describes the assumptions upon which CBM is based. They are:
1) The potential failure is reliably detectible

2) The P-F interval (the time between detectible potential failure and functional
failure) can be estimated, and
3) The inspection interval has been set to one half the P-F interval, and
4) The inspection interval has been ascertained to allow adequate warning time in
which to react appropriately to the potential failure.
Whenever a CBM task fails to accomplish its objective, it means that we have overlooked
or misjudged one or more of these assumptions. In such cases CBM is said to be
ineffective. Despite, extensive application of technology and labor, many sophisticated
CBM programs deliver negligible net benefits. On the other hand, a great number of
simple, inexpensive CBM inspection programs reap enormous benefits. Why do these
advantages often not scale up with added technology, as we would expect they should?
We will explore this issue in subsequent sections as we proceed.
Why do CBM?
Intuitively, condition based maintenance would seem almost universally desirable . If it
can detect an impending failure, thereby allowing us to react quickly enough to prevent,
or to avoid the dire consequences of failure, why not do as much CBM as possible?
Figure 6-2 displays the RCM process with which we may analyze the applicability and
effectiveness of any pro-active maintenance task.
Page 84
Figure 6-2: The reliability-centered maintenance process
At the top of Figure 6-2 we note the two global intellectual activities that characterize all
human progress: A) analysis, followed by B) a decision to act based upon that analysis.
Reliability-centered maintenance (RCM) emerged, in the 1980’s as a maintenance
analysis and decision process of great power. RCM frames the maintenance analysis and
decision process in seven questions (listed on page 16). Question 1 identifies each of
the item’s performance requirements. Question 2 lists the functional failures associated
with each performance requirement. Question 3 enumerates every reasonably likely
cause (called a failure mode) of each functional failure. In Question 4 (effects analysis),
we express the scenario of noteworthy events touched off by the failure mode96. By
carefully considering the effects of Question 4, we may respond to Question 5 to
determine whether the consequences of the failure are: 1) hidden, 2) safety or
environmental, 3) operational (production related), or 4) non-operational (maintenance
impact only). Question 6 and Question 7 are answered by applying the RCM decision
algorithm (Figure 6-3).
We select the appropriate vertical branch (H, S, O/P, or M) of the decision diagram of
Figure 6-3 depending on the answer to Question 5 (consequences). Most97 of the tasks of
rows 3 and 4 of the decision algorithm designate “default” activities. When no single
applicable and effective pro-active task can be found, the decision algorithm directs us to
perform the default tasks. In Part 3. (page 201) we will exercise the RCM process in
great detail.
96
In the previous chapter on case based reasoning, we saw how effects analysis can be structurally
extended to enable the use of diagnostic algorithms – a specialized application of CBM.
97
The maintenance policy applied manage a particular failure mode can “Two or more of above”, a
proactive (not a default) action.
Page 85
Figure 6-3: The RCM decision algorithm
Once the decision analysis of steps 5, 6, and 7 have been completed, the RCM process is
complete, and we may proceed to the resourcing phase, illustrated by Figure 6-4. During
this implementation phase we set up our CMMS with detailed plans and schedules. We
specify the labor, parts, materials, and skills necessary to execute the set of tasks.
Furthermore, the human resource department provides any necessary manpower and
training.
Page 86
Figure 6-4: The RCM 7-step process followed by planning, scheduling, and resourcing
The panorama of Figure 6-4, places CBM (and all our maintenance tactics) into a
strategic context. Every policy action in our maintenance program is traceable back to
one or more functional requirements of a physical action. We have, thus, answered the
question posed by the title of this section, “Why do CBM?”
History of CBM
Physical asset managers attempt to implement policies that maintain the functionality of
machinery and other production assets at a level required by their users, owners, and by
society at large. They select "proactive maintenance" as their first line of defense against
the causes of equipment failure. By applying routine inspection (condition based
maintenance aka CBM, on-condition maintenance, and predictive maintenance) or
periodic renewal (preventive maintenance aka PM, scheduled overhaul), or redesign,
they seek to avoid the consequences of failure. Of the three tactics they prefer to consider
CBM first, because it is usually less expensive and less intrusive. Although data is
plentiful and can be collected and processed in every situation, CBM is appropriate only
when it is both applicable (technically feasible) and effective (economically justifiable).
Applicability implies a non-ambiguous indicator of failure initiation and sufficient time
to proact.
Page 87
Preventive maintenance is the routine renewal of physical assets or their components.
Condition based maintenance is the routine inspection of a physical asset to determine
whether a failure process is underway. If failure has begun, the goal is to take an action
which will somehow avoid or reduce the consequences of failure. If the remedial action
(for example a cleaning or adjustment) can be performed on the spot, at the time of the
inspection, most companies consider the inspection activity as belonging to their
preventive maintenance (PM) program98.
Condition based maintenance (aka on-condition maintenance, predictive maintenance,

and others) first appeared in the late 1940's in the Rio Grande Railway Company, to
detect coolant and fuel leaks in a diesel engine's lubricating oil. They achieved
outstanding economic success in reducing engine failure by performing maintenance
whenever "any" glycol or fuel was detected in the engine oil. The U.S. army, impressed
by the relative ease with which physical asset availability could be improved, adopted
those techniques and developed others. During the 50's, 60's, and early 70s CBM grew in
popularity and a vibrant CBM technology industry emerged providing training, products,
and services which came to be known as "predictive maintenance".
Commercialization of CBM coincided with the dawn of the "information age" and CBM
took on a new "flavor". Technology entrepreneurs conjectured that, if simple physical
measurements, such as vibration amplitude or oil viscosity, could provide such useful
benefits, then collecting the data in computers and trending it over time would, likely,
provide a far deeper insight into the state of a machine's health. Hence the 1980s and
1990s witnessed a soaring rise in the use of computers, software, and data collectors in
maintenance shops throughout the industrial world.
In reality, even in the midst of impressive information technology growth, most day-to-
day CBM success stories still derive from the basic application of the original,
uncomplicated form of CBM. For example; the detection of unbalance in a rotating
machine, of glycol or fuel in an engine oil, or of mechanical looseness, soft foot, or shaft
misalignment seldom require the degree of sophistication (and related expense) of the
variety of technology bells and whistles happily proffered by the CBM industry.99
At the same time (as the growth of CBM), the information technology revolution
impacted another part of physical asset management - the computerized control of
maintenance materials, labor, and historical records. These products became known as
computerized maintenance management systems (CMMS). There was, however, a
striking difference between the CBM and CMMS approaches.
While CBM technology vendors required their clients to adhere to highly structured
procedures for data collection and storage, CMMS vendors, on the other hand, hailed the
concept of 'flexibility' and emphasized their products' "ease of adaptation" to their clients'
98
Rather than to their CBM program. “PM” is being used here interchangeably with “TBM” (time based
maintenance)
99
Nevertheless, the CBM technology vendors offer powerful hardware and software that, when applied
effectively, meet the objectives of CBM.
Page 88
existing business processes. As a consequence of their much vaunted "user friendliness"
no common practices of data classification gathered sufficient critical mass to achieve
standardization - not even within a given organization, let alone in an industry, or in the
physical asset management community at large.
It is in this context that the second millennium, the age of connectivity, finds the state of
maintenance information. Maintenance technology vendors are poised to inject the latest
generation of "integration technology" into their traditional market. But the lack of a
common data model impedes smooth penetration.
The Maintenance Information Management Open Systems Alliance (MIMOSA) was

formed in 1994 by key CBM and maintenance technology vendors to address the
problem. The result of their labors over the past decade is the impressive common
relational information system (CRIS) and associated enabling tools. The CRIS
accommodates many physical asset management concepts within its data structure and
has the flexibility to adapt as required. It is continuously maintained and updated by
MIMOSA (www.mimosa.org).
Hence we may foretell the day when disparate production and physical asset management
systems will communicate seamlessly thanks to MIMOSA and other standardized
information protocols such as OSA-CBM (Open Systems Alliance - Condition Based
Maintenance), STEP (standard exchange for model product data), OPC (formerly OLE
for process control), OAG (Open Applications Group), and others.
Connectivity to this degree of intimacy implies that process and maintenance information
from multiple platforms will materialize in a universally accessible format (CRIS) and, in
that homogenized form, may be intelligently processed for optimum decision making.
Optimization seeks to achieve some objective: the lowest average cost of maintenance,
highest asset availability, or a specified effective reliability. It is onto this stage that the
"CBM Optimizing Intelligent Agent" enters.
EXAKT, a CBM optimizing software, developed by the CBM Laboratory at the

University of Toronto is an intelligent agent. More precisely, it is a platform for
developing intelligent agents that are designed to interpret condition data (CBM
measurements) in combination with concurrent historical data from the CMMS. The
agent reduces both data sets to a clear decision - i.e. whether to intervene and perform
maintenance at this time or to allow the equipment to continue operating. It does so by
considering the economic consequences of failure, the cost of repair, and the risk of
failure in an upcoming period. It generates, a recommendation that supports a stated
management objective - either to minimize cost or to maximize the asset's availability or
to achieve a particular desired key performance indicator (KPI) such as the ratio of
planned-to-breakdown maintenance.
What does the future have in store for CBM? The CBM process consists of three sub-
processes: data acquisition, signal processing, and decision making. Data acquisition is
already highly technologically advanced. "Signal processing" in CBM filters out of the
Page 89
data, operational and environmental information, so that what is left is a "condition
indicator" that reflects the degree of deterioration of some targeted failure mode. New
signal processing methodologies based on a variety of disciplines (wavelet analysis,
principal component analysis, inference engines, and neural net classifiers to name a few)
are being developed in research institutions and universities around the world. (Chapter 7.
describes a few such techniques.) Their effect will be to make it technically feasible to
track and manage ever increasing numbers of failure modes.
Page 90
Chapter 7. Anatomy of CBM
Having understood, from Figure 6-4 (page 87), that the decision to perform CBM flows
from a fundamental analysis of the physical asset’s maintenance requirements, we turn
our attention to the composition of a CBM task. We keep the over-riding concerns in
mind. That is, we elect to conduct only applicable and effective CBM procedures. Figure
7-1 portrays three distinct CBM sub-processes, each of which must satisfy the
applicability and effectiveness criteria in order for CBM to add value to a maintenance
program.
Figure 7-1: CBM sub-processes
Data Acquisition
Data acquisition is the first and, one might assert, the easiest of the three CBM sub-
processes to implement. Assisted by advanced sensor, signal transmission, and storage
technologies, we can, without too much effort, implement systems that collect and store
impressive amounts of data. The predictive maintenance industry has organized100 to
provide communication standards and protocols endowing their products with
unprecedented capability to share process and condition monitoring data. Because
commercial-off-the-shelf (COTS) data acquisition hardware and software products can be
used across a range of industries, data acquisition enjoys more commercial exposure than
do the other sub-processes of CBM. Some maintenance technology consumers imagine
that, once they set up elaborate data acquisition, storage, and display systems, they will
have overcome the major hurdle to effective CBM. Some pay scant attention to the
choice of the data they decide to collect, adopting a when-in-doubt-collect-it-anyway-it-
might-be-useful attitude. Their data choices are influenced largely by the capabilities of
the technology rather than by a pre-assessment of how well the collected data will reflect
an evolving failure mode.
By way of illustration, there are two important reasons why bearings fail :
• Overheating – the most common cause being over-lubrication, and

100
Some typical organizations are provided in the Introduction on page 13
Page 91
• Contaminants in the bearing oil – water being the dominant one
Were we to consider CBM a form of maintenance inspection (rather than a hi-tech

maintenance process), we would demand that monitored data relate clearly to the failure
modes with which we are most concerned. Moreover, from an information management
perspective, we would require that our CBM and CMMS databases store, in the case of
centrifugal pumps, for example, such “mundane” types of data as those described by the
McNalley Institute101:
1. Bearing oil reservoir levels and bearing case temperatures.102

2. Incidences of leakage from the stuffing box, gaskets, bearing seals, cracks or
holes in the piping or pump casting.
3. Abnormal noise such as that sometimes heard when air is leaking into a
mechanical seal or pipefitting. (Vacuum leaks can be checked with smoke.)
4. Odors indicating high temperature.
5. Colors indicating a component has been subjected to abnormal heat.103
6. Blackened oil indicating that it has been subjected to high temperatures.
7. Excessive vibration detected either with the use of instruments, or by one of the
senses.
8. Malfunctioning environmental controls on stuffing boxes, discernable by
measuring the temperature difference between the inlet and outlet lines.
9. Positions of control and isolation valves throughout the system while the pump is
running steadily.
10. Flow, differential pressure, power consumption, temperatures (in the volute and
stuffing box), shaft speed, liquid levels (sight glasses)
11. In cartridge seals, estimates of face loads by measurement of the gap that held the
retention clips.
Tradespersons and operators make these types of observations routinely. Sometimes, they
take approriate corrective action. Seldom, however, do the observation or the failure
mode104 discovered as a result of the observation, appear methodically as records in the
maintenance history database. Invaluable sources of reliability data such as these, elude
most maintenance information record keeping processes. Rather, those historical records
contain, mainly, descriptions of maintenance activities performed, without reference to
the conditions that inspired those actions. The McNalley institute goes on to enumerate
the possible causes of the elevated temperatures in the stuffing box as:
101
http://www.mcnallyinstitute.com/CDweb/p-html/p027.htm
102
Lubricating oil has a useful life of thirty years at thirty degrees centigrade (86°F) and its life is cut in
half for every ten degree centigrade (18°F) increase in temperature. We may assume the temperature in the
bearing is at least ten degrees centigrade (18°F) higher than the oil sump temperature. At elevated
temperatures the oil will carbonize by first forming a "varnish like" film that will turn into a hard black
coke at these higher temperatures. It is these formed solids that will destroy the bearing.
103
For example, overheated stainless steel turns straw yellow, brown, blue and black at respective
temperatures of approximately 400, 500, 600, and 650 degrees Celcius.
104
The opposite side of the coin. The five knowledge elements (page 15) will neatly express these
observations in a work order record of the CMMS.
Page 92
• Loss of circulation in the stuffing box cooling jacket.
• Loss of cooling in the bearing case cooling sump.
• Something is cooling the outside of the bearing casing causing the outside
diameter of the bearing to shrink, increasing the load.
• The bearing was installed incorrectly.
• The bearing is over lubricated. The oil level is too high or there is too much
grease in the bearing.
• The lubricating oil is contaminated with water.
• The shaft is overloaded because the pump is operating off of the B.E.P. (best
efficiency point).
• There is too much axial thrust of the shaft.
• Misallignment, unbalance, etc.
Oil sampling will indicate the following conditions that are a prelude to (or an indication
of) serious failure.
• Water is getting into the oil.

• Oil additives are no longer present and functioning.
• The oil is carbonizing due to high temperature.
• Solids due to corrosion, bearing-cage destruction, or some other reason are
present.
By monitoring pump suction and discharge pressure in concert with product flow and
motor amperage, the following failure modes may be detected:
• Wrong size pump.

• Pump operating far from best efficiency point raising the likelihood of shaft
deflection.
• Motor close to an overload condition.
• Impeller needs adjustment or the wear rings need replacement.
• Poor operating practices.
• Source product tank at wrong level or suction lines are clogging.
• Getting close to cavitation.
Most failure modes occur randomly rather than by a wearing out of a component. For
example, were wear the dominant failure mode in bearings, they would, on the average,
survive 50 or even 100 years. But, industrial bearings undergo accelerated wear initiated
by randomly occurring internal or environmental events, for example a shock load,
excessive heat, or water ingress causing lubricant failure. Bearing life is, in addition,
highly influenced by initial conditions, for example, how it was stored and handled prior
to installation, and how it was installed.
Randomness, being the rule, rather than the exception, is it reasonable for us to assume
that we will usually find a monotonically rising trend of some monitored variable
throughout a component’s lifecycle, from which we may predict its failure? A more
reasonable approach to CBM would be to monitor the equipment and its operating
Page 93
context for signs of conditions causing abnormal stress, that if allowed to persist, will be
destructive. Doctors monitor cholestrol to determine whether our arteries are in danger of
clogging. At a certain level, they order a corrective action, usually a change in lifestyle.
Maintainers monitor oil levels to avoid the consequences of over- or under-lubrication.
Vibration analysts determine a condition of foundation weakness, shaft misalignment or
of rotor imbalance, that, if uncorrected, will lead to serious failure.
These examples illustrate that CBM is a viable maintenance strategy for avoiding failure
altogether. Yet CBM can also track and predict some failure modes from some point in
time after their random initiation to their ultimate functional failure. It has been
estimated105 that twenty precent of failure modes proceed in a predictable enough manner
following their detection (their potential failure), that a repair action may be planned and
executed prior to the loss of asset functionality. A spalled bearing, for example, emits
bearing tones that can be detected automatically through processing of the spectral data
assisted by cepstrum analysis. The bearing may continue to operate adequately from this
point for several months prior to a failure that would render it non-functional.
It seems, then from the preceding, that there are two classes of CBM:
2. the detection of abnormal stresses106 on a system that, if uncorrected, will

provoke a failure that has not yet initiated, and
3. the detection of a failure that has already begun, but has not progressed to
the point where a required function has been lost.
In either situation, CBM is said to be effective, as long as the consequences of failure are
reduced (or avoided entirely) at an acceptable cost. In the case of the first CBM class,
and, pursuing our example of a centrifugal pump, we might notice a rising trend in the
temperature of the stuffing box. If it gets too hot, we are going to have problem. We had
better correct the condition if we do not want to experience a premature (random) seal
failure. The McNally Institute describes the following seal failure modes that will be
provoked by excessive stuffing box temperatures:
• The product can change its state, insofar as ceasing to act as a lubricant, but
partially transforming into a destructive solid.
• The product can vaporize, expand and blow the seal faces open leaving solids
between the faces.
• The product can become viscous, interfering with the free movement of the
springs and bellows.
• The product can become an adherent, gluing the lapped faces together or making
the moveable components inoperable.
• The product can crystallize interfering with the moving parts of the seal.
105
Moubray, J, Reliabity-centered Maintenance, 2nd Ed. Butterworth 1999.
106
We will learn in Chapter 10. page (113) that these two classes of CBM are characterized by two types of
CM variables – 1) internal variables that reflect the state of the asset with respect to its deterioration due to
a failure mode, and 2) external variables that measure the level of stress that influences the probability that
a failure will occur. A CBM decision model, may incorporate either or both types of variables.
Page 94
• Excessive heat can cause the product to build a film on the faces (hot oil as an
example) impeding sliding of the components and making them inoperable.
• Corrosion increases with increasing temperatures.
• Thermal expansion may cause seal faces to go out of flat, loosening of pressed-in
carbon faces in their holder, and sticking of the bellows’ vibration dampers to the
shaft sleave and opening the faces.
• Heat can damage the faces of the plated materials and filled carbon face types.
• Expansion of air pockets in some carbon faces can cause pits in the lapped faces.
• High heat levels can cause elastomers to experience compression set problems,
resulting in leakage or in some cases complete failure.
A change in stuffing box pressures can cause:
• The product to vaporize opening the lapped faces.

• O-rings and other elastomer designs to extrude and jam the sliding components.
• Lapped seal faces to distort and go out of flat.
• A stuffing box vacuum that can blow open unbalanced seals.
• A differential pressure across the elastomer that can cause ethylene oxide to
penetrate into the elastomer and destroy it as it expands in the lower pressure side.
When monitoring temperature and pressure in the stuffing box area we will note these
changes. Then, by applying our knowledge based rules, we will have adeqate time to
react before seal failure occurs. Knowledge based rules form our CBM policy. Without a
CBM policy, regardless of the number of sensors scattered throughout our process, the
amount of data storage capacity, or the sophistication of the software “shell”, our CBM
program will ultimately prove ineffective.
Besides data acquisition, two additional sub-processes challenge our ingenuity prior to
implementing applicable and effective CBM.
Signal Processing
Signal processing in CBM is the filtering out of the acquired data all information that
pertains to the operation of the asset and its environment. In other words the processed
signal should not reflect changes in load or operational conditions, but should react only
to real changes in asset health, with respect to the deterioration by a failure mode that we
are targeting with the CBM task. A variety of signal processing techniques have been
(and continue to be) developed by industry and academic research organizations. We
sometimes refer to signal processing, particularly in vibration analysis, as feature
extraction. We process a raw time waveform signal (using an algorithm) in order to
extract one or more features (condition indicators) that measure the evolution of
particular conditions affecting or occurring in our physical asset. Figure 7-2, Figure 7-3,
Figure 7-4, and Figure 7-5 illustrate a small sample of the wide diversity of CBM signal
processing techniques addressing specific failure modes.
Page 95
Figure 7-2: Stress Wave Analysis, www.swantech.com
Stress wave analysis, illustrated in Figure 7-2, tracks the failure modes associated with
roller and groove damage.
Figure 7-3: Shaft condition monitoring, www.gaussbusters.com/whatisscm.html

The shaft condition monitoring CBM techniques of Figure 7-3 are said to address a
number of failure modes: Shaft Rubs at bearings and seals due to oil whip, coupling
misaligned, growth due to thermal effects, lubrication loss, oil contaminated, Blade
erodes due to wet steam causing charge separation and cavitation, Charge separation and
spark discharge due to dry steam at inlet to turbine with partial admission, Shaft
grounding lost, Intermittent ground fault due to torn Copper Leaf, Insulation shorted at
Bearings, Seals and Couplings, Stator core lamination shorts, Diode fails in generator
excitation, and excessive transients in pulse width modulated rotor and/or stator electrical
supply. Signal processing is required to discriminate among these failure causes.
Page 96
Figure 7-4: Petri-nets for monitoring a manufacturing process.
Figure 7-4 illustrates the graphical language of Petri-nets used to simulate manufacturing
processes in an integrated circuit chip. Deviations from expected timing of activities may
be tracked and related to specific modes of failure.
Figure 7-5: Chaos mathematics applied to CBM signal processing

Many, seemingly random signals, when represented in state space using a branch of
mathematics known as Chaos theory, display patterns, deviations from which may be
tracked and related to specific modes of failure.
Page 97
Figure 7-6: Continuous oil analysis and treatment system (www.thermal-lube.com)
Figure 7-6 illustrates that an effective CBM system may act as one half of an automatic
control loop. Although most CBM programs operate in a manual control loop by
directing a maintenance renewal task, the continuous oil analysis and treatment (COAT)
system uses CBM condition data in an automatic control system. First it extracts features
from a lubrication or cooling fluid’s infrared signature. The arrow on the left of Figure
7-6 represents the signal processing algorithm that extracts the current additive level from
the infrared spectrum. The additive level then can be tracked and trended in time. Other
extracted features (i.e. condition indicators such as oxidation, additive content, and
contamination) can be used similarly. In this case Figure 7-6 portrays the automated
replenishment of depleted oil additives.
Figure 7-6 raises a question about the CBM role of oil analysis. Oil analysis laboratory
services constitute a significant part of the CBM technology industry. Lubricant suppliers
often subsidize those services as an important component of their marketing plan.
Publicity for these programs focus on extending lubricant change intervals. Although
laudable for a supplier to attempt to reduce clients’ product consumption, this point of
view may be misleading. The emphasis on reducing oil consumption diverts attention
from the essential purpose of a CBM task – to reduce or entirely avoid the consequences
of equipment failure. The cost of lubricant usually pales in comparison with the cost of a
major asset failure. Furthermore, cost alone cannot measure the hidden, environmental,
and safety related consequences of failure. When lubricant additive drops at an abnormal
rate, the user more rightly concerns himself with the mechanical failure mode or process
fault whose effects include abnormal additive depletion. Some lubricant vendors and
their sales persons have not yet embraced this viewpoint. They continue to stress reduced
Page 98
lubricant consumption in promoting their CBM service offerings. In the introduction to
Part 1 (page 13) we offered some explanation for this point of view.
There are as many signal processing processing techniques as there are different physical
applications. In Chapter 11. CBM Decision Making with Expert Systems (page 152)
several practical techniques for the extraction of vibration features to be processed by a
rule based expert system are described. In Chapter 13. A survey of signal processing and
decision technologies for CBM (page 177) a broad review of the technical literarture is
presented.
Figure 7-7: Wavelet-comblet processing of gearbox vibration signal107

Figure 7-7 describes a CBM signal processing algorithm that targets the failure mode
“gear tooth fails due to fatigue crack”108. The photograph at the top left of Figure 7-7
illustrates the development of a crack in tooth number 10 of the driven gear in a single-
stage helical gear reducer. The time waveform signal covering one revolution of the
driven gear appears in the top right. Note the amplitude and frequency modulation
occurring at 17 milliseconds into the revolution. This usually indicates gear tooth
damage, however some sort of processing is required if this information is to be used in a
practical CBM program for determining the timing of a pro-active maintenance task. In
this algorithm, a family of Wavelets is constructed to decompose the gear motion error
signal and to extract the residual error signal for gear fault detection. The bottom left
107
Masters Thesis, A new wavelet basis for the decomposition of gear motion error signals and its
application to gearbox diagnostics, Antonio John Miller, 1999
108
Failure modes should consist of a noun and a verb (could be passive form) usually
followed by a clause, such as “due to …” describing the appropriate causality level for
the failure in question. The techniques of failure mode determination are explained fully
in Part 3. “Reliability Centered Maintenance”on page 201.
Page 99
quadrant of Figure 7-7 displays the signal for a single gear revolution and shows that
tooth number 10 has a motion pattern exhibiting high deviation from ideal motion and
differing from that of the other teeth. Finally the signal processing algorithm plots a
single indicator, called the “fault growth parameter” that is tracked over macro time (e.g.
weeks, months, years).
Although the algorithm accomplishes the objective of signal processing – that is a

monotonically increasing condition indictor revealing failure development, still, one
crucial question remains, for the completion of the CBM process. The three lines, and
the question “Where” on the fault growth graph of Figure 7-7 illustrate the question:
“When shall we intervene and perform a gearbox overhaul or change-out? At the first rise
in value? At the second? Or, at the 280 time unit point when a third leveling off occurs at
a FGP (fault growth parameter) value of 18. The answer to this last question, is at the
heart of the third CBM sub-process – decision making.
Decision Making
Decision making represents the final, and often overlooked, CBM sub-process. After
collecting, processing, and storing the current set of condition data, the maintenance
planner, manager, or engineer decides whether an intervention at this point in time is
“optimal”. Figure 3-1 (page33) illustrates the complexity of factors that will affect his
decision. He desires to make that decision, as far as is possible, in a methodological
manner that will bear scrutiny with respect to the objective of the organization and the
current operation of the asset – a tall order. For CBM to render effective service, we
apply the same degree of rigor to this decision making step as we have done to the data
acquisition and signal processing step. CBM Laboratory at the University of Toronto has
created EXAKT, a CBM decision software tool.
Page 100
Figure 7-8: EXAKT decision tool
Figure 7-8 describes how EXAKT software decision tool may be used. The top left of
Figure 7-8 shows a graphical representation of the software’s output. The vertical axis
measures the weighted sum of risk factors found significant by the software’s
proportional hazard model. The horizontal access indicates the item’s working age. A
point on the graph represents current asset condition with respect to one or more failure
modes included in the model. If it falls in the green (bottom left) region, the optimal
decision model recommends no action; if in the yellow (light strip), preventive action
should take place prior to the next CBM inspection; if in the red (dark region in top
right), take immediate preventive action.
Note two important characteristics of the decision graphical output: 1) the condition
indicator indicates considerable random fluctuation, and 2) the boundaries between safe
and critical operation vary with working age. Signal processing has not produced a
monotonically increasing condition indicator. This is a common situation encountered in
CBM. Signal processing has not fully accounted for and therefore filtered out random
operational or environmental factors. Secondly, the varying limit boundary tells us that
EXAKT has determined that, in addition to the sum of the weighted monitored condition
indicators, the item’s working age also strongly influences its risk of failure.
Suppose that the cost of a preventive replacement at $100 is 3 times less than the cost, on
average, to repair the failed item. Then, the decision model can be optimized, by using
this ratio, to adjust the boundaries so that, in the long run, they guide the condition
monitoring data’s interpretion towards achieving the lowest total cost of maintenance.
That is, the model will interpret day-to-day CBM inspection data neither too
conservatively nor too liberally, but will recommend an optimal interpretation which
balances cost and failure probability. Similarly, if maximum availability is the optimizing
objective, then the decision model will use the ratio of the mean-time-to-return-to-service
(MTTR) for the preventive and failure situations, in order to deliver routine decisions that
will support this objective.
Sometimes, maintenance organizations are required to obtain certain key performance

indicators relative to a benchmark, for example a target ratio of 90%:10% for planned
versus breakdown maintenance. Again, we instruct the EXAKT model to achieve the
desired objective. An organization may specify a desired mission reliability, (for
example, a survival probability of at least 99.99%) in a specified time inteval. Once more
the interpretation policy of the CBM program will be adjusted for the required reliability
(survival probability in the interval).
The table at the bottom of Figure 7-8 compares the cost of a proposed optimal (EXAKT)
CBM data interpretation policy with that of an existing policy and with that of a run-to-
failure policy. Note that in this example the optimal policy results in a mean-time-
between-replacements of 1781, which is 45 % less than the current policy (MTBR =
3944). That is we are intervening more often in order to gain a net decrease to 51.53% of
the original (proactive and reactive maintenance) costs. Preventive actions in the
proposed policy would account for 96.6% of incidences compared to only 20% under the
Page 101
current policy. CBM decision optimization provides working decision models that may
be used to automate the interpretation of CBM condition monitoring data, in the
achievement of a specified maintenance objective.
Page 102
Chapter 8. CBM Fundamentals
The fundamental premise of CBM
CBM Program Criteria

A proposed CBM program must satisfy these three criteria:
1. A clear warning that your equipment has entered a "failing" state, and
2. The warning time is long enough for someone to take action to mitigate the
consequences of failure, and
3. The average cost to perform CBM on an asset is less than the average cost of the
consequences of failure over the long run.
So obvious are these CBM program essentials, that we often gloss over them in our rush
to implement high technology solutions. In a sampling of 100 maintenance
organizations109 over 3 years, all had at full blown CBM programs in place that did not
satisfy one or more of the above criteria. Why is this the case? The following sections on
the nature of the maintenance technology industry shed some light on this question.
CBM Monitoring Frequency
Two BearingsTwo Bearings Noise

starts
Risk Functional Warning 2 wks
performance Very critical
Conditional Probability of Failure
OK Failed Inspection interval 1 week

Brg A
P-F = 2 Weeks
(MTBF = 3.5 years)
OK Warning 2 days 1/3.5

Noise Failed
Insp. interval 1
starts Not so critical
day
P-F = 2 Days
Brg B
1/7 (MTBF = 7 years)
Assertions:
1. The lower the Mean Time Between Failure (MTBF), the
more frequently you monitor?
2. The more critical, the more frequently you monitor?
Figure 8-1: Two bearings’ criticality and reliability110
109
Survey by the author of participants of the Physical Asset Management Certification course given twice
yearly by the University of Toronto’s Professional Development Center.
110
John Moubray, Aladon RCM practitioner’s course.
Page 103
Figure 8-1 illustrates how maintenance technology vendors, pre-occupied with explaining
the features of their products, often fail to address more fundamental issues. Users of
CBM frequently wonder how often they should monitor a particular equipment. CBM
technology providers, typically, offer two answers:
1) It depends on the equipment’s reliability, or

2) It depends on the criticality
Figure 8-1 examines each of these assertions by considering two bearings labeled, “A”
and “B”. Most rolling element bearings fail randomly111. Hence their conditional
probabilities of failure are shown to be straight lines (failure pattern F). We will assume
that Bearing A (MTTF of 3.5 years) is half as reliable as Bearing B whose MTTF is 7
years. It follows that the conditional probability of failure of these items are
approximately 1/3.5 and 1/7 respectively112. This is indicated by the relative heights of
the two lines representing the two bearings risks of failure. Suppose we are told by an
experienced employee that Bearing A begins emitting a rumbling sound and then
invariably fails between two weeks and two months later. And, another operator, in
describing his experience with Bearing B in a high rotational speed application tells us
that it issues a distinct whining noise and invariably fails between 2 days and 2 weeks
later. In the case of Bearing A we would reasonably suggest sampling at an interval of 1
week, while, for Bearing B a reasonable sampling interval would be 1 day. Comparing
these conclusions with the first assertion: “The lower the Mean Time To Failure (MTTF),
the more frequently you monitor?” we must reject it because we have just deduced that it
is not necessarily true. That is we have demonstrated a situation where it is appropriate to
monitor a more reliable item (Bearing B) 7 times more frequently than a less reliable item
(Bearing B)
Now, we turn to the second assertion, “The more critical, the more frequently you
monitor”. Let us suppose that we are told that Bearing A is very critical while Bearing B
has a backup system and therefore is far less critical. Once again, in this particular case,
the assertion has been shown to be false. We conclude, therefore, that neither criticality
nor reliability, can be used to determine CBM inspection frequency. Rather we must
focus our attention on confidently detecting a potential failure and reliably estimating the
PF interval, as discussed next.
111
For a discussion of random failure, see Random Failure on 38.
112
For an explanation of this derivation see the chapter “Reliability Centered Maintenance”.
Page 104
Estimating the PF Interval
Is CBM for the failure mode

in question applicable? (Is
there a clearly identifiable
condition indicator? Is the
warning time adequate?) NO
Is CBM for the failure mode

in question effective? (Is
there an economical CBM
task and interval that will YES
avoid or reduce, to a
tolerable level, the
consequences of the
failure?)
CBM not applicable or not

effective -> Descend to next
NO
task type in the RCM
algorithm.
YES
is the warning period of the

order of days, weeks, or
Initial inspection Interval =
months?
X/2
Days (Weeks, Months)
X Days (Weeks, Months)

How many days (weeks,
months)?
Figure 8-2: Estimating the P-F Interval

Figure 8-2 describes a facilitated process for arriving at a consensus estimate of the PF
interval. The RCM team considers first the applicability then the effectiveness of a
particular CBM task. The session leader (facilitator) asks the questions in Figure 8-2 in
order to arrive at a reasonable starting point upon which to base the initial (conservative)
inspection interval. Later, as a result age exploration, the inspection task interval may be
widened as experience of the rate of failure progression is gained. A systematic
methodology for converting experience to an optimal decision policy is provided next in
Chapter 9. and Chapter 10.
Page 105
Chapter 9. The Elusive P-F Curve
J. Moubray coined the phrase "P-F interval". He used it to highlight two pre-requisites of
CBM, namely:
A clear indicator of decreased failure resistance - the potential failure, and
A reasonably consistent warning period prior to functional failure - the P-F interval
Both these requirements are captured in the well known empirical graph of failure
resistance versus working age (Figure 9-1).
Figure 9-1
The P-F interval is a deceptively simple idea. Deceptive, because it takes for granted that
we have previously defined "P" (the potential failure). Of the two concepts, “P” and “P-
F”, it is the former, however, that poses the greater challenge. Therefore, before
addressing the P-F interval, we need to determine when and how to declare a potential
failure.
Figure 9-1 implies that if we could monitor a condition indicator that tracks the resistance
to failure, then declaring the potential failure level would be an easy matter. Two
stumbling blocks, unfortunately, arise and obstruct our plan. The obstacles to the
implementation of Figure 9-1 are:
1. A single condition indicator that faithfully tracks the resistance-to-failure curve is

rare, and
2. The resistance-to-failure curve itself is rarely available.
Condition monitoring data, on the other hand, is abundant. How may we overcome
obstacles 1 and 2? That is, how may we apply CBM to the numerous physical assets
where condition monitoring data abounds, yet, where few alert limits have been defined?
This (setting of the declaration level of the potential failure) is the problem encountered
by many asset managers deluged with condition monitoring data. The unavoidable
Page 106
question facing any implementer of a CBM program is where to set the potential
failure. Which indicator, from among many monitored variables, should he select for this
purpose? At what level? When the physics of the situation are not well known (as is often
the case), a “policy” for declaring a potential failure is far from obvious.
Why does Figure 9-1 stubbornly elude our grasp? The reason is that this graph is often
not 2-dimensional, but multi-dimensional. There is one dimension for each significant
risk factor. The curve of Figure 9-1, therefore, looses its simple geometrical visuality.
This is where software comes to the rescue.
EXAKT summarizes the risk factors associated with working age and monitored
variables and creates a new kind of graph by transforming the significant risk information
onto a 2-dimensional optimal decision graph. Dr. Dragan Banjevic, CBM Lab director,
captured the multi-dimensionality of Figure 9-1 in two ways. First, he combined the
significant monitored variables (other than age) into a risk-weighted sum. That became
the y-axis. Then he transformed the age-related risk factor into the shape of the limit
boundary. One 2-dimensional graph, Figure 9-2 shows all aspects inluencing risk. They
incluce economic factors as well as failure probabily associated with each significant
variable.
Figure 9-2
EXAKT handles the probabilistic nature of P and the P-F interval properly. EXAKT does
not assume a deterministic113 P or P-F interval. Instead it draws (from historical records)
a probabilistic relationship among all significant factors (including working age). It uses
that relationship to estimate the remaining useful life at any given moment. One of the
benefits of this approach is the ability to deal with noisy data, illustrated in Figure 9-3.
On the left side of Figure 9-3 are 3 examples of ideal data. Note how the monitored
113
That is, it recognizes that a potential failure and the ensuing functional failure tend to occur randomly
according to some probability distribution.
Page 107
values increase monotonically, with the red alarm set conveniently to the potential failure
declaration level. Unfortunately condition monitoring data seldom looks like this.
On the right side of Figure 9-3 is data from the nasty real world. It contains random
fluctuations and trends that contradict one another. In other words, the usual situation!
EXAKT alleviates randomness (see Exercise 4 page 324) and conflicting trend data (see
Exercise 2 page 131).
.
Figure 9-3
Summarizing, EXAKT overcomes both obstacles to the application of Figure 9-1:
- It uncovers the weighted combination of monitored variables that most truly

reflect degraded failure resistance, and
- It provides a virtual failure resistance curve that accounts for multiple risk factors.
- It sets the “P” (potential failure alert limit) dynamically so as to optimize risk.114
- It provides a residual life estimate and optimal recommendation, based on risk
and cost.
Are failures required – multiple levels of intrusiveness?

By definition, a potential failure has no dire consequences. Often a less intrusive form of
CBM is used to decide when a more intrusive inspection is required. For example, oil
analysis results often indicate that a problem is occurring in a complex system, such as an
engine, but do not specify which component is failing, nor which failure mode is
occurring. In those cases a physical inspection requiring additional forms of testing is
114
for a required objective (such as low overall cost or high availability).
Page 108
desirable115. Should the physical inspection (a more intrusive form of CBM) uncover a
potential failure, then a model relating the less intrusive measurements to the findings of
the more intrusive inspections is desirable. Still, a functional failure will not have yet
occurred. With ever increasing amounts of data being captured from the control platform,
two (or more) levels of intrusiveness of CBM are often desirable. Hence we may build
decision models that predict potential failures thereby avoiding functional failures
altogether.
Here are two typical situations that the CBM Lab has encountered.
Case 1. A single asset, say a pump, has been operating for 30 years without failure. We
will probably have, for this pump, a large database of condition data (for example
vibration, flow rates, motor current, etc.) taken at regular intervals, but no failure
data. Alternatively, we may have a brand new pump of a new design on which we
possess no experience at all.
Case 2. A maintenance department is responsible to maintain an equipment and/or fleets

of similar equipment. Over the years, it has accumulated large databases (or files
of paper reports) containing condition monitoring data. During the same period
you the maintenance team will have operated a CMMS, and will have recorded
(more or less accurately) the failures that occurred and the maintenance that has
been performed on these assets.
Both these cases meet the criteria for CBM described in Chapter 8. CBM Fundamentals
(page 103).
Discussion of Case 2
In most plants, functional failures and numerous potential failures do occur. The
following would be a typical scenario for the development of one or more CBM
optimization models:
1. We have a machine (or sometimes a fleet of machines).
2. Over time we record various measurements on a periodic (daily, weekly, monthly, etc.)
basis. For example: load, vibration, amperage, phase, or whatever else may be
appropriate. Those readings would also include working age measured in some
service usage unit that describes the accumulating stress on the machine. Say fuel
consumed, or widgets produced. In EXAKT we call each set of measurements
taken at more or less regular intervals, an Inspection.
3. Occasionally w see some anomaly in the data, and you feel that you should do a deeper
(more intrusive) "Inspection". Or we may be doing a time based maintenance
task. In either case you physically inspect one or more components in the
machine. You find that one of the components is in a failing state. You have, thus,
115
For example compression tests, pressure and ignition traces, or even partial disassembly for more
intrusive visual inspection.
Page 109
discovered a potential failure.116 You record this observation in the CMMS as an
event which you might name "EFP1" (ending with potential failure type 1 –
which may be a potential failure of component X or of failure mode Y, for
example).
4. We repeat steps 1 to 3 over time. That is how we normally accumulate a "sample" of
condition and event data. (By the way, we are making use of an important
function of our CMMS by populating it with this type of data. After all, we paid
good money for the CMMS. Why not use its historical data recording capabilities
to their fullest?117
5. Sometimes (as will happen) we will have missed detecting a potential failure soon
enough, and we will experience a real (functional) failure. This, as well, becomes
part of our historical database (i.e. our sample).
6. Over time we will have experienced several failure modes at the potential failure
stage, and perhaps one or two actual functional failures. (Now, at last, we have a
good sample). We analyze this sample in EXAKT and we build a model that can
be used for automated prediction (residual life estimation) and optimal CBM
decision making.
The important point to note in this hypothetical sequence, is that model building
using EXAKT does not require us to have endured catastrophic or expensive functional
failures. EXAKT was designed to extend current CBM decision making capability. The
results of whatever current methods are being used to record condition data and event
data may be analyzed by EXAKT in order to build an optimal CBM data interpretation
model. That model can then be used as a policy (i.e. an alarm limit) for the future
detection of a specific failure mode while it is in its “potential failure” stage.
Of course in the real world, maintainers have not recorded failures, potential failures,
and other events as carefully as they perhaps would have, had they known about
EXAKT's data analysis capabilities. Not to worry. EXAKT contains many data checking
and validation procedures that help us "clean" our (less than meticulous) data. Usually,
we are able to analyze that data and provide the maintenance department with a good
predictive model. Or, at the very least, with some fresh new ideas on how to improve the
effectiveness of their current CBM program. Tutorials 2, 3, and 4 on the OMDEC
website118 demonstrate some of our data cleansing techniques.
Though building a database can take a long time, whatever we do, the clock will tick and
years will elapse. Either, during that time, we use standard procedures to record what
happened, or we populate our CMMS history database haphazardly. Opting for the
former adds negligible cost but confers, in the short term, expanded awareness and better
116
Nowlan and Heap described the method of “opportunity sampling”. When a unit becomes available to
maintenance staff for whatever reason, the opportunity is taken to inspect for all potential failures that may
be present.
117
Interviewer’s note: The sub-menu item “Data Strategy” under the menu item “Reliability” on the
OMDEC website describes how to use your CMMS in this way.
118
Under menu item “CBM Optimization”
Page 110
communication among our maintainers, operators, supervisors, and engineers. In the
longer term, good historical information offers deeper understanding through analysis.
Discussion of Case 1
EXAKT offers two solutions for the “no data” situation depending on each of these two
possible situations:
12. If we have some expert knowledge about the failure of the pump from the
maintenance personnel or from the OEM, or we have some failure data from a
similar pump (e.g., an earlier design of pump that we have used in the past), the
Bayesian approach would be the most appropriate solution. EXAKT’s upcoming
version implements Bayesian modeling. That is, it incorporates expert judgment
of the relative risks associated with various condition indicators to build a prior
model. EXAKT, subsequently and continuously, updates the model as actual
failure or potential failure data accrues.
13. In a second situation, let’s assume that we know nothing about the failure of the
pump. The Bayesian approach can still be applied by assuming a non-
informative prior distribution for the CBM model parameters. As in the first
situation, EXAKT continuously updates the model (as operational, condition, and
failure and condition monitoring data accumulate). Of course, the prior model,
based on a non-informative prior distribution, initially, will have no predictive
value. Until the model evolves, the best we can do is to apply statistical process
control methods or judgement limits to certain “features” of vibration, oil
analysis, or other CBM data. In other words, the usual, or traditional, way that
CBM is done.
One might infer from the foregoing that we must simply revert to our existing CBM
procedures until data becomes available? This is not quite the case. The EXAKT
approach provides two distinct advantages over previous CBM methods:
The first, is that EXAKT measures, monitors, and reports on the effectiveness of the
evolving predictive model. This provides maintenance managers with a clear picture of
whether and how their CBM programs are improving.
Secondly, and even more importantly, the EXAKT methodology imposes a novel
business discipline on the maintenance data acquisition process itself. Technicians,
reliability engineers, and managers alike, quickly experience the benefits of having
understood and duly recorded the five RCM knowledge elements119 prior to closing each
work order.
119
The first five RCM knowledge element (i.e. questions) are: “What function was lost or compromised?”,
“In what way (e.g. full, partial, functional or potential failure) was it lost?”, “Why?”, “What happened?,
and “How did it matter?”
Page 111
One may ask the question about any method that purports to use data from the past to
predict the future. Conditions in the past could have been entirely different from
conditions in the future. How can one claim that the model developed from past data has
any validity at all? If operating conditions, rates, materials, and environmental factors all
change from their values in the past, how good will be the results of the model applied in
the future? A gut response to that question might be, “No good at all!”. But if we stop to
consider the nature of a model, we discover that it’s not as black as that. Consider the
internal indicators that we include in a model – vibration features, throughput, wear
particle size and quantity, component temperature, and so on. Then consider the range of
circumstances that occurred in the past with regard to these variables and their
relationship to a targeted failure mode. Although external conditions may have changed,
the internal physics associated with a failure mode, captured in the statistical model, are
still valid. If, however, the new conditions, provoke entirely new failure modes that have
never occurred, the model cannot predict those new failure modes because the sample
upon which it was built contain no failures or potential failures of that kind.
Page 112
Chapter 10. Optimizing CBM
Developing a Maintenance Risk Model

When the physics of a failure are not completely understood (as is often the case), we
cannot specify a potential failure as a single alarm level. Nevertheless, we often possess
multiple indicators, that we know relate to an item’s remaining useful life. But we do not
know the nature of that relationship. Additionally, measurable external factors, such as
duty cycle, operating environment, and minor maintenance, likewise, influence the
propensity of an item to fail. The best we can do under these circumstances is to deduce
those key risk variables in order to specify the probability of failure occurring in a given
interval.
The foregoing appears to rule out the use of CBM, according to a criterion we stipulated
in Chapter 8. (CBM Program Criteria page 103 ) that "an unambiguous potential failure
must be detectible".
We recognize, nonetheless, that asset management, as does business in general, requires

us to manage risk. Seldom do we have complete information, yet we still must make the
best decisions possible. Therefore, in this chapter, we broaden the meaning of the word
“unambiguous” to include the ability to specify a probability that failure will occur in an
interval, given a set of observations. Rather than precluding CBM, the precondition of
Chapter 8. now requires us to:
1. develop applicable signal processing methods, and

2. establish a CBM data interpretation policy
The most difficult part of CBM is the latter – establishing a data interpretation policy. At
what point in time do we declare that a potential failure has occurred? How do we use
past experience to a) assess, and b) improve our CBM (potential failure declaration)
policy? This chapter will present a methodology to do just that. We begin by finding a
relationship between data and risk.
The traditional risk model

Maintenance departments often introduce CBM based solely on the availability of some
new technology for acquiring inspection data. Data acquisition technologies include data
collectors, control system historians, wireless sensors, and the five human senses. The
Page 113
abundance of data, typically, exceeds our capability to apply simple rules for interpreting
it. Often we imagine that, even without a CBM data interpretation policy, if we display
the data on a graph, a potential failure will emerge as an obvious deviation from a trend
line. Sometimes this is true, but more often the data is unfathomable, exhibiting random
fluctuations and contradictory indications, with no particular potential failure making
itself obvious.
Figure 10-1: Two assumptions 1. The condition indicator is tracking resistance to failure, and 2. the
alert limit (potential failure) is constant with working age.
Figure 10-1 represents a simple CBM decision model. We may think of a model as a
measuring stick. When the monitored value exceeds some predetermined level we declare
a potential failure. In this chapter we discuss a systematic methodology for determining
the appropriate level at which to declare a potential failure. Most CBM data interpretive
policies currently use the simple model of Figure 10-1. When we apply such a model we
make an important assumption.
We assume that, whatever the item’s working age, the indicator level at which to declare
a potential failure, will remain constant. While this assumption may be valid for some
failure modes, it is not necessarily so. Many items, particularly those that are in direct
contact with the product (e.g. liquids, hot gases, solids) or the environment, exhibit wear-
out and aging behavior. Younger machinery, for example, may tolerate higher loads and
vibration levels than older machines of the same type that have logged more fatigue
inducing cycles or exposure to corrosive environments. Experience120 reveals that, as
some items age, their potential failures occur at decreasing levels of the same condition
indicator. The precise relationship linking condition indicator, working age, operating
profile, and potential failure emerges from a CBM optimization analysis whose principles
are discussed next.
Combining Data and Risk
120
See wheel motor decision model discussed on page 131.
Page 114
Figure 10-2: Data and risk121
To begin our discussion of a CBM risk model, we show two graphs in Figure 10-2. The
upper plot is a typical history of an item’s monitored data. The lower plot associates a
failure risk with each data point.122 Assuming that we have discovered the relationship
between data and risk, we would next wish to develop a general policy that tells us what
level of risk is the right one for deciding to preventively renew an item at a given
moment. In many business contexts the right risk level is the one that minimizes the total
cost of preventive and reactive maintenance. The next section describes how to merge
cost and risk into a single decision model.
121
Lecture by Prof. Andrew Jardine, based on notes by Dr. Dragan Banjevic, CBM Laboratory, University
of Toronto
122
If the data were not related to failure risk, there would be no purpose to monitoring it in a CBM
program. Our challenge, then, is to discover and use the true relationship between data and risk.
Page 115
The Optimal Risk
Figure 10-3: What level of risk do we want to maintain at?
Risk is a value that combines the probability of failure with the consequences of failure.
At the extreme left of the risk line, a very conservative maintenance policy may result in
high risk because of high cost and low availability. While on the extreme right, the level
of the risk is high due to low availability and low reliability123. Let us assume that we
know how our monitored data relates to the risk of failure (Figure 10-2). We then hasten
to ask the question in Figure 10-3, “What level of risk do we wish our condition based
maintenance policies to attain?”
We do not want to set our alert limits too low (too conservatively) nor too high (too
liberally). Whichever CM (condition monitoring) data interpretation policy we adopt for
declaring a potential failure will depend on our objective (optimizing objective). Figure
10-3 suggests three possible optimizing objectives for consideration: minimum cost,
maximum availability, or a specified reliability124. We may at certain times, depending on
market conditions, desire to operate near highest availability. Under other circumstances
at lowest cost, or at some specified reliability. We may wish to operate at some
compromise state among the three objectives at a corresponding point on the risk line.
123
Total cost of failure and preventive maintenance approaches that of a run-to-failure policy.
124
For example a survival probability of, say, 98% over a specified mission, say, 6 months.
Page 116
“Risk management” equates to understanding the tradeoffs when adopting a particular
CBM policy. And adjusting that policy as operating context changes.
One would expect the cost-versus-risk curve graph to resemble the trough shaped one
shown. If we wished to operate at near zero risk125 it is logical that we will be required to
spend prohibitive sums on pro-active maintenance in order to attain such a degree of
assurance and comfort. On the other hand, if we desired to throw caution to the winds and
do no pro-active maintenance whatsoever, then the risk will likely be quite high. In fact,
the average cost of failure will approach the item’s mean-time-between failure × the
average cost of an individual failure. Hence Figure 10-3 shows that the cost curve
plateaus to the right. Somewhere between these two extremes we would expect to incur
lower costs. The risk level that engenders the lowest cost126 is said to be the optimal risk.
When a manager makes a decision in a random circumstance (such as equipment failure),

he is always taking risk. Figure 10-3 illustrates that he must manage two kinds of risk
when deciding whether or not to intervene and renew an equipment. One is the risk of
failure before the next inspection (if he chooses not to maintain proactively). If a failure
event occurs, he suffers the loss due to the failure’s consequences. The other is the risk of
intervening too early. If this happens, he incurs the loss of useful service life of the
equipment and unnecessary preventive replacement cost. There is no way to avoid either
risk. One can only try to choose a decision that is "optimal". The objective (e.g. lowest
cost, highest availability, minimum threshold reliability, and so on) he selects for his
optimized policy depends on the operating context of the equipment item.
Let us state here the difference between hazard and risk. Hazard, also referred to as
instantaneous failure rate, is the probability of failure per unit time for a unit that has
survived up until the present time127. Risk (of failure), is defined as the combination of
probability and consequence. Probability is the likelihood of a failure occurring and
consequence is a measure of the damage that could occur as a result of the failure (in
terms of injury, fatalities, property damage, and operational and non-operational
consequences). Increased risk results from increased probability and higher degree of
consequence. On the other end of the risk spectrum (the left side of Figure 10-3), there is
another risk of interest – the risk of renewing a unit too early (over-maintaining). One
needs to manage and balance these two risks to find the most appropriate decision.
Hence we may express our problem in the form of two questions:
1. What is the optimal risk for an item, and

2. How does observable data from an item map to its risk of failure.
If we can, somehow, discover the answer to these questions, we can use observable data
to make optimal proactive decisions – the goal of CBM. We tackle this problem in the
125
Risk might be quantified in any number of ways, e.g. conditional probability of failure, failure rate,
reliability, etc. Or it may include a consideration of the consequences of failure.
126
Or highest availability, or specified reliability, or some desired compromise
127
See Appendix 7. on page 290 for a more complete definition of hazard.
Page 117
next section by examining, first, the simpler case of a preventive (time based)
maintenance optimal model.
A Time Based Maintenance Model128
Figure 10-4: Typical lives of a component
Figure 10-4 chronicles a typical item through a number of its life-cycles. An event B and
another event EF mark, respectively, the beginning and ending-with-failure of each life-
cycle
Figure 10-5: 6 life-cycles ending in failure. No scheduled renewal policy

In Figure 10-5 we plot the six life-cycles in terms of their working age since renewal129,
rather than in the calendar time scale. That is, we reset the item’s hour meter to zero each
time it is renewed. In this form we could be referring to a single item’s consecutive life-
cycles or to a fleet of similar items operating concurrently. Either way, we observe, in
Figure 10-5, that a run-to-failure policy is in effect. The item is permitted to fail with no
preventive intervention attempted. CF represents the (average) cost of a failure. The
average cost per unit of working age of that (no preventive maintenance) policy will then
be the total cost of all failures divided by the total operating time of all units. This
average cost per unit time is expressed by Equation 10-1.
128
This discussion was developed by Dr. Dragan Banjevic, director, CBM Labarotory, University of
Toronto.
129
We are assuming that each renewal is a total (as good as new) repair (as opposed to a partial repair).
Page 118
6C F
Cost / hr = Equation 10-1
t1 + t 2 + t3 + t 4 + t5 + t6
Figure 10-6: Life-cycles with a preventive maintenance policy
In Figure 10-6 we decide to try to improve matters by applying a preventive maintenance

policy of renewing each unit when it attains a working age of tA. We may call this policy
“PM Policy A”. In applying Policy A, according to Figure 10-6, we succeed in
preventing 5 of the 6 failures. Consequently, we incur the cost, CP (cost of a preventive
renewal) 5 times and the cost of a failure CF, 1 time. The operating time covered by these
costs is 5 × tA + 1 × t6. The average cost per unit of working age of total maintenance
(preventive and reactive) is expressed by Equation 10-2.
5C R + 1C F
5t A + t 6
Next, suppose that someone, feels that the cost of PM is too high and decides to extend
the interval for pro-active item renewal, say to time tB. This policy is illustrated by Figure
10-7.
Page 119
Figure 10-7: Life-cycles with an extended interval preventive maintenance policy
Similar to the cost calculation for Policy A, the average cost of maintenance resulting
from Policy B would be that expressed by Equation 10-3.
2C R + 4C F
2t B + t1 + t 2 + t 3 + t 6
The $64 question at this point is, which of the policies of Figure 10-8 is the optimal one?
Table 10-1 applies some numerical values to this problem in order to illustrate how total
cost may vary depending on the preventive based maintenance policy chosen. Note the
sensitivity of total cost to the ratio CR/CF (repair-to-failure cost).
Table 10-1 Possible costs of 3 policies and 3 different failure consequence costs CF
Policy Run to Failure A: tA=2,500h B: tB=4,000h

Operational 20,000h 14,250h 19,000h
Time
6 CF 5 CR + CF 2 CR + 4CF
Cost Formula
20,000h 14,250h 19,000h
Cost if CR=$2,000 $0.75/h $0.88/h $0.74/h
CF=$2,500
Cost if CR=$2,000 $1.5/h $1.06/h $1.26/h
CF=$5,000
Cost if CR=$2,000 $3.00/h $1.40/h $2.32/h
CF=$10,000
Page 120
Which policy will result in the lowest total cost of preventive and reactive maintenance
per unit of working age? A? B? Or a “no scheduled maintenance” policy?
Figure 10-8: Possible PM policies
To answer that question, we suspect that we need to pose two more questions about the
item:
1. How does its failure risk vary with its age? and,
2. How do we combine the costs, CP and CF, with failure risk to arrive at an optimal
PM decision?
Tackling Question 1 first, we seek an equation to describe the lower curve of Figure 10-2
(page 115) that relates risk to working age. As it happens, in the early 1950’s, Professor
Walodi Weibull modestly suggested that his equation relating risk to age “might
sometimes render service”. The reliability community at the time reacted negatively to
the presumption that such a simple formula could work. But Weibull persisted and the
United States Army sponsored his research over the next 25 years. The Weibull hazard
model reshaped the discipline of reliability, having shown itself applicable to a
surprisingly wide cross-section of items and operating environments. The Weibull risk
model is given in Equation 10-4.
Page 121
β −1
βt
h(t ) =  
η η 
Equation 10-4 The Weibull risk model
In the Weibull equation h(t) represents the hazard rate130, and β and η are constants
known respectively as the Weibull shape and scale parameters. If we had a methodology
to estimate the Weibull parameters from a set of lifetime histories of an item or fleet, we
will have responded to Question 1 posed earlier on page 121.
Figure 10-9: Computer estimation of the Weibull shape and scale parameters
As it happens, Figure 10-9 describes the output of such a methodology applied to heavy
haul truck transmissions131. The software used to perform this calculation was
EXAKT132. It uses numerical algorithms to process CMMS (computerized maintenance
management system) historical data. It estimates the Weibull shape and scale parameters.
Figure 10-9 shows the Weibull equation with the estimated shape and scale values.
Hence, for this explicit example, Figure 10-9 answers Question 1 (page 121): “How does
an item’s failure risk vary with its working age?”
130
Hazard is the instantaneous risk of failure at a time t. The conditional probability of failure can be
calculated from the hazard rate by multiplying its value by the length of the desired short interval.
131
The software and data for this example, as well as a tutorial, may be downloaded from
www.omdec.com.
132
CBM Optimizing software developed by the CBM Laboratory at the University of Toronto. A trial
version together with the working databases used in these examples is available at www.omdec.com .
Page 122
Blending in Cost
We turn our attention, now, to Question 2 (of page 121): “How do we combine the costs,
CP and CF, with failure risk to arrive at an optimal PM decision?”
Figure 10-10: Probability density function can be drawn once the Weibull parameters have been
estimated.
Having answered Question 1 (page 121), “How does its failure risk vary with its age?”,
we can, (with the help of a computer program133), draw the curve of Figure 10-10, known
as the “Probability Density Function”, represented by f(t) (defined in Appendix 7. on
page 290). This graph has some convenient characteristics:
1. The area under the curve up to a time t is equal to the probability that the item will
fail prior to time t. This value is known as the “Cumulative Probability of Failure”
and is represented by F(t). And,
2. The area under the remaining part of the curve is equal to the probability that the
item will survive to time t. That value is known as the Survival Function134 and
is represented by R(t). And,
3. Because an item will fail eventually, the area under the entire curve is equal to 1.
It follows, then, that F(t) = 1 – R(t).
In the graph of Figure 10-10, tp represents the time at which we plan to carry out a
preventive renewal of the item. We may, with the help of Figure 10-10, express the
expected average (over many life-cycles) cost, CE, of maintaining that item. CE will
include preventive and reactive maintenance. The expected cost of preventive repair will
be:
the average cost of an individual preventive repair, CP, times the probability that
the item will survive to tp,
while the expected cost of failure will be:
133
Relcode (see page 47 ), for example.
134
Sometimes called the “Reliability Function”
Page 123
the average cost of an individual failure, CF, times the probability that the item
will fail prior to tp.
Equation 10-5 compactly expresses this sum of these two.
CE = C P R (t p ) + C F (1 − R (t p ))
Equation 10-5
In precisely the same way Equation 10-6 expresses the item’s expected operating time,
tE .
t E = t p R (t p ) + t f (1 − R (t p )) where tf = time to failure

Equation 10-6
Finally, the total cost of maintenance per unit of working age, then, will be CE divided by
tE, and is given by
Equation 10-7.
C E c R R (t p ) + c F (1 − R (t p ))
= tp
tE t p R (t p ) + ∫ 0
t f (t ) dt
Equation 10-7135 Total cost per unit time of maintenance as a function of PM policy tP.
Equation 10-7 provides us with the relationship we seek in order to answer Question 2
(page 121). Both R(t) and f(t) are derivable (see Appendix 7.) from the hazard rate of
Equation 10-4. We may use a computer numerical algorithm136 to plot CE/tE for all values
of tP. Figure 10-11 shows just such a graph. The tP corresponding to the minimum cost
on the curve will be the optimal PM policy. Thus, we have answered Question 2 (page
121): “How do we combine the costs, CP and CF, with failure risk to arrive at an optimal
PM decision?”.
tp
135
If you are curious as to how the second term ∫0
t f (t ) dt in the denominator of
Equation 10-7 was derived from tf(1-R(tp))you may look at the derivation given in
Appendix 12. on page 300.
136
Such as is available in EXAKT
Page 124
Figure 10-11: EXAKT solution output for optimal PM policy.
A Condition Based Maintenance Model
Now that we have established a model that relates risk, cost, and working age we ask the
question, “What if we had additional information (in addition to working age) that would
also reflect the risk of failure?”. For example, vibration analysis data, oil analysis data,
visual inspection data, operating profile changes, or other signals from the machine or
process would surely bear upon failure risk, would they not?. Can we therefore extend
our risk equation of page 122? Yes we can. The Weibull model can be extended with a
new term as shown in Figure 10-12.
Figure 10-12: Proportional hazard model.
Page 125
Note that the extended hazard model137 has additional parameters γi that determine the
influence of their respective measured values (called covariates). Where previously, in
the case of preventive maintenance, we determined the best working age at which to
renew the asset, we desire now to determine the best levels of all the significant
covariates and the working age at which to intervene and perform a preventive renewal of
the asset. That is to say, we wish to completely define a potential failure.
Automated CBM Decision Making

The staff of the typical maintenance department performs a multitude of activities on a
daily basis. As human resources (for pouring through endless CBM reports) grow scarce,
acute global competition, forces industry to acquire “intelligent agents” for the
automated interpretation of data. Intelligent agents are programs that monitor data by
applying one or more CBM interpretive models in order to return an optimal decision to
the computerized maintenance management system. The process is illustrated
conceptually in Figure 10-13.
Figure 10-13 Vision of automated CBM decision making
137
Known as the PHM (proportional hazard model) developed by Cox. Cox, D. R. 1972. “Regression
Models and Life Tables (with Discussion).” Journal of the Royal Statistical Society, Series B 34:187—220.
Page 126
In the following examples we detail the process for building an deploying a CBM
intelligent agent.
Example 1 Creating and deploying a decision model

The reader may work through this exercise using the EXAKT demonstration program and
and accompanying data on the CD distributed with this book138. The complete set of step-
by-step instructions may be found in the Appendix (page 307).
The exercise will demonstrate the basic functions of the EXAKT model building platform
and the EXAKT decision agent software. Example 1 uses a reduced set of oil analysis
data from a fleet of haul truck transmissions to build a proportional hazards model. By
following the steps in the Appendix, you will create and deploy this model as an
“intelligent agent” that silently and automatically monitors future condition monitoring
data, returning an optimized decision (whether or not to remove and repair the
transmission) as each new set of condition monitoring readings are received. The model
constitutes a “policy” for making optimized decisions. Such a policy will minimize some
undesirable feature, such as excessive cost, or maximize some wanted feature, such as
availability. The decision agent provides a remaining useful life estimate based on the
current condition of the equipment, its age, and all relevant maintenance and operational
events that have occurred.
Figure 10-14 The Steps in building an EXAKT decision model

Figure 10-14 shows the EXAKT user interface as a flow diagram for executing each step
in the model building process:
1. Data preparation,
2. Weibull PHM,
138
It may also be downloaded from www.omdec.com
Page 127
3. Transition Probability Model,
4. Decision Model, and
5. Decisions
A short descripton of each step, using the data from Exercise 1 follows.
1 Data preparation
Figure 10-15 General project data

The data preparation step including data cleaning is often the most difficult part of any
model building exercise. The difficulty lies mainly in the inconsistant recording of
maintenance and repair events in the past where maintenance staff did not use procedures
similar to those suggested in Chapter 4. Acquiring Maintenance Information (page 58).
Exercise 2. will describe several data preparation and data cleaning methods.
Figure 10-15 illustrates the setting up of the project’s descriptive information. The “CBM
Model” field of Figure 10-15 provides the name by which a predictive CBM decision
model will be known, usually the name of a component or a failure mode whose
deterioration we wish to detect and monitor. The model, as its primary function, should
enable us to declare, and thereby, act upon a potential failure at the most advantageous
moment.
2 Weibull PHM
Page 128
Figure 10-16 Testing possible significant variables
In Weibull PHM step we proceed to test the degree to which monitored variables have
the potential to predict of the failure mode under analysis. In Figure 10-16 we are setting
up the test of the combination of the oil analysis measurements of dissoved iron and lead.
Figure 10-17 provides the results of the test.
Figure 10-17 The results of the trial of iron and lead as PHM model covariates
Note the various text information we set up in Figure 10-15 as it appears in the report of
Figure 10-17. The “Summary of Events and Censored Values” table tells us about the
size of our sample (how many life cycles) and the breakdown of those that actually ended
in failure and those that were preventively renewed.
In the “Summary of Estimated Paramaters (based on ML method)” table we can see the
results of the “maximum likelihood”139 data method applied to the sample of 13 lifetimes.
The results of a number of statistical estimation methods140 are shown in this table
(Standard error, Wald, DF, p-Value, Exp of Estimate, and 95% Confidence Interval). The
software considers the results of each statistical procedure and displays the conclusion in
the column “Sign” (abbreviation for ‘significance’). “Y” indicates that “Shape” (i.e. the
working age), Iron, and Lead have been found to be significant to the probability of
failure in the upcoming observation interval.
139
A “fitting” algorithm that estimates the parameters of the proportional hazard model so that it best fits
the data.
140
See the EXAKT user’s manual for detailed explanations on these statistical tests.
Page 129
3 Transition probability model
So far we have built a proportional hazard model (PHM). That model provides us with a
failure probability (hazard rate) knowing the working age of the item and the values of a
set of significant condition monitoring variables at that working age. However, in order
to complete the predictive capability of the model, we must have a way to describe the
behavior of those variables. The method used in EXAKT is known as the Markov chain
transition probability matrix. Figure 10-18 shows the matrix for iron at each of five states
whose boundaries were proposed by the software.
Figure 10-18 Transition probability matrix

In Figure 10-18 each cell contains the probability of jumping to the state designated by its
column heading if that we are currently in the state designated by its row heading. For
example, if the state of iron is “9-18” the probability of failure in the next interval is 13%.
The matrix of Figure 10-18 represents only a single dimension. It assumes that the value
of the second significant variable, lead, remains constant. Based on the transitions of all
data values in the past to their new values at the subsequent inspections, the software
generates a probability matrix for each combination of states of the significant variables.
The resulting multi-dimensional matrix is combined with the PHM, the next-to-final step
towards building the predictive decision model.
4 Decision model
This last step in the model building process requires us to provide the relative average
costs of a preventive renewal of the component (or failure mode) following the
declaration of a potential failure, as well as, the typical worst case cost141 if a functional
failure were to occur. The results of the decision model applied to one of the equipment
units are shown in Figure 10-19.
141
Based on an assessment of the “typical worst case scenario”. All models are based on assumptions. The
EXAKT model assumes that a manager, through expererience, can envision a balanced portrait of the
events surrounding a failure and their consequences. Sensitivity analysis (a function of the software) helps
us to sanity check these assumptions and their impact upon the model’s decisions.
Page 130
Figure 10-19 Results of applying optimal decison model retroactively to an item.
In the table of Figure 10-19 we may read the average per workng age cost (associated
with failure and proactive maintenance) for the current policy (.449), the proposed
optimal (EXAKT) policy (.378), and the “no pro-active maintenance” policy (1.522). The
mean-time-between-replacement (MTBR) includes both preventive and failure
replacements. It is 8775 working age units currently. By adopting the EXAKT decision
model, we would intervene far more often (every 3326 working age untis), but at a cost
per working age unit of 0.378 (84% of the cost of the current policy).
5 Decsions
Upon building and testing the model in EXAKTm (EXAKT for Modelling), we export
the model to an external database where it may deployed by EXAKTd (the EXAKT
decision agent).
Figure 10-20 Decision report on a fleet of Haul trucks

A report such as that shown in Figure 10-20 may be generated on demand or
automatically within the CMMS. Only one ModelName “Trans Oil Anal” is shown in
Figure 10-20. However such a report may list any number of model names corresponding
to failure modes targeted by CBM.
Example 2 Data validation

Most maintenance departments have yet to adopt standard data management procedures
such as those described in Chapters 1 (page 13) to 4 (page 58), . Hence data validation,
after the fact, will necessarily form part of the CBM automated data interpretation
modeling process. Cardinal River Coals Ltd. was a 50/50 joint venture between Luscar
Ltd. and Consol of Canada, Inc. The mine is located approximately 50 km south of
Hinton, Alberta on the eastern slopes of the Rocky Mountains. The coal produced from
the mine was low sulfur, high quality coking coal used for steel making. Cardinal River
Page 131
Coals Ltd. opened in 1970 as a multiple open pit mine using the truck and shovel mining
method. Annual production at the mine called for the removal of 21 million cubic meters
of rock and 2.8 million tons of coal. The mine won multiple awards for the land
reclamation and creating wildlife habitat. Oil analysis results from a fleet of 55 haul truck
wheel motors were analyzed along with their respective failures and repairs over a nine-
year period.
Extensive planetary gear or sun gear (Figure 10-21) damage necessitates replacement of
one or more major internal components in a general overhaul. There were 26 haul trucks
at the mine site, each having two wheel motors. With 3 spare wheel motors the fleet
numbered 55. Oil analysis was carried out monthly.
Figure 10-21: Wheelmotor
The mine’s computerized maintenance management system (CMMS) recorded wheel

motor removals due either to failure or as the result of a decision to carry out “preventive
maintenance”. The decision to perform PM was made by assessing oil analysis data and
by taking into account the wheelmotor’s age. Costs and details associated with past
removal were available. Additionally, an oil analysis database contained a vast history of
Page 132
condition monitoring test results – some 50,000 records covering the same time period as
the removal history.
The Existing CBM Program

In most maintenance departments, oil analysis reports are received from a commercial
laboratory and are summarily reviewed by a maintenance planner or supervisor. The
laboratory usually points out sudden increases in the concentration of wear metals or
contaminant elements such as silicon (Si) in the oil sample. At Cardinal, staff, noted the
amount of sediment (weight filtrate on a filter patch) and parts per million (ppm) of five
elements: iron, silicon, chrome, nickel, and titanium. A decision to remove the unit (for
overhaul) was based on a visual perusal of the reported values of these elements in
conjunction with the unit's age. The policy, was to rebuild the units after about 20,000
hours of operation regardless of the CBM data – earlier if the metal levels were
abnormally high. Our objective, in this example, is to assess the applicability and
effectiveness of the existing CBM program and to improve it by using the EXAKT age
exploration methodology.
Validating Event Data

The first and most important step in any CBM optimization project is to ascertain the
validity of the data available both in the CMMS and in the CBM databases. We begin, by
applying EXAKT’s DataCheck function to the records of both databases. The result of
this operation is a report similar to that shown in Figure 10-22.
Figure 10-22: DataCheck report in synchronized view with Inspections and Events tables.
Page 133
The DataCheck report addresses a common problem in historical CMMS databases.
Often work order records omit the description of what was found when examining the
item prior to its repair.142 The report of Figure 10-22 issues the warning,
Check whether this history is temporary suspended or “EF/ES” is missing.
whenever it deduces that an ending event, either EF (ending with failure) or ES (ending
by suspension) may be missing. The analyst must investigate the actual work orders or
the comments in the work order record in order to ascertain whether a failure or a
preventive renewal of the item occurred, or whether the item is currently in operation.
Each valid history for a wheel motor must have a Beginning event (B), an Ending event
(EF for failure, or ES for suspension (such as a preventive removal)) and Inspection
events in between.
The DataCheck report of Figure 10-22 may issue additional comments and warnings. For
example:
“The first event of this history is not a “B” event”, or

“This record has the same WorkingAge as the previous record …”, or
“This record has a larger WorkingAge than the next record …”, and
so on
By methodically investigating each of the warning messages, with the automatically

synchronized records in the Events and Inspection tables (Figure 10-22), the analyst
eventually corrects all of the “logical” errors and omissions in the database. The output of
the data - checking tool points out errors of inconsistency of events and dates in the
CMMS and condition databases. The software deduces, from the dates and working ages,
the sets of data that comprise individual histories. For each history that it finds without an
ending, it asks whether the ending event should be designated as a suspension (ES), a
temporary suspension (TS, which is denoted by *ES in the software) or a failure (EF).
(“Temporary suspension” means that the item is currently in operation).
The DataCheck function also points out anomalies that may indicate data problems such
as two inspections on the same day, or working ages and calendar dates out of
synchronization. All of these logical errors would have compromised the model’s
accuracy. Most of these types of errors can easily be corrected by inserting the missing
Beginning and Ending events for each history.
Validating Inspection data

Inspection records can be examined graphically using various combinations of covariates,
dates, ranges, and scales. For example, the cross graph of Figure 10-23 reveals
statistically “unusual” values of silicon forming a horizontal line at exactly 900 parts per
million (PPM).
142
The roots of such data integrity problems were discussed in Part 1.
Page 134
Figure 10-23 Cross graph of Si and Working age for the entire fleet over 9 years
Investigating the commercial laboratory that performed the oil analyses, it was
discovered that, for a period of time, the photo-multiplier tube on the spectrometer was
saturating at exactly 900 PPM. In other words all values of silicon above 900 were
truncated to 900 PPM. A similar situation occurred for iron above 2500 PPM. If not
detected, this could play havoc with the building of the PHM (proportional hazard
model). To solve this problem we call up the cross graph of Silicon versus Iron displayed
in Figure 10-24.
Figure 10-24: Cross graph of correlation between Si and Fe showing data errors
Figure 10-24 reveals strong correlation between Silicon and Iron, as well as an obvious
dog leg in the graph where Si plateaus at 900 ppm. We note too that a few appear after
the spectrometer was repaired and that they fall on the correlation line. It is reasonable,
therefore, to correct the values of 900 ppm by substituting the values of iron × the slope
of the correlation line as was done in Figure 10-25.
Page 135
Figure 10-25 Corrected values of silicon
In this instance, knowing the errors in the laboratory test data, it was possible to
compensate for them in the database used to build the model. For example, to correct the
truncated values of ‘Si’ they were replaced with 1.2 x Fe. The factor of 1.2 was
determined from the initial slope of the cross graph (a correlation graph) of Fe-Si and
values obtained after the saturation defect was corrected. The truncated Fe values were
not corrected since there were too few of them to influence the model.
Determining correlation between covariates is useful both to provide insight into the data,
and in understanding the models generated by the software. For example, if ‘Fe’ and “Ni”
are highly correlated the software would confirm that there is no point in including nickel
in the model since it has been determined to provide no additional information regarding
the probability of failure. Thus, if the software concludes that nickel is “insignificant”,
then by inspecting the correlation graphs one could therefore understand the
reasonableness of such an indication. These correlations are the result of wear of a
metallic alloy component present in the unit.
Page 136
The effects of minor maintenance or equipment calibrations
Figure 10-26 The effect of an oil change

When building the PHM it is necessary that account is taken of any minor maintenance
work that is done, such as changing the oil in the wheel motor. For example, Figure
10-26 illustrates that the actual transition path of oil measurements was from A to B to C
to D. If we did not account for the oil change, then the software would assume that the
transition was A to B to D. This would be misleading and would tend to overestimate the
risk of failure143. EXAKT properly handles minor maintenance events, that impact
monitored variables.
In the EXAKT data preparation phase we set up initialization conditions associated with
certain events. The model is told what covariate values should be associated with those
minor corrective events, such as an oil change (OC). By the same token, events such as
balancing a rotor, or aligning a shaft should be recorded whenever they occur. During
model setup approximate initialization vibration levels will have been assigned to these
event in the CovariatesOnEvent table, so that the model can properly recognize that
decreases in covariate values are the result of a minor maintenance event.
Figure 10-27 shows ‘missing’ or ‘irregular’ oil changes and obvious gaps due to
incomplete records144. Oil ages of 7000-8000 hours are indicated which is quite unlikely
with the use of mineral oils in this application. The site changed to synthetics about two
years earlier to eliminate the need for regular oil changes. However most histories,
containing missing oil changes, occurred prior to1997. It was thought that this
information needed to be recovered from the commercial laboratory’s files.
Unfortunately these files, too, were incomplete and inconsistent with the dates and
working ages in the work order database.
143
By associating failure with decreasing levels of wear metals.
144
Chapter 1. (page 13) and Chapter 2. (page 19) addressed and offered a solution strategy to this common
problem.
Page 137
Figure 10-27 Missing oil change events
Fortunately, however, these 'missing' oil changes did not significantly affect the model
since they were relatively few in number with respect to all of the known oil changes.
That is, there were a sufficient number of known oil changes in the database for the
model to account for their effect on the measured data.
Building the Proportional Hazards Model (PHM)

After all the obvious data errors are eliminated or corrected by using the DataCheck
function and the rich assortment of graphical analytical tools in the EXAKT toolbox, the
proportional hazard model may be generated using the techniques of Example 1 (page
127). Figure 10-12 on page 125 shows that the risk of failure is a function of both
working age and the “significant” condition data. By following the iterative procedure
learned in Example 1, which is based on Cox [ref 1972], the insignificant covariates are
systematically eliminated, and potentially good models are tested to see how well they
represent the actual data. One of these methods used by the software is known as residual
graphical analysis, illustrated in Figure 10-28.
Page 138
Figure 10-28 Graphical analysis of maximum likelihood estimation (MLE) residuals
Each point on the residual graph of Figure 10-28 represents a history, that is, a lifetime of
a wheel motor from its installation to its removal. The sample used to build the model
consists of many histories drawn from the entire fleet. The graph shows an unusual point
that is well above the 95% upper limit. This leads one to investigate the underlying data
corresponding to this residual (i.e. this particular lifecycle). It was discovered (Figure
10-29) that some ‘unusual’ data were included in that history which appears to violate the
model that we are attempting to build.
Page 139
Figure 10-29 Unusually high values of Fe and Si unexplained by a failure event
The Fe values in the left-circled region of Figure 10-29 have an inexplicable pattern. Fe
jumps to high values, but truncated at 2500 PPM due to instrument saturation, and
remains in the same range for a few more inspections. Then, the readings fall back to low
values. No events were recorded to explain these sudden jumps.
Having no event data to support such high values of Fe and Si, the model was
regenerated and the fit tested again after removing that history from the working data set.
Statistical and graphical goodness-of-fit testing procedures are applied by the software as
part of the modeling procedure. The model’s fit to the data improved immediately. The
model building algorithms no longer had to accommodate obviously contradictory and
misleading information.
Validating Human Decisions

The forgoing describes data related problems that were encountered and that were
relatively easy to correct using the statistical and graphical tools available in the
software’s function arsenal.
However a different (and more fundamental) problem occurred regarding the definition
of wheel motor failure. These units seldom failed “functionally”. That is no haul truck
needed to be taken out of service immediately while it was hauling a load of rock or coal.
Nevertheless, to develop a CBM policy (model) we must have some objective definition
of failure. Initially, the mechanics’ remarks (on the work order) were used for this
purpose. For example,
Page 140
"High iron in oil sample and high hours, removed and replaced wheel motor."
This event was then classified as a “failure”. However, on reviewing the re-builder's
report attached to each invoice it became clear that some events initially classified as a
failure should be treated as a suspension and vice verse. For example: If the gears had
been replaced because they failed an ultrasonic test or they were obviously in a failed
state then that event should be classified as a failure. But if the gears were replaced
simply because it was expedient to do so, or if the unit was only generally rebuilt with no
real internal damage, then that event should be considered a suspension.
With the definition of suspension and of failure thus clarified, a proportional-hazards

model was found which was shown by the software’s report (Figure 10-30) to be a “good
fit” by the statement “not rejected”.
Figure 10-30: Results of the "Goodness of Fit" test

The model containing the covariates iron and sediments was found to be good, both by
graphical residual analysis (such as that of Figure 10-28 on page 139) and by the
Kolmogorov-Smirnov statistical test Figure 10-30 applied automatically by the software.
The results of the analysis and the proportional hazard model are displayed in Figure
10-31.
Page 141
Figure 10-31: The proportional hazard model145 for a haul truck wheel motor
Obtaining the CBM Optimal Decision Model

After finding the PHM we are next ready to establish the optimal decision model [ref
Jardine et al 1997] that incorporates economic considerations along with the risk estimate
obtained from the PHM. Using the decision model building methodology that we learned
in Example 1 (page 127) a cost ratio of 3:1 ($20K for preventive replacement cost, $60K
for failure replacement cost, based on the invoices of past repairs by outside contractors)
was blended into the model. The resulting decision model applied to a wheel motor is
depicted in Figure 10-32. Using EXAKT’s cost comparison function (described in
Appendix 11. (page 295), the software calculates the expected savings. These are shown
in Figure 10-33 for various economic conditions, represented by three possible cost
ratios.
Figure 10-32: Optimal CBM decision model applied to a set of oil analysis data for a wheel motor
Application of decision model

Once the decision model was built, data was examined from previous histories to see
what the decision model would have recommended for situations in which the wheel
145
Covariate significance is tested by the Wald statistic, the square of the standardized estimate of the
parameter which follows a chi square distribution with 1 degree of freedom. (Note: A few missing sediment
values had been replaced by the values from previous inspections prior to the analysis, hence the reason for
using the notation CorrSed).
Page 142
motor failed. One illustration of such a history is shown in Figure 13. This graph provides
a recommended decision based on inspection data (covariates and working age).
The decision ‘Replace immediately’ was suggested by the model (as illustrated in Figure
10-32) for the first time at the inspection point at working age = 11384 hrs, 286 hours
(about two weeks) prior to failure (reported at 11660 hrs). The following inspection at
working age = 11653 hours, 7 hours prior to failure, also suggests the replacement of the
wheel motor. The first warning may have been sufficient, given sample turnaround time
of 48 hours, to prevent the consequences of failure. Even prior to 11384 hours it can be
seen from the decision graph that the results of the measurements indicate that a
replacement recommendation was imminent. Note that the zero points on the graph
indicate default measurement values of zero (imputed by the software) immediately
following oil changes.
The economic benefit associated with basing the maintenance policy on the Decision
Policy Graph of Figure 10-32 is exposed through an economic investigation using
EXAKT’s sensitivity analysis function. Under current economic conditions, Figure 10-33
indicates a potential saving of between 20%-30% compared to current practice.
Figure 10-33: Expected savings for various coal market conditions
Page 143
It is to be noted that for the cost ration of 3:1 (first section of Figure 10-33) no
operational savings were accounted for since at the time of this study, unfavorable coal
market conditions caused the mine to operate below its capacity. However, as market
conditions improve higher cost ratios would be used since the capital assets of the mine
will be used at maximum capacity. Current strip ratios (total material removed versus
sellable material) would also affect the savings associated with increased availability and
reliability of the units. The sensitivity analysis function of EXAKT, described in Figure
10-34 demonstrates the sensitivity of the overall savings to changes or inaccuracies in the
cost ratio.
Sensitivity analysis
Page 144
Figure 10-34: Sensitivity of the CBM model to economic and geological conditions affecting the cost
consequences of haul truck failure
In real situations, the actual ratio of failure and preventive replacement costs may not be
well known. Furthermore the dynamics of industry are such that costs can change with
changing technology, production, and market conditions. Therefore one would like to
know, to what degree the true total cost per unit time and the optimal policy would
change with changes in cost ratio. The software enables sensitivity analysis to be
undertaken and generates a graph and corresponding tabular data of Figure 10-34 .
The curves on the graph are interpreted as follows. Solid Line: If the actual cost ratio
(CR) of today differs from that specified when the model was built, that means that the
current policy (as dictated by the Optimal Replacement Graph of Figure 10-32 on
page142) may no longer be optimal. The line indicates the increase (in percentages) that
will be incurred above the optimal cost/unit time by adhering to the current (no longer
optimal) policy. For example, if the actual cost ratio is 5 and our model was built with
CR=3, then the increase in the cost incurred by following that (original optimal) policy is
around 6% (5.98). In other words the solid line represents the sensitivity of costs to
changes in CR. Dashed Line: Again, assume the actual cost ratio has strayed from what
was used when the model was built. If the model is rebuilt using the new ratio the dashed
line tells how much the new optimal cost would differ from that of the original model. In
other words the dashed line represents the sensitivity146 of the optimal policy to changes
in CR. The graph indicates that moderate overestimation of the cost ratio does not
significantly affect the average long run cost but provides a more conservative policy
from the point of view of risk of failure. In a frequently (perhaps seasonally) changing
cost situation it could be worthwhile to dynamically rebuild the CBM optimization model
each time it is applied by using a cost ratio fed from an ERP (enterprise resource
planning) system that takes account of current market conditions.
The cost analysis summary shown on Figure 10-33 (page 143) indicates a saving of 25%,
when CR=3, over the “replace only on failure” (ROOF) policy, whose costs approximate
those of the site’s past policy. Decision model results are also calculated for cost ratios of
5 and 6. As the cost ratio increases we can observe an increase in both the optimal policy
cost as well as an increase in savings. The optimal decision models in these cases indicate
more frequent preventive replacements (from 74% to 91%) will result from applying the
optimal decision policy in order to avoid costly failures. (Note: There is a slight
discrepancy between the expected time between replacements for the ROOF policy, when
CR=3 and CR=5 and 6. This is due to the numerical calculation procedure.)
The steps in the appendix for Exercise 3 (Data Validation) page 319 contain a hyperlink
to a database file with which the reader may reproduce the analyses and graphs of
Example 2.
146
Note that the sensitivity graphs assume that only Cf (failure replacement cost) changes and Cr
(preventive replacement cost) remains unchanged.
Page 145
Example 3 Complex Items
A complex item is an item with one or more failure modes or failure susceptible
components. A simple item has a dominant failure mode, while a complex item has
several failure modes. A CBM program, typically, acquires inspection data (e.g. oil
analysis, vibration, performance data) for an entire system, such as an engine. Thus, a
single system identifier (say “Engine 7483”) labels inspection data records from which
more than one failure mode can be deduced147. Each failure mode will have its own age-
reliability-CMdata148 relationship, and hence, its own CBM decision model.
In 2003, the Condition Based Maintenance Laboratory at the University of Toronto

developed a data structure and methodology for the predictive analysis of complex
systems - items containing multiple components and subject to a variety of failure modes.
The example of this tutorial is of a single reduction gearbox that contains two gears
(referred to as Gear1 and Gear2) respectively. We concern ourselves, in this item, with
the failure mode “tooth fails due to root crack”, which can occur on either gear. We treat
this unit, therefore, as a complex item having two failure modes.149 A CBM policy must
consider all reasonably likely failure modes whose potential failures are detectable in the
condition monitoring data set. The policy must distinguish data patterns characterizing
one failure mode from those characterizing another. The policy must advise on which
potential failure mode is imminent and provide a residual life estimate.
The EXAKT software uses the term Marginal Analysis150 to indicate that a complex item
is being analyzed. We develop our CBM decision models within a “working model”
database (whose filename is typically of the form equipmenttype_WMOD.mdb). We
refer to this database as the WMOD (working model) database. To that WMOD database
we “attach” tables from an external database (typically named
equipmenttype_MES.mdb) that contains data transferred from or linked to the CMMS
and one or more CBM and/or process databases.
If the table names in the equipmenttype_MES.mdb database have the extension “_MA”
(see the table structures of Figure 10-35 below), that will tell EXAKT to perform a
“marginal analysis”. Using marginal analysis we build several CBM decision models,
each corresponding to a specific component or to a specific failure mode. Figure 10-35
147
By selecting and processing the data in different ways.
148
Nowlan and Heap used the phrase “age-reliability relationship” to categorize the probabilistic failure
behavior of an item with respect to its working age. With proportional hazard modiling (PHM) Cox
introduced extra information, condition monitoring (CM) data, that bears on failure behavior. Hence we
have appended a third expression “CMdata” to the phrase. In CBM we can now conveniently refer to the
“age-reliability-CMdata relationship”.
149
In this example, for simplicity and clarity, we will ignore faults associated with bearings or other types
of gear or shaft faults. Nevertheless, EXAKT imposes no limit on the number of failure modes or
components to be included in a complex item.
150
The word “marginal” refers to an analysis on one component, assuming that there is no cross-failure
causality among components or failure modes in the complex item. In the future, EXAKT will deal with the
more general case where one failure mode can provoke or influence another.
Page 146
illustrates the structure of a database to for multiple failure modes occurring in a single
equipment item.
Inspections_MA Events_MA EventsDescription_MA VarDescription_MA CovariatesOnEvent_MA

Ident Ident EventName VariableName Event
Date Date P MeasureUnit StartingDate
WorkingAge WorkingAge Comment WarnLimit1 EndingDate
Covariate1Name Event WarnLimit2 Covariate1Name
Covariate2Name Comment … Covariate2Name
… Comment …
Comment Comment
Figure 10-35 Structure of a MES database to be analyzed using "Marginal Analysis"
The tables of Figure 10-35 are identical (except for the suffix “_MA” in their table
names) to those of the analysis of simple items (of examples 1 and 2).
Three new tables (Figure 10-36), however, have been added to the MES marginal
analysis database structure. Each component or failure mode in a complex item will
behave according to its individual risk model. When complex items are to be analyzed
(and their failure modes to be modeled) we need a way to tell each of model which data
in the database applies to it. For example, one component’s failure may occur at a
particular time, but another component will still be in good working condition. Hence we
need a structured way to indicate the event that each component (or failure mode) has
undergone. The supplementary tables of Figure 10-36 fulfill that role. The table
“IdentToModel” relates a decision model to specific equipment units of a fleet of similar
equipment. It tells the decision agent to which specific equipment units each model
should be applied. For example, if certain engines of the fleet do not have turbo chargers,
then a model predicting the failure of the bearing in the turbo charger should not be
applied to the non-turbocharged engines in the fleet.
Similarly, the “EventToModel” table tells the model which events in the common
database apply to the failure mode that it is predicting. The “VarToModel” table maps
monitored variables to a specific model.
IdentToModel EventToModel VarToModel
ModelName ModelName ModelName

IdentName InputEventName InputVariableName
Date OutputEventName OutputVariableName
InputP VariableDataType
OutputP MeasureUnit
WarnLimit1
WarnLimit2
…
Figure 10-36 New tables in MES required for mapping to an individual failure mode model
At the beginning of the modeling exercise, the IdentToModel, EventToModel, and

VarToModel tables are empty. EXAKT populates them automatically when the user
performs the mapping operation guided by an EXAKT dialog.
Page 147
The phrases “Input…” and “Output…” appear in several field names of the tables
EventToModel and VarToModel. These fields map their values in the general database to
their values in a specific model. For example, “failure of suction valve 3” in the database
would be mapped to the event “EF” in the model that was built to predict the failure
behavior of “Suction Valve 3”. Hence in a single equipment we may have, for example,
two failure events, EF1 and EF2. And two suspension events, ES1 and ES2. We need to
tell a particular model (of a particular failure mode or component), which event records
(for example, those with the values B1 or B2, EF1 or EF2, and ES1 or ES2 in the
database) to use as the beginning, failure, and suspension events respectively for the
failure mode currently being modeled or predicted. In this exercise, we need to tell the
model for Gear1 to use the events B1, EF1, and ES1 as the beginning, failure, and
suspension events (B, EF, and ES). We perform this mapping in a data mapping dialog
such as that shown in step 6 of the tutorial exercise in the Appendix. EXAKT stores the
results of the mappings in the EventToModel, IdentToModel, and VarToModel tables.
Although this mapping of data is difficult to understand in the abstract, don’t despair. It
will become crystal clear as we work through this exercise.
Let us look then at the Events_MA table (Figure 10-37) for the equipment item
GearboxA to be analyzed.
Figure 10-37 Events_MA table for a gearbox with two failure modes.
Note that, in Figure 10-37, there is no B1 or B2 to distinguish the beginnings of the
lifecycles of the individual components (Gear1 and Gear2). But only a single “B” event.
Why? Because, in this particular equipment, the maintenance department adheres to a
policy that when one gear fails, both are replaced. Therefore we have chosen to use the
Page 148
event “B” to mark the life beginnings of both components151. We have chosen to use
“EF1” to designate the failure of Gear1 and “EF2” to represent the failure of GearTwo.
Now let us examine the Inspections_MA table of Figure 10-38.
Figure 10-38 View of the first 23 records of Inspections_MA table

The Inspections_MA table is a single data source from which multiple failure modes in
various components of an equipment may be predicted. Hence the decision agent at each
CBM inspection will generate individual recommendations and residual life estimates for
each component or failure mode covered by a CBM decision model. Each model will
refer to a specific failure mode or component. The maintenance manager will be advised,
therefore, not only that an equipment failure is imminent, but also which component or
failure mode is concerned. This is precisely the information needed to plan maintenance
and order parts.
Once the decision models have been built and deployed a typical optimized CBM
recommendation report covering both failure modes at a point in time might resemble
that of Figure 10-39.
151
Therefore in Step 6 “B” in the database will be mapped to “B” in both models.
Page 149
Figure 10-39 EXAKT output for two failure modes, GearOne and GearTwo
The report of Figure 10-39 tells a maintenance planner that Gear1 needs to be replaced,
but Gear2 is still in good condition. There is another type of information provided by
these decision models that would cause a manager or planner to reconsider the policy of
replacing both gears when either one fails. That information is given by the shape of the
boundary separating the green and red regions. It is a straight horizontal line. That tells us
that, for these gears, the probability of failure at any time is independent of age. Hence
there is likely little or no benefit in replacing Gear2 with the objective of prolonging its
life relative to the failure mode “tooth fracture”152.
You may perform perform the steps in the Appendix (page ) that create and deploy the
model that we have just described. Follow these steps using the EXAKT modeling and
EXAKT decision programs.
Example 4 Data transformations

Often data, in its original form, cannot be used to monitor degradation in the health of a
component and predict its failure. Exercise 4 (data smoothing and fixing shape factor to
1) on page 324 describes methods by which to transform and process CM data for use in
a decision model.
152
One might choose to replace the gear if wearout, for example, indicated by excessive backlash, is a
significant failure mode in this system. The monitored health indicitors, H1 and H2 in the model, however
are targeting the weakening structure of a gear tooth (see Gear Tooth Failure). A separate model, perhaps
based on backlash inspection or some other vibration feature, should be built for the failure mode “gear
tooth wear”.
Page 150
References
Cox, D.R., (1972) “Regression models and life tables (with discussion)”, J.Roy. Stat.
Soc. B, Vol. 34,pp. 187-220.
Jardine, A.K.S., Banjevic D. and Makis V, (1997) “Optimal replacement policy and the
structure of software for condition-based maintenance”, Journal of Quality in
Maintenance Engineering, Vol. 3, No.2, pp. 109-119.
Campbell, J.D. and Jardine A.K.S. (Editors), (2001) Maintenance Excellence: Optimizing
Equipment Life-Cycle Decisions, Marcel Dekker, (Chapter 12: Optimizing Condition
Based Maintenance, by M. Wiseman).
Page 151
Chapter 11. CBM Decision Making with
Expert Systems
Depending on the physics governing a given application, we learned, in Chapter 7. (page
95), that we may choose from a variety of algorithms with which to carry out the signal
processing portion of CBM. Decision making, (the third CBM sub-process), proceeds
similarly, using one or more of a diverse array of decision support tools. In Chapter 10.
Example 1 Creating and deploying a decision model (page 127) we developed a CBM
decision policy using statistical modeling techniques and software. A decision policy
assists maintenance personnel to interpret and act upon a set of condition monitoring
(CM) data. Extensive human knowledge and experience may be available with which to
build a CBM decision policy. A rule-based expert system encapsulates known
relationships between CM data and the deterioration in an asset that takes place due to
one or more failure modes. An algorithm (known as an inference engine) applies the
knowledge base to the current set of CM data. In this chapter we describe an expert
system developed by DLI Engineering153 called ExpertALERT™.
Figure 11-1 CBM signal processing and Decision making using an Expert System
153
www.dliengineering.com, Automated Bearing Wear Detection, Alan Friedman, Published in Vibration
Institute Proceedings 2004
Page 152
Figure 11-1 outlines the signal processing and decision making portions of this CBM
approach. It traces the flow of information through the signal processing steps (steps 1-5)
and the decision making procedure (step 6) that uses a rule-based expert system.
Each machine to be monitored is set up with permanent testpoints154 positioned

strategically (Figure 11-2) in relation to the components of interest. The equipment is
monitored using ExpertALERT™ over a period of time thereby establishing a baseline
spectrum for each test point155 and each orientation. The baseline spectra are updated
automatically by the software and set at the average + 1 standard deviation.
Figure 11-2 An example of test point locations showing the three axes - Axial, Radial, and Tangential
The six steps of Figure 11-1 are described in each of the following sections.
Step 1 Data normalization

We desire to scale the abcissa of the spectrum in multiples (orders) of the forcing
frequency.156 If the shaft speed is known (from a tachometer signal) the algorithm
accomplishes this directly. If it is not known a strong peak is chosen in a window around
the nominal speed, or a number of nominal speeds (in the case of a variable speed drive)
and the algorithm can successfully match peaks, harmonics and sidebands in order to
determine the correct speed for normalizing the spectrum.
The normalization procedure also converts vibration amplitudes to a logarithmic scale in

units of VdB. This assists in the visualization of significant, yet low energy peaks,
alongside the dominant peaks due to the fundamental forcing frequency. The VdB scale
simplifies the interpretation of changes in vibration levels, for example:
o A 6VdB increase = a doubling of vibration amplitude
154
Testpoints may be equiped with permanent triaxial accelerometers, or a triaxial accelermoter connected
to a portable data collector may be used. The barcoded test points must offer a solid screwed mounting for
accelerometer.
155
In both a low and high frequency range
156
This simplifies distinguishing the non-synchronous peaks and their sidebands from the dominant forcing
shaft frequency and its harmonics. A necessary step in the diagnostic process.
Page 153
o A 20 VdB = an increase in vibration amplitude by 10 times.
Step 2 The screening matrix

Next, automated spectral peak extraction and a noise floor calculation are performed. The
resulting data populates a “screening matrix”. The columns of the screening matrix
represent 10 preselected orders of shaft rate (for example 1x, 2x, ….10x), the two highest
non-synchronous peaks in a low and high range spectrum, and a noise floor157 value.
As an example, let us assume an equipment item has two test points. Then the screening
matrix will have (10 orders + 2 peaks x 2 ranges) x 3 orientations x 2 test points + 1 noise
floor = 85 columns. One row of the screening matrix will hold the changes in amplitude
from the previous inspection. A second row will hold the deviations from the baseline
spectrum. A third row will hold the corresponding vibration amplitudes. Hence, in this
instance, 85 x 3 rows=255 extracted features will have been placed into the screening
matrix, ready for further processing.
The noise floor calculation measures any general increase in random noise. Both impacts
and random noise in a time waveform cause the spectrum to become elevated. As
bearings wear, they typically produce larger quantities of non-periodic vibration and
impacts. This raises the noise floor of the spectrum. The automated diagnostic system
uses an algorithm to calculate the level of the noise floor. This value is then compared to
a baseline value. Increases in noise floor level add to the severity (see step 6) of the
bearing wear diagnosis and may even trigger a diagnosis in certain cases when bearing
tones are not evident.
Step 3 Cepstrum analysis

A cepstrum transformation158 of the fft spectrum is performed next. A cepstrum (Figure
11-3) highlights series of spectral peaks that are evenly spaced in the spectrum. These are
called harmonics Harmonics can be synchronous (multiples of shaft speed) or non-
synchronous. The algorithm searches the spectrum for non-synchronous harmonics and
any sidebands. If found they are flagged as possible bearing tones, to be processed further
in steps 5 and 6.
157
An increase in the noise floor level is an indication of impacting and non-periodic (or random)
vibration. Both of these are associated with later stage bearing wear.
158
One may say in a general sense that the more harmonics and sidebands present, the worse the condition
of the bearing. Thus, not only does one wish to know if a peak is part of a larger family of peaks, one also
wants to get an idea of how much energy is contained in the series. Cepstrum analysis is used for
automating this task. The Cepstrum is a power spectrum of a power spectrum of a waveform; therefore, any
periodicities in the spectrum (such as harmonic series or sideband families) will clearly appear as a peak in
the Cepstrum.
Page 154
Figure 11-4 Spectrum showing the synchronous and
non-synchronous harmonics and their 1x spaced
sidebands. The abcissa is scaled in “orders” or
mulitples of the shaft speed.
Figure 11-3 Cepstrum showing peaks with 1x
and 3.61x spacings
The physics of each situation dictate the signal processing method selected. Non-
synchronous peaks, such as those at 3.61 and 7.22 orders (Figure 11-4), are candidates for
“bearing tones” that signal bearing faults. If, in addition, the non-synchronous peaks
display sidebands spaced at orders of the shaft speed, an inner race defect is likely. Figure
11-5 illustrates the physical explanation for bearing tones and the appearance of
sidebands, with respect to to an inner race spall or crack.
Figure 11-5 Physical explanation of non-synchronous peaks and their 1x sidebands related to an
inner race spall.
Step 4 Demodulation
Demodulation (also called “envelope detection”) is a signal processing technique used by
ExpertALERT to supplement and verify the information drawn from the cepstrum and
spectrum analyses. Demodulation provides an independent confirmation of bearing
defects.
If there is a spall on a bearing race, each time a ball passes it will impact and “ring” the
bearing causing it to resonate at high frequencies. The resulting vibrations can be
demodulated in order to extract the forcing frequency that is causing the ringing. The
forcing frequencies will appear as peaks in the demodulated spectrum. If they match the
bearing tones from the screening matrix and the cepstrum, they provide further
Page 155
confirmation of a bearing defect. A distinct advantage of demodulation is that high
frequencies do not travel far in a machine. Thus the demodulation process can localize
the defective bearing. For example, if you see bearing tones in the narrow band spectral
data from two different locations on the machine at the same frequency, and the demod
data has matching peaks at one location (but not the other), you can assume that the
common location is the one with the bearing problem. The spectra of Figure 11-6, Figure
11-7, Figure 11-8, and Figure 11-9 illustrate this point precisely.159
Figure 11-6 Spectrum from motor location showing bearing tone peak
Figure 11-7 Demodulated spectrum from motor location showing matching peak
159
Alan Friedman, DLI Engineering, Demodulation - June 1999 issue of P/PM
Page 156
Figure 11-8 Spectrum from pump location showing same bearing tone
Figure 11-9 Demodulated spectrum from pump location, but showing no bearing tones. Hence
ExpertALERT can conclude that the bearing defect is on the motor.
Step 5 Component specific diagnostic matrices

The screening matrix is transformed into component specific diagnostic matrices
(CSDMs). This transformation extracts values at specific frequencies that characterize
possible faults in a given component. It is interesting to note that the techniques of Steps
2, 3, and 4 require no specific knowledge of bearing geometry (e.g. number of rolling
elements, inner and outer race diameters, pitch diameter, and so on) for the accurate
detection of developing faults. Nevertheless, the CSDM may include specific
frequencies based on bearing manufacturing data. Knowledge rules may refer to these
frequencies, thus extending diagnostic confidence.
Step 6 Decision making

Steps 1 to 5 may be considered the signal processing portion of ExpertALERT. They
extract informative features from the raw vibration data upon which the reasoning engine
of the expert system may now operate. Step 6 performs the decision making function,
Page 157
interpreting the extracted features and identifying the likely fault. In Step 6 each CSDM
is processed through a series of diagnostic templates consisting of rules that pass or fail
every fault known to occur in the component. Furthermore, the expert system computes a
score based on the feature’s excedance above the threshold value coded in each rule.160
The knowledge in the diagnostic templates was developed from an understanding of the
physics of the machinery and its causal relationship with the monitored data.
A simple example is the rule for imbalance. This rule checks the matrix elements (of the
CSDM) that contain the rotational rate levels and exceedances over baseline. The rule
then determines whether these values are are high in a radial direction. If so, other checks
determine that the problem is not misalignment or looseness. Finally, the algorithm
confirms the imbalance diagnosis.
Motor (VdB at 1x)

Orientation Amplitude Exceedence
over baseline
Axial 105 7
Radial 118 10
Tangential 117 10
Pump (VdB at 1x)

Axial 104 9
Radial 113 9
Tangential 92 2
Figure 11-10 Vertical pump and 1x vibration readings

As an example, consider (for simplicity only the 1x vibration levels of) the vertical motor
and centrifugal pump (with coupling), in Figure 11-10. Excessive 1x vibration may
indicate motor imbalance, pump imbalance, angular misalignment, foundation horizontal
flexibility, a radial or thrust bearing clearance problem, or motor cooling fan blade
damage. Expert system rules based on knowledge of the configuration need to deduce the
fault and identify the faulty component.
Looking at the axial and radial data at both locations we might surmise angular
misalignment since 1x axial is abnormally high at both motor and pump. Alternatively, it
could be motor imbalance or pump imbalance, since 1x radial is abnormally high at either
end and radial is higher than axial. Axial motion is, in fact, characteristic (due to rocking)
of unbalance in a vertical pump. Another characteristic of a vertical pump is that one
direction, the direction of external structural support, is always stiffer than the other
directions. The radial axis in this case is the direction of structural flexibility, so that
radially, the pump is being “wagged” by the motor imbalance. The low 1x levels at the
160
Rule thresholds are a matrix that include both absolute amplitudes as well as exceedences over (mean +
1 sigma) baseline.
Page 158
pump in the tangential direction can be explained by the fact that the tangential axis is the
direction of high structural stiffness and therefore the tangential component of the
vibration due to motor imbalance does not transmit to the pump.
Rules are activated by machinery component type (for example, in the preceeding,
“vertical motor pump set with coupling”) as defined by the user in the ExpertALERT
software. A rule for bearing wear in a compressor will look slightly different from the
rule for bearing wear in an AC motor. Each individual machine component type may
have numerous rules for bearing wear. If the the extracted features satisfy the
requirements for a rule, it means the fault condition exists.
After information has been extracted from the spectra as described above in steps 1 to 5,
it is passed through all of the rule templates that apply to the general machine type to see
if any faults exist. The rules are empirically based on thousands of machine tests
collected over more than 20 years and are constantly refined as new information becomes
available. If a rule is edited for any reason, the change is run through all past diagnoses to
ensure that it does not change any previously correct results.
A typical rule looks something like this in terms of its logic:
1. If the sum of the exceedance over baseline of all perceived bearing tones in all
three axes and all test points (Cepstrum confirmed) is higher than a threshold, or
the sum of the noise floor readings from all spectra has increased over the
baseline or alarm by a certain amount, then the rule passes.
2. If the sum of the amplitudes of all of the perceived bearing tones exceeds some
threshold then the rule passes.
3. If none of the perceived bearing tones are above a minimum threshold, the rule
does not pass.
4. If the sum of the shaft rate harmonics from 16x to 100x are above some value,
add to the severity.
5. If the noise floor is above some level add to the severity, and if it’s above a higher
level, add more to the severity.
6. If the sum of the other un-defined peaks that were not confirmed by Cepstrum are
above some threshold, add more to the severity.
7. If sub harmonics of the shaft rate have exceeded the baseline by a certain amount,
add to the severity.
Note that these rules are empirically based. Which is to say, the rule thresholds for
absolute levels or for exceedances over a baseline, have been tweaked until they come
out with the correct answer as determined by a human expert and/or direct field feedback.
In other words, the thresholds mentioned in the example rule above, have been tuned to
come out with the correct answer for any machine to which this particular rule applies.
There are sufficient rule templates for each machine type to catch practically all possible
bearing wear patterns that may exist in the data.
Page 159
Once a fault has been diagnosed, the user will continue to monitor the machine and look
for changes in severity of the fault. The rate at which the severity increases gives a good
indication of when the bearings should be overhauled.
The amounts by which the values in the CSDM exceed the threshold values (set up in the
rules based on experience and knowledge) is scored and converted into a relative
severity. This normalizes a scale with which to judge the state of health of each
component. Thus the relative severity for all components in the equipment can be trended
on a single graph, as in Figure 11-11. The graph provides a decision support tool for
performing a corrective action on a component whose severity is high or has increased
substantially. In the following section, we will propose to extend the automated diagnosis
one step further to extimate remaining life and provide an optimized repair decision.
Figure 11-11 Severity graphs for an equipment item with three components
A proposed hybrid decision tool

Following step 6, the automated diagnostic tools hand over their findings to the human
decision makers. Can we process each diagnostic fault and its respective severity one
step further to provide:
1. A residual life estimate relative to each failure mode, and

2. An optimized decision as to whether
i. to effect repair immediately, or
ii. to repair within a particular time period from the current time, or
iii. to continue operation until the next inspection. ?
The severity values computed for each fault, as well as the absolute and relative values of
the relevant features, may be used as covariates in a proportional hazard model such as
that described in Chapter 10. The next section describes the ABB fault simulator that
may be use to demonstrate this proposed extension to ExpertALERT’s output report.
The ABB fault simulator
Page 160
Figure 11-12 The fault simulator (top left) gradually induces one or more failure modes (for example,
misalignment or unbalance). The failure mode (unbalance) causes the failure mechanism (right) to
proceed towards failure. The failure is the loss of function to hold the Tee in place by spring friction
forces under the stress of vibration forces transmitted through the structure.
In the fault simulator, a spring and friction failure mechanism has been set up with the
following characteristics desirable for the study of a failure modeling and prediction
methodology.
1. A functional failure is clearly defined (by the release of the tee causing the golf
ball to trigger a switch).
2. The (random variable) time to failure can depend both on working age and CM
data.
3. A life cycle can be as small as 1 minute, permitting a large sample of life cycles
from which to build and subsequently test the predictive model.
How ‘predictive’ can such a model be?

The “goodness” (predictability) of the model depends on two factors:
1. How good the data is (its intrinsic information content regarding a progressing
failure mode), and
2. How big the sample is (the number of life cycles used to build the model).
The “better” the data the smaller the sample you need. The less the data correlates with
the targeted failure mode, the larger the sample you need for obtaining a good model.
Page 161
Figure 11-13 Running recommendations from the EXAKT agent
Figure 11-13 displays the running prognostic results that are updated at each inspection.
The “Optimal Maintenance Decision” may be one of :
1. Continue operation, or
2. Plan to replace in a specified number time units, or
3. Replace immediately
The “Estimated Time to Failure” is the time to replacement estimate (TRE). TRE is an
estimate of the time at which a replacement or overhaul will occur either by PM (as a
result of the CBM optimal decision policy recommendation) or by failure. The TRE is
not to be confused with the residual life estimate (RLE) that estimates the time to failure
only. (Replacement by PM is not considered). Both TRE and RLE are interesting figures
for maintenance personnel. TRE, however, may be the more interesting to people who are
concerned with maintenance management, e.g., production planning, manpower
scheduling, spare parts management. RLE, on the other hand, may be more interesting to
people involved in equipment design, procurement, and specification of reliability or risk
of the unit.
Figure 11-14 Key CBM performance indicators

Figure 11-14 shows the console display of the CBM program KPIs for the demonstration
fault simulator unit running an EXAKT optimal decision policy. The predictability of the
CBM policy is measurable. It is reflected in the “Time to Failure Estimate Performance”.
This figure is the average error in the TRE calculated at each inspection of every life
Page 162
cycle. A histogram (Figure 11-15) is another way to indicate the predictive performance
of the model.
Figure 11-15 Histogram showing the errors in replacement time estimate over 678 inspections. For
example the TRE calculated at 412 inspections were within 5% of the actual (functional or potential)
failure time.
The hazard function curves (in Figure 11-16) for potential failures and functional failures
provides an overall performance check on the effectiveness of the CBM program.
Figure 11-16 Hazard functions for potential and functional failures
Page 163
If the difference between TF (total failures) and the FF (functional failures) hazard curves
is small, that indicates that the CBM program is effective. That is, functional failures
(those that have important consequences) are being preempted by the CBM detection and
correction of potential failures (that have none or relatively minor consequences).
Figure 11-17 ExpertALERT operating on the ABB Asset Optimizer Workplace
Figure 11-17 illustrates a typical report issued by ExpertALERT. It contains quantitative

information relating to the detected fault as well as a recommendation and a “Figure of
Merit” indicating the fault severity. The CBM demo links these outputs from
ExpertALERT to an EXAKT decision agent. The agent applies a model of the severity
ratings and other relevant data extracted and computed by ExpertALERT. The new
combined report contains, not only a structured identification and severity rating of the
fault, but also an an optimized recommendation including an estime of the time-to-
failure.
Page 164
Chapter 12. Case based reasoning
It is not enough to improve just incrementally from your
past performance or that of other company divisions. To
compete globally, you must look everywhere to learn new
methods. Make yourself a student of the best of the best,
particularly in unrelated business sectors.
– John D. Campbell161
Introduction
A thin line separates diagnostics from prognostics. Condition based maintenance (to be
described in Chapter 6. (page 73) detects potential failures, which, in themselves,
provoke relatively minor consequences. When maintenance personnel detect and repair
potential failures, they avoid the dire consequences of a functional failure. In a similar
vein, diagnostics begin with the detection of a “fault” indicator, which, in and of itself,
often has few or no consequences, but, which portends a more serious functional failure.
Hence the diagnostic process often meets the RCM criterion for “on-condition
maintenance” as stipulated by Nowlan and Heap (see page 83). One or more of a variety
of fault indicators can initiate the troubleshooting process. Some warn of failure of back-
up functions. Others indicate the failing performance of some function in a subsystem. In
all cases, we require a quick and efficient process, based on the application of knowledge
and experience, that will trace the fault indicator to its root cause (that is, its failure
mode), whereupon we will remediate the cause through a repair or replacement action.
A case based reasoning system extends the five fundamental reliability-

centered knowledge elements introduced in Chapter 1. (page 13). Its database structure
classifies knowledge such that day-to-day experience may increment, expand, and refine
the knowledge base. Not only does CBR result in quick, efficient, guided162 information
retrieval but the process itself enables collaboration, retention, revision, and reuse of
knowledge.
161
Uptime, Strategies for Excellence in Maintenance Management, Productivity Press, 1995
162
The quality of that guidance impacts the cost and time of diagnosis.
Page 165
Intelligent agents
assist maintenance
troubleshooters
through case based
reasoning (CBR).
Figure 12-1. The case based reasoning troubleshooting

process.
Figure 12-1 illustrates four of the five CBR functions:
1. To identify a range of candidate (possible) solutions given the symptom(s),

2. To gather additional information that confirms or refutes the possible solutions,
3. To determine the most likely solutions (symptoms similar to the situation as
described),
4. To evaluate the proposed solution, and
5. To update the knowledge base (Figure 12-3) by learning from this experience.
Efficient Troubleshooting
Intelligent troubleshooting poses the right questions in the best order. A well designed
case based reasoning system guides the technician or engineer along the most practical
and least costly path to a solution. It poses questions and suggests solutions by
considering relevant data and by evaluating:
⇒ Similarity of past cases to the current symptoms
⇒ Frequency of occurrence of similar cases
⇒ Cost and time to get an answer
⇒ Cost and time of repairs
⇒ Information gain - the ability of a question to eliminate inappropriate solutions
Page 166
Figure 12-2: A typical CBR CaseBank SpotLight™ session
Figure 12-2 shows a typical CBR session. The troubleshooting conversational “assistant”
does not demand that the technician answer every question. The user may elect to answer
or ignore any question, and may provide answers in the most convenient order. The tool
suggests but does not enforce a specific question order. At each step, as the diagnostic
effort unfolds, the CBR program re-sequences questions and re-prioritizes solutions by
re-evaluating all information known up to that point.
Data may be entered manually, or it may be retrieved automatically by querying relevant

databases or intelligent test equipment. As the trouble-shooter probes the symptoms, the
CBR algorithms “reconsider” the data, pose new questions, and re-evaluate the possible
solutions until the user determines that the solution has been found.
Page 167
The CBR tool elicits notes and additional observations during a session where such
observations are lacking in the case base. Subject matter experts163, monitor each
completed session, harvesting the data, where appropriate, for case-base development.
Figure 12-3 illustrates this continuing process.
Figure 12-3: Managing the knowledge base

Figure 12-3 describes the most significant feature of a case based reasoning system, that
of continuous enhancement of the knowledge base. Knowledge integrity is assured
through expert review and classification of all completed sessions.
Case Base Development

The development of a case based reasoning system requires the use of software. In this
chapter we review the product SpotLight164. Before embarking on any new development
system, we must first assimilate a specialized set of terminology:
Terminology
Subject: An item of interest.
Domain / subject breakdown: A tree structure of parents and children that describe the
knowledge area to be captured. A subject may have multiple parents.
Attribute: A characteristic that is measurable, testable, or observable. It is attached to
one or more parent subjects in the domain.
Attribute structure: Name, Description, Question, Values, References.
163
This takes place off site as a web application service or is performed by on-site subject matter experts
(maintenance engineer, planner, or technician) trained in the use of the software.
164
CaseBank Technologies, www.casebank.com
Page 168
Attribute types: Logical (T/F, Y/N), Symbolic list (Corroded, cracked, loose), Ordered
list (none, low, med, hi), Integer, Real, Multi-valued (several selections may be valid at
once, e.g.: one or more fault codes shown on a display unit).
Attribute categories: Symptomatic (e.g. vibration level), Root causes (e.g. Piston –
Status – seized, free, sticking), Configuration (e.g. Power rating HP – 130, 150)
Observation: Assignment of a value to an attribute to describe the current scenario (e.g.
Master Caution Light – illuminated).
Case (aka Solution): Concise information representing a type of problem. Most often
representing a failure, but could be an operating error or a normal condition that is often
misinterpreted as a problem.
Casebase (aka Knowledgebase): A repository of cases upon which the reasoning engine
operates.
Session: The data created in the process of matching the characteristics of the current
problem to the cases in the casebase. A session is an “instance” of a case.
Building a knowledge domain

The domain and its cases are built in the SpotLight “Domain/Case Editor”. The domain
evolves as cases are developed. The following is an extract from a domain subject
breakdown:
Page 169
Subjects (displayed in the domain in upper case
characters) can be physical components or they
can be categories used to index physical
components (e.g. COMPLAINTS CONCERNING
SNOWBLOWER OPERATION).
Attributes (displayed in lower case and

prefixed by a “?”) are observable properties or
behaviours. They are separable into sub-
categories (e.g. Engine sounds .. “With auger clutch
engaged”, or “With traction clutch engaged”). The
attribute “Lawnmower equipment malfunction” may
have the values “Poor cut - uneven”, “Hard to push”,
“Vibrates excessively”, “Starter rope hard to pull”,
“Normal”. Attribute details (name, description,
question, reference, observation cost, observation time,
comments, similarities, values, attribute links) are added
to the domain and edited using the Domain/Case
editor.
Expansion of the multi-valued attribute “? LANDING

GEAR control panel advisory lights”
Building a case
We build a case by populating it with the following information:
1. Title: In the form [Problem Description] due to [Root Cause]165.
2. “Lawnmower performance is unsatisfactory due to a restricted (clogged) air filter”
3. Source: The source of the case, either field experience or a document such as a
manual
4. Description: A detailed description of the problem.
“Lawnmower runs erratically and the performance is unsatisfactory, starts with difficulty,
surges, loss of power, overheating, runs poorly at top no-load speed”
5. Observations: A structured description of the case’s attributes and their values.
6. Cause166: A case can have only one root cause167.
165
Corresponding to the RCM terminology for “Failure” and “Failure Mode” respectively
166
Recall the RCM “Failure Mode”
Page 170
7. Explanation168: The explanation may include: 1) how the fault caused the
symptoms, 2) the physical working of the affected component to explain the
failure, 3) the chain of events that led to the identification of the root cause.
8. Repair: The Repair details generally include what was done to correct the
problem, as well as any repair references. E.g. parts and supplies needed, the
sequence of procedures - preparation, execution, testing special tools needed,
safety information effort required (for example, person hours), cost (direct labour,
overhead, parts, etc).
9. Reference: References for a case may include: diagrams, video/audio clips, and
documents that illustrate observations, repair instructions, or explain the case.
10. Lessons: Lessons for the case may include: tips for avoiding mistakes during
troubleshooting as demonstrated by the case, tips for avoiding mistakes during
repair, emphasis on key observations or procedures that are new and not common
knowledge, comments regarding any general principles learned from the case.
11. Edit history: The Edit History shows who made changes to the case, the status of
the case, the date the case was changed, and the comments for the change.
Case Study
Over a period of two months during 2004 the fault “NOSE STEERING illuminated” was
detected in a fleet of aircraft. Around the world several people were grappling with a
similar nose wheel steering problem. The knowledge building process amalgamated the
notes from similar sessions. The notes are presented in Figure 12-4.
Previous Notes:
Solution cases: #4137-Nosewheel Steering sluggish due to partial blockage of the Steering
Manifold Inlet Filter.
2004-12-23 14:15:16 GMT by Vincent, Dominic (Closed)
Nose wheel Steering sluggish. Hydraulic supply line checked for debris as fault had only
become evident after a #2 edp failure a week or so earlier. Debris was found in the filter gauze
in the elbow. Once cleaned steering function checked ''satis''. We are looking back to see if any
of our other A/C that have had edp failures have suffered from steering problems as steering
hydraulic supply pipes not checked after a failure.
2004-12-15 19:58:34 GMT by Gray, Stuart (Closed)
In Service Engineering informs me that as well as the inlet and return filters in the PTU selector
valve in system #1 (Service Bulletin 84-29-13) and the alternate landing gear extend system
filter unions at the bypass valve and at the reservoir intake, (which we have all come across),
there are a few others that merit attention. They are: Rudder PCUs have inlet filter unions;
Elevator PCUs have inlet filter unions; Flap Power Unit has an inlet filter union; If you have an
ongoing fault that appears after a system has been contaminated, (i.e after an EDP failure) a
look at these filters might be worth your while.
2004-12-13 16:23:42 GMT by Gosling, Tom (Referred)
167
However, the root cause can comprise more than one contributory causes, which are expressed in the
form attributes and values that define the case.
168
Recall the RCM “Failure Effects”. The CBR system extends the RCM knowledge elements with
additional structure that enables the application of diagnostic algorithms in software.
Page 171
The manifold has two filters at the inlet port. The first one is located within the swivel joint P/N
SJ504-917-2. The second is an inlet filter P/N FSHX0511200B located within the manifold
downstream of the inlet port. This information is being added to the Goodrich CMM. This is not
an AMM level component as it is part of the steering manifold.
2004-12-13 16:20:07 GMT by Gosling, Tom (Open)
Figure 12-4 Troubleshooting session notes

169
During a search and analysis of session notes, it was quickly realized that all those
aircraft had experienced a hydraulic pump failure previously. This focused the
investigation. It turned out there was a tiny filter inside an elbow (Figure 12-5) on the
NWS unit that nobody knew about - certainly not in the manuals, that was partially
blocked. Within a week, the knowledge analyst added the new case to the case base,
whereupon it became available to the world.
Figure 12-5 Inlet Elbow filter blockage as a result of a prior hydraulic pump failure
The knowledge base was updated to include a new set of observations - a structured
description of attributes and their values. The attributes and their values are presented in
Figure 12-6.
169
Using an enhanced search function provided in the case editor software.
Page 172
Figure 12-6 Observations for a new case added to the knowledge base
Explanation
The recent failure of the engine driven hydraulic pump had contaminated the system. Some of the
contamination had collected in the steering manifold inlet elbow filter and had remained there
after the flushing. The partial blockage of the inlet filter caused a flow restriction to the hydraulic
manifold, which resulted in the sluggish performance when maneuvering. The reduction however,
was not enough to trigger the P-SW (pressure switch) fault in the steering control unit.
Repair
The steering manifold hydraulic supply filter was cleaned.
References
- AIPC 32-51-16-01 - NLG Steering Manifold
Lessons
1. The manifold has two filters at the inlet port.
2. The first one is located within the swivel joint P/N SJ504-917-2.
3. The second is an inlet filter P/N FSHX0511200B located within the manifold downstream
of the inlet port.
This information is being added to the Goodrich CMM. This is not an AMM level component as it
is part of the steering manifold.
Page 173
The seed case base
Before implementing CBR in a maintenance organization, we must first build a seed case
base of a sufficient170 number of cases. Figure 12-7 illustrates the development of the
seed case base from 1) existing work order and troubleshooting records, 2) failure modes
and effects analysis records, and 3) OEM maintenance and troubleshooting (fault
isolation) manuals.
Figure 12-7: The seed case base
Casebase Growth
Having deployed the CBR system with a Seed Casebase, the system itself becomes a
powerful knowledge capture mechanism, and the casebase grows as new cases are
discovered during its use. The chart in Figure 12-8 illustrates the expected pattern as the
case base matures. The left side of the graph shows low initial usage as the seed case
base is deployed in stages, gradually bringing on more and more users until it is part of
normal operations.
170
In order that the tool may inspire sufficient confidence from the outset that it be used and developed
upon.
Page 174
Performance measurement
Figure 12-8: CBR performance results

CBR measures its own performance by tracking the usage, the hit rate and the monthly
average number of solved sessions. Figure 12-8 depicts the growing usage and accuracy
of CBR in diagnosing a jet propulsion engine product line over two years.
Conclusions
The scale and unabated growth of mechanization and automation in all walks of human
endeavor gave rise to the diagnostic approach known as case-based reasoning. CBR
extends the structure of the knowledge gained through the application of reliability-
centered maintenance. Along with advanced condition monitoring tasks, CBR assists the
modern maintainer to satisfy increasingly pressing economic, environmental and safety
demands for:
• Better first-time fix of both potential and functional failures
• Cost reduction / Cost avoidance
o Less troubleshooting time
o Rapid planning for unscheduled maintenance events
o Reduced (unnecessary) parts replacements
o Reduced unscheduled service interruptions
• Increased asset availability
• Preservation and use of intellectual assets
o Capture of “walking knowledge” prior to retirement or attrition
Page 175
o Maximized utility of new staff
o Focused efforts of expert staff on toughest problems
Case based diagnostic reasoning, encompassing a detection, processing, and decision sub
process is truly a form of condition based maintenance or CBM, whose principles we
describe in great detail in the preceding chapters
Page 176
Chapter 13. A survey of signal processing
and decision technologies for CBM
Introduction
In previous chapters we learned that Condition-based Maintenance recommends actions
based on information acquired through observation and analysis. We noted, moreover,
that the CBM process, itself contains three sub-processes or steps: data acquisiton, signal
processing, and maintenance decision making.
Figure 13-1 The three CBM steps
Chapter 12. Case based reasonin (page 165) pointed out, in regard to complex systems,
that prognostics are often indistinguishable from diagnostics, where both aim to identify
the occurance of a potential failure.
Hundreds of theoretical and practical research papers on CBM appear every year in
scientific journals, conference procedings and technical reports. In this chapter we
provide an overview of recent developments in the diagnostics and prognostics of
systems. We will mention a number of models, algorithms, and technologies for signal
processing and maintenance decision making. Given the increased use of multiple
sensors, we will also discuss various techniques for data fusion. The chapter is concluded
with a brief discussion on current practices and possible future trends in CBM. The
purpose of this survey of advanced methods of signal processing and decision making is
not to instruct the reader in the the use of these new techniques, but merely to provide the
maintenance professional with references to the source material so that he or she can
investigate alternatives when encountering various situations where a CBM solution is
proposed.
Reliability has always been an important criterion in the selection of industrial

equipment. Good equipment design is essential for processes requiring high reliability.
However, no amount of design effort will prevent deterioration over time. Machinery and
systems operate under stress in an environment that is characterized by randomness.
Maintenance is the major way in which we assure the user of the asset a satisfactory level
of reliability. Physical asset managers look towards CBM as an efficient form of
maintenance, which, they expect will assist them in the avoidance or reduction of risk.
That is, they seek to reduce, to an acceptable level, the combined impact of the
probability of failure and its consequences. A CBM program, if properly established and
effectively implemented, can significantly reduce overall cost by reducing the number
and/or extent of unnecessary preventive maintenance operations, while still achieving the
desired reliability.
Page 177
Let us begin by reviewing, briefly, the first CBM step, data acquisition.
Data acquistion
Data acquisition, the essential first step in the CBM task, is a process for collecting and
storing useful information that emanates from operating physical assets. Data collected in
a CBM program is of two main types: “event” data and condition monitoring (CM) data.
Event data tells us what happened, for example, an installation, a breakdown, or an
overhaul. Event data also tells us what was done, for example, a minor repair, a
preventive maintenance action, an oil change, and so on. CM data consists of
observational measurements that we believe are, in some way, related to the deteriorating
health or state of the physical asset.
CM data can include vibration data, acoustics data, oil analysis data, temperature,
pressure, moisture, humidity, and any other physical observations, including visual clues,
that relate to to the condition of an operating physical asset in its environment. A variety
of sensors (microsensors, ultrasonic sensors, acoustic emission sensors, thermographic
imagers, etc) have been designed to collect different types of data [11,12]. Wireless
technologies such as Bluetooth have provided an alternative to more expensive hard
wired data communication. Information systems such as Computerized Maintenance
Management Systems (CMMS), Enterprise Resource Planning (ERP) systems, control
system historians, and CBM databases have been developed for data storage and handling
[13]. With the rapid development of computer and advanced sensor technologies, data
acquisition technologies have become more powerful and less expensive, resulting in
exponentially growing databases of CM data.
Event data and CM data are equally important in CBM. In practice, however, engineers
and managers tend to place more emphasis on the latter and sometimes neglect the
former. Overlooking event data may have grown from the mistaken belief that it is not
valuable to fault prediction as long as the condition monitoring data seems to be working
well. We tend to overlook event data, in part, because we lack the knowledge and
methods to use it. Event data is at least as helpful as CM data in assessing machine
health. It augments our ability to judge the significance of CM data with respect to
specific failure modes. The use of event data is discouraged by the fact that its collection
usually implies manual data entry. Once a human is involved, everything becomes more
complicated and error-prone. Choosing the “simple” solution, that of removing the
human element, is hasty and ill-advised. Rather, it is preferable to equip humans with
tools and procedures171 with which to capture event data accurately, in a meaningful
format, and in sufficient detail.
Signal processing
Under the topic of signal processing we include a necessary preliminary step - data
cleaning. Data, especially event data, particularly when it is entered manually, always
171
Such as those developed in Chapter 4. (page 58)
Page 178
contains errors. Data cleaning is meant to ensure that clean (error-free) data is used for
subsequent analysis and modeling. Data errors are caused by many factors, including the
human factor mentioned previously. Errors in CM data may be caused by sensor faults,
which are handled by sensor fault isolation [14]. In general, there is no simple, single
method to clean data. Sometimes manual examination is required. Graphical tools are
helpful in finding and removing data errors. Data cleaning is indeed a vast subject area.
In Example 2 Data validation on page 131 (Chapter 10. ) we touched upon various
aspects of data cleaning.
The next step in signal processing is data analysis. A variety of models, algorithms and
tools are described in the technical literature. Their purpose is to analyze data in order to
better understand and interpret it. The choice of which model, algorithm, or tool to use
for data analysis depends primarily on the type of data collected. Condition monitoring
data falls into three principal types:
Value: Data collected at a specific time epoch as single valued variables. For
example, oil analysis data, temperature, pressure, humidity are all value type data.
Waveform: Data collected at a specific time epoch as a time series of values. For
example, vibration data and acoustic data are or the waveform type.
Multi-dimension: Data collected at a specific time epoch as multi-dimensional

values. The most common multi-dimensional data is image data, for example
infrared thermographs, X-ray images, visual images, etc.
Although we have been using the term more broadly to describe the entire data analysis
phase of CBM, “signal processing” usually refers most specifically to waveform and
multi-dimension data analysis. A large variety of signal processing techniques have been
developed to analyze and interpret these types of data. Their purpose is to extract useful
information from the raw signal in order to perform diagnostics and prognostics. The
signal processing procedure for extracting information relevant to targeted failure modes
is often called “feature extraction”.
Signal processing
Waveform data analysis
The most common waveform data in condition monitoring are vibration signals and
acoustic emissions. Other waveform data include ultrasonic signals, motor current, partial
discharge, and others. In the literature, there are three main categories of waveform data
analysis: time-domain analysis, frequency-domain analysis and time-frequency analysis.
Time-domain analysis is directly based on the time waveform itself. Traditional time-
domain analysis calculates characteristic features from time waveform signals as
descriptive statistics. For example: mean, peak, peak-to-peak interval, standard deviation,
crest factor, high order statistics: RMS (root mean square), skewness, kurtosis, etc. These
Page 179
features are usually called time-domain features. A popular time-domain analysis
approach is time synchronous average (TSA). The idea of TSA is to use the ensemble
average of the raw signal over a number of evolutions in an attempt to remove or reduce
noise and effects from other sources, so as to enhance the signal components of interest.
A brief review of TSA was given by Dalpiaz [15] and some drawbacks of TSA were
pointed out by Miller [16]. Most of the references on TSA can be found in [15,16].
More advanced approaches to time-domain analysis apply time series models to

waveform data. The main idea of time series modeling is to fit the waveform data to a
parametric time series model and extract features based on this parametric model. The
popular models used in the literature are AR (autoregressive) model and ARMA
(autoregressive moving average) model. An ARMA model of order p, q , denoted by
ARMA( p, q ), is expressed by
xt = a1 xt −1 + L + a p xt − p + ε t − b1ε t −1 − L − bq ε t −q
where x is the waveform signal, ε ’s are independent normally distributed with mean 0
and constant variance σ 2 , and ai , bi are model coefficients. An AR model of order p is
a special case of ARMA( p, q ) with q = 0 . Poyhonen et al [17] applied the AR model to
vibration signals collected from an induction motor and used AR model coefficients as
extracted features. Baillie and Mathew [18] compared the performance of three
autoregressive time series modeling techniques: AR model, back propagation neural
networks, and radial basis function networks to bearing fault diagnostics. Garga [19]
proposed using AR modeling followed by dimension reduction for machinery fault
diagnostics. Recently, Zhan [20] used a state space model representation of an AR model
to analyze vibration signals for fault detection.
There are many other time-domain analysis techniques to analyze waveform data for
machinery fault diagnostics. Some of them are briefly described as follows. Wang et al
[21] introduced three nonlinear diagnostic methods for rotating machine fault diagnosis.
These three methods are pseudo-phase portrait, singular spectrum analysis, and
correlation dimension. Pseudo-phase portrait is simple for computer execution and is
sensitive to some fault types. Wang and Lin [22] used a statistical approach known as
singular value decomposition to obtain the pseudo-phase portrait. Singular spectrum
analysis can reveal the complexity of a signal and reduce the noise. Correlation
dimension can provide some intrinsic information on an underlying dynamical system.
Koizumi [23] also considered the application of correlation dimension to fault diagnosis.
Wang et al [24] applied both correlation dimension and bispectrum for rotating machine
fault diagnosis. Zhuge and Lu [25] proposed a modified least mean square algorithm to
model the non-stationary impulse-like signals for reciprocating machine fault diagnosis.
Baydar et al investigated the use of a multivariate statistical technique known as principal
component analysis (PCA) in gear fault diagnostics [26].
Frequency-domain analysis is based on the transformed signal in the frequency domain.

The advantage of frequency-domain analysis over time-domain analysis is its ability to
Page 180
easily identify and isolate certain frequency components of interest. The most widely
used conventional analysis is spectrum analysis by means of FFT (fast Fourier
transform). The main idea of spectrum analysis is either to look at the whole spectrum or
to look closely at certain frequency components of interest and thus extract features from
the signal (see, e.g. [27-29]). The most commonly used tool in spectrum analysis is the
power spectrum. It is defined as E[ X ( f ) X * ( f )] , where (and throughout this section)
X ( f ) is the Fourier transform of signal x(t ) , E denotes expectation and “ ∗ ” denotes
complex conjugate. Some useful auxiliary tools for spectrum analysis are graphical
presentation of the spectrum, frequency filters, envelope analysis (also called amplitude
demodulation) [30-32], side band structure analysis [33], etc. Descriptions of the above
mentioned techniques for FFT based spectrum can be found in textbooks such as [34,35]
and will not be discussed in detail here. Another useful transform, Hilbert transform, has
also been used for machine fault detection and diagnostics [30,36].
Despite the wide acceptance of the power spectrum, other useful spectra for signal
processing have been developed and have been shown to have their own advantages over
the FFT spectrum in certain cases. Cepstrum has the capability to detect harmonics and
sideband patterns in the power spectrum. There are several versions or definitions of
cepstrum [35]. Among them, the power cepstrum, which is defined as the inverse Fourier
transform of the logarithmic power spectrum, is the most commonly used. A modified
cepstrum analysis was proposed in [37]. A high order spectrum, i.e. bispectrum or
trispectrum, can provide more diagnostic information than the power spectrum for non-
Gaussian signals. In the literature, high order spectrum is also called high order statistics
[38]. This name comes from the fact that bispectrum and trispectrum are actually the
Fourier transforms of the third- and fourth-order statistics of the time waveform,
respectively. But this name could be confused with the time-domain high order statistics.
Bispectrum and trispectrum are defined as
B ( f 1 , f 2 ) = E [ X ( f 1 ) X ( f 2 ) X * ( f 1 + f 2 )]
and
T ( f1 , f 2 , f 3 ) = E[ X ( f 1 ) X ( f 2 ) X ( f 3 ) X * ( f 1 + f 2 + f 3 )]
respectively. Bispectrum and trispectrum can be normalized to obtain bicoherence and

tricoherence as
| B( f1 , f 2 ) |
β ( f1 , f 2 ) =
E[| X ( f 1 ) X ( f 2 ) | 2 ]E[| X ( f 1 + f 2 ) | 2 ]
and
| T ( f1 , f 2 , f 3 ) |
τ ( f1 , f 2 , f 3 ) =
E[| X ( f 1 ) X ( f 2 ) X ( f 3 ) | 2 ]E[| X ( f 1 + f 2 + f 3 ) | 2 ]
Page 181
respectively. Bispectrum analysis has been shown to have wide application in machinery
diagnostics for various mechanical systems such as gears [39], bearings [40], rotating
machines [41,42] and induction machines [43,24]. Li [44] investigated the application of
bispectrum diagonal slice B ( f , f ) to gear fault diagnostics. Yang [40] used both
bispectrum diagonal slice and bicoherence diagonal slice β ( f , f ) , summed bispectrum,
and summed bicoherence for bearing fault diagnostics. Application of both bispectrum
and trispectrum to bearing fault diagnostics was discussed in [45]. A new technique
called holospectrum was introduced by Qu [46] to integrate all the information of phase,
amplitude and frequency of a waveform signal. Application of holospectrum to machine
fault diagnostics was studied in [47,48]. A review on holospectrum and its applications
was given by Qu [49] (in Chinese).
Generally speaking, there are two classes of approaches for power spectrum estimation.
The first covers the non-parametric approaches that estimate the autocorrelation sequence
of the signal and subsequently apply a Fourier transform to the estimated autocorrelation
sequence. For details, see [50]. The second class includes the parametric approaches that
build a parametric model for the signal and then estimate power spectrum based on the
fitted model. Among them, AR spectrum [51-53] and ARMA spectrum [54] based on AR
model and ARMA model respectively are the two most commonly used parametric
spectra for machinery fault diagnostics.
One limitation of frequency-domain analysis is its inability to handle non-stationary

waveform signals, which are very common when machinery faults occur. Thus, time-
frequency analysis, which investigates waveform signals in both the time and frequency
domains, has been developed for non-stationary waveform signals. Traditional time-
frequency analysis uses time-frequency distributions, which represents the energy or
power of waveform signals in two-dimensional functions of both time and frequency.
Short-time Fourier transform (STFT, and also called spectrogram) [55,56] and Wigner-
Ville distribution [57-60] are the most popular time-frequency distributions. Cohen [61]
reviewed a class of time-frequency distributions which include spectrogram, Wigner-
Ville distribution, Choi-Williams and others. The idea of a spectrogram is to divide the
whole waveform signal into segments with a short time window and then apply a Fourier
transform to each segment. Spectrogram has some limitations in time-frequency
resolution due to signal segmentation. It can be applied only to non-stationary signals
with slow change in their dynamics. Bilinear transforms such as Wigner-Ville
distribution are not based on signal segmentation and thus overcome the time-frequency
resolution limitation of spectrogram. However, there is one main disadvantage of bilinear
transforms, due to interference terms formed by the transformation itself. These
interference terms make interpretation of the estimated distribution difficult [62].
Improved transforms such as the Choi-Williams distribution have been developed to
overcome this difficulty. Gu et al [63] applied singular value decomposition to extract
features from the time-frequency distribution. Loughlin [64] used a set of conditional
time-frequency moments as characteristic features for fault diagnosis.
Page 182
Another transform for time-frequency analysis is the wavelet transform. Wavelet theory
has been developing rapidly in the past decade and has wide application [65]. A
continuous wavelet transform is defined as
∞
1 t −b
∫ x(t ) ψ
∗
W ( a, b) =   dt
a −∞  a 
where x(t ) is the waveform signal, a is the scale parameter, b is the time parameter and
ψ (⋅) is a wavelet, which is a zero average oscillatory function centered around zero with
a finite energy, and “ ∗ ” denotes complex conjugate. Commonly used wavelets are
Morlet, Mexican hat, Haar, etc. Similar to Fourier transform, the wavelet transform has
its discrete form, which is obtained by discretizing a and b , and expressing x(t ) in
discrete form. Similar to FFT, a fast wavelet transform is likewise available for the
calculation.
Wavelet analysis of a waveform signal expresses the signal in a series of oscillatory

functions with different frequencies at different times by dilations via the scale parameter
a and translations via the time parameter b . Similar to the power spectrum and the phase
spectrum in Fourier analysis, a scalogram defined as | W (a, b) | 2 and a wavelet phase
spectrum defined as the phase angle of the complex variable W (a, b) are used to interpret
the signal. Wavelet transformation has been successfully applied to fault diagnostics of
gears [66,67], bearings [68,69] and other mechanical systems [70,71]. Dalpiaz and Rivola
[72] assessed and compared the effectiveness and reliability of wavelet transform to other
vibration signal analysis techniques for fault detection and diagnostics. Baydar and Ball
[73] applied wavelet transform to both acoustic signals and vibration signals for gear
tooth fault diagnostic. Addison et al [74] investigated the use of low-oscillation complex
wavelets, Mexican hat and Morlet wavelets, as feature detection tools. Wavelet analysis
using Haar wavelet was considered in [75,76]. Miller [77] used a wavelet basis as a comb
filter to decompose vibration signals for gear fault diagnostics. A graphical tool called
wavelet polar maps to display wavelet amplitude and phase was proposed in [78] and was
applied to gear fault diagnostics in [79]. Wavelet transform combined with Fourier
transform to enhance feature extraction capability was proposed in [80]. A more
advanced transform, known as wavelet packet transform, was studied and applied to
machinery fault diagnostics in [81-83]. A new technique know as basis pursuit, based on
a general wavelet packet dictionary, was applied to rolling element bearing fault
diagnostics in [84]. It was shown that basis pursuit has some advantages over other
commonly used wavelet analysis approaches. A recent review with more references on
the applications of wavelet transform in machine condition monitoring and fault
diagnostics was given in [85].
Image processing
Image processing is similar to but more complicated than waveform signal processing
due to one more dimension involved. In practice, raw images are usually very
complicated and immediate information for fault detection is unavailable. In these cases,
image processing techniques must be powerful enough to extract useful features from raw
Page 183
images for fault diagnosis — see [86,87] for descriptions and discussions on image
processing tools and algorithms. Image processing seems unnecessary when raw images
provide sufficient and clear information under visual examination to identify patterns and
detect faults. However, image processing can help in extracting features for automatic
fault detection in such situations. In addition to raw images obtained via data acquisition,
some waveform processing techniques such as time-frequency analysis also produce
images. In these situations, image processing can be combined with waveform processing
to obtain better results.
A few examples of applying image processing techniques in condition monitoring and

fault diagnosis and prognosis are as follows. Wang and MacFadden [88] applied image
processing techniques to spectrograms for early gear fault detection and diagnostics.
Utsumi et al [89] used a wavelet transform to analyze ferrographic images for bearing
diagnosis. Heger and Pandit [90] considered a wavelet-based segmentation approach to
image processing for the condition monitoring and fault diagnostics of grinding tools.
Ellwein et al [91] combined image processing techniques with waveform power spectrum
density to identify a region of interest (ROI) for fault discrimination enhancement.
Value type data analysis

Value type data includes both raw data obtained via data acquisition and feature values
extracted from raw signals via signal processing. Value type data looks much simpler
than waveform and image data. However, complexity lies in the correlation structure
when the number of variables is large. Multivariate analysis techniques such as PCA and
independent component analysis (ICA) are very useful to handle data with complicated
correlation structure. For example, Stellman et al [92] applied PCA to spectroscopic data
to monitor the condition of a lubricant in helicopter rotary gearboxes. Allgood and
Upadhyaya [93] performed PCA on certain descriptive statistics for DC motor
diagnostics and prognostics. ICA is an extension of PCA and will be discussed later.
When the number of variables is large, dimension reduction techniques such as PCA and
project pursuit can be used for data reduction. For a review on dimension reduction
techniques, see [94]. An example of applying dimension reduction techniques for
machine fault diagnostics is given in [19].
Trend analysis techniques such as regression analysis and time series modeling are
commonly used techniques for analyzing value type data. For example, Grimmelius et al
[95] developed a prototype condition monitoring and diagnostics system for compression
refrigeration plants using a regression analysis model to predict healthy system behavior.
Yang et al [96] established an ARMA model to extract features from on-line data for
power equipment diagnosis. Sinha [97] applied both polynomial regression and an
ARMA model to predict the trend of vibration peak amplitude for turbine fault
diagnostics and prognostics.
Data analysis combining event data and condition monitoring data

Data analysis for event data only is well known as “reliability analysis”, which fits the
event data to a time between events probability distribution and uses the fitted
Page 184
distribution for further analysis. In condition-based maintenance, however, additional
information — condition monitoring data, is available. It is beneficial to analyze event
data and condition monitoring data together. This combined data analysis can be
accomplished by building a mathematical model that describes the underlying mechanism
of a fault or a failure. The model built on both event and condition monitoring data is the
basis for maintenance decision support — diagnostics and prognostics, which will be
discussed in the next section.
A time-dependent proportional hazards model (PHM) is suitable for analyzing both event
and condition monitoring data together. It has a hazard function of the form
h(t ) = h0 (t ) exp(γ 1 x1 (t ) + L + γ p x p (t ))
where h0 (t ) is a baseline hazard function, x1 (t ),L, x p (t ) are covariates which are

functions of time, and γ 1 ,L , γ p are coefficients. The baseline hazard function h0 (t ) can
be in non-parametric or parametric form. A commonly used parametric baseline hazard
function is the Weibull hazard function, which is the hazard function of the Weibull
distribution. A PHM with Weibull baseline hazard function is called Weibull PHM.
Jardine et al [98] proposed using a Weibull PHM to analyze the aircraft and marine
engine failure data together with the metal concentration measurements of the engine oil.
An extension of PHM is the proportional intensity model (PIM), which adopts a
stochastic process setting and assumes a similar form to the intensity function of the
stochastic process. Vlok et al [99] studied the application of PIM to analyze failure and
diagnostic measurement data from bearings.
In reliability centered maintenance (RCM) [100], the concept known as the “P-F interval”
is used to describe failure patterns in condition monitoring. A P-F interval is the time
interval between a potential failure (P), which is identified by a condition indicator, and a
functional failure (F). A P-F interval is a useful concept with which to determine an
appropriate interval for periodic condition monitoring. A condition monitoring interval is
usually set to the P-F interval divided by an integer. In practice, however, it is usually
difficult to quantify the P-F interval (see Chapter 9. The Elusive P-F Curve page 106).
Goode et al [101] assumed two Weibull distributions for the P-F interval and the I-P
interval, i.e. from machine installation to a potential failure. Using the statistical process
control (SPC) methods on historical data, they separated each machine life cycle into two
zones: a stable zone and a failure zone. They used the stable zone duration times to fit a
Weibull distribution for the I-P interval. Similarly, they used the failure zone duration
times to fit the Weibull distribution for the P-F interval. Based on these two fitted
distributions combined with the condition monitoring process, machine prognosis was
derived.
A hidden Markov model (HMM) [102,103] is another model for analyzing event and
condition monitoring data together. A HMM consists of two stochastic processes: a
Markov chain with a finite number of states describing an underlying failure mechanism,
and an observation process that depends on the hidden state. Bunks et al [104] applied a
Page 185
HMM to analyze Westland helicopter data which consists of gearbox fault class
information and vibration measurements surrounding the occurance of various faults.
The fault classes were treated as states in the hidden Markov chain, whereas the vibration
measurements were treated as realizations of the observation process. The trained HMM
using lab test data was then applied to fault classification for a data set from an operating
gearbox. Dong and He [105] proposed a more general model, hidden semi-Markov model
(HSMM), for hydraulic pump diagnostics. It was shown that HSMM outperforms HMM
in pump diagnostics.
Lin and Makis [106] proposed using a partially observable stochastic model to describe
the underlying failure mechanism of a system undergoing condition monitoring. The
proposed model is similar to that of a HMM but it has some distinguishing
characteristics. One (failure) state is observable, whle the partially hidden state process is
continuous in time. The observation process, however, is in discrete in time. These
characteristics are more realistic in relation to actual condition monitoring processes. The
model parameters were estimated using both event and condition monitoring data. The
fitted model is used for subsequent diagnostics and prognostics. A fast recursive
parameter estimation procedure for a partially observable stochastic model was given in
[107].
Other models in the literature that can be used to analyze both event and condition
monitoring data are models using the delay time concept [108] and stochastic process
models such as a gamma process [109].
Maintenance decision support

The ultimate goal and final step of a CBM program is maintenance decision making.
Sufficient and efficient decision support will result in maintenance personnel’s taking the
“right” maintenance actions given the current known information. Jardine [110] reviewed
and compared several commonly used CBM decision strategies. They included trend
analysis that is rooted in statistical process control, expert systems, and neural networks.
Wang and Sharp [111] discussed the decision aspect of CBM and reviewed the recent
development in modeling CBM decision support.
Diagnostics
Machine fault diagnostics is a discovery procedure based on mapping information in the
measurement space and/or features in the feature space to machine faults in the fault
space. From an “RCM” perspective, a machine fault may or may not have immediate
consequences. If a fault does not have immediate consequences, other than those
necessary to diagnose and repair it, it is a potential failure. The diagnostic action
following the detection of a potential failure will be a proactive activity, initiated, often,
by a condition based maintenance process. A common example is an alarm generated by
a “rule” applied to the data in a control system historian. Besides a potential failure, a
diagnostic alarm may also expose an otherwise hidden functional failure, usually the
failure of a protective or backup device. The failure of a hidden function has the
immediate consequence that a “multiple” failure is, from that moment on, highly
Page 186
probable. This topic was developed in Failure Finding Intervals of Chapter 3. on page
39.
The diagnostic mapping process is also called pattern recognition. Traditionally, pattern
recognition was a manual exercise, performed with the assistance of graphical tools such
as a power spectrum graph, a phase spectrum graph, a cepstrum graph, an AR spectrum
graph, a spectrogram, a wavelet scalogram, a wavelet phase graph, and so on. However,
manual pattern recognition requires expertise in the specific area of the diagnostic
application. It is slow and expensive requiring highly trained and skilled personnel.
Therefore, automatic pattern recognition is highly desirable. This can be achieved by
classification of signals based on the information and/or features extracted from the
signals. In the following sections, different machine fault diagnostic approaches are
discussed with emphasis on statistical approaches and artificial intelligent approaches.
Machine diagnostics with emphasis on practical issues was discussed in [112]. Various
topics in fault diagnosis with emphasis on model-based and artificial intelligence
approaches were covered in a recent co-authored book [113].
Statistical approaches
A common method of fault diagnostics is to detect whether a specific fault is present or
not based on the available condition monitoring information without intrusive inspection
of the machine. This fault detection problem can be described as a hypothesis test
problem with null hypothesis H0: Fault A is present, against alternative hypothesis H1:
Fault A is not present. In a concrete fault diagnostic problem, hypotheses H0 and H1 are
interpreted into an expression using specific models or distributions, or the parameters of
a specific model or distribution. Test statistics are then constructed to summarize the
condition monitoring information so as to be able to decide whether to accept the null
hypothesis H0 or reject it. See [114-116] for some examples of using hypothesis testing
for fault diagnosis. Recently, a framework for fault diagnosis, called structured
hypothesis tests, was proposed for conveniently handling complicated multiple faults of
different types [117].
A conventional approach, statistical process control, which was originally developed in

quality control theory, has been well developed and widely used in fault detection and
diagnostics. The principle of SPC is to measure the deviation of the current signal from a
reference signal representing the normal condition to see whether the current signal is
within the control limits or not. An example of using SPC for damage detection was
discussed in [118].
Cluster analysis, as a multivariate statistical analysis method, is a statistical classification

approach that groups signals into different fault categories on the basis of the similarity of
the characteristics or features they possess. It seeks to minimize within-group variance
and maximize between-group variance. The result of cluster analysis is a number of
heterogeneous groups with homogeneous contents. There are substantial differences
between the groups, but the signals within a single group are similar. Application of
cluster analysis in machinery fault diagnosis was discussed in [119,120]. A natural way
of signal grouping is based on certain distance measures or similarity measures between
Page 187
two signals. These measures are usually derived from certain discriminant functions in
statistical pattern recognition [121]. Commonly used distance measures are Euclidean
distance, Mahalanobis distance, Kullback-Leibler distance and Bayesian distance. See
[122-125] for some examples of using these distance metrics for fault diagnostics. Ding
et al [122] introduced a new distance metric called quotient distance for engine fault
diagnosis. Pan et al [126] proposed an extended symmetric, the Itakura distance, for
signals in time-frequency representations, for example the Wigner-Ville distributions. In
addition to distance measures, the feature vector correlation coefficient is a similarity
measure commonly used for signal classification in machinery fault diagnosis [125].
Many clustering algorithms are available for distinguishing the signal groups [127]. A
commonly used algorithm in machine fault classification is the nearest neighbour
algorithm that fuses the two closest groups into a new group and calculates the distance
between two groups as the distance of the nearest neighbour in the two separate groups
[128]. The boundary between two adjacent groups is determined by the discriminant
function used. A piecewise linear discriminant function was used and thus piecewise
linear boundaries were obtained for bearing condition classification in [129]. A technique
called support vector machine (SVM) is usually employed to optimize a boundary curve
in the sense that the distance of the closest point to the boundary curve is maximized. The
support vector machine approach applied to machine fault diagnosis was considered in
[17,130].
The hidden Markov model (HMM) described earlier can also be used for fault
classification. Early applications of HMM in fault classification and diagnostics treated
the real machine faulty states and the machine normal state as the hidden states of the
HMM [104,131]. Two recent applications of HMM in fault classification assumed a
HMM with hidden states having no physical meaning for two machine conditions
(normal and faulty) [132,133]. The trained HMMs are then used to decode an observation
for fault classification in a machine whose condition is unknown. Xu and Ge [134]
presented an intelligent fault diagnosis system based on a hidden Markov model. Ye et al
[135] considered the application of 2-dimension HMM based on time-frequency analysis
for fault diagnosis.
Artificial intelligence approaches

Artificial intelligence (AI) techniques have been increasingly applied to machine
diagnosis and have shown improved performance over conventional approaches. In the
literature, two popular AI techniques for machine diagnosis are artificial neural networks
(ANN) and expert systems (ES). Other AI techniques include fuzzy logic systems (FLS),
fuzzy-neural networks (FNN), neural-fuzzy systems (NFS), and evolutionary algorithms
(EA). A review of recent developments in applications of AI techniques for induction
machine stator fault diagnostics was given by Siddique et al [136].
An artificial neural network is a computational model that mimics the human brain. It
consists of simple processing elements connected together in a complex layer structure.
The model approximates a complex nonlinear function with multi-input and multi-output.
One processing element comprises a node and a weight. The artificial neural network
learns the unknown function by adjusting its weights with observations of input and
Page 188
output. This process is usually called training of an artificial neural network. There are
various neural network models. The feedforward neural network (FFNN) is the most
widely used neural network structure in machine fault diagnosis [137-140]. A special
FFNN, mulitlayer perceptron (MLP) with the back propagation (BP) training algorithm,
is the most commonly used neural network model for pattern recognition and
classification. Hence it is popular in machine fault diagnostics as well [140,141,142]. The
BP neural networks, however, have two main limitations: 1) difficulty of determining the
appropriate network structure and the number of nodes; 2) slow convergence of the
training process.
A cascade correlation neural network (CCNN) does not require initial determination of
the network structure and the number of nodes. CCNN can be used in cases where on-line
training is preferable. Spoerre [143] applied CCNN to bearing fault classification and
showed that CCNN can result in utilizing the minimum network structure for fault
recognition with satisfactory accuracy. Other neural network models applied in machine
diagnostics are radial basis function neural networks [18], recurrent neural networks
[144,145] and counter propagation neural networks (CPNN) [146]. The above ANN
models usually use supervised learning algorithms which require external input such as a
priori knowledge about the target or desired output. For example, a common practice of
training a neural network model is to use a set of experimental data with known (seeded)
faults. This training process is supervised learning. In contrast to supervised learning,
unsupervised learning does not require external input. An unsupervised neural network
learns by itself using new information available. Wang and Too [38] applied
unsupervised neural networks, a self-organizing map (SOM), and learning vector
quantization (LVQ) to the detection of rotating machine faults. Tallam et al [147]
proposed several self-commissioning and on-line training algorithms for FFNN applied
particularly to electric machine fault diagnostics. Sohn et al [116] used an autoassociative
neural network to separate the effect of damage on the extracted features from those
caused by the environmental and vibration variations of the system. Then a sequential
probability ratio test was performed on the normalized features for damage classification.
In contrast to neural networks, which acquire knowledge by training on observed data

with known inputs and outputs, expert systems utilize domain expert knowledge in a
computer program with an automated inference engine to perform reasoning for problem
solving. Three main reasoning methods for ES used in the area of machinery diagnostics
are rule-based reasoning [148-150], case-based reasoning [151,152] and model-based
reasoning [153]. Another reasoning method, negative reasoning, was introduced to
mechanical diagnosis by Hall et al [154]. Stanek et al [155] compared case-based and
model-based reasoning and proposed to combine them for a lower cost solution to
machine condition assessment and diagnosis. Unlike other reasoning methods, negative
reasoning deals with negative information, which by its absence or lack of symptoms is
indicative of meaningful inferences.
Expert systems and neural networks have known limitations. A significant limitation of
rule-based expert systems is combinatorial explosion, which refers to the computation
problem caused when the number rules increases exponentially as the number of
Page 189
variables increases. Another important limitation is consistency maintenance, which
refers to the process by which the system decides when some of the variables need to be
recomputed in response to changes in other values. Two important limitations of neural
networks are the difficulty to have physical explanations of the trained model and the
difficulty of the training process. It is natural then to attempt a combination of both
techniques in order to combine their respective advantages thus improving performance
in a hybrid system. For instance, Silva et al [156] used two neural networks, SOM and
adaptive resonance theory (ART), combined with an expert system based on Taylor's tool
life equation to classify tool wear state. DePold and Gass [157] studied the applications
of neural networks and expert systems in a modular intelligent and adaptive system for
gas turbine diagnostics and prognostics. Yang et al [158] presented an approach for
integrating case-based reasoning ES with an ART-Kohonen neural network to enhance
fault diagnosis. It was shown that the proposed approach outperforms the self-organizing
feature map (SOFM) based system with respect to classification rate.
In condition monitoring practice, knowledge from domain specific experts is usually

inexact. Therefore expert system reasoning on domain knowledge is often imprecise.
Measures of the uncertainties in knowledge and reasoning are required in order that an
ES may provide more robust problem solving capability. Commonly used uncertainty
measures are probability, fuzzy member functions in fuzzy logic theory, and belief
functions in belief networks theory. An example of applying fuzzy logic to machine fault
classification was given in [159] to classify frequency spectra representing various rolling
element bearing faults. A comparison between conventional rule-based expert systems
and belief networks applied to machine diagnostics was given in [160]. Du and Yeung
[161] introduced an approach called fuzzy transition probability, which combines
transition probability (Markov process) as well as the fuzzy set, to monitoring
progressive faults. The application of fuzzy logic is usually incorporated with other
techniques such as neural networks and expert systems. For example, Zhang et al [162]
developed a fuzzy neural network for fault diagnosis of rotary machines to improve the
recognition rate of pattern recognition, especially in the case when sample data are
similar. Lou and Loparo [125] employed an adaptive neural-fuzzy inference system as a
diagnostic classifier for bearing fault diagnosis. Liu et al [163] applied fuzzy logic and
expert systems to build a fuzzy expert system for bearing fault detection. Chang et al
[164] built a system for decision making support in a power plant using both a rule-based
ES and fuzzy logic.
Page 190
Neural networks and expert systems have also been combined with other AI techniques
to enhance machine diagnostic systems. Garga et al [165] proposed a hybrid reasoning
approach combining neural network, fuzzy logic and expert systems to integrate domain
knowledge and test operational data. Evolutionary algorithms [166], which mimic the
natural evolution process of a population, have also been shown to have merit when
applied to machine diagnostics. Genetic algorithms (GA) are the most widely used type
of EA. Sampath et al [167] proposed a GA-based optimization approach to gas turbine
diagnostics. Several examples of ANN incorporating GA and other EA algorithms for
machine fault classification and diagnostics are [168-170].
Other approaches
Another class of machine fault diagnostic approaches are the model-based approaches
[171,172]. These approaches utilize physics specific, explicit mathematical models of the
monitored machine. Based on this explicit model, residual generation methods such as
Kalman filter, parameter estimation (or system identification), and parity relations are
used to obtain signals, called residuals, which indicate fault presence in the machine. The
residuals are evaluated to detect, isolate and identify the faut(s). This general procedure is
illustrated in Figure 13-2 . Model-based approaches can be more effective than other
approaches if a correct and accurate model is built. However, explicit mathematical
modeling may not be feasible for complex systems.
Figure 13-2: General flowchart of a model-based approach

Various model-based diagnostic approaches have been applied to fault diagnosis of a
variety of mechanical systems such as gearboxes [173,174], bearings [175-177], rotors
[178,179], and cutting tools [180]. Bartelmus [181,182] used mathematical modeling and
computer simulation to aid signal processing and interpretation. Hansen et al [183]
proposed an approach to more robust diagnosis based on the fusion of sensor-based and
model-based information. Vania and Pennacchi [184] developed some methods to
measure the accuracy of the results obtained with model-based techniques aimed to
identify faults in rotating machines. The information provided by these methods was
shown to be very helpful to precise fault identification as well as an evaluation the
confidence of the diagnostic decision.
Petri nets, as a general purpose graphical tool for describing relations existing between
conditions and events [185], have been applied recently to machine fault detection and
diagnostics. Propes [186] used a fuzzy Petri net to describe operating mode transition and
to detect a mode change event for fault detection and diagnosis in complex systems.
Yang [187] proposed a hybrid Petri-net modeling method coupled with fault-tree analysis
and Kalman filtering for early failure detection and fault isolation. Yang et al [188]
introduced an approach for integrating case-based reasoning with Petri net for fault
Page 191
diagnosis of induction motors. The integrated approach was shown to outperform the
conventional case-base reasoning expert system.
Prognostics
Compared with diagnostics, the literature on prognostics is much smaller. There are two
main prediction types in machine prognostics. The most obvious and widely used is the
prediction of how much time is left before a failure occurs (or, one or more faults or
“potential failures”) given the current machine condition and the past (and future)
operating profile. The time left before observing a failure is usually called “remaining
useful life” or RUL.
In many situations, especially when a fault or a failure has catastrophic consequences

(e.g. nuclear power plant), it is desirable to predict the chance that a machine operates
without a fault or a failure up to some future time (for example, the next inspection),
given the machine’s current condition and its past operational profile. In the general
maintenance context, the probability that a machine operates without fault until next
inspection interval is a good reference in helping to determine whether or not the
inspection interval is appropriate.
Most of the papers in the literature of machine prognostics discuss only the former type
of prognostics, namely RUL estimation. Only a small number of papers address the
second type of prognostics [106,189]. In the following sections, we discuss 1. RUL
estimation, 2. prognostics that incorporate maintenance actions or policies, and 3. the
determination of the appropriate condition monitoring interval.
Remaining useful life

RUL, also called remaining service life, residual life, or remnant life, refers to the time
left before observing a failure, given the current machine age, its condition, and the past
operation profile. Note here that the definition of failure is crucial to the interpretation of
RUL. Although there is some controversy in current industrial practice, a formal
definition of failure can be found in many reliability textbooks.
Prognosis, requires knowledge (or data) on the fault propagation process as well as
knowledge (or data) on the failure mechanism. The fault propagation process is usually
tracked by a trending or forecasting model for certain condition variables. There are two
ways of describing the failure. The first assumes that failure depends on the condition
variables (which reflect the actual fault level)and a predetermined boundary. The most
commonly used failure definition in this case is simple: failure occurs when the fault
reaches the predetermined level.
The second builds a model for the failure mechanism using available historical data.
Various definitions of failure can be used. A failure can be defined as the event that the
machine is operating at an unsatisfactory level (a partial failure); or, it can be a total
functional failure when the machine cannot perform its intended function at all; or it can
be a breakdown when the machine stops operating; or it can be the attainment of a
Page 192
potential failure condition defined in terms of acceptable risk. Similar to diagnosis, the
prognostic methods fall into three main categories: statistical approaches, artificial
intelligent approaches and model-based approaches.
Goode et al [101] used SPC to separate the whole machine life into two intervals, the I-P
(Installation-Potential failure) interval in which the machine is running correctly and the
P-F (Potential failure-Functional failure) in which the machine is running with a problem.
Based on two Weibull distributions assumed for the I-P and P-F time intervals
respectively, failure prediction was derived in the two intervals and the RUL was
estimated. Yan et al [190] employed a logistic regression model to calculate the
probability of failure for given condition variables and an ARMA time series model to
trend the condition variables for failure prediction. A predetermined level of failure
probability was used to estimate the RUL. Phelps et al [191] proposed to track sensor-
level test-failure probability vectors instead of the physical system or sensor parameters
for prognostics. A Kalman filter with an associated interacting multiple model (IMM)
was used to perform the tracking.
Two statistical models in survival analysis, PHM and PIM, are useful tools for RUL
estimation in combination with a trending model for the fault propagation process.
Banjevic and Jardine [192] discussed RUL estimation for a Markov failure time process
which includes a joint model of PHM and a Markov property for the covariate evolution
as a special case. Vlok et al [99] applied PIM with covariate extrapolation to estimate
bearing residual life. HMM, a stochastic process model discussed earlier, is also a
powerful tool for RUL estimation [193,194]. Lin and Makis [195] introduced a partially
observable continuous-discrete stochastic process model to describe the hidden evolution
process of the machine state associated with the observation process. RUL estimation, as
one of the prediction tasks, was generated by the model. Wang et al [109] proposed a
stochastic process, called a “gamma process”, with hazard rate as the the residual life
prediction criterion. The condition information considered was expert judgment based on
vibration analysis. Wang [108] used the residual delay time concept and stochastic
filtering theory to derive the residual life distribution.
Page 193
AI techniques applied to RUL estimation have been considered by some researchers.
Zhang and Ganesan [196] used self-organizing neural networks, for multivariable
trending of the fault development, to estimate the residual life of a bearing system. Wang
and Vachtsevanos [197] applied dynamic wavelet neural networks to predict the fault
propagation process and estimate the RUL as the time left before the fault reaches a given
value. Yam et al [198] applied a recurrent neural network for predicting the machine
condition trend. Dong et al [199] utilized a grey model and a BP neural network to
predict machine condition. Wang et al [200] compared the results of applying recurrent
neural networks and neural-fuzzy inference systems to predict the fault damage
propagation trend. Chinnam and Baruah [201] presented a neural-fuzzy approach to
estimating RUL for the situation where neither failure data nor a specific failure
definition model is available, but domain experts with strong experiential knowledge are
on hand.
Model-based approaches to prognosis require specific failure mechanism knowledge and

theory relevant to the monitored machine. Ray and Tangirala [202] used a nonlinear
stochastic model of fatigue crack dynamics for real-time computation of the time-
dependent damage rate and accumulation in mechanical structures. Li et al [203,204]
introduced two defect propagation models via failure mechanism modeling for RUL
estimation of bearings. Oppenheimer and Loparo [178] applied a physical model for
predicting the machine condition in combination with a fault strengths-to-life model,
based on a crack growth law, to estimate RUL. Chelidze and Cusumano [205] proposed a
general method for tracking the evolution of a hidden damage process given a situation
where a slowly evolving damage process is related to a fast, directly observable dynamic
system. Luo et al [206] introduced an integrated prognostic process based on data from
model-based simulations under nominal and degraded conditions. Kacprzynski et al [207]
proposed fusing the physics of failure modeling with relevant diagnostic information for
helicopter gear prognosis.
A different way of applying model-based approaches to prognosis is to derive the explicit

relationship between the condition variables and the lifetimes (current lifetime and failure
lifetime) via failure mechanism modeling. Two examples of research along this line are
[208] for machines considered as energy processors subject to vibration monitoring and
[209] for bearings with vibration monitoring. Lesieutre et al [210] developed a
hierarchical modeling approach for system simulation to assess RUL. Engel et al [211]
discussed some practical issues regarding accuracy, precision and confidence of the RUL
estimates.
Prognostics incorporating maintenance policies

The aim of machine prognosis is to provide decision support for maintenance actions. As
such, it is natural to include maintenance policies in the consideration of the machine
prognostic process. This makes the situation more complicated since extra effort is
needed to describe the nature of maintenance policies. We interest ourselves particularly
in policies governing in the broad class of maintenance actions that we know as “CBM”
and have set out to describe in this review. Compared to conventional maintenance,
Page 194
mathematical models applicable to the CBM scenario are much fewer [212]. See also
[213] for more recent references on maintenance modeling.
The main idea of prognostics incorporating maintenance policies is to optimize the

maintenance policies according to certain criteria such as risk, cost, reliability and
availability. Risk is defined as the combination of failure probability and consequence.
Usually, consequence can be measured by cost. In this case, the risk criterion is
equivalent to the cost criterion. However, there are some cases, for example, critical
equipment in a power plant, in which consequence cannot be estimated by cost. In these
scenarios, probability or reliability criterion would be more appropriate. Since the cost
criterion applies to most situations, it is not surprising that the literature in CBM
optimization is dominated by cost-based CBM optimization. The consequence analysis
technique discussed in [214] is a general risk evaluation tool for CBM optimization based
on various kinds of criteria.
In condition monitoring, no matter what machines are monitored, they fall into two
categories: completely observable systems and partially observable systems. For a
completely observable system, the machine state can be completely observed or
identified. The information collected from this system is called direct information. For a
partially observable system, the machine condition cannot be fully observed or identified.
The information obtained from this system is called indirect information, which is
somehow related to the real machine state. In the text to follow, we discuss various
models and methods for evaluating, through modeling, these two types of systems.
First, we consider completely observable systems. Wang [215] developed a CBM model
based on a random coefficient growth model where the coefficients of the regression
growth model are assumed to follow known distribution functions. The model was used
to determine the optimal critical level and inspection interval in CBM in terms of a
criterion of interest, which can be cost, downtime or reliability. In a series of works [216-
218], a stochastic model — gamma process, was used to describe the deterioration
process; the system was considered as failed if its condition jumps above a pre-set failure
level; a sequential (or non-periodic) inspection interval was assumed. Grall et al [216]
went on to assume a multi-level control-limit rule replacement policy and obtained the
optimal thresholds and inspection scheduling by minimizing the expected maintenance
cost per unit time. Castanier et al [217] assumed a multi-level control-limit rule
repair/replacement policy and obtained optimal thresholds and inspection scheduling
based on a cost criterion and an availability criterion as well. Dieulle et al [218] assumed
a one-level replacement policy and a sequentially chosen inspection interval using a
maintenance scheduling function, and obtained the optimal threshold and inspection
scheduling by minimizing the global cost per unit time. Amari and McLaughlin [219]
utilized a Markov chain to describe the CBM model for a deterioration system subject to
periodic inspection. The optimal inspection frequency and maintenance threshold were
found to maximize the system availability.
Berenguer et al [220] presented a CBM structure for continuously deteriorating multi-

component systems, which allows cost savings by performing simultaneous maintenance
Page 195
actions. Barata et al [221] used Monte-Carlo simulation to model continuously monitored
deteriorating systems, non-repairable single components or multi-component repairable
systems. Then optimal degradation thresholds of maintenance intervention were found to
minimize the expected total system cost over a given mission time by a direct search.
Marseguerra et al [222] used GA to find the optimal thresholds in the previous work by
simultaneously optimizing two typical objectives of interest, profit and availability.
Hosseini et al [223] employed generalized stochastic Petri nets to represent a CBM model
for a system subject to deterioration failures and Poisson failures. It was assumed that
deterioration failures are restored by major repair and Poisson failures are restored by
minimal repair. The optimal maintenance policy and inspection interval were then found
to maximize system throughput.
We turn now to the consideration of partially observable systems. Ohnishi et al [224]

applied a Markov decision process model for a discrete-time deterioration system to find
the optimal replacement policy in which minimal repair is used to restore a failure if the
decision is not to replace. Hontelez et al [225] formulated the decision process as a
discrete Markov decision problem based on a continuous deterioration process to find the
optimum maintenance policy with respect to cost. Aven [226] presented a counting
process approach to determining the replacement policy minimizing the long run
expected cost. Barbera et al [227] proposed a CBM model assuming that exponential
failures with failure rate depend on the condition variables, and fixed inspection intervals.
The optimal maintenance action was then found to minimize the long-run average cost of
maintenance actions and failures. Barbera et al [228] extended the previous work to the
case of two-unit series systems. Christer et al [229] used a state space model and the
Kalman filter to predict the erosion condition of the inductors in an induction furnace
conditional on the indirect measurements to date. Then a replacement cost model was
developed to obtain the optimal replacement policy given all available information.
Kumar and Westberg [230] proposed a reliability based approach for estimating the
optimal maintenance time interval or the optimal threshold of the maintenance policy to
minimize the total cost per unit time. The authors used PHM to identify the importance of
monitored variables and a total time on test (TTT) plot to find the optimal solution. Makis
and Jardine [231] established a CBM model using a Markov process to describe the
evolution process of condition variables and a PHM to describe the failure mechanism
which depends both on age and condition variables. This CBM model was further
elaborated in [232]. The optimal replacement policy of the hazard control limit type was
then determined by minimizing the long-run expected total cost per unit time. Makis et al
[233] applied optimal stopping theory to find the replacement policy maximizing the total
expected profit during the machine life where no assumption of monotonicity of the
signal process is made. Makis and Jiang [234] presented a framework for CBM
optimization based on a continuous-discrete stochastic model. The evolution of the
hidden machine state was described by a continuous-time Markov process, and the
condition monitoring process was described by a discrete-time observation stochastic
process which depends on the hidden machine state. Then the optimal replacement policy
was found to minimize the long run expected cost per unit time using optimal stopping
theory. Wang [235] applied a stochastic recursive control model for CBM optimization
based on the assumptions that the item monitored follows a two-period failure process
Page 196
with the first period of a normal life and the second, of a potential failure. A stochastic
recursive filtering model was used to predict the residual, and then a decision model was
established to recommend the optimal maintenance actions. The optimal condition
monitoring intervals were determined by a hybrid of simulation and analytical analysis.
Okumura and Okino [236] constructed a generalized condition-based maintenance model,
in which residual life loss and replacement preparation lead-time are included. The
optimal inspection time vector and warning level of the target maintained system under a
constraint preventive replacement probability were obtained by minimizing the long-run
average incurred cost per unit time. Barros et al [237] considered an optimal CBM policy
for a two-unit parallel system of which unit-level monitoring information is imperfect
and/or partial.
Condition monitoring interval

There are two broad types of condition monitoring: continuous and periodic. By
continuous monitoring one continuously monitor (usually by mounted sensors) a machine
and trigger a warning alarm whenever something wrong is detected. Two limitations of
continuous monitoring are: 1) it is often expensive; 2) the continuous monitoring of raw
signals produces large volumes of data, including noise, leading to difficult and
inaccurate diagnostics. Periodic monitoring, therefore, is used due to its being more cost
effective. Diagnostics from periodic monitoring are often more accurate due to the use of
filtered and/or processed the data. Of course, the risk of periodic monitoring is the
possibility of missing some failure events that occur between successive inspections
([34], p. 131).
An important issue relevant to periodic monitoring is the determination of the condition

monitoring interval. Optimal design of the condition monitoring interval (or inspection
interval) has been studied together with optimal threshold design in some of the works
discussed in the previous section [215-219,223,230,235,236]. The following research
works considered condition monitoring interval determination only. Christer and Wang
[238] derived a simple model to find the optimal time for next inspection based upon the
wear condition obtained up to current inspection. The criterion is to minimize the
expected cost per unit time over the time interval between the current inspection and the
next inspection time. Okumura [239] used a delay-time model to obtain the optimal
sequential inspection intervals of a CBM policy for a deteriorating system by minimizing
the long-run average cost per unit time. Goode et al [240] used the model developed in
[101] to determine the length of the next condition monitoring interval for a given risk
level. Wang [241] developed a model for optimal condition monitoring intervals based on
the failure delay time concept and the conditional residual time concept. Condition
monitoring is assumed to be performed at a fixed condition monitoring interval over the
whole life and at a dynamic condition monitoring interval as well in the failure delay-
time period realizing that more frequent monitoring might be needed in this later period.
A hybrid of simulation and analytical procedure was used to find the optimal intervals
based on one of five cost criterion functions.
Multiple sensor data fusion
Page 197
For a complex system, a single sensor is limited in its capability of collecting enough data
for accurate condition monitoring, fault diagnosis and prognosis. Multiple sensors are
needed in order to do a better job. With the rapid development of computer science and
advanced sensor technology, there has been an increasing trend in the use of multiple
sensors for condition monitoring, fault diagnosis and prognosis. Data collected from
different sensors may contain dissimilar partial information on the same machine’s
condition. The problem is knowing how to combine all partial information obtained from
different sensors for accurate machine diagnosis and prognosis. The solution to this
problem is the subject of multisensor data fusion.
There are many techniques to multisensor data fusion. They can be grouped into three
main approaches: (1) data-level fusion, (2) feature-level fusion, and (3) decision-level
fusion. For more discussion on these three approaches, see [242,243]. Heger and Pandit
[90] used a data-level fusion approach to fuse images obtained by multidirectional
illumination to generate an image with a high degree of relevant information for grinding
tool condition monitoring and fault diagnostics. Liu and Wang [244] briefly reviewed
some applications of these three multisensor data fusion approaches to machine diagnosis
and prognosis, and applied a feature-level fusion approach called Cascade-Correlation
neural network for rotating imbalance diagnosis. Diagnostics based on the multisensor
data fusion was shown to outperform diagnostics based on a single sensor. Wang and
Wang [245] used a decision-level data fusion approach called Dempster-Shafer evidence
theory for diesel engine fault diagnosis. Kozlowski et al [246] proposed a model-based
approach to battery diagnostics using decision-level data fusion. Byington et al [247]
explored the methods to fuse non-commensurate oil and vibration features for better
gearbox fault diagnostics and prognostics. Mannan et al [248] applied a radial basis
function neural network to fuse the features extracted from images of machined surfaces
and acoustic signals generated during the machining process. The results were applied to
the diagnostics of cutting tools. Hannah et al [249] discussed frameworks in data fusion
applications for condition monitoring and diagnostic engineering. Data fusion combined
with CBM optimization was studied in [250,251]. Assessment and evaluation of data and
information fusion strategies were discussed in [252,253]. Wang and Wang [254]
discussed the reliability and self-diagnosis of sensors in a multisensor data fusion
diagnostic system.
In a mechanical system with multiple sensors installed, data collected from each sensor
may be a complicated mixture of data from several sources. But only some of the sources
are related to a particular machine condition of interest. The problem is to separate the
various sources for better machine diagnosis and prognosis by fusing the observed
multisensor data. The technique for solving this problem is known as blind source
separation (BSS) [255]. Recently, BSS has received increasing attention in the area of
machine fault diagnostics and prognostics. The general idea behind BSS is shown in
Figure 13-3. It is assumed that the source signals S (t ) = [ s1 (t ),L, s n (t )] , generated from
n unknown independent sources, and the noise signals N (t ) , independent of the source
signals, are combined together by an unknown mixing process. The mixed result is
observed at the channel output as an m -dimensional ( m ≥ n ) signal
X (t ) = [ x1 (t ),L , x m (t )] . A formula for the mixing process can be written as
Page 198
X (t ) = f ( S (t ), N (t ))
where f is generally a non-linear, time-dependent function. A commonly used form for

the mixing process separates the signal and noise, i.e., X (t ) = f ( S (t )) + N (t ) . The
objective of BSS is to find a separating function that is applied to the observed signals
X (t ) to obtain an estimate of the source signals S (t ) .
Figure 13-3: General idea of BSS
In the literature, there are two categories of mixing process: instantaneous and
convolutive mixing process. A mixing process is instantaneous if f (⋅) is a time-
independent (memoryless) function, and convolutive otherwise. The convolutive mixing
process is more common, especially for mechanical systems. The instantaneous mixing
model is also called an “independent component analysis” (ICA) model, which is a
natural extension of PCA. For a survey of ICA theory and methods, see [256]. Several
authors applied ICA together with other signal processing techniques for condition
monitoring and machine fault diagnosis [257-260]. Tian et al [261] used ICA in
frequency domain and wavelet filtering for gearbox fault diagnostics. Zhang et al [262]
studied ICA for partially blind source separation of diagnostic signals for bearing faults
with prior knowledge. For a convolutive mixing process, BSS is more complicated. Gelle
et al [263] compared two approaches, namely a temporal approach and a frequency
approach, to solving the BSS problem of rotating machine signals for monitoring and
diagnosis purposes. They further studied the application of the temporal approach to
bearing fault diagnostics [264]. Tse and Zhang [265] applied the BSS based method of
second order statistics to separate aggregated vibration signals generated from a number
of mechanical components for machine fault diagnostics. Vilela et al [266] used the
temporal de-correlation approach to separate the mixed acoustic signals for machine
monitoring and fault diagnosis. Serviere et al [267] applied BSS to separate noisy
harmonic signals for rotating machine diagnostics on a semi-blind mixing basis.
Concluding remarks
In this chapter, we have summarized recent research and developments in machinery
diagnostics and prognostics used in implementing CBM. Various techniques, models and
algorithms were reviewed. Of the three main steps of a CBM program, namely, data
acquisition, signal processing, and maintenance decision making, we focused on the latter
two. Finally we discussed various techniques for multiple sensor data fusion.
Although advanced maintenance techniques have been available in the literature, CBM,
is under-employed by maintenance departments. Commercial predictive maintenance
Page 199
solution providers have not kept pace with recent advances in signal processing and
decision support despite many situations, especially where both maintenance and failure
are very costly, where well developed and managed condition-based maintenance is
absolutely a better choice than current time based, or inadequate condition based,
maintenance policies. Expert knowledge of both the application field and of reliability
and maintenance theory are required for selecting and implementing effective condition
based maintenance policies in each operating context.
Among the reasons that advanced maintenance technologies have not been well
implemented in industry are: 1) lack of data due to incorrect data collecting approaches
(see 0page 176), 2) lack of efficient communication between theory developers and
practitioners in the area of reliability and maintenance; 3) lack of efficient validation
approaches; 4) difficulty of communication of the principles of CBM to business policy
makers and management executives.
With the rapid development of the MEMS (micro-electro-mechanical systems)

technology, future trends in CBM research will include the design of intelligent devices
capable of continuously monitoring their own health (see, e.g. [268]). Fast and robust on-
line signal processing algorithms are crucial to the design of intelligent devices. Such
novel technology will, no doubt, stimulate increased research interest in this area.
Another trend in CBM research is a growing collaboration among different, yet
individually specialised, CBM research groups, for the joint devlopment of integrated
platforms for enhanced diagnostics and prognostics (See [2] for an application of this
idea).
Page 200
Part 3. Reliability Centered Maintenance
Chapter 14. Pillars of RCM

Introduction
The goal of maintenance has evolved since the industrial revolution. Maintenance has
seen its focus shift ever so imperceptibly from being the “fixers” to becoming the
guarantors of reliability, availability, maintainability, and productivity, safely at lowest
cost. The pivotal study by S. Nowlan and H. Heap chronicles the development of the
MSG-3 (Maintenance Steering Group) report, forming the basis and regulatory structure
for maintenance procedures used in commercial and military aviation. Their work,
entitled “Reliability-centered Maintenance” (RCM) reveals thoughtful solutions to the
multitude of challenges that confront the maintenance departments of all industries. The
study describes a tested process by which organizations, through well considered
proactive maintenance tasks, may attain their business objectives. Much of the material in
this course and book has been borrowed from the Nowlan and Heap study.
SAE JA1011 defines RCM as
“… a specific process used to identify the policies which must be implemented to

manage the failure modes which could cause the functional failure of any physical
asset in a given operating context.”
Figure 14-1 illustrates RCM’s three pillars:
1) The initial information gathering and analysis process, called “failure modes and
effects analysis” (FMEA)
2) The decision algorithm, and
3) The on-going information gathering and analysis process, called “age
exploration”.
Page 201
Figure 14-1: The three pillars of reliability-centered maintenance
One or more teams of knowledgeable personnel conduct RCM analysis on every

significant item in the organization’s asset hierarchy. The RCM worksheet of Figure 14-2
records the results of FMEA (columns 2 – 5) and those of the decision algorithm
(columns 6 to 13). The third pillar of RCM is a follow-on continuous process. It refines
the conservative assumptions that are inevitable (from lack of historical information at
the time) made in the initial (pillars 1 and 2) RCM analyses.
Figure 14-2: The RCM Worksheet form for recording the seven RCM information elements specified
by the standard SAE JA-1011. See detailed guide in Figure 16-4 on page 247.
Page 202
The upper areas of the RCM worksheet of Figure 14-2, record the asset’s “operating
context”. The completed worksheets form the organization’s evolving reliability
knowledge repository or knowledge base. A thorough knowledge base documents the
failure behavior, the consequences of failure, and the reasons for performing each pro-
active task. It combines the experience of the personnel who maintain and operate the
equipment with the knowledge of manufacturing, design, and process experts. In this
chapter we present the detailed methodology for conducting RCM analysis. We focus, in
Part 3. , on the first two pillars: 1. FMEA, initial information gathering, and, 2. The
decision algorithm. (Pillar 3, ongoing age exploration, has been developed in Part 1
(Chapters 1 to 5).
In the following chapters we consider the seven questions or “knowledge elements” of

RCM analysis. The 7-question framework of RCM suggests a sequential process, but this
is not the case. RCM analysis is an iterative process. When analyzing one question we
leap ahead, mentally, anticipating the viewpoint of subsequent questions. Nevertheless,
RCM is structured.
RCM Execution Strategies

In the following chapters we emphasize the popular and successful team approach to
carrying out RCM analysis. An individual with experience in the process of RCM is
known as the RCM facilitator. He or she leads a team of subject matter experts that
include the operators and maintainers of the equipment under analysis. The RCM
facilitator plays a vital role that balances the free expression of multiple viewpoints with
the need to progress quickly and methodically within a structured forum. The RCM
facilitator or team leader’s role is described in the checklist of the Appendix on page 271.
While RCM teams, often referred to as “facilitated review groups” provide excellent
results in most industrial or plant settings, other RCM execution strategies are appropriate
to specific situations. J.C. Leverette points out that NAVAIR does not always use the
facilitated review group approach. NAVAIR has conducted numerous RCM analyses
using dedicated analysts.172 In those situations, the analysis is performed by one or more
RCM analysts who gather information from all relevant sources including system experts,
operators and maintainers. Typically the analyst is an RCM expert with anywhere from
some to extensive knowledge of the equipment he or she is analyzing. Situations
involving new acquisitions or new technology, where the majority of available data may
be engineering or test data, are often most efficiently analyzed by one or two technical
specialists.
172
RCM in the Public Domain: An Overview of the US Naval Air Systems Command's RCM process By
JC Leverette and Andres Echeverry, Anteon Corporation Originally presented at RCM-2005 - The
Reliability Centered Maintenance Managers' Forum, www.reliabilityweb.com
Page 203
Chapter 15. Failure Modes and Effects
Analysis
Figure 15-1: The reliability-centered maintenance process
Question 1 – Functional Analysis
The process
The right maintenance activity addresses the preservation of function. An obvious
proclamation, yet, to his astonishment, the maintenance professional discovers that the
functions of the machinery under his control were inadequately or incompletely
identified. Consequently the failures of those functions and their causes have, by-in-large,
escaped his conscious effort to deal with them. Function identification in an item is
neither obvious nor trivial. Consider, as a familiar example, your own automobile. The
following exercise173 in RCM facilitation illustrates the subtlety and importance of rigor
in functional analysis.
Figure 15-2 Your car

"Your Car" - What are its functions? Asking this question to a group of automobile
owners, invariably elicits the standard answer, “To get from A to B”. We pursue this
173
The car functional analysis was developed by J. Moubray and Aladon. www.aladon.com
Page 204
incomplete answer with the following question, “What is it, in the preceding function
statement (To get from A to B), that distinguishes your car from your feet?” In other
words, what makes us want to use a car rather than our feet to get from A to B? Upon
considerable discussion the group appends “ at speeds up to 85 mph?” to the evolving
function statement. Where there is a wide diversity of opinion, we need to establish a
consensus174 of what users really want from a physical asset.
The function statement now reads “To get from A to B at speeds up to 85 mph”. We
ask subsequently, “Is there anything (in that function statement) that distinguishes our
car from a motor bike?” The function statement once again is amended to “To get a
driver and up to 4 passengers from A to B at speeds up to 85 mph”. We might at this
point ask, “Is there anything (in that function statement) that distinguishes a car from
small helicopter?” Answer, “ … while traveling along paved roads” (as opposed to
cross country). And so on.
Eventually we obtain a fair idea of what the owner and user wants the asset to do. Notice
how we arrived at this function statement. We didn't ask how fast we want to go. We
asked, “What distinguishes a car from feet”, thus raising the requirements for both speed
and distance and any other distinct “car” functionality. We may continue in this manner
to ask about secondary functions175. For example, to the question of what are the
environmental requirements, someone might respond with the single word, “emissions”,
which is a noun not a function. Adopting the form of a function statement, we could say,
“To emit less than (whatever the regulations of the locale) ppm NOx, CO, CO2, and so
on.” In many countries, vehicles that do not comply are off the road, making this function
a maintenance priority.
What about safety and structural integrity? These may be expressed in a function
statement as “To allow passenger cell to deform by X cm in a 30 kph head-on
collision.” Functions relating to control, containment, and comfort may be similarly
revealed. Control/Containment/Comfort associated functions might include “To vary
speed between -20 and 140 mph.” and “To isolate the occupants from the elements.”
When a function statement contains no quantitative standard, it implies “absolute”
(isolation in this case). The comfort associated function “To enable operator to vary
temperature between (whatever limits)” implies an air conditioner. As we walk
through these secondary functions, we learn about the importance of consensus. The
function having to do with appearance: “To look acceptable” begs the question,
“acceptable to whom?”. This may be of vital importance in a given operating context,
but is often impossible to quantify. In such cases an understanding must be reached
between user/owner and maintainer. Protective functions are of singular importance and
were described in Chapter 3. (page 39).
174
This is a very important point. Understanding the requirements of the asset and agreement among
maintainers and users will ensure that the maintenance program preserves the right function.
175
A primary function is usually the reason that the owner purchased and installed the asset. Secondary
functions may include protective functions, environmental functions, appearance requirements, control and
containment functions, health and safety functions, economy and efficiency functions, structural and
superficial functions.
Page 205
Economy/efficiency functions might include “To consume < .010 l/km under (standard
urban cycle, steady speed 100 km, etc.) conditions?” Superfluous functions refer to
components that were installed at one time in the past for one (an original) operating
context but no longer used in another (new context). Often, it is said that the redundant
equipment is more expensive to remove, and it is decided to leave it where it is.
However, these functions may still fail, (a fact that is often overlooked) and thus still
need to be documented in the RCM functional discovery and subsequent analysis.
Hence, the RCM process begins with the first of the seven RCM questions: “What are
the asset’s functional performance requirements in its operating context?” The
validity of all that follows will depend upon the thoroughness with which the functions
are identified and analyzed. An item will have, typically, from 15 to 50 primary and
secondary functions176. The RCM team, using a structured methodology discovers and
records each of the item’s functions. The process comprises the following activities:
1. Team members look closely at the asset under investigation, by examining its
drawings, schematics, photographs and even by conducting physical walkarounds.
Components suggest the functions that are to be recorded.
2. The team reviews, agrees upon, and documents the asset’s operating context (see
top area of the RCM Worksheet of Figure 14-2 on page 202).
3. The team members refer to all helpful documents and recall individual
experiences while listing the item’s functions.
4. Each function statement begins with the word “To” and is followed by a verb.
5. Each function statement specifies one or more quantitative performance
standards.
6. The team agrees upon and records the item’s actual performance requirements,
not its design specifications nor its installed capacity.
a. The team considers and records the requirements of the user, the owner,
and society at large.
7. Example of a function statement: To drill one hole in a work piece to a depth of
18 cm ± 0.001 mm in 15 seconds, of diameter 10 mm ± 0.001 mm whose center
deviates no more than .0001 mm, at an average rate of 3.5 holes per minute. Note
that we specify in the function statement, the quality requirements of accuracy
and consistency.
8. The team proceeds to identify and document all primary and secondary functions.
(Primary functions usually describe why the asset has been purchased.)
a. The group identifies and documents all secondary functions by
examining drawings and schematic diagrams, or even by walking around
the physical item.
b. The group ascertains that all secondary functions have been exposed by
reviewing the PEACHES mnemonic: Protective, Environmental,
176
If an item has more than this number of functions, one or more subcomponents should be “removed”
from the item and analyzed as separate items. This can be done easily at any time. See Appendix 3. “Sizing
the analysis” on page 276
Page 206
Appearance, Control, containment, comfort, Health and safety,
Economy, efficiency, Structural integrity and superfluous177 functions.
9. The team devotes special care to hidden functions – a function whose failure will
go unnoticed under ordinary circumstances.
a. The team uses code phrases to imply that a function is hidden (e.g. to be
capable of, to be able to, …) or that it is protected by a hidden function
(e.g. to heat X liters of water to 140C in Y minutes, in the presence of a
standby heater.)
Example 1
The item under investigation is a passenger rail car truck, (also known as a bogie). A
drawing of the truck is given in Figure 15-3. Its detailed description is given in Appendix
5. on page 280.
Figure 15-3: A passenger rail car truck

With the aid of the truck assembly drawing of Figure 15-3 and its knowledge and
experience with this asset, the RCM team identifies each of the truck’s functions.
Figure 15-4
177
Functions that were required at one time but currently are unused in the current process or product.
Page 207
The team begins by a formulating a statement that captures the truck’s primary function.
The schematic of Figure 15-4 suggests the requirement of “support”, and indicates that
there are two trucks per car. We might, then, propose the function statement: “To support
half the weight of a rail car”. However, when we examine Figure 15-3, various internal
components suggest additional functions. For example, the wheel sets and their bearings
suggest a “rolling” requirement. The suspension components (dampers, air bag, torsion
bar) suggest a “smooth ride” requirement. RCM structured language style encourages us
to broaden the functional statement by including both these notions. We rewrite the
function statement as in Figure 15-5.
Figure 15-5
Notice that we added two quantitative performance specifications, “up to 26.5 T” and “up
to 120 kph” to the function statement. Experienced RCM analysts strive to compose
succinct yet descriptive and quantitative function statements. They try to include as many
functional elements as is practical in a single, clear, grammatically correct function
statement. Such attention to structured phrasing and economy of words keeps the size and
complexity of the entire analysis to manageable proportions.
As the RCM team examines the technical descriptions of Figure 19-6 through Figure
19-14 (in Appendix 5. page 280), it records the functions suggested by each component.
For example, the rubber chevrons of the primary suspension (Figure 19-9 page 283) and
the dampers and air bags of the secondary suspension (Figure 19-12 page 286 suggest
function statement 2 given in Figure 15-6.
Function Functional Failure mode Failure effects

Failure
2 To insulate passengers from
shocks caused by crossing rail
joints, bumps and to minimize
transient oscillations after
crossing such bumps.
Figure 15-6
The rubberized component, “traction link” of Figure 19-12 on page 285, suggests
function “3” of Figure 15-7.

Failure
Page 208
3 To insulate passengers from
jerks during acceleration and
braking
Figure 15-7
The components “torsion bar” and “torsion bar turnbuckles” of Figure 19-6 on page280
suggest function statement “4” of Figure 15-8.

Failure
4 To control the roll angle of the
car body relative to the truck
Figure 15-8
In a similar manner, the RCM team examines the drawings and documentation on the
truck and lists the remaining functions in the worksheet as illustrated in Figure 15-9.
Components reviewed suggest the following functions:
• the “air bag” (Figure 19-12 page 286) suggests function “5”,
• the brake (Figure 19-6 page280) suggests function “6”,
• the auxiliary spring ( Figure 19-9 page 283) function “7”,
• the “towing points” (frame description on page 286) function “8”,
• the “axle rod” (Figure 19-9) function “9”,
• the “emergency spring” (Figure 19-12 page286) function “10”,
• the lateral damper and lateral stop components (Figure 19-11 page285) function
“11”, and
• the “split pin” (Figure 19-9) function “12”.

Failure
5 To ensure that the carriage floor
is level with the platforms when
train stops at a station
6 To assist in stopping the train at
up to 0.88 m/s2
7 To prevent direct contact
between axle box and truck frame
under severe bounce conditions
8 To permit the truck to be lifted
and/or the car to be towed easily
9 To ensure that wheel sets
remain attached to truck while
truck is being lifted
10 To insulate the car from
shocks to some extent if the air
bag fails
11To limit lateral movement of car
relative to truck
12 To prevent traction link
retaining nut from coming undone
Page 209
Figure 15-9: Rail car truck functions 5 to 12
When examining the drawings on pages 280 to 286, the functions listed in Figure 15-9
will not be obvious to those unfamiliar with the item under analysis. This fact
underscores the importance of selecting RCM team members who have used and
maintained the asset over a number of years. For this reason, we discourage the use of
outside consultants to perform RCM analysis.
Example 2
Figure 15-10: The air-conditioning pack in the Douglas DC-10. The location of the three packs in the
nose-wheel compartment is indicated at the upper right. (Based on Airesearch maintenance
materials)
The air-conditioning pack depicted in Figure 15-10 is the cooling portion of the Douglas
DC-10 air-conditioning system. This subsystem was classified as significant during the
first review of the DC-10 systems because of its size, complexity, and cost. There are
three independent installations of this system, located in the unpressurized nose-wheel
side compartment of the airplane (see top right of Figure 15-10). Hot high-pressure air,
which has been bled from the compressor section of the engine, enters the pack through a
flow-control valve and is cooled and dehumidified by a heat exchanger and the turbine of
an air-cycle refrigeration machine. The cool air is then directed through a distribution
Page 210
duct to a manifold in the pressurized area of the airplane, where it is mixed with hot trim
air and distributed to the various compartments. The performance of each pack is
controlled by a pack temperature controller. Each pack is also monitored by cockpit
instrumentation and can be controlled manually if there is trouble with the automatic
control system.
The pack itself consists of the heat exchanger, the air cycle machine (which has air
bearings), and an anti-ice valve, a water separator, and a check valve at the pressure
bulkhead to prevent backflow and cabin depressurization if there is a duct failure in the
unpressurized area. The duct is treated as part of the distribution system; similarly the
flow-control valve through which air enters the pack is part of the pneumatic system.
The pack temperature controller is part of a complex temperature-control system and is
also not analyzed as part of the air-conditioning pack.
Item description: Redundancies and protective features (include

Pack delivers temperature-controlled air to instrumentation):
conditioned-air distribution ducts of airplane. Major The three packs are completely independent. Each
assemblies are heat exchanger, air-cycle machine, pack has a check valve to prevent loss of cabin
anti-ice valve, water separator, and bulkhead check pressure in case of duct failure in unpressurized
valve. nose-wheel compartment. Flow to each pack is
modulated by a flow-control valve which provides
automatic over-temperature protection backed up
by an over- temperature trip off. Full cockpit
instrumentation for each pack includes indicators
for pack flow, turbine inlet temperature, pack-
temperature valve position, and pack discharge
temperature.
Reliability data: Built-in test equipment (described): none
Can aircraft be dispatched with item
inoperative? If so list any limitations which
must be observed:
Yes. No operating restrictions with one pack
inoperative.
Hidden functions: Yes
Functions Functional failures Failure modes Failure effects
1 To supply air to
conditioned air
distribution ducts at the
temperature called for by
pack temperature
controller
2 To be capable of
preventing loss of cabin
pressure by backflow if
the duct is fails in
unpressurized nose-wheel
compartment
Figure 15-11: Worksheet for air conditioning pack with its operating context area completed and the
primary function listed
The top sections of the RCM worksheet of Figure 15-11 records the item’s operating
context. In an industrial environment, operating context will often include such details
Page 211
as: shift arrangements, plant location, customer service requirements, market conditions,
seasonal effects, and so on – anything that sheds light on the asset’s special operating
conditions, requirements, and restrictions. The operating context will greatly assist the
team, when it answers question 5, “What are the consequences?”. Note the italicized
phrase “To be capable of … ” in function statement 2 of Figure 15-11. This code phrase
alerts us to fact that the function is hidden. That is, under ordinary circumstances, as long
as the duct is intact, no member of the operating crew will be aware that the protective
backflow function has failed. Once again, hidden functions are often difficult, if not
impossible, for those unfamiliar with the asset, to discern, emphasizing again the
importance of choosing RCM analysis team members from among the most experienced
maintenance and operational staff.
Example 3
Item description: Distributed control system (DCS) Redundancies and protective features (include
instrumentation):
Built-in test equipment (describe):

Operating context: Continuous process. Unionized. 500 employees. See business plan. Biggest product
Ethylene. Can also produce gasoline Two lines: 1. Material flow 2. Olefins. Raw material safely stored at
high pressure (6000 MPa) in storage underground caverns. It is pipelined to production facilities. Ethylene
converted to polyethylene. There is a "hot side" and "cold side". Raw material undergoes cracking
(breaking carbon chains) and becomes ethylene. The plant extends over several acres (a square kilometer).
The DCS (distributed control system) is integral to the entire production line. There are 3 different types of
DCS. Recently there has been a benzene spill. Environmental excursions occur occasionally. Installed in
1996. Capital expenditures have been curtailed recently. Individual heaters can be shut down for
maintenance.
Hidden Yes, UPS
functions:
Functions Functional Failure modes Failure effects
failures
To provide safe, secure, uninterrupted,
redundant, cost effective, continuous process
control and monitoring according to the target
product of the day, within the parameters
specified by product specification and by
current environmental regulations, in the
presence of a UPS (uninterruptible power
supply)
To alarm on abnormal conditions in the
process real time
To allow manual intervention
To interface with other control systems
To graphically present the process to the
operators
To exchange data with other control systems
To capture historical data
To provide the means to alter control logic
To backup/restore configuration data
To execute batch recipes within the
continuous process, for example cleaning
cycles
Page 212
To provide safe shutdown in the event of a
hardware failure.
To alert the operator, in real time, when some
part of the DCS hardware or a field device
fails.
To be immune from physical, electromagnetic,
electronic, environmental intrusion
To be ergonomic
To conform to NEMA standards
Example 4
In listing the functions of an item, the RCM team, thinks about each component of the
item. One of the functions of a tire tread (e.g. on airplanes or haul trucks) is to provide a
renewable surface that protects the carcass of the tire so that it can be retreaded. This
function is not the most obvious one, and it might well be overlooked in listing the tire
functions; nevertheless, it is important from an economic standpoint. Repeated use of the
tire wears away the tread, and if wear continues to the point at which the carcass cannot
be retreaded, a functional failure has occurred. Although, we are focusing on the item’s
functions, thinking about the failures experienced by an item, for example the retreading
failure described in Figure 15-12, assists us in the function discovery process.
Resistance
restored
Depth of remaining tread
Retread
Potential failure
Potential failure
observed
Functional failure
1 2 3 4 5
Scheduled inspections
Exposure (number of landings)

Potential failure
Figure 15-12: The use of potential failures to prevent functional failures.

When tread depth reaches the potential-failure stage, the tire is removed and retreaded
(recapped) . This process restores the original tread, and hence the original failure
resistance, so that the tire never reaches the functional-failure stage. Function 1 of Figure
15-13 states this requirement in RCM form.

1 To provide a
renewable surface that
protects the carcass of
Page 213
the tire so that it can be
retreaded
Figure 15-13
Question 2 – Failure Analysis
The process
“Functional failure” describes the way in which an asset will fail to perform one of its
functions. We examine each function that we exposed in the preceding functional
analysis. We consider all of the ways that the function can fail. The following points must
be accounted for in answering the question “In what ways can the function fail?”:
1. List each way in which the item can fail to meet each performance requirement
that has been explicitly stated, or implied, in the function statement.
2. Take special care to distinguish between partial and complete failures because
they usually have different causes. For example, “unable to pump at all”, and
“unable to deliver the required 800 lpm” are distinct failures having different
causes.
3. Only functional failures (those that have consequences) are listed in this step.
(Potential failures that preempt functional failures are analyzed and described
when answering question 4 “What are the failure effects?”).
Example 1
Ctrl. Function Statements (Quantitative Failed States (Ways Failure Causes
No. Performance Requirements) Performance is Lost)
1 To provide smooth rolling support for half the Fails to provide support
weight of a passenger car (up to 26.5 tons)
on the rails at speeds up to 120 kph
5 Unable to support the car on
the rails at 120 kph
16 Fails to provide rolling

support
21 Fails to provide a smooth

ride
Figure 15-14: Listing the failed states for a primary function of the truck.
In Figure 15-14 the RCM team has identified 4 ways in which the function, “To provide
smooth rolling support for half the weight of a passenger car (up to 26.5 tons) on the rails at
speeds up to 120 kph” may be lost or compromised. Here we begin to recognize the value
of having “loaded” our function statement with multiple functional elements. Note how
each functional failure (failed state) addresses one of those functional elements.
Page 214
Example 2
1 To supply air to A conditioned air is not
conditioned air supplied at called-for
distribution ducts at the temperature
pack temperature
controller
2 To be capable of A No protection against
preventing loss of cabin backflow
pressure by backflow if
the duct is fails in
unpressurized nose-wheel
compartment
Figure 15-15: Listing the failed states for a function of the air conditioning pack.
Example 3
Item description: Distributed control system (DCS)
failures
To provide safe, secure, Fails to Unauthorized usage of
uninterrupted, redundant, cost provide console either when
effective, continuous process control security unattended or if
and monitoring according to the password stolen
target product of the day, within the
parameters specified by product
specification and by current
environmental regulations, in the
presence of a UPS (uninterruptible
power supply)
Unable to Password forgotten
log in
Unable to UPS has failed
protect
against loss
of control
Control lost Complete loss of
communication with
ring
Complete loss of
communication with
controller node
All consoles fail
Complete loss of
communication on
module bus
Complete loss of
communication on slave
bus
Console LAN fails
Redundancy Console hardware or
lost software fails
Controller hardware or
Page 215
software fails
Communication
hardware of software
fails
Power supply fails
IO card fails
Question 3 – Failure modes analysis
The process
The third step, listing the reasonably likely failure modes, answers the third RCM
question: “What causes the failure?”. It is particularly important in this step that we
keep two objectives of the RCM process in mind. They are:
1. Always gain team consensus for each analysis result, and

2. Neither bog down the analysis in excessive detail, nor overlook important details
by being too superficial,
The failure modes analysis step is particularly difficult, prone to error, and waste of
precious time for two reasons:
1) Deciding which failure modes are important enough to be listed, is a

judgment (subjective) call, and
2) Deciding how deeply to drill down the causality chain, is also a judgment
(subjective) call.
These two problems, if not carefully handled, by the RCM team and the RCM facilitator,
can bog down the analysis or jeopardize its quality. Too much detail is likely to stall
progress, while a superficial analysis can be costly and dangerous.
In deciding which failure modes to list and which to reject, the facilitator urges the team
to keep the operating context in mind. In contexts where the consequences of failure are
severe the group will agree to list certain failure modes that they would not bother to
include, were the consequences less harsh. It is vital that each member give serious
consideration to the failure mode, and that, collectively, the group balances likelihood
and consequences in deciding whether to include it.
For example, suppose the failure mode “Pump damaged by flying object” were raised in
the course an RCM session. The RCM team will consider the likelihood and
consequences of failure. In most operating contexts this failure mode would be excluded
from analysis. However, if the pump were operating in a nuclear facility that happened to
be on the path of a busy airplane flight corridor, the team could reasonably decide to
include it. Since operating contexts vary, no template or hard-and-fast rules can dictate
the level of detail (i.e. how many failure modes to include) needed in a given operating
context.
Page 216
It may happen that an irresolvable difference in opinion emerges among the team
members as to whether or not to include a particular failure mode. In general, the group
decides to err on the side of conservatism. Under no circumstances should this or any
other RCM decision be put to a vote. That would defeat one of the goals of RCM –
ownership of the decisions by the people that they impact. Should the team be unable to
arrive at consensus, the facilitator notes and records the dissenting opinion.
The second difficult question, “How deeply to drill down the causality chain”, if not
carefully considered, will affect the quality and efficiency of the RCM process. Selecting
the failure mode causality depth requires particular vigilance by the RCM facilitator and
the team. The short answer to the question is, “to the level at which the organization can
deal, in a practical way, with the cause of failure”. Figure 15-16 illustrates the almost
limitless choices for selecting causality depth.
Why? Why? Why? Why? Why? Why? Why? Why?

Ventilation Fan fails Motor Motor Airways Inadequate
system fails trips clogged design
fails with dirt.
Defective
sensor
Bearing Lubricant
seized allowed to run
dry
Wrong Improperly Stores
lubricant labeled error
Label Inattention
misread
Insufficient
training
Power Belts Incorrectly Insufficient Employee Poor
drive failed installed training. turnover working
fails conditions
Missing Inadequate …
documentation document
control
Inadequate … …
tools
Incorrectly … … …
specified
Distribution Duct Duct … … … …
system fails fails clogged
Duct … … … …
pierced
Damper … … … …
failed
… … … … … … … …
Figure 15-16: Multiple levels of failure causes
By keeping the operating context in mind at all times as well as the organization’s
practical capabilities to deal with the failure cause, the team members arrive at consensus
in selecting the most appropriate causality depth at which to list the failure mode. The
secret of selecting the right level is knowing when to stop asking “Why?”. For example,
Page 217
certain organizations contract out the maintenance of all motors and a response time has
been agreed upon. The RCM team may decide (depending on the frequency of
occurrence and the consequences) to select the appropriate depth at which to record the
failure mode as the fourth or fifth “why” of Figure 15-16.
Example 1
The RCM team analyzing the truck has recorded the failure modes as given in Figure
15-17 through Figure 15-20.

1 To provide smooth Fails to provide Weld in frame fails due to
rolling support for half support fatigue
the weight of a
passenger car (up to
26.5 tons) on the rails
at speeds up to 120
kph
Wheel collapses due to
fatigue
Axle fails due to fatigue
Truck frame component
fails due to fatigue
Figure 15-17 Causes of functional failure “Fails to provide support”

failures
1 To provide smooth Unable to Differential wear of steel
rolling support for half support the car treads on the same axle
the weight of a on the rails at
passenger car (up to 120 kph
26.5 tons) on the rails at
speeds up to 120 kph
Spalling on wheel tread
Wheel flange shears off
Chevron rubber shears
Tie bar rod axle rod
slackens off
Chevron rubber settles
Chevron rubber elastically
yields
Traction link bolt comes
adrift
Traction link falls off due to
fatigue
Figure 15-18: Causes of functional failure "Unable to support the car on the rails at 120 kph"
Page 218
To provide smooth Fails to provide Bearing collapses due to
rolling support for half rolling support fatigue failure of cage,
the weight of a rollers, spacer or inner or
passenger car (up to outer race
at speeds up to 120
kph
Bearing collapses due to
excessive clearing in
housing
Bearing collapses due to
bumpy rails
Bearing fails due to under
lubrication
Plug falls out of axle box
cover
Bearing fails due to over
lubrication
Moisture in lubricant causes
bearing to fail
Figure 15-19: Causes of functional failure "Fails to provide rolling support"

To provide smooth Fails to provide a Flats worn on wheel tread
rolling support for half smooth ride
the weight of a
passenger car (up to
at speeds up to 120
kph
Figure 15-20: Causes of functional failure "Fails to provide a smooth ride"
Example 2
1 To supply air to A conditioned air is not air-cycle machine seized
conditioned air supplied at called-for
distribution ducts at the temperature
pack temperature
controller
ram-air passages in heat
exchanger blocked
anti-ice valve fails
water separator fails
2 To prevent loss of A No protection against bulkhead check valve
cabin pressure by backflow fails
backflow if the duct is
fails in unpressurized
Page 219
nose-wheel compartment
Figure 15-21: Failure mode analysis of the air conditioning pack
Note the causality levels and the detail (how many failure modes) of the failure modes of
Figure 15-21. For the failure mode “air-cycle machine seized”, for example, the team
stopped at the level of the air cycle machine and looked no deeper. This was a balanced
judgment that weighed the consequences of failure with the frequency of occurrence of
this particular failure mode. Once again, no “template solution” will substitute for due
consideration by a team of knowledgeable, involved persons.
Example 3
failures
To provide safe, secure, Fails to Unauthorized usage of
uninterrupted, redundant, cost provide console either when
effective, continuous process control security unattended or if
and monitoring according to the password stolen
target product of the day, within the
parameters specified by product
specification and by current
environmental regulations, in the
presence of a UPS (uninterruptible
power supply)
Unable to
log in
Unable to
protect
against loss
of control
Control lost
Redundancy
lost
Question 4 – Effects analysis
The process
The team records the entire relevant scenario surrounding the failure mode under
consideration. The text should answer all of the following questions:
A. What sequence of events (internally and organization wide) could be touched off
by the failure mode?
B. How does the failure make itself known? What observable events lead up to the
failure?
C. How is safety or the environment impacted? (do not mention the words "safety"
or "environment")
D. How is production impacted? (quality, cost, customer service)
E. Is there any additional damage caused by the failure?
Page 220
F. How long will it take and what actions must be accomplished to correct the
failure?
G. How does the likelihood of this failure depend on deeper causes? Has it happened
before? How often? Under what circumstances?
Descriptive answers to questions A through G will enable the team to respond,

subsequently, to RCM question 5 “What are the consequences?” In question B we are
anticipating RCM question 6 where we will need to decide whether some CBM task may
be appropriate.
Example 1
Function Statement Failure Failure Effects
mode
1 To provide smooth Fails to provide Weld in The truck as a whole collapses. This is most likely
rolling support for half support frame fails to occur when the car is most heavily loaded - in
the weight of a passenger due to other words when it is full of passengers, and
car (up to 26.5 tons) on fatigue probably while the train is going round a corner. As
the rails at speeds up to a result, it would almost certainly be derailed. At
120 kph present, the truck is replaced when a crack longer
than 100 mm is found. (Such a crack would be
found during course of other inspections that occur
often enough to detect it). Downtime to replace
truck on its own 16 hours.
Figure 15-22: Identifying potential failures
Note that the description of the effects anticipates question 6 by describing the evolution
of the functional failure by defining the potential failure (Figure 15-22) at a crack length
of 100 mm and the likelihood that the potential failure would be found as an opportunity
inspection during the course of other inspections.
Page 221
mode
1 To provide smooth Fails to provide Wheel The truck as a whole collapses. This is most likely
rolling support for half support collapses due to occur when the car is most heavily loaded - in
the weight of a passenger to fatigue other words when it is full of passengers, and
car (up to 26.5 tons) on probably while the train is going round a corner. As
the rails at speeds up to a result, it would almost certainly be derailed. Only
120 kph one cracked wheel has been found to date. It takes 8
hours to replace a wheel
1 To provide smooth Fails to provide Axle fails The truck as a whole collapses. This is most likely
rolling support for half support due to to occur when the car is most heavily loaded - in
the weight of a passenger fatigue other words when it is full of passengers, and
car (up to 26.5 tons) on probably while the train is going round a corner. As
the rails at speeds up to a result, it would almost certainly be derailed. No
120 kph axles have failed so far.
1 To provide smooth Fails to provide Truck frame Initial cracking is likely to lead to frame distortion,
rolling support for half support component which could make the truck unstable enough to
the weight of a passenger fails due to derail the train. As before, this is most likely to
car (up to 26.5 tons) on fatigue happen when heavily loaded - in other words, when
the rails at speeds up to it is full of passengers, and probably while the train
120 kph is going round a corner. So far, the only frame
component which has shown signs of failing has
been the transom, which cracked and has since been
reinforced with a steel plate. Downtime to replace a
truck is 16 hours.
1 To provide smooth Unable to Differential If the difference between wheel diameters is greater
rolling support for half support the car wear of steel than 2 mm, the possibility of derailment at speeds
the weight of a passenger on the rails at treads on the near 120 kph increases. Downtime to re-profile a
car (up to 26.5 tons) on 120 kph same axle pair of wheels is 3 hours.
the rails at speeds up to
120 kph
1 To provide smooth Unable to Spalling on This could lead to differential wear. If the
rolling support for half support the car wheel tread difference between wheel diameters is greater than
the weight of a passenger on the rails at 2 mm, the possibility of derailment at speeds near
car (up to 26.5 tons) on 120 kph 120 kph increases. Downtime to re-profile a pair of
the rails at speeds up to wheels is 3 hours.
120 kph
1 To provide smooth Unable to Wheel flange This failure is only likely to a flange which has
rolling support for half support the car shears off been weakened by excessive wear. It is most likely
the weight of a passenger on the rails at to happen on a heavily loaded train going round a
car (up to 26.5 tons) on 120 kph corner at high speed, which would almost certainly
the rails at speeds up to lead to a derailment. Downtime to replace a set of
120 kph wheels 3 hours.
1 To provide smooth Unable to Chevron Truck frame rests directly on the axle box bump
rolling support for half support the car rubber shears stop. Wheel loading is unevenly distributed and
the weight of a passenger on the rails at wheels are prevented from moving off-axis during
car (up to 26.5 tons) on 120 kph curving - both of these conditions may cause
the rails at speeds up to derailment under adverse conditions of load and
120 kph speed. Downtime to replace the chevron rubber
about 16 hours. (The clearance between the bump
stop and the truck frame should be 30 +1-0 mm)
Page 222
mode
1 To provide smooth Unable to Tie bar rod Wheel arch could distort and chevron rubber could
rolling support for half support the car axle rod shear. Truck frame rests directly on the axle box
the weight of a passenger on the rails at slackens off bump stop. Truck frame rests directly on the axle
car (up to 26.5 tons) on 120 kph box bump stop. Wheel loading is unevenly
the rails at speeds up to distributed and wheels are prevented from moving
120 kph off-axis during curving - both of these conditions
may cause derailment under adverse conditions of
load and speed. Time to tighten axle rod nut in
Depot 15 minutes.
1 To provide smooth Unable to Chevron Settling could cause excessive contact between
rolling support for half support the car rubber settles vertical bump stop and wheel arch. This would
the weight of a passenger on the rails at restrict wheel set movement during curving, and
car (up to 26.5 tons) on 120 kph could cause derailment under severely adverse
the rails at speeds up to conditions of load and speed. Clearance should be
120 kph 30 +1-0mm. Time to replace chevron rubber 4
hours. See also function 2.
1 To provide smooth Unable to Chevron Settling could cause excessive contact between
rolling support for half support the car rubber vertical bump stop and wheel arch. This would
the weight of a passenger on the rails at elastically restrict wheel set movement during curving, and
car (up to 26.5 tons) on 120 kph yields could cause derailment under severely adverse
the rails at speeds up to conditions of load and speed. Clearance should be
120 kph 30 +1-0mm. Time to replace chevron rubber 4
hours. See also function 2.
1 To provide smooth Unable to Traction link The traction link falls off at one end, so the traction
rolling support for half support the car bolt comes center is connected to the truck by only one link.
the weight of a passenger on the rails at adrift Asymmetric load on the remaining link damages
car (up to 26.5 tons) on 120 kph the bushes, interfering with ride comfort and
the rails at speeds up to possibly twisting the link mounting plates. This in
120 kph turn causes the second traction link to shear off,
which would mean that the truck is only connected
to the car by the air bags. A twisted mounting could
also restrict truck movement during curving, which
may lead to derailment under adverse conditions of
load and speed. one end of the traction link could
also hit the ground in such a way that the truck
frame or traction center has to fault over it, causing
a spectacularly nasty derailment. Time to replace a
traction link bolt two hours (note that the nuts on
the traction link bolts are held in place by split pins,
which means that this failure should not occur if the
split pin is in place - see also function 11)
1 To provide smooth Unable to Traction link The traction link falls off at one end, so the traction
rolling support for half support the car falls off due center is connected to the truck by only one link.
the weight of a passenger on the rails at to fatigue Asymmetric load on the remaining link damages
car (up to 26.5 tons) on 120 kph the bushes, interfering with ride comfort and
the rails at speeds up to possibly twisting the link mounting plates. This in
120 kph turn causes the second traction link to shear off,
which would mean that the truck is only connected
to the car by the air bags. A twisted mounting could
also restrict truck movement during curving, which
may lead to derailment under adverse conditions of
load and speed. One end of the traction link could
also hit the ground in such a way that the truck
frame or traction center has to fault over it, causing
Page 223
mode
a spectacularly nasty derailment. Time to replace a
traction link five hours.
1 To provide smooth Fails to provide Bearing Collapsed bearing causes a "hot box", and train
rolling support for half rolling support collapses due must stop at the next station to evacuate passengers
the weight of a passenger to fatigue which causes a traffic delay of 20-60 minutes. It is
car (up to 26.5 tons) on failure of also possible that a failed bearing could cause a
the rails at speeds up to cage, rollers, derailment. The hot box melts the chevron causing
120 kph spacer or it to emit smoke. The chevron also collapses,
inner or damaging the tie-bar and axle. Time to replace a
outer race wheel set complete with bearing and axle box 8
hours.
1 To provide smooth Fails to provide Bearing If the axle box liner bore exceeds the bearing outer
rolling support for half rolling support collapses due race external diameter by more than 0.6 mm,
the weight of a passenger to excessive relative movement between the liner and outer race
car (up to 26.5 tons) on clearing in causes excessive vibration and collapse of the
the rails at speeds up to housing bearing. This causes a hot box, and train must stop
120 kph at the next station to evacuate passengers which
causes a traffic delay of 20-60 minutes. It is also
possible that a failed bearing could cause a
derailment. The hot box melts the chevron causing
it to emit smoke. The chevron also collapses,
damaging the tie-bar and axle. Time to replace a
wheel set complete with bearing and axle box 8
hours.
1 To provide smooth Fails to provide Bearing Excessive interaction between railhead and wheel
rolling support for half rolling support collapses due sets applies shock loads to bearings, leading to
the weight of a passenger to bumpy either fracture of bearing components or accelerated
car (up to 26.5 tons) on rails fatigue failure. This causes a hot box, and train
the rails at speeds up to must stop at the next station to evacuate passengers
120 kph which causes a traffic delay of 20-60 minutes. It is
also possible that a failed bearing could cause a
derailment. The hot box melts the chevron causing
it to emit smoke. The chevron also collapses,
damaging the tie-bar and axle. Time to replace a
wheel set complete with bearing and axle box 8
hours. Rails to be analyzed separately.
1 To provide smooth Fails to provide Bearing fails Seized bearing causes a hot box, and train must stop
rolling support for half rolling support due to under at the next station to evacuate passengers which
the weight of a passenger lubrication causes a traffic delay of 20-60 minutes. It is also
car (up to 26.5 tons) on possible that a failed bearing could cause a
the rails at speeds up to derailment. The hot box melts the chevron causing
120 kph it to emit smoke. The chevron also collapses,
damaging the tie-bar and axle. Time to grease an
axle box 30 mins.
Page 224
mode
1 To provide smooth Fails to provide Plug falls out Lubricant drains out, causing bearing to seize
rolling support for half rolling support of axle box resulting in a hot box. Train must stop at the next
the weight of a passenger cover station to evacuate passengers which causes a
car (up to 26.5 tons) on traffic delay of 20-60 minutes. It is also possible
the rails at speeds up to that a failed bearing could cause a derailment. The
120 kph hot box melts the chevron causing it to emit smoke.
The chevron also collapses, damaging the tie-bar
and axle. Wheel set would be replaced if plug was
found to be missing. Time required to do so 8
hours.
1 To provide smooth Fails to provide Bearing fails Over-lubrication leads to excessive churning and
rolling support for half rolling support due to over eventual breakdown of lubricant, causing bearing to
the weight of a passenger lubrication seize resulting in a hot box. Train must stop at the
car (up to 26.5 tons) on next station to evacuate passengers which causes a
the rails at speeds up to traffic delay of 20-60 minutes. It is also possible
120 kph that a failed bearing could cause a derailment. The
hot box melts the chevron causing it to emit smoke.
and axle. It is felt that this failure is unlikely to
occur because the amount of lubricant is controlled.
1 To provide smooth Fails to provide Moisture in Moisture in lubricant reduces its lubricating
rolling support for half rolling support lubricant effectiveness and may also cause the bearing to
the weight of a passenger causes corrode, in both cases leading to bearing failure
car (up to 26.5 tons) on bearing to resulting in a hot box. Train must stop at the next
the rails at speeds up to fail station to evacuate passengers which causes a
120 kph traffic delay of 20-60 minutes. It is also possible
that a failed bearing could cause a derailment. The
hot box melts the chevron causing it to emit smoke.
and axle. Time to replace wheel set is 8 hours.
1 To provide smooth Fails to provide Flats worn A wheel flat longer than 40 mm is likely to affect
rolling support for half a smooth ride on wheel ride comfort. It will also damage the railhead. The
the weight of a passenger tread noise and vibration caused by a flat wheel tread is
car (up to 26.5 tons) on usually detected quickly by Operations. Time to re-
the rails at speeds up to profile a wheel set on the under floor lathe is 3
120 kph hours.
2 To insulate passengers Fails to insulate Air bag leaks Air bag deflates, so forces are transmitted between
from shocks caused by passengers via top plate truck and car through the layer and emergency
crossing rail joints, adequately of car bolster springs only. This causes a sharper ride, but train
bumps and to minimize faster than it does not have to be withdrawn from service
transient oscillations can be immediately. Time to replace air bag 8 hours. See
after crossing such pumped in also function 5.
bumps.
2 To insulate passengers Fails to insulate Steel wire Air bag fabric cannot contain the air pressure on its
from shocks caused by passengers inside airbag own, so bag bursts causing forces to be transmitted
crossing rail joints, adequately fails through layer and emergency springs only. This
bumps and to minimize causes a sharper ride, but train does not have to be
transient oscillations withdrawn from service immediately. Time to
after crossing such replace air bag 8 hours. See also 44 and 45.
bumps.
Page 225
mode
2 To insulate passengers Fails to insulate Chevron Reduced clearance causes more frequent contact
from shocks caused by passengers spring rubber between vertical bump stop and wheel arch over
crossing rail joints, adequately settles bumps. This reduces ride quality and increases
bumps and to minimize stresses on all truck components. See also 10 above.
transient oscillations Time to replace chevron 8 hours.
after crossing such
bumps.
2 To insulate passengers Fails to insulate Chevron Reduced clearance causes more frequent contact
from shocks caused by passengers elastically between vertical bump stop and wheel arch over
crossing rail joints, adequately yields bumps. This reduces ride quality and increases
bumps and to minimize stresses on all truck components. See also 11 above.
transient oscillations Time to replace chevron 8 hours.
after crossing such
bumps.
2 To insulate passengers Fails to insulate Damper non- Damper "seizes" and transmits shocks directly from
from shocks caused by passengers return valve truck frame to underside of car (in the case of the
crossing rail joints, adequately fails in vertical damper) or to traction center (in the case of
bumps and to minimize closed the horizontal damper). This reduces ride quality
transient oscillations position and increases stresses on all truck components.
after crossing such Time to replace a defective damper in Depot 1
bumps. hour.
2 To insulate passengers Fails to insulate Damper oil Damper becomes steadily stiffer until it eventually
from shocks caused by passengers viscosity seizes altogether, transmitting shocks directly from
crossing rail joints, adequately increased by truck frame to underside of car (in the case of the
bumps and to minimize dirt or vertical damper) or to traction center (in the case of
transient oscillations oxidation the horizontal damper). This reduces ride quality
after crossing such and increases stresses on all truck components.
bumps. Time to replace a defective damper in Depot 1
hour.
2 To insulate passengers Fails to insulate Excessive Damper becomes steadily stiffer until it eventually
from shocks caused by passengers metal-to- seizes altogether, transmitting shocks directly from
crossing rail joints, adequately metal contact truck frame to underside of car (in the case of the
bumps and to minimize between vertical damper) or to traction center (in the case of
transient oscillations damper the horizontal damper). This reduces ride quality
after crossing such piston and and increases stresses on all truck components.
bumps. cylinder Time to replace a defective damper in Depot 1
hour.
2 To insulate passengers Fails to insulate Layer spring Serious loss of stiffness means that secondary
from shocks caused by passengers stiffness suspension is provided by the air bag only. This
crossing rail joints, adequately decreases reduces ride comfort and increases shock loads
bumps and to minimize especially on the air bag itself. Time to replace
transient oscillations layer spring at Depot 8 hours. See also 45.
after crossing such
bumps.
2 To insulate passengers Fails to insulate Air bag, Car has no secondary suspension at all, so all
from shocks caused by passengers layer spring, shocks which pass through the primary suspension
crossing rail joints, adequately and are transmitted directly to the car. Ride becomes
bumps and to minimize emergency very rough and stresses on local truck components
transient oscillations spring all fail are severely increased. Replacement of the three
after crossing such suspension components takes 8 hours at the Depot.
bumps.
Page 226
mode
2 To insulate passengers Fails to Oil leaks out In the case of the vertical damper, full damping
from shocks caused by minimize of damper capability would have to be provided by the damper
crossing rail joints, oscillations seals opposite, which might not be able to cope and
bumps and to minimize (vertical or hence which might also fail rapidly itself. Even if
transient oscillations horizontal the opposite damper did not fail, damping
after crossing such damper) efficiency is impaired so oscillations are not
bumps. effectively damped, which could cause discomfort
on longer journeys. There is only one horizontal
damper, so the effect of loss of this damper is
immediate. Under damping also increases cyclic
stresses on other suspension components, especially
the torsion bar, which could shorten the life of these
components. Time to replace a defective damper in
Depot 1 hour.
2 To insulate passengers Fails to Damper non In the case of the vertical damper, full damping
from shocks caused by minimize return valve capability would have to be provided by the damper
crossing rail joints, oscillations fails in open opposite, which might not be able to cope and
bumps and to minimize position hence which might also fail rapidly itself. Even if
transient oscillations the opposite damper did not fail, damping
after crossing such efficiency is impaired so oscillations are not
bumps. effectively damped, which could cause discomfort
on longer journeys. There is only one horizontal
damper, so the effect of loss of this damper is
immediate. Under damping also increases cyclic
stresses on other suspension components, especially
the torsion bar, which could shorten the life of these
components. Time to replace a defective damper in
Depot 1 hour.
2 To insulate passengers Fails to Damper Dampers come adrift and oscillations are not
from shocks caused by minimize mounting effectively damped, which causes discomfort and
crossing rail joints, oscillations bolts become may induce motion sickness on longer journeys.
bumps and to minimize detached Horizontal damper could be dragged along a rail. It
transient oscillations may also drop off in front of a wheel, possibly
after crossing such leading to derailment. Time to replace a defective
bumps. damper in Depot 1 hour.
3 To insulate passengers Fails to insulate Compound The car body is still supported by the secondary
from jerks during passengers spring suspension, but the center pivot crashes back and
acceleration and braking from jerky retaining nut forth against the traction center when starting and
stops and starts fails, leading stopping. This causes a jerky ride and considerably
to increases shock loads on the truck and local car
dislocation components (especially the center pivot, traction
of the center and air bags). A dislocated spring could also
compound prevent the truck from curving correctly, which
spring may lead to a derailment under adverse conditions
of load and speed. Time to rectify this defect 2
hours at the Depot. (Note that the retaining nut is
held in place by the split pin, so this failure would
not occur if the split pin is in place)
Page 227
mode
3 To insulate passengers Fails to insulate Compound The car body is still supported by the secondary
from jerks during passengers spring rubber suspension, but the center pivot crashes back and
acceleration and braking from jerky deteriorates forth against the traction center when starting and
stops and starts stopping. This causes a jerky ride and considerably
increases shock loads on the truck and local car
components (especially the center pivot, traction
center and air bags). A dislocated spring could also
prevent the truck from curving correctly, which
may lead to a derailment under adverse conditions
of load and speed. Time to rectify this defect 2
hours at the Depot.
3 To insulate passengers Fails to insulate Traction link Starting and stopping forces are damped only by the
from jerks during passengers rubber bush compound spring, which leads to a jerky ride and a
acceleration and braking from jerky fails general increase in shock loads. Time to replace
stops and starts bush 2 hours.
4 To control the roll Fails to control Torsion bar If the torsion bar shears, one end of the car body
angle of the car body the roll angle of shears lurches from side to side during cornering. This
relative to the truck the car body at could disturb and possibly frighten passengers. The
all car also becomes highly unstable and the resulting
loss of balance could lead to derailment, especially
if a heavily loaded car was going at high speed
round a corner. Time to replace the torsion bar in
Depot 4 hours.
4 To control the roll Fails to control Torsion bar The torsion bar would rotate by itself and cause
angle of the car body the roll angle of retaining key noise and vibration. However, the torsion bar would
relative to the truck the car body at fails not be sheared, so derailment is unlikely to occur.
all Time to replace the torsion bar in Depot 4 hours.
4 To control the roll Fails to control Torsion bar Torsion bar has nothing to act against, causing one
angle of the car body the roll angle of turnbuckle end of the car to lurch from side to side during
relative to the truck the car body at fastening cornering, disturbing and possibly frightening
all comes passengers. The car also becomes highly unstable
undone and the resulting loss of balance could lead to
derailment, especially if a heavily loaded car was
going at high speed round a corner. Time to
reconnect the turnbuckle in Depot 4 hours.
4 To control the roll Fails to control Torsion bar Excessive clearance means that the torsion bar rests
angle of the car body the roll angle of bearing worn directly on the edge of the bearing housing. The
relative to the truck the car body at due to lack resulting point load on the torsion bar greatly
all of increases the chances of the bar shearing, causing
lubrication instability and a possible derailment. Time to
replace this bearing at Deport 4 hours.
5 To ensure that the Unable to Air bag leaks If the step is not level with the platform, a
carriage floor is level ensure that via top plate passenger could trip and fall. Time to replace air
with the platforms when carriage floor is of car bolster bag at Deport 8 hours. See also 22 above
train stops at a station level with the faster than it
platform can be
pumped in
5 To ensure that the Unable to Air bag If the step is not level with the platform, a
carriage floor is level ensure that bursts passenger could trip and fall. Time to replace air
with the platforms when carriage floor is bag at Deport 8 hours.
train stops at a station level with the
platform
Page 228
mode
5 To ensure that the Unable to Leveling Air bag cannot be charged efficiently so carriage
carriage floor is level ensure that valve floor cannot be aligned with platform before
with the platforms when carriage floor is turnbuckle passengers start moving on and off the train. This
train stops at a station level with the loose means that a passenger could trip and fall. This
platform failure occurred quite often in the past, but the
locknut and spring washer were replaced by a nylon
washer, and it has not happened for a year.
5 To ensure that the Unable to Layer spring Car body sags, which can be compensated for
carriage floor is level ensure that stiffness initially by adding adjustment shims. Serious loss
with the platforms when carriage floor is decreases of stiffness means that shims can no longer
train stops at a station level with the compensate. Time to replace layer spring at depot 8
platform hours.
6 To assist in stopping Completely Brake pad One worn pad is unlikely to affect the stopping
the train at up to 0.88 unable to assist worn more performance of the whole train, but a number of
m/s2 in stopping the than 10 mm worn pads could do so. Pads are usually replaced
train when wear exceeds 7 mm and it takes 20 minutes to
repair a pad in the Depot.
6 To assist in stopping Completely Brake disk One worn disc would not have a significant to
the train at up to 0.88 unable to assist wear exceeds affect on the stopping performance of the whole
m/s2 in stopping the 2.5 mm train, but several worn disks would do so. Disks are
train re-profiled on the under floor wheel lathe when
wear exceeds 2 mm. This takes 2 hours.
6 To assist in stopping Completely Brake pad Brake pad holder scratches the disk, so the disk has
the train at up to 0.88 unable to assist falls off to be re-profiled (2 hours) and brake pad replaced
m/s2 in stopping the (20 minutes). One worn disc would not have a
train significant effect on the braking performance but
several worn discs would do so.
7 To prevent direct Unable to Vertical The axle box could hammer against the truck frame
contact between axle box prevent contact bump stop when passing over bumps, leading to deformation
and truck frame under between axle missing of the axle box and possible accelerated failure of
severe bounce conditions box and truck the axle bearings. Time to replace the bump stop in
under severe Depot up to 8 hours.
bounce
conditions
8 To permit the truck to Truck cannot Lifting point This failure could occur while the truck is
be lifted and/or the car to be lifted or car fails due to suspended in mid-air, which means that it could fall
be towed easily towed easily wear or onto somebody. Time to repair eye by welding 3
corrosion hours.
8 To permit the truck to Truck cannot Lifting point Eye could be weakened or the truck could be
be lifted and/or the car to be lifted or car damaged by improperly secured for lifting, causing a suspended
be towed easily towed easily external truck to fall, possibly onto somebody. Time to fit
force new eye 3 hours.
8 To permit the truck to Truck cannot Lifting point Truck could not be lifted at all using the eye, so
be lifted and/or the car to be lifted or car sheared off alternative arrangements would have to be made.
be towed easily towed easily by external
force
9 To ensure that wheel Wheel set falls Tie bar Wheel set could drop onto somebody while the
sets remain attached to off truck while fractures truck is suspended in mid-air. Time to replace the
truck while truck is truck is being tie bar up to 8 hours in the Depot.
being lifted lifted
Page 229
mode
10 To insulate the car Incapable of Emergency This failure on its own has no effect. If the air bag
from shocks to some insulating the spring fails fails and the emergency spring both fail, secondary
extent if the air bag fails car if the air suspension has to be provided by the layer spring
bag fails on its own. 30 above explains what happens if air
bag, layer spring and emergency spring all fail.
Time to replace the emergency spring at Depot 8
hours.
11 To limit lateral Unable to limit Lateral bump Under extreme conditions of lateral load, car bolster
movement of car relative lateral stop rubbers stool could hit truck frame, reducing ride comfort
to truck movement of worn away and generally increasing shock loads. Time to
car relative to replace lateral bump stop rubber at Depot 8 hours.
truck
11 To limit lateral Unable to limit Lateral bump Under extreme conditions of lateral load, car bolster
movement of car relative lateral stop falls off stool could hit truck frame, reducing ride comfort
to truck movement of and generally increasing shock loads. Time to
car relative to replace lateral bump stop rubber at Depot 8 hours.
truck
12 To prevent traction Unable to Split pin falls This failure only matters if the retaining nut starts
link retaining nut from prevent traction out coming loose. If the retaining bolt falls out, effects
coming undone link retaining are described in 12 above. Time to replace split pin
nut from falling at Depot 1 hour.
off bolt
13 To prevent compound Unable to Split pin falls This failure only matters if the retaining nut starts
spring retaining nut from prevent the out coming loose. If the retaining nut falls off, the
coming undone compound compound spring would fall off. Large clearance
spring retaining between the center pivot and the center plate would
nut from falling cause fierce vibrations in the car compartment and
off further damage to the bolster stool. Time to replace
split pin in Depot 1 hour.
Example 2
1 To supply air to A conditioned air is not 1 air-cycle machine Reduced pack flow,
conditioned air supplied at called-for seized anomalous readings on
distribution ducts at the temperature pack-flow indicator and
temperature called for by other instruments
pack temperature
controller
2 blocked ram-air High turbine-inlet
passages in heat temperature and partial
exchanger closure of slow-control
valve by over-
temperature protection,
with resulting reduction
in Pack airflow
3 failure of anti-ice If valve fails in open
valve position, increasing
impact discharge
temperature; if valve
Page 230
fails in closed position,
reduced pack airflow
4 failure of water Condensation (water
separator drops, fog, or ice
crystals) in cabin
2 To be able to prevent A No protection against 1 failure of bulkhead None (hidden function);
loss of cabin pressure by backflow check valve if duct and or connectors
backflow if the duct is fail in pack bay, loss of
fails in unpressurized cabin pressure by
nose-wheel compartment backflow, and airplane
must descend to lower
altitude
Example 3
failures
To provide safe, secure, Fails to Unauthorized An unauthorized and untrained person
uninterrupted, redundant, cost provide usage of gains access an operating console or an
effective, continuous process security console either engineering console. This may lead to a
control and monitoring when condition where loss of life or
according to the target unattended or environmental disaster can occur. In this
product of the day, within the if password eventuality legal or civil proceedings
parameters specified by stolen will likely be brought against the
product specification and by Company.
current environmental
regulations, in the presence of
a UPS (uninterruptible power
supply)
Unable to Password Operator unable control the plant.
log in forgotten Operator would look for another console
which has a log in. In a worst case
scenario all consoles would be locked
out and emergency shutdown would be
initiated if the operator suspects
abnormal operation at that particular
time.
Unable to UPS has failed Under normal conditions this failure
protect would be noticed by the operator who
against loss checks the alarms in the normal
of control execution of his daily tasks.
Control lost Complete loss Unreliable or no data shown on console.
of Operator loses ability to control the
communication plant. Emergency shutdown initiated.
with ring The most common cause of this failure
in the past has been contractors
inadvertently cutting cables. This is
likely to take at least 2 hours to one day
to fix entailing a loss of production. This
failure mode is considered to be rare
event.
Complete loss One node goes off line. This could be
of preceded by any of dirt fouling of fan,
communication moisture penetration, RF interference,
Page 231
with controller electronic component failure. Partial or
node complete shutdown depending on
importance of node. Unreliable or no
data shown on console. Operator loses
ability to control the plant. Emergency
shutdown initiated. The most common
cause of this failure in the past has been
contractors inadvertently cutting cables.
This is likely to take at least 2 hours to
one day to fix entailing a loss of
production. This has happened
occasionally in the past.
Page 232
Chapter 16. The RCM Decision Algorithm
Questions 5, 6, and 7
The process
While failure analysis may have some small intrinsic interest of its own, the reason for
our concern with failure is its consequences. These may range from the modest cost of
replacing a failed component to the possible destruction of a piece of equipment,
devastating harm to the environment, or the loss of lives. Thus all reliability-centered
maintenance, including the need for redesign is indicated, not by the frequency of a
particular failure, but by the nature of its consequences. Any preventive-maintenance
program is therefore based on the following precept:
The consequences of a failure determine the priority of the maintenance activities or

design improvement required to prevent its occurrence.
The more complex any piece of equipment is, the more ways there are in which it can
fail. All failure consequences, however, can be grouped in the following four categories:
1. Hidden-failure (H) consequences, which have no direct impact, but increase the
likelihood of a multiple failure
2. Safety or environmental (S) consequences
3. Operational (O) consequences, which involve indirect economic loss as well as
the direct cost of repair
4. Nonoperational maintenance (M) consequences, which involve only the direct
cost of repair
Example 1 shows several of the records from the full analysis of the rail passenger car
Truck. In the column “H S P M” we decide, from the effects description, whether the
consequences are hidden, safety or environmental, production (operational) or
maintenance (non operational). We test each of the four possible consequences in this
order, and we stop as soon as the we ascertain that the circumstances (effects) of the
failure mode provoke the consequence being tested.
Page 233
Example 1
C Function Failed Failu Effects H C T D M Proposed Initial By
trl Statemen States re S C T 2 M task Intv’l
. P CT NM
N ts Mode MC T N M
o s
1 To provide Fails to Weld The truck as a whole Inspect frame To be
smooth provide in collapses. This is most for cracks include
rolling support frame likely to occur when the greater than d with
support for fails car is most heavily loaded 100 mm other
half the due to - in other words when it is schedu
weight of a fatigue full of passengers, and led
passenger probably while the train is tasks
car (up to going round a corner. As a
26.5 tons) result, it would almost
on the rails certainly be derailed. At
at speeds present, the truck is
up to 120 replaced when a crack
kph longer than 100 mm is
found. (Such a crack
would be found during
course of other inspections
that occur often enough to
detect it). Downtime to
replace truck on its own 16
hours.
The RCM decision algorithm is represented by the matrix of Figure 16-1, which is also
included in the heading of the decision half of the RCM worksheet.
Figure 16-1 RCM Decision Diagram. Redesign, “R”, is mandatory in rows “H” and “S” if no
proactive task reduces the consequences of failure to a tolerable level. The full text of each cell is
given below
We execute the RCM decision logic by beginning at the top left of the matrix. We decide
upon the appropriate row (branch of the decision tree corresponding the consequence that
was previously attributed to the failure mode.) and work towards the left. The letter in
each cell of the matrix represents a question (step) in the RCM decision algorithm. The
full text of the questions (below) should be recited explicitly as the decision diagram is
being traversed. Avoid the tendency to abbreviate the questions so much that their
meaning is lost or distorted.
Page 235
Full text of decision diagram questions
H. Is the function's failed state hidden? That is, will the failure go unnoticed until another
function fails or some extraordinary event occurs?
S. Does the failure affect safety, health, or the environment?
O. Can the failure provoke operational (production) consequences. These include cost,
quality, and customer service.
M. Are the only consequences those that affect maintenance or the maintenance budget?
C. Is a condition based maintenance (CBM) task applicable? Can it reliably detect the
'failing' state early enough to reduce the failure's probability and/or its consequences to a
tolerable level? Is it effective? Does it make economic sense to perform this task at the
frequency required?
T. Is a time based maintenance task applicable? Is there an age (useful life) at which the
probability of failure due to this failure mode increases rapidly, and do most items
survive to this age? Effective: Can a routine (TBM) task reduce the failure's probability
and/or its consequences to a tolerable level? Two types of time based tasks are considered
under this heading: 1) Scheduled Overhaul, and 2) Scheduled Discard, the letter being
mandatory for a “safe-life” item178.
D. Is a detection task applicable? Will it reduce the multiple failure's probability to a
tolerable level. Is it effective? Is it practical to do the task at the required interval?
2. Can a combination of 2 or more TBM and CBM tasks be applicable (avoid or reduce
the safety consequences to a tolerable level)? Are they effective (practical)?
N. No time nor condition based activities need be scheduled.
R. A hardware, software, or procedural modification that will reduce the failure's
probability and/or its consequences to a tolerable level is mandatory (H or S) or may be
desirable (P or M).
For the failure mode (cause) “Weld in frame fails due to fatigue” we ask whether the
failure is hidden. Since the failure’s direct effects will be clearly visible (probably
catastrophic) to operating personnel, this failure is not hidden. Therefore we proceed to
the next cell to the right and ask whether there is a CBM task that is applicable and
effective. We need search no further than the effects description to learn that it is entirely
feasible to detect a crack at the potential failure stage of 100 mm length. It will be
effective (economically feasible to do so) because there will be ample opportunity to
perform this inspection often enough during other routine work (to be described in
subsequent rows of the analysis.). Hence we stop at that point and enter “C” under the
second column of the matrix.
Example 2
Two functions have been listed for the air-conditioning pack. It’s basic function is to
supply air to the distribution duct at the temperature called for by the pack controller.
178
An item whose failure has safety or environmental consequences and whose potential failure is not
adequately detectable, and the item ages (e.g. fatigue, wear, corrosion...)
Page 236
We apply the decision algorithm to this function first.
H. Is the occurrence of a failure evident to the operating crew during performance

of normal duties?

. P CT NM
N ts Mode MC T N M
o s
1 To supply air Condition Air- Reduced pack flow, None. This
to ed air is cycle anomalous readings on pack- functional failure
conditioned not machin flow indicator and other has no significant
air supplied e seized instruments consequences;
distribution at called- reclassify as
ducts at the for nonsignificant.
temperature temperatu
called for by re
pack
temperature
controller
Blocke High turbine-inlet
d ram- temperature and partial
air closure of slow-control valve
passage by over-temperature
s in protection, with resulting
heat reduction in Pack airflow
exchan
ger
Failure If valve fails in open position,
of anti- increasing impact discharge
ice temperature; if valve fails in
valve closed position, reduced pack
airflow
failure Condensation (water drops,
of fog, or ice crystals) in cabin
water
separat
or
2 To be able failure None (hidden function); if Disconnect duct
to prevent of duct and or connectors fail in to manifold and
loss of cabin bulkhea pack bay, loss of cabin examine check
pressure by d check pressure by backflow, and valve for wear,
backflow if valve airplane must descend to
the duct is lower altitude
Page 237
. P CT NM
N ts Mode MC T N M
o s
the duct is
fails in
unpressurized
nose-wheel
compartment
Anyone of the failure modes listed will result in changes in the pack’s performance, and
these anomalies will be reflected by the cockpit instruments. Hence the functional failure
in this case can be classified as evident.
The loss of function in itself does not affect operating safety; however, each of the
failure modes must be examined for possible secondary damage:
S. Does the failure cause a loss of function or secondary damage that could have a
direct adverse effect on operating safety?
Engineering study of the design of this item shows that none of the failure modes cause
any damage to surrounding items, so the answer to this question is no.
The next question concerns operational consequences:
O. Does the failure have a direct adverse effect on operational capability?”
Because the packs are fully replicated, the aircraft can be dispatched with no operating
restrictions when any one pack is inoperative. Therefore there is no immediate need for
corrective maintenance. In fact, the aircraft can be dispatched even if two units are
inoperative, although in this event operation would be restricted to altitudes of less than
25,000 feet.
On this basis we would reclassify the air-conditioning pack as a functionally

nonsignificant item. Failure of any one of the three packs to perform its basic function
will be evident, and therefore reported and corrected. A single failure has no effect on
safety or operational capability, and since replacement of the failed unit can be deferred,
there are no economic consequences other than the direct costs of corrective
maintenance. Under these circumstances scheduled maintenance is unlikely to be cost-
effective, and the costs cannot be assessed in any event until after the equipment enters
service. Thus in developing a prior-to-service program there is no need to make an
intensive search for scheduled tasks that might prevent this type of failure.
When we examine the second function of the air-conditioning pack, however, we find an
element that does require scheduled maintenance. The bulkhead check valve, which
Page 238
prevents backflow in case of a duct failure, is of lightweight construction and flutters
back and forth during normal operation. Eventually mechanical wear will cause the
flapper to disengage from its hinge mount, and if the duct in the pressurized nose-wheel
compartment should rupture, the valve will not seal the entrance to the pressurized cabin.
To analyze this second type of failure we start again with the first question in the decision
diagram:
H. Is the occurrence of a failure evident to the operating crew during performance

of normal duties?
The crew will have no way of knowing whether the check valve has failed unless there is
also a duct failure. Thus the valve has a hidden function, and scheduled maintenance is
required to avoid the risk of multiple failure – failure of the check valve, followed at
some later time by failure of the duct. Although the first failure would have no
operational consequences, this multiple failure would necessitate descent to a lower
altitude, and the airplane could not be dispatched after landing until repairs were made.
With a no answer to question 1 proposed tasks for the check valve fall in the hidden-
function branch of the decision diagram:
C. Is an on-condition task to detect potential failures both applicable and effective?”
Engineering advice is that the duct can be disconnected and the valve checked for signs
of wear. Hence an on-condition task is applicable. To be effective the inspections must
be scheduled at short enough intervals to insure adequate availability of the hidden
function. On the basis of experience with other fleets, an initial interval of 10,000 hours
is specified, and the analysis of this function is complete.
In this case inspecting the valve for wear costs no more than inspecting for failed valves
and is preferable because of the economic consequences of a possible multiple failure. If
a multiple failure had no operational consequences, scheduled inspections would still be
necessary to protect the hidden function; however, they would probably have been
scheduled at longer intervals as a failure-finding task.
Example 3
Item Number: Loop 2-Olefins
Page 239
Item Description: Distributed control system. Continuous process. Unionized. 500 employees. See
business plan. Biggest product Ethylene. Can also produce gasoline Two lines: 1. Material flow 2. Olefins.
Raw material safely stored at high pressure (6000 MPa) in storage underground caverns. It is pipelined to
production facilities. Ethylene converted to polyethylene. There is a "hot side" and "cold side". Raw material
undergoes cracking (breaking carbon chains) and becomes ethylene. The plant extends over several acres (a
square kilometre) The DCS is integral to the entire production line. There are 3 different types of DCS.
Recently there has been a benzene spill. Environmental excursions occur occasionally. Installed 1996. Capital
expenditures have been curtailed recently. Individual heaters can be shut down for maintenance.
Ctrl. Function Failed Failure Effects H C T D R Proposed Initial By

No States Modes S C T 2 R task Intv’l
P CTNR
MCTNR
12 3 4
1 To provide safe, Fails to Unauthorized An unauthorized S RAn
secure, provide usage of and untrained authentication
uninterrupted, security console either person gains system (ID
redundant, cost when access an card,
effective, unattended or operating console biometric, etc)
continuous if password or an engineering is mandatory
process control stolen console. This may
and monitoring lead to a condition
according to the where loss of life
target product of or environmental
the day, within disaster can occur.
the parameters In this eventuality
specified by legal or civil
product proceedings will
specification and likely be brought
by current against the
environmental Company.
regulations, in the
presence of a
UPS
(uninterruptible
power supply)
2 Unable to Password Operator unable P RRequire logout
log in forgotten control the plant. at shift
Operator would change.
look for another
console which has
a log in. In a worst
case scenario all
consoles would be
locked out and
emergency
shutdown would
be initiated if the
operator suspects
abnormal
operation at that
particular time.
3 Unable to UPS has failed Under normal M N
protect conditions this
against loss failure would be
Page 240
of control noticed by the
operator who
checks the alarms
in the normal
execution of his
daily tasks.
4 Control lost Complete loss Unreliable or no
of data shown on
communication console. Operator
with ring loses ability to
control the plant.
Emergency
shutdown initiated.
(see l2). The most
common cause of
this failure in the
past has been
contractors
inadvertently
cutting cables.
This is likely to
take at least 2
hours to one day to
fix entailing a loss
of production. This
considered to be
rare event.
5 Complete loss One node goes off
of line. This could be
communication preceded by any of
with controller dirt fouling of fan,
node moisture
penetration, RF
interference,
electronic
component failure.
Partial or complete
shutdown
depending on
importance of
node. Unreliable
or no data shown
on console.
Operator loses
ability to control
the plant.
Emergency
shutdown initiated.
(see l2). The most
common cause of
this failure in the
past has been
contractors
inadvertently
cutting cables.
This is likely to
Page 241
take at least 2
hours to one day to
fix entailing a loss
of production. This
has happened
occasionally.
6 All consoles
fail
7 Complete loss
of
communication
on module bus
8 Complete loss
of
communication
on slave bus
9 Console LAN
fails
10 Redundancy Console
lost hardware or
software fails
11 Controller
hardware or
software fails
12 Power supply
fails
13 IO cards
Note that no attempt is made to design the proposed authentication system. RCM analysis
leaves the detailed redesign to other persons to be assembled for that specific purpose
where specialists are on hand.
Page 242
Example 4
Figure 16-2 The shock-strut assembly on the main landing gear of the Douglas DC-10. The outer
cylinder is a structurally significant item.
Page 243
Structures Worksheet: type of Aircraft Douglas DC-10-10
Item Number: 101 No. per aircraft: 2
Item Name: Shock-strut outer cylinder Major area: main landing gear
Vendor part/model no: PN ARG 7002-505 Zones: 144, 145
Description/location details: Design criterion:
Shock-strut assembly is located on main landing gear; SSI Damage tolerant element: __
consists of outer cylinder (both faces) Safe-life element: Yes
Inspection access:
Internal: Yes
External: Yes
Material (include manufacturer's trade name): Steel alloy Redundancy and external
4330 MOD (Douglas TRICENT 300 M) detectability:
No redundancies; only one cylinder
each landing gear, left and right
wings. No external detectability of
internal corrosion.
Fatigue-test data Is element inspected via a
related SSI? If so, list SSI no.: No
Expected fatigue life: Classification of item
(significant/nonsignificant):
significant
Crack propagation:
Established safe-life: 46,800 landings 70,200 oper. hours
Design conversion ratio: 1.5 operating hours/flight cycle
Proposed task Initial interval
Crack growth
Fatigue life
Controlling
Accidental
Inspection
Corrosion
Class no.
Residual
strength
damage
(int./ext)
factor
- - - 1 4 1 CorrosionInternal Magnetic-particle Sample at 6000 to

inspection for cracking and 9000 hours and at
detailed visual inspection 12000 to 15000
for corrosion hours to establish
best interval
External General inspection of outer During pre-flight
surface walkarounds and at
A checks
Detailed visual inspection Not to exceed 1,000
for corrosion and cracking hours (C check)
Remove and discard at life 34,800 hours
limit
Figure 16-3 RCM Worksheet for structurally significant items
The worksheet of Figure 16-3 differs from that of the previous examples. This form
applies to the analysis of structurally significant items. All structurally significant items,
fall into one of two categories:
Page 244
1. Damage-tolerant item: A monolithic or multiple load path item in which
a crack or complete failure of an element will not reduce residual strength
below the safety level prior to detection, or
2. Safe-life item: A structurally significant item whose potential failure is
not reliably detectable.
Table 16-1 explains the rating system for the first 5 columns of Figure 16-3. The analysis
shows the treatment of a safe-life item in an airline context. Because the shock-strut outer
cylinder on the main landing gear of the Douglas DC-10 has been classified a safe-life
item it must be discarded before a fatigue crack is expected to occur. Hence it is not rated
for residual strength, fatigue life, or crack propagation characteristics (the first three
columns of Figure 16-3). The Class Number of column 6 is set to the minimum of the
columns 1 to 5. The “controlling factor” is that which corresponds to the minimum (of
the 5 columns).
Safe-life limits are only effective, however, if nothing prevents the item from reaching
them. In the case of structural items, there are two factors that introduce this possibility –
corrosion and accidental damage. Experience has shown that landing-gear cylinders of
this type are subject to two corrosion problems. First, the outer cylinder is susceptible to
corrosion from moisture that enters the joints at which other components are attached;
second, high-strength steels such as 4330 MOD are subject to stress corrosion in some of
the same areas. The item is given a corrosion rating of 1, which results, therefore, in a
(overall) class number of 1.
The onset of corrosion is more predictable in a well-developed design than in a new one.
Previous operation of a similar design in a similar environment has shown that severe
corrosion is likely to develop by 15,000 to 20,000 hours (five to seven years of
operation). It can be detected only by inspection of the internal joints after shop
disassembly; hence this inspection will be performed only in conjunction with scheduled
inspections of the landing-gear assembly. This corrosion inspection requirement is,
therefore, one of the controlling factors in establishing the shop-inspection interval.
It is customary to start such inspections at a conservative interval and increase the

interval at a rate determined by experience and the condition of the first units inspected.
The initial requirement is therefore established as inspection of one sample between
6,000 and 9,000 hours and one sample between 12,000 and 15,000 hours to establish the
ongoing interval179. During the shop visits for these inspections any damage to the
structural parts of the assembly are repaired as necessary and the systems parts of the
assembly are usually reworked. Thus the combined process is often referred to as
landing- gear rework.
In addition to the corrosion rating, the shock-strut cylinder is rated for susceptibility to
accidental damage. The cylinder is exposed to relatively infrequent damage from rocks
and other debris thrown up by the wheels. The material is also hard enough to resist most
179
Age exploration derived intervals such as these are continuously refined as experience with the item is
accrued.
Page 245
such damage. Its susceptibility is therefore very low, and the rating is 4. However,
because the damage is random and cannot be predicted, a general check of the outer
cylinder, along with the other landing-gear parts, is included in the walkaround
inspections and the A check, with a detailed inspection of the outer cylinder scheduled at
the C-check interval.
Table 16-1
Reduction in Fatigue life of Crack- Susceptibility to Susceptibility to
residual strength element propagation rate corrosion accidental damage
No. of Ratio of Ratio of Ratio of Exposure as a
rating
rating
rating
rating
elements that fatigue life to interval to corrosion-free result of
can fail design goal fatigue-life age to fatigue- location
without design goal life design
reducing goal
strength below
damage
tolerant level
One 1 1/8 1 1/8 1 High 1
Two or 2 ¼ 2 ¼ 2 Moderate 2
more180
Two or 3 3/8 3 3/8 3 Low 3
more181
Two or 4 ½ 4 ½ 4 Very low 4
more182
180
75% reduction in the margin between ultimate and damage tolerant level
181
182
Page 246
The worksheet guide of Figure 16-4summarizes the processes of Part 3. Reliability Centered Maintenance.
Figure 16-4: RCM worksheet guide
Page 247
Page 248
Chapter 17. Integrating Reliability
Information - MIMOSA
When ideas achieve currency, momentous change ensues. A unified approach to
information sharing in operations and maintenance (O&M) has been gathering
momentum, over the past decade. MIMOSA183, the OPC Foundation184, and the ISA185
have launched OpenO&M, a comprehensive, open information architecture for
unfettered technical collaboration in the modern O&M environment.
UML Class Diagrams
Figure 17-1: MIMOSA UML Class diagram “EventCore”
Compatible information exchange among diverse plant and business sub-systems

encourages the use of “Agents” for effective analysis and decision making. Intelligent
agents may, thus, access and combine data from many different sources, for example,
process control systems, maintenance systems, and financial systems throughout the
enterprise. Agents return decisions or recommendations that, by design, benefit the
organization. For example, an EXAKT™ intelligent agent monitors the CBM and the
CMMS databases. Automatically, it generates and sends to the CMMS, optimal
183
www.mimosa.org
184
OPC Foundation (www.opcfoundation.org)
185
The Instrumentation, Systems, and Automation Society (www.isa.org)
Page 249
recommendations on when and how to perform a maintenance intervention on an asset.
Figure 1 is a MIMOSA UML186 class diagram that shows the role of an intelligent
agent187.
Each of the boxes in the UML class diagram (Figure 17-1) represents a class188. Lines
joining the classes represent relationships. The relationships of Figure 17-1 are called
associations189. The red line, with the diamond head “ ”, joining the
190
AssetRecommendation and Database entities, is an aggregation (a “whole/part”
association.)191 The cardinality (“1” and “*” on the ends) indicates that the relationship is
one-to-many. One database maintains many AssetRecommendation records.
MIMOSA has published a series of 15 such class diagrams. By studying these diagrams
we may understand the utility of and the reasoning behind the MIMOSA Common
Relational Information Schema (CRIS).
The MIMOSA classes, Segment and Asset (of Figure 17-1), require explanation. A
segment is a production process or sub-process or physical area192 on a site. An asset is
an equipment (with a unique serial number) that can be allocated to a segment.
186
The UML is the Unified Modeling Language. See “The Unified Modeling Language User Guide” by
Grady Booch, James Rumbaugh, Ivar Jacobson, Addison-Wesley 1999, ISBN 0201571684
187
An intelligent agent is an automated entity that processes data and makes decisions and
recommendations. The MIMOSA Agent class may also include humans and organizations who fulfill the
same role.
188
A class is a specification for an object. An object can represent some physical item used in the business
process – for example a work order record in a database table of work orders.
189
An association is one of the four types of relationships: 1. Dependency, 2. Association, 3.
Generalization, and 4. Realization
190
A structural relationship such as a table belonging to a database or a printed circuit belonging to a
electronic device.
191
This means that an object of the whole has objects of the part.
192
For example “Compressor Room 1”
Page 250
Figure 17-2: MIMOSA UML Class diagram "RegCore"
The association joining Segment and Site in Figure 17-2 has a solid diamond head “ ”.
The solid diamond indicates that a Segment belongs uniquely to a Site193. On the other
hand, an Asset (with an unfilled diamond head association “ ”) is only loosely
associated with a site. An equipment can, in principle, be moved to another Site.
Furthermore, an asset can be removed from one segment and installed on another. The
association line joining Segment and Asset in Figure 17-2 reveals that relationship.
Examine that association line. Note, the AssetUtilizationHistory class connected by a
dashed line to the association line. The AssetUtilizationHistory class is called an
association class. It provides further clarity on the nature of that association. In this
relationship the objects (database records) of the AssetUtilizationHistory class, record the
removals and the installations of assets on segments. These records provide the
suspension and Failure “Events” for an EXAKT CBM optimization model194.
193
For example “Compressor Room 1” is strongly related to a Site.
194
Or any type of reliability (age exploration) analysis: Weibull, Pareto, Scatter, Cause and Effect, and
others.
Page 251
.
Figure 17-3: MIMOSA UML Class diagram "DiagProgAsset"

Figure 17-3 shows three more associations between the Agent and other MIMOSA
classes. The Agent has the role of “Originator” of AssetRemainingLife records and
AssetHealth records. It also “Creates” AssetProposedEvent records. Consistent
with RCM functional and failure analysis, the AssetProposedEvent has a “Proposed
impact” on AssetFunction
We may conclude from the various MIMOSA UML class diagrams, that the
MIMOSA and OPC OpenO&M architecture recognizes the vital role of intelligent agents
in maintenance decision planning.
Page 252
s.
Figure 17-4 Optimal recommendation and residual life estimate

An EXAKT agent outputs the data required to populate the MIMOSA Common
Relational Information Schema (CRIS) tables AssetRecommendation,
AssetProposedEvent,AssetHealth, and AssetRemainingLife.
The agent uses a statistical “data interpretation” model that has been built by correlating
historical event and condition monitoring data. The model accounts for the current
operating context of an asset. Finally, the model supports the user’s requirements
regarding that asset for which the enterprise defines its objectives, for example:
1. failure and repair cost minimization,

2. asset availability maximization,
3. a required mean time between repairs,
4. a required ratio of planned to emergency work, or
some combination of these or other metrics.
By consistently adhering to the optimal recommendations delivered by the EXAKT

intelligent agent, an organization will achieve its stated long-run objectives for each asset
monitored.
Page 253
Chapter 18. Managing Strategy
Introduction
Improvement concepts such as “the maintenance dashboard”, “key performance
indicators”, and “benchmarking of the best of breed ” resonate in the physical asset
management community. They are the stock in trade of the maintenance management
consultant. A far-sighted vision and a well-conceived strategy followed by a detailed
implementation effort, will, we expect, transform the maintenance function into an
ordered and controllable process.
With minor variations, two schools of thought dominate the scores of philosophies that
contend in the maintenance improvement marketplace. The symbolic “pyramid of
excellence” (Figure 18-1), and the metaphoric “RCM house” (Figure 18-2) convey their
respective, and somewhat conflicting, paths to “world class physical asset management”.
Figure 18-1 The Pyramid of Excellence195 Figure 18-2 The RCM "House"196
Order, rather than content, differentiate the two approaches. The former initializes its
improvement cycle by establishing a suitable maintenance infrastructure (tiers one and
two at the base of the pyramid) . The latter insists that we retain (for the present) our
existing systems and structures, but, that we begin (the improvement process) by
analyzing each significant physical asset’s functions, its failures, failure causes, effects,
and consequences. Doing so will determine the appropriate maintenance requirements
– the foundation (of the house of Figure 18-2). Proponents, of the “Pyramid of
Excellence”, emphasize culture change as an explicit management process. The RCM
camp contends that maintenance culture will adapt naturally with systematic RCM
education and implementation. The former devotes attention to effective planning and
scheduling, while the latter focuses on developing, through RCM analysis, the proactive
cyclic tasks (TBM and CBM) of the maintenance plan197.
195
From Uptime, John Dixon Campbell, Productivity Press, 1995
196
From the RCM II Practitioner’s course, John Moubray 1999
197
Along with the defaults “no scheduled maintenance” and redesign.
Page 254
Summarizing, students of the “Pyramid” school defer reliability-centered maintenance
(RCM) analysis to a future time by placing it up on the third tier. They see the processes
(such as data management and planning) on tier 2 as pre-requisites for reliability analysis
(RCM). Advocates of the alternative point of view (the “House”), consign “systems” to
the roof (the last element to be erected in an improvement plan), while positioning RCM
analysis as the foundation. In this chapter we review software products such as Strategy
Manager™ 198, Real-Time Production Intelligence™199, and Real-Time Production
Management™200 that seek to unify the two201 approaches.
Extending the Maintenance Audit

Occasionally, corporate management elects to perform a “maintenance assessment” at
one or more of its sites. Independent consultants conduct the audit, principally, by
interviewing a cross-section of maintenance and operating staff, and, by reviewing
various CMMS reports and budget documents. Sometimes the consultants map out flow
diagrams that trace the business processes used in the maintenance department.
Occasionally a consultant will observe and take notes during regular maintenance
meetings. In all cases, to complete the audit, the consultant team delivers a final report
evaluating the maintenance department’s performance relative to a set of benchmarks that
characterize the “best” organizations. The report recommends projects and changes that
will narrow the “gaps” to achieving excellence in each of the (ten) areas of Figure 18-1.
The consultants propose a project priority sequence based on the value of the
improvement and its ease of implementation.
New software enabled methodologies extend the reach of the occasional maintenance
audit by offering continuous day-to-day performance visibility and control. To
accomplish this function they integrate, (using O&MOpen202 standards) with the CMMS,
process computers, and other plant systems.
198
Available from DEI Group (www.dei-group.com)
199
Available from ABB (www.abb.com)
200
Available from OSISoft (www.osisoft.com)
201
systems first or reliability first
202
See www.mimosa.org and the article EXAKT and MIMOSA
Page 255
Physical asset management inputs, outputs, and control
Figure 18-3 Physical Asset Management input, outputs, and control

Figure 18-3 illustrates the two feedback “control” loops of physical asset management.
Their ultimate output achieves the corporate vision. Each feedback arrow represents a
management function:
1. To adjust maintenance policy in response to KPI achievement gaps.

2. To adjust KPI targets in response to vision achievement gaps
Arrow 3 represents the way that maintenance policy relates to the KPIs, and arrow 4
represents how the actual KPI’s achieve corporate vision. The physical asset manager
strives to discover the intricate relationships governing how policy impacts KPIs. And,
secondly, he seeks to know how achievement of the KPI targets will impact the balance
sheet and the corporation’s societal responsibilities of custodianship. We might express
the steps to world class performance as:
1. Start with a vision.

2. Set the target KPI’s that are needed to achieve the vision.
3. Set up the maintenance policy with the intent to achieve the KPI targets.
4. Execute the policy and measure the KPI’s and perform various types of age-
exploration analysis.
5. Evaluate KPI target gaps and the results of age exploration. Based on that
evaluation implement new policies or enforce current ones.
6. Evaluate performance relative to corporate vision. Add or alter KPI targets
accordingly.
Note that the center block of Figure 18-3 specifies both KPIs and Age Exploration203.
KPIs often summarize the results of a maintenance policy. They seldom direct us to
203
A broad category of methods of analysis of failure and maintenance data. The analyses target ways to
improve current proactive maintenance policies on significant items in order to improve reliability and/or
lower cost. See Chapter 3.
Page 256
specific policy changes regarding individual assets. On the other hand, age exploration
analyses (for example, Pareto analyses) focus our attention on individual significant items
whose collective performance governs the KPIs.
The foregoing implies that our maintenance management system (CMMS) must embody
reliability-centered information (as outlined in Chapters 1, 2, and 3). Specifically, for
each significant item, the five reliability-centered knowledge elements:
1. “What function was lost or compromised?”,
2. “How (full, partial, potential, functional failure)?”,
3. “Why?”,
4. “What happened?”, and
5. “How did it matter?”
will populate the database upon which the analyses will perform. Furthermore, using our
system, we establish the relationship between the significant consequences of failure
(knowledge element 5) and the KPIs that achieve the corporate vision. In practical terms,
we use the performance management system to classify each incident (maintenance work
order, or production log item) involving downtime, speed loss, or quality loss, as one of
11 to 19 of Table 18-1. Additionally, we document the five RCM knowledge elements
that characterize each incident (seeFigure 18-4).
Physical Asset Management Effectiveness Indicators (KPIs)204
Effectiveness KPIs classify productivity losses as: Downtime, Speed, and Quality
losses.
Table 18-1
Theoretical production time 1
Losses
Valuable Quality losses Speed losses Downtime losses
operating time 8 MM P E MM P E MM P E
Two effectiveness KPI models

1. Production Economy
2. Dupont Analysis
Table 18-2 Model 1 Production Economy

Available production time 2 External losses 3
Unplanned:
Gross operating time 4 Downtime 1. shortage of personnel
Net operating time 6 Speed 5 2. shortage of materials (quantity,
quality
Valuable Quality losses 7 3. Environmental deals
operating time 8 losses 9 Planned:
204
Bert Mijten, Real-Time Production Intelligence, ABB Review, Feb 2004
Page 257
MM P MM P MM P 1. Modification, major mtce
2. Limited need
14 15 16 17 18 19
3. Social (policy not to produce
weekends, holidays, etc
Quality Speed Down
Technical losses 10 11 12 time
13
MM=machine malfunctioning, P=process
Table 18-3 Model 2 Dupont (“planned production time” or “six big losses”) model

Planned production time = Available production time 2 Planned losses 3
- Authorized breaks
(pauses) during the
Gross operating time 4 Downtime working day
losses 5 - Not working during
weekends
Net operating time 6 Speed Setup and Equipment - Scheduled stoppages
adjustment failure for product changes
losses 7 Big loss 5 Big loss 6 - Modifications and
improvements to
equipment (work
financed under
Valuable Quality Reduced Idling investments)
speed and - Decreased production
operating time 8 losses 9 Big loss minor
time due to lack of
3 stops demand for the
Reduced Defects Big product
yield in loss 4 - Planned maintenance
from process
work (inspections,
startup Big
preventive
Big loss loss 2
maintenance,
1
improvements)
- Problems in
production planning
- Saturation of
machines (upstream
and downstream)
- Authorized shop floor
meetings
- Classroom or ‘on the
job’ training sessions
for operators
- External power cuts
- Lack of raw material
Choosing between model 1 and model 2

Of the two models (Production economy vs. Dupont analysis) Model 1 is more generally
applicable. It is easier to allocate incidents to their causes (MM, P, or E) than to the “six
big losses” of Table 18-3.
External losses are losses that cannot be altered by the production or maintenance team.
Planned down time losses are down time losses that were planned. Note that planned
down time losses (of Model 2) are specifically down time losses, whereas external losses
(of Model 1), can be speed, quality, and downtime losses. For instance, speed losses
because of environmental deals are external losses but are not planned down time losses.
Page 258
Similarly, quality losses that are caused by the raw materials are external losses but are
not down time. Hence Model 1 discriminates more easily between losses controllable by
maintenance (and operations) and those that are outside of its control, than does Model 2.
Furthermore external losses are not always planned. For instance, an external power cut
or lack of raw materials is an external loss, but is not a planned down time loss. Hence,
‘available production time’ is different in the two models. Therefore the KPIs calculated
in Table 18-4 will have different values depending on whether Model 1 or Model 2 is
used. However, by leveraging the next generation of management software, we may,if
required, convert the five RCM knowledge elements associated with each incident (from
Model 1) into the six big losses (of Model 2).
Table 18-4 Productivity KPIs

Planning 2/1
factor, PF
Availability 4/2
factor, A
Performance 6/4 No. of parts produced
factor, P Speed losses = P =
Gross oper. time
Theoretical cycle time
Quality 8/6 Approved product
factor, Q Total produced or
Quality level Quality spec Loss factor
1
2
3
Valuable 8 R eference throughput
operating no. approved
time
OEE 8/2 A× P ×Q
Total OEE 8/1 (=OEExPF)
Theoretical Determine by Time units /part
cycle time a neutral 1
instance. reference throughput
Consult both
production
and
maintenance.
Do not
include
external
losses.
(imposed by
legal/hygiene)
Page 259
Table 18-5 Example: (using the Production Economy model definitions)
January to September 1 273 days = 6552 hours

Weekends 78 days, holidays 7 days, 273-96=177
vacation 11 days = 96 days
No production on night shift 177 x 2 x 8 = 2816 hours
Breaks 1 hour /day 2816-177=2640
Lack of personnel/raw material estimated 2 .95x2640=2508
at 5%
Downtime (unplanned) recorded was 551 5 551
hours
There are four production lines whose reference throughput and approved product for the
period under study are given in Table 18-6.
Table 18-6
Reference No. approved Valuable operating time 8

throughput kg/h kg no. approved
reference throughput
hrs
Line 1 1500 720000 480
Line 2 750 334000 445
Line 3 900 160000 178
Line 4 680 36000 53
Total valuable operating time 1156
OEE KPI results

OEE 8/2 1156/2508
PF 2/1 2508/6552
Total OEE 8/1 1156/6552
Availability 4/2=(2-5)/2 (2508-551)/2508
Drilling down from the KPIs

Mastery of the processes 1, 2, 3, and 4 of Figure 18-3 imposes the greatest challenge
upon maintenance performance management. The next generation of maintenance
performance management software will dissect every KPI into its constituant incidences
and knowledge elements (as in Figure 18-4).
Page 260
Figure 18-4 A Performance Management system drills down from the KPI (for example, Quality
Loss) to invoke analysis procedures that guide the physical asset manager to continuous policy
improvement
Figure 18-4 illustrates that historical data (contained in plant systems) fuel reliability
analyses such as Pareto, age-reliability relationships, and optimal CBM decision
graphs205. Those methodologies steer us towards improved maintenance policies. The
CMMS, the control system historian, CBM databases, and other plant systems feed
information to the performance management system. The performance management
system, in the hands of the physical asset manager, outputs continually improving
physical asset management policies. Today, the maintenance world hovers at the
threshold of bridging two remaining gaps that impede “excellence” in asset performance
management. They are:
1. CMMS workorders do not yet record reliability-centered knowledge.

(Chapters 1-3)
2. The RCM knowledge base is not yet fully integrated with the CMMS,
process historian, and CBM databases (Chapter 15)
With these final capabilities in hand, we may anticipate rewarding times ahead for
physical asset management. Nevertheless, there remains the question of how to actually
begin the journey to OEE improvement at lowest cost in each particular enterprise.
205
These may be called age-reliability-significant factor relationships
Page 261
How to start
We set about the task of identifying the significant items whose failure to perform as
required impede the achievement of corporate vision (both on the balance sheet and with
regard to our socal responsibilities of custodianship). They may be expressed in a grid
similar to the following:
Table 18-7
Significant Problem Current Target What’s it

Item performance performance worth206?
System 1 Environmental X lb per mont Y lb per month $Z1
excursions
System 2 Hi cost of $X per ton $Y per ton $Z2
operation
and/or
maintenance
System 3 Low reliability X mtbf Y mtbf $Z3
System 4 Low X% Y% $Z4
availability
System 5 Quality/yeild X% Y% $Z4
System 6 Poor safety X incidents/mo Y incidents/mo $Z5
record
And so on …
Upon extracting from the grid, the projects that deserve immediate attention, proceed to
elaborate a practical schedule for RCM training and for performing the RCM analyses
and the new information gathering processes (of chapters 1, 2, and 3). The RCM schedule
will depend on 1) the practically achievable rate at which the RCM analyses may
proceed, as determined by resource availability (trained RCM analysts and facilitators).
The RCM analyses, as they proceed, will generate specific requirements and ROI
estimates for new maintenance tasks and the redesign of significant items, systems, and
operating procedures. The KPI and age exploration results will monitor, guide, and
motivate the continuing process of improvement.
206
to achieve the target.
Page 262
Chapter 19. Appendices
Appendix 1. EWOP details
Used components and components in suspended animation
A common situation is that of a hydraulic cylinder on a truck suspension system. A

typical scenario follows. The mechanic removes the cylinder from a truck. The truck
meter reads 12000. He checks the records for that truck and finds that the cylinder was
installed at 8000. So he puts a tag on the cylinder saying that it is 4000 old. The tag does
not identify where the cylinder came from. The cylinder has no identifying marks. The
cylinder goes to the shop where there are several other cylinders waiting for a rebuild.
Except for the tags with their ages, the cylinders all look the same. Suddenly there is an
emergency. No cylinders have been rebuilt yet. So they have to install the used cylinder
on another truck whose cylinder has just failed. Hence a cylinder of age 4000 will have
been installed, say at position B on Truck 17. Will this situation compromise accurate
reliabiltity analysis?
This same scenario could apply to hydraulic pumps, or electric motors, or valves, or
significant parts of any kind. Few organizations bother to track all significant components
as individuals. However, using the simple methods of the EWOP, we need know only
when they have been removed from their host and when they have been replaced. We
will know, too, (or can estimate from observation) the age of a used component at the
time that it was re-installed, usually as an emergency repair. The CBM lab207introduced
the practical option of looking at the lifetimes of components from the point of view of
their host rather than from that of the components themselves.
Figure 19-1 A component in suspended animation

The essential idea, used by EXAKT, is that the age of any component can be calculated at
any time just by having recorded the host’s meter reading at time of installation. This
method handles situations such as meter resets, used components, components removed
207
At the University of Toronto, the birthplace of EXAKT for CBM optimization.
Page 263
temporarily, and countless other situations that occur in the anarchic world of
maintenance. Figure 15 illustrates a component in suspended animation.
The working age line of an item is shown proceeding from left to right. Various meter
counts are indicated along the way (1000, 2000, 4000, and 6000). At 2000, a component
has been removed from the item. It was re-installed in the same item at 4000. The event
BSA marks the Beginning of Suspended Animation and ESA marks the Ending of
Suspended Animation for the component in question. The “gap” is the duration of the
suspended animation.
Assume that the component fails (at time T) 2000 working age units after its
reinstallation, i.e. at 6000. What is its (component) working age at failure? It is given by
the formula:
Age at failure = (BSA -B) + (T - ESA)

Or,
(2000 - 1000) + (6000 - 4000) = 3000
Figure 19-2 Installation of a used component
When the work order tells the EWOP that a used component (of say, age 4000) as in
Figure 19-2 was installed at item meter reading 5000, the EWOP generates the B event
with a working age of 1000 and a SM (start monitoring) event at 5000. The component’s
calculated age at failure is 7000. The SM event tells the model that no CBM monitoring
events (in the item) apply to this component prior to its installation at 5000.
Page 264
The EWOP’s Impact on the Work Process
Clearly, the EWOP implies a deep change in the thinking process related to the
completion of work orders and to the management of historical maintenance records in
general. The EWOP will impact the maintenance work order process both in the short
and long terms. In the short term, we recommend that the EWOP be used for specific
analysis projects by one or more reliability analysts or maintenance engineers. This will
introduce the EWOP gradually, prior to general acceptance in the global work order
process.
The short term process

EWOP Version 1.4 makes it convenient to endow any CMMS with a living RCM
process.
Previous versions of the EWOP, required all 16 data elements to be present in the CMMS
work order structure. It also required that the CMMS have the ability to create ad hoc
work orders on demand, whenever a work order involved more than one unique
significant item-function-failure-cause. EWOP 1.4, does not require restructuring of the
existing CMMS database. The user (usually an engineer) may begin applying reliability-
centered knowledge in the CMMS immediately. All required fields as well as pseudo-
workorders are handled entirely by the EWOP using “option 4”.208
A pseudo-work order is a virtual work order that is embedded in the long text field of a
parent work order. This is required where technicians are not permitted, by the current
CMMS rules, to create ad hoc work orders on demand. If they have a situation of
multiple unique item-function-failure-causes to report - no problem. They can create as
many pseudo work orders as they need. They do this by adding any significant unique
item-function-failure-causes that they wish to include, in the CMMS long text field using
EWOP's structured free text format. The method is illustrated in the "Additional field of
Figure 16.
Figure 19-3 Work order form. Many of the 16 data elements in the text field "Additional"
The CMMS field, “Additional” is a long text field (sql_longvarchar), sometimes called a
"memo" field. In this field users will enter all the additional information that the EWOP
needs. Using the native CMMS fields and the additional information in the long text
field, the EWOP will generate the necessary Events table and the RCM records.
Furthermore, it will update the long text field of the work order (including its pseudo-
work orders embedded in the text field). It will parse the free text and insert, at the the
208
To use Option 4 of EWOP, edit the ewop.cfg file and change “,1” to “,4”
Page 265
appropriate places any, RCM record references that were newly generated by the
workorder (or pseudo-work order) via the EWOP.
Eventually, once maintenance engineers and managers recognize that the EWOP method
of integrating RCM thinking with the work order process, will return continuing benefits,
they will, no doubt, request a reconfiguration of the CMMS, as discribed in the section
“Long term process” below. In the short term, the EWOP will empower maintenance
analysts and engineers to conduct specific reliability projects. A suggested list of
activities follows:
1. The reliability analyst/engineer retroactively updates CMMS historical records

that are relevant to his current analysis project by using the EWOP 16 knowledge
element structure.
2. The analyst runs EWOP.
3. EWOP inserts records into the Events and (as required) into the RCM tables.
4. A database autonumber incrementing primary key called "RCMREF" keeps track
of new records inserted into the RCM table.
5. The analyst quality checks all new records added to the RCM table.
6. Subsequent runs of EWOP will not duplicate those RCM table records because
the RCM reference numbers will have been added in an automatic update of the
work orders that generated them.
7. The analyst may decide to delete or update an RCM record and make any
corresponding corrections in the CMMS.
8. The RCM table persists as a permanent, growing knowledge base.
9. Using the Events table, the analyst proceeds to apply various software
procedures. (e.g. EXAKT, Weibull, Pareto, Scatter, Jack-knife, etc.)
10. After a period of time, the analyst may wish to update his analysis or perform a
new analysis.
11. He edits additional CMMS records and runs EWOP on all or selected CMMS
records. Once again any new records will be added, automatically, to the RCM
table as required.
12. An entirely new Events table will be generated for the new analysis.
13. Once again the analyst checks the RCM table for errors. He updates related
CMMS work order records with any new RCMREF numbers.
14. Eventually the knowledge base will grow to a size where it will be attractive for
use in the general maintenance information process. See next section "The long
term process".
The long Term process
The EWOP approach will become integrated into the everyday work process. This, will
happen once the value of the growing reliability knowledge base becomes apparent to
users (and their CMMS vendors). At this point several new CMMS features will have
been introduced (noted below in parentheses) that adapt to the EWOP methodology. The
work process will henceforth consist of the following steps.
Page 266
1. Maintenance technician completes a job, and proceeds to update the CMMS by
completing the work order form (similar to Figure 2)
2. He (or she) recognizes that the work order actually refers to more than one unique
Item-function-failure-cause. He generates as many "sub-workorders" (an added
CMMS feature) as required to accommodate the structured information that he
needs to record. A discussion of sub workorders is given in Chapter 1 of
"Reliability-centered Knowledge".
3. In order to complete each sub-work order, he displays (an added CMMS feature)
the RCM table. He attempts to locate a RCM record that accurately describes a
situation similar to his current "sub-work order".
4. If he is successful, he relates the sub-work order to the RCM record (manually by
entering the RCMREF auto number into the sub-work order record, or
automatically by virtue of a new CMMS feature). He enters the following
reliability data into the sub-work order: dateback, dateout, workingageback,
workingageout, failuretype. (The sub work order is now an instance of a record in
the knowledge base.)
5. The technician may wish to edit the RCM record at this time to include any new
knowledge discovered during the execution of the work order. Usually, he will
append to, or modify, the Effects field. He may update the Consequences field.
(For example, a failure mode previously thought to be evident, was in fact
hidden.)
6. If no RCMREF is found the information in the work order record will be added
automatically (new CMMS feature) to the RCM table, and the RCM table auto
number will be entered automatically (new CMMS feature) into the work order
record.
7. The CMMS will allow supervisors and reliability specialists to audit and approve
changes to the RCM table made by a technician. (new CMMS quality auditing
feature).
Using the EWOP prototype software
1. Unzip the ewop.zip file into a folder

2. Set up a ODBC DSN as follows: Control Panel, Administrative tools, Data
Sources (ODBC), System DSN, Add, Microsoft Access Driver (*.mdb), Finish,
Data source name: EWOP, Select, Navigate to ewop_MES.mdb, OK
3. Install the .Net Framework if it is not already installed :
http://www.asp.net/download-1.1.aspx
4. Hit EWOP.exe in Windows Explorer
5. Hit the Work orders, Events, RCM, and Items buttons and examine the records of
the various tables.
6. Hit the EWOP button. Hit <enter> at the prompt
7. Hit the Events and RCM buttons. Examine the Events records and new RCM
records.
8. Hit the Work orders button and note the RCMREF field filled with the
appropriate autonumber of the RCM record.
9. Hit Initialize DB to re-initialize the database
Page 267
10. Verify that the Events table has been emptied and the RCM table reduced to 4
records, and the RCMREF values in many of the Work orders have been
removed.
11. Hit the EWOP button again.
12. At the prompt type CRU% and hit <enter>.
13. Back to the Events table and verify that only the work orders for the crushers
have been processed. Hit the Initialize DB button to prepare for the next exercise
(Option 4).
Option 4 – short term process
1. Edit ewop.cfg file. Change “Option : ,1” to “Option : ,4” (Save it.)
2. Hit the Work orders Option 4, Events, RCM, and Items buttons and examine the
records of the various tables. Especially the long text field “Additional”. Notice
that there are actually three child or sub records (separated by ~~) within that text
field. Note that most RCMREF : values in the text are empty.
3. Hit the EWOP button. Hit <enter> at the prompt.
4. Hit the Events and RCM buttons. Examine the Events records and new RCM
records.
5. Hit the Work orders Option 4 button and note the RCMREF field in the “pseudo”
workorders of the Additional field filled with the appropriate autonumber of the
RCM record.
6. Hit Initialize DB to re-initialize the database
7. Verify that the Events table has been emptied and the RCM table reduced to 4
records, and the RCMREF values in many of the Additional fields of the Option 4
work orders have been removed.
8. Hit the EWOP button again.
9. At the prompt type CRU% and hit <enter>.
10. Back to the Events button and verify that only the Option 4 work orders for the
crushers have been processed.
The onion skins of CBM

By definition, CBM is an inspection of one type or another. The results of an inspection
should reveal knowledge of a deteriorating component or failure mode. Sometimes, we
must perform some sort of signal processing and interpretation process to arrive at a
decision, as to whether to:
A. intervene immediately or within some time period, or to

B. continue operating for another observation interval
Often, however, we cannot make such a clear decision from the condition monitoring
data. We lack an adequate signal processing method and/or decision model with which
to discriminate patterns in the data that relate unambiguously to a targeted failure mode.
We know something is wrong, but we don’t know which, of a number of possible failure
modes, is deteriorating. We don’t know which part of the equipment is failing.
Page 268
We may, then, perform an exploratory inspection. We escalate from a less intrusive,
purely monitoring, type of inspection to one that requires a more intrusive activity. Oil
analysis of an engine’s crankcase oil may indicate an increasing trend of some wear
metal, such as iron. Concerned, we perform a compression check, in the hope that more
information will narrow down the list of possible failure modes. Further escalations in
inspection intensity might include a pressure/ignition trace, and eventually, a partial or
complete dismantling of the engine.
Each progressively intrusive and costly layer of CBM deepens the process of discovery.
If we find during the compression check that there is indeed a ring sealing problem, we
may learn from this experience. We could attempt to find patterns in the (relatively
inexpensive) oil analysis data, that relate to poor compression. If, through the modeling
process, we find such a relationship, we could use it, thereafter, as a decision model (or
rule) to tell us when it is advisable to perform a compression check.
The consequences of failure are still minor – the discovery of poor compression is
considered a “potential failure”. It would eventually deteriorate and cause a functional
failure whose consequences would indeed be operational, safety related, or economically
important. The point to note is that the development of the decision model (a rule for
issuing a work order for a compression check) did not require us to have experienced a
functional failure. We prefer, naturally, to model potential failures rather than to build
our decision models upon the experience of functional failures that have dire
consequences.
The EWOP encourages the development of decision models that warn of potential
failures. Technicians, in the course of carrying out various preventive tasks, using EWOP
methods, will document their observations in the systematic RCM form. By analysing the
resulting knowledge base, particularly the effects, the events leading up to a failure cause,
we will without doubt develop better inspection and decision techniques with fewer
functional failures.
The EWOP and EXAKT
Appendix 5 The EWOP and EXAKT
EXAKT, as does any RA methodology, requires accurate historical data. Without prior
guidelines, such as those proposed by the EWOP, good data has been difficult to attain.
The EWOP methodology teaches the principles of reliability-centered knowledge.
Analysts can begin using EWOP methods immediately for specific reliability analysis
projects.
Page 269
Figure 17 EWOP main menu
The EWOP is, in one sense, a reversal of the way RA projects have been done in the past.
Traditionally, we have extracted sets of records from the CMMS. Then we embarked on a
process of "data cleaning" (using other software packages such as EXAKT, Excel,
MatLab, and others), to deal with data anomolies (usually missing or undocumented
events).
The EWOP, on the other hand, focuses considerable energy on the data source before
attempting data extraction. The reliability analyst supplements the information on each
work order related to the item under analysis. He does this on site, with the assistance of
those who participated in the maintenance events that concern the item. The enriched
information, presented in the consistent 16-data element format, enables the EWOP to
extract records from the CMMS directly into an Events table.
The EWOP brings substantial advantages to an EXAKT (or any RA) project. By applying
thorough work order documentation methods, within the existing CMMS, an analyst:
1. makes a permanent improvement in the CMMS itself,

2. builds a persistent knowledge base using the fundamental RCM elements,
3. sets up tangible methods for effective work order documentation, that may be
emulated and adopted by the maintenance professional as part of regular
maintenance practices. - teaching by example.
4. leverages the powerful database and display tools of the CMMS to help clean the
data before extraction
5. updates decision models easily, by repeating the EWOP extraction procedure on
new information in the source database.
Page 270
Appendix 2.
The role of the RCM Facilitator - Five Skill Areas:

1. Administration 2. Animation 3. Clarity 4. Time Management 5. Focus
The quality (hence the success) of each RCM analysis will depend heavily on how well
the facilitator has mastered and executes his skills209. Those skills are outlined in Table
19-1: RCM facilitator’s checklist. The facilitator’s skill and vigilance will prevent the
analysis from being dangerously superficial, or, conversely, from becoming bogged down
and stalled in unnecessary detail. The novice facilitator should refer often to this
scorecard throughout the RCM project, and continually self-evaluate h(is)(er)
performance, (initially under the watchful eye of an experienced RCM practitioner) with
respect to each of the items in Table 19-1.
Table 19-1: RCM facilitator’s checklist

1.0 Administration Score
Shortly after the RCM analysis has been completed, assemble the 12345
worksheets and supporting documentation (drawings, photographs) into
a coherent, readable dossier for review and authorization by a
designated auditor.
In the planning phase, before an RCM analysis begins, ensure that 12345
potentially useful documentation (drawings, schematics, manuals,
standard operating procedures, maintenance and operational histories,
etc) are readily accessible for reference during the sessions. Discuss the
general RCM objectives, beforehand, with resource people210, outside
the team, so they may respond quickly if called upon to provide
clarification or information when required during the course of the
analysis.
Assist in the selection of the appropriately skilled RCM team members. 12345
Assist in the initial decomposition of the asset/plant into manageable 12345

significant items for individual RCM analyses. Position the item’s
boundaries so that it can be analyzed in 6 to 14 3 hour sessions. Ensure
that an item has not been defined at too low a level of indenture where
failure modes would be difficult to relate to the failure of the equipment
as a whole. During the analysis, decide how the failure modes of a
subsystem should be handled – whether to 1) break out the subsystem
for more convenient, separate, analysis later, or 2) consider each of the
subsystem's failure modes as part of the main analysis, or 3) consider
the subsystem's failure modes as a single failure mode, or 4) consider
209
As well as on the depth of the collective knowledge and experience of the RCM team members.
210
For example, process specialists, OEM engineers, safety experts, etc.
Page 271
(as part of the main analysis) each of the subsystem's dominant failure
mode(s) singly and the other failure modes lumped under the title
“others”.
Assist in the development of the item’s operating context 12345
Assist in the scheduling of the RCM sessions 12345
Report regularly on progress to the RCM sponsor. Call upon h(im)(er) 12345
for help in resolving technical, organizational, or human issues as they
arise
Assist in the preparation of the presentation (by a team member) to 12345

management at the end of the analysis
Provide team members access to the evolving RCM worksheet as the 12345
analysis unfolds from session to session.
2.0 Score
Animation
Recognize and be sensitive to each personality type. Help each team 12345
member contribute fully to the RCM process by using one or more of
these techniques: Gently discourage the extrovert from monopolizing
the floor by (following a tirade) asking a question to another team
member. ("George, what do you think about that") Encourage the
introvert by asking h(im)(er) questions and by assigning short research
tasks between sessions on unclear issues. (calling a vendor, checking a
log sheet, etc). Ask h(im)(er) to report on h(is)(er) findings at the
beginning of the next meeting. Be careful not to harass h(im)(er).
Recognize when true consensus is achieved. Never permit a vote. Keep 12345
in mind that a lone dissenter may be right. Record h(is)(er) position and
ask h(im)(er) to “agree to disagree” until further elucidating information
comes along.
Sustain the morale of the group by summarizing progress at the 12345

beginning of each session, and by always being positive about the
process. Express praise and gratitude when someone makes a
noteworthy contribution to the analysis.
At the beginning of the first session of the RCM analysis, help the team 12345
set and agree upon the ground rules (smoking, punctuality, etc)
Recognize when the team simply “does not know” (about some aspect 12345
of the asset) by being alert to statements beginning with "I think ..." or "I
believe ...". Assign short research tasks to team members to find out.
Page 272
Remind participants of the objectives and importance of the analysis and 12345
that they have been chosen to participate because of their knowledge
and experience.
With an inexperienced team be alert to misunderstandings of the process 12345

and the meanings of questions. Use timeouts to clarify points of RCM
procedure when required. Common misunderstandings are a) confusing
failed states and failure modes, b) confusing average life (mtbf), useful
life, and Bn life, etc., distinguishing potential failure from functional
failure, d) recognizing the difference between a failure finding task and
an on-condition task
Be alert to answering the wrong question. This could occur at anytime 12345
throughout the RCM process. An example is the raising of an
operational consequence when the process has moved onto the safety
and environmental branch of the decision diagram.
Safeguard the self-esteem of each team member. Recognize that “loss 12345
of face” may occur by persons formerly considered knowledgeable.
Soften the blow by emphasizing (in timeouts and anecdotes) that RCM
is, above all else, a learning forum to bridge the discontinuities in the
knowledge of individuals by gaining synergy from the collective
perspectives of the team.
3.0 Score
Clarity
Input the answers to the RCM questions into the RCM worksheet. 12345
While entering the answers, retain team members’ wording as much as 12345
possible. Occasionally, when necessary suggest ways of expressing the
answers more succinctly in written form. Revise and correct the text
outside the meeting without altering what was said and meant during the
session. When in doubt obtain approval from the team for extensive
word-smithing. Avoid jargon. That is, ensure that the technical terms
used on the worksheet will be understood by everyone on the site.
4.0 Score
Time Management
Following an RCM decision to modify an asset or operating procedure, 12345

resist the (almost overwhelming) enticement to redesign the asset (or
operating procedure) during the RCM meeting. Allow the team to go
only so far as to elaborate the redesign requirement. Do NOT, under
any circumstances embark on a design brainstorming process during the
session. Schedule an offline session for those persons eager to perform
the redesign.
Page 273
Remind the team of the time allotted to the current analysis and the rate 12345
of progress necessary to attain that goal.
Keep the pace of analysis (all 7 steps) at an average rate of 6 failure 12345
modes per hour.
Indicate that about 1/3 of the time will be dedicated to defining the 12345
functions, 1/3 on failures, modes, and effects (FMEA), and 1/3 on
consequences, decisions, and task definition and assignment.
5.0 Score
Focus on the process
Ask the RCM questions. Never answer them. (If the team may have 12345
made a technical error or omission rephrase the questions to probe in a
particular direction or ask that a particular point be checked between
sessions.)
Call a timeout when necessary to remind or explain pertinent RCM 12345

process concepts.
Elaborate the asset's operating context at the beginning of the analysis. 12345
Keep it uppermost in the team’s mind throughout the analysis.
Ensure that the 7 RCM questions are asked completely, in the manner, 12345
and the order prescribed by SAE JA1011211.
Resist the tendency to skip questions, or parts of questions by taking 12345

their answers for granted. In particular ask, explicitly, each question
(page 236) along the appropriate logic branch of the decision diagram.
The RCM process must be performed rigorously. In spite of the
repetitious nature of the process do not abbreviate the questions so much
that their meaning is lost or distorted.
Pay strict attention to the following issues with respect to each of the 12345
SAE JA1011 RCM questions (5.1 to 5.7)...
5.1 What are the functions and associated desired standards of

performance of the asset in its present operating context
(functions)?
Ask the team to uncover the primary functions, the secondary functions, 12345
including all hidden functions. Afterwards invoke the PEACHES
mnemonic to double check that all functions have been listed.
211
SAE JA1011, http://www.sae.org, Title: Evaluation Criteria for Reliability-Centered
Maintenance (RCM) Processes
Page 274
Direct the team to include as many quantitative performance 12345
requirements as practical in each function statement to fully describe the
users’ (owners, societal) objectives for the asset. The function statement
usually begins with “To …” or “Not to …”. Avoid the use of “and”
between two verbs.
Simplify (reduce the size of) the function list by deciding when a certain 12345
function may be more conveniently included as a failure mode of
another functional failure. For example, the function "Not to trip when
the liquid level is below 100 hectoliters" preferably should be included
as the failure mode "pump trips due to grounded electrical contact" of
the primary function "To pump x liters ... ".
Encourage the team use code phrases to imply a hidden function (e.g. 12345
to be capable of, to be able to, …to heat to 140C in the presence of a
standby heater.)
5.2 In what ways can it fail to fulfill its functions (functional failures)?
Ensure that each quantitative performance requirement within an 12345

individual function statement is addressed. Separate partial and total loss
with respect to each requirement.
5.3 What causes each functional failure (failure modes)?
Pay particular attention to the number of failure modes to be included 12345

and to their depth of causality. The list should be tempered by the
reasonable likelihood of occurrence and by the gravity of the
consequences (always keeping the operating context in mind.) More
serious consequences would tend to lengthen the list of failure modes.
The depth (number of times to ask “why”) of causality at which to
specify a failure mode is likewise operating context sensitive. The depth
should be that at which the organization can do something about the
failure or its consequences.
5.4 What happens when each failure occurs (failure effects)?
Extract from the team the sequence of events (internally and

12345
organization-wide) that lead up to and could be touched off by the
failure mode? Also describe:
• how does the failure make itself known?
• how is safety or the environment impacted? (without mentioning the
words "safety" or "environment")
• how is production impacted? (quality, cost, customer service)
• is there any additional (secondary) damage caused by the failure?
• how long will it take and what actions must be accomplished to correct
the failure?
Page 275
• How does the likelihood of this failure depend on deeper causes? Has
it happened before? Under what circumstances?
5.5 In what way does each failure matter (failure consequences)?
Carefully examine the failure effects as elaborated in 5.4 above and

12345
select one of the four possible consequences (H, S, O, N).
5.6 What should be done to predict, prevent, or mitigate the
consequences of each failure (proactive tasks and task intervals)?
For CBM tasks, explore alternative technologies, and expose the true
12345
costs of the proposed program. For all proactive tasks consider the long
run costs of the task and the those of the failure consequences it is
designed to reduce or prevent.
Set the proactive task intervals. For CBM estimate P-F interval, or if
12345
applicable212, use a risk based non-deterministic approach such as
EXAKT. For TBM estimate the useful life regarding the failure mode in
question.
5.7 What should be done if a suitable proactive task cannot be found
(default actions)?
The three possible default actions: run-to-failure, failure detection, and
12345
redesign must be considered when so directed by the decision diagram.
For hidden failures, the detection interval must account for the tolerable
level of risk (probability and consequences) of a multiple failure.
Ensure that the team has considered all practical aspects of the task that
12345
has been selected. The task descriptions must contain enough detail213 to
ensure that no misunderstanding is possible when it is transcribed into
the maintenance system.
Appendix 3.
Sizing the analysis

The RCM facilitator, at the outset, makes a most important decision – to define the
boundaries of the item being analyzed. RCM can be applied at almost any level of the
212
Historical event and condition monitoring data is available and the consequences of failure are serious
enough to justify the analysis effort. The EXAKT analysis should be performed off-line.
213
However safety and intricate task details should be considered offline (with the possible participation of
safety and process, and engineering, and vendor experts where needed),
Page 276
Figure 19-4
asset hierarchy. However Figure 19-4 illustrates the compromise to be considered when
selecting a level at which to define our item. At a higher level the item’s functions and
functional failures are more clearly related to the performance requirements of the
equipment as a whole – an advantage.
Time is one of the facilitator’s prime considerations. The more failure modes that need to
be considered, the longer the analysis will take. Experience tells us that we should size
the item so that it may be analyzed in from 5 and 15214 three-hour sessions. A well run
analysis averages 6 failure modes per hour. Hence a small analysis would contain about
90 failure modes while a large one would analyze about 270. These figures make it
apparent that the facilitator must carefully control the process, lest it flounder by not
achieving the analysis of the item (as defined) in the allotted time. Such occurrences
could jeopardize215 the entire RCM initiative.
214
Depending on the item’s complexity as reflected by the number of its reasonably likely failure modes.
215
By over-running the budgeted time and resources, and by dicouraging team both members and upper
management through non-attainment of milestones.
Page 277
Selecting the significant items
Figure 19-5: Selecting the significant items for analysis

Figure 19-5 depicts the initial significant item selection process. The criteria of
significance and “hidden” dictate which items need to be analyzed within the RCM
project. Prioritization of the analyses lies outside the scope of RCM.216 Whatever priority
sequence has been chosen, the analysis are scheduled and team members assigned, taking
into account operational and personnel constraints. The schedule provides a concrete set
of objectives and milestones for the RCM project.
Appendix 4.
Failure finding intervals for complex items (multiple failure modes and
devices)
Failure finding interval for devices with more than one failure mode.
2 × M pf
I ff =
M mf × (1 M sd 1 + 1 M sd 2 + 1 M sd 3 )
where:
216
The method and details of project priority are industry specific. RCM may then proceed according to the
schedule generated by whichever priority method is used. Variants of RCM (such as Turbo RCM, PMO
2000, RCM Cost and provide structured priority systems.
Page 278
Iff = failure finding interval
Mpf = reliability (mean time between failure) of the protected function
Mmf = tolerable mean time between multiple failure
Msd1 = mean time between failure due to failure mode 1 of the safety device
Failure finding interval for redundant devices (based on the linear approximation).
1
 (n + 1)M pf  n
I ff = M sd ×  
 M mf 
where:
n = number of redundant devices of the same kind.
Failure finding interval for voting systems.

 (n − r )!(r + 1) × M pf 
I ff = M sd ×  
 n!×M mf 
Voting systems are usually called k out of n systems, where:
n = number of sensors in parallel
k = number of sensors needed to activate the safety action
r = number of sensors which must be failed for the safety system to fail
so: r = n - k + 1
Optimal failure finding interval for parallel redundant devices where only cost is a
factor
1
 (M sd )n (n + 1) M pd C ff  n
I off =  
 n × C mf 
where:
Ioff = optimal failure finding interval
Cff = average cost of an inspection
Cmf = average cost of a multiple failure
n = number of redundant safety devices of the same kind.
Page 279
Appendix 5.
Truck description
1. General Description
Each car is mounted on two four-wheel trucks having a wheelbase of 2500 mm. The
trailer trucks which are fitted to each type of trailer car are un-motored and are not fitted
with a parking brake. All trucks are fitted with disc brake equipment.
:
Figure 19-6: Rail car truck
The method of construction: Side frames and transoms are steel fabrications and utilize
closed box sections to give lightweight structurally efficient trucks.
The primary suspension consists of rubber/steel chevrons which mount the axle box to
the truck frame. The inherent damping within the chevron assemblies avoids the
necessity for supplementary dampers in the primary suspension. The axle box also houses
a rubber bump stop, which serves to prevent direct contact between the truck frame and
the axle box under severe bounce conditions.
The secondary suspension consists of two elements which are interposed between the
truck side frames and the car bolsters. The two elements are a layer spring and an air
spring. Under normal conditions, the effective suspension stiffness is a result of the two
springs connected in series. In the event of the air spring being deflated, the car will rest
on emergency springs which are located on top of the layer springs. The cars can still be
Page 280
used in service, but with a reduced quality of ride. Vertical oscillations of the car are
damped by two hydraulic shock absorbers, these being mounted on either side of the
truck between the truck side frame and the air spring top plate on the car bolsters. Lateral
oscillations are damped by hydraulic dampers which is mounted between the traction
center and the truck side frame. Lateral displacements are limited by resilient and positive
stops. Body roll is controlled by a torsion bar which is housed in the transom of each
truck and connected to the body by a suitable linkage. A leveling valve mounted on the
car controls the air pressure in the air springs and maintains a constant floor height
independent of passenger loading. The traction center is connected to the truck frame by
horizontal traction links. The ends of the traction links contain composite metal/rubber
bushes to ensure that attractive and braking forces are transferred to the car as smoothly
as possible.
2. Wheel sets
:
Figure 19-7: Wheel set
Wheels of BR-PB profile of mono-block constructions, are shrunk onto solid one piece
axles which run in double roller bearing axle boxes. The wheel specification being to BS
468 class D oil hardened and tempered. The axles are manufactured from low alloy steel
conforming to the BR specification 109A. The wheels are shrunk onto the axles and the
wheel set is balanced in accordance with BR specification 163. To effect removal of the
wheels, the hubs are drilled with two diametrically opposite oil injection holes. The
gearwheel is also fitted with an oil injection hole to assist removal. The axle ends are
suitably center drilled to allow wheel turning on a wheel lathe.
Page 281
3. Axle box
Figure 19-8: Axle box

The axle box is a forged aluminum alloy body fitted with a shell liner which provides the
housing for two self aligning bearings which are directly mounted on the axle.
Machined on each side of the axle box body is a mounting to carry the primary
suspension, each chevron being retained to the mounting with two bolts. At the top of the
forging is a machined circular housing to accommodate the axle box rubber bump stop.
A sealing collar is abutted up to a shoulder on the axle, and open cover fitted over it.
Labyrinth grooves in the collar and cover prevent leakage of grease from the rear of the
axle box. The front of the axle box is sealed by either a front cover or the housing of a
frequency generator via an adaptor plate. The axle box is lubricated with a lithium base
grease such as Shell Alvania 3, Exxon Beacon 3, or a comparable approved grease.
4. Primary Suspension
The arrangement of the primary suspension shows the tie bar arrangement under the axle
box.. The tie bar arrangement consists of a spacer tube, tie bar, locating rings and suitable
fasteners. The tie bar serves two purposes, it ensures the wheel arch structural integrity
and also allows the truck to be lifted from its wheel sets. The load of the wheel sets being
supported by the tie bars via the axle boxes, when a complete truck is lifted.
Page 282
Figure 19-9: Primary suspension
Cast steel chevron holders are located in, and welded to, web plates attached to each
wheel arch. The correct space between the bump stop housing and the top of the wheel
arch is adjusted by use of shim plates fastened under the top of the wheel arches.
5. Traction Center
The tractive and braking forces are transmitted to the center pivot via the traction center.
The center pivot is bolted to the bolster stool which is riveted to the car bolster. Shims are
fitted between the bolster stool and the center pivot to ensure the interface height between
the center pivot and truck is correct.
Page 283
Figure 19-10: Air spring top plates
Figure 19-11 shows the assembled arrangement of the traction center. The lateral bump
stop assemblies limit the possible lateral body movement relative to the truck. Each bump
stop assembly consists of a bounded rubber/steel bump stop and a fixed stop, such that
any lateral movement is unrestricted until the center pivot comes into contact with the
bump stop; further movement is then resisted by elastic deformation of the bump stop
until the fixed stops are met. The correct dimensions from the truck center to the rubber
stops and the fixed stops, and between the fixed stop and the truck transom. Tractive and
braking forces are transferred from the truck frame to the traction center by two traction
links (Figure 19-12). The traction links house resiliently mounted bushes in each eye, so
that the forces are transferred as smoothly as possible.
Page 284
Forces are transferred from the traction center via the center pivot to the car body. The
center pivot pin is retained in the traction center by a rubber compound spring. Lateral
movements of the traction center relative to the transom hydraulically damped by a shock
absorber which connected to one side of the truck frame and to the traction center.
5. Secondary Suspension
The secondary suspension consists of a series of elements mounted on each truck side
frame (see Figure 19-6). The stiffness of the suspension in normal service conditions is a
result of an air spring and a layer spring acting in series.
Figure 19-11: Traction center assembly

The air input to the air spring is controlled by a leveling valve which is mounted on the
car bolster and via a turnbuckle linkage to the truck. The result being a suspension system
whose characteristics can be varied to suit load conditions.
The air spring is connected to the car bolster by an air spring top plate. These plates can
only be fitted in a certain manner (Figure 19-14) and serve as both the mechanical and
pneumatic connection to the car. The lower sealing face of the air spring seals onto the
top of the layer spring assembly. The layer spring consists of a series of rubber and metal
elements bonded together. A plate on the top of the layer spring serves as the sealing face
for the air spring and as a housing for the emergency spring.
In the event of the air spring being deflated the car will rest on the top of the emergency
spring. The emergency spring comprises a metal/rubber assembly and has a low friction
surface fitted to its upper surface. This low friction surface allows the use of a vehicle in
service with a deflated air suspension, albeit with a reduced quality of ride.
Page 285
Figure 19-12: Secondary suspension
Vertical oscillations are damped by two hydraulic shock absorbers, one each side of the
truck adjacent to the secondary suspension. The dampers being mounted on brackets on
the truck side frame at the one end and to the air spring top plate at the other end.
Roll of car body relative to truck is controlled by an anti-roll torsion bar, which is housed
in the truck transom and connected to the car body by a turnbuckle linkage.
6. Frame
The trailer truck frame is a jig built welded structure comprising of the two side frames, a
center transom, and two headstocks. The side frames form enclosed box sections which
are internally braced to provide the optimum strength to weight relationship. The side
frames are symmetrical in profile about their centers with a wheel arch at each end. The
two side frames are joined at their centers by a transom assembly consisting of top and
bottom plates with vertical plates forming a box section structure, two transverse tubes
are welded integrally into this structure. One of these tubes houses the torsion bar. The
ends of each side frame are joined together by headstocks.
Cast steel chevron holders are located in, and welded to, web plates attached to the front
and rear of each wheel arch. Brackets are located at the bottom of the wheel arches to
locate the tie rod assemblies under each axle box.
Four towing points are fitted, two to each side frame, in-board of the wheel arches. The
points can also be used as lifting points, when handling individual trucks.
Brackets are welded to the outside of each side frame, two provide mounting points for
though vertical dampers and the other house the bearings for the torsion bar.
Page 286
Under the top of each wheel arch, location points are provided to accommodate shims.
The shims ensure the correct clearance between the axle box and the truck.
A torsion bar passes through one of the lateral tubes mounted in the transom and trough
wholes in the side frames, adjacent to the bearing housing brackets.
The mounting brackets for the traction links are welded diametrically opposite each other
under the transom, fore and aft of the center aperture.
Figure 19-13: Air spring top plates

Lateral resilient and fixed stop assemblies are bolded to the aperture of the transom, to
limit the possible movement of the traction center.
At the center of one of the headstocks, a mounting bracket for the AWS is welded to the
bottom plate. The AWS (automatic warning system) receiver is resiliently mounted and
the correct height above rail level is adjusted by the use of spacer washers.
Page 287
Figure 19-14: Rail car truck
Actuators for the wheel mounted discs are mounted on each headstock. The actuators on
the trailer trucks are not fitted with parking brake facility.
Appendix 6.
Terminology used:
Age Exploration: Any analysis procedure that examines historical data in order to
improve the maintenance plan by increasing an item’s reliability, availability,
maintainability, productivity, or by reducing cost. (Also called “reliability analysis”).
Applicable: A task is technically feasible and practical. For a condition based
maintenance task it means that a potential failure can be detected and assessed well
enough in advance of a functional failure to avoid or reduce its consequences. For a
scheduled overhaul it means that the item has a useful life.
Availability: (total scheduled time – downtime)/total scheduled time. Or,
MTTF/(MTTF+MTTR)
Complex item: An item subject to more than one reasonably likely failure mode.
Condition data: Inspection/measurement data (temperature, vibration, wear, yield, visual
observation, performance, etc) from which a potential failure may be deduced.
Conditional probability of failure:
Page 288
probabilit y of entering Interval − probabilit y surviving Interval
Conditiona l probabilit y of failure in Interval =
probabilit y of entering Interval
The interval must be small compared to the average life of the item. It is the probability
of failure in an interval given that it survives to that interval.
Covariate: A condition indicator. A condition data variable or transformation of one or
more variables to be tested in a proportional hazard model.
Decision Model: A method for interpreting condition data. An optimized decision model
is one which maximizes or minimizes some objective (e.g. availability or cost
respectively). A decision model may be developed that achieves some performance
measure such as a specified mission reliability or a required preventive to corrective
maintenance ratio.
Effective: A task accomplishes the intended objective – to lessen satisfactorily or to
avoid entirely the consequences of a failure.
Failure: Two types: 1. Potential failure – an unambiguous indication that a functional
failure is imminent (degraded failure resistance), and 2. Functional failure – the partial or
total loss of one of an item’s functions
Inspections: Observations (physically (human senses) or electronically acquired) related
to an item’s operation and maintenance from which a potential failure may be deduced.217
Item: A group of one or more parts or assemblies that is convenient to treat as a single
entity for reliability analysis. Items are defined at a high enough level of indenture so that
their failures may be clearly related to failure of the equipment as a whole and low
enough so that the number of failure modes is reasonable (<50-60).
Mean time to failure (MTTF): The average life of an item. Can be estimated by totaling
the lives of an item or fleet over a period of time and dividing by the number of items.
Mean time between failure (MTBF): The MTTF less the MTTR.
Mean time to return to service (MTTR): The mean time to return to service. (Also
called the maintainability.)
Multiple Failure: A failure of a protected function at a time when its protective function
is already in a failed state
OEE: (Availability x Productivity x Quality) tracks maintenance effectiveness, where:
Availability = (scheduled time - downtime due to all forms of maintenance)/(scheduled
time). Productivity = Product rate setting/Desired product rate. Quality = (Product -
Scrap)/Product. Additionally, tracking Reliability = MTTF, will provide further insight
into benchmarks for maintenance effectiveness.
On-condition maintenance: The detection of a potential failure. Also known as
condition based maintenance (CBM) and predictive maintenance (PdM).
PM: Preventive Maintenance. Scheduled tasks that include: failure finding218, on-
condition (aka CBM, predictive maintenance), rework, and discard tasks.
Reliability: Usually defined as an item’s MTTF. Sometimes described as a survival
probability of the item for a given mission duration.
Reliability analysis: Synonym for “Age Exploration”: Any analysis procedure that
examines historical data in order to improve the maintenance plan by increasing an item’s
reliability, availability, maintainability, productivity, or by reducing cost.
217
In some contexts (e.g. gas turbines) “Inspections” refer to major overhauls.
218
Inspections to discover functional failures that would otherwise remain hidden until the function is
called upon by some other failure or exceptional event.
Page 289
Reliability-centered: Adjective indicating the aim of sustaining and improving OEE and
reliability.
Reliability-centered maintenance: A (7-question) process used to determine the
maintenance requirements of an asset in its operating context.
Sample: Observations of an item’s (or group of similar219 items’) installations, failures,
preventive renewals, significant events, and condition data over a period of time.
Significant events: Operational or maintenance events that impact an item’s failure
resistance or its condition data.
Significant item: An item whose failures:
• Are not evident under normal circumstances, or
• Can directly negatively impact safety or the environment, or
• Can have direct major economic or operational impact.
Suspended220: Refers to replacement (discard) or rework of an item for any reason other
than its failure.
Useful Life: The age at which the conditional probability of failure begins to increase
and to which most items of the same kind survive. See Figure 3-2 on page 35.
Various definitions of “Life”

When someone says that an item has an operating life of 2000 hours what does that
mean?
No items fail before reaching 2000 hours?

No critical item failures occur before 2000 hours?
Half the items fail before 2000 hours?
The average age (mean time to failure) of failed items is 2000 hours?
The conditional probability of failure is constant below 2000 hours?
Some part in the item has a life limit of 2000 hours?
N% of the engines fail before 2000 hours?
Answer: All of the above. Thus it is important to clarify what we mean by “life” in any
given discussion.
Appendix 7.
Time to Failure - Relationship among hazard, reliability, and probability

density functions
219
Similar physically and in operating context
220
There are actually 3 types of suspensions: left, right, and interval. EXAKT also has the concept of
“temporary” suspension that refers to items that are still operating. In most contexts in the present manual
we mean “right” suspensions.
Page 290
f(t) is the probability density
function (PDF). It is the usual way of
representing a failure distribution. As density
equals mass per unit of volume, probability
density is the probability of failure per unit
time221. When multiplied by the length of a
small time interval at t, the quotient is the
probability of failure in that interval. It is the
basic description of the time to failure of an
item. The PDF is often estimated from real life
data. It resembles a histogram222 of the number
of failures of an item in consecutive intervals.
All other functions related to an item’s
reliability can be derived from it. For
example:
F(t) is the cumulative distribution function (CDF) It is the area under the f(t) curve from 0 to t..
(Sometimes called unreliability or the cumulative probability of failure.)
R(t) is the survival function. (Also called the reliability function.) R(t) = 1-F(t)
h(t) is the hazard function223. (At various times called the hazard rate, conditional failure rate,
instantaneous failure probability, instantaneous failure rate, failure rate, the inverse of failure resistance, failure
risk, and risk.) h(t) = f(t)/R(t)
221
However the analogy is accurate only if we imagine a volume of non-uniform mass. The density of a
small volume element is the mass of that element divided by its volume
222
A histogram is a vertical bar chart on which the bars are placed along a horizontal axis scaled in units of
working age. The width of the bars are uniform representing equal working age intervals. The height of
each bar represents the fraction of items that failed in the interval. If the bars are very narrow then their
outline approaches the pdf.
223
Often, the two terms "conditional probability of failure" and "hazard rate" are used interchangeably in
many RCM and practical maintenance references. In those references the definition for both terms is: the
conditional probability that an item will fail during an age interval given that the item enters (or survives)
to that age interval. This definition is not the one usually meant in reliability theoretical works when they
refer to “hazard rate” or “hazard function”. Nowlan and Heap point out that the hazard rate may be
considered as the limit of the ratio (R(t)-R(t+L))/(R(t)*L) as the age interval L tends to zero.
To summarize, "hazard rate" and "conditional probability of failure" are often used interchangeably (in
more practical maintenance books). The “hazard rate” is commonly used in most reliability theory books.
The conditional probability of failure is more popular with reliability practitioners and is used in RCM
books such as those of N&H and Moubray. There are two versions of the definition for either "hazard
rate" or "conditional probability of failure":
1. h(t) = f(t)/R(t)
2. h(t) = (R(t)-R(t+L))/R(t).
where L is the length of an age interval. Actually, when you divide the right hand side of the second
definition by L and let L tend to 0, you get the first expression.
Since
F(t) = 1 – R(t)
Then differentiating
dF(t) dR(t)
=- = f (t )
dt dt
Dividing the second definition by L and letting L tend to 0 (and applying the derivative definition of a
limit)
Page 291
MTTF is the average time to failure. (Also called the mean time to failure, expected time to failure,
∞
average life.) MTTF = ∫
0
tf (t )dt .
H(t) is the conditional probability of failure. It is the probability that the item fails in a
time interval [t1 to t2] given that it has not failed up to then. It is approximately equal to h(t) multiplied
by the length of the time interval of interest. Its graph has the same shape as that of the hazard
function, differing by a constant value that depends on the interval width being considered.
H(t) = (R(t1)-R(t2))/Rt1
R(t ) − R(t + L) 1  dR(t )  f (t )

h(t ) = lim = − =
L →0 LR (t ) R(t )  dt  R(t )
Note that, in the second version, t is not continuous as in the first version. For example, you may have
t=0,100,200,300,... and L=100.
Actually, not only the hazard function, but pdf, cdf, reliability function and cumulative hazard function
have two versions of their definitions as above. The first version is defined over a continuous range of age t
while the second one is defined over discrete age intervals, e.g., (0,100), (100,200), (200,300), ... Roughly,
we can say the second definition is a discrete version of the first definition.
The first expression is useful in reliability theory and is mainly used for theoretical development. The
second expression is useful for reliability practitioners, since in practice people usually divide the age
horizon into a number of equal age intervals. The pdf, cdf, reliability function, and hazard function may all
be calculated using age intervals. The results are similar to histograms, rather than continuous functions
obtained using the first version of the definitions.
Page 292
Appendix 8.
Random failure survival curve
1
Probability of survival without failure
.78
.61
.50 .47
.37
.29
.22
0
0.25 0.50 0.75 1 1.25 1.50
X the MTBF
By definition:
∞
MTTF = ∫ R(t )dt by definition where R(t) is the survival probability at time t.
0
But R (t ) = e − Lt by definition for an exponential (random) failure, then

1 − Lt ∞ 1
(0 − 1) = 1
∞ ∞
MTTF = ∫ R(t )dt = ∫ e − Lt dt = e =
0 0 −L 0 −L L
At t=MTTF
R (t ) = e − Lt = e −1 = 0.37
That is the probability of survival to the MTTF is 37%
If we calculate the survival probability (reliability) at ¼ the MTTF we get
R (t ) = e − Lt = e −0.25 = 0.78
Likewise we will get the values of 0.61, 0.47, 0.37, 0.29, and 0.22 for the times 0.25/L,
0.5/L, 0.75/L, 1/L, 1.25/L, and 1.5/L respectively
Appendix 9.
Inherent reliability characteristics224

Table 19-2
Inherent reliability characteristic Impact on PM applicability and effectiveness
Failure consequences Determine the significance of items for scheduled
maintenance; establish the definition of task
effectiveness; determine default strategy when no
applicable and effective PM task can be found
Visibility of functional failure to Determines the need for a failure-finding task to
224
Modified from Report AD-A066-578, “Reliability-Centered Maintenance”, F. Stanley Nowlan, Howard
F. Heap, National Technical Information Service, U.S. Department of Commerce, 1978
Page 293
operating crew under normal ensure that failure is detected
circumstances
Ability to measure/detect Determines applicability of on-condition tasks
reduced resistance to failure
Rate at which failure resistance Determines interval for on-condition tasks
decreases with operating age
once a potential failure225 occurs
Age-reliability relationship Determines applicability of rework and discard tasks
Age-reliability-covariate Determines the key risk factors for interpreting on-
relationship condition data.
Cost of corrective maintenance Helps establish PM task effectiveness, except for
safety and environment impacting failures
Cost of preventive maintenance Helps establish PM task effectiveness (except for
safety and environment impacting failures).
Need for safe-life limits to Determines applicability and interval of safe-life
prevent safety or environment discard tasks
failures
Need for servicing and Determines applicability and interval of servicing and
lubrication lubrication tasks
Appendix 10.
Failure mode depth of causality

Table 19-2 illustrates the variability of the depth of causality. At which should we specify
a failure mode? It depends entirely on the item’s operating context, which itself may
change. The choice of the correct level is not always clear. It may take a period of time
for the context to become better understood. Therefore, it will not be unusual at second or
third encounter in discussions between a maintainer and his supervisor that a decision is
made to “drill deeper” or less deep when describing the cause of a particular failure in an
item. Table 19-3 helps us understand that, as perspective sharpens, it will eventually
focus on a choice of failure mode level that strikes the ‘best’ balance between the depth
of causality and the organization’s practical proactive capability to manage the failure
mode. In other words, how deeply one drills into the cause of a failure should be a
function of the organization’s ability to do something to prevent it or reduce its
consequences. Invariably, this is a matter of judgment and consensus. The procedures
described in Part 3. have been designed to permit that discussion to settle quickly and
undisputedly into a common understanding.
Table 19-3: failure mode depth of causality

Why Why Why? Why? Why? Why? Why? Why?
? ?
225
A potential failure is a measurable indicator of reduced resistance to failure.
Page 294
Why Why Why? Why? Why? Why? Why? Why?
? ?
Ventil Fan Motor Motor trips Airways Inadequate
ation fails fails clogged with design
syste dirt.
m
fails
Defective
sensor
Bearing Lubricant
seized allowed to run
dry
Wrong Improperly Stores error
lubricant labeled
Label Inattention
misread
Insufficient
training
Power Belts failed Incorrectly … … …
drive installed
fails
Incorrectly … … …
specified
Distri Duct Duct … … … …
butio fails clogged
n
syste
m
fails
Duct … … … …
pierced
Damper … … … …
failed
… … … … … … … …
Appendix 11. Cost Comparison of CBM Policies

1. Current: The actual cost per unit of working age of preventive rework/discards
plus the costs associated with failures, resulting from existing practice as
determined from the sample226 of historical data
2. Optimal: The rework/discards and failures that would have resulted had an
optimal decision policy been used to assess (and decide upon) each and every one
of the condition data observations in the sample. There are three ways of
calculating the results of the optimal policy to help predict its true effectiveness:
a. Applied: The cost of the policy obtained from applying the optimal model
retroactively to the sample.
226
Sample: Observations of an item’s (or group of similar items’) installations, failures, preventive
renewals, significant events, and condition data over a period of time.
Page 295
b. Fitted: The curve of the EXAKT decision chart is fitted to the actual data;
so as to minimize “average” realized cost.
i. Fitted, Method A: Suspensions227 considered as preventive
renewals.
ii. Fitted Method B: Suspensions not counted228
c. Theoretical: The warning level curve is selected to minimize “expected”
cost.229
3. No scheduled maintenance (NSM): The policy of not using any proactive
(neither scheduled nor on-condition) maintenance.
Rather than describing, in rigorous detail, the various calculation methods mentioned in 2
above, an example of an effectiveness assessment of a CBM policy by comparing these
alternative policies is given below. This data is derived from diesel engines and applies to
a fleet of 300 T haul trucks.
Example of CBM Effectiveness Comparison

Table 19-4 Summary of Events
Policy Sample size Failed Replaced Undecided230 %Suspended

Current 13 6 3 4 30.8
Applied 13 1231 6 6 46.2
Fitted A 13 1,1 5 7,4 53.8
Fitted B 13 1 8 4 30.8
In row “Current” of Table 19-4 we find that of the 13 actual histories in the sample 6
failed, 3 were replaced, and 4 are “undecided” – i.e. we do not know whether they will
eventually fail or be preventively replaced. At the present time they are still operating.
An optimized model in CBM is a tool for interpreting condition data in order to declare a
potential failure so that a required objective is met (minimal average cost, maximum
uptime, a reliability goal, or some other performance metric.). The model is derived from
past equipment failure behavior as a function of age and monitored condition data.
Applying the optimized interpretation model retroactively to the data (row 3 of Table
19-4), we see that 1 would have failed, 6 would have been replaced and 6 would have
been undecided. The result looks very promising since 5 out of 6 failures would have
been prevented. However, our final assessment must take into account how much of the
total operational time we have “exchanged” for such a decrease in failure rate. That is to
say we may have been too cautious having preventively intervened (premature
227
Right suspensions. Equipment that is currently still operating at the time of the sample.
228
We are considering two sets of calculations for the analyst to consider. It is a kind of best and worst
case, with the actual situation being somewhere in the middle.
229
Another calculation to help judge how well the EXAKT derived policy will do in the future
230
“Undecided” means that it is unknown whether the item would have failed. The item was either still in
operation or had been replaced preventively in the actual data set (sample)
231
The optimal policy applied to the data would have permitted one failure to occur. That is the prediction
method would have “missed” one time.
Page 296
replacements) too often resulting in an expensive PM policy. We evaluate this by using
Table 19-5.
Table 19-5 Cost Comparison Summary
Cost/unit Compared Preventive Compared MTBR Compared

Policy time to Current Replacements to Current to Current
(risk
level)
Current 0.391 100% 53.85% 100% 8458.92 100%
Applied 0.195 49.78% 92.31% 171.43% 7113.54 84.10%
(0.638)
Theoretical 0.157 40.26% 97.74% 181.53% 7070.09 83.58%
(0.638)
Fitted 0.182 46.43% 92.31% 171.43% 7627.00 90.17%
(1.259)
No 0.638 163.14% 0.0% 0.0% 9405.25 111.19
Scheduled
Maintenance
Procedure for interpreting Table 19-4 and Table 19-5

First we examine Table 19-4. If the number of failed histories of the Current policy (row
1) is significantly reduced by the optimal policy (row 2 and 3) , then we may conclude
that applying the optimal policy will significantly influence day-to-day decisions.
However, it may or may not produce a true cost reduction. In the example of Table 19-4
we see that:
• the total number of histories (sample size) is 13,
• with the current policy
o 6 histories failed,
o 3 were preventively replaced, and
o 4 are still in operation.
When the optimal policy was applied to the data set,
• 1 history would have failed,
• 6 would have been preventively replaced, and
• 6 would have been undecided232.
From this we may conclude that the number of failures would have been significantly
reduced. The cost ratio used in the optimization calculation was 6000:1000. In Chapter
10. “Optimizing CBM” page 145 we perform a sensitivity analysis to determine how
changes in the ratio will impact the optimal policy.
Next, in Table 19-5 we compare the cost per operating hour of the Current policy with
that of the optimal Applied policy to see whether there is any significant reduction in
232
Still functioning at the sample cut-off date.
Page 297
total maintenance costs233. This should be the main criterion234 for assessment. From
Table 19-5 we see that the current policy cost is $0.391/h, and the optimal policy cost is
$0.195/h. This reduction in the cost of about 50% is significant.
We may also compare the MTBR for both policies. If there is a significant reduction in
MTBR (mean time between repairs, either preventive or as the result of failure) the
optimal policy is being cautious in reducing failures (due to high cost ratio). If the
MTBRs are similar, then the analysis is telling us that our condition indicating
measurements (interpreted by the model) are a relatively accurate predictor of oncoming
failures.
In the example, the current policy cost is $0.391/h, and the optimal policy cost $0.195/h.
Reduction in the cost is about 50%235 . The percent of preventive replacements for the
Current policy is 53.85%236, and for the Applied optimal policy, 92.31%237. MTBR is
8458.92h for the Current policy, and 7113.54h for the Applied optimal policy. All this
leads us to the conclusion that there is much to be gained by optimization.
Next compare the cost of the optimal Applied policy to that of the Theoretical one. If
these two costs are similar, we may conclude that the theoretical model fits the data
properly. In the example the cost of the applied policy is $0.195/h, and that of the
theoretical one is $0.157/h. This difference is not very large (considering the sample
size). Theoretically, then, we expect 97.74% preventive replacement, but only 92.31% =
12/13 would have been realized by applying the optimal policy. Similarly, theoretically
we expect the MTBR to be 7070.09h, but 7113.54h would have been realized. (For this
sample size, these two values are very close).
We now compare the results of the Fitted and Applied policies. Close cost values favor
the conclusion that the optimal model is a good one. A significant difference in the costs
may mean that some part of the theoretical model may be improved, possibly the method
of classifying inspection value ranges238. In the example, the cost of the fitted policy is
$0.182/h, close to the cost $0.195/h of the applied policy. Both policies have one failed
history, but different MTBRs - 7627h for the fitted policy, and 7113.54h for the applied.
This means that the fitted policy would have been more accurate in selecting the moment
for rework or discard239.
In summary:
1. The above analysis provides a way to judge the potential of a proposed CBM
policy.
233
The combined costs of all failures and all preventive repairs in the sample period.
234
The analysis may also be done from the point of view of maximizing total availability, in which case
costs would be replaced by “downtime” using the relationship Avail = uptime/ (uptime+downtime).
235
50.22% = 100% - 49.78%, 49.78% = 0.195/0.391
236
(3+4)/13
237
12/13
238
In the transition probability model.
239
One might ask, why not use the fitted policy then. Answer: the fitted policy can be obtained only after
the fact. The purpose of evaluating a proposed policy in this way is to help judge its future effectiveness.
Page 298
2. It uses various sets of calculations to probe the robustness of the proposed model
3. It is a tool that a statistician uses to gain a degree of comfort by arriving at similar
numbers using calculations at both sides of the envelope of possible solutions.
The assessment procedures described here provide not only an objective way to assess
actual (current) PM policy but ways to predict and evaluate the cost advantages of future
optimized policies.
Page 299
Appendix 12.
Expected failure time for an item whose maintenance policy is time-based

Let Tc be the time of a cycle, t p the preventive maintenance time interval, and T the
failure time.
The expected life cycle Tc will be the planned maintenance time t p multiplied by the
probability that planned maintenance does occur, plus the expected failure time (knowing
that failure occurs before tp) multiplied by the probability that failure occurs before tp.
This is expressed mathematically by:
E (Tc ) = t p P(T > t p ) + E (T | T ≤ t p ) P (T ≤ t p )

Equation 19-1
The term, E (T | T ≤ t p ) in Equation 19-1, is the expected time to failure, given that
failure occurs prior to scheduled maintenance, under a policy where scheduled
maintenance is carried out at time t p . We wish to show that it can be expressed as
tp
E (T | T ≤ t p ) =
∫
0
tf ( t ) dt
1 − R (t p )
First, we recognize that the conditional distribution function of T | T ≤ t p is
 1, t > tp  1, t > tp
 
Fc (t ) = P(T ≤ t | T ≤ t p ) =  P(T ≤ t ) =  F (t )
, t ≤ tp , t ≤ tp
 P(T ≤ t p ) 1 − R(t p )
Equation 19-2
In the first part of Equation 19-2, we have simply defined the distribution function of
(T≤t given that T≤tp) as Fc (t ) = P(T ≤ t | T ≤ t p ) . We will call this conditional
distribution function, “Fc(t)”. (Recall the definition of a distribution function in Appendix
7. on page 290.)
Now, moving towards the right in Equation 19-2, the top condition “1, where t>tp” is
easy to understand. We know that failure will have occurred prior to tp (with 100%
certainty) because T≤tp is our hypothesis in Fc(t).
Page 300
P(T ≤ t )
The bottom condition , t ≤ t p requires us to know that the conditional
P(T ≤ t p )
P( A ∩ B)
probability P(A|B) is where A = T≤t and B = T≤tp
P( B)
But we know that the intersection of T≤t and T≤tp is T≤t (see footnote240)
In the rightmost part of Equation 19-2 we apply the definition of F(t) to the numerator
and denominator. And, of course, we know that F(t) = 1-R(t). (See Appendix 7. on page
290.)
Then the conditional density function of T | T ≤ t p is
 0, t > t p or t < 0
 f (t )
f c (t ) = 
, 0 ≤ t ≤ tp
1 − R(t p )

Equation 19-3
We have used, in Equation 19-3, the fact that the density function is the first derivative of
the distriubtion function.
Therefore,
tp
∞
E (T | T ≤ t p ) = ∫ tf c (t )dt = ∫
tp tf (t ) ∫ tf (t )dt
dt = 0
0 0 1 − R (t p ) 1 − R(t p )
Equation 19-4
Here, in Equation 19-4, we have invoked the definition of “Expectation” as the integral of
the product of t and the density function. From this point on it’s just a matter of
substituting expressions from Equation 19-3.
240
Because t≤tp, the intersection of T≤t and T≤tp is actually T≤min(t,tp)=t.
Page 301
Appendix 13.
Default RCM decision diagram answers in the absence of operating

experience
Table 19-6 The default answer to be used in developing an initial scheduled-maintenance program in
the absence of data from actual operating experience.
Decision Default
Stage at which Possible Default
question answer to be
question can be adverse consequences
used in case
answered consequences eliminated
of Initial Ongoing of default with
uncertainty
program program condition subsequent
(with (operating operating
default) data) information
IDENTIFICATION OF SIGNIFICANT ITEMS
Is the item No: classify item X. X. Unnecessary no
clearly as significant analysis
nonsignificant
EVALUATION OF FAILURE CONSEQUENCES
Is the occurrence No (except for X. X. Unnecessary yes
of a failure critical inspections that
evident to the secondary are not cost-
operating crew damage): effective
during classify function
performance of as hidden.
normal duties?
Does the failure Yes: classify X. X. Unnecessary No for the
cause a loss of consequences as redesign or redesign; yes for
function or critical scheduled scheduled
secondary maintenance that maintenance
damage that is not cost-
could have a effective
direct adverse
effect on
operating safety
and the
environment?
Does the failure Yes: classify X. X. Scheduled yes
have a direct consequences as maintenance that
adverse effect on operational is not cost-
operational (production ) effective
capability?
EVALUATION OF PROPOSED TASKS
Is an on- Yes: include on- X. X. Scheduled yes
condition task to condition task in maintenance that
detect potential the program. is not cost-
failures effective
technically
feasible?
If an on- Yes: assigned X. X. Scheduled yes
Page 302
condition task is inspection maintenance that
technically intervals short is not cost-
feasible enough to make effective
(effective), is it the task
worthwhile? effective.
Is a rework task No (unless there -- X. Delay in yes
to reduce the are real and exploiting
failure rate applicable data): opportunity to
applicable? assign item to no reduce costs
scheduled
maintenance.
If a reworked No (unless there -- X. Unnecessary No for redesign;
task is are real and redesign (safety) yes for
applicable, is it applicable data): or delay in scheduled
effective? assign item exploiting maintenance
scheduled opportunity
maintenance
Is a discard task No (except for X. X. Delay in Yes
to avoid failures safe-life items): (safe life (economic exploiting
or reduce the assign item to only) life) opportunity to
failure rate know scheduled reduce costs
applicable? maintenance
If a discarded No (except for X. X. Delay in yes
task is safe-life items): (safe life (economic exploiting
applicable, is it assign item to only) life) opportunity to
effective? know scheduled reduce costs
maintenance
Appendix 14.
Additional Relcode examples

Exercise 3
The cloth filter on a sugar centrifuge is currently replaced on a preventive basis if a
suitable opportunity occurs and the cloth has been in use for at least 20 hours. The cloth
is also replaced on failure. The following data are available for 10 hour time intervals of
cloth life.
Age in Failure Preventive

Hours Replacements Replacements
0-9.99 14 0
10-19.99 5 0
20-29.99 2 4
30-39.99 1 8
Page 303
Figure 19-15: Relcode data entry for cloth filters
Figure 19-16
Exercise 4
A metropolitan transport company operates a fleet of similar buses. Engine failures
necessitating replacement have occurred in the kilometer ranges shown in the following
table which also shows the number of engines currently running in each age range.
Age Range Failure

(Kilometers) Replacements Survivors
Page 304
0-49,999 2 35
50,000-99,999 8 27
100,000-149,999 33 12
150,000-199,999 44 62
Figure 19-17: Relcode data entry for engines
Figure 19-18
Exercise 5
A new type of car has recently been released and is subject to warranty. An analysis of
warranty claims shows several alternator failures, although, as a proportion of the whole
population the numbers are quite small.
Page 305
The available data are as follows:
Age Range Failure
(Kilometers) Replacements Survivors
0-49,999 1 48
50,000-99,999 2 123
100,000-149,999 3 56
150,000-199,999 4 44
Figure 19-19: Relcode data entry for alternator failure warranties
Figure 19-20
Page 306
Appendix 15. EXAKT Exercises
The instructions in the right column of the following table are minimal so as to keep
them simple. The left column provides more detailed explanation. Whenever an
EXAKT menu option or icon is mentioned, it should be clicked in the EXAKT program.
When database tables are mentioned, they should be double clicked.
Exercise 1
Convention used: Meaning:
X instruction to close the current sub-window (or pane)
Building the CBM Optimal Decision Model

Detailed Explanation Steps to follow
Install the EXAKT program from the Flash player user
1 EXAKT, Install Exakt, follow prompts
interface on the CD (or from www.omdec.com).
Install the data files from the CD's Flash player user
interface. (Alternatively download them from
2 www.omdec.com and place them in a folder on your hard EXAKT, Install data files
drive. Modify the path given in step 4 and step 2 (of
Section 2)depending on where you install the data files.)
Launch “EXAKT for Modeling”. This is the program for
validating and analyzing condition monitoring and event
3 Start, “Exakt for Modeling”
data and for building the optimized CBM (condition based
maintenance) model
File, Open, navigate to c:\Program
Load the working model
4 Files\Exakt\data,
database(Transmission_WMOD.mdb).
Transmission_WMOD.mdb, Open
From the EXAKT – Modeling program attach the sample
measurements and events (Transmission_MES.mdb)
database to the Exakt working model database. After
executing the steps to the right you may examine the
attachment script by again hitting Modeling, Data Set-up. Modeling (on the Menu bar), Data
You will note that it creates an ODBC (open database setup, type in the attachment script
5
connectivity) link to an external database called (actually it is already keyed in for
“Transmission_MES.mdb.” and has attached a number of you), Execute, Save
tables. It has applied its own internal names to two of the
tables using the A=B syntax but other tables are attached
directly since their names are already consistent with
EXAKT’s internal names for those tables.
Notice that the attached tables have now become visible
and accessible in the right tree structure of the right
pane. In the next steps you will examine each one of
those tables to become familiar with their content and
structure, starting with Inspections. Open the
Inspections table. Note the column names and content.
Ident, Date, and WorkingAge are key words used by
6 EXAKT. “Ident” is the unique name of each unit of a Inspections, X
specific type of Item to be analyzed. An item is a
significant system, subsystem, or component upon which
it is convenient and desirable to conduct a reliability
analysis. An item may consist of several components and
may undergo several failure modes. But in this
introductory section of the tutorial we will keep it simple
and assume that the item is a simple item. The “Date”
Page 307
may be in date or date/time format. If condition
monitoring inspections are more frequent than once every
24 hours, the date/time format must be used. The
WorkingAge is a measure such as hours of operation, fuel
consumed, thousands of feet of steel rolled, or any other
measurement that reflects the accumulated usage or
stress on the item. Calendar time can only be used if the
units operate regularly in time – a rare situation.
Databases of production records, hour meters, or
counters must be used to acquire useful WorkingAge
data. The remaining columns contain the condition
monitoring data which we refer to as condition data.
Now examine the Events data table. Contrasted with the
Inspections table, its information represents the other
side of the coin. Both Event and Inspection data are
required for CBM optimization. The EXAKT modeling
process is one of correlation of Events (of all kinds) and
Inspections (that is, condition data). Condition data often
comes from specialized databases provided by CBM
product or service vendors. Common examples are oil
analysis and vibration analysis. These databases are
invariably well organized and consistently populated. The
Events data, on the other hand, often comes from the
organization’s CMMS (computerized maintenance
management system) and from production databases.
(The records in the CMMS, typically, have been less
7 Events, X
rigorously kept than the others. Hence EXAKT contains
tools and techniques to validate and get the CMMS data
into shape.) The basic required Events are: 1) Beginning
(an item has been placed into service) designated by B.
2) Ending by Failure, (EF)and 3) Ending by Suspension
(ES). By “suspension” we mean that the item has been
taken out of service for any reason other than failure. For
example, it may have been preventively replaced. Once
again the Ident, Date of the Event, WorkingAge are
required fields. The Event itself is recorded in the fourth
column. “OC” in this example represents an “oil change”
event. Any event which affects the condition data (in this
case it would initialize the wear metals and contaminant
elements to zero) must be included in the model.
Examine the CovariatesOnEvent table. We must provide
the “initialization values” for each event. Note that in this
case we are initializing wear metals and contaminants to
8 zero and additives to their new-condition levels. We may CovariatesOnEvent, X
also establish calendar periods for which these initialized
values to be used. (For example, the brand or grade of
lubricating oil may be changed periodically.)
Examine the EventsDescription table. The column P (for
precedence) tells EXAKT program in which order to
consider separate events that occur at the same
9 date/time. For example, if an oil sample is drawn from an EventsDescription, X
oil drain, we would wish that the sequence of the
Inspection precede that of the oil change. The inspection
event is implicitly given the precedence “0”.
Examine the Models table. It contains no records yet.
That is because you have not yet begun building a model.
This table is populated automatically by EXAKT as you
10 Models, X
proceed. The only time you might access this table
manually would be to delete certain sub-model(s) that
you do not wish to retain. A sub-model is one of any
Page 308
number of models that are tested in the modeling
process. The sub-model that is considered the best, is
then exported to become the intelligent agent that will
provide decision optimization on a particular item’s
condition data.
Now that we have examined the internal and external
Data Preparation, General Event
database tables we are ready to proceed with the
Data, Project Title: Haul Trucks, CBM
development of a rudimentary CBM optimization
11 Model: Trans Oil Anal, Description:
model. We turn our attention to the right hand window
350 T Transmission Oil Analysis,
pane containing buttons arranged in a flow chart of
Time Unit: Hrs., OK
activities. We enter the general project data.
Next we instruct EXAKT to assemble the Events and
Inspections into a single table C_Inspections to be used
for subsequent calculations. Depending on which version
of EXAKT you are using there are a number of alternative
12 With Covariates (Complete)
buttons we may hit. But for this exercise please choose
the option similar to “Covariates – Complete”. After
hitting this button two more tables will appear in the left
pane, C_Events and C_Inspections.
Examine the C_Inspections table. Note that the records of
both tables (Events and Inspections) have been combined
and arranged in chronological order in the single table
C_Inspections. Inspection (condition monitoring) record
13 C_Inspections, X
events are designated by an *. The other event records
have monitored data (covariate) values set to their
initialized levels according to the CovariatesOnEvent table
discussed previously.
Now let’s begin the “modeling” phase of the analysis. Hit
the “Modeling” button in the “Transmissions Oil
Analysis(*):2 window, not the “Modeling” menu item.
After executing steps A on the right, the Trans Oil Anal
(ilcm) report window appears. Examine the report. The
“Summary of Events and Censored Values” presents the
overall summary of the data being analyzed. A “Sample
Size” of 13 means that there are 13 histories or lifetimes
having a beginning and some kind of ending event. Of the
13 histories 6 ended in failure, 3 (Censored (Def)) ended
prior to a failure, and 4 (Censored (Temp)) units are A. Modeling, Weibull PHM, Select
currently in operation at the time of building this model. Covariates, sub-model Name: ilcm,
They are referred to in EXAKT as “temporary Iron, Æ, Æ, Æ, Æ, OK, X
suspensions” and are identified automatically by the
software. The next tabulation “Summary of Estimated
B. Select Covariates, sub-model
Parameters” provides the results of our first sub-model
Name: ilc, Magnesium, Å, OK, get
14 “ilcm”. The column “Sign.” indicates whether the
“Warning: The procedure is over …”
“Parameter” is significant – that is, whether it has been
XX
found to be statistically related to failure. The Shape (i.e.
WorkingAge), Iron, and Lead are designated as significant
(at this point in the analysis) while Calcium and C. Select Covariates, sub-model
Magnesium are not. Note that Magnesium has the highest Name: il, Calcium, Å, OK, “Warning:
p-Value; the p-value represents the relative probability The procedure is over …” X X
that Magnesium has no significant impact on risk of
failure. The next step is to try a different model by
eliminating the lowest impact variable - magnesium.
Close the window and execute steps B and C to create 2
more sub-models. Notice that we are successively
removing the covariate with the highest reported p-Value.
After hitting “OK” you will receive an alert warning
message from EXAKT.m telling you that the procedure is
over. This is normal for samples of small size (low
number of histories ending in failure). You may safely
Page 309
ignore this message by hitting OK in the message box.
Each of the reports produced from the different models
may be printed (Ctl-P). The columns in the reports are
explained in the Exakt Manual accessible from the
Windows Start menu.
At this point we have a sub-model with covariates and
shape parameter that are all significant. We may
conclude that this, therefore, is potentially an acceptable
model for failure risk prediction. To be rigorous, we
should test one last possible combination – a sub-model
Select Covariates, sub-model Name:
15 with iron alone. (We choose Iron as it is the variable with
i, Lead, Å, OK, X
the lowest p-value and thus is likely to have the strongest
relationship to failure.)The report tells us that this is also
a potentially good predictive model (i.e. iron alone is still
significant). In the next step we decide which of the two
sub-models should be retained and later deployed.
After executing the steps on the right the “PHM
Parameter Estimation - Comparison” report is displayed.
The “N” in the second column is telling you that the sub-
Comparative Report, Compare: il, i,
16 model “i” is not close to the base sub-model “il”. This
Æ, OK, X
means that this simpler sub-model is not as good as il
and that we would be losing confidence by using it rather
than the more complete model “il”.
In this step we examine the results of statistical testing
performed by EXAKT on the retained sub-model, il. Modeling (menu item), Select
17
Reactivate this model with the steps on the right. Use the Current Model, Sub model: il, OK.
menu item “Modeling”
Now hit the Modeling button (not the Modeling menu
item). The third table of the “PHM Goodness of Fit Test”
tells us that the proportional hazards model we
constructed for risk as a function of working age and the
two significant covariates “fits” the data well enough for it Modeling, Weibull PHM, Summary
18
to be used with a confidence of 95%.The test used for Report, X
this is known as the Kolmogorov Smirnov test and is well
accepted as a statistical tool. The test shows that the
model is not rejected at the 5% significance level - i.e. it
is accepted at a 95% confidence level.
After executing the steps of (A) on the right we see that
EXAKT has created a set of bands (listed under Interval
Start Points) or “transition” states for Lead with which to
build a “transition probability model”. The transition
probability model calculates the probability of jumping to
(A) Transition Probability Model,
another state at the next inspection interval. (An example
Covariate Bands Covariate: Lead
19 of what we mean by jumping to another state will be
(B) select Covariate Iron
given below in step 20). Execute step (B) and notice the
(C) OK
transition bands provided for Iron are quite different. This
is because historical iron measurements are scattered
throughout an entirely different range of values. This can
be ascertained using EXAKT's cross-graph function (see
user guide) Execute step (C) to close the window.
Execute step A. Notice that the two buttons “Display
Matrix” and “Display Survival” become active. Let’s
examine the Display Survival function report. Set (A) Transition Rates Display Survival,
WorkingAge to, say, 8000 hours, and Observation Working Age: 8000, Observation
20 Interval to, say 200 hours. (assuming, for example, that Interval: 200, Report Close the
our asset is currently at age 8000 and we are interested report and the “Display Survival
in knowing its risk of failure in the next 200 hours.) The Probabilities” dialog.
“Markov Chain Model Survival Probability matrix”
report is displayed. The probabilities of Iron values
Page 310
jumping to another state and the probability of failure in
the upcoming interval are displayed in a tabular format.
(This table represents only a part of the entire set of
transition probabilities taken into account by the model,
since we have chosen to ignore the other significant
covariate, Lead in this report. To include more than one
covariate in the visual report would require the
representation of multi-dimensional matrices which.
Instead this report allows us to see how a single variable
changes irrespective of the others.) Looking at the table
we see for example that the cell "0- 4.004" and "4.004-
9.009" has the entry 0.301615. This means that there is
a 30.1615% probability that iron will be that state at the
next monitoring interval. Hence this report provides the
probabilities of being in any state at some future time.
(Of course, this report is provided for analysis purposes
only while building the model. The transition probabilities
are fully integrated into the final decision model that will
be deployed in section 2.)
Now for the final step in developing a decision
optimization model. We blend into the model the
economics governing the failure and repair of this item.
That is we apply the average cost of a preventive repair C
and the average cost (including consequential costs) of a
failure C+K. (It is rarely necessary to have great precision
in these amounts for relative costs. The cost sensitivity
function of EXAKT allows us to confirm this for the Decision Model, Decision Model
decision model in question. It’s usage is described in the Parameters, Replacement (C): 1200,
EXAKT help file guide.) After hitting the Report Icon Failure (C+K): 6000, Cost Unit: $,
21 (which you'll find to the left of the Print Icon on the Tool Inspection Interval: 250, OK, Full
Bar), the “Condition Based Replacement Policy – Cost
Analysis report appears. Examine the “Summary of Cost report Icon (two icons to the left
Analysis” table below the Cost Function graph. It is telling of the Print Icon), X
you that by adhering to the interpretive decisions of the
model, an optimal long run ratio of preventive to failure
replacements will be 98.8:1.2 which will result in a cost
savings of 75.1% relative to a replacement-only-at-failure
policy. (The cost comparison reporting function similarly
compares the optimal EXAKT policy with existing practice.
It’s usage is described in the EXAKT help file guide.)
We have been, up to now, building a model based on the
historical data from the entire fleet. We may now test the
model on any individual unit either for the current
situation (i.e. the latest data available in the database,
called "LH" for last history) or we may look at any other
history retroactively. The steps on the right display the
reports of the latest monitored values of each unit. Four
graphs are shown - one for each of the four units 17-66,
17-67, 17-77 and 17-79. By examining the four graphs
we see that none are in alarm at the current moment Decisions, 17-66, shift+17-79,
22 when this snapshot of the data has been made. If the Report, X, Report Icon , PgDn,
weighted sum of the significant covariates (i.e. the y-axis PgDn, PgDn , X
plotted variable) falls in the Green region, no action is
necessary; in the yellow, the item should be renewed
before the next monitoring interval; in the read, the item
should be repaired or replaced immediately. It should be
noted that these boundaries vary with working age which
reflects the analysis findings that working age, as well as
Iron and Lead, are significant failure risk factors. At some
point in the past the values for 17-67 hit the red zone.
This may indicate a spurious laboratory result that was
Page 311
corrected in a follow-up verification. (For modeling,
known incorrect data should be removed from
consideration.) Note that the x-axis scale differs from
graph to graph depending on the current age of the unit.
The analysis and model building phase is complete. We
are now ready to export the optimal decision model we
created into our maintenance system environment (where
it has access to continuously renewing data) so that it can
Hit anywhere in explorer (left) pane,
do its job. Activate the pane on the left by clicking it. By
ModelDbase, Connect to Database
hitting save as instructed on the right, you are sending
Script, key in the script for exporting
23 the model to a database located on the network. But
the model (actually it has been keyed
before you do so, we will, for expedience, copy the script
in for you in this sample), select the
onto the clipboard as instructed. Then hit save. You will
entire script, ctrl-c, Save
notice that several new table links to an external
database have been added to the tree in the left pane.
Now that the ODBC links have been set up, we proceed to
the actual export step next.
After executing the steps on the right you may examine
the tables DecModels, UnitToModel,
DecCovariatesOnEvent, DecEventsDescription (by
double clicking on the file names in the tree view of the
ModelDbase, Store the Decision
24 left pane) to see just what information has been exported
model
to the external database. Please proceed to Section II of
this tutorial in order to deploy the decision model that you
have just created. You may close the EXAKT Modeling
(EXAKTm) program
Deploying the Decision Model as an Intelligent Agent

In this section we run the “agent”
manually. (It can also be set up to run
1 automatically). After you execute the steps Start, Programs, Exakt, Exakt for Decisions
on the right the user interface of the
EXAKTd decision agent appears.
Execute the steps on the right, to create a
File, New, Navigate to c:\Program Files\Exakt\data,
2 working database for decisions
Transmission_WDEC.mdb, Create
(Transmissions_WDEC.mdb).
Setup, Connect to model database script, ctrl-v (or
copy and paste this script:
Now we will link (via ODBC) to the DATABASE="Transmission_DMDR.mdb";
database where we previously exported ATTACH DecModels,
our model (Step 24 of Section I.). After UnitToModel,
3
executing the steps on the right you will
DecCovariatesOnEvent,
see the name of the Model you created,
“Trans Oil Anal” in the top left pane.
DecEventsDescription,
Decisions
hit Save
After executing this step, you will see each
of the units whose optimal decisions for oil
4 analysis will be governed by this model. Expand “Trans Oil Anal”
(new units may be added easily in the
EXAKTd program.)
By selecting any unit in the top left pane,
we see a list of properties but no values.
5 We will next run the agent manually on the 11-66
latest available set of condition monitoring
oil analysis data.
Now you will re-select the Model “Trans Oil Trans Oil Anal, Reports, Create reports, Calculate
6
Anal” and execute the decision agent by time to replace
Page 312
following the steps on the right.
The results of the entire fleet have been
analyzed and decisions have been returned Report icon , expand report window, PgDn,
7 for each unit. You may examine the reports PgDn, PgDn, X (of the sub-window or pane, not the
of each fleet member by following the main window)
steps on the right.
With “Trans Oil Anal” selected you can
conveniently examine the optimal
decisions for the entire fleet on one list in
the right window. You are actually
examining the contents of the Decisions
table of the Transmissions_DMDR.mdb Reports, Create new report list, New Report List
database. This database can be accessed Name: Indoor trucks, OK Reports, Create new
easily by any program, such as your report list, New Report List Name: Outdoor trucks,
8 CMMS. This implies that the decision OK Select “Trans Oil Anal”, Select 17-66 + 17-67,
model’s operation and its results may be ctrl-c, Select Indoor Trucks, ctrl-v Select “Trans Oil
integrated within existing maintenance Anal”, Select 17-77 + 17-79, ctrl-c, Select Outdoor
system software. In other words, the Trucks, ctrl-v
EXAKTd program need not be used at all.
However, it does have a very convenient
user interface and several useful functions,
some of which are described in the
following steps.
Select Indoor Trucks, Reports, Create Reports,
Now we will use the new report lists to help
9 Calculate time to replace Select Outdoor Trucks,
manage our trucks by department.
Reports, Create Reports, Calculate time to replace
This completes this section of the Tutorial.
This has been a minimal exercise to
demonstrate a small portion of the EXAKT
functionality. Please refer to the On-line
10
guide (available on your Start | Programs |
EXAKT menu) for a much more detailed
treatment of the subject of CBM
optimization.
Exercise 2 Complex Items

Building the CBM Optimal Decision Model
Install the required database files from the CD (menu
item “CBM Optimization” or from this (link) if you are
connected to the internet.
Open the EXAKT modeling program. Start, “EXAKT Tools”, “EXAKT for
1
Modeling”
The databases ComplexItemsDemo_MES.mdb and File, Open, navigate to
ComplexItemsDemo_WMOD.mdb are to be used for this /…/ComplexItemsDemo_WMOD.
2
tutorial. If installed from the menu on the CD they will be mdb, Open
in the folder “C:\Program Files\Exakt\tutorial2”
Page 313
After executing the instructions on the right, the required
tables are now accessible in the left pane of the EXAKT
(for modelling) window.
Modelling (on the Menu bar),

Data setup, type in the
3 attachment script (actually it is
already keyed in for you),
Execute, Save
Initiate the project by giving it a title, and naming the

first model (failure mode) to be analyzed, as “Gear1”.
Data Preparation, Enter General

Data, Project Title: “GearboxA 2
failure modes”, CBM Model:
4
“Gear1”, Description: “Tooth
failure on Gear1”, Time Unit:
“Hrs.”, OK
After executing the instruction on the right the dialog

“Marginal Analysis Data Sheet” (shown in step 6)
5 Marginal Analysis
appears. In this dialog we set up the mappings in the
tables of Figure 2.
Page 314
Idents, Check GearboxA, Events
Selection, B, Select Event: B,
Precedence: 6, Apply, EF1,
Select Event: EF, Precedence: 2,
Apply, EF2, Select Event: ES,
Precedence: 3, Apply, Variable
By executing the instructions on the right, we will assign: Selection, Health_Indicator1,
Select Variable: H1, Apply,
• Idents (that tell EXAKT which idents, i.e. units
Health_Indicator2, Select
are to have their data included in the predictive
Variable: H2, Apply, OK.
model that we are currently building).
• Events (that tell EXAKT which named events in [In the above you may be
the database the model should use internally as
wondering why we are mapping
B, EF and ES respectively)
6 EF2 to ES. The reason is that
• Variables (that tell EXAKT which variables to use EF2 is a failure mode of Gear2
and how to rename them for the model we are (to be modeled next). The
building. (This allows the decision agent to current policy is to replace Gear1
display short meaningful names in the optimal preventively when Gear2 fails.
decision graph.) Hence the failure of Gear2 marks
These mappings for the CBM Model “Gear1” are shown in the suspension (ES) of Gear1.
the dialog reproduced below.
Thirdly the variable name
Health_Indicator1 in the
database is mapped to the
variable name H1 used by the
model. Shorter names are more
convenient in building the model.
After completing the previous step, seven new tables appear in the left
pane: “CMI_Events”, “CMI_Inspections”, “Events”, “Inspections”,
“Histories”,”EventsDescription”, and “VarDescription”
Page 315
We will now proceed to build the model for Gear1.
After executing the instructions on the right a report
appears. “Shape” is reported as non-significant “N”.
Modeling (on flow diagram), Weibull

PHM, Select Covariates, sub-model
7
Name: H1, H1Æ, OK, X
By rejecting “Shape”, the software is telling us that

age is not a significant risk factor for the fracture of a
tooth on Gear1. Therefore we will remove age from
the model by fixing the shape parameter to “1”.
Modeling, Weibull PHM, Select

Covariates, sub-model Name: HI_B1,
8
Fix shape parameter=1: check, OK,
X
Build the decision model. The dialogs for the Transition Probability Model,
Transition Probability Model (“Covariates Bands and Transition Rates, OK, Decision Model,
Groups”) and the Decision Model Parameters are Decision Model Parameters,
shown below. Replacement (C): 1000, Failure
8 (C+K): 6000, Cost Unit: $,
Inspection Interval: 30, OK, Full
Report Icon , scroll down to

“Summary of Cost Analysis” table, X
Page 316
Executing the instructions on the right displays this report. Decisions, GearboxA, All Histories,
The report indicates that the gear has failed, but that the Select “GearboxA[1]”, Report, Full
failure would have been predicted two sample intervals ago
(60 hours) had the model been available. Report Icon , PgDn, PgDn, PgDn
…X
We have created and tested a decision model for Gear1. Repeat steps 4 to 9, making obvious
1
We may now, in the same way, generate a decision model changes for the modeling of Gear2.
0
for Gear2.
Create a new database “ComplexItemsDemo_DMDR.mdb” For this tutorial,
1
with seven tables: ComplexItemsDemo_DMDR.mdb has
1
1. DecCovariatesOnEvent already been created for you. So you
Page 317
2. DecEventsDescription do not have to do anything for this
3. Decisions step.
4. UnitToModel
5. DecModels However, this can be easily done
6. DecVarToModel using the EXAKT tool: Data
7. DecEventToModel Preparation for EXAKT. The
procedures is: Start, EXAKT tools,
Data Preparation for EXAKT, File,
Build Corporate Database, Use
Predefined Template, Decision
Models (DMDR), Filename:
ComplexItemsDemo_DMDR.mdb,
Save, Enter Covariate Name: H1,
Enter, H2, Enter, Marginal Analysis
Format: Check, OK File, Exit
Attach the 7 tables from ComplexItemsDemo_DMDR.mdb Activate the left pane Window,
by following the instructions to the right. The attached ModelDBase (on the Menu bar),
tables will appear in the tree view in EXAKT’s left pane. Connect to Model Database Script,
type or copy and paste the following
script into the editing window that
appears.
1
DATABASE =
1
"ComplexItemsDemo_DMDR.mdb";
ATTACH DecCovariatesOnEvent,
DecEventsDescription, UnitToModel,
DecEventToModel, DecVarToModel,
Decisions, DecModels
hit Save.
Assuming you have previously completed building the Activate left pane, ModelDBase,
1
model for Gear2 (Step 10), execute the instructions on the Store
2
right. This will save this model to the DMDR database.
Make the model for Gear1 the current model. Modeling (on the menu bar), Select
1
current model, CBM Model: Gear1,
3
Submodel:H1_B1,OK
1 Now Store the model for Gear1 to the DMDR database. Activate left pane, ModelDBase,
4 Store
1 Congratulations. You have created and exported two Close the EXAKTm program.
5 decision models for a complex item.
Deploying the CBM Optimal Decision Model as an Intelligent Agent

In this section we will manually run the “agent” so that it
applies the two models that we have created to the current
Start, Programs, “Exakt Tools”, “Exakt for
1 data. (It can also be set up to run automatically). After you
Decisions”
execute the steps on the right the user interface of the EXAKTd
decision agent appears.
Execute the steps on the right, to open File, Open, navigate to …,
2
ComplexItemsDemo_WDEC.mdb. ComplexItemsDemo_WDEC.mdb, Open
Attach the 7 tables from ComplexItemsDemo_DMDR.mdb. Setup, Connect to Model Database Script, type
The left pane will show both models, and the list of equipment (or copy and paste) this script into the window:
to which these models will be applied. In this case there is only DATABASE="ComplexItemsDemo_DMDR.mdb";
3 one equipment, “GearboxA”. However if there had been a fleet ATTACH DecModels, UnitToModel,
of similar equipment, they would have been listed below each DecEventsDescription, DecCovariatesOnEvent,
model. Decisions, DecEventToModel, DecVarToModel
hit Save
Page 318
Gear1, Reports, Create reports, Calculate time
4 Select the model “Gear One” and execute the decision agent.
to replace
Gear2, Reports, Create reports, Calculate time
5 Repeat step 4 for the model “Gear2”.
to replace
1. Click on “Gear1” or “Gear2”.
2. Expand “Gear One” or “Gear Two” and
click on the gear unit that you are
The prognostic results are now available for GearboxA.
6 interested in (only GearboxA in this
Examine the results using any of the 3 ways on the right.
example).
3. Click on a gear unit, View, View Model
Report.
Exercise 3 (Data Validation)

Data is the fuel of reliability improvement. This is especially true in condition based
maintenance. This exercise will provide you with a deep insight into the value of good
data practices, particularly regarding the records of the as-found condition of physical
assets.

In this exercise we will examine Download the wheelmotor oil analysis data from wheelmotor.zip. (Not
1 some of the data validation tools necessary if you are working from the CD and have hit "Get/Init
in EXAKT. datafiles")
Start “EXAKT for Modeling”, (resize windows so that these instructions
This is a check for logical
and the EXAKT window can be viewed simultaneously)File, Open, Navigate
(chronolgogical sequencing)
to locate the file Mar2004CRC_WMOD (in c:\Program
errors. Examine the Data Check
Files\Exakt\tutorial3\Mar2004CRC_WMOD if you extracted it from the
2 report. It will give you an overall
CD), Open, Modeling (on menu bar), Select Current Model, CBM Model:
picture of the sample, and indicate
PHM(no OC), OK, Activate Left pane (by clicking on it), Edit, Check
errors such as missing beginning
Database, Data, Scroll down and look at this report, Reduce and Close the
or ending events.
Report
A) Left pane, Open DataCheck table, double-click on “Description” column
Executing the instructions on the heading, View (menu bar), Inspections, Include Events View, OK
right should give you the following
3
screen. B) Arrange windows and panes so that the Inspections and Events window
covers the top two-thirds of the screen and the DataCheck window the
bottom third. The top window should have four panes.
Page 319
The tables and views are all in automatic synchronization.
This makes it easy to find and correct errors, as we shall see
in subsequent steps.
EXAKT has no way of distinguishing between missing ending

events and “temporary” suspensions241. Therefore you will
see many requests to “Check whether this history is
4 temporary suspended or "EF/ES" is missing.” The user makes DataCheck Window
sure that all such indicated records correspond to units that
are operating currently. EXAKT will then consider that they are
indeed temporary suspensions. Otherwise the message means
that you are missing an ending event, either an EF or an ES.
You must manually add the missing record. If the lifetime
corresponding to the message is in fact on going at the
moment, then you must ignore this message.
The 47th record of the DataCheck table has the description
“This record can't be properly identified. It has the same DataCheck window, scroll to Record 47 and
5
Ident, Date, WAge, and Event as the previous place cursor in Ident field of Record 47.
record:Id=5503R 2, Date=...”
Note that record (819) is flagged in the Inspections table and Inspections window, widen the Date column
6 the Events table likewise has its pointer positioned at record so the full date is visible, scroll up 1 row on
404. the scroll bar so that record 818 is visible
Note that record 818 corresponds to an oil sample taken on
the same equipment on the same day. EXAKT is suspicious
about this and is asking you to verify the dates and working Delete record 819 (by selecting the row (with
7
ages for these two. Maintenance planning personnel tell us Al = 143) and hitting the Delete key).
that record 819 must be an error. Therefore we may delete
it.242
Here is a similar type of problem. But in this case two samples
have the same working age but different calendar dates. DataCheck window, record 53, Inspections
8 EXAKT is not pleased with this situation and is asking you to window, scroll up one row so that records
do something about it. You need to check if the equipment 6204 and 6203 are visible.
was really idle for one month.
Thus, does one go systematically through the database Do not bother making any more corrections
9 records, as indicated by the DataCheck table, correcting the for purposes of this exercise. Close the
anomolies that are pointed out by EXAKT. Inspections, and DataCheck windows.
After following the instructions on the right you will have View, Cross Graph, maximize window, Table:
10
reproduced the graph below Inspections, Horizontal: WorkingAge, Vertical:
241
A temporary suspension is a cut off of a life time that is still ongoing. It has been “temporarily”
suspended by the snapshot of the data at the time of analysis.
242
Deletions and changes should always be carried out on a copy of the database. You should keep a record
of all changes that you have made to the data then save the summary with the database as a dated version.
It is convenient to do this on a read-once CD. That way you can easily go back to some previous version of
the database if you have made changes that need to be reversed. These are proper work habits for modelers.
Page 320
SI, Condition: Si<1000, Show
After following the instructions on the right you will have Horizontal: Fe, Vertical Si, delete “Si<1000”,
11
reproduced the following graph. Show, reduce, X
Examine the OutputVarScript. It uses a succinct data query

language to conveniently transform combinations of existing
covariates into new covariates for building and testing risk
models. The “*(=, >, or < statement)”, shown on several lines
Database explorer pane (left pane),
12 of this program, is read “where statement true”. The
OutputVarScript, X
statement of interest is the next to last:
CorrSi=Si*(Si<>900)+1.2*Fe*(Si=900);
It is telling the program to return the actual value of Si where
Si <>900 and to use 1.2*Fe where Si=900.
Modeling (on menu bar), Create Model Input
After following the instructions on the right you will have tables, Complete data, View, Cross Graph,
13
reproduced the graph shown below. maximize, Table: C_Inspections, Horizontal:
Fe, Vertical CorrSI, Show, reduce, X
EXAKT handles events (such as oil changes, Modeling (on menu bar), Select Current Model, CBM Model:
14 adjustments, alignments, calibrations and PHM(with OC), OK, Activate Left pane (Database explorer
other minor maintenance) that impact pane), Modelling (on menu bar), Create Model Input tables,
Page 321
condition data in a correct manner. The Complete data, Database pane, C_Inspections, Scroll to record
instructions on the right will display the 356, reduce and close the C_Inspections table
table illustrated below. It is often useful to
display the events and inspections in a
single table. Note the regularity of the oil
change events.
For a period of 5 months, From 7/6/94 to

11/21/94 no oil change (OC) events are
indicated, where oil changes were
performed previously about every month.
We suspect that the oil changes occurred
but were not reported.
Executing the instructions on the right will

display a graph similar to that found on page. Modeling (on menu bar), Select Current Model, CBM Model:
One history falls outside the 5-95% lines, PHM(noHistExcl), Submodel: FeCorrSed, OK, Procedures
15
violating the estimate of the proportional hazard panel, Modeling, Weibull PHM, In Order of Appearance,
model. One of the lifecycles does not fit the close the graph
model. Why?
Follow the instructions on the right and Database pane, Residuals: PHM(noHistExcl)(FeCorrSed) #1”,
16 when we scroll down to the last row, we see click on the “Residual” column header to order the records by
the history number (also shown below) of Residual, scroll down to last row, note the History Number of
Page 322
the offending history. The number is found 64, close the table
to be 64.
History numbers (such as 64) are

applied by EXAKT to the life cycles in
Procedures panel, Decisions,
chronological order. We must identify
All Histories, Select History
which life cycle of which unit is the
5501L[1] (That is the first
17 offending one. Following the
lifetime of the left wheelmotor
instructions on the right, we can find
of haul truck 5501), hit the
the history (life cycle) is the 2nd
DnArrow key 63 times, Close
history of unit 5509R. (see dialog box
to far right).
We need to examine the cause of the

offending history. The instructions on View, Inspections, Include
the right reproduce the table and Events View, View by history,
graph below. From this Figure, we Select All, Uncheck, move all
observe that the cause of offending variables to “Unselect”
history is the unusually high values of position, move Iron and Si to
Fe and Si not explained by a failure “Selected” position (as shown
event. A reasonable solution to obtain in image on the right), OK
a better fit model is to assume that a
maintenance event was not properly
recorded and to exclude this history Select 5509R[2]
from the model.
Page 323
Exercise 4 (data smoothing and fixing shape factor to 1)
Random fluctuation of monitored condition data characterizes many otherwise straight-
forward CBM applications. In this exercise we use the monitored pressure test data,
which reflects the deterioration of a sealing system in a nuclear fuel rod manipulating
mechanism. For additional background and details on this application, you may refer to
the document www.omdec.com/articles/reliability/paperCandu.html.
Step Explanation Required actions

Download the database files from
1
www.omdec.com or from the OMDEC CD.
Start “EXAKT for Modeling”, File, Open, Navigate
to locate the file candu_WMOD database file (in
2
c:\Program Files\Exakt\tutorial4\ if extracted
from the CD)
Note the randomness yet increasing
nature (generally rising slope) of the data.
Although it is obvious that the item ages in
Activate left (database explorer view) pane, View,
a fairly linear fashion, how does one make
Inspections, OK, Ident drop down list, hit various
a decision at any given inspection if the
3 idents and observe their corresponding sets of
data is so erratic? How do we know if a
inspection data, reduce the inspections window,
high reading is due to noise or to a
close (X) the inspections window.
deteriorating failure mode? The following
steps in EXAKT provide a solution to this
problem.
Database pane, OutputVarScript, X
EXAKT provides a way to perform
“smoothing transformations” of the data. (Note that we have defined 4 new variables from
In the OutputVarScript window you will the original LeakRate and WorkingAge variables:
see a small program that transforms the leakSmooth0, leakSmooth, leakSmoothAve0, and
original variable LeakRate into the leakSmoothAve
transformed variables leakSmooth and
4
leakSmoothAve. EXAKT’s programming By reading the comments in this script and by
language provides several smoothing studying (in the Guide and Manual) the definitions
functions. Smooth() and SmoothAve() are of the various EXAKT transformation functions
smoothing functions that take parameters such as Smooth(), SmoothAve(), Last() and
to adjust the way in which they transform NonDecr(), you will soon get to understand how
the variables. these transformations work. For now, just
continue to step 5)
A) Modeling (on menu bar), Select Current Model,
The instruction on the right generates the
CBM Model: Seals, Submodel: LR_b1, OK,
decision graphs of the model built directly
Procedures panel, Decisions, Select Ident: 5EH1,
on the original (untransformed) data.
scroll down to last row, shift+8WH4, Report,
Observe how much randomness there is
5 Close, PageDown or PageUp, X
in the inspection data. Such randomness
may bias the model and may make it
B) Modeling (on Procedures panel), Weibull PHM,
difficult to clearly apply an optimal
Select Covariates, (note the variable used for this
decision.
model LR_b1 is LeakRate), Cancel
The model LR_Smooth0 uses a variable Repeat Step 5A but select the submodel
that has been smoothed by the Smooth() LR_Smooth0 instead of LR_b1
6
function in EXAKT. On the decision graphs,
we observe that we have eliminated the Repeat Step 5B but note the variable used for this
Page 324
randomness of the previous submodel. But model LR_Smooth0 is leakSmooth0, Cancel
we have another problem. We observe a
drooping artifact243 at the end of every
history. This causes a poor model and a
poor decision recommendation because
the current value of the condition indicator
leakSmooth0 is erroneously low! In step 7
we will correct this problem with a further
transformation.
The adjusted smoothed variable produces Repeat Step 5A but this time use the submodel
a better model and a better decision LR_Smooth
recommendation. Note that the
7
randomness of the data is further reduced Repeat Step 5B but this time note that the
and the drooping artifact has been variable used in the submodel LR_Smooth is
corrected. leakSmooth
Now that we have seen some techniqes for
pre-processing data to eliminate confusing
noise, we may look more closely at the
model itself. You may be wondering about
8 the naming convention we adopted for the
model “LR_Smooth_b1”. The “b1” part of
the name indicates that we have fixed
Beta, the shape factor, to 1. We will
proceed to learn why we did this.
We note, in carrying out the steps on the
right, that this Submodel “LR_Smooth”
Modeling (on Procedures panel), Weibull PHM,
9 uses the transformed variable leakSmooth
Select Covariates, Cancel
and that the “Fix shape factor to 1”
checkbox is unchecked.
Residual Analysis, Summary Report, scroll down.
Upon executing the steps at the right, we (note that the goodness of fit hypothesis is
note that the model is rejected by the rejected), reduce window, X
Kolmogorov-Smirnov test. The test is
10
telling us that the hypothesis that the Look at the modeling results in the orange framed
model is “good” (fits the data) must be "Parameters" window inside the Procedures
rejected. window. Note the NS (not significant) indication
after Shape = 1.35644.
EXAKT has told us in step 10 that working
age is not significant. In fact it is highly Modeling (on menu bar), Select Current Model,
significant, so much so that it correlates LR_Smooth_b1, Modeling (on Procedures panel),
closely with the LeakRate. Thus EXAKT is Weibull PHM, (note that the shape parameter has
really telling us that the LeakRate itself been fixed to 1 for this submodel), Cancel
11
contains all the information we need, to
establish a good predictive model, and it is Residual Analysis, Summary Report, expand and
telling us that we should remove scroll down. (note that the goodness of fit
WorkingAge as a significant factor from hypothesis is not rejected), X
the model by setting Shape to 1.
Similar results can be found for models:
LR_SmoothAve0_b1, and
12 LR_SmoothAve_b1. You may go ahead
examine these models using the tecniques
you have learned in this exercise
243
An artifact is an inaccurate observation that is due to the observation method.
Page 325
Appendix 16. References to Chapter 13.
[1] K. F. Martin, A Review by Discussion of Condition Monitoring and Fault-

Diagnosis in Machine-Tools, International Journal of Machine Tools &
Manufacture, 34 (1994) 527-551.
[2] J. Lee, R. Abujamra, A. K. S. Jardine, D. Lin, D. Banjevic, An integrated platform

for diagnostics, prognostics and maintenance optimization, in: The IMS'2004
International Conference on Advances in Maintenance and in Modeling,
Simulation and Intelligent Monitoring of Degradations, Arles, France, 2004.
[3] S. Nandi, H. A. Toliyat, Condition monitoring and fault diagnostic of electrical

machines - a review, in: Thirty-Fourth IAS Annual Meeting, Vol. 1, Phoenix, AZ
USA, 1999, pp. 197-204.
[4] H. C. Pusey, M. J. Roemer, Assessment of turbomachinery condition monitoring

and failure prognosis technology, Shock and Vibration Digest, 31 (1999) 365-371.
[5] W. Q. Wang, F. Ismail, M. Farid Golnaraghi, Assessment of gear damage

monitoring techniques using vibration measurements, Mechanical Systems and
Signal Processing, 15 (2001) 905-922.
[6] M. Wang, A. J. Vandermaar, K. D. Srivastava, Review of condition assessment of

power transformers in service, IEEE Electrical Insulation Magazine, 18 (2002) 12-
25.
[7] A. El-Shafei, N. Rieger, Automated diagnostics of rotating machinery, in: 2003

ASME Turbo Expo, Vol. 4, Atlanta, GA, United States, 2003, pp. 491-498.
[8] T. K. Saha, Review of modern diagnostic techniques for assessing insulation

condition in aged transformers, IEEE Transactions on Dielectrics and Electrical
Insulation, 10 (2003) 903-917.
[9] R. M. Tallam, S. B. Lee, G. Stone, G. B. Kliman, J. Yoo, T. G. Habetler, R. G.

Harley, A survey of methods for detection of stator related faults in induction
machines, in: IEEE International Symposium on Diagnostics for Electric
Machines, Power Electronics and Drives, Proceedings, New York, 2003, pp. 35-
46.
[10] G. Sabnavis, R. G. Kirk, M. Kasarda, D. Quinn, Cracked shaft detection and

diagnostics: a literature review, The Shock and Vibration Digest, 36 (2004) 287-
296.
[11] H. Austerlitz, Data acquisition techniques using PCs, Academic Press, San Diego,
Calif., 2003.
Page 326
[12] N. V. Kirianaki, S. Y. Yurish, N. O. Shpak, V. P. Deynega, Data Acquisition and
Signal Processing for Smart Sensors, John Wiley and Sons, Ltd., Chichester, West
Sussex, England, 2002.
[13] C. Davies, R. M. Greenough, The use of information systems in fault diagnosis, in:
Proceedings of the 16th National Conference on Manufacturing Research,
University of East London, UK, 2000.
[14] R. Xu, C. Kwan, Robust isolation of sensor failures, Asian Journal of Control, 5
(2003) 12-23.
[15] G. Dalpiaz, A. Rivola, R. Rubini, Effectiveness and sensitivity of vibration

processing techniques for local fault detection in gears, Mechanical Systems and
[16] A. J. Miller, A New Wavelet Basis For The Decompostion Of Gear Motion Error
Signals And Its Application To Gearbox Diagnostics, M.Sc. Thesis, Graduate
Program in Acoustics, The Pennsylvania State University, State College, PA,
1999.
[17] S. Poyhonen, P. Jover, H. Hyotyniemi, Signal processing of vibrations for

condition monitoring of an induction motor, in: ISCCSP : 2004 First International
Symposium on Control, Communications and Signal Processing, New York, 2004,
pp. 499-502.
[18] D. C. Baillie, J. Mathew, A comparison of autoregressive modeling techniques for

fault diagnosis of rolling element bearings, Mechanical Systems and Signal
Processing, 10 (1996) 1-17.
[19] A. K. Garga, B. T. Elverson, D. C. Lang, AR modeling with dimension reduction

for machinery fault classification, in: Critical Link: Diagnosis to Prognosis,
Haymarket, 1997, pp. 299-308.
[20] Y. Zhan, V. Makis, A. K. S. Jardine, Adaptive model for vibration monitoring of

rotating machinery subject to random deterioration, Journal of Quality in
Maintenance Engineering, 9 (2003) 351-375.
[21] W. J. Wang, J. Chen, X. K. Wu, Z. T. Wu, The application of some non-linear

methods in rotating machinery fault diagnosis, Mechanical Systems and Signal
Processing, 15 (2001) 697-705.
[22] W. J. Wang, R. M. Lin, The application of pseudo-phase portrait in machine

condition monitoring, Journal of Sound and Vibration, 259 (2003) 1-16.
[23] T. Koizumi, N. Tsujiuchi, Y. Matsumura, Diagnosis with the correlation integral in

time domain, Mechanical Systems and Signal Processing, 14 (2000) 1003-1010.
Page 327
[24] W. J. Wang, Z. T. Wu, J. Chen, Fault identification in rotating machinery using the
correlation dimension and bispectra, Nonlinear Dynamics, 25 (2001) 383-393.
[25] Q. Zhuge, Y. Lu, Signature analysis for reciprocating machinery with adaptive
signal-processing, Proceedings of the Institution of Mechanical Engineers Part C-
Journal of Mechanical Engineering Science, 205 (1991) 305-310.
[26] N. Baydar, Q. Chen, A. Ball, U. Kruger, Detection of incipient tooth defect in

helical gears using multivariate statistics, Mechanical Systems and Signal
Processing, 15 (2001) 303-321.
[27] R. R. Schoen, T. G. Habetler, Effects of time-varying loads on rotor fault detection

in induction machines, IEEE Transactions on Industry Applications, 31 (1995)
900-906.
[28] R. G. T. De Almeida, S. A. Da Silva Vicente, L. R. Padovese, New technique for

evaluation of global vibration levels in rolling bearings, Shock and Vibration, 9
(2002) 225-234.
[29] Z. Liu, X. Yin, Z. Zhang, D. Chen, W. Chen, Online rotor mixed fault diagnosis
way based on spectrum analysis of instantaneous power in squirrel cage induction
motors, IEEE Transactions on Energy Conversion, 19 (2004) 485-490.
[30] D. Ho, R. B. Randall, Optimisation of bearing diagnostic techniques using

simulated and actual bearing fault signals, Mechanical Systems and Signal
Processing, 14 (2000) 763-788.
[31] R. B. Randall, J. Antoni, S. Chobsaard, The relationship between spectral

correlation and envelope analysis in the diagnostics of bearing faults and other
cyclostationary machine signals, Mechanical Systems and Signal Processing, 15
(2001) 945-962.
[32] J. R. Stack, R. G. Harley, T. G. Habetler, An amplitude Modulation detector for

fault diagnosis in rolling element bearings, IEEE Transactions on Industrial
Electronics, 51 (2004) 1097-1102.
[33] G. W. Blankenship, R. Singh, Analytical solution for modulation sidebands

associated with a class of mechanical oscillators, Journal of Sound and Vibration,
179 (1995) 13-36.
[34] S. Goldman, Vibration Spectrum Analysis : A Practical Approach, Industrial Press,

New York, 1999.
[35] C. M. Harris, A. G. Piersol, Harris' Shock and Vibration Handbook, McGraw-Hill,

2002.
Page 328
[36] M. A. Minnicino, H. J. Sommer, Detecting and quantifying friction nonlinearity
using the Hilbert transform, in: Health Monitoring and Smart Nondestructie
Evaluation of Structural and Biological System III, 5394, Bellingham, 2004, pp.
419-427.
[37] N. T. van der Merwe, A. J. Hoffman, A modified cepstrum analysis applied to

vibrational signals, in: Proceedings of 14th International Conference on Digital
Signal Processing (DSP2002), Vol. 2, Santorini, Greece, 2002, pp. 873-876.
[38] C.-C. Wang, G.-P. J. Too, Rotating machine fault detection based on HOS and
artificial neural networks, Journal of Intelligent Manufacturing, 13 (2002) 283-293.
[39] L. Xiong, T. Shi, S. Yang, R. B. K. N. Rao, A novel application of wavelet-based

bispectrum analysis to diagnose faults in gears, International Journal of
COMADEM, 5 (2002) 31-38.
[40] D.-M. Yang, A. F. Stronach, P. Macconnell, J. Penman, Third-order spectral

techniques for the diagnosis of motor bearing condition using artificial neural
networks, Mechanical Systems and Signal Processing, 16 (2002) 391-411.
[41] T. W. S. Chow, G. Fei, Three phase induction machines asymmetrical faults

identification using bispectrum, IEEE Transactions on Energy Conversion, 10
(1995) 688-693.
[42] N. Arthur, J. Penman, Inverter fed induction machine condition monitoring using
the bispectrum, in: Proceedings of the IEEE Signal Processing Workshop on
Higher-Order Statistics, Banff, Alta., Canada, 1997, pp. 67-71.
[43] B. E. Parker, H. A. Ware, D. P. Wipf, W. R. Tompkins, B. R. Clark, E. C. Larson,

H. V. Poor, Fault diagnostics using statistical change detection in the bispectral
domain, Mechanical Systems and Signal Processing, 14 (2000) 561-570.
[44] W. Li, G. Zhang, T. Shi, S. Yang, Gear crack early diagnosis using bispectrum
diagonal slice, Chinese Journal of Mechanical Engineering (English Edition), 16
(2003) 193-196.
[45] A. C. McCormick, A. K. Nandi, Bispectral and trispectral features for machine

condition diagnosis, IEE Proceedings-Vision, Image and Signal Processing, 146
(1999) 229-234.
[46] L. Qu, X. Liu, G. Peyronne, Y. Chen, The holospectrum: A new method for rotor
surveillance and diagnosis, Mechanical Systems and Signal Processing, 3 (1989)
255-267.
Page 329
[47] C. B. Yu, H. B. He, Y. Xu, F. L. Chen, Identification method of acoustic
information flow of bearing state, in: Condition Monitoring '97, 1997, pp. 311-315.
[48] Y. D. Chen, R. Du, Diagnosing spindle defects using 4-D holospectrnm, Journal of
Vibration and Control, 4 (1998) 717-732.
[49] L. Qu, D. Shi, Holospectrum during the past decade: review & prospect,
Zhendong Ceshi Yu Zhenduan/Journal of Vibration, Measurement & Diagnosis,
18 (1998) 235-242 (in Chinese).
[50] M. H. Hayes, Statistical Digital Signal Processing and Modeling, John Wiley and
Sons, New York, 1996.
[51] C. K. Mechefske, J. Mathew, Fault detection and diagnosis in low speed rolling
element bearing. Part I: The use of parametric spectra, Mechanical Systems and
[52] J. P. Dron, L. Rasolofondraibe, C. Couet, A. Pavan, Fault detection and monitoring

of a ball bearing benchtest and a production machine via autoregressive spectrum
analysis, Journal of Sound and Vibration, 218 (1998) 501-525.
[53] J. R. Stack, T. G. Habetler, R. G. Harley, Bearing fault detection via autoregressive

stator current modeling, IEEE Transactions on Industry Applications, 40 (2004)
740-747.
[54] M. J. E. Salami, A. Gani, T. Pervez, Machine condition monitoring and fault

diagnosis using spectral analysis techniques, in: Proceedings of the First
International Conference on Mechatronics (ICOM '01), Vol. 2, Kuala Lumpar,
Malaysia, 2001, pp. 690-700.
[55] W. J. Wang, P. D. McFadden, Early detection of gear failure by vibration analysis

I. Calculation of the time-frequency distribution, Mechanical Systems and Signal
Processing, 7 (1993) 193-203.
[56] F. A. Andrade, I. Esat, M. N. M. Badi, Gearbox fault detection using statistical

methods, time-frequency methods (STFT and Wigner-Ville distribution) and
harmonic wavelet - A comparative study, in: Proceedings of COMADEM '99,
Chipping Norton, 1999, pp. 77-85.
[57] Q. Meng, L. Qu, Rotating machinery fault diagnosis using Wigner distribution,
Mechanical Systems and Signal Processing, 5 (1991) 155-166.
[58] M.-C. Pan, H. Van Brussel, P. Sas, B. Verbeure, Fault diagnosis of joint backlash,
Journal of Vibration and Acoustics, Transactions of the ASME, 120 (1998) 13-24.
Page 330
[59] I. S. Koo, W. W. Kim, The development of reactor coolant pump vibration
monitoring and a diagnostic system in the nuclear power plant, ISA Transactions,
39 (2000) 309-316.
[60] N. Baydar, A. Ball, A comparative study of acoustic and vibration signals in

detection of gear failures using Wigner-Ville distribution, Mechanical Systems and
[61] L. Cohen, Time-frequency distribution - a review, Proceedings of the IEEE, 77

(1989) 941-981.
[62] P. Bonato, R. Ceravolo, A. De Stefano, M. Knaflitz, Bilinear time-frequency

transformations in the analysis of damaged structures, Mechanical Systems and
[63] S. Gu, J. Ni, J. Yuan, Non-stationary signal analysis and transient machining
process condition monitoring, International Journal of Machine Tools and
Manufacture, 42 (2002) 41-51.
[64] P. Loughlin, F. Cakrak, L. Cohen, Conditional moments analysis of transients with

application to helicopter fault data, Mechanical Systems and Signal Processing, 14
(2000) 511-522.
[65] R. K. Young, Wavelets Theory and Its Applications, Kluwer Academic Publishers,
Boston, 1993.
[66] W. J. Staszewski, G. R. Tomlinson, Application of the wavelet transform to fault

detection in a spur gear, Mechanical Systems and Signal Processing, 8 (1994) 289-
307.
[67] W. J. Wang, P. D. McFadden, Application of wavelets to gearbox vibration signals

for fault detection, Journal of Sound and Vibration, 192 (1996) 927-939.
[68] R. Rubini, U. Meneghetti, Application of the envelope and wavelet transform

analyses for the diagnosis of incipient faults in ball bearings, Mechanical Systems
and Signal Processing, 15 (2001) 287-302.
[69] G. Y. Luo, D. Osypiw, M. Irle, On-line vibration analysis with fast continuous
wavelet algorithm for condition monitoring of bearing, Journal of Vibration and
Control, 9 (2003) 931-947.
[70] N. Aretakis, K. Mathioudakis, Wavelet analysis for gas turbine fault diagnostics,
Journal of Engineering for Gas Turbines and Power, 119 (1997) 870-876.
Page 331
[71] G. O. Chandroth, W. J. Staszewski, Fault detection in internal combustion engines
using wavelet analysis, in: Proceedings of COMADEM '99, Chipping Norton,
1999, pp. 7-15.
[72] G. Dalpiaz, A. Rivola, Condition monitoring and diagnostics in automatic

machines: comparison of vibration analysis techniques, Mechanical Systems and
[73] N. Baydar, A. Ball, Detection of gear failures via vibration and acoustic signals
using wavelet transform, Mechanical Systems and Signal Processing, 17 (2003)
787-804.
[74] P. S. Addison, J. N. Watson, T. Feng, Low-oscillation complex wavelets, Journal

of Sound and Vibration, 254 (2002) 733-762.
[75] Y.-G. Xu, Y.-L. Yan, Research on Haar spectrum in fault diagnosis of rotating
machinery, Applied Mathematics and Mechanics (English Edition), 12 (1991) 61-
66.
[76] H. K. Tonshoff, X. Li, C. Lapp, Application of fast Haar transform and concurrent
learning to tool-breakage detection in milling, IEEE/ASME Transactions on
Mechatronics, 8 (2003) 414-417.
[77] A. J. Miller, K. M. Reichard, A new wavelet basis for automated fault diagnostics
of gear teeth, in: Inter-Noise 99: Proceedings of the 1999 International Congress
on Noise Control Engineering, Vol. 1-3, Poughkeepsie, 1999, pp. 1597-1602.
[78] D. Boulahbal, F. Golnaraghi, F. Ismail, Amplitude and phase wavelet maps for the
detection of cracks in geared systems, Mechanical System and Signal Processing,
13 (1999) 423-436.
[79] G. Meltzer, N. P. Dien, Fault diagnosis in gears operating under non-stationary

rotational speed using polar wavelet amplitude maps, Mechanical Systems and
[80] C. Wang, R. X. Gao, Wavelet transform with spectral post-processing for

enhanced feature extraction, IEEE Transactions on Instrumentation and
Measurement, 52 (2003) 1296-1301.
[81] G. G. Yen, K.-C. Lin, Wavelet packet feature extraction for vibration monitoring,
IEEE Transactions on Industrial Electronics, 47 (2000) 650-667.
[82] S. Zhang, J. Mathew, L. Ma, Y. Sun, Best basis-based intelligent machine fault
diagnosis, Mechanical Systems and Signal Processing, 19 (2005) 357-370.
Page 332
[83] H. A. Toliyat, K. Abbaszadeh, M. M. Rahimian, L. E. Olson, Rail defect diagnosis
using wavelet packet decomposition, IEEE Transactions on Industry Applications,
39 (2003) 1454-1461.
[84] H. Yang, J. Mathew, L. Ma, Fault diagnosis of rolling element bearings using basis
pursuit, Mechanical Systems and Signal Processing, 19 (2005) 341-356.
[85] Z. K. Peng, F. L. Chu, Application of the wavelet transform in machine condition

monitoring and fault diagnostics: A review with bibliography, Mechanical Systems
and Signal Processing, 18 (2004) 199-221.
[86] J. C. Russ, The Image Processing Handbook, CRC Press, Boca Raton, 2002.
[87] M. S. Nixon, A. S. Aguado, Feature Extraction and Image Processing, Newnes,

Oxford, 2002.
[88] W. J. Wang, P. D. McFadden, Early detection of gear failure by vibration analysis

II. Interpretation of the time-frequency distribution using image processing
techniques, Mechanical Systems and Signal Processing, 7 (1993) 205-215.
[89] S. Utsumi, Z. Kawasaki, K. Matsu-Ura, M. Kawada, Use of wavelet transform and

fuzzy system theory to distinguish wear particles in lubricating oil for bearing
diagnosis, Electrical Engineering in Japan, 134 (2001) 36-44.
[90] T. Heger, M. Pandit, Optical wear assessment system for grinding tools, Journal of
Electronic Imaging, 13 (2004) 450-461.
[91] C. Ellwein, S. Danaher, U. Jager, Identifying regions of interest in spectra for

classification purposes, Mechanical Systems and Signal Processing, 16 (2002)
211-222.
[92] C. M. Stellman, K. J. Ewing, F. Bucholtz, I. D. Aggarwal, Monitoring the

degradation of a synthetic lubricant oil using infrared absorption, fluorescence
emission and multivariate analysis: A feasibility study, Lubrication Engineering,
55 (1999) 42-52.
[93] G. O. Allgood, B. R. Upadhyaya, A model-based high-frequency matched filter

arcing diagnostic system based on principal component analysis (PCA) clustering,
in: Applications and Science of Computational Intelligence III, 4055, Bellingham,
2000, pp. 430-440.
[94] I. K. Fodor, A Survey of Dimension Reduction Techniques, Lawrence Livermore

National Laboratory (LLNL) Technical Report, UCRL-ID-148494, University of
California, Livermore, CA, 2002.
Page 333
[95] H. T. Grimmelius, J. K. Woud, G. Been, On-line failure diagnosis for compression
refrigeration plants, International Journal of Refrigeration, 18 (1995) 31-41.
[96] L. Yang, M. Z. Yang, Z. Yan, B. Z. Shi, Extraction of symptom for on-line

diagnosis of power equipment based on method of time series analysis, in:
Proceedings of the 6th International Conference on Properties and Applications of
Dielectric Materials, Vol. 1, Xi'an, China, 2000, pp. 314-317.
[97] B. K. Sinha, Trend prediction from steam turbine responses of vibration and
eccentricity, Proceedings of the Institution of Mechanical Engineers Part A-Journal
of Power and Energy, 216 (2002) 97-104.
[98] A. K. S. Jardine, P. M. Anderson, D. S. Mann, Application of the Weibull

proportional hazard model to aircraft and marine engine failure data, Quality and
Reliability Engineering International, 3 (1987) 77-82.
[99] P.-J. Vlok, M. Wnek, M. Zygmunt, Utilising statistical residual life estimates of
bearings to quantify the influence of preventive maintenance actions, Mechanical
Systems and Signal Processing, 18 (2004) 833-847.
[100] J. Moubray, Reliability-centred maintenance, Butterworth-Heinemann, Oxford,

1997.
[101] K. B. Goode, J. Moore, B. J. Roylance, Plant machinery working life prediction

method utilizing reliability and condition-monitoring data, Proceedings of the
Institution of Mechanical Engineers Part E-Journal of Process Mechanical
Engineering, 214 (2000) 109-122.
[102] L. R. Rabiner, Tutorial on hidden Markov models and selected applications in

speech recognition, Proceedings of the IEEE, 77 (1989) 257-286.
[103] R. J. Elliott, L. Aggoun, J. B. Moore, Hidden Markov Models: Estimation and

Control, Springer-Verlag, New York, 1995.
[104] C. Bunks, D. McCarthy, T. Al-Ani, Condition-based maintenance of machines

using Hidden Markov Models, Mechanical Systems and Signal Processing, 14
(2000) 597-612.
[105] M. Dong, D. He, Hidden semi-markov models for machinery health diagnosis and
prognosis, in: Papers Presented at NAMRC 32, Vol. 32, Charlotte, NC, United
States, 2004, pp. 199-206.
[106] D. Lin, V. Makis, Recursive filters for a partially observable system subject to
random failure, Advances in Applied Probability, 35 (2003) 207-227.
Page 334
[107] D. Lin, V. Makis, On-line parameter estimation for a failure-prone system subject
to condition monitoring, Journal of Applied Probability, 41 (2004) 211-220.
[108] W. Wang, A model to predict the residual life of rolling element bearings given
monitored condition information to date, IMA Journal of Management
Mathematics, 13 (2002) 3-16.
[109] W. Wang, P. A. Scarf, M. A. J. Smith, On the application of a model of condition-

based maintenance, Journal of the Operational Research Society, 51 (2000) 1218-
1227.
[110] A. K. S. Jardine, Optimizing condition based maintenance decisions, in:

Proceedings of the Annual Reliability and Maintainability Symposium, 2002, pp.
90-97.
[111] W. Wang, J. Sharp, Modelling condition-based maintenance decision support, in:

Condition Monitoring: Engineering the Practice, Bury St Edmunds, 2002, pp. 79-
98.
[112] J. H. Williams, A. Davies, P. R. Drake, Condition-based maintenance and machine

diagnostics, Chapman & Hall, London, 1994.
[113] J. Korbicz, J. M. Koscielny, Z. Kowalczuk, W. Cholewa, Fault Diagnosis,

Springer, Berlin, 2004.
[114] J. Ma, C. J. Li, Detection of Localized Defects in Rolling Element Bearings Via
Composite Hypothesis Test, Mechanical Systems and Signal Processing, 9 (1995)
63-75.
[115] Y. W. Kim, G. Rizzoni, V. I. Utkin, Developing a fault tolerant power-train

control system by integrating design of control and diagnostics, International
Journal of Robust and Nonlinear Control, 11 (2001) 1095-1114.
[116] H. Sohn, K. Worden, C. R. Farrar, Statistical damage classification under changing

environmental and operational conditions, Journal of Intelligent Material Systems
and Structures, 13 (2002) 561-574.
[117] M. Nyberg, A general framework for fault diagnosis based on statistical hypothesis
testing, in: Twelfth International Workshop on Principles of Diagnosis (DX 2001),
Via Lattea, Italian Alps, 2001, pp. 135-142.
[118] M. L. Fugate, H. Sohn, C. R. Farrar, Vibration-based damage detection using

statistical process control, Mechanical Systems and Signal Processing, 15 (2001)
707-721.
Page 335
[119] V. A. Skormin, L. J. Popyack, V. I. Gorodetski, M. L. Araiza, J. D. Michel,
Applications of cluster analysis in diagnostics-related problems, in: Proceedings of
the 1999 IEEE Aerospace Conference, Vol. 3, Snowmass at Aspen, CO, USA,
1999, pp. 161-168.
[120] M. Artes, L. Del Castillo, J. Perez, Failure prevention and diagnosis in machine
elements using cluster, in: Proceedings of the Tenth International Congress on
Sound and Vibration, Stockholm, Sweden, 2003, pp. 1197-1203.
[121] J. Schurmann, Pattern Recognition: A Unified View of Statistical and Neural

Approaches, John Wiley & Sons, New York, 1996.
[122] H. Ding, X. Gui, S. Yang, An approach to state recognition and knowledge-based

diagnosis for engines, Mechanical Systems and Signal Processing, 5 (1991) 257-
266.
[123] W. J. Staszewski, K. Worden, G. R. Tomlinson, Time-frequency analysis in

gearbox fault detection using the Wigner-Ville distribution and pattern recognition,
[124] S. K. Goumas, M. E. Zervakis, G. S. Stavrakakis, Classification of washing

machines vibration signals using discrete wavelet analysis for feature extraction,
IEEE Transactions on Instrumentation and Measurement, 51 (2002) 497-508.
[125] X. Lou, K. A. Loparo, Bearing fault diagnosis based on wavelet transform and
fuzzy inference, Mechanical Systems and Signal Processing, 18 (2004) 1077-1095.
[126] M.-C. Pan, P. Sas, H. Van Brussel, Machine condition monitoring using signal
classification techniques, Journal of Vibration and Control, 9 (2003) 1103-1120.
[127] A. R. Webb, Statistical Pattern Recognition, John Wiley and Sons, West Sussex,
England, 2002.
[128] C. K. Mechefske, J. Mathew, Fault detection and diagnosis in low speed rolling
element bearing. Part II: The use of nearest neighbour classification, Mechanical
[129] Q. Sun, P. Chen, D. Zhang, F. Xi, Pattern recognition for automatic machinery
fault diagnosis, Journal of Vibration and Acoustics, Transactions of the ASME,
126 (2004) 307-316.
[130] M. Guo, L. Xie, S.-Q. Wang, J.-M. Zhang, Research on an integrated ICA-SVM
based framework for fault diagnosis, in: Proceedings of the 2003 IEEE
International Conference on Systems, Man and Cybernetics, Vol. 3, Washington,
DC, USA, 2003, pp. 2710-2715.
Page 336
[131] J. Ying, T. Kirubarajan, K. R. Pattipati, A. Patterson-Hine, A hidden Markov
model-based algorithm for fault diagnosis with partial and imperfect tests, IEEE
Transactions on Systems, Man and Cybernetics, Part C (Applications and
Reviews), 30 (2000) 463-473.
[132] M. Ge, R. Du, Y. Xu, Hidden Markov model based fault diagnosis for stamping
processes, Mechanical Systems and Signal Processing, 18 (2004) 391-408.
[133] Z. Li, Z. Wu, Y. He, C. Fulei, Hidden Markov model-based fault diagnostics
method in speed-up and speed-down process for rotating machinery, Mechanical
[134] Y. Xu, M. Ge, Hidden Markov model-based process monitoring system, Journal of
Intelligent Manufacturing, 15 (2004) 337-350.
[135] D. Ye, Q. Ding, Z. Wu, New method for faults diagnosis of rotating machinery
based on 2-dimension hidden Markov model, in: Proceedings of the International
Symposium on Precision Mechanical Measurement, Vol. 4, Hefei, China, 2002,
pp. 391-395.
[136] A. Siddique, G. S. Yadava, B. Singh, Applications of artificial intelligence

techniques for induction machine stator fault diagnostics: Review, in: Proceedings
of the IEEE International Symposium on Diagnostics for Electric Machines, Power
Electronics and Drives, New York, 2003, pp. 29-34.
[137] M. J. Roemer, C. Hong, S. H. Hesler, Machine health monitoring and life

management using finite element-based neural networks, Journal of Engineering
for Gas Turbines and Power-Transactions of the ASME, 118 (1996) 830-835.
[138] E. C. Larson, D. P. Wipf, B. E. Parker, Gear and bearing diagnostics using neural
network-based amplitude and phase demodulation, in: Proceedings of the 51st
Meeting of the Society for Machinery Failure Prevention Technology, Virginia
Beach, VA, 1997, pp. 511-521.
[139] B. Li, M.-Y. Chow, Y. Tipsuwan, J. C. Hung, Neural-network-based motor rolling

bearing fault diagnosis, IEEE Transactions on Industrial Electronics, 47 (2000)
1060-1069.
[140] Y. Fan, C. J. Li, Diagnostic rule extraction from trained feedforward neural
networks, Mechanical Systems and Signal Processing, 16 (2002) 1073-1081.
[141] B. A. Paya, I. I. Esat, M. N. M. Badi, Artificial neural network based fault

diagnostics of rotating machinery using wavelet transforms as a preprocessor,
Page 337
[142] B. Samanta, K. R. Al-Balushi, Artificial neural network based fault diagnostics of
rolling element bearings using time-domain features, Mechanical Systems and
[143] J. K. Spoerre, Application of the cascade correlation algorithm (CCA) to bearing

fault classification problems, Computers in Industry, 32 (1997) 295-304.
[144] D. W. Dong, J. J. Hopfield, K. P. Unnikrishnan, Neural networks for engine fault

diagnostics, in: Neural Networks for Signal Processing VII, New York, 1997, pp.
636-644.
[145] C. J. Li, T.-Y. Huang, Automatic structure and parameter training methods for
modeling of mechanical systems by recurrent neural networks, Applied
Mathematical Modelling, 23 (1999) 933-944.
[146] P. Deuszkiewicz, S. Radkowski, On-line condition monitoring of a power

transmission unit of a rail vehicle, Mechanical Systems and Signal Processing, 17
(2003) 1321-1334.
[147] R. M. Tallam, T. G. Habetler, R. G. Harley, Self-commissioning training

algorithms for neural networks with applications to electric machine fault
diagnostics, IEEE Transactions on Power Electronics, 17 (2002) 1089-1095.
[148] Y. H. Yoon, E. S. Yoon, K. S. Chang, Process fault diagnostics using the

integrated graph model, in: On-Line Fault Detection and Supervision in the
Chemical Process Industries, Oxford, 1993, pp. 89-94.
[149] C. H. Hansen, R. K. Autar, J. M. Pickles, Expert systems for machine fault

diagnosis, Acoustics Australia, 22 (1994) 85-90.
[150] M. F. Baig, N. Sayeed, Model-based reasoning for fault diagnosis of twin-spool

turbofans, Proceedings of the Institution of Mechanical Engineers, Part G: Journal
of Aerospace Engineering, 212 (1998) 109-116.
[151] Z. Y. Wen, J. Crossman, J. Cardillo, Y. L. Murphey, Case base reasoning in

vehicle fault diagnostics, in: Proceedings of the International Joint Conference on
Neural Networks 2003, Vol. 1-4, New York, 2003, pp. 2679-2684.
[152] M. Bengtsson, E. Olsson, P. Funk, M. Jackson, Technical design of condition

based maintenance system-a case study using sound analysis and case-based
reasoning, in: Maintenance and Reliability Conference - Proceedings of the 8th
Congress, Knoxville, USA, 2004.
[153] M. L. Araiza, R. Kent, R. Espinosa, Real-time, embedded diagnostics and

prognostics in advanced artillery systems, in: 2002 IEEE Autotestcon
Page 338
Proceeedings, Systems Readiness Technology Conference, New York, 2002, pp.
818-841.
[154] D. L. Hall, R. J. Hansen, D. C. Lang, The negative information problem in

mechanical diagnostics, Journal of Engineering for Gas Turbines and Power-
Transactions of the Asme, 119 (1997) 370-377.
[155] M. Stanek, M. Morari, K. Frohlich, Model-aided diagnosis: An inexpensive

combination of model-based and case-based condition assessment, IEEE
Transactions on Systems, Man and Cybernetics Part C: Applications and Reviews,
31 (2001) 137-145.
[156] R. G. Silva, R. L. Reuben, K. J. Baker, S. J. Wilcox, Tool wear monitoring of

turning operations by neural network and expert system classification of a feature
set generated from multiple sensors, Mechanical Systems and Signal Processing,
12 (1998) 319-332.
[157] H. R. DePold, F. D. Gass, The application of expert systems and neural networks
to gas turbine prognostics and diagnostics, Journal of Engineering for Gas
Turbines and Power, 121 (1999) 607-612.
[158] B.-S. Yang, T. Han, Y.-S. Kim, Integration of ART-Kohonen neural network and
case-based reasoning for intelligent fault diagnosis, Expert Systems with
Applications, 26 (2004) 387-395.
[159] C. K. Mechefske, Objective machinery fault diagnosis using fuzzy logic,

[160] G. C. Collins, J. R. Bourne, A. J. Brodersen, C. F. Lo, Comparison of rule-based

and belief-based systems for diagnostic problems, in: Proceedings of The Second
International Conference on Industrial and Engineering Applications of Artificial
Intelligence and Expert Systems (IEA/AIE - 89), Vol. 2, New York, USA, 1989,
pp. 785-793.
[161] R. Du, K. Yeung, Fuzzy transition probability: A new method for monitoring
progressive faults. Part 1: The theory, Engineering Applications of Artificial
Intelligence, 17 (2004) 457-467.
[162] S. Zhang, T. Asakura, X. L. Xu, B. J. Xu, Fault diagnosis system for rotary
machine based on fuzzy neural networks, JSME International Journal. Series C:
Mechanical Systems, Machine Elements and Manufacturing, 46 (2003) 1035-1041.
[163] T. I. Liu, J. H. Singonahalli, N. R. Iyer, Detection of roller bearing defects using

expert system and fuzzy logic, Mechanical Systems and Signal Processing, 10
(1996) 595-614.
Page 339
[164] S. H. Chang, K. S. Kang, S. S. Choi, H. G. Kim, H. K. Jeong, C. U. Yi,
Development of the on-line operator aid system OASYS using a rule-based expert
system and fuzzy logic for nuclear power plants, Nuclear Technology, 112 (1995)
266-294.
[165] A. K. Garga, K. T. McClintic, R. L. Campbell, C. C. Yang, M. S. Lebold, T. A.

Hay, C. S. Byington, Hybrid reasoning for prognostic learning in CBM systems,
in: 2001 IEEE Aerospace Conference Proceedings, Vol. 1-7, New York, 2001, pp.
2957-2969.
[166] D. B. Fogel, An introduction to simulated evolutionary optimization, IEEE

Transactions on Neural Networks, 5 (1994) 3-14.
[167] S. Sampath, S. Ogaji, R. Singh, D. Probert, Engine-fault diagnostics: an

optimisation procedure, Applied Energy, 73 (2002) 47-70.
[168] Z. Y. Chen, Y. Y. He, F. L. Chu, J. Y. Huang, Evolutionary strategy for

classification problems and its application in fault diagnostics, Engineering
Applications of Artificial Intelligence, 16 (2003) 31-38.
[169] Y.-C. Huang, C.-M. Huang, Evolving wavelet networks for power transformer
condition monitoring, IEEE Transactions on Power Delivery, 17 (2002) 412-416.
[170] G.-T. Yan, G.-F. Ma, Fault diagnosis of diesel engine combustion system based on
neural networks, in: Proceedings of 2004 International Conference on Machine
Learning and Cybernetics, Vol. 5, Shanghai, China, 2004, pp. 3111-3114.
[171] J. J. Gertler, Fault Detection and Diagnosis in Engineering Systems, Marcel

Dekker, Inc., New York, 1998.
[172] S. Simani, C. Fantuzzi, R. J. Patton, Model-based Fault Diagnosis in Dynamic

Systems Using Identification Techniques, Springer, London, 2003.
[173] I. Howard, S. Jia, J. Wang, The dynamic modelling of a spur gear in mesh
including friction and a crack, Mechanical Systems and Signal Processing, 15
(2001) 831-838.
[174] W. Y. Wang, Towards dynamic model-based prognostics for transmission gears,

in: Component and Systems Diagnostics, Prognostics, and Health Management II,
4733, Bellingham, 2002, pp. 157-167.
[175] D. C. Baillie, J. Mathew, Nonlinear model-based fault diagnosis of bearings, in:

Proceedings of an International Conference on Condition Monitoring, Swansea,
UK, 1994, pp. 241-252.
Page 340
[176] K. A. Loparo, M. L. Adams, W. Lin, M. F. Abdel-Magied, N. Afshari, Fault
detection and diagnosis of rotating machinery, IEEE Transactions on Industrial
Electronics, 47 (2000) 1005-1014.
[177] K. A. Loparo, A. H. Falah, M. L. Adams, Model-based fault detection and

diagnosis in rotating machinery, in: Proceedings of the Tenth International
Congress on Sound and Vibration, Stockholm, Sweden, 2003, pp. 1299-1306.
[178] C. H. Oppenheimer, K. A. Loparo, Physically based diagnosis and prognosis of

cracked rotor shafts, in: Component and Systems Diagnostics, Prognostics, and
Health Management II, 4733, Bellingham, 2002, pp. 122-132.
[179] A. S. Sekhar, Model-based identification of two cracks in a rotor system,

[180] G. H. Choi, G. S. Choi, Application of minimum cross entropy to model-based

monitoring in diamond turning, Mechanical Systems and Signal Processing, 10
(1996) 615-631.
[181] W. Bartelmus, Mathematical modelling and computer simulations as an aid to

gearbox diagnostics, Mechanical Systems and Signal Processing, 15 (2001) 855-
871.
[182] W. Bartelmus, Diagnostic information on gearbox condition for mechatronic

systems, Transactions of the Institute of Measurement and Control, 25 (2003) 451-
465.
[183] R. J. Hansen, D. L. Hall, S. K. Kurtz, A new approach to the challenge of

machinery prognostics, Journal of Engineering for Gas Turbines and Power, 117
(1995) 320-325.
[184] A. Vania, P. Pennacchi, Experimental and theoretical application of fault

identification measures of accuracy in rotating machine diagnostics, Mechanical
[185] R. David, H. Alla, Petri nets for modeling of dynamic systems - a survey,
Automatica, 30 (1994) 175-202.
[186] N. C. Propes, A fuzzy Petri net based mode identification algorithm for fault
diagnosis of complex systems, in: System Diagnosis and Prognosis: Security and
Condition Monitoring Issues III, 5107, Bellingham, 2003, pp. 44-53.
[187] S. K. Yang, A condition-based failure-prediction and processing-scheme for

preventive maintenance, IEEE Transactions on Reliability, 52 (2003) 373-383.
Page 341
[188] B.-S. Yang, S. K. Jeong, Y.-M. Oh, A. C. C. Tan, Case-based reasoning system
with Petri nets for induction motor fault diagnosis, Expert Systems with
Applications, 27 (2004) 301-311.
[189] C. R. Farrar, F. Hemez, G. Park, A. N. Robertson, H. Sohn, T. O. Williams, A

Coupled Approach to Developing Damage Prognosis Solutions, in: Damage
Assessment of Structures - The 5th International Conference on Damage
Assessment of Structures (DAMAS 2003), Southampton, UK, 2003.
[190] J. Yan, M. Koc, J. Lee, A prognostic algorithm for machine performance

assessment and its application, Production Planning and Control, 15 (2004) 796-
801.
[191] E. Phelps, P. Willett, T. Kirubarajan, A statistical approach to prognostics, in:

Component and Systems Diagnositics, Prognosis and Health Management, 4389,
Bellingham, 2001, pp. 23-34.
[192] D. Banjevic, A. K. S. Jardine, Calculation of reliability function and remaining

useful life for a Markov failure time process, To appear in IMA Journal of
Management Mathematics, 2005.
[193] R. B. Chinnam, P. Baruah, Autonomous diagnostics and prognostics through

competitive learning driven HMM-based clustering, in: Proceedings of the
International Joint Conference on Neural Networks 2003, Vol. 1-4, New York,
2003, pp. 2466-2471.
[194] C. Kwan, X. Zhang, R. Xu, L. Haynes, A novel approach to fault diagnostics and
prognostics, in: Proceedings of the 2003 IEEE International Conference on
Robotics and Automation, Vol. 1-3, New York, 2003, pp. 604-609.
[195] D. Lin, V. Makis, Filters and parameter estimation for a partially observable
system subject to random failure with continuous-range observations, Advances in
Applied Probability, 36 (2004) 1212-1230.
[196] S. Zhang, R. Ganesan, Multivariable trend analysis using neural networks for
intelligent diagnostics of rotating machinery, Transactions of the ASME. Journal
of Engineering for Gas Turbines and Power, 119 (1997) 378-384.
[197] P. Wang, G. Vachtsevanos, Fault prognostics using dynamic wavelet neural

networks, AI EDAM-Artificial Intelligence for Engineering Design Analysis and
Manufacturing, 15 (2001) 349-365.
[198] R. C. M. Yam, P. W. Tse, L. Li, P. Tu, Intelligent predictive decision support

system for condition-based maintenance, International Journal of Advanced
Manufacturing Technology, 17 (2001) 383-391.
Page 342
[199] Y.-L. Dong, Y.-J. Gu, K. Yang, W.-K. Zhang, A combining condition prediction
model and its application in power plant, in: Proceedings of 2004 International
Conference on Machine Learning and Cybernetics, Vol. 6, Shanghai, China, 2004,
pp. 3474-3478.
[200] W. Q. Wang, M. F. Golnaraghi, F. Ismail, Prognosis of machine health condition

using neuro-fuzzy systems, Mechanical Systems and Signal Processing, 18 (2004)
813-831.
[201] R. B. Chinnam, P. Baruah, A neuro-fuzzy approach for estimating mean residual

life in condition-based maintenance systems, International Journal of Materials and
Product Technology, 20 (2004) 166-179.
[202] A. Ray, S. Tangirala, Stochastic modeling of fatigue crack dynamics for on-line
failure prognostics, IEEE Transactions on Control Systems Technology, 4 (1996)
443-451.
[203] Y. Li, S. Billington, C. Zhang, T. Kurfess, S. Danyluk, S. Liang, Adaptive

prognostics for rolling element bearing condition, Mechanical Systems and Signal
Processing, 13 (1999) 103-113.
[204] Y. Li, T. R. Kurfess, S. Y. Liang, Stochastic prognostics for rolling element

bearings, Mechanical Systems and Signal Processing, 14 (2000) 747-762.
[205] D. Chelidze, J. P. Cusumano, A dynamical systems approach to failure prognosis,

Journal of Vibration and Acoustics, 126 (2004) 2-8.
[206] J. Luo, A. Bixby, K. Pattipati, L. Qiao, M. Kawamoto, S. Chigusa, An interacting

multiple model approach to model-based prognostics, in: System Security and
Assurance, Vol. 1, Washington, DC, USA, 2003, pp. 189-194.
[207] G. J. Kacprzynski, A. Sarlashkar, M. J. Roemer, Predicting remaining life by

fusing the physics of failure modeling with diagnostics, Journal of Metal, 56
(2004) 29-35.
[208] C. Cempel, H. G. Natke, M. Tabaszewski, A passive diagnostic experiment with

ergodic properties, Mechanical Systems and Signal Processing, 11 (1997) 107-117.
[209] J. Qiu, C. Zhang, B. B. Seth, S. Y. Liang, Damage mechanics approach for bearing
lifetime prognostics, Mechanical Systems and Signal Processing, 16 (2002) 817-
829.
[210] G. A. Lesieutre, L. Fang, U. Lee, Hierarchical failure simulation for machinery

prognostics, in: Critical Link: Diagnosis to Prognosis, Haymarket, 1997, pp. 103-
110.
Page 343
[211] S. J. Engel, B. J. Gilmartin, K. Bongort, A. Hess, Prognostics, the real issues
involved with predicting life remaining, in: 2000 IEEE Aerospace Conference
Proceedings, Vol. 6, New York, 2000, pp. 457-469.
[212] P. A. Scarf, On the application of mathematical models in maintenance, European

Journal of Operational Research, 99 (1997) 493-506.
[213] D. Lugtigheid, D. Banjevic, A. K. S. Jardine, Modelling repairable system

reliability with explanatory variables and repair and maintenance actions, IMA
Journal Management Mathematics, 15 (2004) 89-110.
[214] J. E. Campbell, B. M. Thompson, L. P. Swiler, Consequence analysis in predictive

health monitoring systems, in: Proceedings of Probabilistic Safety Assessment and
Management, Vol. I and II, Amsterdam, 2002, pp. 1353-1358.
[215] W. Wang, A model to determine the optimal critical level and the monitoring
intervals in condition-based maintenance, International Journal of Production
Research, 38 (2000) 1425-1436.
[216] A. Grall, C. Berenguer, L. Dieulle, A condition-based maintenance policy for

stochastically deteriorating systems, Reliability Engineering & System Safety, 76
(2002) 167-180.
[217] B. Castanier, C. Berenguer, A. Grall, A sequential condition-based

repair/replacement policy with non-periodic inspections for a system subject to
continuous wear, Applied Stochastic Models in Business and Industry, 19 (2003)
327-347.
[218] L. Dieulle, C. Berenguer, A. Grall, M. Roussignol, Sequential condition-based

maintenance scheduling for a deteriorating system, European Journal of
Operational Research, 150 (2003) 451-461.
[219] S. V. Amari, L. McLaughlin, Optimal design of a condition-based maintenance

model, in: Proceedings of the Annual Reliability and Maintainability Symposium,
Los Angeles, CA, USA, 2004, pp. 528-533.
[220] C. Berenguer, A. Grall, B. Castanier, Simulation and evaluation of condition-based

maintenance policies for multi-component continuous-state deteriorating systems,
in: Foresight and Precaution, Vol. 1-2, Rotterdam, 2000, pp. 275-282.
[221] J. Barata, C. G. Soares, M. Marseguerra, E. Zio, Simulation modelling of

repairable multi-component deteriorating systems for 'on condition' maintenance
optimisation, Reliability Engineering and System Safety, 76 (2002) 255-264.
Page 344
[222] M. Marseguerra, E. Zio, L. Podofillini, Condition-based maintenance optimization
by means of genetic algorithms and Monte Carlo simulation, Reliability
Engineering and System Safety, 77 (2002) 151-165.
[223] M. M. Hosseini, R. M. Kerr, R. B. Randall, An inspection model with minimal and

major maintenance for a system with deterioration and Poisson failures, IEEE
Transactions on Reliability, 49 (2000) 88-98.
[224] M. Ohnishi, T. Morioka, T. Ibaraki, Optimal minimal-repair and replacement

problem of discrete-time Markovian deterioration system under incomplete state
information, Computers and Industrial Engineering, 27 (1994) 409-412.
[225] J. A. M. Hontelez, H. H. Burger, D. J. D. Wijnmalen, Optimum condition-based

maintenance policies for deteriorating systems with partial information, Reliability
Engineering and System Safety, 51 (1996) 267-274.
[226] T. Aven, Condition based replacement policies-a counting process approach,

Reliability Engineering and System Safety, 51 (1996) 275-281.
[227] F. Barbera, H. Schneider, P. Kelle, A condition based maintenance model with

exponential failures and fixed inspection intervals, Journal of the Operational
Research Society, 47 (1996) 1037-1045.
[228] F. Barbera, H. Schneider, E. Watson, A condition based maintenance model for a

two-unit series system, European Journal of Operational Research, 116 (1999)
281-290.
[229] A. H. Christer, W. Wang, J. M. Sharp, A state space condition monitoring model

for furnace erosion prediction and replacement, European Journal of Operational
Research, 101 (1997) 1-14.
[230] D. Kumar, U. Westberg, Maintenance scheduling under age replacement policy

using proportional hazards model and TTT-plotting, European Journal of
Operational Research, 99 (1997) 507-515.
[231] V. Makis, A. K. S. Jardine, Optimal replacement in the proportional hazards

model, INFOR, 30 (1992) 172-183.
[232] D. Banjevic, A. K. S. Jardine, V. Makis, M. Ennis, A control-limit policy and

software for condition-based maintenance optimization, INFOR, 39 (2001) 32-50.
[233] V. Makis, X. Jiang, A. K. S. Jardine, A condition-based maintenance model, IMA

Journal of Mathematics Applied in Business and Industry, 9 (1998) 201-210.
[234] V. Makis, X. Jiang, Optimal replacement under partial observations, Mathematics

of Operations Research, 28 (2003) 382-394.
Page 345
[235] W. B. Wang, A stochastic control model for on line condition based maintenance
decision support, in: 6th World Multiconference on Systemics, Cybernetics and
Informatics, Vol. VI, Proceedings - Industrial Systems and Engineering I, Orlando,
2002, pp. 370-374.
[236] S. Okumura, N. Okino, Optimisation of inspection time vector and warning level
in CBM considering residual life loss and constraint on preventive replacement
probability, International Journal of COMADEM, 6 (2003) 10-18.
[237] A. Barros, A. Grall, C. Berenguer, A maintenance policy optimized with imperfect

and/or partial monitoring, in: Proceedings of the Annual Reliability and
Maintainability Symposium, New York, 2003, pp. 406-411.
[238] A. H. Christer, W. Wang, A simple condition monitoring model for a direct

monitoring process, European Journal of Operational Research, 82 (1995) 258-
269.
[239] S. Okumura, An inspection policy for deteriorating processes using delay-time

concept, International Transactions in Operational Research, 4 (1997) 365-375.
[240] K. B. Goode, B. J. Roylance, J. Moore, Development of model to predict condition

monitoring interval times, Ironmaking and Steelmaking, 27 (2000) 63-68.
[241] W. Wang, Modelling condition monitoring intervals: A hybrid of simulation and

analytical approaches, Journal of the Operational Research Society, 54 (2003) 273-
282.
[242] D. L. Hall, J. Llinas, Handbook of Multisensor Data Fusion, CRC Press, Boca
Raton, FL, 2001.
[243] D. L. Hall, S. A. H. McMullen, Mathematical techniques in multi-sensor data

fusion, Artech House, Boston, 2004.
[244] Q. Liu, H.-P. Wang, A case study on multisensor data fusion for imbalance
diagnosis of rotating machinery, (AI EDAM) Artificial Intelligence for
Engineering Design, Analysis and Manufacturing, 15 (2001) 203-210.
[245] H. F. Wang, J. P. Wang, Fault diagnosis theory: method and application based on
multisensor data fusion, Journal of Testing and Evaluation, 28 (2000) 513-518.
[246] J. D. Kozlowski, C. S. Byington, A. K. Garga, M. J. Watson, T. A. Hay, Model-

based predictive diagnostics for electrochemical energy sources, in: 2001 IEEE
Aerospace Conference, Vol. 6, Big Sky, MT, 2001, pp. 63149-63164.
Page 346
[247] C. S. Byington, T. A. Merdes, J. D. Kozlowski, Fusion techniques for vibration
and oil debris/quality in gearbox failure testing, in: Proceedings of Condition
Monitoring '99, Chipping Norton, 1999, pp. 113-128.
[248] M. A. Mannan, A. A. Kassim, M. Jing, Application of image and sound analysis

techniques to monitor the condition of cutting tools, Pattern Recognition Letters,
21 (2000) 969-979.
[249] P. Hannah, A. Starr, P. Bryanston-Cross, Condition monitoring and diagnostic

engineering - A data fusion approach, in: Condition Monitoring and Diagnostic
Engineering Management, Amsterdam, 2001, pp. 275-282.
[250] R. Willetts, A. G. Starr, D. Banjevic, A. K. S. Jardine, A. Doyle, Optimising

complex CBM decisions using hybrid fusion methods, in: Condition Monitoring
and Diagnostic Engineering Management, Amsterdam, 2001, pp. 909-918.
[251] A. Starr, R. Willetts, P. Hannah, W. Hu, D. Banjevic, A. K. S. Jardine, Data fusion

applications in intelligent condition monitoring, in: Recent Advances in
Computers, Computing and Communications, 2002, pp. 110-115.
[252] M. J. Roemer, G. J. Kacprzynski, R. F. Orsagh, Assessment of data and knowledge

fusion strategies for prognostics and health management, in: 2001 IEEE Aerospace
Conference Proceedings, Vol. 6, Big Sky, MT, USA, 2001, pp. 2979-2988.
[253] Z. B. Zhang, J. Wang, Y. Tian, H. Q. Zheng, Assessment of information fusion

strategies for diagnostics and prognostics, in: Proceedings of ISTM/2003: 5th
International Symposium on Test and Measurement, Vol. 1-6, Beijing, 2003, pp.
1901-1903.
[254] J. P. Wang, H. F. Wang, The reliability and self-diagnosis of sensors in a

multisensor data fusion diagnostic system, Journal of Testing and Evaluation, 31
(2003) 370-377.
[255] S. S. Haykin, Unsupervised Adaptive Filtering, John Wiley and Sons, New York,
2000.
[256] A. Hyvarinen, Survey on independent component analysis, Neural Computing

Surveys, 2 (1999) 94-128.
[257] L. Li, L. Qu, Machine diagnosis with independent component analysis and
envelope analysis, in: International Conference on Industrial Technology:
`Productivity Reincarnation through Robotics and Automation', Vol. 2, Bankok,
Thailand, 2002, pp. 1360-1364.
[258] J. P. Barnard, C. Aldrich, Diagnostic monitoring of internal combustion engines by

use of independent component analysis and neural networks, in: 2003 International
Page 347
Joint Conference on Neural Networks, Vol. 2, Portland, OR, USA, 2003, pp. 869-
872.
[259] Z. S. Chen, Y. M. Yang, G. J. Shen, X. S. Wen, Early diagnosis of helicopter

gearboxes based on independent component analysis, in: Proceedings of
ISTM/2003: 5th International Symposium on Test and Measurement, Vol. 1-6,
Beijing, 2003, pp. 3383-3386.
[260] X. J. Ma, Z. H. Hao, Mulltisensor data fusion based on independent component

analysis for fault diagnosis of rotor, in: Advances in Neural Networks - ISNN
2004, Pt 1, 3173, Berlin, 2004, pp. 744-749.
[261] X. H. Tian, J. Lin, K. R. Fyfe, M. J. Zuo, Gearbox fault diagnosis using

independent component analysis in the frequency domain and wavelet filtering, in:
Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and
Signal Processing, Vol. II, Speech II; Industry Technology Tracks; Design and
Implementation of Signal Processing Systems; Neural Networks for Signal
Processing, New York, 2003, pp. 245-248.
[262] H. J. Zhang, L. S. Qu, B. G. Xu, G. G. Wen, Partially blind source separation of

the diagnostic signals with prior knowledge, in: Condition Monitoring and
Diagnostic Engineering Management, Amsterdam, 2001, pp. 177-184.
[263] G. Gelle, M. Colas, C. Serviere, BSS for fault detection and machine monitoring -
time or frequency domain approach?, in: Proceedings of International Workshop
on Independent Component Analysis and Blind Signal Separation (ICA 2000),
Helsinki, Finland, 2000, pp. 555-560.
[264] G. Gelle, M. Colas, C. Serviere, Blind source separation: a new pre-processing tool
for rotating machines monitoring?, IEEE Transactions on Instrumentation and
Measurement, 52 (2003) 790-795.
[265] P. W. Tse, J. Zhang, The use of blind-source-separation algorithm for mechanical

signal separation and machine fault diagnosis, in: 2003 ASME International
Mechanical Engineering Congress, Vol. 24, Washington, DC., United States, 2003,
pp. 57-63.
[266] R. M. Vilela, J. C. Metrolho, J. C. Cardoso, Machine and industrial monitorization

system by analysis of acoustic signatures, in: Proceedings of the 12th IEEE
Mediterranean Electrotechnical Conference (MELECON 2004), Vol. 1,
Dubrovnik, Croatia, 2004, pp. 277-279.
[267] C. Serviere, P. Fabry, Blind source separation of noisy harmonic signals for
rotating machine diagnosis, Journal of Sound and Vibration, 272 (2004) 317-339.
Page 348
[268] F. M. Discenzo, P. J. Unsworth, K. A. Loparo, H. Marcy, Self-diagnosing
intelligent motors: a key enabler for next generation manufacturing systems, IEE
Colloquium (Digest), (1999) 15-18 (3/1-3/4).
Page 349

EXAKT-Reliability Centered Knowledge Book

Cargado por

Información del documento

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

EXAKT-Reliability Centered Knowledge Book

Cargado por

Copyright:

Formatos disponibles

Reliabilty-centered

1. Database attributes required for reliability analysis,

We begin with an introduction of the term “reliability-centered knowledge” to imply that

Part 2 offers an introduction and background to CBM that is ultimately extended to a

Reliability-centered maintenance (RCM) forms the philosophical framework of this

I take great pleasure in writing this introduction to “Reliability-centered Knowledge”. I

Chapter 1. The knowledge elements

Website Slogan Typical sponsors/members

In this book we address the technology of information in physical asset management. We

Figure 1-1: A UML class diagram representation of a work order

Figure 1-2: A UML class with some attribute and operation

Incorporating RCM knowledge attributes

The Seven Knowledge elements11 of RCM

A maintainer, in the course of his or her day-to-day inspection, troubleshooting, and

Figure 1-4: Reliability informational attributes of the work order class

The “failure code” problem

1. To determine the types of failures the equipment is actually exposed to as well as

Figure 2-3: WorkOrders with extended primary key

Implementing a Reliability Knowledge Base

The primary purpose of analyzing failure and inspection data is to assess PM

Figure 2-5: Additional attributes in WorkOrders

Step 4 Extending the Use Case if no record is found

1. Simplified guidelines and training document,

Other “FMEA” data types and definitions

It is to be emphasized, with regard to these comparative terms of reference, that neither

Higher demands on increasingly sophisticated systems and greater complexity of

Figure 3-1: Maintenance planning factors42

How to use maintenance data?

To start with, we describe, some of the reliability information methodologies used in

1. The overarching characteristic in the commercial aviation industry is

The improvement in failure management and reliability accomplished over the

3. The pressure to control costs in the face of intense competition characterizes

4. Economic, safety, and regulatory factors have forced an intimate relationship

5. With regard to information management, perhaps the single most distinctive

It is to be noted that these five characteristics of maintenance in commercial aviation are

Age Exploration Procedures

Failure Finding Intervals

Failure finding interval where only cost is at issue

Measuring Reliability Improvement

The effects of gradual improvement

0.3 August – Oct. 1964

0.2 Oct. – December 1964

0.1 January – Feb. 1966

probabilit y of entering Interval − probabilit y surviving Interval

Equation 3-5: Conditional Probability of Failure

1. What is the optimum reliability state?

Decreasing failure rate

a. observations (of potential failures) by the maintainer during the course of

Figure 3-7: Using information to improve the maintenance program

Assessing the effectiveness of a CBM Program

a. functional failure, and

Improving the program through failure mode assessment

Unverified failures Verified failures

Figure 3-9 Multiple failure modes in an item

Software analytic tools

CBM (on-condition maintenance) benefits analysis

The EXAKT manual explains CBM effectiveness in the following terms:

Cost = (#(failures)*(C+K) + #(prev. repl.)*C) / (totalworkingage(failures) +

CBM Effectiveness Comparison

Figure 3-14 Results of evaluating a proposed policy.

Engineering Change Assessment

Engineering changes to a physical component or a proces, undertaken to improve

Start of borescope inspection,

The work order documentation practices proposed in Chapters 1 and 2, facilitate an

Cost = (#(failures)(C+K) + #(prev. repl.)C) / (totalworkingage(failures) +