Documentos de Académico
Documentos de Profesional
Documentos de Cultura
Knowledge
Using Maintenance Databases for Reliability Analysis and Improvement
Page
Part 1. Knowledge Management 13
Part 2. Condition Based Maintenance 83
Part 3. Reliability Centered Maintenance 201
By:
Murray Wiseman
Daming Lin
September 2005
Page 1
Optimal Maintenance Decisions (OMDEC) Inc 2004
Preface
This book provides the course notes for a CBM (condition based maintenance1) training
session that describes in 3 parts:
The growing volumes of data that flood ever diminishing resources of today’s
maintenance departments compel us to automate the CBM process in all three of its steps:
the acquiring of the data, its interpretation, and the decision of when and how to act upon
that data.
At this point readers are invited to work through a step-by-step exercise during which
they encounter the basic features of CBM statistical modeling software. We proceed to
build an optimal decision model using a reduced set of haul truck transmission oil
analysis data. In the exercise that follows, the users deploy the model that they have
1
Also called Predictive Maintenance (PdM), Condition Monitoring (CM), and On-condition maintenance.
2
The term P-F Interval was coined by John Moubray to represent the concept described by Nowlan and
Heap for the period between the appearance of a potential failure and the occurrence of a functional failure.
See The Elusive P-F Curve, Chapter 9. , page106 .
3
The PHM (proportional hazard model) extends the age based reliability model developed by Walodi
Weibull in the 1950’s to one developed by Cox in the 1970’s that adds condition monitoring and
performance data to the age-reliability relationship.
Page 2
Optimal Maintenance Decisions (OMDEC) Inc 2004
previously created. They examine its automated analysis, reporting, and database
functionality. In a second exercise we explore the vital issue of data validation. The
example has been taken from a CBM project at a coal mine in which invalid data,
missing data, faulty failure definition, the impact of oil changes on oil analysis data, and
cost sensitivity analysis are all encountered, and their respective remedies explored. At
this time, we introduce an advanced topic – the analysis of complex items4. We define a
complex item and describe the data structure needed for representing complex items in a
decision model.
Chapter 11. introduces expert systems for CBM decision making, It describes, in detail, a
successful methodology applied to vibration analysis. The chapter closes by proposing a
hybrid system combining the respective advantages of an expert system with those of a
statistical modeling system. Chapter 12. unifies the principles of prognostics and
diagnostics by outlining a methodology known as case based reasoning, which extends
RCM knoweldge to automated diagnostics. Chapter 13. reviews the technical literature
in a thorough survey of signal processing and decision making approaches used in CBM.
I hope you enjoy the book and invite your comments at murray@omdec.com.
Murray Wiseman
Optimal Maintenance Decisions (OMDEC) Inc.
4
Complex items are items that are subject to more than one failure mode.
Page 3
Optimal Maintenance Decisions (OMDEC) Inc 2004
Introduction by Andrew Jardine
Over the past decade, in my work as principal investigator at the CBM laboratory and
during my travels and speaking engagements, people ask what inspired the EXAKT
development project. The answer to that question is quite simple. Condition based
maintenance is the most desirable form of maintenance, yet, former students, now
maintenance professionals, tell me that they find, often, that their current CBM programs,
such as oil analysis, don’t deliver the intended results. I asked them how “exactly” their
staff interpret condition monitoring data. In other words, how do they decide whether or
not to remove an item for repair? Their answers led me to investigate whether a more
rigorous decision methodology might improve the payback on the rather large investment
they were making in condition based maintenance.
I found that two approaches were being used to interpret and act upon CBM data. One
method arrived at decisions by recalling solid experience and engineering knowledge that
a known level of a monitored variable indicates the initiation of a particular failure mode.
The second, relied on “trend analysis” as the basis for making the “maintain-now-or-
continue-operating” decision. Looking closely at the data and results in both cases, I
found that, while the former achieved, generally, the expected benefits, the latter failed to
provide measurable return on the investment in the fixed and running costs of the CBM
program.
In the first case, CBM detection of, for example, diesel fuel in lubricating oil, reflects the
“ground truth” of a failed condition – that is, a leaking of fuel past the sealing surfaces of
some interface, perhaps the piston, ring, and cylinder wall. Similarly, coolant in the lube
oil, reflects the breakdown of some interface, possibly a gasket, separating the cooling
and lubricating fluids. However, where “data trending” is the principal method for
decisions, the relationship between monitored data and the failure mechanism is often
vague. We rely on a palpable deviation from some “normal” trend to alert us to a
problem.
Although this sounds like a reasonable approach, it works only if the data clearly reflects
a developing failure. But such is often not the case. Usually, several separate or inter-
related phenomena affect the monitored data. Although common sense would have us
believe that monitored signals from the machine must contain its health information, we
often know little about the nature of that relationship. For example, if the operator of a
nuclear reactor alters the temperature of the sealing fluid in the cooling water pump, then
the leak rate, normally used to monitor seal health, would tend to decrease, even if the
seal were, indeed, beginning to fail. The interpretation of trends, thus, becomes
complicated. Add to this, random noise, the effects of load variation, and more than one
failure mode, and you can imagine that attempting trend analysis of multiple data
streams, emanating from complex systems, might frustrate the well-intentioned
maintenance planner or engineer.
Page 4
Optimal Maintenance Decisions (OMDEC) Inc 2004
This problem posed a unique challenge. The condition monitoring phrase “equipment
health” brings to mind the idea of human health. I looked at the medical field where the
problem of symptoms based prognostics is well known. The concept of “risk factors” that
associate medical test results with specific illnesses seemed perfectly analogous to the
problem of risk based decisions in maintenance. Cox’s proportional hazard model in the
1970’s had proved useful in the detection of illnesses and in the prediction of human
survival. I applied these ideas, first, to jet propulsion engines, and discovered that we
could model the risk of engine failure in terms of the oil analysis results of iron and
chromium, and the engine’s accumulated flight hours since overhaul. That work proved
very encouraging. So much so, that we set out to develop a general purpose software
platform for PHM (proportional hazard modeling) prediction. Over the past decade, at the
CBM laboratory of the University of Toronto, we gradually improved the program by
applying it to many industrial CBM situations. It has reached the stage now, where it
should be made commercially available to the mainstream of the physical asset
management community. That is the reason OMDEC was spun off from the CBM lab.
I have often been asked why we called the program “EXAKT”, implying that CBM is an
“exact” science, while in fact the methodology of EXAKT is based on probabilities and
statistics. Certainly, I can see why some people think that the name “EXAKT” and the
probabilistic nature of failure are incongruous. Most managers, however, understand risk.
They instinctively weigh probabilities when making decisions in the normal course of
their activities. If they were told “exactly” the risk levels associated with alternative
decisions, they would find such information helpful indeed. Otherwise stated, if they
knew “exactly” with what level of confidence they may accept a residual life estimate for
some operating physical asset, they could adjust their operational and maintenance plans
accordingly.
Self doable, tutorial exercises are a good way to provide a comfort factor to potential
users. EXAKT, is actually a usable tool. But, because EXAKT evolved as a research
platform, some people have formed the impression that it is too difficult for them. This
book sets out to dissolve that feeling. Besides a sound treatment of the founding
principles of CBM based on RCM derived knowledge, it contains step-by-step tutorials
that convey a number of common data problem solving techniques.
Andrew Jardine
Principal Investigator, CBM Lab
Professor, Mechanical and Industrial Engineering
University of Toronto
Page 5
Optimal Maintenance Decisions (OMDEC) Inc 2004
Contents:
Part 1. Knowledge Management ________________________________________ 13
Chapter 1. The knowledge elements ____________________________________ 13
Introduction________________________________________________________ 13
The Work Order UML Class Diagram__________________________________ 15
Incorporating RCM knowledge attributes _______________________________ 15
The Seven Knowledge elements of RCM ________________________________ 16
The “failure code” problem ___________________________________________ 18
Chapter 2. Requirements of Information ________________________________ 19
Data Structure______________________________________________________ 21
Implementing a Reliability Knowledge Base _____________________________ 22
Other “FMEA” data types and definitions _______________________________ 27
Conclusions ________________________________________________________ 30
Chapter 3. Using maintenance information ______________________________ 33
Introduction________________________________________________________ 33
The problem with failure rates ________________________________________ 34
How to use maintenance data? ________________________________________ 35
Age Exploration Procedures __________________________________________ 38
Random Failure____________________________________________________ 38
Failure Finding Intervals_____________________________________________ 39
Measuring Reliability Improvement ____________________________________ 41
Refining the maintenance program_____________________________________ 44
Assessing the effectiveness of a CBM Program ___________________________ 44
Improving the program through failure mode assessment ___________________ 46
Software analytic tools ______________________________________________ 47
CBM (on-condition maintenance) benefits analysis________________________ 49
Engineering Change Assessment ______________________________________ 51
Keeping Track of Components ________________________________________ 52
Introduction_______________________________________________________ 52
Recording Events for Reliability Analysis _______________________________ 52
Keeping track of system component ages________________________________ 53
Significant components______________________________________________ 54
Suspended Animation________________________________________________ 55
Handling meter anomolies ___________________________________________ 55
Marginal analysis ___________________________________________________ 57
Chapter 4. Acquiring Maintenance Information __________________________ 58
Page 6
Optimal Maintenance Decisions (OMDEC) Inc 2004
Introduction________________________________________________________ 58
Lexicon ____________________________________________________________ 59
The purpose of the EWOP ____________________________________________ 60
Work order documentation procedures for the EWOP ____________________ 60
The events table_____________________________________________________ 65
The RCM knowledge base ____________________________________________ 66
Uniqueness of a work order ___________________________________________ 66
Examples __________________________________________________________ 66
Summary and Conclusions____________________________________________ 71
Chapter 5. Assessing “What-if” from maintenance information______________ 73
Introduction________________________________________________________ 73
Modeling a simple system using SPAR __________________________________ 73
Objective of the analysis_____________________________________________ 74
The system function ________________________________________________ 74
Running the program _______________________________________________ 75
Remarks _________________________________________________________ 76
Repair effectiveness ________________________________________________ 76
Applying Preventive Maintenance _____________________________________ 78
Optimizing PM ____________________________________________________ 80
Part 2. Condition Based Maintenance ___________________________________ 83
Chapter 6. Deciding on CBM _________________________________________ 83
Introduction________________________________________________________ 83
Why do CBM?______________________________________________________ 84
History of CBM _____________________________________________________ 87
Chapter 7. Anatomy of CBM __________________________________________ 91
Data Acquisition ____________________________________________________ 91
Signal Processing____________________________________________________ 95
Decision Making ___________________________________________________ 100
Chapter 8. CBM Fundamentals_______________________________________ 103
The fundamental premise of CBM ____________________________________ 103
CBM Program Criteria _____________________________________________ 103
CBM Monitoring Frequency ________________________________________ 103
Estimating the PF Interval __________________________________________ 105
Chapter 9. The Elusive P-F Curve ____________________________________ 106
Are failures required – multiple levels of intrusiveness? ___________________ 108
Discussion of Case 2_______________________________________________ 109
Page 7
Optimal Maintenance Decisions (OMDEC) Inc 2004
Discussion of Case 1_______________________________________________ 111
Chapter 10. Optimizing CBM _________________________________________ 113
Developing a Maintenance Risk Model ________________________________ 113
The traditional risk model___________________________________________ 113
Combining Data and Risk___________________________________________ 114
The Optimal Risk _________________________________________________ 116
A Time Based Maintenance Model ____________________________________ 118
Blending in Cost __________________________________________________ 123
A Condition Based Maintenance Model ________________________________ 125
Automated CBM Decision Making ___________________________________ 126
Example 1 Creating and deploying a decision model______________________ 127
Example 2 Data validation __________________________________________ 131
Example 3 Complex Items __________________________________________ 146
Example 4 Data transformations______________________________________ 150
References ________________________________________________________ 151
Chapter 11. CBM Decision Making with Expert Systems ___________________ 152
Step 1 Data normalization ___________________________________________ 153
Step 2 The screening matrix__________________________________________ 154
Step 3 Cepstrum analysis ____________________________________________ 154
Step 4 Demodulation________________________________________________ 155
Step 5 Component specific diagnostic matrices __________________________ 157
Step 6 Decision making______________________________________________ 157
A proposed hybrid decision tool ______________________________________ 160
The ABB fault simulator____________________________________________ 160
Chapter 12. Case based reasoning______________________________________ 165
Introduction_______________________________________________________ 165
Efficient Troubleshooting____________________________________________ 166
Case Base Development _____________________________________________ 168
Terminology _____________________________________________________ 168
Building a knowledge domain _______________________________________ 169
Building a case ___________________________________________________ 170
Case Study ________________________________________________________ 171
The seed case base__________________________________________________ 174
Performance measurement __________________________________________ 175
Conclusions _______________________________________________________ 175
Chapter 13. A survey of signal processing and decision technologies for CBM __ 177
Page 8
Optimal Maintenance Decisions (OMDEC) Inc 2004
Introduction_______________________________________________________ 177
Data acquistion ____________________________________________________ 178
Signal processing___________________________________________________ 178
Signal processing _________________________________________________ 179
Value type data analysis ____________________________________________ 184
Data analysis combining event data and condition monitoring data __________ 184
Maintenance decision support ________________________________________ 186
Diagnostics ______________________________________________________ 186
Prognostics ______________________________________________________ 192
Multiple sensor data fusion __________________________________________ 197
Concluding remarks ________________________________________________ 199
Part 3. Reliability Centered Maintenance ________________________________ 201
Chapter 14. Pillars of RCM ___________________________________________ 201
Introduction_______________________________________________________ 201
RCM Execution Strategies ___________________________________________ 203
Chapter 15. Failure Modes and Effects Analysis __________________________ 204
Question 1 – Functional Analysis _____________________________________ 204
The process ______________________________________________________ 204
Example 1 _______________________________________________________ 207
Example 2 _______________________________________________________ 210
Example 3 _______________________________________________________ 212
Example 4 _______________________________________________________ 213
Question 2 – Failure Analysis ________________________________________ 214
The process ______________________________________________________ 214
Example 1 _______________________________________________________ 214
Example 2 _______________________________________________________ 215
Example 3 _______________________________________________________ 215
Question 3 – Failure modes analysis ___________________________________ 216
The process ______________________________________________________ 216
Example 1 _______________________________________________________ 218
Example 2 _______________________________________________________ 219
Example 3 _______________________________________________________ 220
Question 4 – Effects analysis _________________________________________ 220
The process ______________________________________________________ 220
Example 1 _______________________________________________________ 221
Example 2 _______________________________________________________ 230
Example 3 _______________________________________________________ 231
Chapter 16. The RCM Decision Algorithm_______________________________ 233
Questions 5, 6, and 7 ________________________________________________ 233
The process ______________________________________________________ 233
Page 9
Optimal Maintenance Decisions (OMDEC) Inc 2004
Example 1 _______________________________________________________ 235
Example 2 _______________________________________________________ 236
Example 3 _______________________________________________________ 239
Example 4 _______________________________________________________ 243
Chapter 17. Integrating Reliability Information - MIMOSA _________________ 249
UML Class Diagrams _______________________________________________ 249
Chapter 18. Managing Strategy________________________________________ 254
Introduction_______________________________________________________ 254
Extending the Maintenance Audit_____________________________________ 255
Physical asset management inputs, outputs, and control __________________ 256
Physical Asset Management Effectiveness Indicators (KPIs)_______________ 257
Choosing between model 1 and model 2 ________________________________ 258
Drilling down from the KPIs _________________________________________ 260
How to start _______________________________________________________ 262
Chapter 19. Appendices ______________________________________________ 263
Appendix 1. EWOP details __________________________________________ 263
Used components and components in suspended animation ________________ 263
The EWOP’s Impact on the Work Process______________________________ 265
Using the EWOP prototype software __________________________________ 267
The onion skins of CBM____________________________________________ 268
The EWOP and EXAKT____________________________________________ 269
Appendix 5 The EWOP and EXAKT__________________________________ 269
Appendix 2. _______________________________________________________ 271
The role of the RCM Facilitator - Five Skill Areas: _______________________ 271
Appendix 3. _______________________________________________________ 276
Sizing the analysis_________________________________________________ 276
Selecting the significant items _______________________________________ 278
Appendix 4. _______________________________________________________ 278
Failure finding intervals for complex items (multiple failure modes and devices) 278
Appendix 5. _______________________________________________________ 280
Truck description _________________________________________________ 280
Appendix 6. _______________________________________________________ 288
Terminology used: ________________________________________________ 288
Various definitions of “Life” ________________________________________ 290
Appendix 7. _______________________________________________________ 290
Time to Failure - Relationship among hazard, reliability, and probability density
functions ________________________________________________________ 290
Appendix 8. _______________________________________________________ 293
Page 10
Optimal Maintenance Decisions (OMDEC) Inc 2004
Random failure survival curve _______________________________________ 293
Appendix 9. _______________________________________________________ 293
Inherent reliability characteristics_____________________________________ 293
Appendix 10. ______________________________________________________ 294
Failure mode depth of causality ______________________________________ 294
Appendix 11. Cost Comparison of CBM Policies ________________________ 295
Appendix 12. ______________________________________________________ 300
Expected failure time for an item whose maintenance policy is time-based ____ 300
Appendix 13. ______________________________________________________ 302
Default RCM decision diagram answers in the absence of operating experience 302
Appendix 14. ______________________________________________________ 303
Additional Relcode examples ________________________________________ 303
Appendix 15. EXAKT Exercises ______________________________________ 307
Appendix 16. References to Chapter 13.________________________________ 326
Page 11
Optimal Maintenance Decisions (OMDEC) Inc 2004
This page left intentionally blank.
Page 12
Optimal Maintenance Decisions (OMDEC) Inc 2004
Part 1. Knowledge Management
Introduction
The quest for information consumes the physical asset manager, more so than his
counterparts in any other sector of the organization. Physical asset managers seek out
information technology products that promise to help them decide how, intelligently, to
deploy their forces.
To what extent may maintenance professionals influence the design of the technology
they acquire? In a perfect market economy, they, as a group, might exercise control over
the features and cost of technical products and services in which they invest. If a product
does not meet their desires, then, in the utopia of unlimited choice and instant
information, they will simply select one that does. Of course, we neither live nor
consume in an ideal marketplace. Does its imperfection frustrate our practical needs by
substituting perceived ones? What can we do to exercise due influence over the design of
new products and services destined to find their way into our technological tool box?
A look at the other side of the coin – the producer’s viewpoint – may prove enlightening.
In the pursuit of open standards for their products’ unhindered electronic inter-
connectivity, technology producers often form trade associations. Such organizations
have demonstrated unprecedented collaboration (even amongst otherwise acute
competitors) in defining a technology framework with which to address their common
market, and to do so with utmost efficiency. We list, in Table 1-1, a sampling of four
such associations, along with their respective website slogans and some representative
members.
Table 1-1
Page 13
Optimal Maintenance Decisions (OMDEC) Inc 2004
Website Slogan Typical sponsors/members
in automation” Advanced Engineering, Inc.,
Advanced, Measurement &
…
www.hartcomm.org Real-time connections … ABB Automation Products,
helping you lower Action Instruments, Adaptive
maintenance cost, increase Instruments LLC, Advanced
plant availability, improve Flow Technologies Co., Agar
plant operations, and facilitate Corporation Inc., American
regulatory compliance. Level Instruments …
Such focused business energy, without doubt, propels advanced open enterprise
application integration (EAI)5 technology that will yield cost savings and efficiency for
technology vendor and user alike. On the other hand, a healthy balance, can help keep
those benefits flowing equitably. Far from discouraging such valuable technological
advances in enterprise integration standards, we propose to amplify their benefits with a
growing understanding of the fundamental knowledge elements governing failure
behavior as revealed by “reliability-centered maintenance”6.
The technology industry seeks out the maintenance professional. That individual labors
relentlessly in pursuit of overall equipment effectiveness at lowest cost. Seldom
possessing adequate time or resources to research and analyze the multitude of failures
and reliability problems encountered, he has come to rely upon a network of suppliers,
who prefer to be known as “solution providers”.
Effective learning requires a well worked out set of methods in order to extract relevant
knowledge from experience, integrate that experience into an existing knowledge
structure, and index it for later matching with similar cases. We hope that the principles
described here will help maintenance professionals to understand and to more clearly
express their reliability information needs in the torrent of new products and services that
may otherwise overwhelm them.
5
Open Applications Group – Enterprise Application Integration OAG-EAI
6
Reliability-centered maintenance (RCM) themes and principles are invoked throughout this book and
developed in detail in Chapter 14. (page 201).
Page 14
Optimal Maintenance Decisions (OMDEC) Inc 2004
The Work Order UML Class Diagram
The UML7 (Unified Modeling Language) is a graphical modeling language used to
develop computerized business solutions. In this section we invoke the UML to help us
clarify some of the business processes of maintenance. A physical asset management
(maintenance) information system revolves about the work order. It is the focal point for
the request, manpower allocation, procurement, execution, and historical documentation
of a maintenance action. First we represent the work order in a UML class diagram as in
Figure 1-1.
7
For a thorough discussion of the UML see “The Unified Modeling Language User Guide”, Grady Booch,
James Rumbaugh, Ivar Jacobson, ISBN 0-201-57168-4. Addison-Wesley 1998.
Page 15
Optimal Maintenance Decisions (OMDEC) Inc 2004
Heap8 entitled “Reliability-centered Maintenance” or RCM9 sheds considerable light on
this question. Nowlan and Heap investigated the failures of airplanes over three decades,
and, based on a remarkably comprehensive study, discovered those informational
elements that are essential to understanding the requirements of a maintenance program.
SAE Standard JA101110 encapsulates those information requirements in seven RCM
(reliability-centered maintenance) questions:
8
F. Stanley Nowlan, Howard F. Heap, Reliability-Centered Maintenance, United Airlines under the sponsorship of the Office
of Assistant Secretary of Defence (Manpower, Reserve Affairs and Logistics), 1978.
9
Reliability-Centered Maintenance is a process for determining the maintenance requirements of a physical asset by
addressing the consequences of failure and seeking the most cost effective preventive or mitigating tasks.
10
Society of Automotive Engineers, SAE JA 1011 Issued Aug1999 Evaluation Criteria for Reliability-Centered Maintenance
(RCM) Processes
11
We use the expression “knowledge element” interchangeably with the phrase “RCM question” in order
to emphasize that knowledge drives decisions to do the right maintenance at the right time. The 7
knowledge elements constitute the framework of our reliability centered knowledge to be physically
enshrined in our CMMS.
12
All corrective activities are motivated either by a functional failure or a potential failure
Page 16
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 1-3: The use of recorded information in maintenance
The information in the left box of Figure 1-3 is sometimes referred to as “as-found”
information. Precise, consistent language populating structured CMMS13 records is grist
for the mill of continuous improvement. While reliability-centered maintenance initially
analyzes “what could happen” to a physical asset, maintainers using a similar conceptual
framework, add information on “what did happen”. The term “PM” in the right hand box
of Figure 1-3 refers to “preventive maintenance” in its broadest sense. PM, in this
context, includes any type of pro-active scheduled inspection (CBM), overhaul, failure
finding activity, or even an engineering or process modification14. To portray the
knowledge retention characteristics of our maintenance information system in the light of
RCM thinking, we redraw the work order UML icon as in Figure 1-4.
13
Computerized maintenance management system (CMMS), also known as a Maintenance information
management system (MIMS)
14
However, a modification, although very often carried out by maintenance personnel, is not
“maintenance” in the strict sense. It is a design improvement to the inherent reliability (capability) of the
asset.
Page 17
Optimal Maintenance Decisions (OMDEC) Inc 2004
The five failure descriptive attributes exposed by the WorkOrder class icon of Figure 1-4
are precisely those of the first five RCM questions (page 16). Additional work order
attributes will describe what was actually done. The rigorous representation of historical
failure information (of Figure 1-4) contrasts starkly with popular attempts to define and
capture failure codes.
Pick lists of maintenance failure codes are often difficult to choose from and prone to
error. The selection items are often too general or do not adequately fit a given situation.
Or, alternatively, long lists of precise codes suffer from “choice overload” resulting in the
overuse of the default “Other”. Without doubt, effective and accurate lists are the
ultimate objective of reliability-centered knowledge systems. But deciding what selection
choices to place on such pick lists is no trivial matter. Some intermediary process is
required that will facilitate the day-to-day recording of useful reliability knowledge in the
short term, but additionally, must eventually evolve to the provision of accurate, robust
pick lists. Chapter 3. (page 33) will address the problem of failure code development
and suggest an approach that is reasonable, simple, robust and progressive. That
approach, elaborated in Chapter 4. (page 58), will unify failure mode records in the RCM
worksheet (knowledge base) with the failure codes in the work order database.
15
The OEE (Availability x Productivity x Quality) tracks maintenance effectiveness, where: Availability =
(scheduled time – downtime due to all forms of maintenance)/(scheduled time). Productivity = Product rate
setting/Desired product rate. Quality = (Product – Scrap)/Product. Additionally, tracking Reliability =
MTTF, will provide further measures of maintenance effectiveness. Two OEE models are thoroughly
described in Chapter 18. on page 254.
16
Only those failure codes appropriate to the equipment and the symptom should appear
Page 18
Optimal Maintenance Decisions (OMDEC) Inc 2004
Chapter 2. Requirements of Information
In order to set up a reliability information system or to add functionality to an existing
CMMS (computerized maintenance management system) consistent with the goals
outlined in Chapter 1, we would need to provide for the following “reliability-
centered” requirements described by Nowlan and Heap:
These requirements demand the provision of tools that will enable the systematic
collection, storage, and retrieval of historical experience that is relevant to asset
17
On-condition maintenance: The detection of a potential failure. Also known as condition based
maintenance (CBM) and predictive maintenance.
18
Increased probability of failure as reflected by a condition indicator or by indicators of imposed stress
19
The PF (potential failure to functional failure) interval coined by John Moubray. See Chapter 9. ( p 106)
20
Applicable: A task is technically feasible. Effective: A task accomplishes the intended objective
21
Such as proportional hazard models that estimate the statistical significance, to risk of failure, of age and
measurement observations, and operational factors in order to predict remaining useful life. (See Chapter
10. page 113)
Page 19
Optimal Maintenance Decisions (OMDEC) Inc 2004
reliability. They must meet the physical asset manager’s need to assess the applicability
and effectiveness of a given proactive task. An applicable pro-active task is one that is
technically feasible. An effective task deals satisfactorily22 with the consequences of the
failure that it addresses. Nowlan and Heap used the term age exploration23 to describe
any technique that analyzes information revealed from maintenance tasks. Using age
exploration methods we assess a task’s real applicability and effectiveness, and, if
necessary, we modify the maintenance program accordingly. Figure 2-1 represents, in a
“UML context diagram” a high level view of a reliability information system meeting the
afore-listed eleven requirements.
Figure 2-1: UML Context diagram of a Reliability Information System and various actors who
interact with it. The term “Use Case” refers to the performance of some operation required by the
user. For example a maintainer “completes a work order”, or a Supervisor “audits a maintenance
record”, describes two use cases.
A context diagram such as that of Figure 2-1 shows merely an overall proposed system
and the persons (or other systems) that we intend should interact with it. The relative
22
That is it entirely avoids or reduces the consequences and probability of failure to a satisfactory level.
23
Age Exploration: Any analysis procedure that examines historical maintenance data in order to alter the
maintenance plan for improved physical asset reliability.
Page 20
Optimal Maintenance Decisions (OMDEC) Inc 2004
impact of the actors, the sequence and the details of their use cases are fully described in
other UML diagrams24. Persons or entities other than those portrayed in Figure 2-1 may
interact with the reliability information system. They may include vendors, specialists, or
even automated “intelligent agents”25. Each actor inter-relates with the system in different
ways, the details of which may be described in other diagrams of the UML.
Data Structure
A simplified data structure for the reliability information system of Figure 2-1 could
resemble that of Figure 2-2.
Figure 2-2: Data model. Each table lists its column names.
The bold column names of Figure 2-2 designate a Primary Key (the values in the column
must be unique and non null) and Foreign Key (the values in the column must be
populated with a value from the primary key of the related table). This relationship is a
direct enabler of reliability analysis. It allows the incidences of an important failure mode
to be counted, studied, and correlated with two types of data: 1. working age, and 2.
monitored condition indicators. (Such analyses will be explored thoroughly in Chapter
10. Optimizing CBM page 113)
Note that the database table, “RCM_Table” contains all of the RCM questions. The one-
to-many cardinality arrow (with the three pronged reverse arrowhead) of Figure 2-2
indicates that each row of the WorkOrders table must relate to a row in RCM_Table. That
is, a work order is an instance of a RCM_table record. This constraint represents a
problem in current PM programs managed by most existing CMMSs because:
1. A single work order can cover, for example, the overhaul of an entire system or
product line with multiple components and failure modes, and
2. A single CBM (condition based maintenance) inspection work order can span
multiple systems.
24
Such as sequence, use case, and others.
25
Automated “watchdogs” that analyze data and recommend (or implement) an action. See(UML Class
Diagrams page 249).
Page 21
Optimal Maintenance Decisions (OMDEC) Inc 2004
Yet, the proposed reliability information system must respect the one-to-many integrity
constraint between the Workorders table and the RCM table. Without such a
relationship we could not trace the decision roots of a pro-active programmed task, and
consequently we could not scrutinize the records of each pro-active task with regard to its
applicability and effectiveness.26 Without such an ability, we may not question and
improve our maintenance strategy. Hence a conflict arises between the proposed
reliability-centered knowledge base and existing maintenance processes that execute
through the CMMS. We may resolve the difficulty in a number of ways that will depend
on the current CMMS data structure. We offer one solution in Figure 2-3, which shows
the primary key of WorkOrders expanded to include an additional work order attribute
called “Sub_No”.
Under this schema every work order can be related to a specific record in the knowledge
base table “RCM”. Now the Job_No may represent a group of (child) work orders each
corresponding to a particular failure mode (i.e. record) in the RCM table27. We will
propose a comprehensive solution to this problem in Chapter 4. Acquiring Maintenance
Information (page 58).
26
This means, for example, that a task, say, “Inspect panels for loose contacts” may be traced right back to
the performance requirement of the asset, providing an auditable trail that may be scrutinized in order to
evaluate or upgrade the maintenance plan at some later time.
27
To the author’s knowledge, this idea has not yet been implemented. Nevertheless, careful consideration
and testing of its practicality will likely prove a worthwhile endeavour.
28
Smith, A.M. and Watson, I.A. (1980). Common cause failures — a dilemma in perspective. Reliability
Engineering 1, 127-142.
Page 22
Optimal Maintenance Decisions (OMDEC) Inc 2004
problem is knowing what to collect, how to make sense of the wealth of data that one can
gather, and what to do with it.” We wish a data structure that captures the information
whose subsequent analysis will either confirm the underlying assumptions of each PM
task, or, point out conflicts with recorded observation, and, thereby suggest specific PM
effectiveness (policy) improvements. That structure must enable the fulfillment of all
eleven requirements outlined earlier (page 19.) in this chapter.
A maintainer, having executed a repair task (resulting from a failure, or potential failure),
will need to complete the work order form by providing data in sufficient detail and in the
proper format for a reliability information system. This activity is illustrated in Figure
2-4 by another type of UML diagram called a Use Case diagram. In it, the actor, in this
case the Maintainer, is shown interacting with the system in order to perform the action
(the use case inside the oval), “Complete the work order form”.
Figure 2-4: Use Case Diagram - Complete the work order form
Step 1
Rather than a conventional pick list of failure codes, the user (maintainer) should
conveniently display the RCM table records for a given item. The great advantage of
presenting the failure modes in the full context of the function, functional, failure, effects,
and consequences is that (unlike the use of failure codes) there will be little ambiguity or
Page 23
Optimal Maintenance Decisions (OMDEC) Inc 2004
uncertainty29 in how to categorize the failure. Subsequent reliability analysis of these
records will, therefore, be founded upon precise historical data.
The RCM data may be referenced by a CMMS user through a multi-row database form, a
spreadsheet tool (e.g. MS Excel), or a commercial RCM database application integrated
with the CMMS30.
Step 2
At this point, once the records are displayed, there are two possibilities:
1. An appropriate record in the RCM table, that accurately describes the current situation,
will be found, or
2. An appropriate record will not be found.
Step 3
If an appropriate record is found, the user will select that RCM identification number
(called RCMREF) for insertion into the WorkOrders table. Additional table attributes will
be required some of which are indicated in Figure 2-5.
29
see The “failure code” problem page 18
30
Acquiring Maintenance Information (page 58) suggests one type of user interface.
31
Other events in this field may be “suspension” and “suspended animation”. See Chapter 4. Acquiring
Maintenance Information (page 58)
32
Providing a vital link between the CMMS and various CBM and other plant systems (see Integrating
Reliability Information page 249). The description of the symptoms as reflected by various monitoring or
inspection techniques, should be appended to the Effects of of the referenced record in RCM_table.
Page 24
Optimal Maintenance Decisions (OMDEC) Inc 2004
RCM analysis has not yet been completed for the item in question, or the present
situation had been overlooked in the RCM analysis for the item. The maintainer must
insert a record in RCM. We “extend” the use case “Complete the work order form” to
cover this new situation. The extended33 UML Use Case diagram is shown in Figure 2-6.
Figure 2-6: Extending the Use Case "Complete the work order Form"
The proposed course of events (of Figure 2-6) challenges the maintainer, the supervisor,
the maintenance engineer and all parties dedicated to high quality information in the
system. Ideally, the RCM knowledge base would have been pre-populated by the RCM
team, assembled expressly for that purpose.34 Those persons would have deliberated and
determined the answers to the seven RCM questions (page 16) covering the entire item.
The RCM analysis process in which they engage is highly structured and well facilitated.
Can we expect the maintainer to provide information of the same high quality, but
without the benefit of adequate time and resources normally accorded to an RCM team?
No. Nevertheless, valuable experience and knowledge about a failure (or potential
33
The return arrow (with the unfilled arrow head) in Figure 10 labelled with the <<extend>> “stereotype”
indicates that the additional use case is sometimes required.
34
The RCM process is described fully in Part 3. on page 201
Page 25
Optimal Maintenance Decisions (OMDEC) Inc 2004
failure) must be captured at this opportune moment35. How can the Maintainer
accomplish this in little time, working alone? He cannot. The system must provide audit,
approval, support, and educational functionality to assist the maintainer in this effort.
Figure 2-7 displays three new fields in RCM (Approval, Deletion, and Last_Update) that
may be used for this purpose.
Figure 2-7: Adding 3 more fields: Approval, Deletion, and Last_Update, to RCM_Table.
Before describing the approval function, it must be emphasized that the quality (hence
usefulness), of the reliability knowledge base depends mostly on human collaboration
(and less so on computer systems). Quality in the RCM table records does not rest (and
he must not perceive it to be so) entirely upon the shoulders of the maintainer. All
personnel will contribute to the ultimate integrity of the records added to the RCM table.
The act of doing so will grow their understanding of the behavior and consequences of an
item’s failure.
The entire process of successfully completing the information fields in the reliability
knowledge base depends on, at least, six supporting functions:
35
If not then, when?
36
Without doubt, effective and accurate lists are the ultimate objective of reliability and OEE centered
information systems. But deciding what choices to place on such picklists is no trivial matter. Some
intermediary process is required that will facilitate the day-to-day recording of useful reliability related data
in the short term, but additionally, must eventually evolve to the provision of accurate, robust picklists.
Chapter 4. (page 58) will address the problem of failure code development and suggest an approach that is
reasonable, simple, and progressive. .
Page 26
Optimal Maintenance Decisions (OMDEC) Inc 2004
During the process of maintaining the reliability knowledge base, personnel will discover
the most common error – incorrect choice of the failure mode causality depth. We treat
this question in detail in Appendix 10. on page 294. Figure 2-8 illustrates a suggested
discussion and approval process in another UML diagram type known as a Sequence
diagram.
Figure 2-8: A sequence diagram illustrating the creation, approval and discussion of an RCM record
As its name suggests, a sequence diagram shows the timing of various interactions
associated with a use case, say “Inserting an RCM record”. Time proceeds from top to
bottom. The diagram focuses on the messages that are transmitted amongst the
interacting “objects” at various times, thus defining a sequence of messages. The UML
sequence diagram of Figure 2-8 indicates that the Maintainer has created a record. The
newly created record “object” (appearing slightly lower down on the time line) signals
the Maintenance Supervisor that he should verify and approve the information. When that
is done the RCM_Record object signals the Maintainer that he may review any changes
made. Finally the Maintainer may issue a signal indicating that a face-to-face discussion
is desired. While the message passing takes place internally in software and is transparent
to the user, the audit and review functionality provides for confidence in data integrity.
Page 27
Optimal Maintenance Decisions (OMDEC) Inc 2004
SAE JA1011-1999 standard37. Contending standards and older standards such as
“FMECA” (Failure Modes, Effects, Criticality Analysis – MIL STD 1639A) and
FMEA/AIAG (Automotive Industry Action Group 1995)38 use several of the same words
and phrases but ascribe to them different meanings. Hence alternate definitions of FMEA
terminology have been and continue to be used extensively in many industries.
Understandably, this has led to confusion and miscommunication. Table 2-1 provides a
comparison of alternative terminology.
Table 2-1
Terminology Non SAE-JA1011 definition SAE-JA1011 definition
FMEA A systematic tool for Different definition: A tool
identifying: effects or for determining the
consequences of a potential functions, functional failures,
product or process failure, causes, and effects of a
methods to eliminate or failure of an item in its
reduce the chance of a failure operating context
occurring
Potential Failure Incorrect material choice, Different definition: An
inappropriate specifications, indicator that a failure mode
operator assembling part has occurred and is in the
incorrectly, excess variation process of degrading to a
in process resulting in out- functional failure. At the
spec products. Example: Air time of detection, however, it
Bag (excessive air bag has no dire consequences.
inflator force, operator may
not install air bag properly on
assembly line such that it may
not engage during impact
Basic and Secondary Basic Function: ingress to and Similar definition: Primary
functions egress from vehicle, function: why item
Secondary function: protect purchased / installed.
occupant from noise Secondary function: All
other functions (protective,
environmental, appearance,
control-containment-comfort,
health and safety, efficiency,
structure-superfluous). See
page 220.
Failure Mode Physical description of a Different definition: The
failure. e.g. noise enters at cause (at a practical causality
door-to-roof interface depth) of a failure.
Failure Effects Impact of failure on people, Different definition: The
equipment. E.g. driver typical worst case scenario of
37
SAE JA 1011 Issued Aug1999 Evaluation Criteria for Reliability-Centered Maintenance (RCM)
Processes
38
GM, Ford, and Chrysler Quality documents.
Page 28
Optimal Maintenance Decisions (OMDEC) Inc 2004
Terminology Non SAE-JA1011 definition SAE-JA1011 definition
dissatisfaction. relevant events touched off
by a failure mode occurring
before, during, and after the
failure. The scenario will
encompass those events at
the local/component level,
the system/equipment level,
the organizational level, and
even the external /societal/
environmental level as
appropriate.
Failure Describes the way in which
an item’s function is lost or
compromised. Includes
partial or total loss of
function and describes the
precise manner in which the
function fails to perform.
Failure Refers to the underlying Somewhat different
Cause/Mechanism (root) cause of a failure. E.g. definition: In SAE JA-1011
insufficient door seal. there is only one active
definition for these terms.
That is to say: “Failure
Cause” = “Failure Mode” =
“Failure Mechanism” =
“Root Cause”. It is the failure
mode (or modes) retained in
an analysis (for example,
from a cause and effect
diagram if required.) for
which there is a practical
consequence mitigating
activity.
Severity A rating corresponding to the None.
seriousness of an effect of a
"potential failure mode".
(scale: 1-10)
Occurrence A rating corresponding to the None.
rate at which a first level
cause and its resultant failure
mode will occur over the
design life (scale 1-10)
Detection A rating corresponding to the None.
likelihood that the detection
methods or current controls
Page 29
Optimal Maintenance Decisions (OMDEC) Inc 2004
Terminology Non SAE-JA1011 definition SAE-JA1011 definition
will detect the potential
failure mode (scale 1-10)
Risk Priority Number Severity × Occurrence × None. Note that SAE
(RPN) Detection JA1011 does not preclude the
use of RPN. Neither does
RPN detract from SAE
JA1011, but merely adds
another dimension to the
analysis, if required.
Consequences Unclear or varied. None in FMEA but are
addressed in the
subsequent decision
process of RCM. The
consequences of failure are
one of:
1. Hidden,
2. Safety, health,
Environmental,
3. Operational, or
4. Non-Operational.
Conclusions
The great advantage of recording failure modes in the CMMS in the full context of the
function, functional failure, effects, and consequences is that there will be little ambiguity
or uncertainty about how to categorize the current failure. Subsequent reliability
analysis39 of these records will, thereafter, be founded upon precise historical data
concerning failure, its causes, effects, and consequences.
39
Refers to age exploration – analyzing the information gained from the execution of maintenance tasks.
That analysis is directed at OEE (defined in glossary on page 288) improvement and cost reduction without
compromising safety and the environment.
Page 30
Optimal Maintenance Decisions (OMDEC) Inc 2004
The general approach to PM assessment and improvement is a double-barreled cannon –
(1) a program of scheduled RCM analysis reviews of significant items40 by a team of
domain experts, and (2) a systematic process for supplementing that knowledge with
accurate historical information. Both these activities populate the same knowledge base.
The former exercises a rigorous process for establishing consensus on an item’s
maintenance characteristics. The latter accumulates reliability data in the field, extending
the knowledge of, and validating the assumptions of the former. Although currently rare,
cross fertilization of the two processes is immensely valuable and will inevitably vitalize
both.
The RCM reliability knowledge base will, ultimately, contain a record for every failure
mode that may reasonably occur in the organization’s asset hierarchy. As the knowledge
base grows, managers, maintenance engineers, planners, and reliability specialists may
apply rich software enabled data analysis and modeling tools41 to optimize their PM and
CBM decisions.
One further argument in favor of systematic work order documentation procedures, such
as those discussed thus far, can be found in the diagnostic experiences related to a
Turbofan engine. Why encourage the feedback of maintenance information from the
field? To complement and to enrich the RCM analysis? Figure 2-9 provides an answer,
which we may reasonably extend to many other installed systems. The Venne diagram
illustrates the gap found between the list of anticipated failure modes and those actually
experienced throughout a large fleet of engines.
40
Item: A group of one or more parts or assemblies that is convenient to treat as a single entity for
reliability analysis. Items are defined at a high enough level of indenture so that their failures may be
clearly related to failure of the equipment as a whole. (See Appendix 3. Sizing the analysis page 276.)
Significant item: An item whose failures:
· Are not evident under normal circumstances, or
· Can directly negatively impact safety or the environment, or
· Can have direct major economic or operational impact.
41
For example, the EXAKT software for developing optimized CBM decision models, and other tools such
as Pareto and Weibull analysis, and real time productivity and maintenance performance management
systems (Managing Strategy Chapter 18. page 254).
Page 31
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 2-9 FMEA Anticipated and actual failure modes experienced
Page 32
Optimal Maintenance Decisions (OMDEC) Inc 2004
Chapter 3. Using maintenance information
More important, by directing both scheduled tasks and intensive age exploration at those
items which are truly significant at the equipment level, the ultimate result will be
equipment with a degree of inherent reliability that is consistent with the state of the art
and the capabilities of maintenance technology
– Nowlan and Heap, Reliability-centered maintenance
Introduction
Few will deny that maintenance departments are very good at amassing data. Fewer still
will argue, though, that we are equally adept at analyzing and interpreting the data that
ends up in our computerized maintenance related systems. We store vast amounts of data,
mainly because it is technologically possible to do so. It often costs relatively little or
nothing to add another field to a form, another sensor to an assembly, or to record another
output from a control system.
We tend to defer, indefinitely, any serious consideration of the data itself. It may be
useful in the future – therefore, we collect it now. In Chapter 1. (page 13), we examined
the structures and procedures required for “maintenance data integrity”. We also
proposed a framework for collecting and managing maintenance data whose format and
content will be useful to those responsible for asset reliability – managers and planners.
They are the individuals who plan maintenance and who, therefore, must consider the
complexity of factors represented in Figure 3-1. In this book we survey the state-of-the-
art of maintenance information analysis techniques.
42
GE Power Systems Heavy-Duty Gas Turbine Operating and Maintenance Considerations Robert Hoeft
and Eric Gebhardt GE Energy Services Atlanta, GA
Page 33
Optimal Maintenance Decisions (OMDEC) Inc 2004
The problem with failure rates
It is sometimes thought that experience derived from others in the form of failure rates
can be useful in reliability-centered43 decision-making. Electricité de France44 notes that
“Data available in the literature cannot be used … it is related to equipment which, from many
viewpoints (operating conditions, maintenance, environment, etc) is very different … Moreover, it
scarcely provides information about the samples used to derive the data and rarely mentions
parameters other than the operating failure rate. For all these reasons, Electricité de France does
not consider the information provided by these tables as very … reliable.”
Even when failure rates are gathered from equipment operating in similar contexts, their
value to reliability investigations is limited. Failure rate is merely the inverse of an item’s
MTTF (or average life).45 Average life alone does not allow us to determine the right
intervals for PM tasks. As an example, consider that many items (most bearings and other
complex components) fail randomly. Only 37% (see Figure 3-4 on page 38 and
43
The expression, “reliability-centered” refers to decisions taken with the objective of sustaining OEE and
reliability while keeping costs acceptably low.
44
Dorey, J. (1981). Consideration of the reliability of pumps, derived from the first year of experience of
the SRDF, the reliability data collection system of Electricité de France. Reliability Engineering 2, 179-
192.
45
For items that fail randomly
Page 34
Optimal Maintenance Decisions (OMDEC) Inc 2004
Appendix 8. on page 293) of such items will survive to their average life. For other types
of failure behavior, for example, for items that wear out, the timing of a PM task would
depend on the item’s useful life – the age to which most of the items survive. The useful
life (Figure 3-2) of an item, however, is unrelated, in any simple way, to its MTTF.46
Figure 3-2: Useful life of an item, the age to which most items survive
Where the consequences of failure are economic only, failure rate can help decide
whether PM is effective. The cost of a scheduled PM task over a long time period47
should be substantially less than the failure rate × time period × average cost of a failure
and its (economic) consequences. This is illustrated formally in Equation 3-1.
Cost of PM in period < failure rate × time period × average failure Cost over period
Equation 3-1: Justifiable cost of PM
Failure rate can also help decide on stocking levels and economic order points for spare
parts48. Finally, knowing the failure rate of a protective device will permit a
determination of an adequate failure finding inspection interval required to achieve a
specified availability (Equation 3-2 page 40).
46
If one mistakenly bases scheduled renewals on MTBF, one will usually grossly underestimate the
number of failures expected to be prevented. (See Appendix 6. Various definitions of “Life” page 290.)
47
Note that we interpret “time period” in Equation 3-1 to mean “working age”. It is the usage
measurement or accumulated stress on the physical asset since installation or major overhaul. Use calendar
time only when the equipment functions regularly in time. More commonly we measure working age in
the specific engineering units of production. For example, for a skip hoist in an underground mine –
number of trips, for a haul truck in an open pit mine – tons of ore hauled, and so on.
48
“Spares” manual. Software developed by the CBM Laboratory at University of Toronto.
Page 35
Optimal Maintenance Decisions (OMDEC) Inc 2004
1. What additional data do we require?, and
2. How can we use it?
The preceding chapters provided some answers to question 1. We drew from the thinking
of reliability-centered maintenance (RCM) in order to describe a data structure into which
maintenance personnel may compile their day-to-day observations. This chapter focuses
on the second question – how to analyze the data in maintenance databases so that we
may make full use of that information. Primarily, we wish to use it to optimize49 every-
day maintenance management decisions.
49
Optimal decisions should support a stated objective. For example, minimum cost, maximum availability,
a specified reliability, or some set of performance measures tailored to the asset in its current operating
context.
50
Maintenance Steering Group 3, the defining document for Reliability-centered maintenance in
commercial aviation upon which the technical and regulatory infrastructure is based.
51
Age Exploration: Any analysis procedure that examines historical maintenance data in order to alter the
maintenance plan for improved physical asset reliability.
52
Including engineering changes and their assessment.
Page 36
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 3-3: Improvement in aviation safety over 3 decades
Page 37
Optimal Maintenance Decisions (OMDEC) Inc 2004
the need to collect and analyze “good” data.) Hence, we can reasonably predict that
analogous problem resolution strategies and age exploration procedures will spread to the
broader marketplace.
Random Failure
Consider the common asset behavior known as “random failure”.58 We stated earlier
(page 34) that only 37% of randomly failing items survive until their mean time to failure
(MTTF). Figure 3-4 illustrates this behavior.
1
Probability of survival without failure
.78
.61
.50 .47
.37
.29
.22
0
0.25 0.50 0.75 1 1.25 1.50
X the MTBF
Figure 3-4: Survival probability (also known as the Reliability) for an item whose failure behavior is
random59
The calculation of survival probability (reliability) at each quarter multiple of the MTTF
shown on the graph is provided in Appendix 8.( page 293). An item whose short term
57
The American Petroleum Institute API 580, process for Risk Based Inspection draws the relationship
between quality of data and risk.
58
Despite the expression “random” we may estimate the risk of failure in a small interval at any age (given
that the item has survived to that age) to be a constant value equal to 1/MTBF.
59
Nowlan and Heap, Reliability-Centered Maintenance
Page 38
Optimal Maintenance Decisions (OMDEC) Inc 2004
risk60 of failure remains constant throughout its life is said to fail randomly. It does not
age, as would an item whose short term risk of failure increases as it gets older. Although
its conditional probability of failure curve (see Appendix 7. on page 290) is flat, a
randomly failing item’s probability of survival curve (Figure 3-4) decreases exponentially
with age. That is to say, it drops by a constant percentage (of its current value) in each
subsequent interval. The survival graph of Figure 3-4 illustrates this phenomenon. The
time axis is divided into equal lengths and the survival probability after each interval is
indicated on the curve. The percent decrease in survival probability from time 0 to time
.25 MTBF time units is (100 – 78)/100 or 22%. Similarly the percentage drop from time
0.25 to 0.50 MTBF time units is (78 –61)/78, which is also 22%, and so on. We take
advantage of this “exponential” behavior, in the following section, to help determine
inspection (failure finding) intervals of an important class of equipment whose failure
behavior is characteristically random61 – safety devices.
The divisions along the time axis of the survival graph of Figure 3-4 have been marked at
an arbitrary ¼ the item’s MTTF. Let us suppose that we carry out our failure finding
inspections at those same intervals. The following statements follow from this particular
(exponential survival) age-reliability relationship:
1. Whatever the age of the item, its survival probability is 78% of its previous value.
(as explained in Appendix 8.( page 293)).
2. Its availability in a very small interval immediately after an inspection would be
100%.
3. Its availability in a small interval immediately before the next inspection will be
78%.
4. The average availability of the item will be 89%. (average of 78 and 100).
Therefore an inspection policy of 1/4 the average life will provide an average
availability of 89%.
60
By short term risk, we mean the conditional probability of failure in a small interval. It is the probability
of reaching the interval minus the probability of surviving the interval divided by the probability of
reaching the interval.
61
The assumption of exponentiality for an item that does not wear out (such as most safety devices and
complex items) is, in fact, a conservative one. – N & H.
62
Hidden Failure: A failure of a protective function, e.g. a safety limit switch, that would normally go
undiscovered until the function that it was protecting, e.g. a high level limit switch, also fails.
63
Multiple Failure: A failure of a protected function at a time when its protective function is already in a
failed state.
Page 39
Optimal Maintenance Decisions (OMDEC) Inc 2004
5. If this degree of availability is inadequate (that is if we need to achieve a higher
average probability that the device will be operational), we must reduce the
interval – that is, increase the frequency of our inspections. The failure finding
interval (I) as a function of the desired availability and mean time to failure (M)
of the protective (safety) device may be calculated from the formula of Equation
3-2.
I = 2 × (1 − desired availabili ty ) × M
Equation 3-2 Failure finding interval for desired device availability. Equation is valid for
safety devices whose availabilities are greater than 95%64.
To give us a feel for the numbers generated from Equation 3-2, Table 3-1 shows
those failure finding (inspection) intervals needed to ensure the specified
availabilities for a safety device whose mean-time-between-failure is 3 years.
Table 3-1
Required safety device 99.999% 99.99% 99.97% 99% 98.5% 98% 96%
availability
Inspection interval as a % of 0.002% 0.02% 0.06% 2% 3% 4% 8%
MTBF (I/M x 100)
Example: MTTF = 3 years ½ hour 5 hours 15 22 33 44 88
Inspection interval to achieve hours days days days days
required safety device
availability
Usually, the manufacturer of a safety device declares its MTTF. The results of failure
finding inspections, however, should be recorded by the user in the CMMS (as described
in Chapters 1 and 2), so that a reliability software product65 may ascertain true average
life and failure behavior of the device or (group of similar devices) under actual working
conditions. Equation 3-2 is valid only in the range of high availability (>95%). Knowing
the reliability of the device, the problem of determining the appropriate failure finding
interval is thus reduced (by Equation 3-2) to the problem of knowing what availability
the asset managers, the owners, the users, and the environmental and safety authorities
will accept for the device in question.
In fact, it is of greater interest to specify, the maximum mean time between multiple
failure66 that interested parties are prepared to accept. We use Equation 3-3 to calculate
the appropriate failure finding interval, Iff, knowing the mean-time-to-failure of the safety
device (Msd), that of the protected function (Mpf), and the maximum tolerable risk of a
multiple failure, i.e. the mean-time-to-multiple-failure (Mmf).
64
Covers most electro-mechanical safety devices
65
EXAKT, Relcode, SuperSMITH, and others.
66
Multiple Failure: A failure of a protected function at a time when its protective function is already in a
failed state.
Page 40
Optimal Maintenance Decisions (OMDEC) Inc 2004
M sd × M pf
I ff = 2 ×
M mf
Equation 3-3: Failure finding interval for risk of a multiple failure
Equation 3-3 describes the simplest configuration of a single device protecting a single
function. Appendix 4. (page 278) provides several extensions of this formula that cover a
variety of common situations involving multiple devices in parallel or in series, multiple
modes of failure, and other configurations.
2 × M sd × M pf × C ff
I off = Equation 3-4
C mf
where:
Ioff = optimal failure finding interval
Cff = average cost of an inspection
Cmf = average cost of a multiple failure
Appendix 4. (page 278) provides a formula for the optimal failure finding interval for
multiple redundant safety devices.
Page 41
Optimal Maintenance Decisions (OMDEC) Inc 2004
exploration methods resulted in engineering and maintenance improvements that
gradually overcame the dominant failure modes on the JT8D engine67.
0.4
June – August 1964
Note how, in Figure 3-5, the conditional probability of failure curve continued to flatten
until it eventually showed no relationship of engine failure risk to operating age. During
the seven year period from 1964 to 1971 dominant failure modes were detected and
removed by redesign.
67
Report AD-A066-578, “Reliability-Centered Maintenance”, F. Stanley Nowlan, Howard F. Heap,
National Technical Information Service, U.S. Department of Commerce, 1978 (Figures 2, 3, 4, 5, and 7 to
18 have been reproduced from this reference document.)
68
The definition of conditional probability of failure is more thoroughly elaborated in Appendix 7. on page
290
Page 42
Optimal Maintenance Decisions (OMDEC) Inc 2004
The example of Figure 3-5 portrays a typical improvement pattern that applies to new
equipment (or equipment where effective PM had not been applied previously). The key
questions are:
Age exploration as a generic and integral part of an information strategy will help to
achieve our reliability improvement goals in minimal time and at lowest cost.
Furthermore, one may, predict the expected rate of reliability improvement by
assuming that it will occur exponentially (that is, at a constant percentage of
improvement over the reliability of each preceding period). Figure 3-6 illustrates the
prediction and the reality.
2.0
Failure rate (failures per 1000 hours)
1.0
0.9
0.8
0.7
0.6 Experience
0.5
0.4
0.3
Date of Forecast
0.2 forecast
0.1
1963 1964 1965 1966 1967 1968 1969 1970 1971 1972
Operating age since last shop visit (flight hours)
Figure 3-6: Exponential reliability improvement
Improvements in reliability as a result of applying a maintenance information strategy
based on the principles of age exploration may be expressed as a decrease in failure rate.
The graph of Figure 3-6 shows the actual failure rates of the JT8D engine compared to
the forecast improvement in reliability. The forecast is characteristically exponential
when age exploration is used. The temporary deviation from the forecasted level between
1969 and 1971 was the result of the appearance of a new dominant failure mode that took
several years to resolve by redesign. Regardless of whether the improvement follows an
exponential or some other pattern, the point is, that good information recording
procedures will ascertain the validity of a given improvement initiative.
Page 43
Optimal Maintenance Decisions (OMDEC) Inc 2004
Refining the maintenance program
Knowledge of how and when to improve a maintenance program comes principally from
two information sources:
Once the maintenance program goes into effect, age exploration of the results of the
scheduled tasks provides the basis for adjusting the initial conservative task intervals set
up by the RCM analysis team. And as further data becomes available the default69
decisions, made in the absence of information, are gradually eliminated from the
program. The process is portrayed symbolically in Figure 3-7.
69
“Default” here does not refer to RCM question 7 (page 16), but rather, to the conservative (default)
answers to the questions of the RCM algorithm in the absence of experience. The default answers are
provided in Appendix 13. on page 302.
Page 44
Optimal Maintenance Decisions (OMDEC) Inc 2004
1. Applicability: An indicator of an initiating failure process (reduced failure
resistance) can be detected and measured, and there is sufficient warning time in
which to proact, and
2. Effectiveness: The task will entirely avoid or reduce, to a tolerable degree, the
failure consequences, at an acceptable cost
Figure 3-8 demonstrates how we may assess pre-condition 2, “Is the CBM task
effective?”. The graph plots the age-reliability relationship for the two types of failure:
Recalling the data structure proposed in Figure 2-5 on page 24, we note that the CMMS
must accommodate the distinction between potential and functional failure as recorded
on a work order. Reliability analysis software70 may process that data and generate the
conditional probability of failure graph of Figure 3-8 and thus assist in the evaluation of
the merits of the CBM program. The upper curve shows the conditional probability curve
for all removals including both functional failures and potential failures. The lower curve
(line) shows the conditional probability of functional failures as reported by operating
personnel and recorded on the work order.
Conditional probabilty of failure for 200
0.4
0.3
0.2
hour intervals
Total removals
0.1 Potential failures
Functional failures
0 1000 2000 3000 4000
Operating age since last shop visit (flight hours)
Figure 3-8 Conditional probability of functional and potential failure
The distance between these two curves represents the conditional probability of detecting
potential failures as a result of on-condition inspections. The difference between the Total
removals and Functional failures conditional probability curves, represents the
effectiveness of the existing CBM program. Functional failures may have safety,
70
See Software analytic tools page 47
Page 45
Optimal Maintenance Decisions (OMDEC) Inc 2004
operational or economic consequences. Potential failures, by definition, do not have
safety or (significant) operational or economic consequences.
An analysis of Figure 3-8 determines that no scheduled overhaul of this unit will offer
additional value because the conditional probability of functional failure is independent
of the equipment's working age71 (as a result of the on-condition maintenance tasks that
have been performed). Scheduled overhaul, where effective on-condition maintenance is
in place, will, therefore, be ineffective. In fact, we would not want to reduce the
incidence of potential failures except by redesign since they are clearly effective in
reducing the number of functional failures.
Total removals
Failure mode C
Failure mode B
Infant
mortality Failure mode A
Operating age
71
Low and constant conditional probability of failure curves are characteristic of a well maintained item.
Page 46
Optimal Maintenance Decisions (OMDEC) Inc 2004
distance between the upper curve and next lower one represents the probability of
unverified (not attributed to a failure mode) from unknown causes.
To determine how we might improve the reliability of this item we must examine the
contributions of each failure mode to the total verified failures. For example, failure
modes A and B show no increase with increasing age; hence any attempt to reduce the
adverse age relationship must be directed at failure mode C. There is also a relatively
high conditional probability of failure immediately after a shop visit as a result of notable
infant mortality from failure mode A. The higher incidence of early failures from this
failure mode could be due to a problem in shop procedures. If so, the difficulty might be
overcome by changing shop specifications either to improve quality control or to break in
a repaired unit before it is returned to service.
Example 1
Heavy duty bearings in a steel forging plant have failed after the following number of
weeks of operation.
Age at Failure
(Weeks)
8
12
14
16
24
24 unfailed
This data may be entered into the Relcode data entry screen as shown in Figure 3-10.
Figure 3-10: Relcode data entry for steel forging plant bearings
Note that record five holds the remaining unfailed (suspended) bearings. The analysis is
performed by the software and the graph of the hazard function, which differs from the
Page 47
Optimal Maintenance Decisions (OMDEC) Inc 2004
conditional probability of failure graph only by a constant (see Appendix 7.), can be
displayed as in Figure 3-11
Figure 3-11: Hazard rate graph indicates that there is a period of about 5 weeks where the
conditional failure probability is negligeable, followed by a period where conditional failure
probability increases with working age.
Exercise 2
Records from two heavy duty dumper trucks show that fan belt failures occurred at the
following odometer readings (kilometers, from new).
Truck 1 Truck 2
51220 45380
68060 103510
At present the odometer readings are:
Truck 1 Truck 2
105680 132720
We populate the six Relcode records with the age values: 51220, 68060-51220, 45380,
103510-45380, 10568-68060, and 132720-103510 as illustrated in Figure 3-12.
Page 48
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 3-12: Relcode data entry for heavy duty dumper truck fan belts
Figure 3-13
Additional examples in the use of Relcode are given in Appendix 14. on page 303.
72
For many failure modes the measurement level at which a potential failure is declared is based on
judgment and experience. The EXAKT methodology recognizes the probabilistic nature of a potential
failure and therefore defines a “best” decision (way of setting an action limit) based on a stated long-run
optimizing objective.
Page 49
Optimal Maintenance Decisions (OMDEC) Inc 2004
On the other hand, if we set our alert level too high (too liberally) we will experience a
larger number of failures than necessary and incur unnecessary costs (and possibly
health, safety, and operational consequences) and excessive downtime. Our goal is to set
our potential failure declaration (data interpretation) policy at the optimal position (best
compromise) between the two poles. The EXAKT methodology is a form of age
exploration. It models the ages of previous potential and functional failures and
preventive renewals together with the condition data73 leading up to those events. It
blends in the failure’s economic consequences, and generates an optimal policy for
declaring potential failures. The effectiveness of a proposed new policy may be compared
with that of current practices by using the software’s “cost comparison” function The
details on how to evaluate a proposed policy are provided in the Appedix (page 295).
CBM effectiveness is related, ultimately, to how “good” the condition data is. That is,
to what degree it holds information that, in some way, reflects the degradation process in
the item (and/or to what extent it measures the accumulated external stress imposed on
the item). CBM effectiveness is also, quite obviously, highly related to the ratio of the
average cost of a preventive action to the average economic consequences of failure.
Lastly, CBM effectiveness depends, as well, on the quality of data collection, processing,
and analysis.
When some policy (of PM74) is applied to the data, the cost is defined as the average
realized cost. The “average realized cost” is the realized cost for all failed and
preventively replaced histories divided by the total realized time for failed and
preventively replaced histories. The formula is:
Where C is the average cost of a proactive task and K is the average additional cost of
the economic consequences of failure (secondary damage, fines, lost sales, and so on.)75
73
Observations, operating data, machinery signals, etc from which a potential failure may be deduced.
74
“PM” in the general sense of proactive maintenance referring here to a policy of scheduled inspections
(on-condition maintenance), scheduled rework, or scheduled discard.
75
The EXAKT methodology is thoroughly examined in Chapter 10. Optimizing CBM on page 113.
Page 50
Optimal Maintenance Decisions (OMDEC) Inc 2004
maintenance and failure costs, per unit of working age, of applying the current policy.
We may calculate the projected (total failure and maintenance) costs per unit of working
age under a proposed “optimal” policy. And finally, we may campare both, to the
calculated cost (of mainenance and failure per unit of working age) with no-proactive
maintenance policy whatsoever. Figure 3-14 provides an example of an evaluation of a
policy proposed by an EXAKT analysis of the data.
In Figure 3-14 the projected average cost per unit of working age is 84% of that of the
current policy. The percentage of preventive (versus reactive) maintenance incidents will
be 98.79%, which is 230.5% greater than that of the existing policy. However, the mean-
time-between-replacements will be less than (3326) the current value (8775.29). This
means that we will intervene more frequently, in order to realize a net cost saving of
16%.
Page 51
Optimal Maintenance Decisions (OMDEC) Inc 2004
10
Number of premature removals
Modification completed
5
Inspection requirement
removed
0
1971 1972 1973
Calendar quarters
Figure 3-15
Figure 3-15 depicts the history of the C-sump problem in the General Electric CF6-6
engine on the Douglas DC-10. The on-condition task instituted to control this problem
had to be reduced to 30-cycle intervals in order to prevent all functional failures. The
precise cause of this failure was never pinpointed; however, both the inspection task and
the redesigned part covered all possibilities. Once modification of all in-service engines
was complete no further potential failures were found, and the inspection requirement
was eventually eliminated.
Introduction
Reliability analysis relies on the availability of systematic records of events undergone by
significant components. The procedures described in this section will enable
computerized maintenance management systems to fill an important role in reliability
analysis.
Page 52
Optimal Maintenance Decisions (OMDEC) Inc 2004
working age. To perform reliability analysis, we must know the working age at each
important event in an item’s lifecycle. Such events include, at the very least, a beginning
(B) event (at installation, overhaul, or replacement) and an ending (E) event. The two
principal ending events are:
An ES event is a removal of the item from operation for any reason other than failure76.
A. Functional failure
B. Potential failure
Each of these event types must be identified in a CMMS work order record. The way in
which these events are recorded was introduced in Chapter 1. and Chapter 2. A practical
approach for recording events will be outlined in Chapter 4.
The foregoing begs the question “What is a significant component?” We must decide,
therefore, whether or not a part is, in fact, significant, so at to justify tracking its
individual lifecycle. A part should be considered significant if it is related to a failure
mode, whose consequences are significant. The example of an ingot transporter in the
following section illustrates the notion of significance.
76
An item may be removed or reworked as the result of a scheduled task, or because it was expedient to do
so at that particular time. Such an event would be an “ES” event. If the item failed (a functional failure), or
failure was imminent (a potential failure), that event would be classified as “EF”.
Page 53
Optimal Maintenance Decisions (OMDEC) Inc 2004
Significant components
Appendix 10. (page 294) illustrates the variable depth of causality at which a failure
mode may be reported. Consider an ingot transporter in a steel mill (Figure 3-16).
Had the two pumps been piped in parallel as backups for one another, this level of
causality might be sufficient since the failure of either one would have no “direct”
operational consequences.77 However, in this case the consequences of a failure of either
pump are, in fact, operational. Hence it is worthwhile to define two failure modes78:
“hydraulic system failure due to failure of left pump” and “hydraulic system failure due
to failure of right pump”. The family of conditional probability curves (Figure 3-9 page
46) for the transporter might therefore include one curve representing the age-reliability
relationship of the left pump and one representing that of the right pump. If these pumps
are relatively quick and easy to replace we would probably stop there. That is, the
consequences of failure do not justify specifying the failure modes of the pump itself (for
example gear worn, diaphragm leaks, etc). In some other system, operating in another
context, an identical pump could require deeper failure mode analysis than that elected to
be performed here. There are no hard-and-fast rules on the depth of casality at which to
manage a failure mode. It depends entirely on operating context.
77
It does have non-operational consequences since the failed pump would have to be repaired in order to
avoid a multiple failure which would indeed have operational consequences.
78
And therefore two corresponding records in the RCM table. This is entirely a judgement call, in which
supervisors, maintainers, and operators decide that the consequences are severe enough for them to take the
extra trouble to understand the behavior of the pump at each location.
Page 54
Optimal Maintenance Decisions (OMDEC) Inc 2004
Keeping track of the working ages of individual components upon which we wish to
perform reliability analysis can be a daunting data challenge. However the EXAKT
maintenance information methodology simplifies the chore by introducing two new
concepts:
Suspended Animation
Suspended animation is a period in which a component (equipment or module) is out of
operation, however, not due to failure. For example, one cylinder of a gas compressor
may be taken off line for minor maintenance.79 We introduce two new events in order to
deal with this (and similar) situations:
Typically, meters or process computers, provide the working age for the system as a
whole, for example, the throughput of a production line. The production line consists of
many equipment units. If an individual item or component of an item is taken off line for
an extended period of time, the work order should mark the event with “BSA” and record
the system working age. When the component is returned to service, the work order
should mark that event with “ESA” and record the system working age. The software
will, via these recorded events, keep track of the working age of every significant
component or equipment, without the necessity of individual working age meters for
every significant item.
79
A “minor” maintenance event is one that does not rejuvinate the equipment. It may be an adjustment or
recalibration or an alignment. It may, or may not, impact condition monitoring data.
Page 55
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 3-17 Events table (partial) for a fleet of haul trucks
Consider the events table of Figure 3-17. Events such as B, BSA, indicate the beginning,
the beginning of a period of suspended animation respectively of an item (an equipment
unit).
Now consider the equipment item "HT-07". On "8/3/94" a significant component was
removed but the component was part of a system that continued to operate. Failure of a
second component in the system occured on 12/29/94, while the first component was still
Page 56
Optimal Maintenance Decisions (OMDEC) Inc 2004
out of service. Once again, accurate working ages may be tracked. The B1SA event tells
the software that component one’s life has been suspended at 4500 - 0 = 4500 hours.
Until a E1SA event occurs Component 1 is considered to be in suspended animation.
Meter reset
Observe equipment "HT-17" (Table 3-2). There was a meter malfunction on 4/22/96 and
the meter was reset (arbitrarily) to 1000 hours next day. At meter reading 4275 failure
occurred. The lifecycle working age at failure is (7230 - 0) + (4274 - 1000) = 10504
hours.
Table 3-2 Documenting a meter reset
Suppose a CBM inspection was made on "7/23/96 at which time the meter indicated
2035. Then, the actual working age at the time of the CBM inspection would be
computed automatically by the software as (7230 - 0) + (2035 - 1000) = 8265.
Marginal analysis
We note, as well, in Error! Reference source not found. that “Events” such as B1, E1F,
B1SA, and E1SA refer to a specific component (or failure mode), rather than to the
equipment as a whole. In Example 3 Complex Items (page 146) of Chapter 10. we will
show how the data structure of EXAKT accommodates the accurate tracking of
individual component ages, simply by keeping track of the date and working age of each
significant component’s installation and ending events.80 In Chapter 4. we will suggest
specific database practices and procedures for practical application of the important
principles associated with tracking significant components and failure modes.
80
In Chapter 17. (page 249) we describe the MIMOSA strategy to track the dates and working ages of
equipment assets that are moved from one operational “segment” to another.
Page 57
Optimal Maintenance Decisions (OMDEC) Inc 2004
Chapter 4. Acquiring Maintenance
Information
Introduction
EWOP is an acronym for EXAKT Work Order Processor. The EWOP is a CMMS
integrated process for managing maintenance data in support of reliability. EXAKT
CBM optimization projects and all reliability (and reliability-centered maintenance)
analysis81 endeavors will benefit from intelligible data.
81
We make the distinction between “Reliability Analysis” (RA) and “Reliability-centered Maintenance
Analysis” (RCMA). The former is the study of what did happen, while the latter is the study of what could
happen. RCMA could include a consideration of RA studies. RCMA develops a knowledge base that
results in an initial maintenance strategy. RA is used to continuously update the knowledge base and to
sanity-check and correct any assumptions or mistakes made due to incomplete information at the time of
RCMA. Nowlan and Heap used the term “age exploration” to describe any type of RA technique.
Page 58
Optimal Maintenance Decisions (OMDEC) Inc 2004
The underlying work order data structure reflected in Figure 4-1, cannot adequately
represent the variety of maintenance knowledge elements that can apply to any given
maintenance or repair situation. Without such information, we lack the ability, during
subsequent analysis of the asset’s work order history, to reconstruct faithfully what
actually happened in each instance. Our inability to use work order histories effectively
for reliability analysis, represents a fundamental problem in physical asset management.
To be of value, work order historical data must isolate (and report on) the five basic
reliability knowledge elements:
Because reliability analysis (for example, Weibull, EXAKT, Pareto, Jack-knife, and
many others) requires this degree of information granularity, we seldom observe analyses
based on fact in the typical maintenance organization. The inability to perform fact
based reliability analysis makes it difficult, and often impossible, to improve the overall
equipment effectiveness (OEE) of physical assets. On the other hand, if a reliability
analysis can be performed, attaining policy improvement is usually straight forward.
EWOP enables the use of the CMMS historical database for reliability analysis!
Lexicon
Failure mode: A cause of a failure described at a practical depth in the causality chain.
Note: In the EWOP (and in reliability analysis in general) the terms component and
failure mode may be used interchangeably. Where a significant component is affected by
a dominant failure mode the terms are equivalent. However where a component’s
reliability is affected by more than one reasonably likely failure mode, our interest
centers on the failure mode rather than on the component. Either way, our analysis
Page 59
Optimal Maintenance Decisions (OMDEC) Inc 2004
proceeds identically. Predictive models, therefore, will sometimes apply to a component
and they will sometimes apply to a failure mode.
The EWOP extracts information from the work order historical database and, using that
information, performs these two important functions:
The EWOP requires that maintenance personnel adopt a new approach regarding the way
in which they document their work orders at the completion of a maintenance or repair
task. A work order is the principal vehicle for the acquisition of information revealed
from the field, describing the “as-found” state of a physical asset. Such information
assists engineers in conducting reliability analysis (RA). RA requires an accurate
chronology of events related to the failure behavior of a physical asset. This section
suggests a practical way of documenting a work order. Good documentation will support
subsequent reliability analysis, which, in turn, will improve reliability and lower cost.
Page 60
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 4-2 Work order with EWOP’s required data fields
At a minimum a (EWOP friendly) work order should provide the following sixteen items
of data:
2. Item
3. Date out
4. Date back
5. Work performed
Page 61
Optimal Maintenance Decisions (OMDEC) Inc 2004
6. By
Working age should closely relate to the accumulated work performed and stress
undergone by the item. (For example kilowatt-hours, fuel consumed, production units
produced, and so on are sufficiently accurate ways to measure working age). Wageout is
the working age at Date out.
9. RCMREF
10. Function
The function that was lost, compromised, or threatened by the events that provoked the
issuance of the work order.
11. Failure
The way in which the function was lost, compromised, or threatened. For example, total
loss of function, partial loss of function, or potential loss of function. (For example the
failure “Transmits only 60% to 80% of the required torque of 900 ft-lb” speaks directly to
the function “To transmit 900 ft-lb of torque to the wheel”.)
12. Cause
What caused the loss (compromise or threat) to the function. Identify the cause at a
practical depth in the causality chain. (For example, Clutch slips, or Clutch slips due to
worn disks, or clutch slips due to warn disks caused by oil leak, or clutch slips due to
warn disks caused by oil leak as a result of an incorrect seal fitting, etc. The depth of
causality at which we state the cause (failure mode) is dictated by the practicality at
which the organization can do something about the cause and its consequences. The
causality level is decided through discussion among maintainers, supervisors, engineers,
and planners. This is the most challenging, yet rewarding, part of the EWOP approach to
maintenance information management.
Page 62
Optimal Maintenance Decisions (OMDEC) Inc 2004
13. Effects
The sequence of relevant events (that are worthwhile to record), within the component,
within the equipment, and within the organization that led up to the failure, occurred
during the failure, and occurred following the failure82
14. Consequences
Based on the Effects, a determination of how the failure matters. Select one of 1. Hidden,
2. Safety, health, environment, 3. Operational, and 4. Non-operational.
82
To the degree that the consequences of the failure are significant the Effects should answer these
questions:
A. What sequence of events (internally and organization wide) could be touched off by the failure
mode?
B. How does the failure make itself known? What observable events lead up to the failure?
C. How is safety or the environment impacted? (do not mention the words "safety" or
"environment")
D. How is production impacted? (quality, cost, customer service)
E. Is there any additional damage caused by the failure?
F. How long will it take and what actions must be accomplished to correct the failure?
G. How does the likelihood of this failure depend on deeper causes? Has it happened before? How
often? Under what circumstances?
83
It is important to distinguish between the event types FF and PF. If the consequences of failure have been
largely avoided or mitigated due to having detected the failure, select PF. In subsequent reliability analyses,
the hazard rates for FF and PF events may be compared and an evaluation of CBM effectiveness may be
made.
84
Perhaps the failure was covered by a previous work order. The item functioned for a period of time
without this component (e.g. one cylinder of a gas compressor) and the component was finally re-instated
in the current work order.
Page 63
Optimal Maintenance Decisions (OMDEC) Inc 2004
o MR – the minor repair of the item. It does not renew any components. Sometimes
it will impact the monitored data. For example, a calibration, a shaft alignment, an
oil change, the balancing of an impeller, and so on.
In most cases one of the first three (FF, PF, and S) will apply. Properly selecting the
Event type from the drop down list and providing additional information in the “Work
performed” field will allow the EWOP to create the required Events (for subsequent
reliability analysis) in the Events table. Several examples are given beginning on page 66
in the Section “Examples”.
One may argue that, this degree of “wordiness” and detail in completing a work order,
especially where a general overhaul was performed, is onerous and excessive. One would
agree, however, that it is worthwhile to adopt a degree of informational completeness that
is proportional to the consequences and probability of the failure of the equipment in
question. Recalling the example of the ingot transporter in the section Significant
components on page 54, the repair person will devote less time and apply a smaller
amount of detail to a less critical equipment, say, one whose functions are duplicated by a
backup system. Recall too, that most of the 16 reliability information elements for a
failure mode of an item, need be entered manually, only once. Thereafter, in future
incidences, the RCM record is merely referenced (using information element 9) in the
work order record. (Recall the advantages of the one-to-many integrity constraint
illustrated in Figure 2-2 on page 21). Moreover, the Event codes described in the next
constitute accurate failure codes that will eventually reduce the clerical verbosity of this
approach, almost to zero.
85
Could contain “pseudo” work orders as structured text. This might be in the case of CMMSs that do not
permit the creation on demand of additional work orders, which are needed to cover unique item-function-
failure-causes.
Page 64
Optimal Maintenance Decisions (OMDEC) Inc 2004
The events table
The events table (Figure 4-3) is one of the two important outputs of the EWOP.
Reliability analyses (such as Weibull, Pareto, and Proportional hazard modeling) require
information on the beginning and ending events of a component or a failure mode. The
most significant events in the life of a component are 1. its installation, 2. its ending due
to failure, and 3. its ending due to a reason other than failure. Reliability analysis makes
use of the dates and working ages at which events occur. Events define a component’s
life cycles. Reliability analysis discovers the relationship between a component’s
working age and its failure probability86. Proportional hazard modeling takes RA one step
further by discovering the three-way relationship between age, reliability, and condition
monitoring87data.
86
It is often referred to as the age-reliability relationship.
87
Of all types: performance data, process data, sensor data, and external data such as environmental factors.
Page 65
Optimal Maintenance Decisions (OMDEC) Inc 2004
The RCM knowledge base
Figure 4-3 shows examples of records in the RCM knowledge base – the second
important output of the EWOP. The EWOP added these records to the RCM knowledge
base, by using the information from the work order.88
The EWOP requires that a work order relate to a single combination of the fields:
1. item
10. function
11. failure
12. cause
If more than one combination of these four elements occurs (and is significant) in a work
order, they should be extracted into as many sub work orders as required. If the CMMS
cannot be adapted to generate additional work orders, on demand, (in order to
accommodate each additional significant combination of data elements), the text area
(“Additional information”) of the work may be used for this purpose. (See Appendix The
short term process page 265)
Examples
Example 1
A component has suffered a functional failure and is renewed. Select: “FF”. The
EWOP will generate the EnnnFF and BnnnFF events in the Events table, where “nnn” is
88
“As-found” information complements and sanity checks the knowledge developed during reliabity-
centered maintenance analyses. The reliabilty knowledge base is, in this way, said to be “living”.
Page 66
Optimal Maintenance Decisions (OMDEC) Inc 2004
the RCMREF of the RCM knowledge base record that describes the current item-
function-failure-cause.
Figure 4-5 Work order with the Event Type "FF" selected
The work order of Figure 4-5 illustrates the event type “FF” having been selected. The
EWOP, upon encountering “FF” generates the two events of Figure 4-6.
Figure 4-6 Events generated from Work order 9 with event type "FF"
Examine the Events table of Figure 4-6, in particular the column “Event”. On May 14,
2004. work order 9 was issued. The work order generated the ending event with failure,
“E1101FF”, and the beginning event, “B1101FF”. Note that the RCM reference number,
1101, was sandwiched between the first letter and the last two. Thus, if we were
analyzing (building a model of) the failure behavior of a particular failure mode we
could easily identify all events in the database that refer to that item-function-failure-
cause.
When building the predictive model, we will conveniently “map” (for example by using
the mapping dialog of EXAKT, described in the Appendix Exercise page 313 ) the event
“E1101FF” to the event EF in the model named “Crusher 1017 B Failure mode 1101”.
Page 67
Optimal Maintenance Decisions (OMDEC) Inc 2004
Finally, in this instance, no RCM reference existed prior to the work order. Therefore, the
EWOP generated record “1101” (using the information provided on the work orders) and
inserted it into the the knowledge base (RCM table). The record will be quality checked
by a reliability engineer, planner, or analyst or other person versed in the use of RCM
language and concepts. This is one important way in which the reliability-centered
knowledge base grows.89
Example 2
A component has revealed a potential failure and is renewed. The technician will
select “PF” in the Event type field. Similarly to Example 1, the EWOP will generate the
event records “EnnnPF” and “BnnnPF” as illustrated in Figure 4-7.
Figure 4-7 Events generated from a work order with the Event type "PF"
A second important aspect of this work order, is that it relates to an already known failure
mode, hence the reference to RCM record 890. The technician, checked the knowledge
base and discovered that the item-function-failure-cause is known90. Therefore, he
references (rather than duplicates) the known reliability information. The referenced
RCM record is illustrated in Figure 4-8.
Example 3
A component has been renewed preventively. The technician selects “S” (for
suspension) in the Event type field. No failure (neither a PF nor a FF) has occurred.
Figure 4-9 Events generated from a work order with the Event type of "S"
89
The other way is through reliability-centered maintenance analysis (RCMA). While RA analyzes failure
and potential failure events that have occurred, RCMA considers and analyzes reasonably likely failure
modes that may or may not have occurred in the past. The two processes populate the same knowledge
base. Each benefits from the expererience and thinking of the other.
90
Either it has occurred before, or it was considered and included in a RCMA project.
Page 68
Optimal Maintenance Decisions (OMDEC) Inc 2004
When building of a predictive model of failure mode 1124 (i.e. RCM record 1124), a
reliability engineer or analyst will map the events E1124S and B1124S to the events ES
(ending by suspension) and B (beginning of life cycle) of failure mode “1124” of item
“Crusher 1017B”.
Example 4
A new component has been installed but no component was replaced. Only one
event generated.
Figure 4-10 A single event generated for an installation of a component for the first time
Examine the second record of Figure 4-10. Work order 22 covers the installation of a
component where there was none. Hence EWOP inserts a single Event “B852B”. EWOP
knows to insert only a beginning event because the technician selected “B” (rather than
FF, PF, or S) in the Event type field.
A month earlier on work order 21, the “component” 852 was removed, possibly for a
minor repair (such as cleaning, adjustment, etc) with the intention of placing the part back
in service at a later date. The component was supposedly placed in “suspended
animation” (hence the event E852BSA). However, it was decided to renew the
component fully before placing it back in service. That is why we have a “B” event
where an ESA event was expected.
It will be quite clear to an analyst, however, that E852BSA event should be mapped to a
ES event, leaving no ambiguity that the component’s lifecycle was suspended (rather
than having been placed in suspended animation as originally thought).
Example 5
Page 69
Optimal Maintenance Decisions (OMDEC) Inc 2004
Work order 13 covers the repair of a component (whose dominant failure mode is
documented in RCM record 1120). However, it was necessary to reset the meter at this
time. (This could have been for operational reasons, the meter reached its maximum, or
because the meter was replaced.) Of course, a meter reset such as this will erroneously
impact the recorded ages of every significant item and failure mode whose reliability we
wish to track. How do we make sure our reliability data is not compromised by such an
event?
The answer is simply to include the phrase “meter reset” in the Work performed field.
The EWOP will understand that the meter was reset and it will insert two artificial events
into the Events table. The events will be labelled “BmeterESA” and “EmeterBSA”. This
tells the analyst that when modeling any component or failure mode of the item, he
should map these events to ESA and BSA in the reliability analysis project for the
component or failure mode under scrutiny. The analysis software will internally adjust
the working ages of every component whose histories include ESA and BSA events.91
Example 6
Work order 23 covered the functional failure and renewal of component (failure mode)
881 on November 14. Observe (in Figure 4-13) the way in which the work order 23 was
documented. The phrase “used (4000)” was inserted by the technician into the text of the
of the Work performed field. This tells the EWOP that a 4000 old component, rather than
a new component was installed.
91
The internal adjustment by reliability software of the individual component age is discussed in Chapter 4.
.
Page 70
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 4-13 Specifying that a used component was installed
The EWOP, therefore, subtracts 4000 from the B event, thereby “instantly” aging the
component by 4000. In addition it generates a “Start Monitoring” (SM) event to indicate
that no CBM results will apply to this component for the first 4000 time units of its life.
In any subsequent analysis of the reliability of component (or failure mode) 881, the
meaning will be clear. Appendix 1. EWOP details (page 263) provides further discussion
on the analysis regarding replacement by a used component.
Example 7
Figure 4-14 A component placed in suspended animation and later on returns to service
In work order 19 component 800 was removed for some minor reason and was placed
into suspended animation. Four months later it was reinstated on work order 20. Once
again the events E800BSA and B800ESA will be mapped to events BSA and ESA
respectively when modeling component (failure mode) 800.
You may be wondering why E for (“ending event”) is the prefix to E800BSA that marks
the beginning of a period of suspended animation. The reason for this seeming oxymoron
is that the beginning of a period of suspended animation is, from the component’s point
of view, the end of a segment of its life-cycle.
Page 71
Optimal Maintenance Decisions (OMDEC) Inc 2004
simple process for reliability information management, assisted by a software tool called
EWOP. The EWOP provides a quick, consistent, and inexpensive way to capture field
information needed by all reliability analysis techniques and software92.
We hope that the growing use of EWOP will encourage CMMS builders to supplement
their systems with similar features. Additionally, we encourage reliability consultants to
embrace and teach the principles of living reliability-centered knowledge. A
demonstration version of the EWOP may be obtained by communicating with the author.
92
Including reliability-centered maintenance analysis, FMEA, FMECA, root cause failure analysis, risk
based inspection, HAZOP, FRACAS, six-sigma, and many others that require the consideration of facts
based on experience.
Page 72
Optimal Maintenance Decisions (OMDEC) Inc 2004
Chapter 5. Assessing “What-if” from
maintenance information
Introduction
We gather information in the course of our day-to-day maintenance activities in order to
deepen our understanding of failure so that we may better manage its causes and control
its consequences. We use our growing knowledge of the causes and effects of failure to
improve reliability. By "reliability improvement" we mean the attainment of desired
levels of availability, reliability, operating/maintenance cost, yield, production rate,
safety, and environmental integrity of each significant physical asset in its operating
context.
In the preceding chapter, we described methods and tools for using the CMMS to report
the outputs of an existing maintenance policy. For example, the graph of Figure 3-8 (page
45) reports on the effectiveness of our current CBM program. And the graphs of Figure
3-9 (page 46) describe the actual failure behavior of items. They provide clues as to
whether a different maintenance policy or physical modification may act to our
advantage.
All the previous methods help us track the effectiveness - the maintenance outputs - of
past and present policies. They do not predict what would happen in the future if a
maintenance policy were altered. The capacity to perform “what if” analysis on the
future impact of policy changes, would, no doubt, assist the physical asset manager. He
could, thereafter, ask questions of the type, “What will the
downtime/availability/reliability/cost be of my system if I double/triple/halve the
overhaul frequency?” We can perform decision analyses such as these by building and
running a model. In this chapter we examine the powerful modeling technique known as
Monte Carlo Simulation.
Assume that we have operated and recorded, in our CMMS, failure and installation
events of a simple item over a number of years. We note from these records, that the
average life (MTTF) was 0.5 years. We observed the average repair time (MTTR) to be
93
Monte Carlo Simulation software available from Clockwork Solutions, www.clockworksolutions.com
This exercise was compiled by Naaman Gurvitz of Clockwork.
Page 73
Optimal Maintenance Decisions (OMDEC) Inc 2004
10 days (0.0274 years) and that the actual repair time was normally distributed with a
10% standard deviation. We desire, at this time, to predict the maintenance performance
for this item over the next two years under a variety of alternative policies and conditions.
Figure 5-1The reliability block diagram for a single line replaceable unit (LRU) named "SGN"
Figure 5-1 presents the simplest of reliability block diagrams containing a single line
replaceable unit.
Failure behaviors
As a hypothetical set of cases for our examination, we will assume 4 possible failure
distributions for the single LRU of Figure 5-1: 1) exponential, and 2) Weibull with shape
parameters 1.5, 2.5, and 3.5. An exponential distribution’s single parameter is the item’s
MTTF, which in this is case 0.5 years. For the three Weibull distributions, we may
calculate the second (scale) parameter, λ, using the equation:
β
Γ(1+1/β)
λ=
MTTF
Equation 5-1
Page 74
Optimal Maintenance Decisions (OMDEC) Inc 2004
where Γ is the gamma function94. And MTTF =0.5. Equation 5-1 yields the following
values for the Weibull scale parameter, λ:
λ β
2.4230 1.5
4.2035 2.5
7.82445 3.5
We can now enter, into the SPAR™ program, the parameters of the 4 failure
distributions, and the parameters for the repair time normal distribution (0.0274 years and
.00274 years). We specify a service time observation window of 2 years and run the
program.
Figure 5-2 Graphs of predicted availability over 2 years for each of the 4 distributions
94
The value of the gamma function Γ(x) for any x may be looked up in a table similar to trigonometric
tables, for example, sin(x)
Page 75
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 5-3 Predicted average downtime over 2 years for each of 4 distributions
Figure 5-4 Predicted number of failures in a two year period for each of 4 failure distributions
Remarks
We may conclude that it is technically feasible, (knowing the failure and repair
distributions) to analyze and predict maintenance performance. At this point we
increase the level of realism one notch by considering policies where repair effectiveness
will be less than “perfect”.
Repair effectiveness
We define “repair effectiveness” as a reduction in age. Following a perfect repair we
would “reset” a component’s age to zero. That is, age conservation for a 100% effective
maintenance action is “0”. If the repair is imperfect we use the SPAR program’s bubble
Page 76
Optimal Maintenance Decisions (OMDEC) Inc 2004
logic to instruct the calculation engine to conserve a portion of the item’s age after repair.
Assume, for example, that a “minimal” repair will actually conserve 99% of an item’s
age95. We enter this information into SPAR using its Bubble Logic generator tool. SPAR
then generates the following Dynamic Logical Sentence (DLS):
At Collision
START DLS (1)
Comment: Setting age upon repair
1.1 If LRU 1 in current system is repaired now
Set age of LRU 1 in current system to .99*age at last failure
1.1 End Of If
END DLS (1)
The DLS tells the calculation engine to treat repair as “minimal”. We run the analysis
once again. This time, however, the predictive results will account for the minimal nature
of the repair. We refer to such repairs as “as bad as old”. Compare the results of the
following graphs (Figure 5-5, Figure 5-6, and Figure 5-7) to the previous ones (Figure
5-2, Figure 5-3, and Figure 5-4 ) where a perfect repair policy was assumed.
Figure 5-5 Predicted availability under a minimal (“as bad as old”) repair policy
95
For example, to get the equipment back into production quickly, the policy may be to replace only the
failed component(s), leaving the others in the unit to continue aging.
Page 77
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 5-6 Predicted downtime under a minimal (“as bad as old”) repair policy
Figure 5-7 Predicted number of failures under a minimal (“as bad as old”) repair policy
We note that the repair policy "as bad as old" leads to lower system performance than in
the "as good as new" case. This is expected. However, it is not true (comparing the blue
lines and bars of each set of graphs) for the case of an exponential failure distribution.
That is because the exponential distribution is "ageless"; a unit whose failure distribution
is exponential is always as good as new! At this point we ratchet up the level of realism
another notch by adding preventive maintenance (periodic overhauls) to our maintenance
policy for this item.
Page 78
Optimal Maintenance Decisions (OMDEC) Inc 2004
current minor repair policy a proposed preventive maintenance schedule. We do this by
using SPAR’s Input Generator tool.
Through a series of dialogs, we modify the current project, by telling SPAR to apply PM
periodically at 6 month intervals. We also indicate to SPAR that the PM duration is 14
days (0.0384 years). By default, the PM is considered to apply zero age conservation,
which is what we want. As previously, we run the program and generate the maintenance
performance prediction graphs of Figure 5-8, Figure 5-9, and Figure 5-10.
Figure 5-8 Time Dependent Availability for Weibull β=2.5 distribution, (a) Perfect Repair, (b)
Minimal Repair, (c) Minimal repair and Periodic Maintenance
Figure 5-9 Average Downtime for Weibull β=2.5 distribution, (a) Perfect Repair, (b) Minimal Repair,
(c) Minimal repair and Periodic Maintenance
Page 79
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 5-10 Number of Failures for Weibull β=2.5 distribution, (a) Perfect Repair, (b) Minimal
Repair, (c) Minimal repair and Periodic Maintenance
Optimizing PM
It is usual to define an optimal PM policy as one that minimizes lifecycle cost. Lifecycle
cost would include the cost of lost production due to failure and maintenance. We set up
the variables of our optimization problem as follows:
Variable Definition
Td total down time (due to either PM or failure) of the system
Cd cost of downtime per unit time (i.e. production loss)
Nf number of failures
Cf cost per failure (not including downtime but only fixed costs such as
man-hours, spare parts and so on.)
Nm number of preventive maintenance operations
Cm cost of a maintenance operation (not including downtime)
Total Cost Cost = Cd * Td + Cf * Nf + Cm * Nm
We proceed to determine the optimal maintenance strategy for, say, the case of the
Weibull failure distribution with shape factor = 2.5 and a “as bad as old” repair policy.
Three possible maintenance strategies are:
1. No maintenance.
2. Preventive maintenance every 6 months.
3. Preventive maintenance every 3 months.
The cases of no maintenance and maintenance every 6 months have already been run. We
easily run another case with maintenance every 3 month. Then we have SPAR display the
comparative results graphs of Figure 5-11 and Figure 5-12.
Page 80
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 5-11 Average Downtime for Weibull β=2.5 distribution with Minimal Repair and: 1. No
Maintenance, 2 Maintenance Every Six Months, and 3. Maintenance Every Three Months
Figure 5-12 Average number of failures for Weibull β=2.5 distribution with Minimal Repair and: 1.
No Maintenance, 2. Maintenance Every Six Months, and 3. Maintenance Every Three Months
Using these results we set up the following spreadsheet calculating cost as Cost = Cd *
Td + Cf * Nf + Cm * Nm:
On the lower row of this spreadsheet we have applied the following values for this
exercise:
Page 81
Optimal Maintenance Decisions (OMDEC) Inc 2004
Variable Definition Value
Cd cost of downtime per unit time (i.e. production loss) $0.10
Cf cost per failure (not including downtime but only fixed costs $10
such as man-hours, spare parts and so on.)
Cm cost of a maintenance operation (not including downtime) $1
We enter the downtimes (from Figure 5-11) and the number of failures (from Figure
5-12) into the spreadsheet. The number of PM events (0, 3, and 7) for each case are
calculated by hand. (e.g. the number of 3 month interval PMs that will take place in 24
months = 7). We conclude that the most cost effective policy of the three alternatives is to
perform preventive maintenance every 3 months. However, a change in the relative costs
of failures versus those of maintenance versus those of lost production during downtime
will likely change the best policy.
Page 82
Optimal Maintenance Decisions (OMDEC) Inc 2004
Part 2. Condition Based Maintenance
On-condition inspections, which make it possible to preempt functional failures by
potential failures, are the most effective tool of preventive maintenance – Nowlan and
Heap, Reliability-centered Maintenance.
Introduction
Most courses and books on CBM (also known as “Predictive maintenance”, “On-
condition maintenance”, and “Condition monitoring”) focus much attention on the
“technology” of acquiring and manipulating condition monitoring data. CBM hardware
and software providers provide excellent training to their customers in the efficient use of
their products and services. This book, on the other hand, explores the informational
processes underlying the technology of condition based maintenance.
We seek to perform CBM tasks that are applicable (feasible and practical) and effective
(accomplish the intended objective). CBM may, in a sense, be thought of as the most
“noble” or preferred form of maintenance for these reasons: Wherever applicable and
effective, CBM is:
1) the least intrusive,
2) the least expensive, and
3) the least tolerant of failure.
The third of these points requires some explanation. We perform time-based (preventive)
maintenance at a time prior to the age at which we expect the item to fail. In other words,
at an age to which most items of the kind in question survive (see Figure 3-2 on page 35).
By definition then, we expect that some items will fail prior to the scheduled preventive
renewal. We are prepared, therefore, to tolerate a relatively small number of failures.
CBM, on the other hand, is designed to intervene at the point of potential failure. Figure
6-1 illustrates CBM theory.
Page 83
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 6-1: CBM theory
Figure 6-1 describes the assumptions upon which CBM is based. They are:
Whenever a CBM task fails to accomplish its objective, it means that we have overlooked
or misjudged one or more of these assumptions. In such cases CBM is said to be
ineffective. Despite, extensive application of technology and labor, many sophisticated
CBM programs deliver negligible net benefits. On the other hand, a great number of
simple, inexpensive CBM inspection programs reap enormous benefits. Why do these
advantages often not scale up with added technology, as we would expect they should?
We will explore this issue in subsequent sections as we proceed.
Why do CBM?
Intuitively, condition based maintenance would seem almost universally desirable . If it
can detect an impending failure, thereby allowing us to react quickly enough to prevent,
or to avoid the dire consequences of failure, why not do as much CBM as possible?
Figure 6-2 displays the RCM process with which we may analyze the applicability and
effectiveness of any pro-active maintenance task.
Page 84
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 6-2: The reliability-centered maintenance process
At the top of Figure 6-2 we note the two global intellectual activities that characterize all
human progress: A) analysis, followed by B) a decision to act based upon that analysis.
Reliability-centered maintenance (RCM) emerged, in the 1980’s as a maintenance
analysis and decision process of great power. RCM frames the maintenance analysis and
decision process in seven questions (listed on page 16). Question 1 identifies each of
the item’s performance requirements. Question 2 lists the functional failures associated
with each performance requirement. Question 3 enumerates every reasonably likely
cause (called a failure mode) of each functional failure. In Question 4 (effects analysis),
we express the scenario of noteworthy events touched off by the failure mode96. By
carefully considering the effects of Question 4, we may respond to Question 5 to
determine whether the consequences of the failure are: 1) hidden, 2) safety or
environmental, 3) operational (production related), or 4) non-operational (maintenance
impact only). Question 6 and Question 7 are answered by applying the RCM decision
algorithm (Figure 6-3).
We select the appropriate vertical branch (H, S, O/P, or M) of the decision diagram of
Figure 6-3 depending on the answer to Question 5 (consequences). Most97 of the tasks of
rows 3 and 4 of the decision algorithm designate “default” activities. When no single
applicable and effective pro-active task can be found, the decision algorithm directs us to
perform the default tasks. In Part 3. (page 201) we will exercise the RCM process in
great detail.
96
In the previous chapter on case based reasoning, we saw how effects analysis can be structurally
extended to enable the use of diagnostic algorithms – a specialized application of CBM.
97
The maintenance policy applied manage a particular failure mode can “Two or more of above”, a
proactive (not a default) action.
Page 85
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 6-3: The RCM decision algorithm
Once the decision analysis of steps 5, 6, and 7 have been completed, the RCM process is
complete, and we may proceed to the resourcing phase, illustrated by Figure 6-4. During
this implementation phase we set up our CMMS with detailed plans and schedules. We
specify the labor, parts, materials, and skills necessary to execute the set of tasks.
Furthermore, the human resource department provides any necessary manpower and
training.
Page 86
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 6-4: The RCM 7-step process followed by planning, scheduling, and resourcing
The panorama of Figure 6-4, places CBM (and all our maintenance tactics) into a
strategic context. Every policy action in our maintenance program is traceable back to
one or more functional requirements of a physical action. We have, thus, answered the
question posed by the title of this section, “Why do CBM?”
History of CBM
Physical asset managers attempt to implement policies that maintain the functionality of
machinery and other production assets at a level required by their users, owners, and by
society at large. They select "proactive maintenance" as their first line of defense against
the causes of equipment failure. By applying routine inspection (condition based
maintenance aka CBM, on-condition maintenance, and predictive maintenance) or
periodic renewal (preventive maintenance aka PM, scheduled overhaul), or redesign,
they seek to avoid the consequences of failure. Of the three tactics they prefer to consider
CBM first, because it is usually less expensive and less intrusive. Although data is
plentiful and can be collected and processed in every situation, CBM is appropriate only
when it is both applicable (technically feasible) and effective (economically justifiable).
Applicability implies a non-ambiguous indicator of failure initiation and sufficient time
to proact.
Page 87
Optimal Maintenance Decisions (OMDEC) Inc 2004
Preventive maintenance is the routine renewal of physical assets or their components.
Condition based maintenance is the routine inspection of a physical asset to determine
whether a failure process is underway. If failure has begun, the goal is to take an action
which will somehow avoid or reduce the consequences of failure. If the remedial action
(for example a cleaning or adjustment) can be performed on the spot, at the time of the
inspection, most companies consider the inspection activity as belonging to their
preventive maintenance (PM) program98.
Commercialization of CBM coincided with the dawn of the "information age" and CBM
took on a new "flavor". Technology entrepreneurs conjectured that, if simple physical
measurements, such as vibration amplitude or oil viscosity, could provide such useful
benefits, then collecting the data in computers and trending it over time would, likely,
provide a far deeper insight into the state of a machine's health. Hence the 1980s and
1990s witnessed a soaring rise in the use of computers, software, and data collectors in
maintenance shops throughout the industrial world.
In reality, even in the midst of impressive information technology growth, most day-to-
day CBM success stories still derive from the basic application of the original,
uncomplicated form of CBM. For example; the detection of unbalance in a rotating
machine, of glycol or fuel in an engine oil, or of mechanical looseness, soft foot, or shaft
misalignment seldom require the degree of sophistication (and related expense) of the
variety of technology bells and whistles happily proffered by the CBM industry.99
At the same time (as the growth of CBM), the information technology revolution
impacted another part of physical asset management - the computerized control of
maintenance materials, labor, and historical records. These products became known as
computerized maintenance management systems (CMMS). There was, however, a
striking difference between the CBM and CMMS approaches.
While CBM technology vendors required their clients to adhere to highly structured
procedures for data collection and storage, CMMS vendors, on the other hand, hailed the
concept of 'flexibility' and emphasized their products' "ease of adaptation" to their clients'
98
Rather than to their CBM program. “PM” is being used here interchangeably with “TBM” (time based
maintenance)
99
Nevertheless, the CBM technology vendors offer powerful hardware and software that, when applied
effectively, meet the objectives of CBM.
Page 88
Optimal Maintenance Decisions (OMDEC) Inc 2004
existing business processes. As a consequence of their much vaunted "user friendliness"
no common practices of data classification gathered sufficient critical mass to achieve
standardization - not even within a given organization, let alone in an industry, or in the
physical asset management community at large.
It is in this context that the second millennium, the age of connectivity, finds the state of
maintenance information. Maintenance technology vendors are poised to inject the latest
generation of "integration technology" into their traditional market. But the lack of a
common data model impedes smooth penetration.
Hence we may foretell the day when disparate production and physical asset management
systems will communicate seamlessly thanks to MIMOSA and other standardized
information protocols such as OSA-CBM (Open Systems Alliance - Condition Based
Maintenance), STEP (standard exchange for model product data), OPC (formerly OLE
for process control), OAG (Open Applications Group), and others.
Connectivity to this degree of intimacy implies that process and maintenance information
from multiple platforms will materialize in a universally accessible format (CRIS) and, in
that homogenized form, may be intelligently processed for optimum decision making.
Optimization seeks to achieve some objective: the lowest average cost of maintenance,
highest asset availability, or a specified effective reliability. It is onto this stage that the
"CBM Optimizing Intelligent Agent" enters.
What does the future have in store for CBM? The CBM process consists of three sub-
processes: data acquisition, signal processing, and decision making. Data acquisition is
already highly technologically advanced. "Signal processing" in CBM filters out of the
Page 89
Optimal Maintenance Decisions (OMDEC) Inc 2004
data, operational and environmental information, so that what is left is a "condition
indicator" that reflects the degree of deterioration of some targeted failure mode. New
signal processing methodologies based on a variety of disciplines (wavelet analysis,
principal component analysis, inference engines, and neural net classifiers to name a few)
are being developed in research institutions and universities around the world. (Chapter 7.
describes a few such techniques.) Their effect will be to make it technically feasible to
track and manage ever increasing numbers of failure modes.
Page 90
Optimal Maintenance Decisions (OMDEC) Inc 2004
Chapter 7. Anatomy of CBM
Having understood, from Figure 6-4 (page 87), that the decision to perform CBM flows
from a fundamental analysis of the physical asset’s maintenance requirements, we turn
our attention to the composition of a CBM task. We keep the over-riding concerns in
mind. That is, we elect to conduct only applicable and effective CBM procedures. Figure
7-1 portrays three distinct CBM sub-processes, each of which must satisfy the
applicability and effectiveness criteria in order for CBM to add value to a maintenance
program.
Data Acquisition
Data acquisition is the first and, one might assert, the easiest of the three CBM sub-
processes to implement. Assisted by advanced sensor, signal transmission, and storage
technologies, we can, without too much effort, implement systems that collect and store
impressive amounts of data. The predictive maintenance industry has organized100 to
provide communication standards and protocols endowing their products with
unprecedented capability to share process and condition monitoring data. Because
commercial-off-the-shelf (COTS) data acquisition hardware and software products can be
used across a range of industries, data acquisition enjoys more commercial exposure than
do the other sub-processes of CBM. Some maintenance technology consumers imagine
that, once they set up elaborate data acquisition, storage, and display systems, they will
have overcome the major hurdle to effective CBM. Some pay scant attention to the
choice of the data they decide to collect, adopting a when-in-doubt-collect-it-anyway-it-
might-be-useful attitude. Their data choices are influenced largely by the capabilities of
the technology rather than by a pre-assessment of how well the collected data will reflect
an evolving failure mode.
By way of illustration, there are two important reasons why bearings fail :
Page 91
Optimal Maintenance Decisions (OMDEC) Inc 2004
• Contaminants in the bearing oil – water being the dominant one
Tradespersons and operators make these types of observations routinely. Sometimes, they
take approriate corrective action. Seldom, however, do the observation or the failure
mode104 discovered as a result of the observation, appear methodically as records in the
maintenance history database. Invaluable sources of reliability data such as these, elude
most maintenance information record keeping processes. Rather, those historical records
contain, mainly, descriptions of maintenance activities performed, without reference to
the conditions that inspired those actions. The McNalley institute goes on to enumerate
the possible causes of the elevated temperatures in the stuffing box as:
101
http://www.mcnallyinstitute.com/CDweb/p-html/p027.htm
102
Lubricating oil has a useful life of thirty years at thirty degrees centigrade (86°F) and its life is cut in
half for every ten degree centigrade (18°F) increase in temperature. We may assume the temperature in the
bearing is at least ten degrees centigrade (18°F) higher than the oil sump temperature. At elevated
temperatures the oil will carbonize by first forming a "varnish like" film that will turn into a hard black
coke at these higher temperatures. It is these formed solids that will destroy the bearing.
103
For example, overheated stainless steel turns straw yellow, brown, blue and black at respective
temperatures of approximately 400, 500, 600, and 650 degrees Celcius.
104
The opposite side of the coin. The five knowledge elements (page 15) will neatly express these
observations in a work order record of the CMMS.
Page 92
Optimal Maintenance Decisions (OMDEC) Inc 2004
• Loss of circulation in the stuffing box cooling jacket.
• Loss of cooling in the bearing case cooling sump.
• Something is cooling the outside of the bearing casing causing the outside
diameter of the bearing to shrink, increasing the load.
• The bearing was installed incorrectly.
• The bearing is over lubricated. The oil level is too high or there is too much
grease in the bearing.
• The lubricating oil is contaminated with water.
• The shaft is overloaded because the pump is operating off of the B.E.P. (best
efficiency point).
• There is too much axial thrust of the shaft.
• Misallignment, unbalance, etc.
Oil sampling will indicate the following conditions that are a prelude to (or an indication
of) serious failure.
By monitoring pump suction and discharge pressure in concert with product flow and
motor amperage, the following failure modes may be detected:
Most failure modes occur randomly rather than by a wearing out of a component. For
example, were wear the dominant failure mode in bearings, they would, on the average,
survive 50 or even 100 years. But, industrial bearings undergo accelerated wear initiated
by randomly occurring internal or environmental events, for example a shock load,
excessive heat, or water ingress causing lubricant failure. Bearing life is, in addition,
highly influenced by initial conditions, for example, how it was stored and handled prior
to installation, and how it was installed.
Randomness, being the rule, rather than the exception, is it reasonable for us to assume
that we will usually find a monotonically rising trend of some monitored variable
throughout a component’s lifecycle, from which we may predict its failure? A more
reasonable approach to CBM would be to monitor the equipment and its operating
Page 93
Optimal Maintenance Decisions (OMDEC) Inc 2004
context for signs of conditions causing abnormal stress, that if allowed to persist, will be
destructive. Doctors monitor cholestrol to determine whether our arteries are in danger of
clogging. At a certain level, they order a corrective action, usually a change in lifestyle.
Maintainers monitor oil levels to avoid the consequences of over- or under-lubrication.
Vibration analysts determine a condition of foundation weakness, shaft misalignment or
of rotor imbalance, that, if uncorrected, will lead to serious failure.
These examples illustrate that CBM is a viable maintenance strategy for avoiding failure
altogether. Yet CBM can also track and predict some failure modes from some point in
time after their random initiation to their ultimate functional failure. It has been
estimated105 that twenty precent of failure modes proceed in a predictable enough manner
following their detection (their potential failure), that a repair action may be planned and
executed prior to the loss of asset functionality. A spalled bearing, for example, emits
bearing tones that can be detected automatically through processing of the spectral data
assisted by cepstrum analysis. The bearing may continue to operate adequately from this
point for several months prior to a failure that would render it non-functional.
It seems, then from the preceding, that there are two classes of CBM:
In either situation, CBM is said to be effective, as long as the consequences of failure are
reduced (or avoided entirely) at an acceptable cost. In the case of the first CBM class,
and, pursuing our example of a centrifugal pump, we might notice a rising trend in the
temperature of the stuffing box. If it gets too hot, we are going to have problem. We had
better correct the condition if we do not want to experience a premature (random) seal
failure. The McNally Institute describes the following seal failure modes that will be
provoked by excessive stuffing box temperatures:
• The product can change its state, insofar as ceasing to act as a lubricant, but
partially transforming into a destructive solid.
• The product can vaporize, expand and blow the seal faces open leaving solids
between the faces.
• The product can become viscous, interfering with the free movement of the
springs and bellows.
• The product can become an adherent, gluing the lapped faces together or making
the moveable components inoperable.
• The product can crystallize interfering with the moving parts of the seal.
105
Moubray, J, Reliabity-centered Maintenance, 2nd Ed. Butterworth 1999.
106
We will learn in Chapter 10. page (113) that these two classes of CBM are characterized by two types of
CM variables – 1) internal variables that reflect the state of the asset with respect to its deterioration due to
a failure mode, and 2) external variables that measure the level of stress that influences the probability that
a failure will occur. A CBM decision model, may incorporate either or both types of variables.
Page 94
Optimal Maintenance Decisions (OMDEC) Inc 2004
• Excessive heat can cause the product to build a film on the faces (hot oil as an
example) impeding sliding of the components and making them inoperable.
• Corrosion increases with increasing temperatures.
• Thermal expansion may cause seal faces to go out of flat, loosening of pressed-in
carbon faces in their holder, and sticking of the bellows’ vibration dampers to the
shaft sleave and opening the faces.
• Heat can damage the faces of the plated materials and filled carbon face types.
• Expansion of air pockets in some carbon faces can cause pits in the lapped faces.
• High heat levels can cause elastomers to experience compression set problems,
resulting in leakage or in some cases complete failure.
When monitoring temperature and pressure in the stuffing box area we will note these
changes. Then, by applying our knowledge based rules, we will have adeqate time to
react before seal failure occurs. Knowledge based rules form our CBM policy. Without a
CBM policy, regardless of the number of sensors scattered throughout our process, the
amount of data storage capacity, or the sophistication of the software “shell”, our CBM
program will ultimately prove ineffective.
Besides data acquisition, two additional sub-processes challenge our ingenuity prior to
implementing applicable and effective CBM.
Signal Processing
Signal processing in CBM is the filtering out of the acquired data all information that
pertains to the operation of the asset and its environment. In other words the processed
signal should not reflect changes in load or operational conditions, but should react only
to real changes in asset health, with respect to the deterioration by a failure mode that we
are targeting with the CBM task. A variety of signal processing techniques have been
(and continue to be) developed by industry and academic research organizations. We
sometimes refer to signal processing, particularly in vibration analysis, as feature
extraction. We process a raw time waveform signal (using an algorithm) in order to
extract one or more features (condition indicators) that measure the evolution of
particular conditions affecting or occurring in our physical asset. Figure 7-2, Figure 7-3,
Figure 7-4, and Figure 7-5 illustrate a small sample of the wide diversity of CBM signal
processing techniques addressing specific failure modes.
Page 95
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 7-2: Stress Wave Analysis, www.swantech.com
Stress wave analysis, illustrated in Figure 7-2, tracks the failure modes associated with
roller and groove damage.
Page 96
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 7-4: Petri-nets for monitoring a manufacturing process.
Figure 7-4 illustrates the graphical language of Petri-nets used to simulate manufacturing
processes in an integrated circuit chip. Deviations from expected timing of activities may
be tracked and related to specific modes of failure.
Page 97
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 7-6: Continuous oil analysis and treatment system (www.thermal-lube.com)
Figure 7-6 illustrates that an effective CBM system may act as one half of an automatic
control loop. Although most CBM programs operate in a manual control loop by
directing a maintenance renewal task, the continuous oil analysis and treatment (COAT)
system uses CBM condition data in an automatic control system. First it extracts features
from a lubrication or cooling fluid’s infrared signature. The arrow on the left of Figure
7-6 represents the signal processing algorithm that extracts the current additive level from
the infrared spectrum. The additive level then can be tracked and trended in time. Other
extracted features (i.e. condition indicators such as oxidation, additive content, and
contamination) can be used similarly. In this case Figure 7-6 portrays the automated
replenishment of depleted oil additives.
Figure 7-6 raises a question about the CBM role of oil analysis. Oil analysis laboratory
services constitute a significant part of the CBM technology industry. Lubricant suppliers
often subsidize those services as an important component of their marketing plan.
Publicity for these programs focus on extending lubricant change intervals. Although
laudable for a supplier to attempt to reduce clients’ product consumption, this point of
view may be misleading. The emphasis on reducing oil consumption diverts attention
from the essential purpose of a CBM task – to reduce or entirely avoid the consequences
of equipment failure. The cost of lubricant usually pales in comparison with the cost of a
major asset failure. Furthermore, cost alone cannot measure the hidden, environmental,
and safety related consequences of failure. When lubricant additive drops at an abnormal
rate, the user more rightly concerns himself with the mechanical failure mode or process
fault whose effects include abnormal additive depletion. Some lubricant vendors and
their sales persons have not yet embraced this viewpoint. They continue to stress reduced
Page 98
Optimal Maintenance Decisions (OMDEC) Inc 2004
lubricant consumption in promoting their CBM service offerings. In the introduction to
Part 1 (page 13) we offered some explanation for this point of view.
There are as many signal processing processing techniques as there are different physical
applications. In Chapter 11. CBM Decision Making with Expert Systems (page 152)
several practical techniques for the extraction of vibration features to be processed by a
rule based expert system are described. In Chapter 13. A survey of signal processing and
decision technologies for CBM (page 177) a broad review of the technical literarture is
presented.
Decision Making
Decision making represents the final, and often overlooked, CBM sub-process. After
collecting, processing, and storing the current set of condition data, the maintenance
planner, manager, or engineer decides whether an intervention at this point in time is
“optimal”. Figure 3-1 (page33) illustrates the complexity of factors that will affect his
decision. He desires to make that decision, as far as is possible, in a methodological
manner that will bear scrutiny with respect to the objective of the organization and the
current operation of the asset – a tall order. For CBM to render effective service, we
apply the same degree of rigor to this decision making step as we have done to the data
acquisition and signal processing step. CBM Laboratory at the University of Toronto has
created EXAKT, a CBM decision software tool.
Page 100
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 7-8: EXAKT decision tool
Figure 7-8 describes how EXAKT software decision tool may be used. The top left of
Figure 7-8 shows a graphical representation of the software’s output. The vertical axis
measures the weighted sum of risk factors found significant by the software’s
proportional hazard model. The horizontal access indicates the item’s working age. A
point on the graph represents current asset condition with respect to one or more failure
modes included in the model. If it falls in the green (bottom left) region, the optimal
decision model recommends no action; if in the yellow (light strip), preventive action
should take place prior to the next CBM inspection; if in the red (dark region in top
right), take immediate preventive action.
Note two important characteristics of the decision graphical output: 1) the condition
indicator indicates considerable random fluctuation, and 2) the boundaries between safe
and critical operation vary with working age. Signal processing has not produced a
monotonically increasing condition indicator. This is a common situation encountered in
CBM. Signal processing has not fully accounted for and therefore filtered out random
operational or environmental factors. Secondly, the varying limit boundary tells us that
EXAKT has determined that, in addition to the sum of the weighted monitored condition
indicators, the item’s working age also strongly influences its risk of failure.
Suppose that the cost of a preventive replacement at $100 is 3 times less than the cost, on
average, to repair the failed item. Then, the decision model can be optimized, by using
this ratio, to adjust the boundaries so that, in the long run, they guide the condition
monitoring data’s interpretion towards achieving the lowest total cost of maintenance.
That is, the model will interpret day-to-day CBM inspection data neither too
conservatively nor too liberally, but will recommend an optimal interpretation which
balances cost and failure probability. Similarly, if maximum availability is the optimizing
objective, then the decision model will use the ratio of the mean-time-to-return-to-service
(MTTR) for the preventive and failure situations, in order to deliver routine decisions that
will support this objective.
The table at the bottom of Figure 7-8 compares the cost of a proposed optimal (EXAKT)
CBM data interpretation policy with that of an existing policy and with that of a run-to-
failure policy. Note that in this example the optimal policy results in a mean-time-
between-replacements of 1781, which is 45 % less than the current policy (MTBR =
3944). That is we are intervening more often in order to gain a net decrease to 51.53% of
the original (proactive and reactive maintenance) costs. Preventive actions in the
proposed policy would account for 96.6% of incidences compared to only 20% under the
Page 101
Optimal Maintenance Decisions (OMDEC) Inc 2004
current policy. CBM decision optimization provides working decision models that may
be used to automate the interpretation of CBM condition monitoring data, in the
achievement of a specified maintenance objective.
Page 102
Optimal Maintenance Decisions (OMDEC) Inc 2004
Chapter 8. CBM Fundamentals
The fundamental premise of CBM
1. A clear warning that your equipment has entered a "failing" state, and
2. The warning time is long enough for someone to take action to mitigate the
consequences of failure, and
3. The average cost to perform CBM on an asset is less than the average cost of the
consequences of failure over the long run.
So obvious are these CBM program essentials, that we often gloss over them in our rush
to implement high technology solutions. In a sampling of 100 maintenance
organizations109 over 3 years, all had at full blown CBM programs in place that did not
satisfy one or more of the above criteria. Why is this the case? The following sections on
the nature of the maintenance technology industry shed some light on this question.
Assertions:
1. The lower the Mean Time Between Failure (MTBF), the
more frequently you monitor?
2. The more critical, the more frequently you monitor?
109
Survey by the author of participants of the Physical Asset Management Certification course given twice
yearly by the University of Toronto’s Professional Development Center.
110
John Moubray, Aladon RCM practitioner’s course.
Page 103
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 8-1 illustrates how maintenance technology vendors, pre-occupied with explaining
the features of their products, often fail to address more fundamental issues. Users of
CBM frequently wonder how often they should monitor a particular equipment. CBM
technology providers, typically, offer two answers:
Figure 8-1 examines each of these assertions by considering two bearings labeled, “A”
and “B”. Most rolling element bearings fail randomly111. Hence their conditional
probabilities of failure are shown to be straight lines (failure pattern F). We will assume
that Bearing A (MTTF of 3.5 years) is half as reliable as Bearing B whose MTTF is 7
years. It follows that the conditional probability of failure of these items are
approximately 1/3.5 and 1/7 respectively112. This is indicated by the relative heights of
the two lines representing the two bearings risks of failure. Suppose we are told by an
experienced employee that Bearing A begins emitting a rumbling sound and then
invariably fails between two weeks and two months later. And, another operator, in
describing his experience with Bearing B in a high rotational speed application tells us
that it issues a distinct whining noise and invariably fails between 2 days and 2 weeks
later. In the case of Bearing A we would reasonably suggest sampling at an interval of 1
week, while, for Bearing B a reasonable sampling interval would be 1 day. Comparing
these conclusions with the first assertion: “The lower the Mean Time To Failure (MTTF),
the more frequently you monitor?” we must reject it because we have just deduced that it
is not necessarily true. That is we have demonstrated a situation where it is appropriate to
monitor a more reliable item (Bearing B) 7 times more frequently than a less reliable item
(Bearing B)
Now, we turn to the second assertion, “The more critical, the more frequently you
monitor”. Let us suppose that we are told that Bearing A is very critical while Bearing B
has a backup system and therefore is far less critical. Once again, in this particular case,
the assertion has been shown to be false. We conclude, therefore, that neither criticality
nor reliability, can be used to determine CBM inspection frequency. Rather we must
focus our attention on confidently detecting a potential failure and reliably estimating the
PF interval, as discussed next.
111
For a discussion of random failure, see Random Failure on 38.
112
For an explanation of this derivation see the chapter “Reliability Centered Maintenance”.
Page 104
Optimal Maintenance Decisions (OMDEC) Inc 2004
Estimating the PF Interval
YES
Page 105
Optimal Maintenance Decisions (OMDEC) Inc 2004
Chapter 9. The Elusive P-F Curve
J. Moubray coined the phrase "P-F interval". He used it to highlight two pre-requisites of
CBM, namely:
A clear indicator of decreased failure resistance - the potential failure, and
A reasonably consistent warning period prior to functional failure - the P-F interval
Both these requirements are captured in the well known empirical graph of failure
resistance versus working age (Figure 9-1).
Figure 9-1
The P-F interval is a deceptively simple idea. Deceptive, because it takes for granted that
we have previously defined "P" (the potential failure). Of the two concepts, “P” and “P-
F”, it is the former, however, that poses the greater challenge. Therefore, before
addressing the P-F interval, we need to determine when and how to declare a potential
failure.
Figure 9-1 implies that if we could monitor a condition indicator that tracks the resistance
to failure, then declaring the potential failure level would be an easy matter. Two
stumbling blocks, unfortunately, arise and obstruct our plan. The obstacles to the
implementation of Figure 9-1 are:
Condition monitoring data, on the other hand, is abundant. How may we overcome
obstacles 1 and 2? That is, how may we apply CBM to the numerous physical assets
where condition monitoring data abounds, yet, where few alert limits have been defined?
This (setting of the declaration level of the potential failure) is the problem encountered
by many asset managers deluged with condition monitoring data. The unavoidable
Page 106
Optimal Maintenance Decisions (OMDEC) Inc 2004
question facing any implementer of a CBM program is where to set the potential
failure. Which indicator, from among many monitored variables, should he select for this
purpose? At what level? When the physics of the situation are not well known (as is often
the case), a “policy” for declaring a potential failure is far from obvious.
Why does Figure 9-1 stubbornly elude our grasp? The reason is that this graph is often
not 2-dimensional, but multi-dimensional. There is one dimension for each significant
risk factor. The curve of Figure 9-1, therefore, looses its simple geometrical visuality.
This is where software comes to the rescue.
EXAKT summarizes the risk factors associated with working age and monitored
variables and creates a new kind of graph by transforming the significant risk information
onto a 2-dimensional optimal decision graph. Dr. Dragan Banjevic, CBM Lab director,
captured the multi-dimensionality of Figure 9-1 in two ways. First, he combined the
significant monitored variables (other than age) into a risk-weighted sum. That became
the y-axis. Then he transformed the age-related risk factor into the shape of the limit
boundary. One 2-dimensional graph, Figure 9-2 shows all aspects inluencing risk. They
incluce economic factors as well as failure probabily associated with each significant
variable.
Figure 9-2
EXAKT handles the probabilistic nature of P and the P-F interval properly. EXAKT does
not assume a deterministic113 P or P-F interval. Instead it draws (from historical records)
a probabilistic relationship among all significant factors (including working age). It uses
that relationship to estimate the remaining useful life at any given moment. One of the
benefits of this approach is the ability to deal with noisy data, illustrated in Figure 9-3.
On the left side of Figure 9-3 are 3 examples of ideal data. Note how the monitored
113
That is, it recognizes that a potential failure and the ensuing functional failure tend to occur randomly
according to some probability distribution.
Page 107
Optimal Maintenance Decisions (OMDEC) Inc 2004
values increase monotonically, with the red alarm set conveniently to the potential failure
declaration level. Unfortunately condition monitoring data seldom looks like this.
On the right side of Figure 9-3 is data from the nasty real world. It contains random
fluctuations and trends that contradict one another. In other words, the usual situation!
EXAKT alleviates randomness (see Exercise 4 page 324) and conflicting trend data (see
Exercise 2 page 131).
.
Figure 9-3
114
for a required objective (such as low overall cost or high availability).
Page 108
Optimal Maintenance Decisions (OMDEC) Inc 2004
desirable115. Should the physical inspection (a more intrusive form of CBM) uncover a
potential failure, then a model relating the less intrusive measurements to the findings of
the more intrusive inspections is desirable. Still, a functional failure will not have yet
occurred. With ever increasing amounts of data being captured from the control platform,
two (or more) levels of intrusiveness of CBM are often desirable. Hence we may build
decision models that predict potential failures thereby avoiding functional failures
altogether.
Here are two typical situations that the CBM Lab has encountered.
Case 1. A single asset, say a pump, has been operating for 30 years without failure. We
will probably have, for this pump, a large database of condition data (for example
vibration, flow rates, motor current, etc.) taken at regular intervals, but no failure
data. Alternatively, we may have a brand new pump of a new design on which we
possess no experience at all.
Both these cases meet the criteria for CBM described in Chapter 8. CBM Fundamentals
(page 103).
Discussion of Case 2
In most plants, functional failures and numerous potential failures do occur. The
following would be a typical scenario for the development of one or more CBM
optimization models:
1. We have a machine (or sometimes a fleet of machines).
2. Over time we record various measurements on a periodic (daily, weekly, monthly, etc.)
basis. For example: load, vibration, amperage, phase, or whatever else may be
appropriate. Those readings would also include working age measured in some
service usage unit that describes the accumulating stress on the machine. Say fuel
consumed, or widgets produced. In EXAKT we call each set of measurements
taken at more or less regular intervals, an Inspection.
3. Occasionally w see some anomaly in the data, and you feel that you should do a deeper
(more intrusive) "Inspection". Or we may be doing a time based maintenance
task. In either case you physically inspect one or more components in the
machine. You find that one of the components is in a failing state. You have, thus,
115
For example compression tests, pressure and ignition traces, or even partial disassembly for more
intrusive visual inspection.
Page 109
Optimal Maintenance Decisions (OMDEC) Inc 2004
discovered a potential failure.116 You record this observation in the CMMS as an
event which you might name "EFP1" (ending with potential failure type 1 –
which may be a potential failure of component X or of failure mode Y, for
example).
4. We repeat steps 1 to 3 over time. That is how we normally accumulate a "sample" of
condition and event data. (By the way, we are making use of an important
function of our CMMS by populating it with this type of data. After all, we paid
good money for the CMMS. Why not use its historical data recording capabilities
to their fullest?117
5. Sometimes (as will happen) we will have missed detecting a potential failure soon
enough, and we will experience a real (functional) failure. This, as well, becomes
part of our historical database (i.e. our sample).
6. Over time we will have experienced several failure modes at the potential failure
stage, and perhaps one or two actual functional failures. (Now, at last, we have a
good sample). We analyze this sample in EXAKT and we build a model that can
be used for automated prediction (residual life estimation) and optimal CBM
decision making.
The important point to note in this hypothetical sequence, is that model building
using EXAKT does not require us to have endured catastrophic or expensive functional
failures. EXAKT was designed to extend current CBM decision making capability. The
results of whatever current methods are being used to record condition data and event
data may be analyzed by EXAKT in order to build an optimal CBM data interpretation
model. That model can then be used as a policy (i.e. an alarm limit) for the future
detection of a specific failure mode while it is in its “potential failure” stage.
Of course in the real world, maintainers have not recorded failures, potential failures,
and other events as carefully as they perhaps would have, had they known about
EXAKT's data analysis capabilities. Not to worry. EXAKT contains many data checking
and validation procedures that help us "clean" our (less than meticulous) data. Usually,
we are able to analyze that data and provide the maintenance department with a good
predictive model. Or, at the very least, with some fresh new ideas on how to improve the
effectiveness of their current CBM program. Tutorials 2, 3, and 4 on the OMDEC
website118 demonstrate some of our data cleansing techniques.
Though building a database can take a long time, whatever we do, the clock will tick and
years will elapse. Either, during that time, we use standard procedures to record what
happened, or we populate our CMMS history database haphazardly. Opting for the
former adds negligible cost but confers, in the short term, expanded awareness and better
116
Nowlan and Heap described the method of “opportunity sampling”. When a unit becomes available to
maintenance staff for whatever reason, the opportunity is taken to inspect for all potential failures that may
be present.
117
Interviewer’s note: The sub-menu item “Data Strategy” under the menu item “Reliability” on the
OMDEC website describes how to use your CMMS in this way.
118
Under menu item “CBM Optimization”
Page 110
Optimal Maintenance Decisions (OMDEC) Inc 2004
communication among our maintainers, operators, supervisors, and engineers. In the
longer term, good historical information offers deeper understanding through analysis.
Discussion of Case 1
EXAKT offers two solutions for the “no data” situation depending on each of these two
possible situations:
12. If we have some expert knowledge about the failure of the pump from the
maintenance personnel or from the OEM, or we have some failure data from a
similar pump (e.g., an earlier design of pump that we have used in the past), the
Bayesian approach would be the most appropriate solution. EXAKT’s upcoming
version implements Bayesian modeling. That is, it incorporates expert judgment
of the relative risks associated with various condition indicators to build a prior
model. EXAKT, subsequently and continuously, updates the model as actual
failure or potential failure data accrues.
13. In a second situation, let’s assume that we know nothing about the failure of the
pump. The Bayesian approach can still be applied by assuming a non-
informative prior distribution for the CBM model parameters. As in the first
situation, EXAKT continuously updates the model (as operational, condition, and
failure and condition monitoring data accumulate). Of course, the prior model,
based on a non-informative prior distribution, initially, will have no predictive
value. Until the model evolves, the best we can do is to apply statistical process
control methods or judgement limits to certain “features” of vibration, oil
analysis, or other CBM data. In other words, the usual, or traditional, way that
CBM is done.
One might infer from the foregoing that we must simply revert to our existing CBM
procedures until data becomes available? This is not quite the case. The EXAKT
approach provides two distinct advantages over previous CBM methods:
The first, is that EXAKT measures, monitors, and reports on the effectiveness of the
evolving predictive model. This provides maintenance managers with a clear picture of
whether and how their CBM programs are improving.
Secondly, and even more importantly, the EXAKT methodology imposes a novel
business discipline on the maintenance data acquisition process itself. Technicians,
reliability engineers, and managers alike, quickly experience the benefits of having
understood and duly recorded the five RCM knowledge elements119 prior to closing each
work order.
119
The first five RCM knowledge element (i.e. questions) are: “What function was lost or compromised?”,
“In what way (e.g. full, partial, functional or potential failure) was it lost?”, “Why?”, “What happened?,
and “How did it matter?”
Page 111
Optimal Maintenance Decisions (OMDEC) Inc 2004
One may ask the question about any method that purports to use data from the past to
predict the future. Conditions in the past could have been entirely different from
conditions in the future. How can one claim that the model developed from past data has
any validity at all? If operating conditions, rates, materials, and environmental factors all
change from their values in the past, how good will be the results of the model applied in
the future? A gut response to that question might be, “No good at all!”. But if we stop to
consider the nature of a model, we discover that it’s not as black as that. Consider the
internal indicators that we include in a model – vibration features, throughput, wear
particle size and quantity, component temperature, and so on. Then consider the range of
circumstances that occurred in the past with regard to these variables and their
relationship to a targeted failure mode. Although external conditions may have changed,
the internal physics associated with a failure mode, captured in the statistical model, are
still valid. If, however, the new conditions, provoke entirely new failure modes that have
never occurred, the model cannot predict those new failure modes because the sample
upon which it was built contain no failures or potential failures of that kind.
Page 112
Optimal Maintenance Decisions (OMDEC) Inc 2004
Chapter 10. Optimizing CBM
The foregoing appears to rule out the use of CBM, according to a criterion we stipulated
in Chapter 8. (CBM Program Criteria page 103 ) that "an unambiguous potential failure
must be detectible".
The most difficult part of CBM is the latter – establishing a data interpretation policy. At
what point in time do we declare that a potential failure has occurred? How do we use
past experience to a) assess, and b) improve our CBM (potential failure declaration)
policy? This chapter will present a methodology to do just that. We begin by finding a
relationship between data and risk.
Page 113
Optimal Maintenance Decisions (OMDEC) Inc 2004
abundance of data, typically, exceeds our capability to apply simple rules for interpreting
it. Often we imagine that, even without a CBM data interpretation policy, if we display
the data on a graph, a potential failure will emerge as an obvious deviation from a trend
line. Sometimes this is true, but more often the data is unfathomable, exhibiting random
fluctuations and contradictory indications, with no particular potential failure making
itself obvious.
Figure 10-1: Two assumptions 1. The condition indicator is tracking resistance to failure, and 2. the
alert limit (potential failure) is constant with working age.
Figure 10-1 represents a simple CBM decision model. We may think of a model as a
measuring stick. When the monitored value exceeds some predetermined level we declare
a potential failure. In this chapter we discuss a systematic methodology for determining
the appropriate level at which to declare a potential failure. Most CBM data interpretive
policies currently use the simple model of Figure 10-1. When we apply such a model we
make an important assumption.
We assume that, whatever the item’s working age, the indicator level at which to declare
a potential failure, will remain constant. While this assumption may be valid for some
failure modes, it is not necessarily so. Many items, particularly those that are in direct
contact with the product (e.g. liquids, hot gases, solids) or the environment, exhibit wear-
out and aging behavior. Younger machinery, for example, may tolerate higher loads and
vibration levels than older machines of the same type that have logged more fatigue
inducing cycles or exposure to corrosive environments. Experience120 reveals that, as
some items age, their potential failures occur at decreasing levels of the same condition
indicator. The precise relationship linking condition indicator, working age, operating
profile, and potential failure emerges from a CBM optimization analysis whose principles
are discussed next.
120
See wheel motor decision model discussed on page 131.
Page 114
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 10-2: Data and risk121
To begin our discussion of a CBM risk model, we show two graphs in Figure 10-2. The
upper plot is a typical history of an item’s monitored data. The lower plot associates a
failure risk with each data point.122 Assuming that we have discovered the relationship
between data and risk, we would next wish to develop a general policy that tells us what
level of risk is the right one for deciding to preventively renew an item at a given
moment. In many business contexts the right risk level is the one that minimizes the total
cost of preventive and reactive maintenance. The next section describes how to merge
cost and risk into a single decision model.
121
Lecture by Prof. Andrew Jardine, based on notes by Dr. Dragan Banjevic, CBM Laboratory, University
of Toronto
122
If the data were not related to failure risk, there would be no purpose to monitoring it in a CBM
program. Our challenge, then, is to discover and use the true relationship between data and risk.
Page 115
Optimal Maintenance Decisions (OMDEC) Inc 2004
The Optimal Risk
Risk is a value that combines the probability of failure with the consequences of failure.
At the extreme left of the risk line, a very conservative maintenance policy may result in
high risk because of high cost and low availability. While on the extreme right, the level
of the risk is high due to low availability and low reliability123. Let us assume that we
know how our monitored data relates to the risk of failure (Figure 10-2). We then hasten
to ask the question in Figure 10-3, “What level of risk do we wish our condition based
maintenance policies to attain?”
We do not want to set our alert limits too low (too conservatively) nor too high (too
liberally). Whichever CM (condition monitoring) data interpretation policy we adopt for
declaring a potential failure will depend on our objective (optimizing objective). Figure
10-3 suggests three possible optimizing objectives for consideration: minimum cost,
maximum availability, or a specified reliability124. We may at certain times, depending on
market conditions, desire to operate near highest availability. Under other circumstances
at lowest cost, or at some specified reliability. We may wish to operate at some
compromise state among the three objectives at a corresponding point on the risk line.
123
Total cost of failure and preventive maintenance approaches that of a run-to-failure policy.
124
For example a survival probability of, say, 98% over a specified mission, say, 6 months.
Page 116
Optimal Maintenance Decisions (OMDEC) Inc 2004
“Risk management” equates to understanding the tradeoffs when adopting a particular
CBM policy. And adjusting that policy as operating context changes.
One would expect the cost-versus-risk curve graph to resemble the trough shaped one
shown. If we wished to operate at near zero risk125 it is logical that we will be required to
spend prohibitive sums on pro-active maintenance in order to attain such a degree of
assurance and comfort. On the other hand, if we desired to throw caution to the winds and
do no pro-active maintenance whatsoever, then the risk will likely be quite high. In fact,
the average cost of failure will approach the item’s mean-time-between failure × the
average cost of an individual failure. Hence Figure 10-3 shows that the cost curve
plateaus to the right. Somewhere between these two extremes we would expect to incur
lower costs. The risk level that engenders the lowest cost126 is said to be the optimal risk.
Let us state here the difference between hazard and risk. Hazard, also referred to as
instantaneous failure rate, is the probability of failure per unit time for a unit that has
survived up until the present time127. Risk (of failure), is defined as the combination of
probability and consequence. Probability is the likelihood of a failure occurring and
consequence is a measure of the damage that could occur as a result of the failure (in
terms of injury, fatalities, property damage, and operational and non-operational
consequences). Increased risk results from increased probability and higher degree of
consequence. On the other end of the risk spectrum (the left side of Figure 10-3), there is
another risk of interest – the risk of renewing a unit too early (over-maintaining). One
needs to manage and balance these two risks to find the most appropriate decision.
If we can, somehow, discover the answer to these questions, we can use observable data
to make optimal proactive decisions – the goal of CBM. We tackle this problem in the
125
Risk might be quantified in any number of ways, e.g. conditional probability of failure, failure rate,
reliability, etc. Or it may include a consideration of the consequences of failure.
126
Or highest availability, or specified reliability, or some desired compromise
127
See Appendix 7. on page 290 for a more complete definition of hazard.
Page 117
Optimal Maintenance Decisions (OMDEC) Inc 2004
next section by examining, first, the simpler case of a preventive (time based)
maintenance optimal model.
Figure 10-4 chronicles a typical item through a number of its life-cycles. An event B and
another event EF mark, respectively, the beginning and ending-with-failure of each life-
cycle
128
This discussion was developed by Dr. Dragan Banjevic, director, CBM Labarotory, University of
Toronto.
129
We are assuming that each renewal is a total (as good as new) repair (as opposed to a partial repair).
Page 118
Optimal Maintenance Decisions (OMDEC) Inc 2004
6C F
Cost / hr = Equation 10-1
t1 + t 2 + t3 + t 4 + t5 + t6
5C R + 1C F
Cost / hr = Equation 10-2
5t A + t 6
Next, suppose that someone, feels that the cost of PM is too high and decides to extend
the interval for pro-active item renewal, say to time tB. This policy is illustrated by Figure
10-7.
Page 119
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 10-7: Life-cycles with an extended interval preventive maintenance policy
Similar to the cost calculation for Policy A, the average cost of maintenance resulting
from Policy B would be that expressed by Equation 10-3.
2C R + 4C F
Cost / hr = Equation 10-3
2t B + t1 + t 2 + t 3 + t 6
The $64 question at this point is, which of the policies of Figure 10-8 is the optimal one?
Table 10-1 applies some numerical values to this problem in order to illustrate how total
cost may vary depending on the preventive based maintenance policy chosen. Note the
sensitivity of total cost to the ratio CR/CF (repair-to-failure cost).
Table 10-1 Possible costs of 3 policies and 3 different failure consequence costs CF
Page 120
Optimal Maintenance Decisions (OMDEC) Inc 2004
Which policy will result in the lowest total cost of preventive and reactive maintenance
per unit of working age? A? B? Or a “no scheduled maintenance” policy?
To answer that question, we suspect that we need to pose two more questions about the
item:
1. How does its failure risk vary with its age? and,
2. How do we combine the costs, CP and CF, with failure risk to arrive at an optimal
PM decision?
Tackling Question 1 first, we seek an equation to describe the lower curve of Figure 10-2
(page 115) that relates risk to working age. As it happens, in the early 1950’s, Professor
Walodi Weibull modestly suggested that his equation relating risk to age “might
sometimes render service”. The reliability community at the time reacted negatively to
the presumption that such a simple formula could work. But Weibull persisted and the
United States Army sponsored his research over the next 25 years. The Weibull hazard
model reshaped the discipline of reliability, having shown itself applicable to a
surprisingly wide cross-section of items and operating environments. The Weibull risk
model is given in Equation 10-4.
Page 121
Optimal Maintenance Decisions (OMDEC) Inc 2004
β −1
βt
h(t ) =
η η
Equation 10-4 The Weibull risk model
In the Weibull equation h(t) represents the hazard rate130, and β and η are constants
known respectively as the Weibull shape and scale parameters. If we had a methodology
to estimate the Weibull parameters from a set of lifetime histories of an item or fleet, we
will have responded to Question 1 posed earlier on page 121.
Figure 10-9: Computer estimation of the Weibull shape and scale parameters
As it happens, Figure 10-9 describes the output of such a methodology applied to heavy
haul truck transmissions131. The software used to perform this calculation was
EXAKT132. It uses numerical algorithms to process CMMS (computerized maintenance
management system) historical data. It estimates the Weibull shape and scale parameters.
Figure 10-9 shows the Weibull equation with the estimated shape and scale values.
Hence, for this explicit example, Figure 10-9 answers Question 1 (page 121): “How does
an item’s failure risk vary with its working age?”
130
Hazard is the instantaneous risk of failure at a time t. The conditional probability of failure can be
calculated from the hazard rate by multiplying its value by the length of the desired short interval.
131
The software and data for this example, as well as a tutorial, may be downloaded from
www.omdec.com.
132
CBM Optimizing software developed by the CBM Laboratory at the University of Toronto. A trial
version together with the working databases used in these examples is available at www.omdec.com .
Page 122
Optimal Maintenance Decisions (OMDEC) Inc 2004
Blending in Cost
We turn our attention, now, to Question 2 (of page 121): “How do we combine the costs,
CP and CF, with failure risk to arrive at an optimal PM decision?”
Figure 10-10: Probability density function can be drawn once the Weibull parameters have been
estimated.
Having answered Question 1 (page 121), “How does its failure risk vary with its age?”,
we can, (with the help of a computer program133), draw the curve of Figure 10-10, known
as the “Probability Density Function”, represented by f(t) (defined in Appendix 7. on
page 290). This graph has some convenient characteristics:
1. The area under the curve up to a time t is equal to the probability that the item will
fail prior to time t. This value is known as the “Cumulative Probability of Failure”
and is represented by F(t). And,
2. The area under the remaining part of the curve is equal to the probability that the
item will survive to time t. That value is known as the Survival Function134 and
is represented by R(t). And,
3. Because an item will fail eventually, the area under the entire curve is equal to 1.
It follows, then, that F(t) = 1 – R(t).
In the graph of Figure 10-10, tp represents the time at which we plan to carry out a
preventive renewal of the item. We may, with the help of Figure 10-10, express the
expected average (over many life-cycles) cost, CE, of maintaining that item. CE will
include preventive and reactive maintenance. The expected cost of preventive repair will
be:
the average cost of an individual preventive repair, CP, times the probability that
the item will survive to tp,
133
Relcode (see page 47 ), for example.
134
Sometimes called the “Reliability Function”
Page 123
Optimal Maintenance Decisions (OMDEC) Inc 2004
the average cost of an individual failure, CF, times the probability that the item
will fail prior to tp.
CE = C P R (t p ) + C F (1 − R (t p ))
Equation 10-5
In precisely the same way Equation 10-6 expresses the item’s expected operating time,
tE .
C E c R R (t p ) + c F (1 − R (t p ))
= tp
tE t p R (t p ) + ∫ 0
t f (t ) dt
Equation 10-7135 Total cost per unit time of maintenance as a function of PM policy tP.
Equation 10-7 provides us with the relationship we seek in order to answer Question 2
(page 121). Both R(t) and f(t) are derivable (see Appendix 7.) from the hazard rate of
Equation 10-4. We may use a computer numerical algorithm136 to plot CE/tE for all values
of tP. Figure 10-11 shows just such a graph. The tP corresponding to the minimum cost
on the curve will be the optimal PM policy. Thus, we have answered Question 2 (page
121): “How do we combine the costs, CP and CF, with failure risk to arrive at an optimal
PM decision?”.
tp
135
If you are curious as to how the second term ∫0
t f (t ) dt in the denominator of
Equation 10-7 was derived from tf(1-R(tp))you may look at the derivation given in
Appendix 12. on page 300.
136
Such as is available in EXAKT
Page 124
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 10-11: EXAKT solution output for optimal PM policy.
Now that we have established a model that relates risk, cost, and working age we ask the
question, “What if we had additional information (in addition to working age) that would
also reflect the risk of failure?”. For example, vibration analysis data, oil analysis data,
visual inspection data, operating profile changes, or other signals from the machine or
process would surely bear upon failure risk, would they not?. Can we therefore extend
our risk equation of page 122? Yes we can. The Weibull model can be extended with a
new term as shown in Figure 10-12.
Page 125
Optimal Maintenance Decisions (OMDEC) Inc 2004
Note that the extended hazard model137 has additional parameters γi that determine the
influence of their respective measured values (called covariates). Where previously, in
the case of preventive maintenance, we determined the best working age at which to
renew the asset, we desire now to determine the best levels of all the significant
covariates and the working age at which to intervene and perform a preventive renewal of
the asset. That is to say, we wish to completely define a potential failure.
137
Known as the PHM (proportional hazard model) developed by Cox. Cox, D. R. 1972. “Regression
Models and Life Tables (with Discussion).” Journal of the Royal Statistical Society, Series B 34:187—220.
Page 126
Optimal Maintenance Decisions (OMDEC) Inc 2004
In the following examples we detail the process for building an deploying a CBM
intelligent agent.
The exercise will demonstrate the basic functions of the EXAKT model building platform
and the EXAKT decision agent software. Example 1 uses a reduced set of oil analysis
data from a fleet of haul truck transmissions to build a proportional hazards model. By
following the steps in the Appendix, you will create and deploy this model as an
“intelligent agent” that silently and automatically monitors future condition monitoring
data, returning an optimized decision (whether or not to remove and repair the
transmission) as each new set of condition monitoring readings are received. The model
constitutes a “policy” for making optimized decisions. Such a policy will minimize some
undesirable feature, such as excessive cost, or maximize some wanted feature, such as
availability. The decision agent provides a remaining useful life estimate based on the
current condition of the equipment, its age, and all relevant maintenance and operational
events that have occurred.
138
It may also be downloaded from www.omdec.com
Page 127
Optimal Maintenance Decisions (OMDEC) Inc 2004
3. Transition Probability Model,
4. Decision Model, and
5. Decisions
A short descripton of each step, using the data from Exercise 1 follows.
1 Data preparation
Figure 10-15 illustrates the setting up of the project’s descriptive information. The “CBM
Model” field of Figure 10-15 provides the name by which a predictive CBM decision
model will be known, usually the name of a component or a failure mode whose
deterioration we wish to detect and monitor. The model, as its primary function, should
enable us to declare, and thereby, act upon a potential failure at the most advantageous
moment.
2 Weibull PHM
Page 128
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 10-16 Testing possible significant variables
In Weibull PHM step we proceed to test the degree to which monitored variables have
the potential to predict of the failure mode under analysis. In Figure 10-16 we are setting
up the test of the combination of the oil analysis measurements of dissoved iron and lead.
Figure 10-17 provides the results of the test.
Figure 10-17 The results of the trial of iron and lead as PHM model covariates
Note the various text information we set up in Figure 10-15 as it appears in the report of
Figure 10-17. The “Summary of Events and Censored Values” table tells us about the
size of our sample (how many life cycles) and the breakdown of those that actually ended
in failure and those that were preventively renewed.
In the “Summary of Estimated Paramaters (based on ML method)” table we can see the
results of the “maximum likelihood”139 data method applied to the sample of 13 lifetimes.
The results of a number of statistical estimation methods140 are shown in this table
(Standard error, Wald, DF, p-Value, Exp of Estimate, and 95% Confidence Interval). The
software considers the results of each statistical procedure and displays the conclusion in
the column “Sign” (abbreviation for ‘significance’). “Y” indicates that “Shape” (i.e. the
working age), Iron, and Lead have been found to be significant to the probability of
failure in the upcoming observation interval.
139
A “fitting” algorithm that estimates the parameters of the proportional hazard model so that it best fits
the data.
140
See the EXAKT user’s manual for detailed explanations on these statistical tests.
Page 129
Optimal Maintenance Decisions (OMDEC) Inc 2004
3 Transition probability model
So far we have built a proportional hazard model (PHM). That model provides us with a
failure probability (hazard rate) knowing the working age of the item and the values of a
set of significant condition monitoring variables at that working age. However, in order
to complete the predictive capability of the model, we must have a way to describe the
behavior of those variables. The method used in EXAKT is known as the Markov chain
transition probability matrix. Figure 10-18 shows the matrix for iron at each of five states
whose boundaries were proposed by the software.
The matrix of Figure 10-18 represents only a single dimension. It assumes that the value
of the second significant variable, lead, remains constant. Based on the transitions of all
data values in the past to their new values at the subsequent inspections, the software
generates a probability matrix for each combination of states of the significant variables.
The resulting multi-dimensional matrix is combined with the PHM, the next-to-final step
towards building the predictive decision model.
4 Decision model
This last step in the model building process requires us to provide the relative average
costs of a preventive renewal of the component (or failure mode) following the
declaration of a potential failure, as well as, the typical worst case cost141 if a functional
failure were to occur. The results of the decision model applied to one of the equipment
units are shown in Figure 10-19.
141
Based on an assessment of the “typical worst case scenario”. All models are based on assumptions. The
EXAKT model assumes that a manager, through expererience, can envision a balanced portrait of the
events surrounding a failure and their consequences. Sensitivity analysis (a function of the software) helps
us to sanity check these assumptions and their impact upon the model’s decisions.
Page 130
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 10-19 Results of applying optimal decison model retroactively to an item.
In the table of Figure 10-19 we may read the average per workng age cost (associated
with failure and proactive maintenance) for the current policy (.449), the proposed
optimal (EXAKT) policy (.378), and the “no pro-active maintenance” policy (1.522). The
mean-time-between-replacement (MTBR) includes both preventive and failure
replacements. It is 8775 working age units currently. By adopting the EXAKT decision
model, we would intervene far more often (every 3326 working age untis), but at a cost
per working age unit of 0.378 (84% of the cost of the current policy).
5 Decsions
Upon building and testing the model in EXAKTm (EXAKT for Modelling), we export
the model to an external database where it may deployed by EXAKTd (the EXAKT
decision agent).
Page 131
Optimal Maintenance Decisions (OMDEC) Inc 2004
Coals Ltd. opened in 1970 as a multiple open pit mine using the truck and shovel mining
method. Annual production at the mine called for the removal of 21 million cubic meters
of rock and 2.8 million tons of coal. The mine won multiple awards for the land
reclamation and creating wildlife habitat. Oil analysis results from a fleet of 55 haul truck
wheel motors were analyzed along with their respective failures and repairs over a nine-
year period.
Extensive planetary gear or sun gear (Figure 10-21) damage necessitates replacement of
one or more major internal components in a general overhaul. There were 26 haul trucks
at the mine site, each having two wheel motors. With 3 spare wheel motors the fleet
numbered 55. Oil analysis was carried out monthly.
Page 132
Optimal Maintenance Decisions (OMDEC) Inc 2004
condition monitoring test results – some 50,000 records covering the same time period as
the removal history.
Figure 10-22: DataCheck report in synchronized view with Inspections and Events tables.
Page 133
Optimal Maintenance Decisions (OMDEC) Inc 2004
The DataCheck report addresses a common problem in historical CMMS databases.
Often work order records omit the description of what was found when examining the
item prior to its repair.142 The report of Figure 10-22 issues the warning,
whenever it deduces that an ending event, either EF (ending with failure) or ES (ending
by suspension) may be missing. The analyst must investigate the actual work orders or
the comments in the work order record in order to ascertain whether a failure or a
preventive renewal of the item occurred, or whether the item is currently in operation.
Each valid history for a wheel motor must have a Beginning event (B), an Ending event
(EF for failure, or ES for suspension (such as a preventive removal)) and Inspection
events in between.
The DataCheck report of Figure 10-22 may issue additional comments and warnings. For
example:
The DataCheck function also points out anomalies that may indicate data problems such
as two inspections on the same day, or working ages and calendar dates out of
synchronization. All of these logical errors would have compromised the model’s
accuracy. Most of these types of errors can easily be corrected by inserting the missing
Beginning and Ending events for each history.
142
The roots of such data integrity problems were discussed in Part 1.
Page 134
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 10-23 Cross graph of Si and Working age for the entire fleet over 9 years
Investigating the commercial laboratory that performed the oil analyses, it was
discovered that, for a period of time, the photo-multiplier tube on the spectrometer was
saturating at exactly 900 PPM. In other words all values of silicon above 900 were
truncated to 900 PPM. A similar situation occurred for iron above 2500 PPM. If not
detected, this could play havoc with the building of the PHM (proportional hazard
model). To solve this problem we call up the cross graph of Silicon versus Iron displayed
in Figure 10-24.
Figure 10-24: Cross graph of correlation between Si and Fe showing data errors
Figure 10-24 reveals strong correlation between Silicon and Iron, as well as an obvious
dog leg in the graph where Si plateaus at 900 ppm. We note too that a few appear after
the spectrometer was repaired and that they fall on the correlation line. It is reasonable,
therefore, to correct the values of 900 ppm by substituting the values of iron × the slope
of the correlation line as was done in Figure 10-25.
Page 135
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 10-25 Corrected values of silicon
In this instance, knowing the errors in the laboratory test data, it was possible to
compensate for them in the database used to build the model. For example, to correct the
truncated values of ‘Si’ they were replaced with 1.2 x Fe. The factor of 1.2 was
determined from the initial slope of the cross graph (a correlation graph) of Fe-Si and
values obtained after the saturation defect was corrected. The truncated Fe values were
not corrected since there were too few of them to influence the model.
Determining correlation between covariates is useful both to provide insight into the data,
and in understanding the models generated by the software. For example, if ‘Fe’ and “Ni”
are highly correlated the software would confirm that there is no point in including nickel
in the model since it has been determined to provide no additional information regarding
the probability of failure. Thus, if the software concludes that nickel is “insignificant”,
then by inspecting the correlation graphs one could therefore understand the
reasonableness of such an indication. These correlations are the result of wear of a
metallic alloy component present in the unit.
Page 136
Optimal Maintenance Decisions (OMDEC) Inc 2004
The effects of minor maintenance or equipment calibrations
In the EXAKT data preparation phase we set up initialization conditions associated with
certain events. The model is told what covariate values should be associated with those
minor corrective events, such as an oil change (OC). By the same token, events such as
balancing a rotor, or aligning a shaft should be recorded whenever they occur. During
model setup approximate initialization vibration levels will have been assigned to these
event in the CovariatesOnEvent table, so that the model can properly recognize that
decreases in covariate values are the result of a minor maintenance event.
Figure 10-27 shows ‘missing’ or ‘irregular’ oil changes and obvious gaps due to
incomplete records144. Oil ages of 7000-8000 hours are indicated which is quite unlikely
with the use of mineral oils in this application. The site changed to synthetics about two
years earlier to eliminate the need for regular oil changes. However most histories,
containing missing oil changes, occurred prior to1997. It was thought that this
information needed to be recovered from the commercial laboratory’s files.
Unfortunately these files, too, were incomplete and inconsistent with the dates and
working ages in the work order database.
143
By associating failure with decreasing levels of wear metals.
144
Chapter 1. (page 13) and Chapter 2. (page 19) addressed and offered a solution strategy to this common
problem.
Page 137
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 10-27 Missing oil change events
Fortunately, however, these 'missing' oil changes did not significantly affect the model
since they were relatively few in number with respect to all of the known oil changes.
That is, there were a sufficient number of known oil changes in the database for the
model to account for their effect on the measured data.
Page 138
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 10-28 Graphical analysis of maximum likelihood estimation (MLE) residuals
Each point on the residual graph of Figure 10-28 represents a history, that is, a lifetime of
a wheel motor from its installation to its removal. The sample used to build the model
consists of many histories drawn from the entire fleet. The graph shows an unusual point
that is well above the 95% upper limit. This leads one to investigate the underlying data
corresponding to this residual (i.e. this particular lifecycle). It was discovered (Figure
10-29) that some ‘unusual’ data were included in that history which appears to violate the
model that we are attempting to build.
Page 139
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 10-29 Unusually high values of Fe and Si unexplained by a failure event
The Fe values in the left-circled region of Figure 10-29 have an inexplicable pattern. Fe
jumps to high values, but truncated at 2500 PPM due to instrument saturation, and
remains in the same range for a few more inspections. Then, the readings fall back to low
values. No events were recorded to explain these sudden jumps.
Having no event data to support such high values of Fe and Si, the model was
regenerated and the fit tested again after removing that history from the working data set.
Statistical and graphical goodness-of-fit testing procedures are applied by the software as
part of the modeling procedure. The model’s fit to the data improved immediately. The
model building algorithms no longer had to accommodate obviously contradictory and
misleading information.
However a different (and more fundamental) problem occurred regarding the definition
of wheel motor failure. These units seldom failed “functionally”. That is no haul truck
needed to be taken out of service immediately while it was hauling a load of rock or coal.
Nevertheless, to develop a CBM policy (model) we must have some objective definition
of failure. Initially, the mechanics’ remarks (on the work order) were used for this
purpose. For example,
Page 140
Optimal Maintenance Decisions (OMDEC) Inc 2004
"High iron in oil sample and high hours, removed and replaced wheel motor."
This event was then classified as a “failure”. However, on reviewing the re-builder's
report attached to each invoice it became clear that some events initially classified as a
failure should be treated as a suspension and vice verse. For example: If the gears had
been replaced because they failed an ultrasonic test or they were obviously in a failed
state then that event should be classified as a failure. But if the gears were replaced
simply because it was expedient to do so, or if the unit was only generally rebuilt with no
real internal damage, then that event should be considered a suspension.
Page 141
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 10-31: The proportional hazard model145 for a haul truck wheel motor
Figure 10-32: Optimal CBM decision model applied to a set of oil analysis data for a wheel motor
145
Covariate significance is tested by the Wald statistic, the square of the standardized estimate of the
parameter which follows a chi square distribution with 1 degree of freedom. (Note: A few missing sediment
values had been replaced by the values from previous inspections prior to the analysis, hence the reason for
using the notation CorrSed).
Page 142
Optimal Maintenance Decisions (OMDEC) Inc 2004
motor failed. One illustration of such a history is shown in Figure 13. This graph provides
a recommended decision based on inspection data (covariates and working age).
The decision ‘Replace immediately’ was suggested by the model (as illustrated in Figure
10-32) for the first time at the inspection point at working age = 11384 hrs, 286 hours
(about two weeks) prior to failure (reported at 11660 hrs). The following inspection at
working age = 11653 hours, 7 hours prior to failure, also suggests the replacement of the
wheel motor. The first warning may have been sufficient, given sample turnaround time
of 48 hours, to prevent the consequences of failure. Even prior to 11384 hours it can be
seen from the decision graph that the results of the measurements indicate that a
replacement recommendation was imminent. Note that the zero points on the graph
indicate default measurement values of zero (imputed by the software) immediately
following oil changes.
The economic benefit associated with basing the maintenance policy on the Decision
Policy Graph of Figure 10-32 is exposed through an economic investigation using
EXAKT’s sensitivity analysis function. Under current economic conditions, Figure 10-33
indicates a potential saving of between 20%-30% compared to current practice.
Page 143
Optimal Maintenance Decisions (OMDEC) Inc 2004
It is to be noted that for the cost ration of 3:1 (first section of Figure 10-33) no
operational savings were accounted for since at the time of this study, unfavorable coal
market conditions caused the mine to operate below its capacity. However, as market
conditions improve higher cost ratios would be used since the capital assets of the mine
will be used at maximum capacity. Current strip ratios (total material removed versus
sellable material) would also affect the savings associated with increased availability and
reliability of the units. The sensitivity analysis function of EXAKT, described in Figure
10-34 demonstrates the sensitivity of the overall savings to changes or inaccuracies in the
cost ratio.
Sensitivity analysis
Page 144
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 10-34: Sensitivity of the CBM model to economic and geological conditions affecting the cost
consequences of haul truck failure
In real situations, the actual ratio of failure and preventive replacement costs may not be
well known. Furthermore the dynamics of industry are such that costs can change with
changing technology, production, and market conditions. Therefore one would like to
know, to what degree the true total cost per unit time and the optimal policy would
change with changes in cost ratio. The software enables sensitivity analysis to be
undertaken and generates a graph and corresponding tabular data of Figure 10-34 .
The curves on the graph are interpreted as follows. Solid Line: If the actual cost ratio
(CR) of today differs from that specified when the model was built, that means that the
current policy (as dictated by the Optimal Replacement Graph of Figure 10-32 on
page142) may no longer be optimal. The line indicates the increase (in percentages) that
will be incurred above the optimal cost/unit time by adhering to the current (no longer
optimal) policy. For example, if the actual cost ratio is 5 and our model was built with
CR=3, then the increase in the cost incurred by following that (original optimal) policy is
around 6% (5.98). In other words the solid line represents the sensitivity of costs to
changes in CR. Dashed Line: Again, assume the actual cost ratio has strayed from what
was used when the model was built. If the model is rebuilt using the new ratio the dashed
line tells how much the new optimal cost would differ from that of the original model. In
other words the dashed line represents the sensitivity146 of the optimal policy to changes
in CR. The graph indicates that moderate overestimation of the cost ratio does not
significantly affect the average long run cost but provides a more conservative policy
from the point of view of risk of failure. In a frequently (perhaps seasonally) changing
cost situation it could be worthwhile to dynamically rebuild the CBM optimization model
each time it is applied by using a cost ratio fed from an ERP (enterprise resource
planning) system that takes account of current market conditions.
The cost analysis summary shown on Figure 10-33 (page 143) indicates a saving of 25%,
when CR=3, over the “replace only on failure” (ROOF) policy, whose costs approximate
those of the site’s past policy. Decision model results are also calculated for cost ratios of
5 and 6. As the cost ratio increases we can observe an increase in both the optimal policy
cost as well as an increase in savings. The optimal decision models in these cases indicate
more frequent preventive replacements (from 74% to 91%) will result from applying the
optimal decision policy in order to avoid costly failures. (Note: There is a slight
discrepancy between the expected time between replacements for the ROOF policy, when
CR=3 and CR=5 and 6. This is due to the numerical calculation procedure.)
The steps in the appendix for Exercise 3 (Data Validation) page 319 contain a hyperlink
to a database file with which the reader may reproduce the analyses and graphs of
Example 2.
146
Note that the sensitivity graphs assume that only Cf (failure replacement cost) changes and Cr
(preventive replacement cost) remains unchanged.
Page 145
Optimal Maintenance Decisions (OMDEC) Inc 2004
Example 3 Complex Items
A complex item is an item with one or more failure modes or failure susceptible
components. A simple item has a dominant failure mode, while a complex item has
several failure modes. A CBM program, typically, acquires inspection data (e.g. oil
analysis, vibration, performance data) for an entire system, such as an engine. Thus, a
single system identifier (say “Engine 7483”) labels inspection data records from which
more than one failure mode can be deduced147. Each failure mode will have its own age-
reliability-CMdata148 relationship, and hence, its own CBM decision model.
The example of this tutorial is of a single reduction gearbox that contains two gears
(referred to as Gear1 and Gear2) respectively. We concern ourselves, in this item, with
the failure mode “tooth fails due to root crack”, which can occur on either gear. We treat
this unit, therefore, as a complex item having two failure modes.149 A CBM policy must
consider all reasonably likely failure modes whose potential failures are detectable in the
condition monitoring data set. The policy must distinguish data patterns characterizing
one failure mode from those characterizing another. The policy must advise on which
potential failure mode is imminent and provide a residual life estimate.
The EXAKT software uses the term Marginal Analysis150 to indicate that a complex item
is being analyzed. We develop our CBM decision models within a “working model”
database (whose filename is typically of the form equipmenttype_WMOD.mdb). We
refer to this database as the WMOD (working model) database. To that WMOD database
we “attach” tables from an external database (typically named
equipmenttype_MES.mdb) that contains data transferred from or linked to the CMMS
and one or more CBM and/or process databases.
If the table names in the equipmenttype_MES.mdb database have the extension “_MA”
(see the table structures of Figure 10-35 below), that will tell EXAKT to perform a
“marginal analysis”. Using marginal analysis we build several CBM decision models,
each corresponding to a specific component or to a specific failure mode. Figure 10-35
147
By selecting and processing the data in different ways.
148
Nowlan and Heap used the phrase “age-reliability relationship” to categorize the probabilistic failure
behavior of an item with respect to its working age. With proportional hazard modiling (PHM) Cox
introduced extra information, condition monitoring (CM) data, that bears on failure behavior. Hence we
have appended a third expression “CMdata” to the phrase. In CBM we can now conveniently refer to the
“age-reliability-CMdata relationship”.
149
In this example, for simplicity and clarity, we will ignore faults associated with bearings or other types
of gear or shaft faults. Nevertheless, EXAKT imposes no limit on the number of failure modes or
components to be included in a complex item.
150
The word “marginal” refers to an analysis on one component, assuming that there is no cross-failure
causality among components or failure modes in the complex item. In the future, EXAKT will deal with the
more general case where one failure mode can provoke or influence another.
Page 146
Optimal Maintenance Decisions (OMDEC) Inc 2004
illustrates the structure of a database to for multiple failure modes occurring in a single
equipment item.
The tables of Figure 10-35 are identical (except for the suffix “_MA” in their table
names) to those of the analysis of simple items (of examples 1 and 2).
Three new tables (Figure 10-36), however, have been added to the MES marginal
analysis database structure. Each component or failure mode in a complex item will
behave according to its individual risk model. When complex items are to be analyzed
(and their failure modes to be modeled) we need a way to tell each of model which data
in the database applies to it. For example, one component’s failure may occur at a
particular time, but another component will still be in good working condition. Hence we
need a structured way to indicate the event that each component (or failure mode) has
undergone. The supplementary tables of Figure 10-36 fulfill that role. The table
“IdentToModel” relates a decision model to specific equipment units of a fleet of similar
equipment. It tells the decision agent to which specific equipment units each model
should be applied. For example, if certain engines of the fleet do not have turbo chargers,
then a model predicting the failure of the bearing in the turbo charger should not be
applied to the non-turbocharged engines in the fleet.
Similarly, the “EventToModel” table tells the model which events in the common
database apply to the failure mode that it is predicting. The “VarToModel” table maps
monitored variables to a specific model.
Figure 10-36 New tables in MES required for mapping to an individual failure mode model
Page 147
Optimal Maintenance Decisions (OMDEC) Inc 2004
The phrases “Input…” and “Output…” appear in several field names of the tables
EventToModel and VarToModel. These fields map their values in the general database to
their values in a specific model. For example, “failure of suction valve 3” in the database
would be mapped to the event “EF” in the model that was built to predict the failure
behavior of “Suction Valve 3”. Hence in a single equipment we may have, for example,
two failure events, EF1 and EF2. And two suspension events, ES1 and ES2. We need to
tell a particular model (of a particular failure mode or component), which event records
(for example, those with the values B1 or B2, EF1 or EF2, and ES1 or ES2 in the
database) to use as the beginning, failure, and suspension events respectively for the
failure mode currently being modeled or predicted. In this exercise, we need to tell the
model for Gear1 to use the events B1, EF1, and ES1 as the beginning, failure, and
suspension events (B, EF, and ES). We perform this mapping in a data mapping dialog
such as that shown in step 6 of the tutorial exercise in the Appendix. EXAKT stores the
results of the mappings in the EventToModel, IdentToModel, and VarToModel tables.
Although this mapping of data is difficult to understand in the abstract, don’t despair. It
will become crystal clear as we work through this exercise.
Let us look then at the Events_MA table (Figure 10-37) for the equipment item
GearboxA to be analyzed.
Figure 10-37 Events_MA table for a gearbox with two failure modes.
Note that, in Figure 10-37, there is no B1 or B2 to distinguish the beginnings of the
lifecycles of the individual components (Gear1 and Gear2). But only a single “B” event.
Why? Because, in this particular equipment, the maintenance department adheres to a
policy that when one gear fails, both are replaced. Therefore we have chosen to use the
Page 148
Optimal Maintenance Decisions (OMDEC) Inc 2004
event “B” to mark the life beginnings of both components151. We have chosen to use
“EF1” to designate the failure of Gear1 and “EF2” to represent the failure of GearTwo.
Now let us examine the Inspections_MA table of Figure 10-38.
Once the decision models have been built and deployed a typical optimized CBM
recommendation report covering both failure modes at a point in time might resemble
that of Figure 10-39.
151
Therefore in Step 6 “B” in the database will be mapped to “B” in both models.
Page 149
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 10-39 EXAKT output for two failure modes, GearOne and GearTwo
The report of Figure 10-39 tells a maintenance planner that Gear1 needs to be replaced,
but Gear2 is still in good condition. There is another type of information provided by
these decision models that would cause a manager or planner to reconsider the policy of
replacing both gears when either one fails. That information is given by the shape of the
boundary separating the green and red regions. It is a straight horizontal line. That tells us
that, for these gears, the probability of failure at any time is independent of age. Hence
there is likely little or no benefit in replacing Gear2 with the objective of prolonging its
life relative to the failure mode “tooth fracture”152.
You may perform perform the steps in the Appendix (page ) that create and deploy the
model that we have just described. Follow these steps using the EXAKT modeling and
EXAKT decision programs.
152
One might choose to replace the gear if wearout, for example, indicated by excessive backlash, is a
significant failure mode in this system. The monitored health indicitors, H1 and H2 in the model, however
are targeting the weakening structure of a gear tooth (see Gear Tooth Failure). A separate model, perhaps
based on backlash inspection or some other vibration feature, should be built for the failure mode “gear
tooth wear”.
Page 150
Optimal Maintenance Decisions (OMDEC) Inc 2004
References
Cox, D.R., (1972) “Regression models and life tables (with discussion)”, J.Roy. Stat.
Soc. B, Vol. 34,pp. 187-220.
Jardine, A.K.S., Banjevic D. and Makis V, (1997) “Optimal replacement policy and the
structure of software for condition-based maintenance”, Journal of Quality in
Maintenance Engineering, Vol. 3, No.2, pp. 109-119.
Campbell, J.D. and Jardine A.K.S. (Editors), (2001) Maintenance Excellence: Optimizing
Equipment Life-Cycle Decisions, Marcel Dekker, (Chapter 12: Optimizing Condition
Based Maintenance, by M. Wiseman).
Page 151
Optimal Maintenance Decisions (OMDEC) Inc 2004
Chapter 11. CBM Decision Making with
Expert Systems
Depending on the physics governing a given application, we learned, in Chapter 7. (page
95), that we may choose from a variety of algorithms with which to carry out the signal
processing portion of CBM. Decision making, (the third CBM sub-process), proceeds
similarly, using one or more of a diverse array of decision support tools. In Chapter 10.
Example 1 Creating and deploying a decision model (page 127) we developed a CBM
decision policy using statistical modeling techniques and software. A decision policy
assists maintenance personnel to interpret and act upon a set of condition monitoring
(CM) data. Extensive human knowledge and experience may be available with which to
build a CBM decision policy. A rule-based expert system encapsulates known
relationships between CM data and the deterioration in an asset that takes place due to
one or more failure modes. An algorithm (known as an inference engine) applies the
knowledge base to the current set of CM data. In this chapter we describe an expert
system developed by DLI Engineering153 called ExpertALERT™.
Figure 11-1 CBM signal processing and Decision making using an Expert System
153
www.dliengineering.com, Automated Bearing Wear Detection, Alan Friedman, Published in Vibration
Institute Proceedings 2004
Page 152
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 11-1 outlines the signal processing and decision making portions of this CBM
approach. It traces the flow of information through the signal processing steps (steps 1-5)
and the decision making procedure (step 6) that uses a rule-based expert system.
Figure 11-2 An example of test point locations showing the three axes - Axial, Radial, and Tangential
The six steps of Figure 11-1 are described in each of the following sections.
Page 153
Optimal Maintenance Decisions (OMDEC) Inc 2004
o A 20 VdB = an increase in vibration amplitude by 10 times.
As an example, let us assume an equipment item has two test points. Then the screening
matrix will have (10 orders + 2 peaks x 2 ranges) x 3 orientations x 2 test points + 1 noise
floor = 85 columns. One row of the screening matrix will hold the changes in amplitude
from the previous inspection. A second row will hold the deviations from the baseline
spectrum. A third row will hold the corresponding vibration amplitudes. Hence, in this
instance, 85 x 3 rows=255 extracted features will have been placed into the screening
matrix, ready for further processing.
The noise floor calculation measures any general increase in random noise. Both impacts
and random noise in a time waveform cause the spectrum to become elevated. As
bearings wear, they typically produce larger quantities of non-periodic vibration and
impacts. This raises the noise floor of the spectrum. The automated diagnostic system
uses an algorithm to calculate the level of the noise floor. This value is then compared to
a baseline value. Increases in noise floor level add to the severity (see step 6) of the
bearing wear diagnosis and may even trigger a diagnosis in certain cases when bearing
tones are not evident.
157
An increase in the noise floor level is an indication of impacting and non-periodic (or random)
vibration. Both of these are associated with later stage bearing wear.
158
One may say in a general sense that the more harmonics and sidebands present, the worse the condition
of the bearing. Thus, not only does one wish to know if a peak is part of a larger family of peaks, one also
wants to get an idea of how much energy is contained in the series. Cepstrum analysis is used for
automating this task. The Cepstrum is a power spectrum of a power spectrum of a waveform; therefore, any
periodicities in the spectrum (such as harmonic series or sideband families) will clearly appear as a peak in
the Cepstrum.
Page 154
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 11-4 Spectrum showing the synchronous and
non-synchronous harmonics and their 1x spaced
sidebands. The abcissa is scaled in “orders” or
mulitples of the shaft speed.
Figure 11-3 Cepstrum showing peaks with 1x
and 3.61x spacings
The physics of each situation dictate the signal processing method selected. Non-
synchronous peaks, such as those at 3.61 and 7.22 orders (Figure 11-4), are candidates for
“bearing tones” that signal bearing faults. If, in addition, the non-synchronous peaks
display sidebands spaced at orders of the shaft speed, an inner race defect is likely. Figure
11-5 illustrates the physical explanation for bearing tones and the appearance of
sidebands, with respect to to an inner race spall or crack.
Figure 11-5 Physical explanation of non-synchronous peaks and their 1x sidebands related to an
inner race spall.
Step 4 Demodulation
Demodulation (also called “envelope detection”) is a signal processing technique used by
ExpertALERT to supplement and verify the information drawn from the cepstrum and
spectrum analyses. Demodulation provides an independent confirmation of bearing
defects.
If there is a spall on a bearing race, each time a ball passes it will impact and “ring” the
bearing causing it to resonate at high frequencies. The resulting vibrations can be
demodulated in order to extract the forcing frequency that is causing the ringing. The
forcing frequencies will appear as peaks in the demodulated spectrum. If they match the
bearing tones from the screening matrix and the cepstrum, they provide further
Page 155
Optimal Maintenance Decisions (OMDEC) Inc 2004
confirmation of a bearing defect. A distinct advantage of demodulation is that high
frequencies do not travel far in a machine. Thus the demodulation process can localize
the defective bearing. For example, if you see bearing tones in the narrow band spectral
data from two different locations on the machine at the same frequency, and the demod
data has matching peaks at one location (but not the other), you can assume that the
common location is the one with the bearing problem. The spectra of Figure 11-6, Figure
11-7, Figure 11-8, and Figure 11-9 illustrate this point precisely.159
Figure 11-6 Spectrum from motor location showing bearing tone peak
Figure 11-7 Demodulated spectrum from motor location showing matching peak
159
Alan Friedman, DLI Engineering, Demodulation - June 1999 issue of P/PM
Page 156
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 11-8 Spectrum from pump location showing same bearing tone
Figure 11-9 Demodulated spectrum from pump location, but showing no bearing tones. Hence
ExpertALERT can conclude that the bearing defect is on the motor.
Page 157
Optimal Maintenance Decisions (OMDEC) Inc 2004
interpreting the extracted features and identifying the likely fault. In Step 6 each CSDM
is processed through a series of diagnostic templates consisting of rules that pass or fail
every fault known to occur in the component. Furthermore, the expert system computes a
score based on the feature’s excedance above the threshold value coded in each rule.160
The knowledge in the diagnostic templates was developed from an understanding of the
physics of the machinery and its causal relationship with the monitored data.
A simple example is the rule for imbalance. This rule checks the matrix elements (of the
CSDM) that contain the rotational rate levels and exceedances over baseline. The rule
then determines whether these values are are high in a radial direction. If so, other checks
determine that the problem is not misalignment or looseness. Finally, the algorithm
confirms the imbalance diagnosis.
Looking at the axial and radial data at both locations we might surmise angular
misalignment since 1x axial is abnormally high at both motor and pump. Alternatively, it
could be motor imbalance or pump imbalance, since 1x radial is abnormally high at either
end and radial is higher than axial. Axial motion is, in fact, characteristic (due to rocking)
of unbalance in a vertical pump. Another characteristic of a vertical pump is that one
direction, the direction of external structural support, is always stiffer than the other
directions. The radial axis in this case is the direction of structural flexibility, so that
radially, the pump is being “wagged” by the motor imbalance. The low 1x levels at the
160
Rule thresholds are a matrix that include both absolute amplitudes as well as exceedences over (mean +
1 sigma) baseline.
Page 158
Optimal Maintenance Decisions (OMDEC) Inc 2004
pump in the tangential direction can be explained by the fact that the tangential axis is the
direction of high structural stiffness and therefore the tangential component of the
vibration due to motor imbalance does not transmit to the pump.
Rules are activated by machinery component type (for example, in the preceeding,
“vertical motor pump set with coupling”) as defined by the user in the ExpertALERT
software. A rule for bearing wear in a compressor will look slightly different from the
rule for bearing wear in an AC motor. Each individual machine component type may
have numerous rules for bearing wear. If the the extracted features satisfy the
requirements for a rule, it means the fault condition exists.
After information has been extracted from the spectra as described above in steps 1 to 5,
it is passed through all of the rule templates that apply to the general machine type to see
if any faults exist. The rules are empirically based on thousands of machine tests
collected over more than 20 years and are constantly refined as new information becomes
available. If a rule is edited for any reason, the change is run through all past diagnoses to
ensure that it does not change any previously correct results.
1. If the sum of the exceedance over baseline of all perceived bearing tones in all
three axes and all test points (Cepstrum confirmed) is higher than a threshold, or
the sum of the noise floor readings from all spectra has increased over the
baseline or alarm by a certain amount, then the rule passes.
2. If the sum of the amplitudes of all of the perceived bearing tones exceeds some
threshold then the rule passes.
3. If none of the perceived bearing tones are above a minimum threshold, the rule
does not pass.
4. If the sum of the shaft rate harmonics from 16x to 100x are above some value,
add to the severity.
5. If the noise floor is above some level add to the severity, and if it’s above a higher
level, add more to the severity.
6. If the sum of the other un-defined peaks that were not confirmed by Cepstrum are
above some threshold, add more to the severity.
7. If sub harmonics of the shaft rate have exceeded the baseline by a certain amount,
add to the severity.
Note that these rules are empirically based. Which is to say, the rule thresholds for
absolute levels or for exceedances over a baseline, have been tweaked until they come
out with the correct answer as determined by a human expert and/or direct field feedback.
In other words, the thresholds mentioned in the example rule above, have been tuned to
come out with the correct answer for any machine to which this particular rule applies.
There are sufficient rule templates for each machine type to catch practically all possible
bearing wear patterns that may exist in the data.
Page 159
Optimal Maintenance Decisions (OMDEC) Inc 2004
Once a fault has been diagnosed, the user will continue to monitor the machine and look
for changes in severity of the fault. The rate at which the severity increases gives a good
indication of when the bearings should be overhauled.
The amounts by which the values in the CSDM exceed the threshold values (set up in the
rules based on experience and knowledge) is scored and converted into a relative
severity. This normalizes a scale with which to judge the state of health of each
component. Thus the relative severity for all components in the equipment can be trended
on a single graph, as in Figure 11-11. The graph provides a decision support tool for
performing a corrective action on a component whose severity is high or has increased
substantially. In the following section, we will propose to extend the automated diagnosis
one step further to extimate remaining life and provide an optimized repair decision.
Figure 11-11 Severity graphs for an equipment item with three components
The severity values computed for each fault, as well as the absolute and relative values of
the relevant features, may be used as covariates in a proportional hazard model such as
that described in Chapter 10. The next section describes the ABB fault simulator that
may be use to demonstrate this proposed extension to ExpertALERT’s output report.
Page 160
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 11-12 The fault simulator (top left) gradually induces one or more failure modes (for example,
misalignment or unbalance). The failure mode (unbalance) causes the failure mechanism (right) to
proceed towards failure. The failure is the loss of function to hold the Tee in place by spring friction
forces under the stress of vibration forces transmitted through the structure.
In the fault simulator, a spring and friction failure mechanism has been set up with the
following characteristics desirable for the study of a failure modeling and prediction
methodology.
1. A functional failure is clearly defined (by the release of the tee causing the golf
ball to trigger a switch).
2. The (random variable) time to failure can depend both on working age and CM
data.
3. A life cycle can be as small as 1 minute, permitting a large sample of life cycles
from which to build and subsequently test the predictive model.
Page 161
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 11-13 Running recommendations from the EXAKT agent
Figure 11-13 displays the running prognostic results that are updated at each inspection.
The “Optimal Maintenance Decision” may be one of :
1. Continue operation, or
2. Plan to replace in a specified number time units, or
3. Replace immediately
The “Estimated Time to Failure” is the time to replacement estimate (TRE). TRE is an
estimate of the time at which a replacement or overhaul will occur either by PM (as a
result of the CBM optimal decision policy recommendation) or by failure. The TRE is
not to be confused with the residual life estimate (RLE) that estimates the time to failure
only. (Replacement by PM is not considered). Both TRE and RLE are interesting figures
for maintenance personnel. TRE, however, may be the more interesting to people who are
concerned with maintenance management, e.g., production planning, manpower
scheduling, spare parts management. RLE, on the other hand, may be more interesting to
people involved in equipment design, procurement, and specification of reliability or risk
of the unit.
Page 162
Optimal Maintenance Decisions (OMDEC) Inc 2004
cycle. A histogram (Figure 11-15) is another way to indicate the predictive performance
of the model.
Figure 11-15 Histogram showing the errors in replacement time estimate over 678 inspections. For
example the TRE calculated at 412 inspections were within 5% of the actual (functional or potential)
failure time.
The hazard function curves (in Figure 11-16) for potential failures and functional failures
provides an overall performance check on the effectiveness of the CBM program.
Page 163
Optimal Maintenance Decisions (OMDEC) Inc 2004
If the difference between TF (total failures) and the FF (functional failures) hazard curves
is small, that indicates that the CBM program is effective. That is, functional failures
(those that have important consequences) are being preempted by the CBM detection and
correction of potential failures (that have none or relatively minor consequences).
Page 164
Optimal Maintenance Decisions (OMDEC) Inc 2004
Chapter 12. Case based reasoning
It is not enough to improve just incrementally from your
past performance or that of other company divisions. To
compete globally, you must look everywhere to learn new
methods. Make yourself a student of the best of the best,
particularly in unrelated business sectors.
– John D. Campbell161
Introduction
A thin line separates diagnostics from prognostics. Condition based maintenance (to be
described in Chapter 6. (page 73) detects potential failures, which, in themselves,
provoke relatively minor consequences. When maintenance personnel detect and repair
potential failures, they avoid the dire consequences of a functional failure. In a similar
vein, diagnostics begin with the detection of a “fault” indicator, which, in and of itself,
often has few or no consequences, but, which portends a more serious functional failure.
Hence the diagnostic process often meets the RCM criterion for “on-condition
maintenance” as stipulated by Nowlan and Heap (see page 83). One or more of a variety
of fault indicators can initiate the troubleshooting process. Some warn of failure of back-
up functions. Others indicate the failing performance of some function in a subsystem. In
all cases, we require a quick and efficient process, based on the application of knowledge
and experience, that will trace the fault indicator to its root cause (that is, its failure
mode), whereupon we will remediate the cause through a repair or replacement action.
161
Uptime, Strategies for Excellence in Maintenance Management, Productivity Press, 1995
162
The quality of that guidance impacts the cost and time of diagnosis.
Page 165
Optimal Maintenance Decisions (OMDEC) Inc 2004
Intelligent agents
assist maintenance
troubleshooters
through case based
reasoning (CBR).
Efficient Troubleshooting
Intelligent troubleshooting poses the right questions in the best order. A well designed
case based reasoning system guides the technician or engineer along the most practical
and least costly path to a solution. It poses questions and suggests solutions by
considering relevant data and by evaluating:
⇒ Similarity of past cases to the current symptoms
⇒ Frequency of occurrence of similar cases
⇒ Cost and time to get an answer
⇒ Cost and time of repairs
⇒ Information gain - the ability of a question to eliminate inappropriate solutions
Page 166
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 12-2: A typical CBR CaseBank SpotLight™ session
Figure 12-2 shows a typical CBR session. The troubleshooting conversational “assistant”
does not demand that the technician answer every question. The user may elect to answer
or ignore any question, and may provide answers in the most convenient order. The tool
suggests but does not enforce a specific question order. At each step, as the diagnostic
effort unfolds, the CBR program re-sequences questions and re-prioritizes solutions by
re-evaluating all information known up to that point.
Page 167
Optimal Maintenance Decisions (OMDEC) Inc 2004
The CBR tool elicits notes and additional observations during a session where such
observations are lacking in the case base. Subject matter experts163, monitor each
completed session, harvesting the data, where appropriate, for case-base development.
Figure 12-3 illustrates this continuing process.
Terminology
Subject: An item of interest.
Domain / subject breakdown: A tree structure of parents and children that describe the
knowledge area to be captured. A subject may have multiple parents.
Attribute: A characteristic that is measurable, testable, or observable. It is attached to
one or more parent subjects in the domain.
Attribute structure: Name, Description, Question, Values, References.
163
This takes place off site as a web application service or is performed by on-site subject matter experts
(maintenance engineer, planner, or technician) trained in the use of the software.
164
CaseBank Technologies, www.casebank.com
Page 168
Optimal Maintenance Decisions (OMDEC) Inc 2004
Attribute types: Logical (T/F, Y/N), Symbolic list (Corroded, cracked, loose), Ordered
list (none, low, med, hi), Integer, Real, Multi-valued (several selections may be valid at
once, e.g.: one or more fault codes shown on a display unit).
Attribute categories: Symptomatic (e.g. vibration level), Root causes (e.g. Piston –
Status – seized, free, sticking), Configuration (e.g. Power rating HP – 130, 150)
Observation: Assignment of a value to an attribute to describe the current scenario (e.g.
Master Caution Light – illuminated).
Case (aka Solution): Concise information representing a type of problem. Most often
representing a failure, but could be an operating error or a normal condition that is often
misinterpreted as a problem.
Casebase (aka Knowledgebase): A repository of cases upon which the reasoning engine
operates.
Session: The data created in the process of matching the characteristics of the current
problem to the cases in the casebase. A session is an “instance” of a case.
Page 169
Optimal Maintenance Decisions (OMDEC) Inc 2004
Subjects (displayed in the domain in upper case
characters) can be physical components or they
can be categories used to index physical
components (e.g. COMPLAINTS CONCERNING
SNOWBLOWER OPERATION).
Building a case
We build a case by populating it with the following information:
1. Title: In the form [Problem Description] due to [Root Cause]165.
2. “Lawnmower performance is unsatisfactory due to a restricted (clogged) air filter”
3. Source: The source of the case, either field experience or a document such as a
manual
4. Description: A detailed description of the problem.
“Lawnmower runs erratically and the performance is unsatisfactory, starts with difficulty,
surges, loss of power, overheating, runs poorly at top no-load speed”
5. Observations: A structured description of the case’s attributes and their values.
6. Cause166: A case can have only one root cause167.
165
Corresponding to the RCM terminology for “Failure” and “Failure Mode” respectively
166
Recall the RCM “Failure Mode”
Page 170
Optimal Maintenance Decisions (OMDEC) Inc 2004
7. Explanation168: The explanation may include: 1) how the fault caused the
symptoms, 2) the physical working of the affected component to explain the
failure, 3) the chain of events that led to the identification of the root cause.
8. Repair: The Repair details generally include what was done to correct the
problem, as well as any repair references. E.g. parts and supplies needed, the
sequence of procedures - preparation, execution, testing special tools needed,
safety information effort required (for example, person hours), cost (direct labour,
overhead, parts, etc).
9. Reference: References for a case may include: diagrams, video/audio clips, and
documents that illustrate observations, repair instructions, or explain the case.
10. Lessons: Lessons for the case may include: tips for avoiding mistakes during
troubleshooting as demonstrated by the case, tips for avoiding mistakes during
repair, emphasis on key observations or procedures that are new and not common
knowledge, comments regarding any general principles learned from the case.
11. Edit history: The Edit History shows who made changes to the case, the status of
the case, the date the case was changed, and the comments for the change.
Case Study
Over a period of two months during 2004 the fault “NOSE STEERING illuminated” was
detected in a fleet of aircraft. Around the world several people were grappling with a
similar nose wheel steering problem. The knowledge building process amalgamated the
notes from similar sessions. The notes are presented in Figure 12-4.
Previous Notes:
Solution cases: #4137-Nosewheel Steering sluggish due to partial blockage of the Steering
Manifold Inlet Filter.
2004-12-23 14:15:16 GMT by Vincent, Dominic (Closed)
Nose wheel Steering sluggish. Hydraulic supply line checked for debris as fault had only
become evident after a #2 edp failure a week or so earlier. Debris was found in the filter gauze
in the elbow. Once cleaned steering function checked ''satis''. We are looking back to see if any
of our other A/C that have had edp failures have suffered from steering problems as steering
hydraulic supply pipes not checked after a failure.
2004-12-15 19:58:34 GMT by Gray, Stuart (Closed)
In Service Engineering informs me that as well as the inlet and return filters in the PTU selector
valve in system #1 (Service Bulletin 84-29-13) and the alternate landing gear extend system
filter unions at the bypass valve and at the reservoir intake, (which we have all come across),
there are a few others that merit attention. They are: Rudder PCUs have inlet filter unions;
Elevator PCUs have inlet filter unions; Flap Power Unit has an inlet filter union; If you have an
ongoing fault that appears after a system has been contaminated, (i.e after an EDP failure) a
look at these filters might be worth your while.
2004-12-13 16:23:42 GMT by Gosling, Tom (Referred)
167
However, the root cause can comprise more than one contributory causes, which are expressed in the
form attributes and values that define the case.
168
Recall the RCM “Failure Effects”. The CBR system extends the RCM knowledge elements with
additional structure that enables the application of diagnostic algorithms in software.
Page 171
Optimal Maintenance Decisions (OMDEC) Inc 2004
The manifold has two filters at the inlet port. The first one is located within the swivel joint P/N
SJ504-917-2. The second is an inlet filter P/N FSHX0511200B located within the manifold
downstream of the inlet port. This information is being added to the Goodrich CMM. This is not
an AMM level component as it is part of the steering manifold.
2004-12-13 16:20:07 GMT by Gosling, Tom (Open)
Figure 12-5 Inlet Elbow filter blockage as a result of a prior hydraulic pump failure
The knowledge base was updated to include a new set of observations - a structured
description of attributes and their values. The attributes and their values are presented in
Figure 12-6.
169
Using an enhanced search function provided in the case editor software.
Page 172
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 12-6 Observations for a new case added to the knowledge base
Explanation
The recent failure of the engine driven hydraulic pump had contaminated the system. Some of the
contamination had collected in the steering manifold inlet elbow filter and had remained there
after the flushing. The partial blockage of the inlet filter caused a flow restriction to the hydraulic
manifold, which resulted in the sluggish performance when maneuvering. The reduction however,
was not enough to trigger the P-SW (pressure switch) fault in the steering control unit.
Repair
The steering manifold hydraulic supply filter was cleaned.
References
- AIPC 32-51-16-01 - NLG Steering Manifold
Lessons
1. The manifold has two filters at the inlet port.
2. The first one is located within the swivel joint P/N SJ504-917-2.
3. The second is an inlet filter P/N FSHX0511200B located within the manifold downstream
of the inlet port.
This information is being added to the Goodrich CMM. This is not an AMM level component as it
is part of the steering manifold.
Page 173
Optimal Maintenance Decisions (OMDEC) Inc 2004
The seed case base
Before implementing CBR in a maintenance organization, we must first build a seed case
base of a sufficient170 number of cases. Figure 12-7 illustrates the development of the
seed case base from 1) existing work order and troubleshooting records, 2) failure modes
and effects analysis records, and 3) OEM maintenance and troubleshooting (fault
isolation) manuals.
Casebase Growth
Having deployed the CBR system with a Seed Casebase, the system itself becomes a
powerful knowledge capture mechanism, and the casebase grows as new cases are
discovered during its use. The chart in Figure 12-8 illustrates the expected pattern as the
case base matures. The left side of the graph shows low initial usage as the seed case
base is deployed in stages, gradually bringing on more and more users until it is part of
normal operations.
170
In order that the tool may inspire sufficient confidence from the outset that it be used and developed
upon.
Page 174
Optimal Maintenance Decisions (OMDEC) Inc 2004
Performance measurement
Conclusions
The scale and unabated growth of mechanization and automation in all walks of human
endeavor gave rise to the diagnostic approach known as case-based reasoning. CBR
extends the structure of the knowledge gained through the application of reliability-
centered maintenance. Along with advanced condition monitoring tasks, CBR assists the
modern maintainer to satisfy increasingly pressing economic, environmental and safety
demands for:
• Better first-time fix of both potential and functional failures
• Cost reduction / Cost avoidance
o Less troubleshooting time
o Rapid planning for unscheduled maintenance events
o Reduced (unnecessary) parts replacements
o Reduced unscheduled service interruptions
• Increased asset availability
• Preservation and use of intellectual assets
o Capture of “walking knowledge” prior to retirement or attrition
Page 175
Optimal Maintenance Decisions (OMDEC) Inc 2004
o Maximized utility of new staff
o Focused efforts of expert staff on toughest problems
Case based diagnostic reasoning, encompassing a detection, processing, and decision sub
process is truly a form of condition based maintenance or CBM, whose principles we
describe in great detail in the preceding chapters
Page 176
Optimal Maintenance Decisions (OMDEC) Inc 2004
Chapter 13. A survey of signal processing
and decision technologies for CBM
Introduction
In previous chapters we learned that Condition-based Maintenance recommends actions
based on information acquired through observation and analysis. We noted, moreover,
that the CBM process, itself contains three sub-processes or steps: data acquisiton, signal
processing, and maintenance decision making.
Chapter 12. Case based reasonin (page 165) pointed out, in regard to complex systems,
that prognostics are often indistinguishable from diagnostics, where both aim to identify
the occurance of a potential failure.
Hundreds of theoretical and practical research papers on CBM appear every year in
scientific journals, conference procedings and technical reports. In this chapter we
provide an overview of recent developments in the diagnostics and prognostics of
systems. We will mention a number of models, algorithms, and technologies for signal
processing and maintenance decision making. Given the increased use of multiple
sensors, we will also discuss various techniques for data fusion. The chapter is concluded
with a brief discussion on current practices and possible future trends in CBM. The
purpose of this survey of advanced methods of signal processing and decision making is
not to instruct the reader in the the use of these new techniques, but merely to provide the
maintenance professional with references to the source material so that he or she can
investigate alternatives when encountering various situations where a CBM solution is
proposed.
Page 177
Optimal Maintenance Decisions (OMDEC) Inc 2004
Let us begin by reviewing, briefly, the first CBM step, data acquisition.
Data acquistion
Data acquisition, the essential first step in the CBM task, is a process for collecting and
storing useful information that emanates from operating physical assets. Data collected in
a CBM program is of two main types: “event” data and condition monitoring (CM) data.
Event data tells us what happened, for example, an installation, a breakdown, or an
overhaul. Event data also tells us what was done, for example, a minor repair, a
preventive maintenance action, an oil change, and so on. CM data consists of
observational measurements that we believe are, in some way, related to the deteriorating
health or state of the physical asset.
CM data can include vibration data, acoustics data, oil analysis data, temperature,
pressure, moisture, humidity, and any other physical observations, including visual clues,
that relate to to the condition of an operating physical asset in its environment. A variety
of sensors (microsensors, ultrasonic sensors, acoustic emission sensors, thermographic
imagers, etc) have been designed to collect different types of data [11,12]. Wireless
technologies such as Bluetooth have provided an alternative to more expensive hard
wired data communication. Information systems such as Computerized Maintenance
Management Systems (CMMS), Enterprise Resource Planning (ERP) systems, control
system historians, and CBM databases have been developed for data storage and handling
[13]. With the rapid development of computer and advanced sensor technologies, data
acquisition technologies have become more powerful and less expensive, resulting in
exponentially growing databases of CM data.
Event data and CM data are equally important in CBM. In practice, however, engineers
and managers tend to place more emphasis on the latter and sometimes neglect the
former. Overlooking event data may have grown from the mistaken belief that it is not
valuable to fault prediction as long as the condition monitoring data seems to be working
well. We tend to overlook event data, in part, because we lack the knowledge and
methods to use it. Event data is at least as helpful as CM data in assessing machine
health. It augments our ability to judge the significance of CM data with respect to
specific failure modes. The use of event data is discouraged by the fact that its collection
usually implies manual data entry. Once a human is involved, everything becomes more
complicated and error-prone. Choosing the “simple” solution, that of removing the
human element, is hasty and ill-advised. Rather, it is preferable to equip humans with
tools and procedures171 with which to capture event data accurately, in a meaningful
format, and in sufficient detail.
Signal processing
Under the topic of signal processing we include a necessary preliminary step - data
cleaning. Data, especially event data, particularly when it is entered manually, always
171
Such as those developed in Chapter 4. (page 58)
Page 178
Optimal Maintenance Decisions (OMDEC) Inc 2004
contains errors. Data cleaning is meant to ensure that clean (error-free) data is used for
subsequent analysis and modeling. Data errors are caused by many factors, including the
human factor mentioned previously. Errors in CM data may be caused by sensor faults,
which are handled by sensor fault isolation [14]. In general, there is no simple, single
method to clean data. Sometimes manual examination is required. Graphical tools are
helpful in finding and removing data errors. Data cleaning is indeed a vast subject area.
In Example 2 Data validation on page 131 (Chapter 10. ) we touched upon various
aspects of data cleaning.
The next step in signal processing is data analysis. A variety of models, algorithms and
tools are described in the technical literature. Their purpose is to analyze data in order to
better understand and interpret it. The choice of which model, algorithm, or tool to use
for data analysis depends primarily on the type of data collected. Condition monitoring
data falls into three principal types:
Value: Data collected at a specific time epoch as single valued variables. For
example, oil analysis data, temperature, pressure, humidity are all value type data.
Waveform: Data collected at a specific time epoch as a time series of values. For
example, vibration data and acoustic data are or the waveform type.
Although we have been using the term more broadly to describe the entire data analysis
phase of CBM, “signal processing” usually refers most specifically to waveform and
multi-dimension data analysis. A large variety of signal processing techniques have been
developed to analyze and interpret these types of data. Their purpose is to extract useful
information from the raw signal in order to perform diagnostics and prognostics. The
signal processing procedure for extracting information relevant to targeted failure modes
is often called “feature extraction”.
Signal processing
Waveform data analysis
The most common waveform data in condition monitoring are vibration signals and
acoustic emissions. Other waveform data include ultrasonic signals, motor current, partial
discharge, and others. In the literature, there are three main categories of waveform data
analysis: time-domain analysis, frequency-domain analysis and time-frequency analysis.
Time-domain analysis is directly based on the time waveform itself. Traditional time-
domain analysis calculates characteristic features from time waveform signals as
descriptive statistics. For example: mean, peak, peak-to-peak interval, standard deviation,
crest factor, high order statistics: RMS (root mean square), skewness, kurtosis, etc. These
Page 179
Optimal Maintenance Decisions (OMDEC) Inc 2004
features are usually called time-domain features. A popular time-domain analysis
approach is time synchronous average (TSA). The idea of TSA is to use the ensemble
average of the raw signal over a number of evolutions in an attempt to remove or reduce
noise and effects from other sources, so as to enhance the signal components of interest.
A brief review of TSA was given by Dalpiaz [15] and some drawbacks of TSA were
pointed out by Miller [16]. Most of the references on TSA can be found in [15,16].
xt = a1 xt −1 + L + a p xt − p + ε t − b1ε t −1 − L − bq ε t −q
where x is the waveform signal, ε ’s are independent normally distributed with mean 0
and constant variance σ 2 , and ai , bi are model coefficients. An AR model of order p is
a special case of ARMA( p, q ) with q = 0 . Poyhonen et al [17] applied the AR model to
vibration signals collected from an induction motor and used AR model coefficients as
extracted features. Baillie and Mathew [18] compared the performance of three
autoregressive time series modeling techniques: AR model, back propagation neural
networks, and radial basis function networks to bearing fault diagnostics. Garga [19]
proposed using AR modeling followed by dimension reduction for machinery fault
diagnostics. Recently, Zhan [20] used a state space model representation of an AR model
to analyze vibration signals for fault detection.
There are many other time-domain analysis techniques to analyze waveform data for
machinery fault diagnostics. Some of them are briefly described as follows. Wang et al
[21] introduced three nonlinear diagnostic methods for rotating machine fault diagnosis.
These three methods are pseudo-phase portrait, singular spectrum analysis, and
correlation dimension. Pseudo-phase portrait is simple for computer execution and is
sensitive to some fault types. Wang and Lin [22] used a statistical approach known as
singular value decomposition to obtain the pseudo-phase portrait. Singular spectrum
analysis can reveal the complexity of a signal and reduce the noise. Correlation
dimension can provide some intrinsic information on an underlying dynamical system.
Koizumi [23] also considered the application of correlation dimension to fault diagnosis.
Wang et al [24] applied both correlation dimension and bispectrum for rotating machine
fault diagnosis. Zhuge and Lu [25] proposed a modified least mean square algorithm to
model the non-stationary impulse-like signals for reciprocating machine fault diagnosis.
Baydar et al investigated the use of a multivariate statistical technique known as principal
component analysis (PCA) in gear fault diagnostics [26].
Page 180
Optimal Maintenance Decisions (OMDEC) Inc 2004
easily identify and isolate certain frequency components of interest. The most widely
used conventional analysis is spectrum analysis by means of FFT (fast Fourier
transform). The main idea of spectrum analysis is either to look at the whole spectrum or
to look closely at certain frequency components of interest and thus extract features from
the signal (see, e.g. [27-29]). The most commonly used tool in spectrum analysis is the
power spectrum. It is defined as E[ X ( f ) X * ( f )] , where (and throughout this section)
X ( f ) is the Fourier transform of signal x(t ) , E denotes expectation and “ ∗ ” denotes
complex conjugate. Some useful auxiliary tools for spectrum analysis are graphical
presentation of the spectrum, frequency filters, envelope analysis (also called amplitude
demodulation) [30-32], side band structure analysis [33], etc. Descriptions of the above
mentioned techniques for FFT based spectrum can be found in textbooks such as [34,35]
and will not be discussed in detail here. Another useful transform, Hilbert transform, has
also been used for machine fault detection and diagnostics [30,36].
Despite the wide acceptance of the power spectrum, other useful spectra for signal
processing have been developed and have been shown to have their own advantages over
the FFT spectrum in certain cases. Cepstrum has the capability to detect harmonics and
sideband patterns in the power spectrum. There are several versions or definitions of
cepstrum [35]. Among them, the power cepstrum, which is defined as the inverse Fourier
transform of the logarithmic power spectrum, is the most commonly used. A modified
cepstrum analysis was proposed in [37]. A high order spectrum, i.e. bispectrum or
trispectrum, can provide more diagnostic information than the power spectrum for non-
Gaussian signals. In the literature, high order spectrum is also called high order statistics
[38]. This name comes from the fact that bispectrum and trispectrum are actually the
Fourier transforms of the third- and fourth-order statistics of the time waveform,
respectively. But this name could be confused with the time-domain high order statistics.
Bispectrum and trispectrum are defined as
B ( f 1 , f 2 ) = E [ X ( f 1 ) X ( f 2 ) X * ( f 1 + f 2 )]
and
T ( f1 , f 2 , f 3 ) = E[ X ( f 1 ) X ( f 2 ) X ( f 3 ) X * ( f 1 + f 2 + f 3 )]
| B( f1 , f 2 ) |
β ( f1 , f 2 ) =
E[| X ( f 1 ) X ( f 2 ) | 2 ]E[| X ( f 1 + f 2 ) | 2 ]
and
| T ( f1 , f 2 , f 3 ) |
τ ( f1 , f 2 , f 3 ) =
E[| X ( f 1 ) X ( f 2 ) X ( f 3 ) | 2 ]E[| X ( f 1 + f 2 + f 3 ) | 2 ]
Page 181
Optimal Maintenance Decisions (OMDEC) Inc 2004
respectively. Bispectrum analysis has been shown to have wide application in machinery
diagnostics for various mechanical systems such as gears [39], bearings [40], rotating
machines [41,42] and induction machines [43,24]. Li [44] investigated the application of
bispectrum diagonal slice B ( f , f ) to gear fault diagnostics. Yang [40] used both
bispectrum diagonal slice and bicoherence diagonal slice β ( f , f ) , summed bispectrum,
and summed bicoherence for bearing fault diagnostics. Application of both bispectrum
and trispectrum to bearing fault diagnostics was discussed in [45]. A new technique
called holospectrum was introduced by Qu [46] to integrate all the information of phase,
amplitude and frequency of a waveform signal. Application of holospectrum to machine
fault diagnostics was studied in [47,48]. A review on holospectrum and its applications
was given by Qu [49] (in Chinese).
Generally speaking, there are two classes of approaches for power spectrum estimation.
The first covers the non-parametric approaches that estimate the autocorrelation sequence
of the signal and subsequently apply a Fourier transform to the estimated autocorrelation
sequence. For details, see [50]. The second class includes the parametric approaches that
build a parametric model for the signal and then estimate power spectrum based on the
fitted model. Among them, AR spectrum [51-53] and ARMA spectrum [54] based on AR
model and ARMA model respectively are the two most commonly used parametric
spectra for machinery fault diagnostics.
Page 182
Optimal Maintenance Decisions (OMDEC) Inc 2004
Another transform for time-frequency analysis is the wavelet transform. Wavelet theory
has been developing rapidly in the past decade and has wide application [65]. A
continuous wavelet transform is defined as
∞
1 t −b
∫ x(t ) ψ
∗
W ( a, b) = dt
a −∞ a
where x(t ) is the waveform signal, a is the scale parameter, b is the time parameter and
ψ (⋅) is a wavelet, which is a zero average oscillatory function centered around zero with
a finite energy, and “ ∗ ” denotes complex conjugate. Commonly used wavelets are
Morlet, Mexican hat, Haar, etc. Similar to Fourier transform, the wavelet transform has
its discrete form, which is obtained by discretizing a and b , and expressing x(t ) in
discrete form. Similar to FFT, a fast wavelet transform is likewise available for the
calculation.
Image processing
Image processing is similar to but more complicated than waveform signal processing
due to one more dimension involved. In practice, raw images are usually very
complicated and immediate information for fault detection is unavailable. In these cases,
image processing techniques must be powerful enough to extract useful features from raw
Page 183
Optimal Maintenance Decisions (OMDEC) Inc 2004
images for fault diagnosis — see [86,87] for descriptions and discussions on image
processing tools and algorithms. Image processing seems unnecessary when raw images
provide sufficient and clear information under visual examination to identify patterns and
detect faults. However, image processing can help in extracting features for automatic
fault detection in such situations. In addition to raw images obtained via data acquisition,
some waveform processing techniques such as time-frequency analysis also produce
images. In these situations, image processing can be combined with waveform processing
to obtain better results.
Trend analysis techniques such as regression analysis and time series modeling are
commonly used techniques for analyzing value type data. For example, Grimmelius et al
[95] developed a prototype condition monitoring and diagnostics system for compression
refrigeration plants using a regression analysis model to predict healthy system behavior.
Yang et al [96] established an ARMA model to extract features from on-line data for
power equipment diagnosis. Sinha [97] applied both polynomial regression and an
ARMA model to predict the trend of vibration peak amplitude for turbine fault
diagnostics and prognostics.
A time-dependent proportional hazards model (PHM) is suitable for analyzing both event
and condition monitoring data together. It has a hazard function of the form
h(t ) = h0 (t ) exp(γ 1 x1 (t ) + L + γ p x p (t ))
In reliability centered maintenance (RCM) [100], the concept known as the “P-F interval”
is used to describe failure patterns in condition monitoring. A P-F interval is the time
interval between a potential failure (P), which is identified by a condition indicator, and a
functional failure (F). A P-F interval is a useful concept with which to determine an
appropriate interval for periodic condition monitoring. A condition monitoring interval is
usually set to the P-F interval divided by an integer. In practice, however, it is usually
difficult to quantify the P-F interval (see Chapter 9. The Elusive P-F Curve page 106).
Goode et al [101] assumed two Weibull distributions for the P-F interval and the I-P
interval, i.e. from machine installation to a potential failure. Using the statistical process
control (SPC) methods on historical data, they separated each machine life cycle into two
zones: a stable zone and a failure zone. They used the stable zone duration times to fit a
Weibull distribution for the I-P interval. Similarly, they used the failure zone duration
times to fit the Weibull distribution for the P-F interval. Based on these two fitted
distributions combined with the condition monitoring process, machine prognosis was
derived.
A hidden Markov model (HMM) [102,103] is another model for analyzing event and
condition monitoring data together. A HMM consists of two stochastic processes: a
Markov chain with a finite number of states describing an underlying failure mechanism,
and an observation process that depends on the hidden state. Bunks et al [104] applied a
Page 185
Optimal Maintenance Decisions (OMDEC) Inc 2004
HMM to analyze Westland helicopter data which consists of gearbox fault class
information and vibration measurements surrounding the occurance of various faults.
The fault classes were treated as states in the hidden Markov chain, whereas the vibration
measurements were treated as realizations of the observation process. The trained HMM
using lab test data was then applied to fault classification for a data set from an operating
gearbox. Dong and He [105] proposed a more general model, hidden semi-Markov model
(HSMM), for hydraulic pump diagnostics. It was shown that HSMM outperforms HMM
in pump diagnostics.
Lin and Makis [106] proposed using a partially observable stochastic model to describe
the underlying failure mechanism of a system undergoing condition monitoring. The
proposed model is similar to that of a HMM but it has some distinguishing
characteristics. One (failure) state is observable, whle the partially hidden state process is
continuous in time. The observation process, however, is in discrete in time. These
characteristics are more realistic in relation to actual condition monitoring processes. The
model parameters were estimated using both event and condition monitoring data. The
fitted model is used for subsequent diagnostics and prognostics. A fast recursive
parameter estimation procedure for a partially observable stochastic model was given in
[107].
Other models in the literature that can be used to analyze both event and condition
monitoring data are models using the delay time concept [108] and stochastic process
models such as a gamma process [109].
Diagnostics
Machine fault diagnostics is a discovery procedure based on mapping information in the
measurement space and/or features in the feature space to machine faults in the fault
space. From an “RCM” perspective, a machine fault may or may not have immediate
consequences. If a fault does not have immediate consequences, other than those
necessary to diagnose and repair it, it is a potential failure. The diagnostic action
following the detection of a potential failure will be a proactive activity, initiated, often,
by a condition based maintenance process. A common example is an alarm generated by
a “rule” applied to the data in a control system historian. Besides a potential failure, a
diagnostic alarm may also expose an otherwise hidden functional failure, usually the
failure of a protective or backup device. The failure of a hidden function has the
immediate consequence that a “multiple” failure is, from that moment on, highly
Page 186
Optimal Maintenance Decisions (OMDEC) Inc 2004
probable. This topic was developed in Failure Finding Intervals of Chapter 3. on page
39.
The diagnostic mapping process is also called pattern recognition. Traditionally, pattern
recognition was a manual exercise, performed with the assistance of graphical tools such
as a power spectrum graph, a phase spectrum graph, a cepstrum graph, an AR spectrum
graph, a spectrogram, a wavelet scalogram, a wavelet phase graph, and so on. However,
manual pattern recognition requires expertise in the specific area of the diagnostic
application. It is slow and expensive requiring highly trained and skilled personnel.
Therefore, automatic pattern recognition is highly desirable. This can be achieved by
classification of signals based on the information and/or features extracted from the
signals. In the following sections, different machine fault diagnostic approaches are
discussed with emphasis on statistical approaches and artificial intelligent approaches.
Machine diagnostics with emphasis on practical issues was discussed in [112]. Various
topics in fault diagnosis with emphasis on model-based and artificial intelligence
approaches were covered in a recent co-authored book [113].
Statistical approaches
A common method of fault diagnostics is to detect whether a specific fault is present or
not based on the available condition monitoring information without intrusive inspection
of the machine. This fault detection problem can be described as a hypothesis test
problem with null hypothesis H0: Fault A is present, against alternative hypothesis H1:
Fault A is not present. In a concrete fault diagnostic problem, hypotheses H0 and H1 are
interpreted into an expression using specific models or distributions, or the parameters of
a specific model or distribution. Test statistics are then constructed to summarize the
condition monitoring information so as to be able to decide whether to accept the null
hypothesis H0 or reject it. See [114-116] for some examples of using hypothesis testing
for fault diagnosis. Recently, a framework for fault diagnosis, called structured
hypothesis tests, was proposed for conveniently handling complicated multiple faults of
different types [117].
Page 187
Optimal Maintenance Decisions (OMDEC) Inc 2004
two signals. These measures are usually derived from certain discriminant functions in
statistical pattern recognition [121]. Commonly used distance measures are Euclidean
distance, Mahalanobis distance, Kullback-Leibler distance and Bayesian distance. See
[122-125] for some examples of using these distance metrics for fault diagnostics. Ding
et al [122] introduced a new distance metric called quotient distance for engine fault
diagnosis. Pan et al [126] proposed an extended symmetric, the Itakura distance, for
signals in time-frequency representations, for example the Wigner-Ville distributions. In
addition to distance measures, the feature vector correlation coefficient is a similarity
measure commonly used for signal classification in machinery fault diagnosis [125].
Many clustering algorithms are available for distinguishing the signal groups [127]. A
commonly used algorithm in machine fault classification is the nearest neighbour
algorithm that fuses the two closest groups into a new group and calculates the distance
between two groups as the distance of the nearest neighbour in the two separate groups
[128]. The boundary between two adjacent groups is determined by the discriminant
function used. A piecewise linear discriminant function was used and thus piecewise
linear boundaries were obtained for bearing condition classification in [129]. A technique
called support vector machine (SVM) is usually employed to optimize a boundary curve
in the sense that the distance of the closest point to the boundary curve is maximized. The
support vector machine approach applied to machine fault diagnosis was considered in
[17,130].
The hidden Markov model (HMM) described earlier can also be used for fault
classification. Early applications of HMM in fault classification and diagnostics treated
the real machine faulty states and the machine normal state as the hidden states of the
HMM [104,131]. Two recent applications of HMM in fault classification assumed a
HMM with hidden states having no physical meaning for two machine conditions
(normal and faulty) [132,133]. The trained HMMs are then used to decode an observation
for fault classification in a machine whose condition is unknown. Xu and Ge [134]
presented an intelligent fault diagnosis system based on a hidden Markov model. Ye et al
[135] considered the application of 2-dimension HMM based on time-frequency analysis
for fault diagnosis.
An artificial neural network is a computational model that mimics the human brain. It
consists of simple processing elements connected together in a complex layer structure.
The model approximates a complex nonlinear function with multi-input and multi-output.
One processing element comprises a node and a weight. The artificial neural network
learns the unknown function by adjusting its weights with observations of input and
Page 188
Optimal Maintenance Decisions (OMDEC) Inc 2004
output. This process is usually called training of an artificial neural network. There are
various neural network models. The feedforward neural network (FFNN) is the most
widely used neural network structure in machine fault diagnosis [137-140]. A special
FFNN, mulitlayer perceptron (MLP) with the back propagation (BP) training algorithm,
is the most commonly used neural network model for pattern recognition and
classification. Hence it is popular in machine fault diagnostics as well [140,141,142]. The
BP neural networks, however, have two main limitations: 1) difficulty of determining the
appropriate network structure and the number of nodes; 2) slow convergence of the
training process.
A cascade correlation neural network (CCNN) does not require initial determination of
the network structure and the number of nodes. CCNN can be used in cases where on-line
training is preferable. Spoerre [143] applied CCNN to bearing fault classification and
showed that CCNN can result in utilizing the minimum network structure for fault
recognition with satisfactory accuracy. Other neural network models applied in machine
diagnostics are radial basis function neural networks [18], recurrent neural networks
[144,145] and counter propagation neural networks (CPNN) [146]. The above ANN
models usually use supervised learning algorithms which require external input such as a
priori knowledge about the target or desired output. For example, a common practice of
training a neural network model is to use a set of experimental data with known (seeded)
faults. This training process is supervised learning. In contrast to supervised learning,
unsupervised learning does not require external input. An unsupervised neural network
learns by itself using new information available. Wang and Too [38] applied
unsupervised neural networks, a self-organizing map (SOM), and learning vector
quantization (LVQ) to the detection of rotating machine faults. Tallam et al [147]
proposed several self-commissioning and on-line training algorithms for FFNN applied
particularly to electric machine fault diagnostics. Sohn et al [116] used an autoassociative
neural network to separate the effect of damage on the extracted features from those
caused by the environmental and vibration variations of the system. Then a sequential
probability ratio test was performed on the normalized features for damage classification.
Expert systems and neural networks have known limitations. A significant limitation of
rule-based expert systems is combinatorial explosion, which refers to the computation
problem caused when the number rules increases exponentially as the number of
Page 189
Optimal Maintenance Decisions (OMDEC) Inc 2004
variables increases. Another important limitation is consistency maintenance, which
refers to the process by which the system decides when some of the variables need to be
recomputed in response to changes in other values. Two important limitations of neural
networks are the difficulty to have physical explanations of the trained model and the
difficulty of the training process. It is natural then to attempt a combination of both
techniques in order to combine their respective advantages thus improving performance
in a hybrid system. For instance, Silva et al [156] used two neural networks, SOM and
adaptive resonance theory (ART), combined with an expert system based on Taylor's tool
life equation to classify tool wear state. DePold and Gass [157] studied the applications
of neural networks and expert systems in a modular intelligent and adaptive system for
gas turbine diagnostics and prognostics. Yang et al [158] presented an approach for
integrating case-based reasoning ES with an ART-Kohonen neural network to enhance
fault diagnosis. It was shown that the proposed approach outperforms the self-organizing
feature map (SOFM) based system with respect to classification rate.
Page 190
Optimal Maintenance Decisions (OMDEC) Inc 2004
Neural networks and expert systems have also been combined with other AI techniques
to enhance machine diagnostic systems. Garga et al [165] proposed a hybrid reasoning
approach combining neural network, fuzzy logic and expert systems to integrate domain
knowledge and test operational data. Evolutionary algorithms [166], which mimic the
natural evolution process of a population, have also been shown to have merit when
applied to machine diagnostics. Genetic algorithms (GA) are the most widely used type
of EA. Sampath et al [167] proposed a GA-based optimization approach to gas turbine
diagnostics. Several examples of ANN incorporating GA and other EA algorithms for
machine fault classification and diagnostics are [168-170].
Other approaches
Another class of machine fault diagnostic approaches are the model-based approaches
[171,172]. These approaches utilize physics specific, explicit mathematical models of the
monitored machine. Based on this explicit model, residual generation methods such as
Kalman filter, parameter estimation (or system identification), and parity relations are
used to obtain signals, called residuals, which indicate fault presence in the machine. The
residuals are evaluated to detect, isolate and identify the faut(s). This general procedure is
illustrated in Figure 13-2 . Model-based approaches can be more effective than other
approaches if a correct and accurate model is built. However, explicit mathematical
modeling may not be feasible for complex systems.
Petri nets, as a general purpose graphical tool for describing relations existing between
conditions and events [185], have been applied recently to machine fault detection and
diagnostics. Propes [186] used a fuzzy Petri net to describe operating mode transition and
to detect a mode change event for fault detection and diagnosis in complex systems.
Yang [187] proposed a hybrid Petri-net modeling method coupled with fault-tree analysis
and Kalman filtering for early failure detection and fault isolation. Yang et al [188]
introduced an approach for integrating case-based reasoning with Petri net for fault
Page 191
Optimal Maintenance Decisions (OMDEC) Inc 2004
diagnosis of induction motors. The integrated approach was shown to outperform the
conventional case-base reasoning expert system.
Prognostics
Compared with diagnostics, the literature on prognostics is much smaller. There are two
main prediction types in machine prognostics. The most obvious and widely used is the
prediction of how much time is left before a failure occurs (or, one or more faults or
“potential failures”) given the current machine condition and the past (and future)
operating profile. The time left before observing a failure is usually called “remaining
useful life” or RUL.
Most of the papers in the literature of machine prognostics discuss only the former type
of prognostics, namely RUL estimation. Only a small number of papers address the
second type of prognostics [106,189]. In the following sections, we discuss 1. RUL
estimation, 2. prognostics that incorporate maintenance actions or policies, and 3. the
determination of the appropriate condition monitoring interval.
Prognosis, requires knowledge (or data) on the fault propagation process as well as
knowledge (or data) on the failure mechanism. The fault propagation process is usually
tracked by a trending or forecasting model for certain condition variables. There are two
ways of describing the failure. The first assumes that failure depends on the condition
variables (which reflect the actual fault level)and a predetermined boundary. The most
commonly used failure definition in this case is simple: failure occurs when the fault
reaches the predetermined level.
The second builds a model for the failure mechanism using available historical data.
Various definitions of failure can be used. A failure can be defined as the event that the
machine is operating at an unsatisfactory level (a partial failure); or, it can be a total
functional failure when the machine cannot perform its intended function at all; or it can
be a breakdown when the machine stops operating; or it can be the attainment of a
Page 192
Optimal Maintenance Decisions (OMDEC) Inc 2004
potential failure condition defined in terms of acceptable risk. Similar to diagnosis, the
prognostic methods fall into three main categories: statistical approaches, artificial
intelligent approaches and model-based approaches.
Goode et al [101] used SPC to separate the whole machine life into two intervals, the I-P
(Installation-Potential failure) interval in which the machine is running correctly and the
P-F (Potential failure-Functional failure) in which the machine is running with a problem.
Based on two Weibull distributions assumed for the I-P and P-F time intervals
respectively, failure prediction was derived in the two intervals and the RUL was
estimated. Yan et al [190] employed a logistic regression model to calculate the
probability of failure for given condition variables and an ARMA time series model to
trend the condition variables for failure prediction. A predetermined level of failure
probability was used to estimate the RUL. Phelps et al [191] proposed to track sensor-
level test-failure probability vectors instead of the physical system or sensor parameters
for prognostics. A Kalman filter with an associated interacting multiple model (IMM)
was used to perform the tracking.
Two statistical models in survival analysis, PHM and PIM, are useful tools for RUL
estimation in combination with a trending model for the fault propagation process.
Banjevic and Jardine [192] discussed RUL estimation for a Markov failure time process
which includes a joint model of PHM and a Markov property for the covariate evolution
as a special case. Vlok et al [99] applied PIM with covariate extrapolation to estimate
bearing residual life. HMM, a stochastic process model discussed earlier, is also a
powerful tool for RUL estimation [193,194]. Lin and Makis [195] introduced a partially
observable continuous-discrete stochastic process model to describe the hidden evolution
process of the machine state associated with the observation process. RUL estimation, as
one of the prediction tasks, was generated by the model. Wang et al [109] proposed a
stochastic process, called a “gamma process”, with hazard rate as the the residual life
prediction criterion. The condition information considered was expert judgment based on
vibration analysis. Wang [108] used the residual delay time concept and stochastic
filtering theory to derive the residual life distribution.
Page 193
Optimal Maintenance Decisions (OMDEC) Inc 2004
AI techniques applied to RUL estimation have been considered by some researchers.
Zhang and Ganesan [196] used self-organizing neural networks, for multivariable
trending of the fault development, to estimate the residual life of a bearing system. Wang
and Vachtsevanos [197] applied dynamic wavelet neural networks to predict the fault
propagation process and estimate the RUL as the time left before the fault reaches a given
value. Yam et al [198] applied a recurrent neural network for predicting the machine
condition trend. Dong et al [199] utilized a grey model and a BP neural network to
predict machine condition. Wang et al [200] compared the results of applying recurrent
neural networks and neural-fuzzy inference systems to predict the fault damage
propagation trend. Chinnam and Baruah [201] presented a neural-fuzzy approach to
estimating RUL for the situation where neither failure data nor a specific failure
definition model is available, but domain experts with strong experiential knowledge are
on hand.
Page 194
Optimal Maintenance Decisions (OMDEC) Inc 2004
mathematical models applicable to the CBM scenario are much fewer [212]. See also
[213] for more recent references on maintenance modeling.
In condition monitoring, no matter what machines are monitored, they fall into two
categories: completely observable systems and partially observable systems. For a
completely observable system, the machine state can be completely observed or
identified. The information collected from this system is called direct information. For a
partially observable system, the machine condition cannot be fully observed or identified.
The information obtained from this system is called indirect information, which is
somehow related to the real machine state. In the text to follow, we discuss various
models and methods for evaluating, through modeling, these two types of systems.
First, we consider completely observable systems. Wang [215] developed a CBM model
based on a random coefficient growth model where the coefficients of the regression
growth model are assumed to follow known distribution functions. The model was used
to determine the optimal critical level and inspection interval in CBM in terms of a
criterion of interest, which can be cost, downtime or reliability. In a series of works [216-
218], a stochastic model — gamma process, was used to describe the deterioration
process; the system was considered as failed if its condition jumps above a pre-set failure
level; a sequential (or non-periodic) inspection interval was assumed. Grall et al [216]
went on to assume a multi-level control-limit rule replacement policy and obtained the
optimal thresholds and inspection scheduling by minimizing the expected maintenance
cost per unit time. Castanier et al [217] assumed a multi-level control-limit rule
repair/replacement policy and obtained optimal thresholds and inspection scheduling
based on a cost criterion and an availability criterion as well. Dieulle et al [218] assumed
a one-level replacement policy and a sequentially chosen inspection interval using a
maintenance scheduling function, and obtained the optimal threshold and inspection
scheduling by minimizing the global cost per unit time. Amari and McLaughlin [219]
utilized a Markov chain to describe the CBM model for a deterioration system subject to
periodic inspection. The optimal inspection frequency and maintenance threshold were
found to maximize the system availability.
Page 195
Optimal Maintenance Decisions (OMDEC) Inc 2004
actions. Barata et al [221] used Monte-Carlo simulation to model continuously monitored
deteriorating systems, non-repairable single components or multi-component repairable
systems. Then optimal degradation thresholds of maintenance intervention were found to
minimize the expected total system cost over a given mission time by a direct search.
Marseguerra et al [222] used GA to find the optimal thresholds in the previous work by
simultaneously optimizing two typical objectives of interest, profit and availability.
Hosseini et al [223] employed generalized stochastic Petri nets to represent a CBM model
for a system subject to deterioration failures and Poisson failures. It was assumed that
deterioration failures are restored by major repair and Poisson failures are restored by
minimal repair. The optimal maintenance policy and inspection interval were then found
to maximize system throughput.
Page 196
Optimal Maintenance Decisions (OMDEC) Inc 2004
with the first period of a normal life and the second, of a potential failure. A stochastic
recursive filtering model was used to predict the residual, and then a decision model was
established to recommend the optimal maintenance actions. The optimal condition
monitoring intervals were determined by a hybrid of simulation and analytical analysis.
Okumura and Okino [236] constructed a generalized condition-based maintenance model,
in which residual life loss and replacement preparation lead-time are included. The
optimal inspection time vector and warning level of the target maintained system under a
constraint preventive replacement probability were obtained by minimizing the long-run
average incurred cost per unit time. Barros et al [237] considered an optimal CBM policy
for a two-unit parallel system of which unit-level monitoring information is imperfect
and/or partial.
Page 197
Optimal Maintenance Decisions (OMDEC) Inc 2004
For a complex system, a single sensor is limited in its capability of collecting enough data
for accurate condition monitoring, fault diagnosis and prognosis. Multiple sensors are
needed in order to do a better job. With the rapid development of computer science and
advanced sensor technology, there has been an increasing trend in the use of multiple
sensors for condition monitoring, fault diagnosis and prognosis. Data collected from
different sensors may contain dissimilar partial information on the same machine’s
condition. The problem is knowing how to combine all partial information obtained from
different sensors for accurate machine diagnosis and prognosis. The solution to this
problem is the subject of multisensor data fusion.
There are many techniques to multisensor data fusion. They can be grouped into three
main approaches: (1) data-level fusion, (2) feature-level fusion, and (3) decision-level
fusion. For more discussion on these three approaches, see [242,243]. Heger and Pandit
[90] used a data-level fusion approach to fuse images obtained by multidirectional
illumination to generate an image with a high degree of relevant information for grinding
tool condition monitoring and fault diagnostics. Liu and Wang [244] briefly reviewed
some applications of these three multisensor data fusion approaches to machine diagnosis
and prognosis, and applied a feature-level fusion approach called Cascade-Correlation
neural network for rotating imbalance diagnosis. Diagnostics based on the multisensor
data fusion was shown to outperform diagnostics based on a single sensor. Wang and
Wang [245] used a decision-level data fusion approach called Dempster-Shafer evidence
theory for diesel engine fault diagnosis. Kozlowski et al [246] proposed a model-based
approach to battery diagnostics using decision-level data fusion. Byington et al [247]
explored the methods to fuse non-commensurate oil and vibration features for better
gearbox fault diagnostics and prognostics. Mannan et al [248] applied a radial basis
function neural network to fuse the features extracted from images of machined surfaces
and acoustic signals generated during the machining process. The results were applied to
the diagnostics of cutting tools. Hannah et al [249] discussed frameworks in data fusion
applications for condition monitoring and diagnostic engineering. Data fusion combined
with CBM optimization was studied in [250,251]. Assessment and evaluation of data and
information fusion strategies were discussed in [252,253]. Wang and Wang [254]
discussed the reliability and self-diagnosis of sensors in a multisensor data fusion
diagnostic system.
In a mechanical system with multiple sensors installed, data collected from each sensor
may be a complicated mixture of data from several sources. But only some of the sources
are related to a particular machine condition of interest. The problem is to separate the
various sources for better machine diagnosis and prognosis by fusing the observed
multisensor data. The technique for solving this problem is known as blind source
separation (BSS) [255]. Recently, BSS has received increasing attention in the area of
machine fault diagnostics and prognostics. The general idea behind BSS is shown in
Figure 13-3. It is assumed that the source signals S (t ) = [ s1 (t ),L, s n (t )] , generated from
n unknown independent sources, and the noise signals N (t ) , independent of the source
signals, are combined together by an unknown mixing process. The mixed result is
observed at the channel output as an m -dimensional ( m ≥ n ) signal
X (t ) = [ x1 (t ),L , x m (t )] . A formula for the mixing process can be written as
Page 198
Optimal Maintenance Decisions (OMDEC) Inc 2004
X (t ) = f ( S (t ), N (t ))
In the literature, there are two categories of mixing process: instantaneous and
convolutive mixing process. A mixing process is instantaneous if f (⋅) is a time-
independent (memoryless) function, and convolutive otherwise. The convolutive mixing
process is more common, especially for mechanical systems. The instantaneous mixing
model is also called an “independent component analysis” (ICA) model, which is a
natural extension of PCA. For a survey of ICA theory and methods, see [256]. Several
authors applied ICA together with other signal processing techniques for condition
monitoring and machine fault diagnosis [257-260]. Tian et al [261] used ICA in
frequency domain and wavelet filtering for gearbox fault diagnostics. Zhang et al [262]
studied ICA for partially blind source separation of diagnostic signals for bearing faults
with prior knowledge. For a convolutive mixing process, BSS is more complicated. Gelle
et al [263] compared two approaches, namely a temporal approach and a frequency
approach, to solving the BSS problem of rotating machine signals for monitoring and
diagnosis purposes. They further studied the application of the temporal approach to
bearing fault diagnostics [264]. Tse and Zhang [265] applied the BSS based method of
second order statistics to separate aggregated vibration signals generated from a number
of mechanical components for machine fault diagnostics. Vilela et al [266] used the
temporal de-correlation approach to separate the mixed acoustic signals for machine
monitoring and fault diagnosis. Serviere et al [267] applied BSS to separate noisy
harmonic signals for rotating machine diagnostics on a semi-blind mixing basis.
Concluding remarks
In this chapter, we have summarized recent research and developments in machinery
diagnostics and prognostics used in implementing CBM. Various techniques, models and
algorithms were reviewed. Of the three main steps of a CBM program, namely, data
acquisition, signal processing, and maintenance decision making, we focused on the latter
two. Finally we discussed various techniques for multiple sensor data fusion.
Although advanced maintenance techniques have been available in the literature, CBM,
is under-employed by maintenance departments. Commercial predictive maintenance
Page 199
Optimal Maintenance Decisions (OMDEC) Inc 2004
solution providers have not kept pace with recent advances in signal processing and
decision support despite many situations, especially where both maintenance and failure
are very costly, where well developed and managed condition-based maintenance is
absolutely a better choice than current time based, or inadequate condition based,
maintenance policies. Expert knowledge of both the application field and of reliability
and maintenance theory are required for selecting and implementing effective condition
based maintenance policies in each operating context.
Among the reasons that advanced maintenance technologies have not been well
implemented in industry are: 1) lack of data due to incorrect data collecting approaches
(see 0page 176), 2) lack of efficient communication between theory developers and
practitioners in the area of reliability and maintenance; 3) lack of efficient validation
approaches; 4) difficulty of communication of the principles of CBM to business policy
makers and management executives.
Page 200
Optimal Maintenance Decisions (OMDEC) Inc 2004
Part 3. Reliability Centered Maintenance
1) The initial information gathering and analysis process, called “failure modes and
effects analysis” (FMEA)
2) The decision algorithm, and
3) The on-going information gathering and analysis process, called “age
exploration”.
Page 201
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 14-1: The three pillars of reliability-centered maintenance
Figure 14-2: The RCM Worksheet form for recording the seven RCM information elements specified
by the standard SAE JA-1011. See detailed guide in Figure 16-4 on page 247.
Page 202
Optimal Maintenance Decisions (OMDEC) Inc 2004
The upper areas of the RCM worksheet of Figure 14-2, record the asset’s “operating
context”. The completed worksheets form the organization’s evolving reliability
knowledge repository or knowledge base. A thorough knowledge base documents the
failure behavior, the consequences of failure, and the reasons for performing each pro-
active task. It combines the experience of the personnel who maintain and operate the
equipment with the knowledge of manufacturing, design, and process experts. In this
chapter we present the detailed methodology for conducting RCM analysis. We focus, in
Part 3. , on the first two pillars: 1. FMEA, initial information gathering, and, 2. The
decision algorithm. (Pillar 3, ongoing age exploration, has been developed in Part 1
(Chapters 1 to 5).
While RCM teams, often referred to as “facilitated review groups” provide excellent
results in most industrial or plant settings, other RCM execution strategies are appropriate
to specific situations. J.C. Leverette points out that NAVAIR does not always use the
facilitated review group approach. NAVAIR has conducted numerous RCM analyses
using dedicated analysts.172 In those situations, the analysis is performed by one or more
RCM analysts who gather information from all relevant sources including system experts,
operators and maintainers. Typically the analyst is an RCM expert with anywhere from
some to extensive knowledge of the equipment he or she is analyzing. Situations
involving new acquisitions or new technology, where the majority of available data may
be engineering or test data, are often most efficiently analyzed by one or two technical
specialists.
172
RCM in the Public Domain: An Overview of the US Naval Air Systems Command's RCM process By
JC Leverette and Andres Echeverry, Anteon Corporation Originally presented at RCM-2005 - The
Reliability Centered Maintenance Managers' Forum, www.reliabilityweb.com
Page 203
Optimal Maintenance Decisions (OMDEC) Inc 2004
Chapter 15. Failure Modes and Effects
Analysis
The process
The right maintenance activity addresses the preservation of function. An obvious
proclamation, yet, to his astonishment, the maintenance professional discovers that the
functions of the machinery under his control were inadequately or incompletely
identified. Consequently the failures of those functions and their causes have, by-in-large,
escaped his conscious effort to deal with them. Function identification in an item is
neither obvious nor trivial. Consider, as a familiar example, your own automobile. The
following exercise173 in RCM facilitation illustrates the subtlety and importance of rigor
in functional analysis.
173
The car functional analysis was developed by J. Moubray and Aladon. www.aladon.com
Page 204
Optimal Maintenance Decisions (OMDEC) Inc 2004
incomplete answer with the following question, “What is it, in the preceding function
statement (To get from A to B), that distinguishes your car from your feet?” In other
words, what makes us want to use a car rather than our feet to get from A to B? Upon
considerable discussion the group appends “ at speeds up to 85 mph?” to the evolving
function statement. Where there is a wide diversity of opinion, we need to establish a
consensus174 of what users really want from a physical asset.
The function statement now reads “To get from A to B at speeds up to 85 mph”. We
ask subsequently, “Is there anything (in that function statement) that distinguishes our
car from a motor bike?” The function statement once again is amended to “To get a
driver and up to 4 passengers from A to B at speeds up to 85 mph”. We might at this
point ask, “Is there anything (in that function statement) that distinguishes a car from
small helicopter?” Answer, “ … while traveling along paved roads” (as opposed to
cross country). And so on.
Eventually we obtain a fair idea of what the owner and user wants the asset to do. Notice
how we arrived at this function statement. We didn't ask how fast we want to go. We
asked, “What distinguishes a car from feet”, thus raising the requirements for both speed
and distance and any other distinct “car” functionality. We may continue in this manner
to ask about secondary functions175. For example, to the question of what are the
environmental requirements, someone might respond with the single word, “emissions”,
which is a noun not a function. Adopting the form of a function statement, we could say,
“To emit less than (whatever the regulations of the locale) ppm NOx, CO, CO2, and so
on.” In many countries, vehicles that do not comply are off the road, making this function
a maintenance priority.
What about safety and structural integrity? These may be expressed in a function
statement as “To allow passenger cell to deform by X cm in a 30 kph head-on
collision.” Functions relating to control, containment, and comfort may be similarly
revealed. Control/Containment/Comfort associated functions might include “To vary
speed between -20 and 140 mph.” and “To isolate the occupants from the elements.”
When a function statement contains no quantitative standard, it implies “absolute”
(isolation in this case). The comfort associated function “To enable operator to vary
temperature between (whatever limits)” implies an air conditioner. As we walk
through these secondary functions, we learn about the importance of consensus. The
function having to do with appearance: “To look acceptable” begs the question,
“acceptable to whom?”. This may be of vital importance in a given operating context,
but is often impossible to quantify. In such cases an understanding must be reached
between user/owner and maintainer. Protective functions are of singular importance and
were described in Chapter 3. (page 39).
174
This is a very important point. Understanding the requirements of the asset and agreement among
maintainers and users will ensure that the maintenance program preserves the right function.
175
A primary function is usually the reason that the owner purchased and installed the asset. Secondary
functions may include protective functions, environmental functions, appearance requirements, control and
containment functions, health and safety functions, economy and efficiency functions, structural and
superficial functions.
Page 205
Optimal Maintenance Decisions (OMDEC) Inc 2004
Economy/efficiency functions might include “To consume < .010 l/km under (standard
urban cycle, steady speed 100 km, etc.) conditions?” Superfluous functions refer to
components that were installed at one time in the past for one (an original) operating
context but no longer used in another (new context). Often, it is said that the redundant
equipment is more expensive to remove, and it is decided to leave it where it is.
However, these functions may still fail, (a fact that is often overlooked) and thus still
need to be documented in the RCM functional discovery and subsequent analysis.
Hence, the RCM process begins with the first of the seven RCM questions: “What are
the asset’s functional performance requirements in its operating context?” The
validity of all that follows will depend upon the thoroughness with which the functions
are identified and analyzed. An item will have, typically, from 15 to 50 primary and
secondary functions176. The RCM team, using a structured methodology discovers and
records each of the item’s functions. The process comprises the following activities:
1. Team members look closely at the asset under investigation, by examining its
drawings, schematics, photographs and even by conducting physical walkarounds.
Components suggest the functions that are to be recorded.
2. The team reviews, agrees upon, and documents the asset’s operating context (see
top area of the RCM Worksheet of Figure 14-2 on page 202).
3. The team members refer to all helpful documents and recall individual
experiences while listing the item’s functions.
4. Each function statement begins with the word “To” and is followed by a verb.
5. Each function statement specifies one or more quantitative performance
standards.
6. The team agrees upon and records the item’s actual performance requirements,
not its design specifications nor its installed capacity.
a. The team considers and records the requirements of the user, the owner,
and society at large.
7. Example of a function statement: To drill one hole in a work piece to a depth of
18 cm ± 0.001 mm in 15 seconds, of diameter 10 mm ± 0.001 mm whose center
deviates no more than .0001 mm, at an average rate of 3.5 holes per minute. Note
that we specify in the function statement, the quality requirements of accuracy
and consistency.
8. The team proceeds to identify and document all primary and secondary functions.
(Primary functions usually describe why the asset has been purchased.)
a. The group identifies and documents all secondary functions by
examining drawings and schematic diagrams, or even by walking around
the physical item.
b. The group ascertains that all secondary functions have been exposed by
reviewing the PEACHES mnemonic: Protective, Environmental,
176
If an item has more than this number of functions, one or more subcomponents should be “removed”
from the item and analyzed as separate items. This can be done easily at any time. See Appendix 3. “Sizing
the analysis” on page 276
Page 206
Optimal Maintenance Decisions (OMDEC) Inc 2004
Appearance, Control, containment, comfort, Health and safety,
Economy, efficiency, Structural integrity and superfluous177 functions.
9. The team devotes special care to hidden functions – a function whose failure will
go unnoticed under ordinary circumstances.
a. The team uses code phrases to imply that a function is hidden (e.g. to be
capable of, to be able to, …) or that it is protected by a hidden function
(e.g. to heat X liters of water to 140C in Y minutes, in the presence of a
standby heater.)
Example 1
The item under investigation is a passenger rail car truck, (also known as a bogie). A
drawing of the truck is given in Figure 15-3. Its detailed description is given in Appendix
5. on page 280.
Figure 15-4
177
Functions that were required at one time but currently are unused in the current process or product.
Page 207
Optimal Maintenance Decisions (OMDEC) Inc 2004
The team begins by a formulating a statement that captures the truck’s primary function.
The schematic of Figure 15-4 suggests the requirement of “support”, and indicates that
there are two trucks per car. We might, then, propose the function statement: “To support
half the weight of a rail car”. However, when we examine Figure 15-3, various internal
components suggest additional functions. For example, the wheel sets and their bearings
suggest a “rolling” requirement. The suspension components (dampers, air bag, torsion
bar) suggest a “smooth ride” requirement. RCM structured language style encourages us
to broaden the functional statement by including both these notions. We rewrite the
function statement as in Figure 15-5.
Figure 15-5
Notice that we added two quantitative performance specifications, “up to 26.5 T” and “up
to 120 kph” to the function statement. Experienced RCM analysts strive to compose
succinct yet descriptive and quantitative function statements. They try to include as many
functional elements as is practical in a single, clear, grammatically correct function
statement. Such attention to structured phrasing and economy of words keeps the size and
complexity of the entire analysis to manageable proportions.
As the RCM team examines the technical descriptions of Figure 19-6 through Figure
19-14 (in Appendix 5. page 280), it records the functions suggested by each component.
For example, the rubber chevrons of the primary suspension (Figure 19-9 page 283) and
the dampers and air bags of the secondary suspension (Figure 19-12 page 286 suggest
function statement 2 given in Figure 15-6.
The rubberized component, “traction link” of Figure 19-12 on page 285, suggests
function “3” of Figure 15-7.
Page 208
Optimal Maintenance Decisions (OMDEC) Inc 2004
3 To insulate passengers from
jerks during acceleration and
braking
Figure 15-7
The components “torsion bar” and “torsion bar turnbuckles” of Figure 19-6 on page280
suggest function statement “4” of Figure 15-8.
Figure 15-8
In a similar manner, the RCM team examines the drawings and documentation on the
truck and lists the remaining functions in the worksheet as illustrated in Figure 15-9.
Components reviewed suggest the following functions:
• the “air bag” (Figure 19-12 page 286) suggests function “5”,
• the brake (Figure 19-6 page280) suggests function “6”,
• the auxiliary spring ( Figure 19-9 page 283) function “7”,
• the “towing points” (frame description on page 286) function “8”,
• the “axle rod” (Figure 19-9) function “9”,
• the “emergency spring” (Figure 19-12 page286) function “10”,
• the lateral damper and lateral stop components (Figure 19-11 page285) function
“11”, and
• the “split pin” (Figure 19-9) function “12”.
Page 209
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 15-9: Rail car truck functions 5 to 12
When examining the drawings on pages 280 to 286, the functions listed in Figure 15-9
will not be obvious to those unfamiliar with the item under analysis. This fact
underscores the importance of selecting RCM team members who have used and
maintained the asset over a number of years. For this reason, we discourage the use of
outside consultants to perform RCM analysis.
Example 2
Figure 15-10: The air-conditioning pack in the Douglas DC-10. The location of the three packs in the
nose-wheel compartment is indicated at the upper right. (Based on Airesearch maintenance
materials)
The air-conditioning pack depicted in Figure 15-10 is the cooling portion of the Douglas
DC-10 air-conditioning system. This subsystem was classified as significant during the
first review of the DC-10 systems because of its size, complexity, and cost. There are
three independent installations of this system, located in the unpressurized nose-wheel
side compartment of the airplane (see top right of Figure 15-10). Hot high-pressure air,
which has been bled from the compressor section of the engine, enters the pack through a
flow-control valve and is cooled and dehumidified by a heat exchanger and the turbine of
an air-cycle refrigeration machine. The cool air is then directed through a distribution
Page 210
Optimal Maintenance Decisions (OMDEC) Inc 2004
duct to a manifold in the pressurized area of the airplane, where it is mixed with hot trim
air and distributed to the various compartments. The performance of each pack is
controlled by a pack temperature controller. Each pack is also monitored by cockpit
instrumentation and can be controlled manually if there is trouble with the automatic
control system.
The pack itself consists of the heat exchanger, the air cycle machine (which has air
bearings), and an anti-ice valve, a water separator, and a check valve at the pressure
bulkhead to prevent backflow and cabin depressurization if there is a duct failure in the
unpressurized area. The duct is treated as part of the distribution system; similarly the
flow-control valve through which air enters the pack is part of the pneumatic system.
The pack temperature controller is part of a complex temperature-control system and is
also not analyzed as part of the air-conditioning pack.
Page 211
Optimal Maintenance Decisions (OMDEC) Inc 2004
as: shift arrangements, plant location, customer service requirements, market conditions,
seasonal effects, and so on – anything that sheds light on the asset’s special operating
conditions, requirements, and restrictions. The operating context will greatly assist the
team, when it answers question 5, “What are the consequences?”. Note the italicized
phrase “To be capable of … ” in function statement 2 of Figure 15-11. This code phrase
alerts us to fact that the function is hidden. That is, under ordinary circumstances, as long
as the duct is intact, no member of the operating crew will be aware that the protective
backflow function has failed. Once again, hidden functions are often difficult, if not
impossible, for those unfamiliar with the asset, to discern, emphasizing again the
importance of choosing RCM analysis team members from among the most experienced
maintenance and operational staff.
Example 3
Item description: Distributed control system (DCS) Redundancies and protective features (include
instrumentation):
Page 212
Optimal Maintenance Decisions (OMDEC) Inc 2004
To provide safe shutdown in the event of a
hardware failure.
To alert the operator, in real time, when some
part of the DCS hardware or a field device
fails.
To be immune from physical, electromagnetic,
electronic, environmental intrusion
To be ergonomic
To conform to NEMA standards
Example 4
In listing the functions of an item, the RCM team, thinks about each component of the
item. One of the functions of a tire tread (e.g. on airplanes or haul trucks) is to provide a
renewable surface that protects the carcass of the tire so that it can be retreaded. This
function is not the most obvious one, and it might well be overlooked in listing the tire
functions; nevertheless, it is important from an economic standpoint. Repeated use of the
tire wears away the tread, and if wear continues to the point at which the carcass cannot
be retreaded, a functional failure has occurred. Although, we are focusing on the item’s
functions, thinking about the failures experienced by an item, for example the retreading
failure described in Figure 15-12, assists us in the function discovery process.
Resistance
restored
Depth of remaining tread
Retread
Potential failure
Potential failure
observed
Functional failure
1 2 3 4 5
Scheduled inspections
Page 213
Optimal Maintenance Decisions (OMDEC) Inc 2004
the tire so that it can be
retreaded
Figure 15-13
The process
“Functional failure” describes the way in which an asset will fail to perform one of its
functions. We examine each function that we exposed in the preceding functional
analysis. We consider all of the ways that the function can fail. The following points must
be accounted for in answering the question “In what ways can the function fail?”:
1. List each way in which the item can fail to meet each performance requirement
that has been explicitly stated, or implied, in the function statement.
2. Take special care to distinguish between partial and complete failures because
they usually have different causes. For example, “unable to pump at all”, and
“unable to deliver the required 800 lpm” are distinct failures having different
causes.
3. Only functional failures (those that have consequences) are listed in this step.
(Potential failures that preempt functional failures are analyzed and described
when answering question 4 “What are the failure effects?”).
Example 1
Ctrl. Function Statements (Quantitative Failed States (Ways Failure Causes
No. Performance Requirements) Performance is Lost)
1 To provide smooth rolling support for half the Fails to provide support
weight of a passenger car (up to 26.5 tons)
on the rails at speeds up to 120 kph
5 Unable to support the car on
the rails at 120 kph
Page 214
Optimal Maintenance Decisions (OMDEC) Inc 2004
Example 2
Functions Functional failures Failure modes Failure effects
1 To supply air to A conditioned air is not
conditioned air supplied at called-for
distribution ducts at the temperature
temperature called for by
pack temperature
controller
2 To be capable of A No protection against
preventing loss of cabin backflow
pressure by backflow if
the duct is fails in
unpressurized nose-wheel
compartment
Figure 15-15: Listing the failed states for a function of the air conditioning pack.
Example 3
Item description: Distributed control system (DCS)
Functions Functional Failure modes Failure effects
failures
To provide safe, secure, Fails to Unauthorized usage of
uninterrupted, redundant, cost provide console either when
effective, continuous process control security unattended or if
and monitoring according to the password stolen
target product of the day, within the
parameters specified by product
specification and by current
environmental regulations, in the
presence of a UPS (uninterruptible
power supply)
Unable to Password forgotten
log in
Unable to UPS has failed
protect
against loss
of control
Control lost Complete loss of
communication with
ring
Complete loss of
communication with
controller node
All consoles fail
Complete loss of
communication on
module bus
Complete loss of
communication on slave
bus
Console LAN fails
Redundancy Console hardware or
lost software fails
Controller hardware or
Page 215
Optimal Maintenance Decisions (OMDEC) Inc 2004
software fails
Communication
hardware of software
fails
Power supply fails
IO card fails
The process
The third step, listing the reasonably likely failure modes, answers the third RCM
question: “What causes the failure?”. It is particularly important in this step that we
keep two objectives of the RCM process in mind. They are:
The failure modes analysis step is particularly difficult, prone to error, and waste of
precious time for two reasons:
These two problems, if not carefully handled, by the RCM team and the RCM facilitator,
can bog down the analysis or jeopardize its quality. Too much detail is likely to stall
progress, while a superficial analysis can be costly and dangerous.
In deciding which failure modes to list and which to reject, the facilitator urges the team
to keep the operating context in mind. In contexts where the consequences of failure are
severe the group will agree to list certain failure modes that they would not bother to
include, were the consequences less harsh. It is vital that each member give serious
consideration to the failure mode, and that, collectively, the group balances likelihood
and consequences in deciding whether to include it.
For example, suppose the failure mode “Pump damaged by flying object” were raised in
the course an RCM session. The RCM team will consider the likelihood and
consequences of failure. In most operating contexts this failure mode would be excluded
from analysis. However, if the pump were operating in a nuclear facility that happened to
be on the path of a busy airplane flight corridor, the team could reasonably decide to
include it. Since operating contexts vary, no template or hard-and-fast rules can dictate
the level of detail (i.e. how many failure modes to include) needed in a given operating
context.
Page 216
Optimal Maintenance Decisions (OMDEC) Inc 2004
It may happen that an irresolvable difference in opinion emerges among the team
members as to whether or not to include a particular failure mode. In general, the group
decides to err on the side of conservatism. Under no circumstances should this or any
other RCM decision be put to a vote. That would defeat one of the goals of RCM –
ownership of the decisions by the people that they impact. Should the team be unable to
arrive at consensus, the facilitator notes and records the dissenting opinion.
The second difficult question, “How deeply to drill down the causality chain”, if not
carefully considered, will affect the quality and efficiency of the RCM process. Selecting
the failure mode causality depth requires particular vigilance by the RCM facilitator and
the team. The short answer to the question is, “to the level at which the organization can
deal, in a practical way, with the cause of failure”. Figure 15-16 illustrates the almost
limitless choices for selecting causality depth.
Example 1
The RCM team analyzing the truck has recorded the failure modes as given in Figure
15-17 through Figure 15-20.
Page 218
Optimal Maintenance Decisions (OMDEC) Inc 2004
Functions Functional failures Failure modes Failure effects
To provide smooth Fails to provide Bearing collapses due to
rolling support for half rolling support fatigue failure of cage,
the weight of a rollers, spacer or inner or
passenger car (up to outer race
26.5 tons) on the rails
at speeds up to 120
kph
Bearing collapses due to
excessive clearing in
housing
Bearing collapses due to
bumpy rails
Bearing fails due to under
lubrication
Plug falls out of axle box
cover
Bearing fails due to over
lubrication
Moisture in lubricant causes
bearing to fail
Figure 15-19: Causes of functional failure "Fails to provide rolling support"
Example 2
Functions Functional failures Failure modes Failure effects
1 To supply air to A conditioned air is not air-cycle machine seized
conditioned air supplied at called-for
distribution ducts at the temperature
temperature called for by
pack temperature
controller
ram-air passages in heat
exchanger blocked
anti-ice valve fails
water separator fails
2 To prevent loss of A No protection against bulkhead check valve
cabin pressure by backflow fails
backflow if the duct is
fails in unpressurized
Page 219
Optimal Maintenance Decisions (OMDEC) Inc 2004
nose-wheel compartment
Figure 15-21: Failure mode analysis of the air conditioning pack
Note the causality levels and the detail (how many failure modes) of the failure modes of
Figure 15-21. For the failure mode “air-cycle machine seized”, for example, the team
stopped at the level of the air cycle machine and looked no deeper. This was a balanced
judgment that weighed the consequences of failure with the frequency of occurrence of
this particular failure mode. Once again, no “template solution” will substitute for due
consideration by a team of knowledgeable, involved persons.
Example 3
Item description: Distributed control system (DCS)
Functions Functional Failure modes Failure effects
failures
To provide safe, secure, Fails to Unauthorized usage of
uninterrupted, redundant, cost provide console either when
effective, continuous process control security unattended or if
and monitoring according to the password stolen
target product of the day, within the
parameters specified by product
specification and by current
environmental regulations, in the
presence of a UPS (uninterruptible
power supply)
Unable to
log in
Unable to
protect
against loss
of control
Control lost
Redundancy
lost
The process
The team records the entire relevant scenario surrounding the failure mode under
consideration. The text should answer all of the following questions:
A. What sequence of events (internally and organization wide) could be touched off
by the failure mode?
B. How does the failure make itself known? What observable events lead up to the
failure?
C. How is safety or the environment impacted? (do not mention the words "safety"
or "environment")
D. How is production impacted? (quality, cost, customer service)
E. Is there any additional damage caused by the failure?
Page 220
Optimal Maintenance Decisions (OMDEC) Inc 2004
F. How long will it take and what actions must be accomplished to correct the
failure?
G. How does the likelihood of this failure depend on deeper causes? Has it happened
before? How often? Under what circumstances?
Example 1
Function Statement Failure Failure Effects
mode
1 To provide smooth Fails to provide Weld in The truck as a whole collapses. This is most likely
rolling support for half support frame fails to occur when the car is most heavily loaded - in
the weight of a passenger due to other words when it is full of passengers, and
car (up to 26.5 tons) on fatigue probably while the train is going round a corner. As
the rails at speeds up to a result, it would almost certainly be derailed. At
120 kph present, the truck is replaced when a crack longer
than 100 mm is found. (Such a crack would be
found during course of other inspections that occur
often enough to detect it). Downtime to replace
truck on its own 16 hours.
Note that the description of the effects anticipates question 6 by describing the evolution
of the functional failure by defining the potential failure (Figure 15-22) at a crack length
of 100 mm and the likelihood that the potential failure would be found as an opportunity
inspection during the course of other inspections.
Page 221
Optimal Maintenance Decisions (OMDEC) Inc 2004
Function Statement Failure Failure Effects
mode
1 To provide smooth Fails to provide Wheel The truck as a whole collapses. This is most likely
rolling support for half support collapses due to occur when the car is most heavily loaded - in
the weight of a passenger to fatigue other words when it is full of passengers, and
car (up to 26.5 tons) on probably while the train is going round a corner. As
the rails at speeds up to a result, it would almost certainly be derailed. Only
120 kph one cracked wheel has been found to date. It takes 8
hours to replace a wheel
1 To provide smooth Fails to provide Axle fails The truck as a whole collapses. This is most likely
rolling support for half support due to to occur when the car is most heavily loaded - in
the weight of a passenger fatigue other words when it is full of passengers, and
car (up to 26.5 tons) on probably while the train is going round a corner. As
the rails at speeds up to a result, it would almost certainly be derailed. No
120 kph axles have failed so far.
1 To provide smooth Fails to provide Truck frame Initial cracking is likely to lead to frame distortion,
rolling support for half support component which could make the truck unstable enough to
the weight of a passenger fails due to derail the train. As before, this is most likely to
car (up to 26.5 tons) on fatigue happen when heavily loaded - in other words, when
the rails at speeds up to it is full of passengers, and probably while the train
120 kph is going round a corner. So far, the only frame
component which has shown signs of failing has
been the transom, which cracked and has since been
reinforced with a steel plate. Downtime to replace a
truck is 16 hours.
1 To provide smooth Unable to Differential If the difference between wheel diameters is greater
rolling support for half support the car wear of steel than 2 mm, the possibility of derailment at speeds
the weight of a passenger on the rails at treads on the near 120 kph increases. Downtime to re-profile a
car (up to 26.5 tons) on 120 kph same axle pair of wheels is 3 hours.
the rails at speeds up to
120 kph
1 To provide smooth Unable to Spalling on This could lead to differential wear. If the
rolling support for half support the car wheel tread difference between wheel diameters is greater than
the weight of a passenger on the rails at 2 mm, the possibility of derailment at speeds near
car (up to 26.5 tons) on 120 kph 120 kph increases. Downtime to re-profile a pair of
the rails at speeds up to wheels is 3 hours.
120 kph
1 To provide smooth Unable to Wheel flange This failure is only likely to a flange which has
rolling support for half support the car shears off been weakened by excessive wear. It is most likely
the weight of a passenger on the rails at to happen on a heavily loaded train going round a
car (up to 26.5 tons) on 120 kph corner at high speed, which would almost certainly
the rails at speeds up to lead to a derailment. Downtime to replace a set of
120 kph wheels 3 hours.
1 To provide smooth Unable to Chevron Truck frame rests directly on the axle box bump
rolling support for half support the car rubber shears stop. Wheel loading is unevenly distributed and
the weight of a passenger on the rails at wheels are prevented from moving off-axis during
car (up to 26.5 tons) on 120 kph curving - both of these conditions may cause
the rails at speeds up to derailment under adverse conditions of load and
120 kph speed. Downtime to replace the chevron rubber
about 16 hours. (The clearance between the bump
stop and the truck frame should be 30 +1-0 mm)
Page 222
Optimal Maintenance Decisions (OMDEC) Inc 2004
Function Statement Failure Failure Effects
mode
1 To provide smooth Unable to Tie bar rod Wheel arch could distort and chevron rubber could
rolling support for half support the car axle rod shear. Truck frame rests directly on the axle box
the weight of a passenger on the rails at slackens off bump stop. Truck frame rests directly on the axle
car (up to 26.5 tons) on 120 kph box bump stop. Wheel loading is unevenly
the rails at speeds up to distributed and wheels are prevented from moving
120 kph off-axis during curving - both of these conditions
may cause derailment under adverse conditions of
load and speed. Time to tighten axle rod nut in
Depot 15 minutes.
1 To provide smooth Unable to Chevron Settling could cause excessive contact between
rolling support for half support the car rubber settles vertical bump stop and wheel arch. This would
the weight of a passenger on the rails at restrict wheel set movement during curving, and
car (up to 26.5 tons) on 120 kph could cause derailment under severely adverse
the rails at speeds up to conditions of load and speed. Clearance should be
120 kph 30 +1-0mm. Time to replace chevron rubber 4
hours. See also function 2.
1 To provide smooth Unable to Chevron Settling could cause excessive contact between
rolling support for half support the car rubber vertical bump stop and wheel arch. This would
the weight of a passenger on the rails at elastically restrict wheel set movement during curving, and
car (up to 26.5 tons) on 120 kph yields could cause derailment under severely adverse
the rails at speeds up to conditions of load and speed. Clearance should be
120 kph 30 +1-0mm. Time to replace chevron rubber 4
hours. See also function 2.
1 To provide smooth Unable to Traction link The traction link falls off at one end, so the traction
rolling support for half support the car bolt comes center is connected to the truck by only one link.
the weight of a passenger on the rails at adrift Asymmetric load on the remaining link damages
car (up to 26.5 tons) on 120 kph the bushes, interfering with ride comfort and
the rails at speeds up to possibly twisting the link mounting plates. This in
120 kph turn causes the second traction link to shear off,
which would mean that the truck is only connected
to the car by the air bags. A twisted mounting could
also restrict truck movement during curving, which
may lead to derailment under adverse conditions of
load and speed. one end of the traction link could
also hit the ground in such a way that the truck
frame or traction center has to fault over it, causing
a spectacularly nasty derailment. Time to replace a
traction link bolt two hours (note that the nuts on
the traction link bolts are held in place by split pins,
which means that this failure should not occur if the
split pin is in place - see also function 11)
1 To provide smooth Unable to Traction link The traction link falls off at one end, so the traction
rolling support for half support the car falls off due center is connected to the truck by only one link.
the weight of a passenger on the rails at to fatigue Asymmetric load on the remaining link damages
car (up to 26.5 tons) on 120 kph the bushes, interfering with ride comfort and
the rails at speeds up to possibly twisting the link mounting plates. This in
120 kph turn causes the second traction link to shear off,
which would mean that the truck is only connected
to the car by the air bags. A twisted mounting could
also restrict truck movement during curving, which
may lead to derailment under adverse conditions of
load and speed. One end of the traction link could
also hit the ground in such a way that the truck
frame or traction center has to fault over it, causing
Page 223
Optimal Maintenance Decisions (OMDEC) Inc 2004
Function Statement Failure Failure Effects
mode
a spectacularly nasty derailment. Time to replace a
traction link five hours.
1 To provide smooth Fails to provide Bearing Collapsed bearing causes a "hot box", and train
rolling support for half rolling support collapses due must stop at the next station to evacuate passengers
the weight of a passenger to fatigue which causes a traffic delay of 20-60 minutes. It is
car (up to 26.5 tons) on failure of also possible that a failed bearing could cause a
the rails at speeds up to cage, rollers, derailment. The hot box melts the chevron causing
120 kph spacer or it to emit smoke. The chevron also collapses,
inner or damaging the tie-bar and axle. Time to replace a
outer race wheel set complete with bearing and axle box 8
hours.
1 To provide smooth Fails to provide Bearing If the axle box liner bore exceeds the bearing outer
rolling support for half rolling support collapses due race external diameter by more than 0.6 mm,
the weight of a passenger to excessive relative movement between the liner and outer race
car (up to 26.5 tons) on clearing in causes excessive vibration and collapse of the
the rails at speeds up to housing bearing. This causes a hot box, and train must stop
120 kph at the next station to evacuate passengers which
causes a traffic delay of 20-60 minutes. It is also
possible that a failed bearing could cause a
derailment. The hot box melts the chevron causing
it to emit smoke. The chevron also collapses,
damaging the tie-bar and axle. Time to replace a
wheel set complete with bearing and axle box 8
hours.
1 To provide smooth Fails to provide Bearing Excessive interaction between railhead and wheel
rolling support for half rolling support collapses due sets applies shock loads to bearings, leading to
the weight of a passenger to bumpy either fracture of bearing components or accelerated
car (up to 26.5 tons) on rails fatigue failure. This causes a hot box, and train
the rails at speeds up to must stop at the next station to evacuate passengers
120 kph which causes a traffic delay of 20-60 minutes. It is
also possible that a failed bearing could cause a
derailment. The hot box melts the chevron causing
it to emit smoke. The chevron also collapses,
damaging the tie-bar and axle. Time to replace a
wheel set complete with bearing and axle box 8
hours. Rails to be analyzed separately.
1 To provide smooth Fails to provide Bearing fails Seized bearing causes a hot box, and train must stop
rolling support for half rolling support due to under at the next station to evacuate passengers which
the weight of a passenger lubrication causes a traffic delay of 20-60 minutes. It is also
car (up to 26.5 tons) on possible that a failed bearing could cause a
the rails at speeds up to derailment. The hot box melts the chevron causing
120 kph it to emit smoke. The chevron also collapses,
damaging the tie-bar and axle. Time to grease an
axle box 30 mins.
Page 224
Optimal Maintenance Decisions (OMDEC) Inc 2004
Function Statement Failure Failure Effects
mode
1 To provide smooth Fails to provide Plug falls out Lubricant drains out, causing bearing to seize
rolling support for half rolling support of axle box resulting in a hot box. Train must stop at the next
the weight of a passenger cover station to evacuate passengers which causes a
car (up to 26.5 tons) on traffic delay of 20-60 minutes. It is also possible
the rails at speeds up to that a failed bearing could cause a derailment. The
120 kph hot box melts the chevron causing it to emit smoke.
The chevron also collapses, damaging the tie-bar
and axle. Wheel set would be replaced if plug was
found to be missing. Time required to do so 8
hours.
1 To provide smooth Fails to provide Bearing fails Over-lubrication leads to excessive churning and
rolling support for half rolling support due to over eventual breakdown of lubricant, causing bearing to
the weight of a passenger lubrication seize resulting in a hot box. Train must stop at the
car (up to 26.5 tons) on next station to evacuate passengers which causes a
the rails at speeds up to traffic delay of 20-60 minutes. It is also possible
120 kph that a failed bearing could cause a derailment. The
hot box melts the chevron causing it to emit smoke.
The chevron also collapses, damaging the tie-bar
and axle. It is felt that this failure is unlikely to
occur because the amount of lubricant is controlled.
1 To provide smooth Fails to provide Moisture in Moisture in lubricant reduces its lubricating
rolling support for half rolling support lubricant effectiveness and may also cause the bearing to
the weight of a passenger causes corrode, in both cases leading to bearing failure
car (up to 26.5 tons) on bearing to resulting in a hot box. Train must stop at the next
the rails at speeds up to fail station to evacuate passengers which causes a
120 kph traffic delay of 20-60 minutes. It is also possible
that a failed bearing could cause a derailment. The
hot box melts the chevron causing it to emit smoke.
The chevron also collapses, damaging the tie-bar
and axle. Time to replace wheel set is 8 hours.
1 To provide smooth Fails to provide Flats worn A wheel flat longer than 40 mm is likely to affect
rolling support for half a smooth ride on wheel ride comfort. It will also damage the railhead. The
the weight of a passenger tread noise and vibration caused by a flat wheel tread is
car (up to 26.5 tons) on usually detected quickly by Operations. Time to re-
the rails at speeds up to profile a wheel set on the under floor lathe is 3
120 kph hours.
2 To insulate passengers Fails to insulate Air bag leaks Air bag deflates, so forces are transmitted between
from shocks caused by passengers via top plate truck and car through the layer and emergency
crossing rail joints, adequately of car bolster springs only. This causes a sharper ride, but train
bumps and to minimize faster than it does not have to be withdrawn from service
transient oscillations can be immediately. Time to replace air bag 8 hours. See
after crossing such pumped in also function 5.
bumps.
2 To insulate passengers Fails to insulate Steel wire Air bag fabric cannot contain the air pressure on its
from shocks caused by passengers inside airbag own, so bag bursts causing forces to be transmitted
crossing rail joints, adequately fails through layer and emergency springs only. This
bumps and to minimize causes a sharper ride, but train does not have to be
transient oscillations withdrawn from service immediately. Time to
after crossing such replace air bag 8 hours. See also 44 and 45.
bumps.
Page 225
Optimal Maintenance Decisions (OMDEC) Inc 2004
Function Statement Failure Failure Effects
mode
2 To insulate passengers Fails to insulate Chevron Reduced clearance causes more frequent contact
from shocks caused by passengers spring rubber between vertical bump stop and wheel arch over
crossing rail joints, adequately settles bumps. This reduces ride quality and increases
bumps and to minimize stresses on all truck components. See also 10 above.
transient oscillations Time to replace chevron 8 hours.
after crossing such
bumps.
2 To insulate passengers Fails to insulate Chevron Reduced clearance causes more frequent contact
from shocks caused by passengers elastically between vertical bump stop and wheel arch over
crossing rail joints, adequately yields bumps. This reduces ride quality and increases
bumps and to minimize stresses on all truck components. See also 11 above.
transient oscillations Time to replace chevron 8 hours.
after crossing such
bumps.
2 To insulate passengers Fails to insulate Damper non- Damper "seizes" and transmits shocks directly from
from shocks caused by passengers return valve truck frame to underside of car (in the case of the
crossing rail joints, adequately fails in vertical damper) or to traction center (in the case of
bumps and to minimize closed the horizontal damper). This reduces ride quality
transient oscillations position and increases stresses on all truck components.
after crossing such Time to replace a defective damper in Depot 1
bumps. hour.
2 To insulate passengers Fails to insulate Damper oil Damper becomes steadily stiffer until it eventually
from shocks caused by passengers viscosity seizes altogether, transmitting shocks directly from
crossing rail joints, adequately increased by truck frame to underside of car (in the case of the
bumps and to minimize dirt or vertical damper) or to traction center (in the case of
transient oscillations oxidation the horizontal damper). This reduces ride quality
after crossing such and increases stresses on all truck components.
bumps. Time to replace a defective damper in Depot 1
hour.
2 To insulate passengers Fails to insulate Excessive Damper becomes steadily stiffer until it eventually
from shocks caused by passengers metal-to- seizes altogether, transmitting shocks directly from
crossing rail joints, adequately metal contact truck frame to underside of car (in the case of the
bumps and to minimize between vertical damper) or to traction center (in the case of
transient oscillations damper the horizontal damper). This reduces ride quality
after crossing such piston and and increases stresses on all truck components.
bumps. cylinder Time to replace a defective damper in Depot 1
hour.
2 To insulate passengers Fails to insulate Layer spring Serious loss of stiffness means that secondary
from shocks caused by passengers stiffness suspension is provided by the air bag only. This
crossing rail joints, adequately decreases reduces ride comfort and increases shock loads
bumps and to minimize especially on the air bag itself. Time to replace
transient oscillations layer spring at Depot 8 hours. See also 45.
after crossing such
bumps.
2 To insulate passengers Fails to insulate Air bag, Car has no secondary suspension at all, so all
from shocks caused by passengers layer spring, shocks which pass through the primary suspension
crossing rail joints, adequately and are transmitted directly to the car. Ride becomes
bumps and to minimize emergency very rough and stresses on local truck components
transient oscillations spring all fail are severely increased. Replacement of the three
after crossing such suspension components takes 8 hours at the Depot.
bumps.
Page 226
Optimal Maintenance Decisions (OMDEC) Inc 2004
Function Statement Failure Failure Effects
mode
2 To insulate passengers Fails to Oil leaks out In the case of the vertical damper, full damping
from shocks caused by minimize of damper capability would have to be provided by the damper
crossing rail joints, oscillations seals opposite, which might not be able to cope and
bumps and to minimize (vertical or hence which might also fail rapidly itself. Even if
transient oscillations horizontal the opposite damper did not fail, damping
after crossing such damper) efficiency is impaired so oscillations are not
bumps. effectively damped, which could cause discomfort
on longer journeys. There is only one horizontal
damper, so the effect of loss of this damper is
immediate. Under damping also increases cyclic
stresses on other suspension components, especially
the torsion bar, which could shorten the life of these
components. Time to replace a defective damper in
Depot 1 hour.
2 To insulate passengers Fails to Damper non In the case of the vertical damper, full damping
from shocks caused by minimize return valve capability would have to be provided by the damper
crossing rail joints, oscillations fails in open opposite, which might not be able to cope and
bumps and to minimize position hence which might also fail rapidly itself. Even if
transient oscillations the opposite damper did not fail, damping
after crossing such efficiency is impaired so oscillations are not
bumps. effectively damped, which could cause discomfort
on longer journeys. There is only one horizontal
damper, so the effect of loss of this damper is
immediate. Under damping also increases cyclic
stresses on other suspension components, especially
the torsion bar, which could shorten the life of these
components. Time to replace a defective damper in
Depot 1 hour.
2 To insulate passengers Fails to Damper Dampers come adrift and oscillations are not
from shocks caused by minimize mounting effectively damped, which causes discomfort and
crossing rail joints, oscillations bolts become may induce motion sickness on longer journeys.
bumps and to minimize detached Horizontal damper could be dragged along a rail. It
transient oscillations may also drop off in front of a wheel, possibly
after crossing such leading to derailment. Time to replace a defective
bumps. damper in Depot 1 hour.
3 To insulate passengers Fails to insulate Compound The car body is still supported by the secondary
from jerks during passengers spring suspension, but the center pivot crashes back and
acceleration and braking from jerky retaining nut forth against the traction center when starting and
stops and starts fails, leading stopping. This causes a jerky ride and considerably
to increases shock loads on the truck and local car
dislocation components (especially the center pivot, traction
of the center and air bags). A dislocated spring could also
compound prevent the truck from curving correctly, which
spring may lead to a derailment under adverse conditions
of load and speed. Time to rectify this defect 2
hours at the Depot. (Note that the retaining nut is
held in place by the split pin, so this failure would
not occur if the split pin is in place)
Page 227
Optimal Maintenance Decisions (OMDEC) Inc 2004
Function Statement Failure Failure Effects
mode
3 To insulate passengers Fails to insulate Compound The car body is still supported by the secondary
from jerks during passengers spring rubber suspension, but the center pivot crashes back and
acceleration and braking from jerky deteriorates forth against the traction center when starting and
stops and starts stopping. This causes a jerky ride and considerably
increases shock loads on the truck and local car
components (especially the center pivot, traction
center and air bags). A dislocated spring could also
prevent the truck from curving correctly, which
may lead to a derailment under adverse conditions
of load and speed. Time to rectify this defect 2
hours at the Depot.
3 To insulate passengers Fails to insulate Traction link Starting and stopping forces are damped only by the
from jerks during passengers rubber bush compound spring, which leads to a jerky ride and a
acceleration and braking from jerky fails general increase in shock loads. Time to replace
stops and starts bush 2 hours.
4 To control the roll Fails to control Torsion bar If the torsion bar shears, one end of the car body
angle of the car body the roll angle of shears lurches from side to side during cornering. This
relative to the truck the car body at could disturb and possibly frighten passengers. The
all car also becomes highly unstable and the resulting
loss of balance could lead to derailment, especially
if a heavily loaded car was going at high speed
round a corner. Time to replace the torsion bar in
Depot 4 hours.
4 To control the roll Fails to control Torsion bar The torsion bar would rotate by itself and cause
angle of the car body the roll angle of retaining key noise and vibration. However, the torsion bar would
relative to the truck the car body at fails not be sheared, so derailment is unlikely to occur.
all Time to replace the torsion bar in Depot 4 hours.
4 To control the roll Fails to control Torsion bar Torsion bar has nothing to act against, causing one
angle of the car body the roll angle of turnbuckle end of the car to lurch from side to side during
relative to the truck the car body at fastening cornering, disturbing and possibly frightening
all comes passengers. The car also becomes highly unstable
undone and the resulting loss of balance could lead to
derailment, especially if a heavily loaded car was
going at high speed round a corner. Time to
reconnect the turnbuckle in Depot 4 hours.
4 To control the roll Fails to control Torsion bar Excessive clearance means that the torsion bar rests
angle of the car body the roll angle of bearing worn directly on the edge of the bearing housing. The
relative to the truck the car body at due to lack resulting point load on the torsion bar greatly
all of increases the chances of the bar shearing, causing
lubrication instability and a possible derailment. Time to
replace this bearing at Deport 4 hours.
5 To ensure that the Unable to Air bag leaks If the step is not level with the platform, a
carriage floor is level ensure that via top plate passenger could trip and fall. Time to replace air
with the platforms when carriage floor is of car bolster bag at Deport 8 hours. See also 22 above
train stops at a station level with the faster than it
platform can be
pumped in
5 To ensure that the Unable to Air bag If the step is not level with the platform, a
carriage floor is level ensure that bursts passenger could trip and fall. Time to replace air
with the platforms when carriage floor is bag at Deport 8 hours.
train stops at a station level with the
platform
Page 228
Optimal Maintenance Decisions (OMDEC) Inc 2004
Function Statement Failure Failure Effects
mode
5 To ensure that the Unable to Leveling Air bag cannot be charged efficiently so carriage
carriage floor is level ensure that valve floor cannot be aligned with platform before
with the platforms when carriage floor is turnbuckle passengers start moving on and off the train. This
train stops at a station level with the loose means that a passenger could trip and fall. This
platform failure occurred quite often in the past, but the
locknut and spring washer were replaced by a nylon
washer, and it has not happened for a year.
5 To ensure that the Unable to Layer spring Car body sags, which can be compensated for
carriage floor is level ensure that stiffness initially by adding adjustment shims. Serious loss
with the platforms when carriage floor is decreases of stiffness means that shims can no longer
train stops at a station level with the compensate. Time to replace layer spring at depot 8
platform hours.
6 To assist in stopping Completely Brake pad One worn pad is unlikely to affect the stopping
the train at up to 0.88 unable to assist worn more performance of the whole train, but a number of
m/s2 in stopping the than 10 mm worn pads could do so. Pads are usually replaced
train when wear exceeds 7 mm and it takes 20 minutes to
repair a pad in the Depot.
6 To assist in stopping Completely Brake disk One worn disc would not have a significant to
the train at up to 0.88 unable to assist wear exceeds affect on the stopping performance of the whole
m/s2 in stopping the 2.5 mm train, but several worn disks would do so. Disks are
train re-profiled on the under floor wheel lathe when
wear exceeds 2 mm. This takes 2 hours.
6 To assist in stopping Completely Brake pad Brake pad holder scratches the disk, so the disk has
the train at up to 0.88 unable to assist falls off to be re-profiled (2 hours) and brake pad replaced
m/s2 in stopping the (20 minutes). One worn disc would not have a
train significant effect on the braking performance but
several worn discs would do so.
7 To prevent direct Unable to Vertical The axle box could hammer against the truck frame
contact between axle box prevent contact bump stop when passing over bumps, leading to deformation
and truck frame under between axle missing of the axle box and possible accelerated failure of
severe bounce conditions box and truck the axle bearings. Time to replace the bump stop in
under severe Depot up to 8 hours.
bounce
conditions
8 To permit the truck to Truck cannot Lifting point This failure could occur while the truck is
be lifted and/or the car to be lifted or car fails due to suspended in mid-air, which means that it could fall
be towed easily towed easily wear or onto somebody. Time to repair eye by welding 3
corrosion hours.
8 To permit the truck to Truck cannot Lifting point Eye could be weakened or the truck could be
be lifted and/or the car to be lifted or car damaged by improperly secured for lifting, causing a suspended
be towed easily towed easily external truck to fall, possibly onto somebody. Time to fit
force new eye 3 hours.
8 To permit the truck to Truck cannot Lifting point Truck could not be lifted at all using the eye, so
be lifted and/or the car to be lifted or car sheared off alternative arrangements would have to be made.
be towed easily towed easily by external
force
9 To ensure that wheel Wheel set falls Tie bar Wheel set could drop onto somebody while the
sets remain attached to off truck while fractures truck is suspended in mid-air. Time to replace the
truck while truck is truck is being tie bar up to 8 hours in the Depot.
being lifted lifted
Page 229
Optimal Maintenance Decisions (OMDEC) Inc 2004
Function Statement Failure Failure Effects
mode
10 To insulate the car Incapable of Emergency This failure on its own has no effect. If the air bag
from shocks to some insulating the spring fails fails and the emergency spring both fail, secondary
extent if the air bag fails car if the air suspension has to be provided by the layer spring
bag fails on its own. 30 above explains what happens if air
bag, layer spring and emergency spring all fail.
Time to replace the emergency spring at Depot 8
hours.
11 To limit lateral Unable to limit Lateral bump Under extreme conditions of lateral load, car bolster
movement of car relative lateral stop rubbers stool could hit truck frame, reducing ride comfort
to truck movement of worn away and generally increasing shock loads. Time to
car relative to replace lateral bump stop rubber at Depot 8 hours.
truck
11 To limit lateral Unable to limit Lateral bump Under extreme conditions of lateral load, car bolster
movement of car relative lateral stop falls off stool could hit truck frame, reducing ride comfort
to truck movement of and generally increasing shock loads. Time to
car relative to replace lateral bump stop rubber at Depot 8 hours.
truck
12 To prevent traction Unable to Split pin falls This failure only matters if the retaining nut starts
link retaining nut from prevent traction out coming loose. If the retaining bolt falls out, effects
coming undone link retaining are described in 12 above. Time to replace split pin
nut from falling at Depot 1 hour.
off bolt
13 To prevent compound Unable to Split pin falls This failure only matters if the retaining nut starts
spring retaining nut from prevent the out coming loose. If the retaining nut falls off, the
coming undone compound compound spring would fall off. Large clearance
spring retaining between the center pivot and the center plate would
nut from falling cause fierce vibrations in the car compartment and
off further damage to the bolster stool. Time to replace
split pin in Depot 1 hour.
Example 2
Functions Functional failures Failure modes Failure effects
1 To supply air to A conditioned air is not 1 air-cycle machine Reduced pack flow,
conditioned air supplied at called-for seized anomalous readings on
distribution ducts at the temperature pack-flow indicator and
temperature called for by other instruments
pack temperature
controller
2 blocked ram-air High turbine-inlet
passages in heat temperature and partial
exchanger closure of slow-control
valve by over-
temperature protection,
with resulting reduction
in Pack airflow
3 failure of anti-ice If valve fails in open
valve position, increasing
impact discharge
temperature; if valve
Page 230
Optimal Maintenance Decisions (OMDEC) Inc 2004
fails in closed position,
reduced pack airflow
4 failure of water Condensation (water
separator drops, fog, or ice
crystals) in cabin
2 To be able to prevent A No protection against 1 failure of bulkhead None (hidden function);
loss of cabin pressure by backflow check valve if duct and or connectors
backflow if the duct is fail in pack bay, loss of
fails in unpressurized cabin pressure by
nose-wheel compartment backflow, and airplane
must descend to lower
altitude
Example 3
Item description: Distributed control system (DCS)
Functions Functional Failure modes Failure effects
failures
To provide safe, secure, Fails to Unauthorized An unauthorized and untrained person
uninterrupted, redundant, cost provide usage of gains access an operating console or an
effective, continuous process security console either engineering console. This may lead to a
control and monitoring when condition where loss of life or
according to the target unattended or environmental disaster can occur. In this
product of the day, within the if password eventuality legal or civil proceedings
parameters specified by stolen will likely be brought against the
product specification and by Company.
current environmental
regulations, in the presence of
a UPS (uninterruptible power
supply)
Unable to Password Operator unable control the plant.
log in forgotten Operator would look for another console
which has a log in. In a worst case
scenario all consoles would be locked
out and emergency shutdown would be
initiated if the operator suspects
abnormal operation at that particular
time.
Unable to UPS has failed Under normal conditions this failure
protect would be noticed by the operator who
against loss checks the alarms in the normal
of control execution of his daily tasks.
Control lost Complete loss Unreliable or no data shown on console.
of Operator loses ability to control the
communication plant. Emergency shutdown initiated.
with ring The most common cause of this failure
in the past has been contractors
inadvertently cutting cables. This is
likely to take at least 2 hours to one day
to fix entailing a loss of production. This
failure mode is considered to be rare
event.
Complete loss One node goes off line. This could be
of preceded by any of dirt fouling of fan,
communication moisture penetration, RF interference,
Page 231
Optimal Maintenance Decisions (OMDEC) Inc 2004
with controller electronic component failure. Partial or
node complete shutdown depending on
importance of node. Unreliable or no
data shown on console. Operator loses
ability to control the plant. Emergency
shutdown initiated. The most common
cause of this failure in the past has been
contractors inadvertently cutting cables.
This is likely to take at least 2 hours to
one day to fix entailing a loss of
production. This has happened
occasionally in the past.
Page 232
Optimal Maintenance Decisions (OMDEC) Inc 2004
Chapter 16. The RCM Decision Algorithm
Questions 5, 6, and 7
The process
While failure analysis may have some small intrinsic interest of its own, the reason for
our concern with failure is its consequences. These may range from the modest cost of
replacing a failed component to the possible destruction of a piece of equipment,
devastating harm to the environment, or the loss of lives. Thus all reliability-centered
maintenance, including the need for redesign is indicated, not by the frequency of a
particular failure, but by the nature of its consequences. Any preventive-maintenance
program is therefore based on the following precept:
The more complex any piece of equipment is, the more ways there are in which it can
fail. All failure consequences, however, can be grouped in the following four categories:
1. Hidden-failure (H) consequences, which have no direct impact, but increase the
likelihood of a multiple failure
2. Safety or environmental (S) consequences
3. Operational (O) consequences, which involve indirect economic loss as well as
the direct cost of repair
4. Nonoperational maintenance (M) consequences, which involve only the direct
cost of repair
Example 1 shows several of the records from the full analysis of the rail passenger car
Truck. In the column “H S P M” we decide, from the effects description, whether the
consequences are hidden, safety or environmental, production (operational) or
maintenance (non operational). We test each of the four possible consequences in this
order, and we stop as soon as the we ascertain that the circumstances (effects) of the
failure mode provoke the consequence being tested.
Page 233
Optimal Maintenance Decisions (OMDEC) Inc 2004
Example 1
C Function Failed Failu Effects H C T D M Proposed Initial By
trl Statemen States re S C T 2 M task Intv’l
. P CT NM
N ts Mode MC T N M
o s
1 To provide Fails to Weld The truck as a whole Inspect frame To be
smooth provide in collapses. This is most for cracks include
rolling support frame likely to occur when the greater than d with
support for fails car is most heavily loaded 100 mm other
half the due to - in other words when it is schedu
weight of a fatigue full of passengers, and led
passenger probably while the train is tasks
car (up to going round a corner. As a
26.5 tons) result, it would almost
on the rails certainly be derailed. At
at speeds present, the truck is
up to 120 replaced when a crack
kph longer than 100 mm is
found. (Such a crack
would be found during
course of other inspections
that occur often enough to
detect it). Downtime to
replace truck on its own 16
hours.
The RCM decision algorithm is represented by the matrix of Figure 16-1, which is also
included in the heading of the decision half of the RCM worksheet.
Figure 16-1 RCM Decision Diagram. Redesign, “R”, is mandatory in rows “H” and “S” if no
proactive task reduces the consequences of failure to a tolerable level. The full text of each cell is
given below
We execute the RCM decision logic by beginning at the top left of the matrix. We decide
upon the appropriate row (branch of the decision tree corresponding the consequence that
was previously attributed to the failure mode.) and work towards the left. The letter in
each cell of the matrix represents a question (step) in the RCM decision algorithm. The
full text of the questions (below) should be recited explicitly as the decision diagram is
being traversed. Avoid the tendency to abbreviate the questions so much that their
meaning is lost or distorted.
Page 235
Optimal Maintenance Decisions (OMDEC) Inc 2004
Full text of decision diagram questions
H. Is the function's failed state hidden? That is, will the failure go unnoticed until another
function fails or some extraordinary event occurs?
S. Does the failure affect safety, health, or the environment?
O. Can the failure provoke operational (production) consequences. These include cost,
quality, and customer service.
M. Are the only consequences those that affect maintenance or the maintenance budget?
C. Is a condition based maintenance (CBM) task applicable? Can it reliably detect the
'failing' state early enough to reduce the failure's probability and/or its consequences to a
tolerable level? Is it effective? Does it make economic sense to perform this task at the
frequency required?
T. Is a time based maintenance task applicable? Is there an age (useful life) at which the
probability of failure due to this failure mode increases rapidly, and do most items
survive to this age? Effective: Can a routine (TBM) task reduce the failure's probability
and/or its consequences to a tolerable level? Two types of time based tasks are considered
under this heading: 1) Scheduled Overhaul, and 2) Scheduled Discard, the letter being
mandatory for a “safe-life” item178.
D. Is a detection task applicable? Will it reduce the multiple failure's probability to a
tolerable level. Is it effective? Is it practical to do the task at the required interval?
2. Can a combination of 2 or more TBM and CBM tasks be applicable (avoid or reduce
the safety consequences to a tolerable level)? Are they effective (practical)?
N. No time nor condition based activities need be scheduled.
R. A hardware, software, or procedural modification that will reduce the failure's
probability and/or its consequences to a tolerable level is mandatory (H or S) or may be
desirable (P or M).
For the failure mode (cause) “Weld in frame fails due to fatigue” we ask whether the
failure is hidden. Since the failure’s direct effects will be clearly visible (probably
catastrophic) to operating personnel, this failure is not hidden. Therefore we proceed to
the next cell to the right and ask whether there is a CBM task that is applicable and
effective. We need search no further than the effects description to learn that it is entirely
feasible to detect a crack at the potential failure stage of 100 mm length. It will be
effective (economically feasible to do so) because there will be ample opportunity to
perform this inspection often enough during other routine work (to be described in
subsequent rows of the analysis.). Hence we stop at that point and enter “C” under the
second column of the matrix.
Example 2
Two functions have been listed for the air-conditioning pack. It’s basic function is to
supply air to the distribution duct at the temperature called for by the pack controller.
178
An item whose failure has safety or environmental consequences and whose potential failure is not
adequately detectable, and the item ages (e.g. fatigue, wear, corrosion...)
Page 236
Optimal Maintenance Decisions (OMDEC) Inc 2004
We apply the decision algorithm to this function first.
Page 237
Optimal Maintenance Decisions (OMDEC) Inc 2004
C Function Failed Failu Effects H C T D M Proposed Initial By
trl Statemen States re S C T 2 M task Intv’l
. P CT NM
N ts Mode MC T N M
o s
the duct is
fails in
unpressurized
nose-wheel
compartment
Anyone of the failure modes listed will result in changes in the pack’s performance, and
these anomalies will be reflected by the cockpit instruments. Hence the functional failure
in this case can be classified as evident.
The loss of function in itself does not affect operating safety; however, each of the
failure modes must be examined for possible secondary damage:
S. Does the failure cause a loss of function or secondary damage that could have a
direct adverse effect on operating safety?
Engineering study of the design of this item shows that none of the failure modes cause
any damage to surrounding items, so the answer to this question is no.
Because the packs are fully replicated, the aircraft can be dispatched with no operating
restrictions when any one pack is inoperative. Therefore there is no immediate need for
corrective maintenance. In fact, the aircraft can be dispatched even if two units are
inoperative, although in this event operation would be restricted to altitudes of less than
25,000 feet.
When we examine the second function of the air-conditioning pack, however, we find an
element that does require scheduled maintenance. The bulkhead check valve, which
Page 238
Optimal Maintenance Decisions (OMDEC) Inc 2004
prevents backflow in case of a duct failure, is of lightweight construction and flutters
back and forth during normal operation. Eventually mechanical wear will cause the
flapper to disengage from its hinge mount, and if the duct in the pressurized nose-wheel
compartment should rupture, the valve will not seal the entrance to the pressurized cabin.
To analyze this second type of failure we start again with the first question in the decision
diagram:
The crew will have no way of knowing whether the check valve has failed unless there is
also a duct failure. Thus the valve has a hidden function, and scheduled maintenance is
required to avoid the risk of multiple failure – failure of the check valve, followed at
some later time by failure of the duct. Although the first failure would have no
operational consequences, this multiple failure would necessitate descent to a lower
altitude, and the airplane could not be dispatched after landing until repairs were made.
With a no answer to question 1 proposed tasks for the check valve fall in the hidden-
function branch of the decision diagram:
Engineering advice is that the duct can be disconnected and the valve checked for signs
of wear. Hence an on-condition task is applicable. To be effective the inspections must
be scheduled at short enough intervals to insure adequate availability of the hidden
function. On the basis of experience with other fleets, an initial interval of 10,000 hours
is specified, and the analysis of this function is complete.
In this case inspecting the valve for wear costs no more than inspecting for failed valves
and is preferable because of the economic consequences of a possible multiple failure. If
a multiple failure had no operational consequences, scheduled inspections would still be
necessary to protect the hidden function; however, they would probably have been
scheduled at longer intervals as a failure-finding task.
Example 3
Item Number: Loop 2-Olefins
Page 239
Optimal Maintenance Decisions (OMDEC) Inc 2004
Item Description: Distributed control system. Continuous process. Unionized. 500 employees. See
business plan. Biggest product Ethylene. Can also produce gasoline Two lines: 1. Material flow 2. Olefins.
Raw material safely stored at high pressure (6000 MPa) in storage underground caverns. It is pipelined to
production facilities. Ethylene converted to polyethylene. There is a "hot side" and "cold side". Raw material
undergoes cracking (breaking carbon chains) and becomes ethylene. The plant extends over several acres (a
square kilometre) The DCS is integral to the entire production line. There are 3 different types of DCS.
Recently there has been a benzene spill. Environmental excursions occur occasionally. Installed 1996. Capital
expenditures have been curtailed recently. Individual heaters can be shut down for maintenance.
12 3 4
1 To provide safe, Fails to Unauthorized An unauthorized S RAn
secure, provide usage of and untrained authentication
uninterrupted, security console either person gains system (ID
redundant, cost when access an card,
effective, unattended or operating console biometric, etc)
continuous if password or an engineering is mandatory
process control stolen console. This may
and monitoring lead to a condition
according to the where loss of life
target product of or environmental
the day, within disaster can occur.
the parameters In this eventuality
specified by legal or civil
product proceedings will
specification and likely be brought
by current against the
environmental Company.
regulations, in the
presence of a
UPS
(uninterruptible
power supply)
2 Unable to Password Operator unable P RRequire logout
log in forgotten control the plant. at shift
Operator would change.
look for another
console which has
a log in. In a worst
case scenario all
consoles would be
locked out and
emergency
shutdown would
be initiated if the
operator suspects
abnormal
operation at that
particular time.
3 Unable to UPS has failed Under normal M N
protect conditions this
against loss failure would be
Page 240
Optimal Maintenance Decisions (OMDEC) Inc 2004
of control noticed by the
operator who
checks the alarms
in the normal
execution of his
daily tasks.
4 Control lost Complete loss Unreliable or no
of data shown on
communication console. Operator
with ring loses ability to
control the plant.
Emergency
shutdown initiated.
(see l2). The most
common cause of
this failure in the
past has been
contractors
inadvertently
cutting cables.
This is likely to
take at least 2
hours to one day to
fix entailing a loss
of production. This
considered to be
rare event.
5 Complete loss One node goes off
of line. This could be
communication preceded by any of
with controller dirt fouling of fan,
node moisture
penetration, RF
interference,
electronic
component failure.
Partial or complete
shutdown
depending on
importance of
node. Unreliable
or no data shown
on console.
Operator loses
ability to control
the plant.
Emergency
shutdown initiated.
(see l2). The most
common cause of
this failure in the
past has been
contractors
inadvertently
cutting cables.
This is likely to
Page 241
Optimal Maintenance Decisions (OMDEC) Inc 2004
take at least 2
hours to one day to
fix entailing a loss
of production. This
has happened
occasionally.
6 All consoles
fail
7 Complete loss
of
communication
on module bus
8 Complete loss
of
communication
on slave bus
9 Console LAN
fails
10 Redundancy Console
lost hardware or
software fails
11 Controller
hardware or
software fails
12 Power supply
fails
13 IO cards
Note that no attempt is made to design the proposed authentication system. RCM analysis
leaves the detailed redesign to other persons to be assembled for that specific purpose
where specialists are on hand.
Page 242
Optimal Maintenance Decisions (OMDEC) Inc 2004
Example 4
Figure 16-2 The shock-strut assembly on the main landing gear of the Douglas DC-10. The outer
cylinder is a structurally significant item.
Page 243
Optimal Maintenance Decisions (OMDEC) Inc 2004
Structures Worksheet: type of Aircraft Douglas DC-10-10
Item Number: 101 No. per aircraft: 2
Item Name: Shock-strut outer cylinder Major area: main landing gear
Vendor part/model no: PN ARG 7002-505 Zones: 144, 145
Description/location details: Design criterion:
Shock-strut assembly is located on main landing gear; SSI Damage tolerant element: __
consists of outer cylinder (both faces) Safe-life element: Yes
Inspection access:
Internal: Yes
External: Yes
Material (include manufacturer's trade name): Steel alloy Redundancy and external
4330 MOD (Douglas TRICENT 300 M) detectability:
No redundancies; only one cylinder
each landing gear, left and right
wings. No external detectability of
internal corrosion.
Fatigue-test data Is element inspected via a
related SSI? If so, list SSI no.: No
Expected fatigue life: Classification of item
(significant/nonsignificant):
significant
Crack propagation:
Established safe-life: 46,800 landings 70,200 oper. hours
Design conversion ratio: 1.5 operating hours/flight cycle
Proposed task Initial interval
Crack growth
Fatigue life
Controlling
Accidental
Inspection
Corrosion
Class no.
Residual
strength
damage
(int./ext)
factor
Page 244
Optimal Maintenance Decisions (OMDEC) Inc 2004
1. Damage-tolerant item: A monolithic or multiple load path item in which
a crack or complete failure of an element will not reduce residual strength
below the safety level prior to detection, or
2. Safe-life item: A structurally significant item whose potential failure is
not reliably detectable.
Table 16-1 explains the rating system for the first 5 columns of Figure 16-3. The analysis
shows the treatment of a safe-life item in an airline context. Because the shock-strut outer
cylinder on the main landing gear of the Douglas DC-10 has been classified a safe-life
item it must be discarded before a fatigue crack is expected to occur. Hence it is not rated
for residual strength, fatigue life, or crack propagation characteristics (the first three
columns of Figure 16-3). The Class Number of column 6 is set to the minimum of the
columns 1 to 5. The “controlling factor” is that which corresponds to the minimum (of
the 5 columns).
Safe-life limits are only effective, however, if nothing prevents the item from reaching
them. In the case of structural items, there are two factors that introduce this possibility –
corrosion and accidental damage. Experience has shown that landing-gear cylinders of
this type are subject to two corrosion problems. First, the outer cylinder is susceptible to
corrosion from moisture that enters the joints at which other components are attached;
second, high-strength steels such as 4330 MOD are subject to stress corrosion in some of
the same areas. The item is given a corrosion rating of 1, which results, therefore, in a
(overall) class number of 1.
The onset of corrosion is more predictable in a well-developed design than in a new one.
Previous operation of a similar design in a similar environment has shown that severe
corrosion is likely to develop by 15,000 to 20,000 hours (five to seven years of
operation). It can be detected only by inspection of the internal joints after shop
disassembly; hence this inspection will be performed only in conjunction with scheduled
inspections of the landing-gear assembly. This corrosion inspection requirement is,
therefore, one of the controlling factors in establishing the shop-inspection interval.
In addition to the corrosion rating, the shock-strut cylinder is rated for susceptibility to
accidental damage. The cylinder is exposed to relatively infrequent damage from rocks
and other debris thrown up by the wheels. The material is also hard enough to resist most
179
Age exploration derived intervals such as these are continuously refined as experience with the item is
accrued.
Page 245
Optimal Maintenance Decisions (OMDEC) Inc 2004
such damage. Its susceptibility is therefore very low, and the rating is 4. However,
because the damage is random and cannot be predicted, a general check of the outer
cylinder, along with the other landing-gear parts, is included in the walkaround
inspections and the A check, with a detailed inspection of the outer cylinder scheduled at
the C-check interval.
Table 16-1
Reduction in Fatigue life of Crack- Susceptibility to Susceptibility to
residual strength element propagation rate corrosion accidental damage
No. of Ratio of Ratio of Ratio of Exposure as a
rating
rating
rating
rating
elements that fatigue life to interval to corrosion-free result of
can fail design goal fatigue-life age to fatigue- location
without design goal life design
reducing goal
strength below
damage
tolerant level
One 1 1/8 1 1/8 1 High 1
Two or 2 ¼ 2 ¼ 2 Moderate 2
more180
Two or 3 3/8 3 3/8 3 Low 3
more181
Two or 4 ½ 4 ½ 4 Very low 4
more182
180
75% reduction in the margin between ultimate and damage tolerant level
181
50% reduction in the margin between ultimate and damage tolerant level
182
25% reduction in the margin between ultimate and damage tolerant level
Page 246
Optimal Maintenance Decisions (OMDEC) Inc 2004
The worksheet guide of Figure 16-4summarizes the processes of Part 3. Reliability Centered Maintenance.
Page 247
Optimal Maintenance Decisions (OMDEC) Inc 2004
Page 248
Optimal Maintenance Decisions (OMDEC) Inc 2004
Chapter 17. Integrating Reliability
Information - MIMOSA
When ideas achieve currency, momentous change ensues. A unified approach to
information sharing in operations and maintenance (O&M) has been gathering
momentum, over the past decade. MIMOSA183, the OPC Foundation184, and the ISA185
have launched OpenO&M, a comprehensive, open information architecture for
unfettered technical collaboration in the modern O&M environment.
Page 249
Optimal Maintenance Decisions (OMDEC) Inc 2004
recommendations on when and how to perform a maintenance intervention on an asset.
Figure 1 is a MIMOSA UML186 class diagram that shows the role of an intelligent
agent187.
Each of the boxes in the UML class diagram (Figure 17-1) represents a class188. Lines
joining the classes represent relationships. The relationships of Figure 17-1 are called
associations189. The red line, with the diamond head “ ”, joining the
190
AssetRecommendation and Database entities, is an aggregation (a “whole/part”
association.)191 The cardinality (“1” and “*” on the ends) indicates that the relationship is
one-to-many. One database maintains many AssetRecommendation records.
MIMOSA has published a series of 15 such class diagrams. By studying these diagrams
we may understand the utility of and the reasoning behind the MIMOSA Common
Relational Information Schema (CRIS).
The MIMOSA classes, Segment and Asset (of Figure 17-1), require explanation. A
segment is a production process or sub-process or physical area192 on a site. An asset is
an equipment (with a unique serial number) that can be allocated to a segment.
186
The UML is the Unified Modeling Language. See “The Unified Modeling Language User Guide” by
Grady Booch, James Rumbaugh, Ivar Jacobson, Addison-Wesley 1999, ISBN 0201571684
187
An intelligent agent is an automated entity that processes data and makes decisions and
recommendations. The MIMOSA Agent class may also include humans and organizations who fulfill the
same role.
188
A class is a specification for an object. An object can represent some physical item used in the business
process – for example a work order record in a database table of work orders.
189
An association is one of the four types of relationships: 1. Dependency, 2. Association, 3.
Generalization, and 4. Realization
190
A structural relationship such as a table belonging to a database or a printed circuit belonging to a
electronic device.
191
This means that an object of the whole has objects of the part.
192
For example “Compressor Room 1”
Page 250
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 17-2: MIMOSA UML Class diagram "RegCore"
The association joining Segment and Site in Figure 17-2 has a solid diamond head “ ”.
The solid diamond indicates that a Segment belongs uniquely to a Site193. On the other
hand, an Asset (with an unfilled diamond head association “ ”) is only loosely
associated with a site. An equipment can, in principle, be moved to another Site.
Furthermore, an asset can be removed from one segment and installed on another. The
association line joining Segment and Asset in Figure 17-2 reveals that relationship.
Examine that association line. Note, the AssetUtilizationHistory class connected by a
dashed line to the association line. The AssetUtilizationHistory class is called an
association class. It provides further clarity on the nature of that association. In this
relationship the objects (database records) of the AssetUtilizationHistory class, record the
removals and the installations of assets on segments. These records provide the
suspension and Failure “Events” for an EXAKT CBM optimization model194.
193
For example “Compressor Room 1” is strongly related to a Site.
194
Or any type of reliability (age exploration) analysis: Weibull, Pareto, Scatter, Cause and Effect, and
others.
Page 251
Optimal Maintenance Decisions (OMDEC) Inc 2004
.
We may conclude from the various MIMOSA UML class diagrams, that the
MIMOSA and OPC OpenO&M architecture recognizes the vital role of intelligent agents
in maintenance decision planning.
Page 252
Optimal Maintenance Decisions (OMDEC) Inc 2004
s.
The agent uses a statistical “data interpretation” model that has been built by correlating
historical event and condition monitoring data. The model accounts for the current
operating context of an asset. Finally, the model supports the user’s requirements
regarding that asset for which the enterprise defines its objectives, for example:
Page 253
Optimal Maintenance Decisions (OMDEC) Inc 2004
Chapter 18. Managing Strategy
Introduction
Improvement concepts such as “the maintenance dashboard”, “key performance
indicators”, and “benchmarking of the best of breed ” resonate in the physical asset
management community. They are the stock in trade of the maintenance management
consultant. A far-sighted vision and a well-conceived strategy followed by a detailed
implementation effort, will, we expect, transform the maintenance function into an
ordered and controllable process.
With minor variations, two schools of thought dominate the scores of philosophies that
contend in the maintenance improvement marketplace. The symbolic “pyramid of
excellence” (Figure 18-1), and the metaphoric “RCM house” (Figure 18-2) convey their
respective, and somewhat conflicting, paths to “world class physical asset management”.
Figure 18-1 The Pyramid of Excellence195 Figure 18-2 The RCM "House"196
Order, rather than content, differentiate the two approaches. The former initializes its
improvement cycle by establishing a suitable maintenance infrastructure (tiers one and
two at the base of the pyramid) . The latter insists that we retain (for the present) our
existing systems and structures, but, that we begin (the improvement process) by
analyzing each significant physical asset’s functions, its failures, failure causes, effects,
and consequences. Doing so will determine the appropriate maintenance requirements
– the foundation (of the house of Figure 18-2). Proponents, of the “Pyramid of
Excellence”, emphasize culture change as an explicit management process. The RCM
camp contends that maintenance culture will adapt naturally with systematic RCM
education and implementation. The former devotes attention to effective planning and
scheduling, while the latter focuses on developing, through RCM analysis, the proactive
cyclic tasks (TBM and CBM) of the maintenance plan197.
195
From Uptime, John Dixon Campbell, Productivity Press, 1995
196
From the RCM II Practitioner’s course, John Moubray 1999
197
Along with the defaults “no scheduled maintenance” and redesign.
Page 254
Optimal Maintenance Decisions (OMDEC) Inc 2004
Summarizing, students of the “Pyramid” school defer reliability-centered maintenance
(RCM) analysis to a future time by placing it up on the third tier. They see the processes
(such as data management and planning) on tier 2 as pre-requisites for reliability analysis
(RCM). Advocates of the alternative point of view (the “House”), consign “systems” to
the roof (the last element to be erected in an improvement plan), while positioning RCM
analysis as the foundation. In this chapter we review software products such as Strategy
Manager™ 198, Real-Time Production Intelligence™199, and Real-Time Production
Management™200 that seek to unify the two201 approaches.
New software enabled methodologies extend the reach of the occasional maintenance
audit by offering continuous day-to-day performance visibility and control. To
accomplish this function they integrate, (using O&MOpen202 standards) with the CMMS,
process computers, and other plant systems.
198
Available from DEI Group (www.dei-group.com)
199
Available from ABB (www.abb.com)
200
Available from OSISoft (www.osisoft.com)
201
systems first or reliability first
202
See www.mimosa.org and the article EXAKT and MIMOSA
Page 255
Optimal Maintenance Decisions (OMDEC) Inc 2004
Physical asset management inputs, outputs, and control
Arrow 3 represents the way that maintenance policy relates to the KPIs, and arrow 4
represents how the actual KPI’s achieve corporate vision. The physical asset manager
strives to discover the intricate relationships governing how policy impacts KPIs. And,
secondly, he seeks to know how achievement of the KPI targets will impact the balance
sheet and the corporation’s societal responsibilities of custodianship. We might express
the steps to world class performance as:
Note that the center block of Figure 18-3 specifies both KPIs and Age Exploration203.
KPIs often summarize the results of a maintenance policy. They seldom direct us to
203
A broad category of methods of analysis of failure and maintenance data. The analyses target ways to
improve current proactive maintenance policies on significant items in order to improve reliability and/or
lower cost. See Chapter 3.
Page 256
Optimal Maintenance Decisions (OMDEC) Inc 2004
specific policy changes regarding individual assets. On the other hand, age exploration
analyses (for example, Pareto analyses) focus our attention on individual significant items
whose collective performance governs the KPIs.
The foregoing implies that our maintenance management system (CMMS) must embody
reliability-centered information (as outlined in Chapters 1, 2, and 3). Specifically, for
each significant item, the five reliability-centered knowledge elements:
1. “What function was lost or compromised?”,
2. “How (full, partial, potential, functional failure)?”,
3. “Why?”,
4. “What happened?”, and
5. “How did it matter?”
will populate the database upon which the analyses will perform. Furthermore, using our
system, we establish the relationship between the significant consequences of failure
(knowledge element 5) and the KPIs that achieve the corporate vision. In practical terms,
we use the performance management system to classify each incident (maintenance work
order, or production log item) involving downtime, speed loss, or quality loss, as one of
11 to 19 of Table 18-1. Additionally, we document the five RCM knowledge elements
that characterize each incident (seeFigure 18-4).
Effectiveness KPIs classify productivity losses as: Downtime, Speed, and Quality
losses.
Table 18-1
Theoretical production time 1
Losses
Valuable Quality losses Speed losses Downtime losses
operating time 8 MM P E MM P E MM P E
204
Bert Mijten, Real-Time Production Intelligence, ABB Review, Feb 2004
Page 257
Optimal Maintenance Decisions (OMDEC) Inc 2004
MM P MM P MM P 1. Modification, major mtce
2. Limited need
14 15 16 17 18 19
3. Social (policy not to produce
weekends, holidays, etc
Quality Speed Down
Technical losses 10 11 12 time
13
MM=machine malfunctioning, P=process
Table 18-3 Model 2 Dupont (“planned production time” or “six big losses”) model
External losses are losses that cannot be altered by the production or maintenance team.
Planned down time losses are down time losses that were planned. Note that planned
down time losses (of Model 2) are specifically down time losses, whereas external losses
(of Model 1), can be speed, quality, and downtime losses. For instance, speed losses
because of environmental deals are external losses but are not planned down time losses.
Page 258
Optimal Maintenance Decisions (OMDEC) Inc 2004
Similarly, quality losses that are caused by the raw materials are external losses but are
not down time. Hence Model 1 discriminates more easily between losses controllable by
maintenance (and operations) and those that are outside of its control, than does Model 2.
Furthermore external losses are not always planned. For instance, an external power cut
or lack of raw materials is an external loss, but is not a planned down time loss. Hence,
‘available production time’ is different in the two models. Therefore the KPIs calculated
in Table 18-4 will have different values depending on whether Model 1 or Model 2 is
used. However, by leveraging the next generation of management software, we may,if
required, convert the five RCM knowledge elements associated with each incident (from
Model 1) into the six big losses (of Model 2).
Page 259
Optimal Maintenance Decisions (OMDEC) Inc 2004
Table 18-5 Example: (using the Production Economy model definitions)
There are four production lines whose reference throughput and approved product for the
period under study are given in Table 18-6.
Table 18-6
Page 260
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 18-4 A Performance Management system drills down from the KPI (for example, Quality
Loss) to invoke analysis procedures that guide the physical asset manager to continuous policy
improvement
Figure 18-4 illustrates that historical data (contained in plant systems) fuel reliability
analyses such as Pareto, age-reliability relationships, and optimal CBM decision
graphs205. Those methodologies steer us towards improved maintenance policies. The
CMMS, the control system historian, CBM databases, and other plant systems feed
information to the performance management system. The performance management
system, in the hands of the physical asset manager, outputs continually improving
physical asset management policies. Today, the maintenance world hovers at the
threshold of bridging two remaining gaps that impede “excellence” in asset performance
management. They are:
With these final capabilities in hand, we may anticipate rewarding times ahead for
physical asset management. Nevertheless, there remains the question of how to actually
begin the journey to OEE improvement at lowest cost in each particular enterprise.
205
These may be called age-reliability-significant factor relationships
Page 261
Optimal Maintenance Decisions (OMDEC) Inc 2004
How to start
We set about the task of identifying the significant items whose failure to perform as
required impede the achievement of corporate vision (both on the balance sheet and with
regard to our socal responsibilities of custodianship). They may be expressed in a grid
similar to the following:
Table 18-7
Upon extracting from the grid, the projects that deserve immediate attention, proceed to
elaborate a practical schedule for RCM training and for performing the RCM analyses
and the new information gathering processes (of chapters 1, 2, and 3). The RCM schedule
will depend on 1) the practically achievable rate at which the RCM analyses may
proceed, as determined by resource availability (trained RCM analysts and facilitators).
The RCM analyses, as they proceed, will generate specific requirements and ROI
estimates for new maintenance tasks and the redesign of significant items, systems, and
operating procedures. The KPI and age exploration results will monitor, guide, and
motivate the continuing process of improvement.
206
to achieve the target.
Page 262
Optimal Maintenance Decisions (OMDEC) Inc 2004
Chapter 19. Appendices
This same scenario could apply to hydraulic pumps, or electric motors, or valves, or
significant parts of any kind. Few organizations bother to track all significant components
as individuals. However, using the simple methods of the EWOP, we need know only
when they have been removed from their host and when they have been replaced. We
will know, too, (or can estimate from observation) the age of a used component at the
time that it was re-installed, usually as an emergency repair. The CBM lab207introduced
the practical option of looking at the lifetimes of components from the point of view of
their host rather than from that of the components themselves.
207
At the University of Toronto, the birthplace of EXAKT for CBM optimization.
Page 263
Optimal Maintenance Decisions (OMDEC) Inc 2004
temporarily, and countless other situations that occur in the anarchic world of
maintenance. Figure 15 illustrates a component in suspended animation.
The working age line of an item is shown proceeding from left to right. Various meter
counts are indicated along the way (1000, 2000, 4000, and 6000). At 2000, a component
has been removed from the item. It was re-installed in the same item at 4000. The event
BSA marks the Beginning of Suspended Animation and ESA marks the Ending of
Suspended Animation for the component in question. The “gap” is the duration of the
suspended animation.
Assume that the component fails (at time T) 2000 working age units after its
reinstallation, i.e. at 6000. What is its (component) working age at failure? It is given by
the formula:
When the work order tells the EWOP that a used component (of say, age 4000) as in
Figure 19-2 was installed at item meter reading 5000, the EWOP generates the B event
with a working age of 1000 and a SM (start monitoring) event at 5000. The component’s
calculated age at failure is 7000. The SM event tells the model that no CBM monitoring
events (in the item) apply to this component prior to its installation at 5000.
Page 264
Optimal Maintenance Decisions (OMDEC) Inc 2004
The EWOP’s Impact on the Work Process
Clearly, the EWOP implies a deep change in the thinking process related to the
completion of work orders and to the management of historical maintenance records in
general. The EWOP will impact the maintenance work order process both in the short
and long terms. In the short term, we recommend that the EWOP be used for specific
analysis projects by one or more reliability analysts or maintenance engineers. This will
introduce the EWOP gradually, prior to general acceptance in the global work order
process.
Previous versions of the EWOP, required all 16 data elements to be present in the CMMS
work order structure. It also required that the CMMS have the ability to create ad hoc
work orders on demand, whenever a work order involved more than one unique
significant item-function-failure-cause. EWOP 1.4, does not require restructuring of the
existing CMMS database. The user (usually an engineer) may begin applying reliability-
centered knowledge in the CMMS immediately. All required fields as well as pseudo-
workorders are handled entirely by the EWOP using “option 4”.208
A pseudo-work order is a virtual work order that is embedded in the long text field of a
parent work order. This is required where technicians are not permitted, by the current
CMMS rules, to create ad hoc work orders on demand. If they have a situation of
multiple unique item-function-failure-causes to report - no problem. They can create as
many pseudo work orders as they need. They do this by adding any significant unique
item-function-failure-causes that they wish to include, in the CMMS long text field using
EWOP's structured free text format. The method is illustrated in the "Additional field of
Figure 16.
Figure 19-3 Work order form. Many of the 16 data elements in the text field "Additional"
The CMMS field, “Additional” is a long text field (sql_longvarchar), sometimes called a
"memo" field. In this field users will enter all the additional information that the EWOP
needs. Using the native CMMS fields and the additional information in the long text
field, the EWOP will generate the necessary Events table and the RCM records.
Furthermore, it will update the long text field of the work order (including its pseudo-
work orders embedded in the text field). It will parse the free text and insert, at the the
208
To use Option 4 of EWOP, edit the ewop.cfg file and change “,1” to “,4”
Page 265
Optimal Maintenance Decisions (OMDEC) Inc 2004
appropriate places any, RCM record references that were newly generated by the
workorder (or pseudo-work order) via the EWOP.
Eventually, once maintenance engineers and managers recognize that the EWOP method
of integrating RCM thinking with the work order process, will return continuing benefits,
they will, no doubt, request a reconfiguration of the CMMS, as discribed in the section
“Long term process” below. In the short term, the EWOP will empower maintenance
analysts and engineers to conduct specific reliability projects. A suggested list of
activities follows:
The EWOP approach will become integrated into the everyday work process. This, will
happen once the value of the growing reliability knowledge base becomes apparent to
users (and their CMMS vendors). At this point several new CMMS features will have
been introduced (noted below in parentheses) that adapt to the EWOP methodology. The
work process will henceforth consist of the following steps.
Page 266
Optimal Maintenance Decisions (OMDEC) Inc 2004
1. Maintenance technician completes a job, and proceeds to update the CMMS by
completing the work order form (similar to Figure 2)
2. He (or she) recognizes that the work order actually refers to more than one unique
Item-function-failure-cause. He generates as many "sub-workorders" (an added
CMMS feature) as required to accommodate the structured information that he
needs to record. A discussion of sub workorders is given in Chapter 1 of
"Reliability-centered Knowledge".
3. In order to complete each sub-work order, he displays (an added CMMS feature)
the RCM table. He attempts to locate a RCM record that accurately describes a
situation similar to his current "sub-work order".
4. If he is successful, he relates the sub-work order to the RCM record (manually by
entering the RCMREF auto number into the sub-work order record, or
automatically by virtue of a new CMMS feature). He enters the following
reliability data into the sub-work order: dateback, dateout, workingageback,
workingageout, failuretype. (The sub work order is now an instance of a record in
the knowledge base.)
5. The technician may wish to edit the RCM record at this time to include any new
knowledge discovered during the execution of the work order. Usually, he will
append to, or modify, the Effects field. He may update the Consequences field.
(For example, a failure mode previously thought to be evident, was in fact
hidden.)
6. If no RCMREF is found the information in the work order record will be added
automatically (new CMMS feature) to the RCM table, and the RCM table auto
number will be entered automatically (new CMMS feature) into the work order
record.
7. The CMMS will allow supervisors and reliability specialists to audit and approve
changes to the RCM table made by a technician. (new CMMS quality auditing
feature).
Page 267
Optimal Maintenance Decisions (OMDEC) Inc 2004
10. Verify that the Events table has been emptied and the RCM table reduced to 4
records, and the RCMREF values in many of the Work orders have been
removed.
11. Hit the EWOP button again.
12. At the prompt type CRU% and hit <enter>.
13. Back to the Events table and verify that only the work orders for the crushers
have been processed. Hit the Initialize DB button to prepare for the next exercise
(Option 4).
1. Edit ewop.cfg file. Change “Option : ,1” to “Option : ,4” (Save it.)
2. Hit the Work orders Option 4, Events, RCM, and Items buttons and examine the
records of the various tables. Especially the long text field “Additional”. Notice
that there are actually three child or sub records (separated by ~~) within that text
field. Note that most RCMREF : values in the text are empty.
3. Hit the EWOP button. Hit <enter> at the prompt.
4. Hit the Events and RCM buttons. Examine the Events records and new RCM
records.
5. Hit the Work orders Option 4 button and note the RCMREF field in the “pseudo”
workorders of the Additional field filled with the appropriate autonumber of the
RCM record.
6. Hit Initialize DB to re-initialize the database
7. Verify that the Events table has been emptied and the RCM table reduced to 4
records, and the RCMREF values in many of the Additional fields of the Option 4
work orders have been removed.
8. Hit the EWOP button again.
9. At the prompt type CRU% and hit <enter>.
10. Back to the Events button and verify that only the Option 4 work orders for the
crushers have been processed.
Often, however, we cannot make such a clear decision from the condition monitoring
data. We lack an adequate signal processing method and/or decision model with which
to discriminate patterns in the data that relate unambiguously to a targeted failure mode.
We know something is wrong, but we don’t know which, of a number of possible failure
modes, is deteriorating. We don’t know which part of the equipment is failing.
Page 268
Optimal Maintenance Decisions (OMDEC) Inc 2004
We may, then, perform an exploratory inspection. We escalate from a less intrusive,
purely monitoring, type of inspection to one that requires a more intrusive activity. Oil
analysis of an engine’s crankcase oil may indicate an increasing trend of some wear
metal, such as iron. Concerned, we perform a compression check, in the hope that more
information will narrow down the list of possible failure modes. Further escalations in
inspection intensity might include a pressure/ignition trace, and eventually, a partial or
complete dismantling of the engine.
Each progressively intrusive and costly layer of CBM deepens the process of discovery.
If we find during the compression check that there is indeed a ring sealing problem, we
may learn from this experience. We could attempt to find patterns in the (relatively
inexpensive) oil analysis data, that relate to poor compression. If, through the modeling
process, we find such a relationship, we could use it, thereafter, as a decision model (or
rule) to tell us when it is advisable to perform a compression check.
The consequences of failure are still minor – the discovery of poor compression is
considered a “potential failure”. It would eventually deteriorate and cause a functional
failure whose consequences would indeed be operational, safety related, or economically
important. The point to note is that the development of the decision model (a rule for
issuing a work order for a compression check) did not require us to have experienced a
functional failure. We prefer, naturally, to model potential failures rather than to build
our decision models upon the experience of functional failures that have dire
consequences.
The EWOP encourages the development of decision models that warn of potential
failures. Technicians, in the course of carrying out various preventive tasks, using EWOP
methods, will document their observations in the systematic RCM form. By analysing the
resulting knowledge base, particularly the effects, the events leading up to a failure cause,
we will without doubt develop better inspection and decision techniques with fewer
functional failures.
EXAKT, as does any RA methodology, requires accurate historical data. Without prior
guidelines, such as those proposed by the EWOP, good data has been difficult to attain.
The EWOP methodology teaches the principles of reliability-centered knowledge.
Analysts can begin using EWOP methods immediately for specific reliability analysis
projects.
Page 269
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 17 EWOP main menu
The EWOP is, in one sense, a reversal of the way RA projects have been done in the past.
Traditionally, we have extracted sets of records from the CMMS. Then we embarked on a
process of "data cleaning" (using other software packages such as EXAKT, Excel,
MatLab, and others), to deal with data anomolies (usually missing or undocumented
events).
The EWOP, on the other hand, focuses considerable energy on the data source before
attempting data extraction. The reliability analyst supplements the information on each
work order related to the item under analysis. He does this on site, with the assistance of
those who participated in the maintenance events that concern the item. The enriched
information, presented in the consistent 16-data element format, enables the EWOP to
extract records from the CMMS directly into an Events table.
The EWOP brings substantial advantages to an EXAKT (or any RA) project. By applying
thorough work order documentation methods, within the existing CMMS, an analyst:
Page 270
Optimal Maintenance Decisions (OMDEC) Inc 2004
Appendix 2.
The quality (hence the success) of each RCM analysis will depend heavily on how well
the facilitator has mastered and executes his skills209. Those skills are outlined in Table
19-1: RCM facilitator’s checklist. The facilitator’s skill and vigilance will prevent the
analysis from being dangerously superficial, or, conversely, from becoming bogged down
and stalled in unnecessary detail. The novice facilitator should refer often to this
scorecard throughout the RCM project, and continually self-evaluate h(is)(er)
performance, (initially under the watchful eye of an experienced RCM practitioner) with
respect to each of the items in Table 19-1.
In the planning phase, before an RCM analysis begins, ensure that 12345
potentially useful documentation (drawings, schematics, manuals,
standard operating procedures, maintenance and operational histories,
etc) are readily accessible for reference during the sessions. Discuss the
general RCM objectives, beforehand, with resource people210, outside
the team, so they may respond quickly if called upon to provide
clarification or information when required during the course of the
analysis.
Assist in the selection of the appropriately skilled RCM team members. 12345
Page 271
Optimal Maintenance Decisions (OMDEC) Inc 2004
(as part of the main analysis) each of the subsystem's dominant failure
mode(s) singly and the other failure modes lumped under the title
“others”.
Report regularly on progress to the RCM sponsor. Call upon h(im)(er) 12345
for help in resolving technical, organizational, or human issues as they
arise
Provide team members access to the evolving RCM worksheet as the 12345
analysis unfolds from session to session.
2.0 Score
Animation
Recognize and be sensitive to each personality type. Help each team 12345
member contribute fully to the RCM process by using one or more of
these techniques: Gently discourage the extrovert from monopolizing
the floor by (following a tirade) asking a question to another team
member. ("George, what do you think about that") Encourage the
introvert by asking h(im)(er) questions and by assigning short research
tasks between sessions on unclear issues. (calling a vendor, checking a
log sheet, etc). Ask h(im)(er) to report on h(is)(er) findings at the
beginning of the next meeting. Be careful not to harass h(im)(er).
Recognize when true consensus is achieved. Never permit a vote. Keep 12345
in mind that a lone dissenter may be right. Record h(is)(er) position and
ask h(im)(er) to “agree to disagree” until further elucidating information
comes along.
At the beginning of the first session of the RCM analysis, help the team 12345
set and agree upon the ground rules (smoking, punctuality, etc)
Recognize when the team simply “does not know” (about some aspect 12345
of the asset) by being alert to statements beginning with "I think ..." or "I
believe ...". Assign short research tasks to team members to find out.
Page 272
Optimal Maintenance Decisions (OMDEC) Inc 2004
Remind participants of the objectives and importance of the analysis and 12345
that they have been chosen to participate because of their knowledge
and experience.
Be alert to answering the wrong question. This could occur at anytime 12345
throughout the RCM process. An example is the raising of an
operational consequence when the process has moved onto the safety
and environmental branch of the decision diagram.
Safeguard the self-esteem of each team member. Recognize that “loss 12345
of face” may occur by persons formerly considered knowledgeable.
Soften the blow by emphasizing (in timeouts and anecdotes) that RCM
is, above all else, a learning forum to bridge the discontinuities in the
knowledge of individuals by gaining synergy from the collective
perspectives of the team.
3.0 Score
Clarity
Input the answers to the RCM questions into the RCM worksheet. 12345
While entering the answers, retain team members’ wording as much as 12345
possible. Occasionally, when necessary suggest ways of expressing the
answers more succinctly in written form. Revise and correct the text
outside the meeting without altering what was said and meant during the
session. When in doubt obtain approval from the team for extensive
word-smithing. Avoid jargon. That is, ensure that the technical terms
used on the worksheet will be understood by everyone on the site.
4.0 Score
Time Management
Page 273
Optimal Maintenance Decisions (OMDEC) Inc 2004
Remind the team of the time allotted to the current analysis and the rate 12345
of progress necessary to attain that goal.
Keep the pace of analysis (all 7 steps) at an average rate of 6 failure 12345
modes per hour.
Indicate that about 1/3 of the time will be dedicated to defining the 12345
functions, 1/3 on failures, modes, and effects (FMEA), and 1/3 on
consequences, decisions, and task definition and assignment.
5.0 Score
Focus on the process
Ask the RCM questions. Never answer them. (If the team may have 12345
made a technical error or omission rephrase the questions to probe in a
particular direction or ask that a particular point be checked between
sessions.)
Elaborate the asset's operating context at the beginning of the analysis. 12345
Keep it uppermost in the team’s mind throughout the analysis.
Ensure that the 7 RCM questions are asked completely, in the manner, 12345
and the order prescribed by SAE JA1011211.
Pay strict attention to the following issues with respect to each of the 12345
SAE JA1011 RCM questions (5.1 to 5.7)...
Ask the team to uncover the primary functions, the secondary functions, 12345
including all hidden functions. Afterwards invoke the PEACHES
mnemonic to double check that all functions have been listed.
211
SAE JA1011, http://www.sae.org, Title: Evaluation Criteria for Reliability-Centered
Maintenance (RCM) Processes
Page 274
Optimal Maintenance Decisions (OMDEC) Inc 2004
Direct the team to include as many quantitative performance 12345
requirements as practical in each function statement to fully describe the
users’ (owners, societal) objectives for the asset. The function statement
usually begins with “To …” or “Not to …”. Avoid the use of “and”
between two verbs.
Simplify (reduce the size of) the function list by deciding when a certain 12345
function may be more conveniently included as a failure mode of
another functional failure. For example, the function "Not to trip when
the liquid level is below 100 hectoliters" preferably should be included
as the failure mode "pump trips due to grounded electrical contact" of
the primary function "To pump x liters ... ".
Encourage the team use code phrases to imply a hidden function (e.g. 12345
to be capable of, to be able to, …to heat to 140C in the presence of a
standby heater.)
5.2 In what ways can it fail to fulfill its functions (functional failures)?
Page 275
Optimal Maintenance Decisions (OMDEC) Inc 2004
• How does the likelihood of this failure depend on deeper causes? Has
it happened before? Under what circumstances?
5.5 In what way does each failure matter (failure consequences)?
Set the proactive task intervals. For CBM estimate P-F interval, or if
12345
applicable212, use a risk based non-deterministic approach such as
EXAKT. For TBM estimate the useful life regarding the failure mode in
question.
5.7 What should be done if a suitable proactive task cannot be found
(default actions)?
The three possible default actions: run-to-failure, failure detection, and
12345
redesign must be considered when so directed by the decision diagram.
For hidden failures, the detection interval must account for the tolerable
level of risk (probability and consequences) of a multiple failure.
Ensure that the team has considered all practical aspects of the task that
12345
has been selected. The task descriptions must contain enough detail213 to
ensure that no misunderstanding is possible when it is transcribed into
the maintenance system.
Appendix 3.
212
Historical event and condition monitoring data is available and the consequences of failure are serious
enough to justify the analysis effort. The EXAKT analysis should be performed off-line.
213
However safety and intricate task details should be considered offline (with the possible participation of
safety and process, and engineering, and vendor experts where needed),
Page 276
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 19-4
asset hierarchy. However Figure 19-4 illustrates the compromise to be considered when
selecting a level at which to define our item. At a higher level the item’s functions and
functional failures are more clearly related to the performance requirements of the
equipment as a whole – an advantage.
Time is one of the facilitator’s prime considerations. The more failure modes that need to
be considered, the longer the analysis will take. Experience tells us that we should size
the item so that it may be analyzed in from 5 and 15214 three-hour sessions. A well run
analysis averages 6 failure modes per hour. Hence a small analysis would contain about
90 failure modes while a large one would analyze about 270. These figures make it
apparent that the facilitator must carefully control the process, lest it flounder by not
achieving the analysis of the item (as defined) in the allotted time. Such occurrences
could jeopardize215 the entire RCM initiative.
214
Depending on the item’s complexity as reflected by the number of its reasonably likely failure modes.
215
By over-running the budgeted time and resources, and by dicouraging team both members and upper
management through non-attainment of milestones.
Page 277
Optimal Maintenance Decisions (OMDEC) Inc 2004
Selecting the significant items
Appendix 4.
Failure finding intervals for complex items (multiple failure modes and
devices)
Failure finding interval for devices with more than one failure mode.
2 × M pf
I ff =
M mf × (1 M sd 1 + 1 M sd 2 + 1 M sd 3 )
where:
216
The method and details of project priority are industry specific. RCM may then proceed according to the
schedule generated by whichever priority method is used. Variants of RCM (such as Turbo RCM, PMO
2000, RCM Cost and provide structured priority systems.
Page 278
Optimal Maintenance Decisions (OMDEC) Inc 2004
Iff = failure finding interval
Mpf = reliability (mean time between failure) of the protected function
Mmf = tolerable mean time between multiple failure
Msd1 = mean time between failure due to failure mode 1 of the safety device
Msd2 = mean time between failure due to failure mode 2 of the safety device
Msd3 = mean time between failure due to failure mode 3 of the safety device
Failure finding interval for redundant devices (based on the linear approximation).
1
(n + 1)M pf n
I ff = M sd ×
M mf
where:
n = number of redundant devices of the same kind.
Optimal failure finding interval for parallel redundant devices where only cost is a
factor
1
(M sd )n (n + 1) M pd C ff n
I off =
n × C mf
where:
Ioff = optimal failure finding interval
Cff = average cost of an inspection
Cmf = average cost of a multiple failure
n = number of redundant safety devices of the same kind.
Page 279
Optimal Maintenance Decisions (OMDEC) Inc 2004
Appendix 5.
Truck description
1. General Description
Each car is mounted on two four-wheel trucks having a wheelbase of 2500 mm. The
trailer trucks which are fitted to each type of trailer car are un-motored and are not fitted
with a parking brake. All trucks are fitted with disc brake equipment.
:
Figure 19-6: Rail car truck
The method of construction: Side frames and transoms are steel fabrications and utilize
closed box sections to give lightweight structurally efficient trucks.
The primary suspension consists of rubber/steel chevrons which mount the axle box to
the truck frame. The inherent damping within the chevron assemblies avoids the
necessity for supplementary dampers in the primary suspension. The axle box also houses
a rubber bump stop, which serves to prevent direct contact between the truck frame and
the axle box under severe bounce conditions.
The secondary suspension consists of two elements which are interposed between the
truck side frames and the car bolsters. The two elements are a layer spring and an air
spring. Under normal conditions, the effective suspension stiffness is a result of the two
springs connected in series. In the event of the air spring being deflated, the car will rest
on emergency springs which are located on top of the layer springs. The cars can still be
Page 280
Optimal Maintenance Decisions (OMDEC) Inc 2004
used in service, but with a reduced quality of ride. Vertical oscillations of the car are
damped by two hydraulic shock absorbers, these being mounted on either side of the
truck between the truck side frame and the air spring top plate on the car bolsters. Lateral
oscillations are damped by hydraulic dampers which is mounted between the traction
center and the truck side frame. Lateral displacements are limited by resilient and positive
stops. Body roll is controlled by a torsion bar which is housed in the transom of each
truck and connected to the body by a suitable linkage. A leveling valve mounted on the
car controls the air pressure in the air springs and maintains a constant floor height
independent of passenger loading. The traction center is connected to the truck frame by
horizontal traction links. The ends of the traction links contain composite metal/rubber
bushes to ensure that attractive and braking forces are transferred to the car as smoothly
as possible.
2. Wheel sets
:
Figure 19-7: Wheel set
Wheels of BR-PB profile of mono-block constructions, are shrunk onto solid one piece
axles which run in double roller bearing axle boxes. The wheel specification being to BS
468 class D oil hardened and tempered. The axles are manufactured from low alloy steel
conforming to the BR specification 109A. The wheels are shrunk onto the axles and the
wheel set is balanced in accordance with BR specification 163. To effect removal of the
wheels, the hubs are drilled with two diametrically opposite oil injection holes. The
gearwheel is also fitted with an oil injection hole to assist removal. The axle ends are
suitably center drilled to allow wheel turning on a wheel lathe.
Page 281
Optimal Maintenance Decisions (OMDEC) Inc 2004
3. Axle box
Machined on each side of the axle box body is a mounting to carry the primary
suspension, each chevron being retained to the mounting with two bolts. At the top of the
forging is a machined circular housing to accommodate the axle box rubber bump stop.
A sealing collar is abutted up to a shoulder on the axle, and open cover fitted over it.
Labyrinth grooves in the collar and cover prevent leakage of grease from the rear of the
axle box. The front of the axle box is sealed by either a front cover or the housing of a
frequency generator via an adaptor plate. The axle box is lubricated with a lithium base
grease such as Shell Alvania 3, Exxon Beacon 3, or a comparable approved grease.
4. Primary Suspension
The arrangement of the primary suspension shows the tie bar arrangement under the axle
box.. The tie bar arrangement consists of a spacer tube, tie bar, locating rings and suitable
fasteners. The tie bar serves two purposes, it ensures the wheel arch structural integrity
and also allows the truck to be lifted from its wheel sets. The load of the wheel sets being
supported by the tie bars via the axle boxes, when a complete truck is lifted.
Page 282
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 19-9: Primary suspension
Cast steel chevron holders are located in, and welded to, web plates attached to each
wheel arch. The correct space between the bump stop housing and the top of the wheel
arch is adjusted by use of shim plates fastened under the top of the wheel arches.
5. Traction Center
The tractive and braking forces are transmitted to the center pivot via the traction center.
The center pivot is bolted to the bolster stool which is riveted to the car bolster. Shims are
fitted between the bolster stool and the center pivot to ensure the interface height between
the center pivot and truck is correct.
Page 283
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 19-10: Air spring top plates
Figure 19-11 shows the assembled arrangement of the traction center. The lateral bump
stop assemblies limit the possible lateral body movement relative to the truck. Each bump
stop assembly consists of a bounded rubber/steel bump stop and a fixed stop, such that
any lateral movement is unrestricted until the center pivot comes into contact with the
bump stop; further movement is then resisted by elastic deformation of the bump stop
until the fixed stops are met. The correct dimensions from the truck center to the rubber
stops and the fixed stops, and between the fixed stop and the truck transom. Tractive and
braking forces are transferred from the truck frame to the traction center by two traction
links (Figure 19-12). The traction links house resiliently mounted bushes in each eye, so
that the forces are transferred as smoothly as possible.
Page 284
Optimal Maintenance Decisions (OMDEC) Inc 2004
Forces are transferred from the traction center via the center pivot to the car body. The
center pivot pin is retained in the traction center by a rubber compound spring. Lateral
movements of the traction center relative to the transom hydraulically damped by a shock
absorber which connected to one side of the truck frame and to the traction center.
5. Secondary Suspension
The secondary suspension consists of a series of elements mounted on each truck side
frame (see Figure 19-6). The stiffness of the suspension in normal service conditions is a
result of an air spring and a layer spring acting in series.
The air spring is connected to the car bolster by an air spring top plate. These plates can
only be fitted in a certain manner (Figure 19-14) and serve as both the mechanical and
pneumatic connection to the car. The lower sealing face of the air spring seals onto the
top of the layer spring assembly. The layer spring consists of a series of rubber and metal
elements bonded together. A plate on the top of the layer spring serves as the sealing face
for the air spring and as a housing for the emergency spring.
In the event of the air spring being deflated the car will rest on the top of the emergency
spring. The emergency spring comprises a metal/rubber assembly and has a low friction
surface fitted to its upper surface. This low friction surface allows the use of a vehicle in
service with a deflated air suspension, albeit with a reduced quality of ride.
Page 285
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 19-12: Secondary suspension
Vertical oscillations are damped by two hydraulic shock absorbers, one each side of the
truck adjacent to the secondary suspension. The dampers being mounted on brackets on
the truck side frame at the one end and to the air spring top plate at the other end.
Roll of car body relative to truck is controlled by an anti-roll torsion bar, which is housed
in the truck transom and connected to the car body by a turnbuckle linkage.
6. Frame
The trailer truck frame is a jig built welded structure comprising of the two side frames, a
center transom, and two headstocks. The side frames form enclosed box sections which
are internally braced to provide the optimum strength to weight relationship. The side
frames are symmetrical in profile about their centers with a wheel arch at each end. The
two side frames are joined at their centers by a transom assembly consisting of top and
bottom plates with vertical plates forming a box section structure, two transverse tubes
are welded integrally into this structure. One of these tubes houses the torsion bar. The
ends of each side frame are joined together by headstocks.
Cast steel chevron holders are located in, and welded to, web plates attached to the front
and rear of each wheel arch. Brackets are located at the bottom of the wheel arches to
locate the tie rod assemblies under each axle box.
Four towing points are fitted, two to each side frame, in-board of the wheel arches. The
points can also be used as lifting points, when handling individual trucks.
Brackets are welded to the outside of each side frame, two provide mounting points for
though vertical dampers and the other house the bearings for the torsion bar.
Page 286
Optimal Maintenance Decisions (OMDEC) Inc 2004
Under the top of each wheel arch, location points are provided to accommodate shims.
The shims ensure the correct clearance between the axle box and the truck.
A torsion bar passes through one of the lateral tubes mounted in the transom and trough
wholes in the side frames, adjacent to the bearing housing brackets.
The mounting brackets for the traction links are welded diametrically opposite each other
under the transom, fore and aft of the center aperture.
At the center of one of the headstocks, a mounting bracket for the AWS is welded to the
bottom plate. The AWS (automatic warning system) receiver is resiliently mounted and
the correct height above rail level is adjusted by the use of spacer washers.
Page 287
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 19-14: Rail car truck
Actuators for the wheel mounted discs are mounted on each headstock. The actuators on
the trailer trucks are not fitted with parking brake facility.
Appendix 6.
Terminology used:
Age Exploration: Any analysis procedure that examines historical data in order to
improve the maintenance plan by increasing an item’s reliability, availability,
maintainability, productivity, or by reducing cost. (Also called “reliability analysis”).
Applicable: A task is technically feasible and practical. For a condition based
maintenance task it means that a potential failure can be detected and assessed well
enough in advance of a functional failure to avoid or reduce its consequences. For a
scheduled overhaul it means that the item has a useful life.
Availability: (total scheduled time – downtime)/total scheduled time. Or,
MTTF/(MTTF+MTTR)
Complex item: An item subject to more than one reasonably likely failure mode.
Condition data: Inspection/measurement data (temperature, vibration, wear, yield, visual
observation, performance, etc) from which a potential failure may be deduced.
Conditional probability of failure:
Page 288
Optimal Maintenance Decisions (OMDEC) Inc 2004
probabilit y of entering Interval − probabilit y surviving Interval
Conditiona l probabilit y of failure in Interval =
probabilit y of entering Interval
The interval must be small compared to the average life of the item. It is the probability
of failure in an interval given that it survives to that interval.
Covariate: A condition indicator. A condition data variable or transformation of one or
more variables to be tested in a proportional hazard model.
Decision Model: A method for interpreting condition data. An optimized decision model
is one which maximizes or minimizes some objective (e.g. availability or cost
respectively). A decision model may be developed that achieves some performance
measure such as a specified mission reliability or a required preventive to corrective
maintenance ratio.
Effective: A task accomplishes the intended objective – to lessen satisfactorily or to
avoid entirely the consequences of a failure.
Failure: Two types: 1. Potential failure – an unambiguous indication that a functional
failure is imminent (degraded failure resistance), and 2. Functional failure – the partial or
total loss of one of an item’s functions
Inspections: Observations (physically (human senses) or electronically acquired) related
to an item’s operation and maintenance from which a potential failure may be deduced.217
Item: A group of one or more parts or assemblies that is convenient to treat as a single
entity for reliability analysis. Items are defined at a high enough level of indenture so that
their failures may be clearly related to failure of the equipment as a whole and low
enough so that the number of failure modes is reasonable (<50-60).
Mean time to failure (MTTF): The average life of an item. Can be estimated by totaling
the lives of an item or fleet over a period of time and dividing by the number of items.
Mean time between failure (MTBF): The MTTF less the MTTR.
Mean time to return to service (MTTR): The mean time to return to service. (Also
called the maintainability.)
Multiple Failure: A failure of a protected function at a time when its protective function
is already in a failed state
OEE: (Availability x Productivity x Quality) tracks maintenance effectiveness, where:
Availability = (scheduled time - downtime due to all forms of maintenance)/(scheduled
time). Productivity = Product rate setting/Desired product rate. Quality = (Product -
Scrap)/Product. Additionally, tracking Reliability = MTTF, will provide further insight
into benchmarks for maintenance effectiveness.
On-condition maintenance: The detection of a potential failure. Also known as
condition based maintenance (CBM) and predictive maintenance (PdM).
PM: Preventive Maintenance. Scheduled tasks that include: failure finding218, on-
condition (aka CBM, predictive maintenance), rework, and discard tasks.
Reliability: Usually defined as an item’s MTTF. Sometimes described as a survival
probability of the item for a given mission duration.
Reliability analysis: Synonym for “Age Exploration”: Any analysis procedure that
examines historical data in order to improve the maintenance plan by increasing an item’s
reliability, availability, maintainability, productivity, or by reducing cost.
217
In some contexts (e.g. gas turbines) “Inspections” refer to major overhauls.
218
Inspections to discover functional failures that would otherwise remain hidden until the function is
called upon by some other failure or exceptional event.
Page 289
Optimal Maintenance Decisions (OMDEC) Inc 2004
Reliability-centered: Adjective indicating the aim of sustaining and improving OEE and
reliability.
Reliability-centered maintenance: A (7-question) process used to determine the
maintenance requirements of an asset in its operating context.
Sample: Observations of an item’s (or group of similar219 items’) installations, failures,
preventive renewals, significant events, and condition data over a period of time.
Significant events: Operational or maintenance events that impact an item’s failure
resistance or its condition data.
Significant item: An item whose failures:
• Are not evident under normal circumstances, or
• Can directly negatively impact safety or the environment, or
• Can have direct major economic or operational impact.
Suspended220: Refers to replacement (discard) or rework of an item for any reason other
than its failure.
Useful Life: The age at which the conditional probability of failure begins to increase
and to which most items of the same kind survive. See Figure 3-2 on page 35.
Answer: All of the above. Thus it is important to clarify what we mean by “life” in any
given discussion.
Appendix 7.
219
Similar physically and in operating context
220
There are actually 3 types of suspensions: left, right, and interval. EXAKT also has the concept of
“temporary” suspension that refers to items that are still operating. In most contexts in the present manual
we mean “right” suspensions.
Page 290
Optimal Maintenance Decisions (OMDEC) Inc 2004
f(t) is the probability density
function (PDF). It is the usual way of
representing a failure distribution. As density
equals mass per unit of volume, probability
density is the probability of failure per unit
time221. When multiplied by the length of a
small time interval at t, the quotient is the
probability of failure in that interval. It is the
basic description of the time to failure of an
item. The PDF is often estimated from real life
data. It resembles a histogram222 of the number
of failures of an item in consecutive intervals.
All other functions related to an item’s
reliability can be derived from it. For
example:
F(t) is the cumulative distribution function (CDF) It is the area under the f(t) curve from 0 to t..
(Sometimes called unreliability or the cumulative probability of failure.)
R(t) is the survival function. (Also called the reliability function.) R(t) = 1-F(t)
h(t) is the hazard function223. (At various times called the hazard rate, conditional failure rate,
instantaneous failure probability, instantaneous failure rate, failure rate, the inverse of failure resistance, failure
risk, and risk.) h(t) = f(t)/R(t)
221
However the analogy is accurate only if we imagine a volume of non-uniform mass. The density of a
small volume element is the mass of that element divided by its volume
222
A histogram is a vertical bar chart on which the bars are placed along a horizontal axis scaled in units of
working age. The width of the bars are uniform representing equal working age intervals. The height of
each bar represents the fraction of items that failed in the interval. If the bars are very narrow then their
outline approaches the pdf.
223
Often, the two terms "conditional probability of failure" and "hazard rate" are used interchangeably in
many RCM and practical maintenance references. In those references the definition for both terms is: the
conditional probability that an item will fail during an age interval given that the item enters (or survives)
to that age interval. This definition is not the one usually meant in reliability theoretical works when they
refer to “hazard rate” or “hazard function”. Nowlan and Heap point out that the hazard rate may be
considered as the limit of the ratio (R(t)-R(t+L))/(R(t)*L) as the age interval L tends to zero.
To summarize, "hazard rate" and "conditional probability of failure" are often used interchangeably (in
more practical maintenance books). The “hazard rate” is commonly used in most reliability theory books.
The conditional probability of failure is more popular with reliability practitioners and is used in RCM
books such as those of N&H and Moubray. There are two versions of the definition for either "hazard
rate" or "conditional probability of failure":
1. h(t) = f(t)/R(t)
2. h(t) = (R(t)-R(t+L))/R(t).
where L is the length of an age interval. Actually, when you divide the right hand side of the second
definition by L and let L tend to 0, you get the first expression.
Since
F(t) = 1 – R(t)
Then differentiating
dF(t) dR(t)
=- = f (t )
dt dt
Dividing the second definition by L and letting L tend to 0 (and applying the derivative definition of a
limit)
Page 291
Optimal Maintenance Decisions (OMDEC) Inc 2004
MTTF is the average time to failure. (Also called the mean time to failure, expected time to failure,
∞
average life.) MTTF = ∫
0
tf (t )dt .
H(t) is the conditional probability of failure. It is the probability that the item fails in a
time interval [t1 to t2] given that it has not failed up to then. It is approximately equal to h(t) multiplied
by the length of the time interval of interest. Its graph has the same shape as that of the hazard
function, differing by a constant value that depends on the interval width being considered.
H(t) = (R(t1)-R(t2))/Rt1
Actually, not only the hazard function, but pdf, cdf, reliability function and cumulative hazard function
have two versions of their definitions as above. The first version is defined over a continuous range of age t
while the second one is defined over discrete age intervals, e.g., (0,100), (100,200), (200,300), ... Roughly,
we can say the second definition is a discrete version of the first definition.
The first expression is useful in reliability theory and is mainly used for theoretical development. The
second expression is useful for reliability practitioners, since in practice people usually divide the age
horizon into a number of equal age intervals. The pdf, cdf, reliability function, and hazard function may all
be calculated using age intervals. The results are similar to histograms, rather than continuous functions
obtained using the first version of the definitions.
Page 292
Optimal Maintenance Decisions (OMDEC) Inc 2004
Appendix 8.
1
Probability of survival without failure
.78
.61
.50 .47
.37
.29
.22
0
0.25 0.50 0.75 1 1.25 1.50
X the MTBF
By definition:
∞
MTTF = ∫ R(t )dt by definition where R(t) is the survival probability at time t.
0
Appendix 9.
224
Modified from Report AD-A066-578, “Reliability-Centered Maintenance”, F. Stanley Nowlan, Howard
F. Heap, National Technical Information Service, U.S. Department of Commerce, 1978
Page 293
Optimal Maintenance Decisions (OMDEC) Inc 2004
operating crew under normal ensure that failure is detected
circumstances
Ability to measure/detect Determines applicability of on-condition tasks
reduced resistance to failure
Rate at which failure resistance Determines interval for on-condition tasks
decreases with operating age
once a potential failure225 occurs
Age-reliability relationship Determines applicability of rework and discard tasks
Age-reliability-covariate Determines the key risk factors for interpreting on-
relationship condition data.
Cost of corrective maintenance Helps establish PM task effectiveness, except for
safety and environment impacting failures
Cost of preventive maintenance Helps establish PM task effectiveness (except for
safety and environment impacting failures).
Need for safe-life limits to Determines applicability and interval of safe-life
prevent safety or environment discard tasks
failures
Need for servicing and Determines applicability and interval of servicing and
lubrication lubrication tasks
Appendix 10.
225
A potential failure is a measurable indicator of reduced resistance to failure.
Page 294
Optimal Maintenance Decisions (OMDEC) Inc 2004
Why Why Why? Why? Why? Why? Why? Why?
? ?
Ventil Fan Motor Motor trips Airways Inadequate
ation fails fails clogged with design
syste dirt.
m
fails
Defective
sensor
Bearing Lubricant
seized allowed to run
dry
Wrong Improperly Stores error
lubricant labeled
Label Inattention
misread
Insufficient
training
Power Belts failed Incorrectly … … …
drive installed
fails
Incorrectly … … …
specified
Distri Duct Duct … … … …
butio fails clogged
n
syste
m
fails
Duct … … … …
pierced
Damper … … … …
failed
… … … … … … … …
226
Sample: Observations of an item’s (or group of similar items’) installations, failures, preventive
renewals, significant events, and condition data over a period of time.
Page 295
Optimal Maintenance Decisions (OMDEC) Inc 2004
b. Fitted: The curve of the EXAKT decision chart is fitted to the actual data;
so as to minimize “average” realized cost.
i. Fitted, Method A: Suspensions227 considered as preventive
renewals.
ii. Fitted Method B: Suspensions not counted228
c. Theoretical: The warning level curve is selected to minimize “expected”
cost.229
3. No scheduled maintenance (NSM): The policy of not using any proactive
(neither scheduled nor on-condition) maintenance.
Rather than describing, in rigorous detail, the various calculation methods mentioned in 2
above, an example of an effectiveness assessment of a CBM policy by comparing these
alternative policies is given below. This data is derived from diesel engines and applies to
a fleet of 300 T haul trucks.
In row “Current” of Table 19-4 we find that of the 13 actual histories in the sample 6
failed, 3 were replaced, and 4 are “undecided” – i.e. we do not know whether they will
eventually fail or be preventively replaced. At the present time they are still operating.
An optimized model in CBM is a tool for interpreting condition data in order to declare a
potential failure so that a required objective is met (minimal average cost, maximum
uptime, a reliability goal, or some other performance metric.). The model is derived from
past equipment failure behavior as a function of age and monitored condition data.
Applying the optimized interpretation model retroactively to the data (row 3 of Table
19-4), we see that 1 would have failed, 6 would have been replaced and 6 would have
been undecided. The result looks very promising since 5 out of 6 failures would have
been prevented. However, our final assessment must take into account how much of the
total operational time we have “exchanged” for such a decrease in failure rate. That is to
say we may have been too cautious having preventively intervened (premature
227
Right suspensions. Equipment that is currently still operating at the time of the sample.
228
We are considering two sets of calculations for the analyst to consider. It is a kind of best and worst
case, with the actual situation being somewhere in the middle.
229
Another calculation to help judge how well the EXAKT derived policy will do in the future
230
“Undecided” means that it is unknown whether the item would have failed. The item was either still in
operation or had been replaced preventively in the actual data set (sample)
231
The optimal policy applied to the data would have permitted one failure to occur. That is the prediction
method would have “missed” one time.
Page 296
Optimal Maintenance Decisions (OMDEC) Inc 2004
replacements) too often resulting in an expensive PM policy. We evaluate this by using
Table 19-5.
From this we may conclude that the number of failures would have been significantly
reduced. The cost ratio used in the optimization calculation was 6000:1000. In Chapter
10. “Optimizing CBM” page 145 we perform a sensitivity analysis to determine how
changes in the ratio will impact the optimal policy.
Next, in Table 19-5 we compare the cost per operating hour of the Current policy with
that of the optimal Applied policy to see whether there is any significant reduction in
232
Still functioning at the sample cut-off date.
Page 297
Optimal Maintenance Decisions (OMDEC) Inc 2004
total maintenance costs233. This should be the main criterion234 for assessment. From
Table 19-5 we see that the current policy cost is $0.391/h, and the optimal policy cost is
$0.195/h. This reduction in the cost of about 50% is significant.
We may also compare the MTBR for both policies. If there is a significant reduction in
MTBR (mean time between repairs, either preventive or as the result of failure) the
optimal policy is being cautious in reducing failures (due to high cost ratio). If the
MTBRs are similar, then the analysis is telling us that our condition indicating
measurements (interpreted by the model) are a relatively accurate predictor of oncoming
failures.
In the example, the current policy cost is $0.391/h, and the optimal policy cost $0.195/h.
Reduction in the cost is about 50%235 . The percent of preventive replacements for the
Current policy is 53.85%236, and for the Applied optimal policy, 92.31%237. MTBR is
8458.92h for the Current policy, and 7113.54h for the Applied optimal policy. All this
leads us to the conclusion that there is much to be gained by optimization.
Next compare the cost of the optimal Applied policy to that of the Theoretical one. If
these two costs are similar, we may conclude that the theoretical model fits the data
properly. In the example the cost of the applied policy is $0.195/h, and that of the
theoretical one is $0.157/h. This difference is not very large (considering the sample
size). Theoretically, then, we expect 97.74% preventive replacement, but only 92.31% =
12/13 would have been realized by applying the optimal policy. Similarly, theoretically
we expect the MTBR to be 7070.09h, but 7113.54h would have been realized. (For this
sample size, these two values are very close).
We now compare the results of the Fitted and Applied policies. Close cost values favor
the conclusion that the optimal model is a good one. A significant difference in the costs
may mean that some part of the theoretical model may be improved, possibly the method
of classifying inspection value ranges238. In the example, the cost of the fitted policy is
$0.182/h, close to the cost $0.195/h of the applied policy. Both policies have one failed
history, but different MTBRs - 7627h for the fitted policy, and 7113.54h for the applied.
This means that the fitted policy would have been more accurate in selecting the moment
for rework or discard239.
In summary:
1. The above analysis provides a way to judge the potential of a proposed CBM
policy.
233
The combined costs of all failures and all preventive repairs in the sample period.
234
The analysis may also be done from the point of view of maximizing total availability, in which case
costs would be replaced by “downtime” using the relationship Avail = uptime/ (uptime+downtime).
235
50.22% = 100% - 49.78%, 49.78% = 0.195/0.391
236
(3+4)/13
237
12/13
238
In the transition probability model.
239
One might ask, why not use the fitted policy then. Answer: the fitted policy can be obtained only after
the fact. The purpose of evaluating a proposed policy in this way is to help judge its future effectiveness.
Page 298
Optimal Maintenance Decisions (OMDEC) Inc 2004
2. It uses various sets of calculations to probe the robustness of the proposed model
3. It is a tool that a statistician uses to gain a degree of comfort by arriving at similar
numbers using calculations at both sides of the envelope of possible solutions.
The assessment procedures described here provide not only an objective way to assess
actual (current) PM policy but ways to predict and evaluate the cost advantages of future
optimized policies.
Page 299
Optimal Maintenance Decisions (OMDEC) Inc 2004
Appendix 12.
The expected life cycle Tc will be the planned maintenance time t p multiplied by the
probability that planned maintenance does occur, plus the expected failure time (knowing
that failure occurs before tp) multiplied by the probability that failure occurs before tp.
The term, E (T | T ≤ t p ) in Equation 19-1, is the expected time to failure, given that
failure occurs prior to scheduled maintenance, under a policy where scheduled
maintenance is carried out at time t p . We wish to show that it can be expressed as
tp
E (T | T ≤ t p ) =
∫
0
tf ( t ) dt
1 − R (t p )
1, t > tp 1, t > tp
Fc (t ) = P(T ≤ t | T ≤ t p ) = P(T ≤ t ) = F (t )
, t ≤ tp , t ≤ tp
P(T ≤ t p ) 1 − R(t p )
Equation 19-2
In the first part of Equation 19-2, we have simply defined the distribution function of
(T≤t given that T≤tp) as Fc (t ) = P(T ≤ t | T ≤ t p ) . We will call this conditional
distribution function, “Fc(t)”. (Recall the definition of a distribution function in Appendix
7. on page 290.)
Now, moving towards the right in Equation 19-2, the top condition “1, where t>tp” is
easy to understand. We know that failure will have occurred prior to tp (with 100%
certainty) because T≤tp is our hypothesis in Fc(t).
Page 300
Optimal Maintenance Decisions (OMDEC) Inc 2004
P(T ≤ t )
The bottom condition , t ≤ t p requires us to know that the conditional
P(T ≤ t p )
P( A ∩ B)
probability P(A|B) is where A = T≤t and B = T≤tp
P( B)
But we know that the intersection of T≤t and T≤tp is T≤t (see footnote240)
In the rightmost part of Equation 19-2 we apply the definition of F(t) to the numerator
and denominator. And, of course, we know that F(t) = 1-R(t). (See Appendix 7. on page
290.)
0, t > t p or t < 0
f (t )
f c (t ) =
, 0 ≤ t ≤ tp
1 − R(t p )
Equation 19-3
We have used, in Equation 19-3, the fact that the density function is the first derivative of
the distriubtion function.
Therefore,
tp
∞
E (T | T ≤ t p ) = ∫ tf c (t )dt = ∫
tp tf (t ) ∫ tf (t )dt
dt = 0
0 0 1 − R (t p ) 1 − R(t p )
Equation 19-4
Here, in Equation 19-4, we have invoked the definition of “Expectation” as the integral of
the product of t and the density function. From this point on it’s just a matter of
substituting expressions from Equation 19-3.
240
Because t≤tp, the intersection of T≤t and T≤tp is actually T≤min(t,tp)=t.
Page 301
Optimal Maintenance Decisions (OMDEC) Inc 2004
Appendix 13.
Page 302
Optimal Maintenance Decisions (OMDEC) Inc 2004
condition task is inspection maintenance that
technically intervals short is not cost-
feasible enough to make effective
(effective), is it the task
worthwhile? effective.
Is a rework task No (unless there -- X. Delay in yes
to reduce the are real and exploiting
failure rate applicable data): opportunity to
applicable? assign item to no reduce costs
scheduled
maintenance.
If a reworked No (unless there -- X. Unnecessary No for redesign;
task is are real and redesign (safety) yes for
applicable, is it applicable data): or delay in scheduled
effective? assign item exploiting maintenance
scheduled opportunity
maintenance
Is a discard task No (except for X. X. Delay in Yes
to avoid failures safe-life items): (safe life (economic exploiting
or reduce the assign item to only) life) opportunity to
failure rate know scheduled reduce costs
applicable? maintenance
If a discarded No (except for X. X. Delay in yes
task is safe-life items): (safe life (economic exploiting
applicable, is it assign item to only) life) opportunity to
effective? know scheduled reduce costs
maintenance
Appendix 14.
Page 303
Optimal Maintenance Decisions (OMDEC) Inc 2004
Figure 19-15: Relcode data entry for cloth filters
Figure 19-16
Exercise 4
A metropolitan transport company operates a fleet of similar buses. Engine failures
necessitating replacement have occurred in the kilometer ranges shown in the following
table which also shows the number of engines currently running in each age range.
Page 304
Optimal Maintenance Decisions (OMDEC) Inc 2004
0-49,999 2 35
50,000-99,999 8 27
100,000-149,999 33 12
150,000-199,999 44 62
Figure 19-18
Exercise 5
A new type of car has recently been released and is subject to warranty. An analysis of
warranty claims shows several alternator failures, although, as a proportion of the whole
population the numbers are quite small.
Page 305
Optimal Maintenance Decisions (OMDEC) Inc 2004
The available data are as follows:
Age Range Failure
(Kilometers) Replacements Survivors
0-49,999 1 48
50,000-99,999 2 123
100,000-149,999 3 56
150,000-199,999 4 44
Figure 19-20
Page 306
Optimal Maintenance Decisions (OMDEC) Inc 2004
Appendix 15. EXAKT Exercises
The instructions in the right column of the following table are minimal so as to keep
them simple. The left column provides more detailed explanation. Whenever an
EXAKT menu option or icon is mentioned, it should be clicked in the EXAKT program.
When database tables are mentioned, they should be double clicked.
Exercise 1
Convention used: Meaning:
X instruction to close the current sub-window (or pane)
Page 307
Optimal Maintenance Decisions (OMDEC) Inc 2004
may be in date or date/time format. If condition
monitoring inspections are more frequent than once every
24 hours, the date/time format must be used. The
WorkingAge is a measure such as hours of operation, fuel
consumed, thousands of feet of steel rolled, or any other
measurement that reflects the accumulated usage or
stress on the item. Calendar time can only be used if the
units operate regularly in time – a rare situation.
Databases of production records, hour meters, or
counters must be used to acquire useful WorkingAge
data. The remaining columns contain the condition
monitoring data which we refer to as condition data.
Now examine the Events data table. Contrasted with the
Inspections table, its information represents the other
side of the coin. Both Event and Inspection data are
required for CBM optimization. The EXAKT modeling
process is one of correlation of Events (of all kinds) and
Inspections (that is, condition data). Condition data often
comes from specialized databases provided by CBM
product or service vendors. Common examples are oil
analysis and vibration analysis. These databases are
invariably well organized and consistently populated. The
Events data, on the other hand, often comes from the
organization’s CMMS (computerized maintenance
management system) and from production databases.
(The records in the CMMS, typically, have been less
7 Events, X
rigorously kept than the others. Hence EXAKT contains
tools and techniques to validate and get the CMMS data
into shape.) The basic required Events are: 1) Beginning
(an item has been placed into service) designated by B.
2) Ending by Failure, (EF)and 3) Ending by Suspension
(ES). By “suspension” we mean that the item has been
taken out of service for any reason other than failure. For
example, it may have been preventively replaced. Once
again the Ident, Date of the Event, WorkingAge are
required fields. The Event itself is recorded in the fourth
column. “OC” in this example represents an “oil change”
event. Any event which affects the condition data (in this
case it would initialize the wear metals and contaminant
elements to zero) must be included in the model.
Examine the CovariatesOnEvent table. We must provide
the “initialization values” for each event. Note that in this
case we are initializing wear metals and contaminants to
8 zero and additives to their new-condition levels. We may CovariatesOnEvent, X
also establish calendar periods for which these initialized
values to be used. (For example, the brand or grade of
lubricating oil may be changed periodically.)
Examine the EventsDescription table. The column P (for
precedence) tells EXAKT program in which order to
consider separate events that occur at the same
9 date/time. For example, if an oil sample is drawn from an EventsDescription, X
oil drain, we would wish that the sequence of the
Inspection precede that of the oil change. The inspection
event is implicitly given the precedence “0”.
Examine the Models table. It contains no records yet.
That is because you have not yet begun building a model.
This table is populated automatically by EXAKT as you
10 Models, X
proceed. The only time you might access this table
manually would be to delete certain sub-model(s) that
you do not wish to retain. A sub-model is one of any
Page 308
Optimal Maintenance Decisions (OMDEC) Inc 2004
number of models that are tested in the modeling
process. The sub-model that is considered the best, is
then exported to become the intelligent agent that will
provide decision optimization on a particular item’s
condition data.
Now that we have examined the internal and external
Data Preparation, General Event
database tables we are ready to proceed with the
Data, Project Title: Haul Trucks, CBM
development of a rudimentary CBM optimization
11 Model: Trans Oil Anal, Description:
model. We turn our attention to the right hand window
350 T Transmission Oil Analysis,
pane containing buttons arranged in a flow chart of
Time Unit: Hrs., OK
activities. We enter the general project data.
Next we instruct EXAKT to assemble the Events and
Inspections into a single table C_Inspections to be used
for subsequent calculations. Depending on which version
of EXAKT you are using there are a number of alternative
12 With Covariates (Complete)
buttons we may hit. But for this exercise please choose
the option similar to “Covariates – Complete”. After
hitting this button two more tables will appear in the left
pane, C_Events and C_Inspections.
Examine the C_Inspections table. Note that the records of
both tables (Events and Inspections) have been combined
and arranged in chronological order in the single table
C_Inspections. Inspection (condition monitoring) record
13 C_Inspections, X
events are designated by an *. The other event records
have monitored data (covariate) values set to their
initialized levels according to the CovariatesOnEvent table
discussed previously.
Now let’s begin the “modeling” phase of the analysis. Hit
the “Modeling” button in the “Transmissions Oil
Analysis(*):2 window, not the “Modeling” menu item.
After executing steps A on the right, the Trans Oil Anal
(ilcm) report window appears. Examine the report. The
“Summary of Events and Censored Values” presents the
overall summary of the data being analyzed. A “Sample
Size” of 13 means that there are 13 histories or lifetimes
having a beginning and some kind of ending event. Of the
13 histories 6 ended in failure, 3 (Censored (Def)) ended
prior to a failure, and 4 (Censored (Temp)) units are A. Modeling, Weibull PHM, Select
currently in operation at the time of building this model. Covariates, sub-model Name: ilcm,
They are referred to in EXAKT as “temporary Iron, Æ, Æ, Æ, Æ, OK, X
suspensions” and are identified automatically by the
software. The next tabulation “Summary of Estimated
B. Select Covariates, sub-model
Parameters” provides the results of our first sub-model
Name: ilc, Magnesium, Å, OK, get
14 “ilcm”. The column “Sign.” indicates whether the
“Warning: The procedure is over …”
“Parameter” is significant – that is, whether it has been
XX
found to be statistically related to failure. The Shape (i.e.
WorkingAge), Iron, and Lead are designated as significant
(at this point in the analysis) while Calcium and C. Select Covariates, sub-model
Magnesium are not. Note that Magnesium has the highest Name: il, Calcium, Å, OK, “Warning:
p-Value; the p-value represents the relative probability The procedure is over …” X X
that Magnesium has no significant impact on risk of
failure. The next step is to try a different model by
eliminating the lowest impact variable - magnesium.
Close the window and execute steps B and C to create 2
more sub-models. Notice that we are successively
removing the covariate with the highest reported p-Value.
After hitting “OK” you will receive an alert warning
message from EXAKT.m telling you that the procedure is
over. This is normal for samples of small size (low
number of histories ending in failure). You may safely
Page 309
Optimal Maintenance Decisions (OMDEC) Inc 2004
ignore this message by hitting OK in the message box.
Each of the reports produced from the different models
may be printed (Ctl-P). The columns in the reports are
explained in the Exakt Manual accessible from the
Windows Start menu.
At this point we have a sub-model with covariates and
shape parameter that are all significant. We may
conclude that this, therefore, is potentially an acceptable
model for failure risk prediction. To be rigorous, we
should test one last possible combination – a sub-model
Select Covariates, sub-model Name:
15 with iron alone. (We choose Iron as it is the variable with
i, Lead, Å, OK, X
the lowest p-value and thus is likely to have the strongest
relationship to failure.)The report tells us that this is also
a potentially good predictive model (i.e. iron alone is still
significant). In the next step we decide which of the two
sub-models should be retained and later deployed.
After executing the steps on the right the “PHM
Parameter Estimation - Comparison” report is displayed.
The “N” in the second column is telling you that the sub-
Comparative Report, Compare: il, i,
16 model “i” is not close to the base sub-model “il”. This
Æ, OK, X
means that this simpler sub-model is not as good as il
and that we would be losing confidence by using it rather
than the more complete model “il”.
In this step we examine the results of statistical testing
performed by EXAKT on the retained sub-model, il. Modeling (menu item), Select
17
Reactivate this model with the steps on the right. Use the Current Model, Sub model: il, OK.
menu item “Modeling”
Now hit the Modeling button (not the Modeling menu
item). The third table of the “PHM Goodness of Fit Test”
tells us that the proportional hazards model we
constructed for risk as a function of working age and the
two significant covariates “fits” the data well enough for it Modeling, Weibull PHM, Summary
18
to be used with a confidence of 95%.The test used for Report, X
this is known as the Kolmogorov Smirnov test and is well
accepted as a statistical tool. The test shows that the
model is not rejected at the 5% significance level - i.e. it
is accepted at a 95% confidence level.
After executing the steps of (A) on the right we see that
EXAKT has created a set of bands (listed under Interval
Start Points) or “transition” states for Lead with which to
build a “transition probability model”. The transition
probability model calculates the probability of jumping to
(A) Transition Probability Model,
another state at the next inspection interval. (An example
Covariate Bands Covariate: Lead
19 of what we mean by jumping to another state will be
(B) select Covariate Iron
given below in step 20). Execute step (B) and notice the
(C) OK
transition bands provided for Iron are quite different. This
is because historical iron measurements are scattered
throughout an entirely different range of values. This can
be ascertained using EXAKT's cross-graph function (see
user guide) Execute step (C) to close the window.
Execute step A. Notice that the two buttons “Display
Matrix” and “Display Survival” become active. Let’s
examine the Display Survival function report. Set (A) Transition Rates Display Survival,
WorkingAge to, say, 8000 hours, and Observation Working Age: 8000, Observation
20 Interval to, say 200 hours. (assuming, for example, that Interval: 200, Report Close the
our asset is currently at age 8000 and we are interested report and the “Display Survival
in knowing its risk of failure in the next 200 hours.) The Probabilities” dialog.
“Markov Chain Model Survival Probability matrix”
report is displayed. The probabilities of Iron values
Page 310
Optimal Maintenance Decisions (OMDEC) Inc 2004
jumping to another state and the probability of failure in
the upcoming interval are displayed in a tabular format.
(This table represents only a part of the entire set of
transition probabilities taken into account by the model,
since we have chosen to ignore the other significant
covariate, Lead in this report. To include more than one
covariate in the visual report would require the
representation of multi-dimensional matrices which.
Instead this report allows us to see how a single variable
changes irrespective of the others.) Looking at the table
we see for example that the cell "0- 4.004" and "4.004-
9.009" has the entry 0.301615. This means that there is
a 30.1615% probability that iron will be that state at the
next monitoring interval. Hence this report provides the
probabilities of being in any state at some future time.
(Of course, this report is provided for analysis purposes
only while building the model. The transition probabilities
are fully integrated into the final decision model that will
be deployed in section 2.)
Now for the final step in developing a decision
optimization model. We blend into the model the
economics governing the failure and repair of this item.
That is we apply the average cost of a preventive repair C
and the average cost (including consequential costs) of a
failure C+K. (It is rarely necessary to have great precision
in these amounts for relative costs. The cost sensitivity
function of EXAKT allows us to confirm this for the Decision Model, Decision Model
decision model in question. It’s usage is described in the Parameters, Replacement (C): 1200,
EXAKT help file guide.) After hitting the Report Icon Failure (C+K): 6000, Cost Unit: $,
21 (which you'll find to the left of the Print Icon on the Tool Inspection Interval: 250, OK, Full
Bar), the “Condition Based Replacement Policy – Cost
Analysis report appears. Examine the “Summary of Cost report Icon (two icons to the left
Analysis” table below the Cost Function graph. It is telling of the Print Icon), X
you that by adhering to the interpretive decisions of the
model, an optimal long run ratio of preventive to failure
replacements will be 98.8:1.2 which will result in a cost
savings of 75.1% relative to a replacement-only-at-failure
policy. (The cost comparison reporting function similarly
compares the optimal EXAKT policy with existing practice.
It’s usage is described in the EXAKT help file guide.)
We have been, up to now, building a model based on the
historical data from the entire fleet. We may now test the
model on any individual unit either for the current
situation (i.e. the latest data available in the database,
called "LH" for last history) or we may look at any other
history retroactively. The steps on the right display the
reports of the latest monitored values of each unit. Four
graphs are shown - one for each of the four units 17-66,
17-67, 17-77 and 17-79. By examining the four graphs
we see that none are in alarm at the current moment Decisions, 17-66, shift+17-79,
22 when this snapshot of the data has been made. If the Report, X, Report Icon , PgDn,
weighted sum of the significant covariates (i.e. the y-axis PgDn, PgDn , X
plotted variable) falls in the Green region, no action is
necessary; in the yellow, the item should be renewed
before the next monitoring interval; in the read, the item
should be repaired or replaced immediately. It should be
noted that these boundaries vary with working age which
reflects the analysis findings that working age, as well as
Iron and Lead, are significant failure risk factors. At some
point in the past the values for 17-67 hit the red zone.
This may indicate a spurious laboratory result that was
Page 311
Optimal Maintenance Decisions (OMDEC) Inc 2004
corrected in a follow-up verification. (For modeling,
known incorrect data should be removed from
consideration.) Note that the x-axis scale differs from
graph to graph depending on the current age of the unit.
The analysis and model building phase is complete. We
are now ready to export the optimal decision model we
created into our maintenance system environment (where
it has access to continuously renewing data) so that it can
Hit anywhere in explorer (left) pane,
do its job. Activate the pane on the left by clicking it. By
ModelDbase, Connect to Database
hitting save as instructed on the right, you are sending
Script, key in the script for exporting
23 the model to a database located on the network. But
the model (actually it has been keyed
before you do so, we will, for expedience, copy the script
in for you in this sample), select the
onto the clipboard as instructed. Then hit save. You will
entire script, ctrl-c, Save
notice that several new table links to an external
database have been added to the tree in the left pane.
Now that the ODBC links have been set up, we proceed to
the actual export step next.
After executing the steps on the right you may examine
the tables DecModels, UnitToModel,
DecCovariatesOnEvent, DecEventsDescription (by
double clicking on the file names in the tree view of the
ModelDbase, Store the Decision
24 left pane) to see just what information has been exported
model
to the external database. Please proceed to Section II of
this tutorial in order to deploy the decision model that you
have just created. You may close the EXAKT Modeling
(EXAKTm) program
Page 312
Optimal Maintenance Decisions (OMDEC) Inc 2004
following the steps on the right.
The results of the entire fleet have been
analyzed and decisions have been returned Report icon , expand report window, PgDn,
7 for each unit. You may examine the reports PgDn, PgDn, X (of the sub-window or pane, not the
of each fleet member by following the main window)
steps on the right.
With “Trans Oil Anal” selected you can
conveniently examine the optimal
decisions for the entire fleet on one list in
the right window. You are actually
examining the contents of the Decisions
table of the Transmissions_DMDR.mdb Reports, Create new report list, New Report List
database. This database can be accessed Name: Indoor trucks, OK Reports, Create new
easily by any program, such as your report list, New Report List Name: Outdoor trucks,
8 CMMS. This implies that the decision OK Select “Trans Oil Anal”, Select 17-66 + 17-67,
model’s operation and its results may be ctrl-c, Select Indoor Trucks, ctrl-v Select “Trans Oil
integrated within existing maintenance Anal”, Select 17-77 + 17-79, ctrl-c, Select Outdoor
system software. In other words, the Trucks, ctrl-v
EXAKTd program need not be used at all.
However, it does have a very convenient
user interface and several useful functions,
some of which are described in the
following steps.
Select Indoor Trucks, Reports, Create Reports,
Now we will use the new report lists to help
9 Calculate time to replace Select Outdoor Trucks,
manage our trucks by department.
Reports, Create Reports, Calculate time to replace
This completes this section of the Tutorial.
This has been a minimal exercise to
demonstrate a small portion of the EXAKT
functionality. Please refer to the On-line
10
guide (available on your Start | Programs |
EXAKT menu) for a much more detailed
treatment of the subject of CBM
optimization.
Page 313
Optimal Maintenance Decisions (OMDEC) Inc 2004
After executing the instructions on the right, the required
tables are now accessible in the left pane of the EXAKT
(for modelling) window.
Page 314
Optimal Maintenance Decisions (OMDEC) Inc 2004
Idents, Check GearboxA, Events
Selection, B, Select Event: B,
Precedence: 6, Apply, EF1,
Select Event: EF, Precedence: 2,
Apply, EF2, Select Event: ES,
Precedence: 3, Apply, Variable
By executing the instructions on the right, we will assign: Selection, Health_Indicator1,
Select Variable: H1, Apply,
• Idents (that tell EXAKT which idents, i.e. units
Health_Indicator2, Select
are to have their data included in the predictive
Variable: H2, Apply, OK.
model that we are currently building).
• Events (that tell EXAKT which named events in [In the above you may be
the database the model should use internally as
wondering why we are mapping
B, EF and ES respectively)
6 EF2 to ES. The reason is that
• Variables (that tell EXAKT which variables to use EF2 is a failure mode of Gear2
and how to rename them for the model we are (to be modeled next). The
building. (This allows the decision agent to current policy is to replace Gear1
display short meaningful names in the optimal preventively when Gear2 fails.
decision graph.) Hence the failure of Gear2 marks
These mappings for the CBM Model “Gear1” are shown in the suspension (ES) of Gear1.
the dialog reproduced below.
Thirdly the variable name
Health_Indicator1 in the
database is mapped to the
variable name H1 used by the
model. Shorter names are more
convenient in building the model.
After completing the previous step, seven new tables appear in the left
pane: “CMI_Events”, “CMI_Inspections”, “Events”, “Inspections”,
“Histories”,”EventsDescription”, and “VarDescription”
Page 315
Optimal Maintenance Decisions (OMDEC) Inc 2004
We will now proceed to build the model for Gear1.
After executing the instructions on the right a report
appears. “Shape” is reported as non-significant “N”.
Build the decision model. The dialogs for the Transition Probability Model,
Transition Probability Model (“Covariates Bands and Transition Rates, OK, Decision Model,
Groups”) and the Decision Model Parameters are Decision Model Parameters,
shown below. Replacement (C): 1000, Failure
8 (C+K): 6000, Cost Unit: $,
Inspection Interval: 30, OK, Full
Page 316
Optimal Maintenance Decisions (OMDEC) Inc 2004
Executing the instructions on the right displays this report. Decisions, GearboxA, All Histories,
The report indicates that the gear has failed, but that the Select “GearboxA[1]”, Report, Full
failure would have been predicted two sample intervals ago
(60 hours) had the model been available. Report Icon , PgDn, PgDn, PgDn
…X
We have created and tested a decision model for Gear1. Repeat steps 4 to 9, making obvious
1
We may now, in the same way, generate a decision model changes for the modeling of Gear2.
0
for Gear2.
Create a new database “ComplexItemsDemo_DMDR.mdb” For this tutorial,
1
with seven tables: ComplexItemsDemo_DMDR.mdb has
1
1. DecCovariatesOnEvent already been created for you. So you
Page 317
Optimal Maintenance Decisions (OMDEC) Inc 2004
2. DecEventsDescription do not have to do anything for this
3. Decisions step.
4. UnitToModel
5. DecModels However, this can be easily done
6. DecVarToModel using the EXAKT tool: Data
7. DecEventToModel Preparation for EXAKT. The
procedures is: Start, EXAKT tools,
Data Preparation for EXAKT, File,
Build Corporate Database, Use
Predefined Template, Decision
Models (DMDR), Filename:
ComplexItemsDemo_DMDR.mdb,
Save, Enter Covariate Name: H1,
Enter, H2, Enter, Marginal Analysis
Format: Check, OK File, Exit
Attach the 7 tables from ComplexItemsDemo_DMDR.mdb Activate the left pane Window,
by following the instructions to the right. The attached ModelDBase (on the Menu bar),
tables will appear in the tree view in EXAKT’s left pane. Connect to Model Database Script,
type or copy and paste the following
script into the editing window that
appears.
1
DATABASE =
1
"ComplexItemsDemo_DMDR.mdb";
ATTACH DecCovariatesOnEvent,
DecEventsDescription, UnitToModel,
DecEventToModel, DecVarToModel,
Decisions, DecModels
hit Save.
Assuming you have previously completed building the Activate left pane, ModelDBase,
1
model for Gear2 (Step 10), execute the instructions on the Store
2
right. This will save this model to the DMDR database.
Make the model for Gear1 the current model. Modeling (on the menu bar), Select
1
current model, CBM Model: Gear1,
3
Submodel:H1_B1,OK
1 Now Store the model for Gear1 to the DMDR database. Activate left pane, ModelDBase,
4 Store
1 Congratulations. You have created and exported two Close the EXAKTm program.
5 decision models for a complex item.
Page 318
Optimal Maintenance Decisions (OMDEC) Inc 2004
Gear1, Reports, Create reports, Calculate time
4 Select the model “Gear One” and execute the decision agent.
to replace
Gear2, Reports, Create reports, Calculate time
5 Repeat step 4 for the model “Gear2”.
to replace
1. Click on “Gear1” or “Gear2”.
2. Expand “Gear One” or “Gear Two” and
click on the gear unit that you are
The prognostic results are now available for GearboxA.
6 interested in (only GearboxA in this
Examine the results using any of the 3 ways on the right.
example).
3. Click on a gear unit, View, View Model
Report.
Page 319
Optimal Maintenance Decisions (OMDEC) Inc 2004
The tables and views are all in automatic synchronization.
This makes it easy to find and correct errors, as we shall see
in subsequent steps.
241
A temporary suspension is a cut off of a life time that is still ongoing. It has been “temporarily”
suspended by the snapshot of the data at the time of analysis.
242
Deletions and changes should always be carried out on a copy of the database. You should keep a record
of all changes that you have made to the data then save the summary with the database as a dated version.
It is convenient to do this on a read-once CD. That way you can easily go back to some previous version of
the database if you have made changes that need to be reversed. These are proper work habits for modelers.
Page 320
Optimal Maintenance Decisions (OMDEC) Inc 2004
SI, Condition: Si<1000, Show
After following the instructions on the right you will have Horizontal: Fe, Vertical Si, delete “Si<1000”,
11
reproduced the following graph. Show, reduce, X
EXAKT handles events (such as oil changes, Modeling (on menu bar), Select Current Model, CBM Model:
14 adjustments, alignments, calibrations and PHM(with OC), OK, Activate Left pane (Database explorer
other minor maintenance) that impact pane), Modelling (on menu bar), Create Model Input tables,
Page 321
Optimal Maintenance Decisions (OMDEC) Inc 2004
condition data in a correct manner. The Complete data, Database pane, C_Inspections, Scroll to record
instructions on the right will display the 356, reduce and close the C_Inspections table
table illustrated below. It is often useful to
display the events and inspections in a
single table. Note the regularity of the oil
change events.
Follow the instructions on the right and Database pane, Residuals: PHM(noHistExcl)(FeCorrSed) #1”,
16 when we scroll down to the last row, we see click on the “Residual” column header to order the records by
the history number (also shown below) of Residual, scroll down to last row, note the History Number of
Page 322
Optimal Maintenance Decisions (OMDEC) Inc 2004
the offending history. The number is found 64, close the table
to be 64.
Page 323
Optimal Maintenance Decisions (OMDEC) Inc 2004
Exercise 4 (data smoothing and fixing shape factor to 1)
Random fluctuation of monitored condition data characterizes many otherwise straight-
forward CBM applications. In this exercise we use the monitored pressure test data,
which reflects the deterioration of a sealing system in a nuclear fuel rod manipulating
mechanism. For additional background and details on this application, you may refer to
the document www.omdec.com/articles/reliability/paperCandu.html.
Page 324
Optimal Maintenance Decisions (OMDEC) Inc 2004
randomness of the previous submodel. But model LR_Smooth0 is leakSmooth0, Cancel
we have another problem. We observe a
drooping artifact243 at the end of every
history. This causes a poor model and a
poor decision recommendation because
the current value of the condition indicator
leakSmooth0 is erroneously low! In step 7
we will correct this problem with a further
transformation.
The adjusted smoothed variable produces Repeat Step 5A but this time use the submodel
a better model and a better decision LR_Smooth
recommendation. Note that the
7
randomness of the data is further reduced Repeat Step 5B but this time note that the
and the drooping artifact has been variable used in the submodel LR_Smooth is
corrected. leakSmooth
Now that we have seen some techniqes for
pre-processing data to eliminate confusing
noise, we may look more closely at the
model itself. You may be wondering about
8 the naming convention we adopted for the
model “LR_Smooth_b1”. The “b1” part of
the name indicates that we have fixed
Beta, the shape factor, to 1. We will
proceed to learn why we did this.
We note, in carrying out the steps on the
right, that this Submodel “LR_Smooth”
Modeling (on Procedures panel), Weibull PHM,
9 uses the transformed variable leakSmooth
Select Covariates, Cancel
and that the “Fix shape factor to 1”
checkbox is unchecked.
Residual Analysis, Summary Report, scroll down.
Upon executing the steps at the right, we (note that the goodness of fit hypothesis is
note that the model is rejected by the rejected), reduce window, X
Kolmogorov-Smirnov test. The test is
10
telling us that the hypothesis that the Look at the modeling results in the orange framed
model is “good” (fits the data) must be "Parameters" window inside the Procedures
rejected. window. Note the NS (not significant) indication
after Shape = 1.35644.
EXAKT has told us in step 10 that working
age is not significant. In fact it is highly Modeling (on menu bar), Select Current Model,
significant, so much so that it correlates LR_Smooth_b1, Modeling (on Procedures panel),
closely with the LeakRate. Thus EXAKT is Weibull PHM, (note that the shape parameter has
really telling us that the LeakRate itself been fixed to 1 for this submodel), Cancel
11
contains all the information we need, to
establish a good predictive model, and it is Residual Analysis, Summary Report, expand and
telling us that we should remove scroll down. (note that the goodness of fit
WorkingAge as a significant factor from hypothesis is not rejected), X
the model by setting Shape to 1.
Similar results can be found for models:
LR_SmoothAve0_b1, and
12 LR_SmoothAve_b1. You may go ahead
examine these models using the tecniques
you have learned in this exercise
243
An artifact is an inaccurate observation that is due to the observation method.
Page 325
Optimal Maintenance Decisions (OMDEC) Inc 2004
Appendix 16. References to Chapter 13.
[11] H. Austerlitz, Data acquisition techniques using PCs, Academic Press, San Diego,
Calif., 2003.
Page 326
Optimal Maintenance Decisions (OMDEC) Inc 2004
[12] N. V. Kirianaki, S. Y. Yurish, N. O. Shpak, V. P. Deynega, Data Acquisition and
Signal Processing for Smart Sensors, John Wiley and Sons, Ltd., Chichester, West
Sussex, England, 2002.
[13] C. Davies, R. M. Greenough, The use of information systems in fault diagnosis, in:
Proceedings of the 16th National Conference on Manufacturing Research,
University of East London, UK, 2000.
[14] R. Xu, C. Kwan, Robust isolation of sensor failures, Asian Journal of Control, 5
(2003) 12-23.
[16] A. J. Miller, A New Wavelet Basis For The Decompostion Of Gear Motion Error
Signals And Its Application To Gearbox Diagnostics, M.Sc. Thesis, Graduate
Program in Acoustics, The Pennsylvania State University, State College, PA,
1999.
Page 327
Optimal Maintenance Decisions (OMDEC) Inc 2004
[24] W. J. Wang, Z. T. Wu, J. Chen, Fault identification in rotating machinery using the
correlation dimension and bispectra, Nonlinear Dynamics, 25 (2001) 383-393.
[25] Q. Zhuge, Y. Lu, Signature analysis for reciprocating machinery with adaptive
signal-processing, Proceedings of the Institution of Mechanical Engineers Part C-
Journal of Mechanical Engineering Science, 205 (1991) 305-310.
[29] Z. Liu, X. Yin, Z. Zhang, D. Chen, W. Chen, Online rotor mixed fault diagnosis
way based on spectrum analysis of instantaneous power in squirrel cage induction
motors, IEEE Transactions on Energy Conversion, 19 (2004) 485-490.
Page 328
Optimal Maintenance Decisions (OMDEC) Inc 2004
[36] M. A. Minnicino, H. J. Sommer, Detecting and quantifying friction nonlinearity
using the Hilbert transform, in: Health Monitoring and Smart Nondestructie
Evaluation of Structural and Biological System III, 5394, Bellingham, 2004, pp.
419-427.
[38] C.-C. Wang, G.-P. J. Too, Rotating machine fault detection based on HOS and
artificial neural networks, Journal of Intelligent Manufacturing, 13 (2002) 283-293.
[42] N. Arthur, J. Penman, Inverter fed induction machine condition monitoring using
the bispectrum, in: Proceedings of the IEEE Signal Processing Workshop on
Higher-Order Statistics, Banff, Alta., Canada, 1997, pp. 67-71.
[44] W. Li, G. Zhang, T. Shi, S. Yang, Gear crack early diagnosis using bispectrum
diagonal slice, Chinese Journal of Mechanical Engineering (English Edition), 16
(2003) 193-196.
[46] L. Qu, X. Liu, G. Peyronne, Y. Chen, The holospectrum: A new method for rotor
surveillance and diagnosis, Mechanical Systems and Signal Processing, 3 (1989)
255-267.
Page 329
Optimal Maintenance Decisions (OMDEC) Inc 2004
[47] C. B. Yu, H. B. He, Y. Xu, F. L. Chen, Identification method of acoustic
information flow of bearing state, in: Condition Monitoring '97, 1997, pp. 311-315.
[48] Y. D. Chen, R. Du, Diagnosing spindle defects using 4-D holospectrnm, Journal of
Vibration and Control, 4 (1998) 717-732.
[49] L. Qu, D. Shi, Holospectrum during the past decade: review & prospect,
Zhendong Ceshi Yu Zhenduan/Journal of Vibration, Measurement & Diagnosis,
18 (1998) 235-242 (in Chinese).
[50] M. H. Hayes, Statistical Digital Signal Processing and Modeling, John Wiley and
Sons, New York, 1996.
[51] C. K. Mechefske, J. Mathew, Fault detection and diagnosis in low speed rolling
element bearing. Part I: The use of parametric spectra, Mechanical Systems and
Signal Processing, 6 (1992) 297-307.
[57] Q. Meng, L. Qu, Rotating machinery fault diagnosis using Wigner distribution,
Mechanical Systems and Signal Processing, 5 (1991) 155-166.
[58] M.-C. Pan, H. Van Brussel, P. Sas, B. Verbeure, Fault diagnosis of joint backlash,
Journal of Vibration and Acoustics, Transactions of the ASME, 120 (1998) 13-24.
Page 330
Optimal Maintenance Decisions (OMDEC) Inc 2004
[59] I. S. Koo, W. W. Kim, The development of reactor coolant pump vibration
monitoring and a diagnostic system in the nuclear power plant, ISA Transactions,
39 (2000) 309-316.
[63] S. Gu, J. Ni, J. Yuan, Non-stationary signal analysis and transient machining
process condition monitoring, International Journal of Machine Tools and
Manufacture, 42 (2002) 41-51.
[65] R. K. Young, Wavelets Theory and Its Applications, Kluwer Academic Publishers,
Boston, 1993.
[69] G. Y. Luo, D. Osypiw, M. Irle, On-line vibration analysis with fast continuous
wavelet algorithm for condition monitoring of bearing, Journal of Vibration and
Control, 9 (2003) 931-947.
[70] N. Aretakis, K. Mathioudakis, Wavelet analysis for gas turbine fault diagnostics,
Journal of Engineering for Gas Turbines and Power, 119 (1997) 870-876.
Page 331
Optimal Maintenance Decisions (OMDEC) Inc 2004
[71] G. O. Chandroth, W. J. Staszewski, Fault detection in internal combustion engines
using wavelet analysis, in: Proceedings of COMADEM '99, Chipping Norton,
1999, pp. 7-15.
[73] N. Baydar, A. Ball, Detection of gear failures via vibration and acoustic signals
using wavelet transform, Mechanical Systems and Signal Processing, 17 (2003)
787-804.
[75] Y.-G. Xu, Y.-L. Yan, Research on Haar spectrum in fault diagnosis of rotating
machinery, Applied Mathematics and Mechanics (English Edition), 12 (1991) 61-
66.
[76] H. K. Tonshoff, X. Li, C. Lapp, Application of fast Haar transform and concurrent
learning to tool-breakage detection in milling, IEEE/ASME Transactions on
Mechatronics, 8 (2003) 414-417.
[77] A. J. Miller, K. M. Reichard, A new wavelet basis for automated fault diagnostics
of gear teeth, in: Inter-Noise 99: Proceedings of the 1999 International Congress
on Noise Control Engineering, Vol. 1-3, Poughkeepsie, 1999, pp. 1597-1602.
[78] D. Boulahbal, F. Golnaraghi, F. Ismail, Amplitude and phase wavelet maps for the
detection of cracks in geared systems, Mechanical System and Signal Processing,
13 (1999) 423-436.
[81] G. G. Yen, K.-C. Lin, Wavelet packet feature extraction for vibration monitoring,
IEEE Transactions on Industrial Electronics, 47 (2000) 650-667.
[82] S. Zhang, J. Mathew, L. Ma, Y. Sun, Best basis-based intelligent machine fault
diagnosis, Mechanical Systems and Signal Processing, 19 (2005) 357-370.
Page 332
Optimal Maintenance Decisions (OMDEC) Inc 2004
[83] H. A. Toliyat, K. Abbaszadeh, M. M. Rahimian, L. E. Olson, Rail defect diagnosis
using wavelet packet decomposition, IEEE Transactions on Industry Applications,
39 (2003) 1454-1461.
[84] H. Yang, J. Mathew, L. Ma, Fault diagnosis of rolling element bearings using basis
pursuit, Mechanical Systems and Signal Processing, 19 (2005) 341-356.
[86] J. C. Russ, The Image Processing Handbook, CRC Press, Boca Raton, 2002.
[90] T. Heger, M. Pandit, Optical wear assessment system for grinding tools, Journal of
Electronic Imaging, 13 (2004) 450-461.
Page 333
Optimal Maintenance Decisions (OMDEC) Inc 2004
[95] H. T. Grimmelius, J. K. Woud, G. Been, On-line failure diagnosis for compression
refrigeration plants, International Journal of Refrigeration, 18 (1995) 31-41.
[97] B. K. Sinha, Trend prediction from steam turbine responses of vibration and
eccentricity, Proceedings of the Institution of Mechanical Engineers Part A-Journal
of Power and Energy, 216 (2002) 97-104.
[99] P.-J. Vlok, M. Wnek, M. Zygmunt, Utilising statistical residual life estimates of
bearings to quantify the influence of preventive maintenance actions, Mechanical
Systems and Signal Processing, 18 (2004) 833-847.
[105] M. Dong, D. He, Hidden semi-markov models for machinery health diagnosis and
prognosis, in: Papers Presented at NAMRC 32, Vol. 32, Charlotte, NC, United
States, 2004, pp. 199-206.
[106] D. Lin, V. Makis, Recursive filters for a partially observable system subject to
random failure, Advances in Applied Probability, 35 (2003) 207-227.
Page 334
Optimal Maintenance Decisions (OMDEC) Inc 2004
[107] D. Lin, V. Makis, On-line parameter estimation for a failure-prone system subject
to condition monitoring, Journal of Applied Probability, 41 (2004) 211-220.
[108] W. Wang, A model to predict the residual life of rolling element bearings given
monitored condition information to date, IMA Journal of Management
Mathematics, 13 (2002) 3-16.
[114] J. Ma, C. J. Li, Detection of Localized Defects in Rolling Element Bearings Via
Composite Hypothesis Test, Mechanical Systems and Signal Processing, 9 (1995)
63-75.
[117] M. Nyberg, A general framework for fault diagnosis based on statistical hypothesis
testing, in: Twelfth International Workshop on Principles of Diagnosis (DX 2001),
Via Lattea, Italian Alps, 2001, pp. 135-142.
Page 335
Optimal Maintenance Decisions (OMDEC) Inc 2004
[119] V. A. Skormin, L. J. Popyack, V. I. Gorodetski, M. L. Araiza, J. D. Michel,
Applications of cluster analysis in diagnostics-related problems, in: Proceedings of
the 1999 IEEE Aerospace Conference, Vol. 3, Snowmass at Aspen, CO, USA,
1999, pp. 161-168.
[120] M. Artes, L. Del Castillo, J. Perez, Failure prevention and diagnosis in machine
elements using cluster, in: Proceedings of the Tenth International Congress on
Sound and Vibration, Stockholm, Sweden, 2003, pp. 1197-1203.
[125] X. Lou, K. A. Loparo, Bearing fault diagnosis based on wavelet transform and
fuzzy inference, Mechanical Systems and Signal Processing, 18 (2004) 1077-1095.
[126] M.-C. Pan, P. Sas, H. Van Brussel, Machine condition monitoring using signal
classification techniques, Journal of Vibration and Control, 9 (2003) 1103-1120.
[127] A. R. Webb, Statistical Pattern Recognition, John Wiley and Sons, West Sussex,
England, 2002.
[128] C. K. Mechefske, J. Mathew, Fault detection and diagnosis in low speed rolling
element bearing. Part II: The use of nearest neighbour classification, Mechanical
Systems and Signal Processing, 6 (1992) 309-316.
[129] Q. Sun, P. Chen, D. Zhang, F. Xi, Pattern recognition for automatic machinery
fault diagnosis, Journal of Vibration and Acoustics, Transactions of the ASME,
126 (2004) 307-316.
[130] M. Guo, L. Xie, S.-Q. Wang, J.-M. Zhang, Research on an integrated ICA-SVM
based framework for fault diagnosis, in: Proceedings of the 2003 IEEE
International Conference on Systems, Man and Cybernetics, Vol. 3, Washington,
DC, USA, 2003, pp. 2710-2715.
Page 336
Optimal Maintenance Decisions (OMDEC) Inc 2004
[131] J. Ying, T. Kirubarajan, K. R. Pattipati, A. Patterson-Hine, A hidden Markov
model-based algorithm for fault diagnosis with partial and imperfect tests, IEEE
Transactions on Systems, Man and Cybernetics, Part C (Applications and
Reviews), 30 (2000) 463-473.
[132] M. Ge, R. Du, Y. Xu, Hidden Markov model based fault diagnosis for stamping
processes, Mechanical Systems and Signal Processing, 18 (2004) 391-408.
[133] Z. Li, Z. Wu, Y. He, C. Fulei, Hidden Markov model-based fault diagnostics
method in speed-up and speed-down process for rotating machinery, Mechanical
Systems and Signal Processing, 19 (2005) 329-339.
[134] Y. Xu, M. Ge, Hidden Markov model-based process monitoring system, Journal of
Intelligent Manufacturing, 15 (2004) 337-350.
[135] D. Ye, Q. Ding, Z. Wu, New method for faults diagnosis of rotating machinery
based on 2-dimension hidden Markov model, in: Proceedings of the International
Symposium on Precision Mechanical Measurement, Vol. 4, Hefei, China, 2002,
pp. 391-395.
[138] E. C. Larson, D. P. Wipf, B. E. Parker, Gear and bearing diagnostics using neural
network-based amplitude and phase demodulation, in: Proceedings of the 51st
Meeting of the Society for Machinery Failure Prevention Technology, Virginia
Beach, VA, 1997, pp. 511-521.
[140] Y. Fan, C. J. Li, Diagnostic rule extraction from trained feedforward neural
networks, Mechanical Systems and Signal Processing, 16 (2002) 1073-1081.
Page 337
Optimal Maintenance Decisions (OMDEC) Inc 2004
[142] B. Samanta, K. R. Al-Balushi, Artificial neural network based fault diagnostics of
rolling element bearings using time-domain features, Mechanical Systems and
Signal Processing, 17 (2003) 317-328.
[145] C. J. Li, T.-Y. Huang, Automatic structure and parameter training methods for
modeling of mechanical systems by recurrent neural networks, Applied
Mathematical Modelling, 23 (1999) 933-944.
Page 338
Optimal Maintenance Decisions (OMDEC) Inc 2004
Proceeedings, Systems Readiness Technology Conference, New York, 2002, pp.
818-841.
[157] H. R. DePold, F. D. Gass, The application of expert systems and neural networks
to gas turbine prognostics and diagnostics, Journal of Engineering for Gas
Turbines and Power, 121 (1999) 607-612.
[158] B.-S. Yang, T. Han, Y.-S. Kim, Integration of ART-Kohonen neural network and
case-based reasoning for intelligent fault diagnosis, Expert Systems with
Applications, 26 (2004) 387-395.
[161] R. Du, K. Yeung, Fuzzy transition probability: A new method for monitoring
progressive faults. Part 1: The theory, Engineering Applications of Artificial
Intelligence, 17 (2004) 457-467.
[162] S. Zhang, T. Asakura, X. L. Xu, B. J. Xu, Fault diagnosis system for rotary
machine based on fuzzy neural networks, JSME International Journal. Series C:
Mechanical Systems, Machine Elements and Manufacturing, 46 (2003) 1035-1041.
Page 339
Optimal Maintenance Decisions (OMDEC) Inc 2004
[164] S. H. Chang, K. S. Kang, S. S. Choi, H. G. Kim, H. K. Jeong, C. U. Yi,
Development of the on-line operator aid system OASYS using a rule-based expert
system and fuzzy logic for nuclear power plants, Nuclear Technology, 112 (1995)
266-294.
[169] Y.-C. Huang, C.-M. Huang, Evolving wavelet networks for power transformer
condition monitoring, IEEE Transactions on Power Delivery, 17 (2002) 412-416.
[170] G.-T. Yan, G.-F. Ma, Fault diagnosis of diesel engine combustion system based on
neural networks, in: Proceedings of 2004 International Conference on Machine
Learning and Cybernetics, Vol. 5, Shanghai, China, 2004, pp. 3111-3114.
[173] I. Howard, S. Jia, J. Wang, The dynamic modelling of a spur gear in mesh
including friction and a crack, Mechanical Systems and Signal Processing, 15
(2001) 831-838.
Page 340
Optimal Maintenance Decisions (OMDEC) Inc 2004
[176] K. A. Loparo, M. L. Adams, W. Lin, M. F. Abdel-Magied, N. Afshari, Fault
detection and diagnosis of rotating machinery, IEEE Transactions on Industrial
Electronics, 47 (2000) 1005-1014.
[185] R. David, H. Alla, Petri nets for modeling of dynamic systems - a survey,
Automatica, 30 (1994) 175-202.
[186] N. C. Propes, A fuzzy Petri net based mode identification algorithm for fault
diagnosis of complex systems, in: System Diagnosis and Prognosis: Security and
Condition Monitoring Issues III, 5107, Bellingham, 2003, pp. 44-53.
Page 341
Optimal Maintenance Decisions (OMDEC) Inc 2004
[188] B.-S. Yang, S. K. Jeong, Y.-M. Oh, A. C. C. Tan, Case-based reasoning system
with Petri nets for induction motor fault diagnosis, Expert Systems with
Applications, 27 (2004) 301-311.
[194] C. Kwan, X. Zhang, R. Xu, L. Haynes, A novel approach to fault diagnostics and
prognostics, in: Proceedings of the 2003 IEEE International Conference on
Robotics and Automation, Vol. 1-3, New York, 2003, pp. 604-609.
[195] D. Lin, V. Makis, Filters and parameter estimation for a partially observable
system subject to random failure with continuous-range observations, Advances in
Applied Probability, 36 (2004) 1212-1230.
[196] S. Zhang, R. Ganesan, Multivariable trend analysis using neural networks for
intelligent diagnostics of rotating machinery, Transactions of the ASME. Journal
of Engineering for Gas Turbines and Power, 119 (1997) 378-384.
Page 342
Optimal Maintenance Decisions (OMDEC) Inc 2004
[199] Y.-L. Dong, Y.-J. Gu, K. Yang, W.-K. Zhang, A combining condition prediction
model and its application in power plant, in: Proceedings of 2004 International
Conference on Machine Learning and Cybernetics, Vol. 6, Shanghai, China, 2004,
pp. 3474-3478.
[202] A. Ray, S. Tangirala, Stochastic modeling of fatigue crack dynamics for on-line
failure prognostics, IEEE Transactions on Control Systems Technology, 4 (1996)
443-451.
[209] J. Qiu, C. Zhang, B. B. Seth, S. Y. Liang, Damage mechanics approach for bearing
lifetime prognostics, Mechanical Systems and Signal Processing, 16 (2002) 817-
829.
Page 343
Optimal Maintenance Decisions (OMDEC) Inc 2004
[211] S. J. Engel, B. J. Gilmartin, K. Bongort, A. Hess, Prognostics, the real issues
involved with predicting life remaining, in: 2000 IEEE Aerospace Conference
Proceedings, Vol. 6, New York, 2000, pp. 457-469.
[215] W. Wang, A model to determine the optimal critical level and the monitoring
intervals in condition-based maintenance, International Journal of Production
Research, 38 (2000) 1425-1436.
Page 344
Optimal Maintenance Decisions (OMDEC) Inc 2004
[222] M. Marseguerra, E. Zio, L. Podofillini, Condition-based maintenance optimization
by means of genetic algorithms and Monte Carlo simulation, Reliability
Engineering and System Safety, 77 (2002) 151-165.
Page 345
Optimal Maintenance Decisions (OMDEC) Inc 2004
[235] W. B. Wang, A stochastic control model for on line condition based maintenance
decision support, in: 6th World Multiconference on Systemics, Cybernetics and
Informatics, Vol. VI, Proceedings - Industrial Systems and Engineering I, Orlando,
2002, pp. 370-374.
[236] S. Okumura, N. Okino, Optimisation of inspection time vector and warning level
in CBM considering residual life loss and constraint on preventive replacement
probability, International Journal of COMADEM, 6 (2003) 10-18.
[242] D. L. Hall, J. Llinas, Handbook of Multisensor Data Fusion, CRC Press, Boca
Raton, FL, 2001.
[244] Q. Liu, H.-P. Wang, A case study on multisensor data fusion for imbalance
diagnosis of rotating machinery, (AI EDAM) Artificial Intelligence for
Engineering Design, Analysis and Manufacturing, 15 (2001) 203-210.
[245] H. F. Wang, J. P. Wang, Fault diagnosis theory: method and application based on
multisensor data fusion, Journal of Testing and Evaluation, 28 (2000) 513-518.
Page 346
Optimal Maintenance Decisions (OMDEC) Inc 2004
[247] C. S. Byington, T. A. Merdes, J. D. Kozlowski, Fusion techniques for vibration
and oil debris/quality in gearbox failure testing, in: Proceedings of Condition
Monitoring '99, Chipping Norton, 1999, pp. 113-128.
[255] S. S. Haykin, Unsupervised Adaptive Filtering, John Wiley and Sons, New York,
2000.
[257] L. Li, L. Qu, Machine diagnosis with independent component analysis and
envelope analysis, in: International Conference on Industrial Technology:
`Productivity Reincarnation through Robotics and Automation', Vol. 2, Bankok,
Thailand, 2002, pp. 1360-1364.
Page 347
Optimal Maintenance Decisions (OMDEC) Inc 2004
Joint Conference on Neural Networks, Vol. 2, Portland, OR, USA, 2003, pp. 869-
872.
[263] G. Gelle, M. Colas, C. Serviere, BSS for fault detection and machine monitoring -
time or frequency domain approach?, in: Proceedings of International Workshop
on Independent Component Analysis and Blind Signal Separation (ICA 2000),
Helsinki, Finland, 2000, pp. 555-560.
[264] G. Gelle, M. Colas, C. Serviere, Blind source separation: a new pre-processing tool
for rotating machines monitoring?, IEEE Transactions on Instrumentation and
Measurement, 52 (2003) 790-795.
[267] C. Serviere, P. Fabry, Blind source separation of noisy harmonic signals for
rotating machine diagnosis, Journal of Sound and Vibration, 272 (2004) 317-339.
Page 348
Optimal Maintenance Decisions (OMDEC) Inc 2004
[268] F. M. Discenzo, P. J. Unsworth, K. A. Loparo, H. Marcy, Self-diagnosing
intelligent motors: a key enabler for next generation manufacturing systems, IEE
Colloquium (Digest), (1999) 15-18 (3/1-3/4).
Page 349
Optimal Maintenance Decisions (OMDEC) Inc 2004