Lindhe Prob

water research 43 (2009) 16411653
Available at www.sciencedirect.com
journal homepage: www.elsevier.com/locate/watres
Fault tree analysis for integrated and probabilistic risk

analysis of drinking water systems
Andreas Lindhea,*, Lars Rosena, Tommy Norbergb, Olof Bergstedta,c

a
Department of Civil and Environmental Engineering, Chalmers University of Technology, SE-412 96 Goteborg, Sweden
b
Department of Mathematical Sciences, Goteborg University and Chalmers University of Technology, SE-412 96 Goteborg, Sweden
c
Goteborg Vatten, Box 123, SE-424 23 Angered, Sweden
article info abstract
Article history: Drinking water systems are vulnerable and subject to a wide range of risks. To avoid sub-
Received 23 July 2008 optimisation of risk-reduction options, risk analyses need to include the entire drinking
Received in revised form water system, from source to tap. Such an integrated approach demands tools that are able
21 December 2008 to model interactions between different events. Fault tree analysis is a risk estimation tool
Accepted 22 December 2008 with the ability to model interactions between events. Using fault tree analysis on an
Published online 3 January 2009 integrated level, a probabilistic risk analysis of a large drinking water system in Sweden
was carried out. The primary aims of the study were: (1) to develop a method for integrated
Keywords: and probabilistic risk analysis of entire drinking water systems; and (2) to evaluate the
Drinking water system applicability of Customer Minutes Lost (CML) as a measure of risk. The analysis included
Risk analysis situations where no water is delivered to the consumer (quantity failure) and situations
Fault tree where water is delivered but does not comply with water quality standards (quality failure).
Integrated Hard data as well as expert judgements were used to estimate probabilities of events and
Probabilistic uncertainties in the estimates. The calculations were performed using Monte Carlo
Customer Minutes Lost simulations. CML is shown to be a useful measure of risks associated with drinking water
Uncertainties systems. The method presented provides information on risk levels, probabilities of failure,
failure rates and downtimes of the system. This information is available for the entire
system as well as its different sub-systems. Furthermore, the method enables comparison
of the results with performance targets and acceptable levels of risk. The method thus
facilitates integrated risk analysis and consequently helps decision-makers to minimise
sub-optimisation of risk-reduction options.
2008 Elsevier Ltd. All rights reserved.
1. Introduction World Health Organization (WHO), it is pointed out that

a comprehensive risk management approach is the most
Efficient risk management is of primary importance to water effective way to ensure the safety of drinking water supply
utilities. Access to a reliable supply of drinking water and (WHO, 2004). To achieve an acceptable level of risk, it is
safe water quality are basic requirements for human health crucial to analyse the risk and based on tolerability
and economic development (IWA, 2004). In the third edition criteria evaluate the risk and alternative options for risk
of the Guidelines for Drinking-water Quality, published by the reduction.
* Corresponding author. Tel.: 46 31 772 2060; fax: 46 31 772 2107.

E-mail addresses: andreas.lindhe@chalmers.se (A. Lindhe), lars.rosen@chalmers.se (L. Rosen), tommy@chalmers.se (T. Norberg),
olof.bergstedt@vatten.goteborg.se (O. Bergstedt).
0043-1354/$ see front matter 2008 Elsevier Ltd. All rights reserved.
doi:10.1016/j.watres.2008.12.034
1642 water research 43 (2009) 16411653
As part of risk management, WHO recommends prepara-

tion of Water Safety Plans (WSPs), including system assess- 2. Conceptual model
ment, operational monitoring and management plans (Davison
et al., 2005; WHO, 2004). To prioritise hazards, WHO suggests A drinking water system is commonly described as a supply
these to be ranked using a risk matrix with discretised chain composed of three main sub-systems: raw water, treat-
probability and consequence scales. This qualitative (or semi- ment and distribution. Together, these sub-systems cover the
quantitative) method is common in many disciplines and the entire supply chain, from the water source to the consumers
main advantages are that it is simple to use and the result is taps. Along the supply chain there are hazards that may harm
easy to communicate. However, the method is not suitable the system in different ways. The hazardous events may be
for modelling complex systems with interactions between different but their consequences are usually categorised as
components and events. Burgman (2005) emphasises that quantity- or quality-related. This means that either the ability to
risk-ranking methods assume a discrete nature of hazards, deliver water to the consumers or the water quality itself is
do not provide quantitative estimates and lack a procedure affected. According to Gray (2005) the objective of water treat-
for uncertainty analysis, see also Cox (2008). To further ment is to produce an adequate and continuous supply of
support the WSP approach and risk management of drinking water that is chemically, bacteriologically and aesthetically
water systems in general, quantitative tools for risk analysis acceptable. In addition to being bacteriologically safe, the
are also needed. A quantification of the risk facilitates, for water should also be microbiologically safe. It is not only
example, comparison with other risks and acceptable levels pathogenic bacteria that may cause harm to public health;
of risk in absolute terms as well as quantitative estimations there are also viruses, protozoa and other biological contami-
of the efficiency of risk-reduction options. nates. The objectives of water treatment can be divided into
An important aspect when conducting risk analyses of quantity and quality objectives and consequently they are in
drinking water systems is to consider the entire system, from line with the two categories of consequences.
source to tap (e.g. WHO, 2004). This means that the water The overall failure event included in the method is supply
source as well as the treatment system and the distribution failure, defined as including: (1) quantity failure, i.e. no water is
network all the way to the consumers taps should be taken delivered to the consumer; and (2) quality failure, i.e. water is
into consideration. The main reasons for adopting an inte- delivered but does not comply with water quality standards.
grated approach are: (1) the existence of interactions between Fig. 1 illustrates the two categories of failure as well as the
events, i.e. chains of events, needs to be considered; and (2) main type of event that may cause these failures. Quantity
failure in one part of the system may be compensated for by failure may be caused either by component failure, e.g. pipe
other parts, i.e. the system has an inherent redundancy. If damage, or unacceptable water quality (raw water or drinking
these circumstances are not considered, important informa- water) causing the water utility to stop the delivery. Quality
tion can be overlooked. In an integrated analysis it should be failure may occur due to non-detection of unacceptable water
possible to compare the contribution made by different sub- quality and no action is thus possible, or due to unacceptable
systems to the risk in order to avoid sub-optimisation of risk- quality that is detected but no action is taken or it is not
reduction options. It may not be worthwhile, for example, to possible to stop delivery. The latter case may arise, for
increase the safety at an already efficient and safe treatment example, when the water utility decides to use raw water of
plant if no resources are spent on maintenance of the distri- unacceptable quality in order to avoid a water shortage.
bution system. Since resources for risk reduction are limited, Supply failure occurs because of events in one or more of
it is necessary to prioritise and choose the most suitable the three main sub-systems (raw water, treatment and
option. The importance of an integrated approach is advo- distribution). However, if failure occurs in one sub-system
cated by many, e.g. WHO (2004), IWA (2004), CDW/CCME (2004) another may compensate and thereby prevent supply failure.
and NHMRC/NRMMC (2004). Hence, to model a drinking water system its inherent ability to
Fault tree analysis is a risk estimation tool with the ability compensate for failure must be considered. For this reason the
to model interactions between events. A fault tree models the occurrence of failures could be as described in Fig. 2. The
occurrence of an event based on the occurrence or non- figure includes the entire system, from source to tap, showing
occurrence of other events (Bedford and Cooke, 2001). This that failure may occur in any part of it. It also illustrates that
paper presents a method for integrated risk analysis of the different sub-systems can compensate for failure. To
drinking water systems based on a probabilistic fault tree determine the contribution made by each sub-system to the
analysis. The fault tree method has been devised to estimate risk, failure in one part of the system is based on the
not only the probability of failure but also the mean failure assumption that the previous parts operate correctly (i.e. no
rate and mean downtime of the system. Furthermore, the failures in previous parts). It is assumed, for example, that no
consequences of failures are included in the method and risk raw water failures have occurred when failures in the treat-
levels are quantified using a measure called Customer ment are identified. Table 1 presents criteria for quantity as
Minutes Lost (CML). The method considers the entire supply well as quality failures in the three sub-systems.
system, from source to tap, and takes water quantity as well
as water quality aspects into consideration. The primary aims
of the study were: (1) to develop a method for integrated and 3. Fault tree analysis
probabilistic risk analysis of entire drinking water systems;
and (2) to evaluate the applicability of CML as a measure of A fault tree analysis is a structured process that identifies
risk. potential causes of system failure. A fault tree illustrates the
water research 43 (2009) 16411653 1643
Categories of supply failure Causes
Failure of components in the

Quantity failure (Q = 0) system (e.g. pumps or pipes)
No water is delivered to the
consumer Events related to unacceptable
water quality causing the water
utility to stop the delivery
Supply failure
Unacceptable water quality is
Quality failure (Q > 0, C) detected but no action is taken or
Water is delivered but does it is not possible to stop the delivery
not comply with water quality
Unacceptable water quality is not
standards
detected and no action is thus
possible
Q = Flow (Q = 0, no water is delivered to the consumer; Q > 0, water is delivered)
C = The drinking water does not comply with water quality standards
Fig. 1 Categories of supply failure and their main causes.
interactions between different events using logic gates, and In order to structure the fault tree of a drinking water
shows how the events may lead to system failure, i.e. the top system, four types of logic gates were identified. A Markovian
event. The top event is a critical situation that causes system approach was used with mean failure rate l and mean
failure and the occurrence of the top event is described in downtime 1/m (see e.g. Rausand and Hyland, 2004). The mean
terms of occurrence or non-occurrence of other events (Bed- time to failure is 1/l, hence the probability of failure can be
ford and Cooke, 2001). Starting with the top event, the tree is written as PF [ l/(l D m). By replacing each basic event in the
developed until the required level of detail is reached. Events logic gates with a Markov Process, equations for calculating
whose causes have been further developed are intermediate the mean failure rate and mean downtime for the output
events, and events that terminate branches are basic events. events were developed. One of the main reasons for using the
While the top event can be seen as a system failure, the basic failure rate and downtime, and not just the probability of
events are component failures. For a further description of failure, is to facilitate elicitation of expert judgements. Since
fault tree analysis and its application in risk analysis see e.g. both the failure rate and downtime need to be considered
Rausand and Hyland (2004) and Bedford and Cooke (2001). when estimating the probability, these are estimated
Raw water Treatment Distribution

Water tower
Water
treatment
plant
Service Service
reservoir reservoir
Events in the raw water system

cause insufficient/unacceptable
raw water quantity/quality and neither the treatment nor the distribution is able to compensate.
events in the treatment cause

insufficient/unacceptable drinking and the distribution is unable to
Given no failure in the raw water water quantity/quality compensate.
events in the distribution system

cause insufficient/unacceptable
Given no failure in either the raw water or the treatment drinking water quantity/quality.
Fig. 2 Illustration of how failures (quantity or quality) in different parts of the system may cause supply failure.
1644 water research 43 (2009) 16411653
Table 1 Criteria for quantity and quality failures in raw water, treatment and distribution.
Sub-system Category of failure Failure criteria
Raw water Quantity failure Not enough raw water is transferred to the treatment plant(s),
making it impossible to produce enough drinking water (the supply is less than the demand).
Treatment and distribution are unable to compensate.
Quality failure The raw water quality does not comply with health-based water quality standards.
Treatment and distribution are unable to compensate.
Treatment Quantity failure No raw water quantity failure.

The water transferred from the treatment plant(s) is less than the demand.
Distribution is unable to compensate.
Quality failure No raw water quality failure.
The drinking water produced does not comply with health-based water quality standards.
Distribution is unable to compensate.
Distribution Quantity failure No raw water or treatment quantity failure.

Water cannot be delivered to the consumer.
Quality failure No raw water or treatment quality failure.
The water quality does not comply with health-based water quality standards.
separately to maintain transparency. Norberg et al. (2008) 3.2. AND-gate

present a comprehensive description of the theoretical foun-
dations of the logic gates presented here. An AND-gate (Fig. 4) is used to model events that must occur
simultaneously in order for the output event to occur. The
3.1. OR-gate AND-gate corresponds to a parallel system where the proba-
bility of failure is calculated as the product of the n indepen-
The output of an OR-gate (Fig. 3) occurs if at least one of the dent events probabilities, see Equation (5). For example, if
input events occurs. The OR-gate corresponds to a series a water utility can use raw water from two different water
system with n independent events where the probability of sources, both must be unavailable to cause raw water
failure can be calculated using Equation (1). Supply failure, for shortage (provided that one source is sufficient to meet the
example, may occur if there is failure in the raw water, water demand). The output event of an AND-gate is calculated
treatment or distribution systems. Only one of the sub- using Equations (6)(8).
systems needs to fail to cause supply failure. Using the mean
Y
n
failure rates and mean downtimes of the input events, Equa- PF Pi (5)
tions (2)(4) are used to calculate the output event of an OR- i1
gate.
X
n
Y
n m mi (6)
PF 1 1 Pi (1) i1
i1
X
n Qn
i1 li
X
n l mi Qn Q (7)
l li (2) i1
l
i1 i mi ni1 li
i1
l Y
n
li
X
n Qn PF (8)
i1 mi l m i1 li mi
m li Qn Qn (3)
i1 i1 li mi i1 mi
3.3. Variants of the AND-gate

l Y
n
mi
PF 1 (4) To adequately model a systems inherent ability to com-
lm i1
l i mi
pensate for failure, the AND-gate needs to be extended. It
Fig. 3 Fault tree with an OR-gate. Fig. 4 Fault tree with an AND-gate.
water research 43 (2009) 16411653 1645
must include what in reliability applications is called cold

standby and imperfect switching. If, for example, a pump
station supplying a high altitude area with drinking water
breaks down, water stored in the water tower can supply
the consumers for a limited time. If the water tower is not in
use due, for example, to failure or maintenance work, the
water tower cannot compensate at all for failure (failure on
demand). When the water tower operates normally and
Fig. 6 The second variant of the AND-gate, including
does not fail on demand, it is able to compensate until
a components ability to recover after failure.
a failure occurs or it is emptied (failure during operation).
The first variant of the AND-gate (Fig. 5) is designed
primarily for situations when the ability to compensate is
limited in time. The output event of the first variant of the
m1 l1 q2 l2 m1 m2 l1 l2 1 q2 m1 m2
AND-gate is calculated using Equations (9)(11), where q is l (13)
l1 m1 l2 m1 m2 1 PF
the probability of failure on demand (that is the switcher
does not function or the standby is unavailable), and the
mean failure rate (l) is used to model failure during m1 l1 q2 l2 m1 m2 l1 l2 1 q2 m1 m2
m (14)
operation. l1 m1 l2 m1 m2 PF
m m1 (9)
4. Generic fault tree structure of a drinking
l1 Y
n
li qi m1 water system
PF (10)
l1 m1 i2
li m1
The conceptual model and the logic gates form the basis for
a generic fault tree structure of a drinking water system. Fig. 7
PF
l m (11) illustrates possible failure paths. When the system operates
1 PF
normally, failure may occur in any of the three sub-systems
and given failure in one sub-system another can either
A second variant of the AND-gate is needed to model situa-
compensate or fail to compensate. It is also possible for more
tions where the ability to compensate may recover after it has
than one sub-system to fail at the same time. To simplify the
failed. This may arise, for example, when raw water of
figure, no compensation or failure is treated as the same event
unacceptable quality is used but may be compensated for by
when the previous sub-system has failed. These two events
treatment. If the unacceptable water quality cannot be
could have been illustrated separately but since both cause
compensated for at all, failure on demand arises. However, if
failure they are merged here.
the unacceptable quality can be compensated for, failure does
Based on the alternatives in Fig. 7, a generic fault tree
not arise until the treatment efficiency is affected due to
structure applicable to drinking water systems is suggested
failure in the treatment. When the failure has been taken care
(Fig. 8). The system is broken down into its three main sub-
of, the treatment recovers and is able to compensate again
systems, and the top event, supply failure, may occur due to
until a new failure occurs. The output event of this second
failure in any one of them. In each sub-system, quantity or
variant of the AND-gate (Fig. 6) is calculated using Equations
quality failures may occur. The first variant of the AND-gate
(12)(14). The equations apply only when one component
is used to illustrate that failure (quantity or quality) in one
compensates for failure. If multiple components could
sub-system may be compensated for by other sub-systems.
compensate, a regular AND-gate can be used to combine the
A drinking water system can thus not be considered
events. The output of the regular AND-gate is used as the
a traditional series system, where failure in one sub-system
compensating input event in the second variant of the AND-
automatically causes system failure. The transfer gates in
gate.
Fig. 8 indicate that the fault tree is further developed else-
where (Fig. 9). Although the same transfer gates can be
l1 l2 q2 m1 m2
PF (12) found in all three sub-systems, they do not refer to exactly
l1 m1 l2 m1 m2
the same events. For example, component failure in the
treatment is not exactly the same event as component
failure in the distribution. The generic fault tree structure
(Figs. 8 and 9) includes the major events and the relation-
ships between them. However, in a practical application the
events need to be developed further for the structure to
correspond properly to system properties and enable esti-
mations to be made of the required variables for the basic
events.
Fig. 9 shows that quantity failure occurs due to compo-
Fig. 5 The first variant of the AND-gate, including the nent failure or unacceptable water quality. For the unac-
ability of the system to compensate for failure. ceptable quality to cause quantity failure, three events need
1646 water research 43 (2009) 16411653
No distribution failure Success

No treatment failure
Distribution failure Failure
No raw water failure
Distribution compensation Success
Treatment failure
No distribution compensation Failure
or failure
Operation
No distribution failure Success

Treatment compensation
Distribution failure Failure
Raw water failure
Distribution compensation Success

No treatment compensation
or failure No distribution compensation Failure
or failure
Fig. 7 Possible paths leading to failure (quantity or quality). Each branch illustrates a situation where failure occurs or does
not occur and compensation is possible or not possible.
to occur simultaneously: the water quality needs to be a quality failure occurs instead (Fig. 9). Quality failure may
unacceptable, the unacceptable quality needs to be detec- also occur when the water quality is unacceptable although
ted and the water utility needs to decide to stop the the quality deviation is not detected and hence no action is
delivery. If the water utility decides not to stop the delivery, possible.
Supply failure
Raw water failure Treatment failure Distribution failure
Raw water quantity Treatment quantity Distribution quantity

failure (Q = 0) failure (Q = 0) failure (Q = 0)
Quantity failure Quantity failure

Distribution quality
failure (Q > 0, C')
1 1
2
Treatment fails to Distribution fails to
compensate compensate
Distribution fails to Treament quality

compensate failure (Q > 0, C')
Raw water quality

failure (Q > 0, C')
Quality failure
2
Quality failure
Distribution fails to
2 compensate
Treatment fails to OR-gate

compensate First variant of AND-gate
Transfer in, the fault tree is developed further elsewhere
Distribution fails to Q = Flow (Q = 0, no water is delivered to the consumer; Q > 0, water is delivered)
compensate C' = The drinking water does not comply with water quality standards
Fig. 8 Generic fault tree illustrating the two categories of failure in the three sub-systems. The transfer in gates (1 and 2)
refer to the transfer out gates in Fig. 9.
water research 43 (2009) 16411653 1647
Quantity failure Quality failure
1 2
Detected quality
Component failure
failure
Unacceptable quality
causing delivery stop
Unacceptable quality Detection
Detection No delivery stop
Decision to stop Non-detected quality

delivery failure
Transfer out
OR-gate
No detection
AND-gate
Fig. 9 Schematic illustration of how quantity (1) and quality (2) failures may occur. The transfer out gates refer to
corresponding transfer in gates in Fig. 8.
where l is the mean failure rate, 1/m the mean downtime and C
5. Estimation of risk the expected proportion of consumers affected by failure.
Since C is expressed as a proportion, the risk is estimated for
5.1. Risk the average consumer. When estimating the risk in terms of
CML, the utility of the two attributes used to define the
According to Kaplan and Garrick (1981), see also Kaplan (1997), consequence (affected proportion of consumers and mean
the question What is the risk? actually comprises three downtime) is assumed to be independent. Note that Equation
questions: What can happen?; How likely is it?; and (15) is an approximation, valid when 1/m 0 1/l. To consider
What are the consequences?. Hence, the answers to these that a system cannot fail when it is in its failure mode, the true
three questions together describe the risk. According to this failure rate should be calculated as
widely accepted description, risk is expressed as a combina-
tion of the frequency, or probability, of occurrence and the 1 lm
u (16)
consequences of a hazardous event, see e.g. IEC (1995), ISO/IEC 1 1 lm

(2002) and the European Commission (2000). l m
When based on the true failure rate (u), the risk is expressed as
5.2. Risk as Customer Minutes Lost
lm 1 l
R C C (17)
lm m lm
For the purpose of this study, the consequences of failures
(quantity and quality) are defined by the duration of failure and The probability of failure is defined as PF [ l/(l D m), and
number of people affected. Since two attributes are used to Equation (17) is therefore equivalent to
describe the consequences, the evaluation of the results can be
described as a multi-attribute problem. By multiplying the two R PF C (18)
attributes, the consequences are expressed in terms of Customer The risk is consequently calculated as the probability of
Minutes Lost (CML). In order to maintain transparency the esti- failure multiplied by the proportion of consumers affected.
mated number of CML, as well as other results of the analysis, However, it is not meaningful to define the affected proportion
are presented separately for quantity and quality failure. The of consumers only at the top event (supply failure). Instead,
risk, expressed as the expected CML, is calculated as a lower and suitable level for defining consequences must be
identified. This level should be as close as possible to the top
1 event and only have events that are combined by means of
Rl C (15)
m OR-gates above it. It is also important that an intermediate
1648 water research 43 (2009) 16411653
event, of which the consequences are defined, does not The maintenance occasions in a drinking water system of
include events with totally different consequences. If these a component represented by a basic event in the fault tree are
criteria are fulfilled the total risk may be calculated as a sum of sometimes random (e.g. condition- or opportunity-based) and
the risks caused by different events. In the generic fault tree sometimes planned (e.g. age- or clock-based); and (2) accord-
(Fig. 8), it is suitable to define consequences for each type of ing to Rausand and Hyland (2004), it has been shown that the
failure (quantity and quality) in the three sub-systems. The superposition of an infinite number of independent stationary
fault tree in Fig. 8 is generalised and in a real-world applica- renewal processes is a homogeneous Poisson process. Rau-
tion the quantity and quality failures for each sub-system are sand and Hyland (2004) claim that this result often is used as
preferably divided into different main types of event. These a justification for assuming the time between system failures
main types of event would constitute a suitable level for being exponentially distributed.
defining consequences. Since the consequences are defined The proportion of consumers affected (C ) and the proba-
for several (n) events, the total risk is bility of failure on demand (q) are modelled by Beta distribu-
tions. The main reason for using Gamma and Beta
X
n
R PFi Ci (19) distributions is the fact that they are conjugate to the expo-
i1 nential and binomial models respectively. This means, for
If the affected proportions of consumers are defined at a level example, that a prior Gamma distribution updated with hard
in the fault tree where it is plausible that some of the events data results in a posterior Gamma distribution. Hence, the
may occur simultaneously, the risks posed by the events may distributional classes of the prior and posterior distributions
not be additive and Eq. (19) may not be valid. It should be noted are the same. The use of conjugate distributions thus facili-
that two events with different probabilities, durations and tates a Bayesian approach, in which hard data can be used for
number of people affected, may cause the same level of risk a mathematically formal updating of previous knowledge.
(expressed as CML). CML has previously been used as Furthermore, the Beta and Gamma distributions are flexible
a performance indicator in the drinking water sector in the and capable of attaining a vide variety of shapes.
Netherlands (Blokker et al., 2005). Hard data, such as measurements and statistics on events,
expert judgements and combinations of these, are used as
input data in the fault tree analysis. Expert judgements are
used when hard data is not available or too limited. The
6. Uncertainties and input data experts, mainly water utility experts, are asked to estimate
a plausible maximum and minimum value of the variable of
The method presented is probabilistic and therefore all input interest. These estimates are used as percentiles when esti-
variables are replaced by probability distributions. A Bayesian mating the distribution. However, mean or median values
approach is applied and the risk is calculated by means of should also be considered to ensure that a suitable distribu-
Monte Carlo simulations, taking uncertainties of estimations tion is obtained. The use of variables l and m to calculate the
into consideration (Fig. 10). An important aspect of Monte probability of failure facilitates expert judgements. Instead of
Carlo simulations is the required number of iterations, further estimating the probability of failure, which can be difficult, the
discussed in the result section. experts estimate the mean failure rate (l) and mean duration
Variables l and m are modelled as exponential rates using of failure (1/m), and based on this the probability of failure is
Gamma distributions. The use of constant failure rates is calculated as PF l/(l D m).
justified by: (1) any system that is maintained in such a way The Gamma distribution has one shape parameter (r) and
that it is restored to being close to an as good as new one scale parameter (s) and the Beta distribution has two
condition at irregular times has approximately constant shape parameters (a, b). Hence, the Gamma and Beta distri-
failure rate, provided maintenance occurs sufficiently often. butions can be defined either by the two standard parameter
P
PF =
+
P
PF
P
R = PF C
P
R
Fig. 10 Illustration of how the uncertainties are taken into consideration when calculating the risk using Monte Carlo
simulations.
water research 43 (2009) 16411653 1649
sets (r, s) and (a, b) respectively, or by two other parameters disinfection. The distribution network is approximately
such as the 5th and 95th percentiles. In Table 2 possible 1700 km in length and, to assure sufficient pressure in
parameters that can be used to define probability distributions network areas at high altitudes, the water head is raised
representing the different variables are summarised. To through booster stations. To meet peaks in the water demand,
describe the use of expert judgments and estimation of service reservoirs are used. The water quality in the river and
probability distributions we consider the event pump failure. the treatment plants is monitored online. Additional analyses,
Water utility personnel may assess the time to failure to be in e.g. microbial, are also made in the water sources and at the
the interval of 715 years and when failure occurs they treatment plants and different locations in the distribution
assume it will take 124 h to repair it. If these values are system.
considered as the 5th and 95th percentiles (i.e. reflecting
reliable estimates of the intervals), variables l and m are esti-
mated as Gamma distributions with parameters (r 19.1, 7.2. Fault tree model
s 0.005) and (r 1.4, s 2300) respectively.
The fault tree of the Gothenburg system was based on the
generic fault tree structure presented in Section 4. Supply
failure may thus occur in the raw water, treatment or distri-
7. Method application
bution (Fig. 8). Within each sub-system, quantity and quality
failures may occur. The first variant of the AND-gate was used
An integrated and probabilistic fault tree analysis was con-
to model failure in one sub-system being compensated for by
ducted for the drinking water system in Gothenburg, Sweden.
other sub-systems. The failure events as well as the structure
By applying the method in a system as extensive and complex
of the fault tree were identified and compiled in close collab-
system as in Gothenburg, it was tested and conditions specific
oration with water utility personnel. Both previous and
to drinking water systems were incorporated. The method
possible future events were included. The drinking water
application is presented here with a focus on methodological
quality was considered unacceptable when unfit for human
aspects. Lindhe et al. (2008) presented the application in
consumption, a criterion based on the Swedish quality stan-
Gothenburg with a focus on aspects of the specific drinking
dards for drinking water (SLVFS, 2001:30). In total, the fault
water system.
tree was composed of 116 basic events, 100 intermediate
events and 101 logic gates. To describe the basic events 214
7.1. Gothenburg water supply probability distributions were estimated for the different
variables, i.e. failure rates, downtimes and probabilities of
Gothenburg is the second largest city in Sweden and approx- failure on demand. Hard data was used to define eight of the
imately 500,000 people are supplied with drinking water. The variables and the main source of data was consequently
raw water supply is solely based on surface water. The main expert judgments.
water source is a river, although a number of lakes are also The raw water part of the fault tree was structured to
used as reservoirs and reserve water sources. The system illustrate which of the two treatment plants is affected by
includes two treatment plants with roughly the same failure. For quantity failure to occur all raw water sources
production capacity and similar treatment processes, must be unavailable for at least one treatment plant and the
including chemical flocculation, sedimentation, filtration and treatment and distribution systems must fail to compensate.
The traditional AND-gate was used to model that all water
sources must be simultaneously unavailable. In addition, the
first variant of the AND-gate was used to model the ability to
Table 2 Possible parameters for estimating probability
compensate for failure by means of increased production
distributions for model variables.
capacity at the non-affected treatment plant as well as stored
Variable and Hard data Expert judgment
water in service reservoirs at the treatment plants and in the
distribution
distribution system.
Failure rate (l), The number of events Two percentiles of the To model quality failures in the raw water the second
Gamma registered (r 1) and the distribution of the time variant of the AND-gate was used. Hence, unacceptable raw
specific time period (1/s). to failure (1/l).
water quality may be compensated for by the treatment. The
Duration of The number of events Two percentiles of the probability that unacceptable water quality cannot be
failure1 (m), (r 1) and the total distribution of duration compensated for at all was represented by the probability of
Gamma duration of the events of failure (1/m). failure on demand. Estimates of the failure rate and down-
(1/s).
time, i.e. how often the treatment efficiency is affected and for
Probability of Statistics on the Two percentiles of the how long, was provided by the treatment part of the fault tree.
failure on reliability of the distribution of Quantity failure in the treatment may also be compensated
demand (q), compensating probability of failure on
for by means of increased production capacity at the non-
Beta component (q). demand (q).
affected treatment plant and service reservoirs at the treat-
Proportion of Statistics on the Two percentiles of the ment plants and in the distribution system. There is no
consumers proportion of consumers distribution of
subsequent sub-system that could compensate for failure in
affected (C ), affected by specific proportion of consumers
the distribution. However, the distribution system itself may
Beta events (C ). affected (C ).
compensate for quantity failures. If, for example, water
1650 water research 43 (2009) 16411653
cannot be transferred to a delivery zone due to pump failure, iterations, were made. Results from one such simulation are
water stored in water towers in that zone may be used. presented in Figs. 11 and 12. In Fig. 11 the risk level in (mean)
CML, probability of failure, mean failure rate and mean
7.3. Tolerability criteria downtime are presented for the entire system as well as the
three sub-systems. The results are presented separately for
When risks have been analysed they need to be evaluated to quantity and quality failure. A histogram of the quantity-
determine whether the level of risk is acceptable or not. related CML is given in Fig. 12.
Sometimes tolerability criteria already exist whereas in other The purpose of repeating the main simulation 10 times was
cases an acceptable level of risk needs to be defined for the to get an idea of how close 10,000 iterations are to conver-
specific analysis at hand. A combination of the two alterna- gence. For the quantity-related CML, the standard deviation
tives may also be required. was 1.3, 6.3 and 42.5 for P05, P50 and P95 respectively, where
The City of Gothenburg has worked out an action plan e.g. P05 denotes the 5th percentile of the empirical distribu-
which, among other things, includes performance targets for tion. In all cases the coefficient of variation was around 0.02.
the supply of drinking water (Goteborg Vatten, 2006). These Similar results were obtained for quality-related CML. This
targets are politically established and can be considered as variation would hardly be noticeable if plotted in the corre-
acceptable levels of risk. One target related to the reliability of sponding plot in Fig. 11 (see also Fig. 12) and was consequently
the supply, i.e. water quantity, is defined as: regarded as being sufficiently small.
Duration of interruption in delivery to the average The results show that the total risk level (CML) is mainly
consumer shall, irrespective of the reason, be less than a total due to raw water failures. This is valid for both quantity and
of 10 days in 100 years. quality failure. The reason for this is long downtimes and the
This target corresponds to an acceptable risk level of 144 fact that a large number of people are affected when the first
CML per year for the average consumer. In the result section part of the supply chain fails. It should be noted that the
this target is compared with the results of the fault tree probability of failure is a function of the mean failure rate and
analysis. mean downtime. However, by studying all three variables
additional information is obtained.
The probability of failure differs between the sub-systems
8. Results and the probability of distribution failure is highest. Hence,
the total probability of failure is governed mainly by the
The calculations were performed using Monte Carlo simula- probability of distribution failure. Similar to the probability of
tion. A total of 10 simulations, each consisting of 10,000 failure, the total failure rate is influenced mostly by the
Probability of failure
Risk (quantity) (quantity) Failure rate (quantity) Downtime (quantity)
2000 0.4 600 50
Mean failure rate [year1]
CML (quantity) per year
Mean downtime [h]
500 40
1500 0.3
Probability
400
30
1000 0.2 300
20
200
500 0.1
100 10
0 0 0 0
Tot. Raw w. Treat. Distr. Tot. Raw w. Treat. Distr. Tot. Raw w. Treat. Distr. Tot. Raw w. Treat. Distr.
P05 P50 P95
Probability of failure
Risk (quality) (quality) Failure rate (quality) Downtime (quality)
4000 0.4 15 400
Mean failure rate [year1]
CML (quality) per year
Mean downtime [h]
3000 0.3 300

Probability
10
2000 0.2 200
5
1000 0.1 100
0 0 0 0
Tot. Raw w. Treat. Distr. Tot. Raw w. Treat. Distr. Tot. Raw w. Treat. Distr. Tot. Raw w. Treat. Distr.
Fig. 11 Histograms showing the 5th, 50th and 95th percentiles for quantity and quality failure. For each of the four
parameters the result is presented for the entire system (Tot.) as well as the three main sub-systems. Note that the scales
are not the same for quantity and quality failure.
water research 43 (2009) 16411653 1651
Quantity failure (entire system) dynamic behaviour of the system would not have been
0.05 possible to calculate.
The results of methods such as the one presented in this
0.04 paper are not easy to verify since, for example, many rare
events are combined to calculate the risk. In order to evaluate
Probability
0.03 the quantity-related criterion defined by the City of Gothen-

burg, the local water utility has estimated the total duration of
interruption in delivery to the average consumer to 61 days in
0.02
100 year, i.e. 878 CML. This estimation includes frequent fail-
ures as well as what is considered the most important rare
0.01
events. By comparing the results, quantity-related CML
values, of the fault tree analysis (Figs. 11 and 12) with the
0
0 500 1000 1500 2000 estimated value of 878 CML it is clear that the results are in the
CML same range. However, it should be noted that expert judg-
ments by the water utility personnel are used as a basis for
Fig. 12 Uncertainty distribution of quantity-related CML
both results.
for the entire system. The acceptable level of risk (144 CML
It is important to study not only the results for the top
per year for the average consumer) is indicated by a solid
event but also at lower levels in the fault tree. For example, the
vertical line and the probability of exceeding the acceptable
three sub-systems should be compared to see in what way
level is 0.84.
they contribute to the risk. The results for the Gothenburg
system show that the probability of raw water failure is low
but when a failure occurs the mean downtime is long and
frequent distribution failures. For quantity as well as quality many people are affected. The treatment has a low mean
failure, the mean downtime is highest for the raw water sub- failure rate, short mean downtime and little impact on the
system. However, the frequent failures in the distribution total risk. The distribution system causes frequent failures but
system have a short mean downtime and consequently the due to the short mean downtime and few people affected its
total mean downtime is in approximately the same range as contribution to the total risk is small.
the distribution failures. Two sub-systems may cause the same number of CML but
Fig. 11 also shows the uncertainty of the results and for the probability of failure and the number of people affected
some variables there is a large difference between the may differ. Two sub-systems may also have the same proba-
percentiles, indicating a high degree of uncertainty in the bility of failure but different failure rates and downtimes.
results. The probabilistic approach also enabled a comparison Properties like these can be identified using the fault tree and
of the quantity-related total risk level with the acceptable are important to know about when evaluating a system and
level of risk defined by the City of Gothenburg. Fig. 12 shows suggesting ways of reducing the risk.
the quantity-related CML, including uncertainties, and the A fault tree should be constructed so that it represents
tolerability criterion. The probability of exceeding the crite- circumstances of the actual system instead of being fitted to
rion is 0.84 0.01 with 95% confidence. actual data. When hard data is missing or insufficient, expert
In addition to the quantified risk levels and other calcu- judgements must be used. The fault tree construction is an
lated variables, the fault tree can be analysed qualitatively. By iterative process where the structure and the results are
studying the structure of the fault tree, information on what evaluated continuously to ensure that a proper model is
may cause failure and the interaction between different developed. A fault tree model makes it possible to evaluate
events and parts of the systems is provided. Hence, a person each basic event as well as the intermediate events, depend-
not involved in the fault tree construction can acquire valu- ing on which is most suitable.
able information by studying the fault tree. Uncertainty is an important part of the concept of risk and
the probabilistic approach used in this method enables
different types of uncertainty analysis. First of all, uncer-
tainties regarding the results provide information on the
9. Discussion variation in the calculated variables. The uncertainties may,
for example, be due to modelling uncertainties, variable
For the Gothenburg system it was shown that the raw water uncertainty or natural variability. The simulation approach
system contributes most to the total risk level, expressed as also facilitates calculations of rank correlation coefficients.
CML, although it is the distribution system that contributes The rank correlation coefficients show how much each vari-
most to the probability of failure due to frequent failures. able in the model contributes to the uncertainty of the result.
These findings are confirmed by existing knowledge of the The rank correlation coefficient can thus be used to identify
system. By studying not only the level of risk but also the where in the model new information is most, and least,
probability of failure, failure rate and downtime, information valuable in reducing uncertainties in the results. Conse-
on the dynamic behaviour of the system is provided. A tradi- quently, this information can be used to guide further studies.
tional fault tree analysis, not applying the Markovian The probabilistic approach also enables, for example, calcu-
approach and including the consequences, would only have lation of the probability of exceeding acceptable levels of risk.
provided information on the probability of failure and the The application to the Gothenburg system showed that the
1652 water research 43 (2009) 16411653
probability of exceeding the quantity-related criterion was reduction option by means of the fault tree method is in
0.84. Results of this nature provide the decision-maker with progress, see e.g. Rosen et al. (2008). The Bayesian approach,
additional information. It is not only important to define using Beta and Gamma distributions, enables a mathemati-
a tolerable level of risk but also a level of certainty by which cally formal updating of previous knowledge as new hard
the risk should not be exceeded. In addition to the quantity- data becomes available. Hence, expert judgements can be
related criterion used in this study also other criteria can be combined with hard data and the model can be updated
used and compared with the results of the fault tree analysis. continuously.
For example, criteria related to the risk level as well as failure Compared to simpler methods for risk analysis, such as
rate, downtime and probability of failure can be used for both risk ranking by using risk matrices with discretised probability
quantity- and quality-related failures. It should also be noted and consequence scales, the fault tree method enables
that criteria can be defined and evaluated for the entire modelling of chains of events and interconnections between
system as well as its specific parts. events. The fault tree method also quantifies the level of risk
The calculation of CML was possible since estimates of the and the dynamic behaviour of the system, which facilitates
affected proportions of consumers were included in the fault comparison with other risks and acceptable levels of risk.
tree. The use of CML as a measure of risk is based on the However, since risk ranking and the fault tree method provide
assumption that the uncertainties of the probability of failure different results and the latter method requires more time,
and the affected proportion of consumers are independent. data and need for training, the methods fulfil different
Furthermore, the use of CML implies that two events that demands.
cause the same level of risk but have different failure rates,
downtimes and affect different numbers of people, are
regarded as being equally severe. In order to distinguish such 10. Conclusions
events from each other, the calculated CML values should be
evaluated together with information on the probabilities of The main conclusions of this study are:
failure and/or affected proportions of consumers. Since the
probability of failure is defined by the failure rate and down- The fault tree method presented here can be used to
time of the system, these variables provide additional infor- perform integrated risk analysis of drinking water
mation that is important to consider when evaluating the systems from source to tap. It includes the inherent ability
results. of the system to compensate for failure. Hence, it supports
The model does not quantify the actual health risk and the decision-makers in the task of minimising sub-optimisa-
CML related to quality does not include the health effect of tion of risk-reduction options.
drinking water that does not comply with quality standards. Customer Minutes Lost (CML) is shown to be a valuable
However, since the CML related to quality corresponds to the measure of risk since performance targets (acceptable
minutes per year that the average consumer is exposed to levels of risk) can be defined using this measure. However,
drinking water not complying with quality standards, it since different probability and consequence values can
provides valuable information. The CML related to quality result in the same risk, calculated CML values should be
could, for example, be one important input in a Quantitative analysed and compared in combination with information
Microbial Risk Assessment (QMRA). In the Gothenburg case on the probabilities of failure and/or consequences.
the criterion for quality failure was defined as unfit for human The possibility to not only estimate the probability of failure
consumption. This criterion can be defined differently but also the mean failure rate and mean downtime at each
depending on the purpose of the study. Combining a QMRA intermediate level of the fault tree, provides valuable
with a detailed system description, represented by a fault tree information about the dynamic behaviour of the system.
model, enables a focused search for best options to reduce the The probabilistic approach enables uncertainty analysis
health risk, which would otherwise have been very difficult. It and calculations of the probability of exceeding defined
is also possible to learn from actual quality failures by performance targets and acceptable levels of risk.
detailing them using a fault tree, see e.g. Risebro et al. (2007). Incorporation of expert judgements is facilitated by using
The information gained can then be used to improve fault tree the mean failure rate and mean downtime to model
models for similar systems. estimates of probabilities. The use of Gamma and Beta
Due to the function of a drinking water system, it cannot distributions enables a Bayesian approach with mathe-
be regarded as a simple series system where failure in one matically formal updating of the analysis as new hard
part of the system automatically affects the consumer. data becomes available.
Consequently, integrated analyses, including the entire
system and its ability to compensate for failure, are The construction of the fault tree, analysis of available
required. The fault tree method presented facilitates inte- data, expert judgements and the analysis of results facilitate
grated risk analysis of drinking water systems and thus also discussions of risk as well as the function of the system.
minimises sub-optimisation of risk-reduction options. An Hence, it should be stressed that not only the results of the
advantage of the fault tree method is that, in addition to calculations are valuable but also the actual process of per-
providing risk estimations, it can also be used to evaluate forming the fault tree analysis. This, in combination with
risk-reduction options. By changing the fault tree structure, the ability to model risk-reduction options, makes the fault
e.g. adding events or changing the input data, risk-reduction tree method an important source of support in decision-
options can be modelled. The work of evaluation risk- making.
water research 43 (2009) 16411653 1653
Goteborg Vatten, 2006. Action Plan Water: Long-term Goals for

Acknowledgements the Water Supply in Gothenburg (in Swedish), City of
Gothenburg.
This study has been carried out within the framework of the IEC, 1995. Dependability Management Part 3: Application
TECHNEAU project (Technology Enabled Universal Access to Guide Section 9: Risk Analysis of Technological Systems.
International Electrotechnical Commission. International
Safe Water), funded by the European Commission (contract
Standard IEC 300-3-9.
number 018320), and with support from the Swedish Water &
ISO/IEC, 2002. Guide 73 Risk Management Vocabulary
Wastewater Association and the City of Gothenburg. We Guidelines for Use in Standards. International Organization
would like to thank the City of Gothenburg for its valuable and for Standardization and International Electrotechnical
fruitful collaboration. We also appreciate the comments and Commission.
suggestions given by the three anonymous reviewers which IWA, 2004. The Bonn Charter for Safe Drinking Water.
helped to improve the paper. International Water Association, London.
Kaplan, S., 1997. The words of risk analysis. Risk Analysis 17 (4),
407417.
Kaplan, S., Garrick, B.J., 1981. On the quantitative definition of
risk. Risk Analysis 1 (1), 1127.
references Lindhe, A., Rosen, L., Norberg, T., Pettersson, T.J.R., Bergstedt, O.,
A strom, J., Bondelind, M., 2008. Integrated risk analysis from
source to tap: case study Goteborg, Paper presented at the 6th
Bedford, T., Cooke, R.M., 2001. Probabilistic Risk Analysis: Nordic Drinking Water Conference, Oslo, 911 June.
Foundations and Methods. Cambridge University Press, NHMRC/NRMMC, 2004. National Water Quality Management
Cambridge. Strategy: Australian Drinking Water Guidelines. National
Blokker, M., Ruijg, K., de Kater, H., 2005. Introduction of Health and Medical Research Council and Natural Resource
a substandard supply minutes performance indicator. Water Management Ministerial Council, Australian Government.
Asset Management International 1 (3), 1922. Norberg, T., Rosen, L., Lindhe, A., 2008. Added value in fault tree
Burgman, M.A., 2005. Risks and Decisions for Conservation and analyses. In: Martorell, et al. (Eds.), Safety, Reliability and Risk
Environmental Management. Cambridge University Press, Analysis: Theory, Methods and Applications. Taylor & Francis
Cambridge. Group, London, pp. 10411048.
CDW/CCME, 2004. From Source to Tap: Guidance on the Multi- Rausand, M., Hyland, A., 2004. System Reliability Theory:
Barrier Approach to Safe Drinking Water, Federal-Provincial- Models, Statistical Methods, and Applications, second ed.
Territorial Committee on Drinking Water and Canadian Wiley-Interscience.
Council of Ministers of the Environment Water Quality Task Risebro, H.L., Doria, M.F., Andersson, Y., Medema, G., Osborn, K.,
Group, Health Canada. Schlosser, O., Hunter, P.R., 2007. Fault tree analysis of the
Cox, A.L., 2008. Whats wrong with risk matrices? Risk Analysis causes of waterborne outbreaks. Journal of Water and Health
28 (2), 497512. 5 (1), 118.
Davison, A., Howard, G., Stevens, M., Callan, P., Fewtrell, L., Rosen, L., Bergstedt, O., Lindhe, A., Pettersson, T.J.R., Johansson,
Deere, D., Bartram, J., 2005. Water Safety Plans: Managing A., Norberg, T., 2008. Comparing raw water options to reach
Drinking-water Quality from Catchment to Consumer. WHO/ water safety targets using an integrated fault tree model.
SDE/WSH/05.06. World Health Organisation, Geneva. Paper presented at the International Water Association
European Commission, 2000. First Report on the Harmonisation Conference, Water Safety Plans: Global Experiences and
of Risk Assessment Procedures, Part 2: Appendices 2627 Future Trends, Lisbon, 1214 May.
October 2000, Health and Consumer Protection Directorate- SLVFS, 2001. 30 National Food Administration Ordinance on
General. Drinking Water. Swedish National Food Administration
Gray, N.F., 2005. Water Technology: an Introduction for (in Swedish).
Environmental Scientists and Engineers, second ed. Elsevier WHO, 2004, third ed.. Guidelines for Drinking-water Quality,
Butterworth-Heinemann, Oxford. Vol. 1, Recommendations World Health Organization, Geneva.

Lindhe Prob

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Lindhe Prob

Enviado por

Direitos autorais:

Formatos disponíveis

water research 43 (2009) 16411653

journal homepage: www.elsevier.com/locate/watres

Fault tree analysis for integrated and probabilistic risk

Andreas Lindhea,*, Lars Rosena, Tommy Norbergb, Olof Bergstedta,c

article info abstract

1. Introduction World Health Organization (WHO), it is pointed out that

* Corresponding author. Tel.: 46 31 772 2060; fax: 46 31 772 2107.

As part of risk management, WHO recommends prepara-

Categories of supply failure Causes

Failure of components in the

Fig. 1 Categories of supply failure and their main causes.

Raw water Treatment Distribution

Events in the raw water system

events in the treatment cause

events in the distribution system

Treatment Quantity failure No raw water quantity failure.

Distribution Quantity failure No raw water or treatment quantity failure.

separately to maintain transparency. Norberg et al. (2008) 3.2. AND-gate

3.3. Variants of the AND-gate

must include what in reliability applications is called cold

No distribution failure Success

No distribution failure Success

Distribution compensation Success

Raw water failure Treatment failure Distribution failure

Raw water quantity Treatment quantity Distribution quantity

Quantity failure Quantity failure

Distribution fails to Treament quality

Raw water quality

Treatment fails to OR-gate

Quantity failure Quality failure

Unacceptable quality Detection

Detection No delivery stop

Decision to stop Non-detected quality

Mean downtime [h]

P05 P50 P95

Mean downtime [h]

3000 0.3 300

2000 0.2 200

0.03 the quantity-related criterion defined by the City of Gothen-

Goteborg Vatten, 2006. Action Plan Water: Long-term Goals for

Você também pode gostar