f33 FT Computing Lec02 Measures

Fault-Tolerant Computing
Motivation,
Background,
and Tools
Oct. 2006
Terminology, Models, and Measures
Slide 1
About This Presentation

This presentation has been prepared for the graduate
course ECE 257A (Fault-Tolerant Computing) by
Behrooz Parhami, Professor of Electrical and Computer
Engineering at University of California, Santa Barbara.
The material contained herein can be used freely in
classroom teaching or any other educational setting.
Unauthorized uses are prohibited. Behrooz Parhami
Oct. 2006
Edition
Released
First
Oct. 2006
Revised
Revised
Slide 2

for Dependability
Oct. 2006
Slide 3
Oct. 2006
Slide 4
Fl
aw
Impairments to Dependability
ERROR
e
r
lu
i
a
F
Oct. 2006
Fa
ul
Hazard
Bu
g
t
n
o
i
t
a
d
a
r
g
De
Intr
us
Ma
De
ion
lfu
Crash
nc
ti
t
c
fe
on
Slide 5
The Fault-Error-Failure Cycle

Includes both
components
and design
Aspect
Impairment
Structure
State
Behavior
Fault
Error
Failure
Fault
Correct
signal
0
Replaced
with
NAND?
Schematic diagram of the Newcastle hierarchical model

and the impairments within one level.
Oct. 2006
Slide 6
The Four-Universe Model

Universe
Impairment
Physical
Logical
Informational
External
Failure
Fault
Error
Crash
Cause-effect diagram for Aviienis four-universe

model of impairments to dependability.
Oct. 2006
Slide 7
Unrolling the Fault-Error-Failure Cycle
Aspect
Impairment
Structure
State
Behavior
Fault
Error
Failure
First
Cycle
Second
Cycle
Abstraction
Impairment
Component
Logic
Information
System
Service
Result
Defect
Fault
Error
Malfunction
Degradation
Failure
LowLevel
MidLevel
HighLevel
Cause-effect diagram for an extended six-level

view of impairments to dependability.
Oct. 2006
Slide 8
Multilevel Model
Component
Logic
Defective
Legend:
Legned:
Service
Result
Oct. 2006
Low-Level
Impaired
Faulty
Initial
Entry
Entry
Information
System
Ideal
Erroneous
Mid-Level
Impaired
Deviation
Malfunctioning
Remedy
Degraded
Tolerance
Failed
High-Level
Impaired
Slide 9
Analogy for the Multilevel Model

An analogy for our
multi-level model of
dependable computing.
Defects, faults, errors,
malfunctions,
degradations, and
failures are
represented by pouring
water from above.
Valves represent
avoidance and
tolerance techniques.
The goal is to avoid
overflow.
Oct. 2006
Wall heights represent

inter-level latencies
Inlet valves represent

avoidance techniques
Concentric reservoirs are

analogs of the six model levels,
with defect being innermost
I I I I I I
Drain valves represent
tolerance techniques
Slide 10
Why Our Concern with Dependability?

Reliability of n-transistor system, each having failure rate
R(t) = ent
There are only 3 ways of making systems more reliable
1.0
Reduce
.9999
.9990
.9900
.9048
0.8
Reduce n
Reduce t
0.6
n t
0.4
Alternative:
Change the reliability
formula by introducing
redundancy in system
Oct. 2006
.3679
0.2
0.0
10 4
10 6
nt
10 8
10 10
Slide 11
Highly Dependable Computer Systems

Long-life systems: Fail-slow, Rugged, High-reliability
Spacecraft with multiyear missions, systems in inaccessible locations
Methods: Replication (spares), error coding, monitoring, shielding
Safety-critical systems: Fail-safe, Sound, High-integrity
Flight control computers, nuclear-plant shutdown, medical monitoring
Methods: Replication with voting, time redundancy, design diversity
High-availability: Fail-soft, Robust, High-availability
Telephone switching centers, transaction processing, e-commerce
Methods: HW/info redundancy, backup schemes, hot-swap, recovery
Just as performance enhancement techniques gradually migrate from
supercomputers to desktops, so too dependability enhancement
methods find their way from exotic systems into personal computers
Oct. 2006
Slide 12
Aspects of Dependability
y
t
li
i
b
v
r
e
a
e
ic
Se
cu
rit
y
y
t
e
ns
f
o
a
SRisk, c
nc
e
u
eq
Resilience
y
v.,
t
a
y
i
l
l
t
i ity,
i
va
l
r
b
i
e
YF
a labil
T
I
b
nt
t
L
I
I
B
a
A
,
s
I
.
L lity, MTTF = MTF
l e av
i
e
RE
ol bility
r
t
a
T
n va
v
is T R
o
Reliabi
r
w
C
Aoint , MT
se
M
b
o
y
P BF
a
t
i
T
in t
lF
M
i
ain
a,bMCB
R
a
m
o
I
y
n
t
b
r
b
t
li
u
e
i
i
o
lity
s
g
b
f
t
r
a
n
i
r
t
y
ess
rm
e
o
P rf
Pe
Oct. 2006
Slide 13
Concepts from Probability Theory

Probability density function: pdf
f(t) = prob[t x t + dt] / dt = dF(t) / dt
Liftimes of 20
identical systems
Cumulative distribution function: CDF

F(t) = prob[x t] = 0t f(x) dx
Expected value of x
+
Ex = x f(x) dx = k xk f(xk)
10
20
30
40
50
30
40
50
30
40
50
1.0
0.8
CDF
0.6
Variance of x
+
2
x = (x Ex)2 f(x) dx
= k (xk Ex)2 f(xk)
Time
0.4
F(t)
0.2
0.0
0
10
20
Time
0.05
Covariance of x and y
x,y = E [(x Ex)(y Ey)]
= E [x y] Ex Ey
pdf
0.04
0.03
f(t)
0.02
0.01
0.00
0
Oct. 2006
10
20
Time
Slide 14
Some Simple Probability Distributions

F(x)
1
CDF
CDF
CDF
CDF
Normal
Binomial
f(x)
pdf
Uniform
Oct. 2006
pdf
Exponential
pdf
Slide 15
Reliability and MTTF

Reliability: R(t)
Probability that system remains in the
Good state through the interval [0, t]
Two-state
nonrepairable
system
R(t + dt) = R(t) [1 z(t) dt]

Hazard function
R(t) = 1 F(t)
Start
State
Failure
Failed
CDF of the system lifetime, or its unreliability
Constant hazard function z(t) = R(t) = et

(system failure rate is independent of its age)
Mean time to failure: MTTF
+
+
MTTF = 0 t f(t) dt = 0 R(t) dt
Expected value of lifetime
Oct. 2006
Good
Exponential
reliability law
Area under the reliability curve

(easily provable)
Slide 16
Failure Distributions of Interest

Discrete versions
Exponential: z(t) =
R(t) = et
MTTF = 1/
Geometric
R(k) = q k
Rayleigh: z(t) = 2(t)

R(t) = e(t)2
MTTF = (1/) / 2
Weibull: z(t) = (t) 1
R(t) = e(t)
MTTF = (1/) (1 + 1/)
Discrete Weibull
Erlang:
MTTF = k/
Gamma:
Erlang and exponential are special cases
Normal:
Reliability and MTTF formulas are complicated
Oct. 2006
Binomial
Slide 17
Comparing Reliabilities
Reliability difference: R2 R1
Reliability gain: R2 / R1
Reliability functions
for Systems 1/2
Reliability improvement factor

RIF2/1 = [1R1(tM)] / [1R2(tM)]
System Reliability (R)
Example:
[1 0.9] / [1 0.99] = 10
1.0
R2 (tM)
rG
Reliability improv. index

RII = log R1(tM) / log R2(tM)
R2 (t)
R1(tM)
Mission time extension

MTE2/1(rG) = T2(rG) T1(rG)
Mission time improv. factor:
MTIF2/1(rG) = T2(rG) / T1(rG)
R1(t)
0.0
T1(rG)
tM T2 (rG) MTTF2
MTTF1
Time (t)
Oct. 2006
Slide 18
Availability, MTTR, and MTBF

(Interval) Availability: A(t)
Fraction of time that system is in the
Up state during the interval [0, t]
Two-state
repairable
system
Steady-state availability: A = limt A(t)

Pointwise availability: a(t)
Probability that system available at time t
A(t) = (1/t) 0t a(x) dx
Repair
Start
State
Down
Up
Failure
Availability = Reliability, when there is no repair

Availability is a function not only of how rarely a system fails (reliability)
but also of how quickly it can be repaired (time to repair)
MTTF
MTTF
A=
=
=
MTTF + MTTR MTBF
+
In general, >> , leading to A 1
Oct. 2006
Repair rate
1/ = MTTR
(Will justify this
equation later)
Slide 19
System Up and Down Times

Short repair time implies
good Maintainability (serviceability)
Time to first failure
Repair
Start
State
Down
Up
Failure
Time between failures

Repair time
Up
Down
0
t1
t 2 t'2
t'1
Time
Oct. 2006
Slide 20
Performability and MCBF

Performability: P
Composite measure, incorporating
both performance and reliability
Three-state
degradable system
Repair
Start
State
Up 2
Up 1
Partial repair
Down
Failure
Partial failure
Simple example
Worth of Up2 twice that of Up1
t
pUpi = probability
system is in state Upi
Question:
What is system
availability here?
P = 2pUp2 + pUp1
pUp2 = 0.92, pUp1 = 0.06, pDown = 0.02, P = 1.90

(system performance equiv. To that of 1.9 processors on average)
Performability improvement factor of this system (akin to RIF) relative
to a fail-hard system that goes down when either processor fails:
PIF = (2 2 0.92) / (2 1.90) = 1.6
Oct. 2006
Slide 21
System Up, Partially Up, and Down Times

Important to prevent
direct transitions to the
Down state (coverage)
Start
State
Up 2
Up 1
Partial failure
Partial repair
Down
Failure
Partial
Failure
Up
Partially Up
Down
0
Oct. 2006
Repair
t1
Total
Failure
Partial
Repair
t2
t'2
Time
t'1
t 3 t'3 t
Slide 22
Integrity and Safety

Risk: Prob. of being in Unsafe Failed state
There may be multiple unsafe states,
each with a different consequence (cost)
Simple analysis
Lump Safe Failed state with Good
state; proceed as in reliability analysis
Three-state
fail-safe system
Failure
Start
State
Good
Failure
More detailed analysis

Even though Safe Failed state is more
desirable than Unsafe Failed, it is still
not as desirable as the Good state;
so keeping it separate makes sense
Safe
Failed
Unsafe
Failed
For example, if a repair transition is introduced between Safe Failed

and Good states, we can tackle questions such as the expected
outage of the system in safe mode, and thus its availability
Oct. 2006
Slide 23

f33 FT Computing Lec02 Measures

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

f33 FT Computing Lec02 Measures

Enviado por

Direitos autorais:

Formatos disponíveis

Fault-Tolerant Computing

Terminology, Models, and Measures

About This Presentation

Terminology, Models, and Measures

Terminology, Models, and Measures

Terminology, Models, and Measures

Terminology, Models, and Measures

Terminology, Models, and Measures

The Fault-Error-Failure Cycle

Schematic diagram of the Newcastle hierarchical model

Terminology, Models, and Measures

The Four-Universe Model

Cause-effect diagram for Aviienis four-universe

Terminology, Models, and Measures

Unrolling the Fault-Error-Failure Cycle

Cause-effect diagram for an extended six-level

Terminology, Models, and Measures

Terminology, Models, and Measures

Analogy for the Multilevel Model

Wall heights represent

Inlet valves represent

Concentric reservoirs are

Terminology, Models, and Measures

Why Our Concern with Dependability?

Terminology, Models, and Measures

Highly Dependable Computer Systems

Terminology, Models, and Measures

Terminology, Models, and Measures

Concepts from Probability Theory

Cumulative distribution function: CDF

Terminology, Models, and Measures

Some Simple Probability Distributions

Terminology, Models, and Measures

Reliability and MTTF

R(t + dt) = R(t) [1 z(t) dt]

CDF of the system lifetime, or its unreliability

Constant hazard function z(t) = R(t) = et

Area under the reliability curve

Terminology, Models, and Measures

Failure Distributions of Interest

Rayleigh: z(t) = 2(t)

Terminology, Models, and Measures

Reliability improvement factor

System Reliability (R)

Reliability improv. index

Mission time extension

Terminology, Models, and Measures

Availability, MTTR, and MTBF

Steady-state availability: A = limt A(t)

Availability = Reliability, when there is no repair

Terminology, Models, and Measures

System Up and Down Times

Time between failures

Terminology, Models, and Measures

Performability and MCBF

pUp2 = 0.92, pUp1 = 0.06, pDown = 0.02, P = 1.90

Terminology, Models, and Measures

System Up, Partially Up, and Down Times

Terminology, Models, and Measures

Integrity and Safety

More detailed analysis

For example, if a repair transition is introduced between Safe Failed

Terminology, Models, and Measures