Você está na página 1de 23

Fault-Tolerant Computing

Motivation,
Background,
and Tools

Oct. 2006

Terminology, Models, and Measures

Slide 1

About This Presentation


This presentation has been prepared for the graduate
course ECE 257A (Fault-Tolerant Computing) by
Behrooz Parhami, Professor of Electrical and Computer
Engineering at University of California, Santa Barbara.
The material contained herein can be used freely in
classroom teaching or any other educational setting.
Unauthorized uses are prohibited. Behrooz Parhami

Oct. 2006

Edition

Released

First

Oct. 2006

Revised

Terminology, Models, and Measures

Revised

Slide 2

Terminology, Models, and Measures


for Dependability

Oct. 2006

Terminology, Models, and Measures

Slide 3

Oct. 2006

Terminology, Models, and Measures

Slide 4

Fl
aw

Impairments to Dependability

ERROR

e
r
lu
i
a
F
Oct. 2006

Fa
ul

Hazard

Bu
g

t
n
o
i
t
a
d
a
r
g
De

Intr
us

Ma

De

ion

lfu

Crash

Terminology, Models, and Measures

nc
ti

t
c
fe

on
Slide 5

The Fault-Error-Failure Cycle


Includes both
components
and design

Aspect

Impairment

Structure

State

Behavior

Fault

Error

Failure

Fault

Correct
signal

0
Replaced
with
NAND?

Schematic diagram of the Newcastle hierarchical model


and the impairments within one level.

Oct. 2006

Terminology, Models, and Measures

Slide 6

The Four-Universe Model


Universe

Impairment

Physical

Logical

Informational

External

Failure

Fault

Error

Crash

Cause-effect diagram for Aviienis four-universe


model of impairments to dependability.

Oct. 2006

Terminology, Models, and Measures

Slide 7

Unrolling the Fault-Error-Failure Cycle

Aspect

Impairment

Structure

State

Behavior

Fault

Error

Failure

First
Cycle

Second
Cycle

Abstraction

Impairment

Component

Logic

Information

System

Service

Result

Defect

Fault

Error

Malfunction

Degradation

Failure

LowLevel

MidLevel

HighLevel

Cause-effect diagram for an extended six-level


view of impairments to dependability.
Oct. 2006

Terminology, Models, and Measures

Slide 8

Multilevel Model
Component
Logic

Defective

Legend:
Legned:

Service
Result
Oct. 2006

Low-Level
Impaired
Faulty

Initial
Entry
Entry

Information
System

Ideal

Erroneous
Mid-Level
Impaired

Deviation

Malfunctioning

Remedy

Degraded

Tolerance

Failed

Terminology, Models, and Measures

High-Level
Impaired

Slide 9

Analogy for the Multilevel Model


An analogy for our
multi-level model of
dependable computing.
Defects, faults, errors,
malfunctions,
degradations, and
failures are
represented by pouring
water from above.
Valves represent
avoidance and
tolerance techniques.
The goal is to avoid
overflow.

Oct. 2006

Wall heights represent


inter-level latencies

Inlet valves represent


avoidance techniques

Concentric reservoirs are


analogs of the six model levels,
with defect being innermost

Terminology, Models, and Measures

I I I I I I
Drain valves represent
tolerance techniques

Slide 10

Why Our Concern with Dependability?


Reliability of n-transistor system, each having failure rate

R(t) = ent
There are only 3 ways of making systems more reliable
1.0

Reduce

.9999

.9990

.9900
.9048

0.8

Reduce n
Reduce t

0.6
n t

0.4

Alternative:
Change the reliability
formula by introducing
redundancy in system
Oct. 2006

.3679

0.2
0.0
10 4

10 6

Terminology, Models, and Measures

nt

10 8

10 10
Slide 11

Highly Dependable Computer Systems


Long-life systems: Fail-slow, Rugged, High-reliability
Spacecraft with multiyear missions, systems in inaccessible locations
Methods: Replication (spares), error coding, monitoring, shielding
Safety-critical systems: Fail-safe, Sound, High-integrity
Flight control computers, nuclear-plant shutdown, medical monitoring
Methods: Replication with voting, time redundancy, design diversity
High-availability: Fail-soft, Robust, High-availability
Telephone switching centers, transaction processing, e-commerce
Methods: HW/info redundancy, backup schemes, hot-swap, recovery
Just as performance enhancement techniques gradually migrate from
supercomputers to desktops, so too dependability enhancement
methods find their way from exotic systems into personal computers

Oct. 2006

Terminology, Models, and Measures

Slide 12

Aspects of Dependability
y
t
li
i
b

v
r
e

a
e
ic

Se
cu
rit
y

y
t
e
ns
f
o
a

SRisk, c

nc
e
u
eq

Resilience

y
v.,
t
a
y
i
l
l
t
i ity,
i
va
l
r
b
i
e
YF
a labil
T
I
b
nt
t
L
I
I
B
a
A
,
s
I
.
L lity, MTTF = MTF
l e av
i
e
RE
ol bility
r
t
a
T
n va
v
is T R
o
Reliabi
r
w
C
Aoint , MT
se
M
b
o
y
P BF
a
t
i
T
in t
lF
M
i
ain
a,bMCB
R
a
m
o
I
y
n
t
b
r
b
t
li
u
e
i
i
o
lity
s
g
b
f
t
r
a
n
i
r
t
y
ess
rm
e
o
P rf
Pe

Oct. 2006

Terminology, Models, and Measures

Slide 13

Concepts from Probability Theory


Probability density function: pdf
f(t) = prob[t x t + dt] / dt = dF(t) / dt

Liftimes of 20
identical systems

Cumulative distribution function: CDF


F(t) = prob[x t] = 0t f(x) dx
Expected value of x
+
Ex = x f(x) dx = k xk f(xk)

10

20

30

40

50

30

40

50

30

40

50

1.0
0.8

CDF

0.6

Variance of x
+
2
x = (x Ex)2 f(x) dx
= k (xk Ex)2 f(xk)

Time

0.4

F(t)

0.2
0.0
0

10

20

Time

0.05

Covariance of x and y
x,y = E [(x Ex)(y Ey)]
= E [x y] Ex Ey

pdf

0.04
0.03

f(t)

0.02
0.01
0.00
0

Oct. 2006

10

Terminology, Models, and Measures

20

Time

Slide 14

Some Simple Probability Distributions


F(x)
1

CDF

CDF

CDF

CDF

Normal

Binomial

f(x)
pdf

Uniform

Oct. 2006

pdf

Exponential

pdf

Terminology, Models, and Measures

Slide 15

Reliability and MTTF


Reliability: R(t)
Probability that system remains in the
Good state through the interval [0, t]

Two-state
nonrepairable
system

R(t + dt) = R(t) [1 z(t) dt]


Hazard function
R(t) = 1 F(t)

Start
State

Failure

Failed

CDF of the system lifetime, or its unreliability

Constant hazard function z(t) = R(t) = et


(system failure rate is independent of its age)
Mean time to failure: MTTF
+
+
MTTF = 0 t f(t) dt = 0 R(t) dt
Expected value of lifetime
Oct. 2006

Good

Exponential
reliability law

Area under the reliability curve


(easily provable)

Terminology, Models, and Measures

Slide 16

Failure Distributions of Interest


Discrete versions
Exponential: z(t) =
R(t) = et

MTTF = 1/

Geometric
R(k) = q k

Rayleigh: z(t) = 2(t)


R(t) = e(t)2
MTTF = (1/) / 2
Weibull: z(t) = (t) 1
R(t) = e(t)
MTTF = (1/) (1 + 1/)

Discrete Weibull

Erlang:
MTTF = k/
Gamma:
Erlang and exponential are special cases
Normal:
Reliability and MTTF formulas are complicated
Oct. 2006

Terminology, Models, and Measures

Binomial

Slide 17

Comparing Reliabilities
Reliability difference: R2 R1
Reliability gain: R2 / R1

Reliability functions
for Systems 1/2

Reliability improvement factor


RIF2/1 = [1R1(tM)] / [1R2(tM)]

System Reliability (R)

Example:
[1 0.9] / [1 0.99] = 10

1.0
R2 (tM)
rG

Reliability improv. index


RII = log R1(tM) / log R2(tM)

R2 (t)

R1(tM)

Mission time extension


MTE2/1(rG) = T2(rG) T1(rG)
Mission time improv. factor:
MTIF2/1(rG) = T2(rG) / T1(rG)

R1(t)

0.0
T1(rG)

tM T2 (rG) MTTF2

MTTF1

Time (t)
Oct. 2006

Terminology, Models, and Measures

Slide 18

Availability, MTTR, and MTBF


(Interval) Availability: A(t)
Fraction of time that system is in the
Up state during the interval [0, t]

Two-state
repairable
system

Steady-state availability: A = limt A(t)


Pointwise availability: a(t)
Probability that system available at time t
A(t) = (1/t) 0t a(x) dx

Repair
Start
State

Down

Up
Failure

Availability = Reliability, when there is no repair


Availability is a function not only of how rarely a system fails (reliability)
but also of how quickly it can be repaired (time to repair)
MTTF
MTTF

A=
=
=
MTTF + MTTR MTBF
+
In general, >> , leading to A 1
Oct. 2006

Terminology, Models, and Measures

Repair rate
1/ = MTTR
(Will justify this
equation later)
Slide 19

System Up and Down Times


Short repair time implies
good Maintainability (serviceability)
Time to first failure

Repair
Start
State

Down

Up
Failure

Time between failures


Repair time

Up

Down
0

t1

t 2 t'2

t'1

Time
Oct. 2006

Terminology, Models, and Measures

Slide 20

Performability and MCBF


Performability: P
Composite measure, incorporating
both performance and reliability

Three-state
degradable system
Repair

Start
State

Up 2

Up 1

Partial repair
Down
Failure

Partial failure
Simple example
Worth of Up2 twice that of Up1
t
pUpi = probability
system is in state Upi
Question:

What is system
availability here?

P = 2pUp2 + pUp1

pUp2 = 0.92, pUp1 = 0.06, pDown = 0.02, P = 1.90


(system performance equiv. To that of 1.9 processors on average)
Performability improvement factor of this system (akin to RIF) relative
to a fail-hard system that goes down when either processor fails:
PIF = (2 2 0.92) / (2 1.90) = 1.6
Oct. 2006

Terminology, Models, and Measures

Slide 21

System Up, Partially Up, and Down Times


Important to prevent
direct transitions to the
Down state (coverage)

Start
State

Up 2

Up 1
Partial failure

Partial repair
Down
Failure

Partial
Failure

Up

Partially Up

Down
0
Oct. 2006

Repair

t1

Total
Failure

Partial
Repair

t2

t'2
Time

Terminology, Models, and Measures

t'1

t 3 t'3 t
Slide 22

Integrity and Safety


Risk: Prob. of being in Unsafe Failed state
There may be multiple unsafe states,
each with a different consequence (cost)
Simple analysis
Lump Safe Failed state with Good
state; proceed as in reliability analysis

Three-state
fail-safe system
Failure

Start
State

Good
Failure

More detailed analysis


Even though Safe Failed state is more
desirable than Unsafe Failed, it is still
not as desirable as the Good state;
so keeping it separate makes sense

Safe
Failed

Unsafe
Failed

For example, if a repair transition is introduced between Safe Failed


and Good states, we can tackle questions such as the expected
outage of the system in safe mode, and thus its availability
Oct. 2006

Terminology, Models, and Measures

Slide 23

Você também pode gostar