Você está na página 1de 10
DEPENDABLE COMPUTING AND FAULT TOLERANCE : CONCEPTS AND TERMINOLOGY Jean-Clande Lapeiet LAAS-CNRS. 7, Avenue du Colonel Roche, 31400 Toulouse, France ‘ABSTRACT ‘This paper provides a conceptual framework for expressing the attributes of what constitutes dependable ‘and reliable computing: + the impairments to dependability: faults, errors, and failures, + the means for dependability: fault-avoidance, fault-tolerance, error-temoval, and error. forecasting, = the measures of dependability: reliability, availa. bility, maintainability, and safety ‘Emphasis is being put on the dependability impairments snd on fault-tolerance. FOREWORD ‘This paper is aimed at giving informal but precise efinitions characterizing the Various etwibutes of com- [puting systems dependability. It is a contribution to the ‘work undertaken within the "Reliable and Fault ‘Tolerant 5" scientific and technical community [Avi 78, Ran 78, Car 79, Lap 79, And 61, Se &2, Cri 84] in order to propose clear and widely acceptable de- finitions for some basi concept. ‘The work presented here has been conducted ‘within the framework of the subcommittee "Fundamen- ‘al Concepts and Tenninology", which is commen tothe TAP WG 10.4 “Reliable Computing and. Fault Tolerance” and to the IEEE Computer Sodery TC “Fault-Tolerant Computing”. ‘he werk presented bere bs besa party performed wile ‘nue suinion with he UCLA {oe Angee, CA IDK. | I Reprint fom FTCS-15, 1985, pp. 2-1, 0-8186-7150-5/96 $5.00 © 1996 L Proceedings of FTCS:25, Volume IT ‘This paper isan eited version of [Lap 84]. Go- fog backward fn tine, a milestone was the special mex sng devoted to fundamental conopsbeld in conuncion with FICSIL. A special session was organized at FTCS-12, devoted tothe preston of viewpoint cat borate fa Several places and isitins [And 82, Avi ©, Kop © Lap €2, Lee 62, Rob €2). The previous version of ‘his paper were then discussed during the 1983 and 1968 Winter and Summer merings ofthe FTP WG 10.8, ‘The poper proceeds by refinements: dependabili- 1y is rst introduced as a lobal concept. Faul- ‘lerance is then detailed ‘The guidelines which have governed this presen- tation can be summed up as follows: + search for the minimum number of concepts ‘enubling the dependability attributeds to be ex- pressed, = use of terms which are identical to (whenever possible) or as close as possible to those general- Jy used; asa rule, a term which hes not been de- ‘led shall retain its ordinary sense (at given by ‘any dictionary), = emphasis on integration (as opposed to speciali- ation) [Gol 2} In each section, concise definitions are given first, then they are beavily commented in order 0 (at- tempe to) show the wide’ applicability of the adopted ‘resentation. Boldface character are used when a term is de- fined, italle characters being an invitation to focus the reader's attention. ‘THE DEPENDABILITY CONCEPT BASIC DEFINITIONS AND ASSOCIATED TERMINOLO- or Computer system dependability is the quay of the delivered service sych that reliance can justly be placed on this service * ‘The service delivered by a system isthe system behavior «sis perceived by another special system(s) interacting ‘withthe considered sytem: Im wer()- A system fallure occurs when the delivered ser- vie deviates from the specified service, where the ser- ce speciation isan agreed description of the expect- ed service. The fale occurred because the system was erroneous: an error is that part of the system state ‘hich is liable to lead to failure, Le. t0 the delivery of a Service not complying with the specified service. The cause ~ in its phenomenological sense -= of an error i a fat. ‘Upon occurrence, a fault creates a latent error, which becomes effective when itis activated; when the error affects the delivered service, a fallure oocurs. Stated ia ‘other terms, an errr isthe manifestation inthe system fof a fault, and a failure is the manifestation on the er- le ofan exrr. “Achieving a dependable computing system calls for the combined utilization of a set of methods which can be classed into: ~ taleavedance: how to prevent, by consacton, + fanlttolerance: how to provide, by redundancy, service complying with the specification in spite ‘of faults having occurred or occurring, + error-removal: how to minimize, by verfcaion, the presence of lateat errors, + errortorecasting: how to estimate, by evalua- fo, tke presence, the creation tnd the conse ‘quences of errors. Fault-avoidance and fault-tolerance may be seen a8 con- stituing. dependability procurement: bow to provide the sem with the obiliy to deliver the specified service; crror-emoval and error-foecasting may be teen a8 oon: stituting dependability validation: how to reach conf- dence inthe system ably to deliver the speed ser ‘Tis ction is wpe trom (Cr #2), where “rwtworhnese td contin” has bec replaced y "i The life of a system is perceived by its users as ‘an alteration between two states of the delivered ser- vice with respect tothe speified service: = service accomplishment, where the service is delivered as specified, + service interruption where the delivered service different from the specified service. The events which constitute the transitions between these two states are the failure and the restoration. ‘Quantifying this accomplishment interuption alters ton leads to the two main measures of dependability: + rellabity: a measure of the continuous service accomplishment (or, equivalently, of the time to failure) from a reference initial instant, + avallabity: a measure of the service accom- lishment with respect to the alternation of ac Compishment and interruption. COMMENTS 1. On the introduction of dependability at a generic con- opt sy i as ey impel lar ate aCe ee i ere natn tpn ‘ily in its general meaning (cable system) Secsereaan a Gat reliability), » Sw tei, mtn, Se acs Some eae att = Sobavinas Gat In regard to the term “dependabiliy", it is ‘noteworthy that from an etymological pont of view, the termreliabilty” would be more appropriate: ability to rely upon. Although dependability is synonymous with reliability, it brings in the notion of dependence at a second level. This may be felt asa negative connotation at first sight, when compared to the positive notion of Hier’, itself from the Latin “religare, to bind back: re-, bck and ligare, to fasten, to tic. The necessary soli arity for reaching reliability is present! The Preach ‘word for reliability, “Gablite” traces back to the 12th century, to the word “fiabete” whose meaning was character of being trustworthy”; the Latin origin is “idare", popular verb meaning “to trust”. Finally, it is interesting that viewing dependabill- ty as a more general concept than reliability, avalabil- ty, ete, and embodying the latter terms, has already been tempted in the past (Sc e.g. [Fos 60), although ‘with less generality than here, since the goal’ was then todefine 4 measure 2. On the notion of service and is specification From is very definition, the service delivered by a sytem is clearly a0 abrracton of hs behavior I toteworthy tht this sbstraction is highly dependeat on the application where the computer system is employed. ‘An example of ths dependence is the lmportant role ‘layed inthe abstraction by time: the time granularties Of the system and ofits users) are generally different. Concerning specication, what i essential wihin the present context that i a deseripion of the ex: peced service which is agreed upon by two persons or arorate bodies: the system supplier (in broad sense of the term: designer, builder, vendor, ete) and is (Gums) wer). What precedes does not mean that a service specication wil oot change once established. ‘his would be simple iporance of he fact of le, ‘which imply change. The changes may be motivated by specification. ‘what i important i thatthe specication i (ee) agreed ‘pon. 8S noteworthy that such matters as performance, ob- servabllity, readiness, etc. can be captured by an ap- ropriatly stated specification. 43. On the notions offal, error and failure First, some illustrative examples: + 9 programmer's mistake is a four: the conse- ‘quence isa latent) error io the written software (Grroncous instruction or piece of data); upon ‘ctvation (activation of the module where the error resides and an appropriate input pattern ‘activating the erroneous instruction, instruction sequence or pice of data) the error becomes ef fective; when this effective error produces er ‘oneous data (in value or in the ting of thir ‘elivery) which affect the delivered service, @ falture curs, +8 short-creuit occurring in an integrated circuit isa fal; the consequence (Connection stuck at & Boolean value, modification of the ercit func- tion, etc.) isan error which wil remain latent 3 Jong as it is not activated, the continuation of ‘the proccss being identical to the previous exam- = an electromagnetic pertubation of sufficient en- ergy Is faut; when (for instance) acting on a rmemory's inputs, it wil eeate an error if active ‘when the memory i in the write position; the er- ror will remain latent untl the erroneous ‘memory loeation(s) is (are) read, etc. + an inappropriate man-machine interaction per- formed (inadvertently or deliberately) by a0 ‘operator during the operation of the system is 2 Feat; the resulting altered data is en error, ete. = armaintenance or operating manual wri: mis- take isa fal; the consequence is an error inthe corresponding manual (erroneous dizectives) ‘hich wil remain latent as long as the directives fre not acted upon in order to face given situa- fon, ete. From the shove examples, it is easily understood that the error latency duration may vary considerably, depending upon the fault, the considered system utiliza tion, etc. These ‘also explain why an error was defined as being lable to lead to a failure. Wheth- er or not an error will ecectvely lead to a fallure ‘depends on several factors: + the activation conditions according to which a la- ‘eat error will become effective, = the system composition, and especialy the ‘amount of available redundancy: + explicit redundancy (in order to ensure fault-tolerance) which is directly intended to prevent an eror from leading t0 falture, 4 implicit redundancy (itis infact dificult to bulld a system without any form of redundancy) which may have the same (unexpected) result as explicit redundancy, + the very definition of a fallure from the user's viewpoint, e.g in the nation of "acceptable error rate” (implicit: before conidering that a failure has occurred) in data transmission. ‘These examples finaly enable the introduction of the notion of fault clases, which are classically [Avi 78} ‘physical faults and human-made faults, Proposed defin- fons are as follows: + physieal faults: adverse physical phenomena, ither internal (physicochemical disorders! threshold 5» shortereults, open sGrouls...) or external (environmental perturba: tions: electromagnetic perturbations, tempers- ‘ue, vibration, + design faults, committed either a) during the system inital design (broadly speak ing, fom requirement specification to im- plementation) or during subsequent modif- feations, or b) during the establishment of ‘operating or maintenance procedures, ‘+ Interaction faults: inadvertent or deli- berate violations of operating or mainte- ‘ance procedures. Faults, errors and failures are all undesired cir cumstances. ‘At Ors Sight, the three notions are neces- sary: a) the occurence of an undersired circumstance tifccing the service ~ failure i felt by the wer(s) ‘and assessed, b) an (internal) system undesired cir- ccumstance - error ~- is detected, c) the undesired ci- ‘aumstance able to give rise to a system error ~ fault ~ fand later on to a service failure is either avoided or tolerated. Assignment of the terms fault, error and failure to the phenomenological cause, the system and the service undesired circumstances simply takes into account current usage: faultavoidance or tolerance, - ror detection, failure rate. Moreover, it seems important to differentiate between the cause with respect fo activation and propagation ~- error causing fellure ~, and the cause with respect 10 the (suspected) originating phenomencn(a) ~ faut caus- ing eror = - 1 could be argued that with such a reasoning (at viewed se Pheamendtopal ase, of eon) ‘can go "along way back”. For instance, if we look beck at two ofthe previously given examples: + why did the programmer make a mistake? = _ why did the short occu inthe integrated circuit? Ia fact, recursion stops at the cause which is intended to be avoided or tolerated. Uf fault - avoidance is meaning- fl within this contert (when a cause is avoided, its ef- feu are of litle interest) it may not be 40 for fault - tolerance; in fact, it is the cause which is tolerated, through processing its effects: a faut is thus the ad” adged cause of a0 error (Mor 83}. Furthermore, such a view is consistent with the dstine- ton between human and physical fauls in that a com- ‘puting system isa human creation and as such any fault {8 uldmately human-made since it represents human ina- bility to master the complexity of the phenomena which govern the behavior of a system. In an absolute way, dstingushing between physical and human-made faults (specially design faults affecting the system) may be considered as unnecessary; however itis of importance ‘when considering (curren) methods and techniques for procuring and validating depeadsbilty. If the above- tentioned recursion is not stopped, a fault is mocking fle than a failure of a system having lnteracted oF tae teractng with he concidered sys; examples follow: = a design fault is identifiable as a designer failure, = an iteraal physical fault ic due toa latent error (the "physics reliability” community rarely characerzes failures as “sudden, noapredicable land ireversible) originating from the hardware production, + physical external faults and (buman-mede) in teraction faults are identifiable as failures due to another design fault: the inability to foresee all the situations the system will be faced with dur- ing its operational life. Up to now, a rystem bas been considered os a whole, emphasizing its externally perceived behavior; efintion of a system complying with ths “black bor” view is: an entity having interacted, interacting, or I- able to interact with other entities, thus otber systems. ‘The bebavir is them simply what the system does (Zie 6). What enables it to do what i does isthe steuctare of the system or its organtzatioa. Adopting the spirit of [And 81), a system, from a structural ("white box") viewpoint, isa st of components bound together in ord- er to interact; a component is another system, ete. The ecursion stops when a system is considered as being stomic: any further internal structure cannot be dis- femed, or it not of interest and can be ignored. ‘The term “component” has to be understood in a broad Sense: layers of a computing system as well as inte layer components; in addition, a component being itself a system, it embodies the interrelation(s) of the com- ponents of which it is composed. From these definitions, the discussion of whether “failure” applis to a system or 10 a component is simply irrelevant, since a component is itself a system. When stomie systems are dealt with, the classical notion of ‘allure comes ‘naturally. It is also noteworthy that from the preceding View of system ‘sructure, the notions of service and specification apply ‘equally naturally 10 the components, This is especially Interesting in the design process, when using off-the- shelf components, either hardware of software [Hor 84]: ‘what is of actual interest is the service they are able 10 provide, not their detailed (internal) behavior. ‘This structured view enables fault pathology to be made more presse; the creation and action mechanisms of faults, erors and failures may be summarized as fo- lows: 1) A fault creates one or several latent errors in the component where it oocurs; physical faults can directly affect the physical layer components ‘only, whereas human-made faults may affect ‘any component. 2) The properties governing errors may be stated as follows: 1) a intent error bocomes effective once itis ac- tivated, 1) an error may eyce between its latent and ef fective sates, 6) an effective error may, and in general does, ‘Propagate from one component to another; by propagating, an error creates other (new) From these properties it may be deduced that an effective error within a component may originate from: activation of a latent error within the same jeomponeat, 2 effective error propagating within the fame component or from another con ponent. 3) A component fale oocurs when an error af- fects the service delivered (88 a response to request(s) by the component. 4) These properties apply to any component of a system, In the preceding, the intransitive form of "propagate" ‘was inteatonnaly used: an error does not propagate it felt, it just ‘Although “propagate” was re- tained due to is wide use, better words would probably ‘be “spread”, or “breed”. ‘Thre final comments: 1) A given error ina given component may be sub- Sequent to diferent faults. For instance: an er ror in a physical component (eg. suck at ‘round voting) may esl from: + & physio fault (@g, threshold change) ‘acting atthe physical layer comprising the ‘component, + am information error (e.g. erroneous mi- eroinstruction), caused by a design faut (Ce. programmer mistake), propagating topdown through the layers and leading tola short between two aut output for ® duration “loog enough 10 provoke t ‘hort-cirult having the same eect asthe ‘threabold change. iil) The adjective “deliberate” in the definition of ‘buman-made interaction fault i clearly intend- cd to include “undesired accesses” in the sense of computer security and privacy: however, the corresponding methods and techniques will not be addressed in the sequel. 4. On faul-avoldance and tolerance, eror-removal and forecasing ‘Al the “how to's" which appear in the basic de- ‘nitions are in fact goals which canoot be fully reached, 1s all the corresponding activities are human activites, ‘and thus impertoct. These imperfections bring in ‘ependencies which explain why its only the combined utlization of the above methods (preferably at each step of the design and implementation process) which can lead to a dependable computing system. These depen sdences can be sketched as follows: inspite of construc- ‘on rules (imperfect in order to be workable), fait oc- ‘ir; hence the need for error-removalerror-removal is Ise imperfect, a8 are the offthe-shelf components of the system, hence the need for eror forecasting; our in- creasing dependence on computing systems brings ia fault-tolerance, which in turn necessitates construction rales, and thus error removal, error-foecastng, ete. It ‘has to be noted thatthe process is even more recursive than it appears from the above: current computer sys- tems are so complex that their design and implementa- tion need computerized tools in order 0 be cost. elfecive (in a broad sense, including the capability of suoceeding within acceptable delays). These tools have themselves to be dependable, and so on. Ths preceding reeoing exis why in the terms are often associated, .g. “V and V" [Boe 79], the distinction being related to the diference between “twulding the system right” (‘elated to verification) and the ‘right system" (related to validation). ‘What is proposed is simply an extension of this concept the answer fo the question “am I bulding the right sjn- ‘uenta of the same activity - validation ~ is moreover of great interest as it enables a better understanding of the notion of coverage, and thus of an important prob- lem introduced by the sbove recursion(): the validation of the validation, or how to reach confidence in the ‘methods and tools used in building confidence in the system. Coverage refers here to 8 measure of the representatvity of the situations to which the system is submitted during its validation with respect tothe actual ‘tuations it willbe confronted with during its operation- al life. Finally, “validation” stems from "validity", ‘which encapsulates ro notions: + validity ot @ given moment, which relates to cerror-removal, = validity for a given duration, which relates to cexror-forecasting. 5. On the dependability measwes ‘The term “probability” has intentionally not been ‘employed inthe given defiitions, so a8 to Keep the ds- ‘cussion informal, and to reinforce the physical signifi ‘ance of the defined measures. However, asthe con- sidered circumstances are noo-deterministc, random ‘variables are associated with them, and the’ measures ‘which are. dealt with are probabilities; this i sticly Speaking correc a probability can be defined mathematically as a measure. Only two basic measures have been considered, reliability and availabilty, whereas a third one, malnta- ‘ability is usually considered, which may be defined as ‘2 measure of the continuous service interruption, or ‘equivalently, of the time to restoration. This measure is ‘no lest important than thote previously defined; it was ‘ot introduced earlier because it may, at least cancept- ally, be deduced from the other two. It is noteworthy ‘that availability embodies the failure frequency and 122 accomplishment time duration at each alternation ssecomplishment interruption. ‘A system may not, and generally docs not, al- ‘ways fail in the same way. This immediately brings in the notion of the consequences ofa failure upon the oth- cer systems with which the considered system in interact- ing, i. its environment; several failure modes can gea- erally be distinguished, ordered according to the in- ‘creasing severity of their consequences. A special case ‘of great interest is that of systems which exhibit two failure modes whose severities difer considerably = benign failures, where the consequences are of the same order of magnitude (generally in terms of cot) at those of the service delivered in the absence of failure, malign or castrophic failures, where the conse “Trough grouping the states of service accomplishment sod serve interoption subsequent to benign flores into a aafe ste (inthe seas of ing fee fem dam. te, not fom dang), the generalization of realty ited to an additonal ease: measre of conn us safeocs, or equivalently, a measure of the tine (0 tesoope flue, Le- maf. Tes worth noting that a direct generalization of the availabily, thus providing 2 measure of safeness with respect to the alteration of safeness and interruption after catstrophie fallure, ‘would not provide a significant measure. When a catas trophic failure has occurred, the consequences are gen- ‘erally +0 important that system restoration is ot of prime importance for atleast the two following reasons: + it comes second to repairing (inthe broad sense of the term, including legal aspects) the conse- ‘quences of the catastrophe, = the lengthy period prior to being allowed to ‘operate the system again (investigation commis- sls) would lad to mecaingiss nanedcl However, a "bybrid” reliabilty-availablity messure can te defined: a measure of the service accomplishment ‘with respect to the alteration of accomplishment and fon after benign failure. This measure is of in- terest in that it provides a quantification of the Fes Toran Computing, Lon Acgien, Se 192, 7p. ry Gen 71, N. Gen, “Notes on dita bane operating stems, in operaang noms, Lear Notes in Comper Scene 108, Ber; Springe Veg 1978, p. 4. ay 76 1. P. Hayes, “A graph mode for ferent systems, IEEE Tram. on Conpurs, Va. C28, No. 8, Sep 1976, p55. ae 4 Hoon, J.B. Mansco, “Aa expmive view of rea lester, EEE Transom Serre Engineering, Val ‘SE-10, No 5, Sept. 1984, pp 477-487. oe 60 LE, Hoxtord, “Measures of dependsiiy", Operant ‘esearch, VA 8, No.1, 196, pp. 206206 ap 82H. Kops, “The fare (FF) model", poe. 128 I. ‘Simp. on Fault Toles Campus, Loe Angel, ine 190, pp 417, Lam 81 W, Lampoon, “Atomic Traci, in Disbsed Syuensarhtecare and Inplenewsion, Lacie Notes ia Competer Scie 108, Bela” Splage-Veig, 198), ap Lap 7 J.C. Lape, A. Cases, R. Troy, Depend: rere: ments tad soi”, proc. SEE Cong. on ler end lecronical Sse Deprdably, Todouse, Frac, 02. 197;inFrech ap 82 J.C. Lap, A. Coes ‘Depensiiy: « unyng oom pt for relale coputing proc. 12 I. Simp. on eal Tolerant Compuing, Ls Angles, June 1862, pp. 18. a, ‘Lap 4 J.C. Lape, “Dependable Computing sn fet oer: coecps ed temiology", TIP WG 104, Samar 1964 ‘mening, Kisinmee, vias LAAS Reseach Report NO. ‘805, Jee 1904, ae t2 P. A Lee, DE Morgan, E., “Pandamentl cones of Sitters computing, prope repo, proc. 128 In. Symp. on Fels Toleet! Conpuing, Los Angee, Jane u 196, pp. 3438. ‘Mor 83D. Morgn, W. C. Carer, A. Hophisa, “Report to IFIP ‘WO 104 Cony and teinlogy, Dat FTP WG 104 Summer 9 meeting, Como, Uy, Kane 19, ‘an 75 B, Randel, "ystems structure fr software fat tlernes", [IEEE Trav. om Sfware Engen, Val. SEA, No. 2, ‘ne 1995, pp. 20232. an 78 B. Randel, P. A. Le, P.C. Treen, Telahity sues {a comping aptem deiga’, Comping Sven, Val 10, No 2, June 1978, pp, 2168 ‘Mob AD A. S, Robinson, “A user crented penpecve of fet tolerant sytem models ad tering" rc. 12 ft Symp. on FaulbTolrow Conpuieg, Los Angel, Jone 19, pp. 2228. ‘Sle 2D. P. Senior, RS. Sear, The ery and practice of {ible dein, Digi Pe, 1982 ‘Try 40 D. J. Taylor, D. E. Mor, J.P. Hack, “Redundancy I i sures: improving stone fol, EEE Trams. om Software Engineering, Vel. SE, No. 6, Now 1980, pp. 30-594 Wen TAT, M. Wenley, I. Lamport, J. Goiberg, M. W. Green, KN. Levit, P.M Melle-Sahy RE. Shak, CH. ‘Welnstok, “SIFT: te daign an anya of © uu tolerant computer for arf cont, Proceedings of tc TEBE, Vl. 65, No. 10, Oc. 78, p. 12851268. Te 76 BP, eg, Theory of Modeling and Simaision, Now ‘York: fn Wiley 976,

Você também pode gostar