Fault Handling in Multi Core Processors

MULTI-CORE COMMERCIAL OFF-THE-SHELF (COTS)
UNDER THE IMPLEMENTATION OF FAULT

TOLERANCE
¹ Minahil Ahmad ² Zakka-ur-Rehman
Lahore Garrison University, M.Phil. Student Lahore Garrison University, M.Phil. Student
minahilahmad77@hotmail.com zakka727@gmail.com
ABSTRACT
The fault-tolerant design is applicable to high performance IT systems, increased by an amount of

false contacts from smaller structures. As a substitute for the hardware solutions that are based on
software error tolerance mechanisms, this solution can increase the reliability for the commercial-of-
the-shelf (COTS) multicore processors within a cheaper range and efficient enough for further use.
This paper aims to create a hybrid approach to hardware-related software, the current Core Intel x86
family and Xeon multi-core platforms. We walk to support memory transactional hardware (TSX
Intel) in creating implicit hits and fast rollback. Redundant process execution and their signature
comparison based on their calculation for error detection and error recovery can transactional packing.
Existing applications have improved redundant instrumentation performance with tolerance to post-
binary compound errors. Further physical improvement increases the applicability of the CPU SPEC
benchmark approach proposed in 2006 with administrative costs and rated performance over 47% on
average based on the existence of proposed hardware support. In this paper, we will gather all the
techniques for hardware and software level that help us for the removal of faults in multi-core
commercial-of-the-shelf (COTS) and make the system able enough to work properly.
Keywords: Commercial of the shelf (COTS), Multicore processors, fault tolerance, Sphere of
Replication (SoR).
1. INTRODUCTION used or the cycles. The locking mechanism

in COTS multicore processors is complex to
The faults in the computer systems can never update and require more working.
be eliminated easily several measures have
been taken to make those faults removed. The power management can be synchronized
These faults can be removed by detecting along with the performance in step by step
them that where they are present. We need to cycles. Modular system could be helpful in
check out those faults again for the detecting the two redundant processes
confirmation and for calculating the down (DMRs) that support only the
time of faults as the start and the end of the malfunctioning. Triple redundancy module
fault. There are some systems named (TMR) [Florian Haas1, Sebastian Weis1,
dependable server system that helps to check Theo Ungerer] provides the error tolerant
the fail stop during the working of a specific facility in multicore processors. A multicore
task. The fail operation system is build up to processor in COTS architecture requires a
detect the faults and error by itself ad make costly technology for error detection and its
the task fault free by removing the error. tolerance, as for high performance we need
big special compilers that supports for
There is a technique known as error unlimited recovery abilities.
correction code (ECC) that is used to remove
transient errors in the path of the memory
and helps in controlling pipelining. This
implementation also sorts out the error lock
that is stuck with the transient errors simply
by doubling the amount of resources being
2. ORGANIZATION AND system shows that are where it is hardware
WORKING error or software error as for hardware error
we need to reconstruct the system again
COTS multicore processors [N. Oh, P. P. from where it is causing damage and for
Shirvani, and E. J. McCluskey] work on software error we need to validate the
Transaction synchronization extension software or code which is error prone.
(TSX) in a reverse form that is used for COTS Multicore processor needs several
error tolerance. Its flowchart is shown under attributes to fulfill the requirements of the
where we can see the dependent systems system and these overall working leads to a
and then they further moves towards the system that is fault avoidant, faultless and
impairments which shows that how the tolerant. See Fig.1.
system is causing an interrupt. Means of
Figure 1: Organization and Working for COTS Multicore processors
3. FAULT AVOIDANCE OR built upon certain specifications that make

PREVENTION them efficient from other systems and along
The avoidance of any types of faults in the with this it should be fault tolerant.
system and the software is a basic form that
makes a fault tolerant system. System 3.1. System requirements specification
requirements and specification are the For the hardware and software system we
important factors during the construction need to build a set up that is complete of all
along with the design models that makes the required means. It should be free of any
them reusable for implementing the formal kind of systematic and logical error.
methods. Each COTS multicore processor is Software setup should be in a way that
maintains a correct communication between 4.1. Testing
software engineer and a system. Hardware This phase is the basic form of getting COTS
system [N. Oh, P. P. Shirvani, and E. J. multiprocessor system of testing all the
McCluskey ]should be in a way that requirements that are needed for its
accomplishes a complete communication the construction to be working well and then
hardware the technical, mechanical and setting them on and checking out the
electrical engineer. System that fulfills all the connection they are forming is correct or not.
requirements forms a desired system for the
user as well as for the firms.
3.2. Formal methods 4.2. Inspection

Formal methods have been taken into Inspection is very important factor in fault
account for both hardware and software removing this is done by companies which
forms. In hardware form these methods help wanted no error and need their system to
to construct a complete system which is work and flourish well. In this phase small
dependable. Software form is used to groups of people are form which inspect of
maintain the entire task whether their perspective region which is being
mathematical or coding and most provided to them and they test out all the
importantly the usage of set of instructions. parts and making the system fault free
[D.J.Sorin].
These formal methods are used generally for
small system as COTS Multicore processor 4.3. Design
is a huge one so the developers don’t make These are related to the formal method that is
the use of this system anymore, they discussed earlier. This is phases which test
undergoes some of the fault removal the pattern in which instruction show flow.
techniques for building up a system. Designing Is one of the most important thing
to be created when we have to work on
4. FAULT REMOVAL multicore processors, so this is a complex
phase to be done.
Fault removal techniques are dependent
enhancing techniques that work on two basic For checking the hardware design like in
agendas verification and validation. These multi core processors we check whether the
techniques improve software and hardware processors are correctly connected or how
dependability by detecting existing faults, they can work properly without causing any
using verification and validation (V&V) kind of fault [Florian Haas1, Sebastian
methods, and eliminating the detected faults. Weis1, Theo Ungerer]
Fault removal techniques contribute to
system dependability using testing, formal
inspection, and formal design proofs.
5. ARCHITECTURE OF COTS MULTICORE PROCESSORS
Figure 2: Software COTS Multicore Processor Architecture

Figure 3: Hardware COTS Multicore Processor Architecture
Figure 4: Basic COTS Multicore Processor Architecture

6. SOFTWARE vs. HARDWARE strong enough to not cause any kind of
IMPLEMENTATION damage as for making system fault-tolerant
6.1. Software reuse in hardware form some vendors may use hot-
Software reuse is very attractive for a swappable disk when there is a risk of disk-
variety of reasons. Software reusability failure [ASHOK KUMAR]
implies a savings in development cost, since
it reduces the number of components that 7. APPROACH
must be originally developed. It is also
popular as a means of increasing 7.1. Software centric
dependability because software that has been The software centric [D.J.Sorin] approach
well exercised is less likely to fail (since has application and libraries that can cause
many faults have already been identified and faults these could be detected by using
corrected).In addition, object-oriented Sphere of Replication (SoR) that is a layer or
paradigms and techniques encourage and a medium that software which has faults
support software reuse. However, it is leaves the boundaries of SoR and we can
important to recognize that different detect that they are faulty. Here we uses a
measures of dependability may not be technique Process Level Redundancy (PLR)
improved equally by reuse of software. [G .A. Reis, J. Chang, N. Vachharajani, R.
Rangan, and D. I. August] which makes the
6.2. Hardware reuse software redundant as we have the backup
While we are construction for COTS multi- available for such fault causing software
core processor we make the connections application and libraries.
Figure 5 Software Centric
7.2. Hardware centric the faults leaves the boundary of SoR and we
can make those processors redundant by
In hardware centric we have the fault in the using several techniques in which lock
hardware level basically in the processors. stepping also comes [Bingbing Xia, Fei
These faults can be detected by Sphere of Qiao, Huazhong Yang, and Hui Wang]
Replication (SoR) as the processor having
Figure 6 Hardware Centric
8. COMPARISON OF SOFTWARE vs. HARDWARE ON MULTICORE COTS

PROCESSOR APPLYING FAULT TOLERANCE
Table 1
9. REDUNDANCY TECHNIQUES 11. LOCKSTEPPING
There is multiple redundancy techniques

used for error detection. N-Modular Lock stepping is a hardware based technique
redundancy is widely used technique for for fault tolerance. In this technique,
fault detection in multi-core processors in processor core is responsible for replication
which different cores of the processors of the process and multiple processor cores
process the same data in isolation and execute and computes result for same
compare results with each other in order to program. An additional hardware module is
check results and detect errors. present to replicate program on all
processors and compare the results. It find
A well-known redundancy technique is differences between results and check if they
Dual Modular Redundancy (DMR) which are identical. If all the results are same, there
uses two elements. The error is detected is no error. But in case if there is any
using DMR by comparing results of two difference in results there is an error in
elements if these are different. There is execution and comparison module then
another technique called Triple Modular generate a trap signal to the processors [H.
Redundancy (TMR) which detects errors by Mushtaq, Z. Al-Ars, and K. Bertels] .
majority voting. This leads to a higher
availability, since the system can continue
the execution by masking the faulty element In this approach time plays an important role
[H. Mushtaq, Z. Al-Ars, and K. Bertels]. as all cores must provide result at the same
These techniques further classified into time to ensure the correct result matching.
spatial and temporal redundancy
techniques. Spatial redundancy means
calculation is done on multiple distinct 12. SOFTWARE BASED
elements while temporal redundancy means INSTRUCTION LEVEL FAULT
the calculation is performed multiple times.
TOLERANCE
Typically spatial locality redundancy is
used on multi-core which is cost efficient.
Later on G.A. Reis et al. Proposed Software
This can be supported by the fact that in
multi-core systems, all the cores usually are Based Instruction level Fault Tolerance
not used at a time. Therefor unused cores (SWIFT) which is software based single-
are used for execution of redundant threads threaded approach for redundancy and fault
for fault detection purposes. tolerance. SWIFT is a compiler based
technique using which instructions are
10. SIMULTANEOUS AND
duplicated at compile time by the compiler
REDUNDANT
MULTITHREADING and some comparison instructions are
inserted into the program at some strategic
Another redundant execution approach is points. During the execution of program
Simultaneous and Redundant values of both instructions are computed
Multithreading (SRT) which uses hardware
twice and compared to avoid any effect on
feature called Simultaneous Multithreading
(SMT). SMT execute copies of a program the output of the program. SWIFT has many
on two independent threads on two physical improvements over prior techniques. It does
cores of the processor by sharing some not require any additional hardware changes
resources. The two dominant forms of rather it uses software enhancements.
redundant execution are lockstep
configuration and redundant multithreading 13. RECOVERY
with loose lock stepping or without it. [
D.J.Sorin] When a fault is detected it is important to
recover it before it cause system failure. For
a fault tolerant system it is very important to
implement recovery system also. There are
two options which can be adopted in order to Lock stepping and SWIFT. There are some
recover from error i.e. error handling, fault pros and cons of each technique.
handling or both can be used for recovery.
EDDI is the approach that consumes
Error Handling: In error handling errors are sufficient memory because each location of
removed from the system without removing original instruction will need a shadow
the source of the fault. Two techniques are instruction in the memory. The duplication
mostly used for error handling i.e. of memory results in significant hardware
checkpoint and repair and other is masking. cost which also leads to significant cache
In checkpoint the state of system is saved cost [A. Shye, J. Blomstedt, T. Moseley, V.
after some time slots and when an error is J. Reddi, and D. A. Connors]
detected it is rollbacked to its previous state
valid state using checkpoint. While in Lock stepping also has some disadvantages
masking technique, component containing regarding the processor cores. Basically, the
error is masked by majority voting on states need for low level hardware determinism
of other redundant components. Then the leads to restrictions in hardware design.
state of an erroneous component may be First, Lock stepping leads to a huge
restored by using the state of one of the non- hardware overhead. For fault detection, two
erroneous redundant components [Dimitris entire processor cores are required, which
Gizopoulos, Sarita V. Adve, Daniel Sorin, duplicates the total hardware costs. Second,
Arijit Biswas, Xavier Vera]. It is more non-determinism can occur in modern off-
efficient technique then checkpoint and the shelf hardware. For instance, differences
repair because it does not involve rollback. in hardware bits lead to a Lockstep error,
even if they have no effect on code
Fault Handling: Fault handling involves execution. Third, clock domain crossing
isolation of the faulty component and leads to asynchronous events, which would
recovering the system from the fault. subverts hardware determinism.
Moreover, tasks which were being computed Additionally, deterministic execution over
on a faulty core need to be reassigned to a the entire hardware life time cannot be tested
working core or a spare core. This is known but implementing at hardware is much more
as reassignment. Repair of a faulty preferable by considering fault tolerance.
component can be done in a reconfigurable
system through reconfiguration [N. REFERENCE
Aggarwal, P. Ranganathan, N. P. Jouppi, and
1. ASHOK KUMAR P, 2THANASEKHAR B.
J. E. Smith]. Faulty component is isolated FAULT TOLERANCE IN MULTI-CORE
from the non-faulty components so that error PROCESSORS USING FLEXIBLE
is not transferred to other components. REDUNDANT THREADING. Chennai, India
CONCLUSION : s.n., 7 JULY 2014.
Multicore processors provide structural 2. A Fault-tolerant Structure for Reliable

redundancy which can be exploited by using Multi-core Systems Based on Hardware-
different fault tolerance techniques. There Software Co-design. Bingbing Xia, Fei Qiao,
are multiple approaches for fault tolerance in Huazhong Yang, and Hui Wang. s.l. :
multicore COTS processors. Some of those Institute of Circuits and Systems, Dept. of
have been discussed above. Those Electronic Engineering, 2001.
techniques include EDDI, redundancy 3. Fault-tolerant Execution on COTS Multi-
techniques i.e. SMRT, N-Modular approach, core processors with transactional support.
Florian Haas1, Sebastian Weis1, Theo 10. T. C. Bressoud and F. B. Schneider.
Ungerer1,. USA, GERMANY : s.n., 2013. Hypervisor-based fault tolerance. In
Proceedings of the fifteenth ACM symposium
4. Architectures for Online Error Detection on Operating systems principles, SOSP ’95,
and Recovery in multi-core processors. pages 1–11, New York, NY, USA, 1995. ACM
Dimitris Gizopoulos, Sarita V. Adve, Daniel
Sorin, Arijit Biswas, Xavier Vera. USA, 11. N. Oh, P. P. Shirvani, and E. J.
GREECE : s.n., 2011. McCluskey. Error detection by duplicated
instructions in super-scalar processors.
5. Using Process-Level Redundancy to IEEE Transactions on Reliability, 51(1):63–
Exploit Multiple Cores for Transient fault 75, March 2002
tolerance. Connors, Alex Shye Tipp
Moseley† Vijay Janapa Reddi‡ Joseph
12. G .A. Reis, J. Chang, N. Vachharajani, R.
Blomstedt Daniel A. COLORADO : s.n.,
Rangan, and D. I. August. Swift: software
2012. implemented fault tolerance. In Code
Generation and Optimization, 2005. CGO
6. H. Mushtaq, Z. Al-Ars, and K. Bertels,
2005. International Symposium on, pages
“Survey of Fault Tolerance Techniques for 243 – 254, 2005.
Shared Memory Multicore/Multiprocessor
Systems,” in IEEE International Design and
Test Workshop, Dec. 2011 13.https://www4.cs.fau.de/Lehre/WS14/MS_
AKSS/papers/07-Paper-Stefan_Reif.pdf
7. D.J.Sorin, Fault Tolerant Computer
Architecture, Synthesis Lectures on
14. http://ce-
Computer Architecture, Morgan & Claypool publications.et.tudelft.nl/publications/94_sur
Publish., 2009. vey_of_fault_tolerance_techniques_for_shar
ed_memory_multic.pdf
8. G .A. Reis, J. Chang, N. Vachharajani, R.
Rangan, and D. I. August. Swift: software
implemented fault tolerance. In Code 15. A. Shye, J. Blomstedt, T. Moseley, V. J.
Generation and Optimization, 2005. CGO Reddi, and D. A. Connors. Plr: A software
2005. International Symposium on, pages approach to transient fault tolerance for
multicore architectures. Dependable and
243 – 254, 2005.
Secure Computing, IEEE Transactions on,
vol. 6, pages 135 –148, 2009.
9. A. Shye, J. Blomstedt, T. Moseley, V. J.
Reddi, and D. A. Connors. Plr: A software
approach to transient fault tolerance for 16. N. Aggarwal, P. Ranganathan, N. P.
multicore architectures. Dependable and Jouppi, and J. E. Smith. Con- figurable
Secure Computing, IEEE Transactions on, isolation: building high availability systems
vol. 6, pages 135 –148, 2009. with commodity multi-core processors.
SIGARCH Comput. Archit. News, vol. 35,
pages 470–481, June 2007.

Fault Handling in Multi Core Processors

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Fault Handling in Multi Core Processors

Enviado por

Direitos autorais:

Formatos disponíveis

MULTI-CORE COMMERCIAL OFF-THE-SHELF (COTS)

UNDER THE IMPLEMENTATION OF FAULT

The fault-tolerant design is applicable to high performance IT systems, increased by an amount of

1. INTRODUCTION used or the cycles. The locking mechanism

Figure 1: Organization and Working for COTS Multicore processors

3. FAULT AVOIDANCE OR built upon certain specifications that make

3.2. Formal methods 4.2. Inspection

5. ARCHITECTURE OF COTS MULTICORE PROCESSORS

Figure 2: Software COTS Multicore Processor Architecture

Figure 4: Basic COTS Multicore Processor Architecture

Figure 5 Software Centric

8. COMPARISON OF SOFTWARE vs. HARDWARE ON MULTICORE COTS

There is multiple redundancy techniques

Multicore processors provide structural 2. A Fault-tolerant Structure for Reliable

Você também pode gostar