Computational: Optimal Fault Tolerant

Towards Optimal Fault Tolerant Scheduling in
Computational Grid
Muhammad Imran 1, Iftikhar Azim Niaz ', Sajjad Haider 2, Naveed Hussain2, M. A. Ansari3
'Faculty of Computing, Riphah International University, Islamabad.
2Information Technology Department, National University of Modem Languages, Islamabad
3 Computer Science Department, Federal Urdu University of Arts, Science & Technology, Islamabad
{mimran, ianiaz} griphah.edu.pk, { sajjad, naveedhussain} gnuml.edu.pk, drmaansarigfuuast.edu.pk
Abstract- Grid environment has significant challenges due to Due to dynamic, heterogeneous and geographically
diverse failures encountered during job execution. Computational distributed nature of grid, user job is always prone to different
grids provide the main execution platform for long running jobs.
Such jobs require long commitment of grid resources. Therefore kinds of errors, failures and faults [1]. Grid environment has
fault tolerance in such an environment cannot be ignored. Most of significant challenges due to the diverse failures encountered
the grid middleware have either ignored failure issues or have during job execution. A survey with grid users on fault
developed adhoc solutions. Most of the existing fault tolerance tolerance has revealed that how difficult it is to run applications
techniques are application dependant and causes cognitive on grid environments susceptible to a wide range of failures
problem.
[4]. Each middleware should have fault tolerance mechanism
This paper examines existing fault detection and tolerance in order to execute user jobs reliably.
techniques in various middleware. We have proposed fault Fault tolerance is to preserve the delivery of expected
tolerant layered grid architecture with cross-layered design. In
our approach Hybrid Particle Swarm Optimization (HPSO) services despite the presence of fault-caused errors within the
algorithm and Anycast technique are used in conjunction with the system itself. Errors are detected and corrected, and permanent
Globus middleware. We have adopted a proactive and reactive faults are located and removed while the system continues to
fault management strategy for centralized and distributed deliver acceptable service [2]. As resource failure is the rule
environments. rather than exception in grid [3], so fault tolerance mechanisms
The proposed strategy is helpful in identifying root cause of both proactive and reactive must be an essence of each grid
failures and resolving cognitive problem. Our strategy minimizes middleware. Most of the fault tolerance techniques developed
computation and communication thus achieving higher reliability. for grid computing are reactive in nature. For example
Anycast limits the effect of Denial of Service/Distributed Denial of
Service D(DoS) attacks nearest to the source of the attack thus replication and checkpointing are used in grid but they are only
achieving better security. Significant performance improvement is able to deal with crash failures.
achieved through using Anycast before HPSO. The selection of Grid middleware has either ignored failure issues or they
more reliable nodes results in less overhead of checkpointing. have implemented on ad hoc basis. Most of the fault detection
I. INTRODUCTION and recovery techniques built so far are application dependant,
resulting in problems like overwhelming single layer i.e.
Grid computing is a type of parallel and distributed system cognitive problem [4] and increased inter layer communication.
that consists of resources having heterogeneous architectures, In traditional layered grid architecture [8], cross layer
being geographically distributed and interconnected via communication is not possible i.e. an intermediate layer cannot
unreliable network media. It enables the sharing of communicate with upper and lower layer. The concept
geographically distributed autonomous resources owned and proposed in [1] multicasts address of execute machine in order
managed dynamically by multiple organizations. to select backup that generates huge amount of network traffic.
Computational grid being popular for constructing large scale This will increase network communication, which leads to
meta-computing provides dependable, consistent, pervasive increased unreliability, safety and network delays.
and in-expensive access to high performance computational Grid resource broker performs resource discovery,
capabilities. These grids are well suitable for long running scheduling, and processing of application jobs on distributed
applications that require long commitment of grid resources. grid resources. It uses grid information service that maintains
For example a job that takes weeks of execution on single status of grid resources. An artificial intelligence based
system can be executed on computational grid in minutes metaheuristic algorithm Hybrid Particle Swarm Optimization
depending upon how much computational resources are (HPSO) for task allocation has already been proposed in [9].
available on grid. Evolutionary algorithms like HPSO performs two steps i.e.
Grid middleware is a software suite that is deployed on each exploration and exploitation. It can take more time in large size
participating machine to be able to participate in a grid problems.
environment. Each middleware provides some essential
functionality in order to execute user jobs successfully.
1-4244-1494-6/07/$25.00 C 2007 IEEE
Authorized licensed use limited to: Dharmsinh Desai University. Downloaded on August 18,2010 at 07:11:05 UTC from IEEE Xplore. Restrictions apply.
We discuss the proposed scheme with respect to Globus approach in which a layer can only communicate with its upper
tookit [19]. In Globus a noticeable flaw is lack of support for or lower layer.
fault tolerance. It does not provide checkpointing for saving Some of the existing fault detection and tolerance techniques
long running computation. Although it uses resource broker being used in various middleware are summarized in Table 1.
like Condor [20], but Condor's feature of checkpointing is not It clearly shows that each middleware uses its own technique
supported in Globus [1]. for fault detection and tolerance.
We have proposed that Anycast should be used instead of
multicast. As in Anycast, data is routed to the nearest or best TABLE I FAULT DETECTION AND TOLERANCE TECHNIQUES USED
destination as seen by the network topology or routing protocol IN PARALLEL AND GRID COMPUTING
measure of distance. By using HPSO after Anycast will not System Fault Detection Fault Tolerance
only minimizes reliability calculation time but even in the Globus Heart Beat Monitor Retry
worst case calculates the reliability within reasonable time. The Legion
Pinging and Checkpoint-recovery
proposed strategy manages faults both in proactive and reactive Legion Timeout
manner. It can work with both centralized and distributed
environment. We believe that the proposed solution results in
Condor-G Polling machine
increased reliability and safety. Condor Checkpoint-recovery
The rest of the paper is organized as follows. In section II, NetSolve Generic heart beat Retry on another
existing fault detection and tolerance techniques have been monitor available machine
analyzed. In section III, proposed system has been discussed Mentat Polling Replication
that consists of fault tolerant layered grid architecture, cross- CoGKits N/A N/A
layered grid design using Anycast and HPSO. In section IV,
proposed architecture for fault tolerance with respect to Software fault tolerance is provided by software diversity.
centralized and distributed environment have been discussed. Diversity can be introduced in software systems by
Section V presents the evaluation of the proposed solution. In constructing diverse replicas that solve the same problem in
Section VI and VII we conclude and discuss the future work. different ways (different algorithms, different programming
II. EXISTING FAULT TOLERANCE TECHNIQUES languages etc) [4].
A traditional layered grid design is depicted in Fig. 1. It
Most of the existing fault management mechanisms are
reactive incomplete and application dependent. For example if shows that a layer cannot communicate with its upper and
a job execution machine fails during execution; jobs would be
lower layer simultaneously.
submitted on another machine from start. This technique is
known as Retry. We cannot afford such kind of techniques for A lpplications A Epplictions
compute intensive jobs that require huge computational IT
resources. By overburdening a single layer, communication
among layers is increased. Another problem faced is the
cognitive problem. It becomes very difficult to detect, identify,
isolate and recover from failures. An extension to the
classification [5, 6, 7] of errors, failures and faults with their
expected occurrence at appropriate grid layer has already been
presented in [11].
A proactive approach regarding job scheduling in
computational grid has been proposed in [10]. It schedule jobs
based on history maintained in grid information service about
grid resources. Fig. 1. Traditional layered grid design
Another agent oriented proactive fault tolerant framework
has been proposed in [6] where agents deal with individual
It becomes difficult to implement fault tolerant layered grid
faults proactively. Agents maintain information about hardware
conditions, executing process memory consumption, available design that requires fault tolerance techniques at each
resources, network conditions and component mean time to corresponding layer.
failure. III. PROPOSED SOLUTION
According to our knowledge, none of the techniques or We believe that fault management should be deployed on
frameworks deals with failures both in proactive and reactive each participating machine in computational grid. In this way
way. All of the available solutions do not follow fault tolerance single point of failure can be avoided. Our proposed concept
at corresponding grid layer. They just follow traditional layered consists of following components.
A. Fault TolerantLayered GridArchitecture Deploying Anycast does not require any special
We believe that all fault detection and tolerance mechanisms software/firmware, it just leverage existing infrastructure.
should be implemented on each corresponding grid layer. One D. HybridParticle Swarm Optimization (HPSO)
way is to implement through agents/components for each grid A metaheuristic evolutionary algorithm for maximizing
resource. Agents should keeps on monitoring and updating
reliability of distributed systems has been proposed [9]. It finds
reliability of that particular resource at corresponding grid
layer. A combined reliability factor for a particular machine
near-optimal resources for task execution based on their
reliability. Reliability of resources is calculated based on
can be calculated based on reliability factors of layers.
Furthermore overhead of checkpointing intensity can be certain factors. Alike all modem metaheuristic algorithms, it
minimized based on system reliability. This scheme can only performs two steps i.e. exploration and exploitation. Looking
for new candidate solutions in solution space is called
work as proactive fault management. exploration while finding local optima from good specific
B. Cross Layer Design solutions is called exploitation. It continuously explores
The idea of cross layer design in grid with respect to Quality solution space and in worst cast might go for long time that is
of Service (QoS) has already been proposed in [12]. Fig. 2 not affordable.
shows cross layer design in computational grid where each IV. ARCHITECTURE OVERVIEW
layer can communicate to the layer below and above it unlike
traditional layered approach. Cross layer design in grid This section presents the architectural view of proposed
provides flexibility in implementing different mechanisms in system with respect to cross-layered grid design. Fig. 3 shows
grid. our proposed architecture with respect to Globus.
Applicattions
I~~~~~~~~~~~~ Applica1tions
Thlird Party User Level Middllewatre
I
Thirld Part User Level Middleware
Globus
Other Grid lIfrm_ation HPSO
services SROiWc (MDS)
Core
Middleware I 111 1111ME
Grid Security Layer
Girid Faibric
Fig. 2. Cross Layered Design in Computational Grid
Grid Fabric Anycastt
Fig. 3. Anycast & HPSO in Cross Layered Grid design
C. Anycast
Anycast is a network addressing and routing scheme We propose that HPSO can be incorporated at any of the
whereby data is routed to the "nearest" or "best" destination higher three layers i.e. grid core middleware, user level
[13] based on the following: middleware or application. It would be more useful to have
> As viewed by the routing topology HPSO at core grid middleware in order to minimize cross layer
> According to the routing protocols measure of communication. As grid information service i.e. Monitoring
distance and Discovery Service (MDS) have all the information about
An extension of Anycast has been proposed in [18], which the resources in grid, HPSO can use that information at same
delivers message to k-nodes from a group. Applications of layer to find better machines in terms of reliability. HPSO can
Anycast were largely been unexplored in the past; but recent be used with resource brokers like condor at user level
work [14] shows it is emerging as a new communication middleware but that requires cross layer communication with
paradigm. Anycast have two important characteristics from MDS. Similarly, it can be implemented at application layer that
fault tolerance and safety point of view. These include: would also require cross layer communication with MDS and
user level middleware.
1. Reliability We also propose that Anycast can be incorporated at grid
Anycast applications typically feature external "heartbeat" fabric layer in cross-layered grid design. By using Anycast
monitoring that helps to avoid black holes. It provides before HPSO will not only reduce network communication but
automatic failover. also it would reduce reliability calculation time of HPSO.
2. Property ofsinking D(DoS) attacks Anycast would actually limits the solution space for HPSO,
It can limit scope of Denial of Service/Distributed Denial which in worst case take longer time. Although it would result
of Service attacks nearest to the point of origination. in local optima but the proposed solution is for huge problem
size. Fig. 4 shows the effect of Anycast before running HPSO.
10.0.0.5 10.0.0.6
10.0.0.1
Partial task 1
Fig. 4. Effect of Anycast before running HPSO
The proposed concept can work both in centralized and

distributed environment. Both scenarios have been discussed in
next sections.
Fig. 6. Selecting optimal primary server for sending checkpoints
A. Centralized environment
Centralized environment have been discussed in detail [15] Our proposed approach has been depicted in Fig. 7. In order
with its proposed architecture. It does not describe how to to select machines for job execution, submit machine would
select a reliable primary and backup server for sending Anycast. The nearest candidate backup machines would reply
checkpoints. Fig. 5 depicts steps involved in selecting near with their 'machine information'. The submit machine would
optimal server for sending checkpoints. The 'execute machine' run HPSO based on information received. In this way, near
would Anycast for selecting primary server as backup. The optimal machines in terms of reliability would be selected for
candidate servers with their machine information reply. HPSO task execution. Another benefit would be that constraints with
would find the optimal server for sending checkpoints. Primary limited resources would also be satisfied. The scenario is
server would perform similar steps in order to select secondary depicted in Fig. 7.
backup server. 10.0.0.2
10.0.0.1
10.0.0.5 10.0.0.6
iputationally/
wulsive task <
10.0.0.1 10.0.0.2
,0 Anycast
0Machine Info 100 l
10.0.0.3
10.0.0.1
Fig. 7. Selecting execute machine for task execution

Fig. 5. Selecting optimal primary server for sending checkpoints
The 'execute machine' would perform similar steps in order
Fig. 6 depicts 'execute machine' sending checkpoints to to select backup machine. Fig. 8 presents this scenario
primary server and it further sending to secondary backup
server. In case, a primary server fails, backup would become
10.0.0.1
primary and it will select its backup following similar steps.
B. Distributed environment
Distributed fault management for computational grid has
been proposed in [1]. The question of how to select an optimal yast 10.0.0.2 //
Partial task I1
or near optimal node as execute or backup machine remains s 10Machine Info
unanswered. The proposed concept was that submit machine
would multicast address of execute machine in order to request
for backup. All candidate backups reply to execute machine as
consent to serve as backup. Two problems that are identified Anycast 10.0.0.3
are: First, increased communication Second, a D(DoS) attack
Machirlne Itlfo
can effect network performance severely. 10.0.0.1
Fig. 8. Selecting backup machine for sending checkpoints
In case of 'execute machine' failure, backup machine would The more the number of nodes are; higher the reliability
take over execution from last recent checkpoint and it would calculation time of HPSO is. The results from mathematical
select its backup following similar steps. model presented in section-A and [9] show the relationship as
V. EVALUATION presented in Fig. 9.
A. Reliability Calculation Time No ofNodes vs Reliability calculation time (sec)
By using Stirling formula for interpolation [21], we can
estimate the time required to calculate system reliability 10
9
through HPSO from [9]. 8
Population size of entire solution space =N 7
Population size of localized solution space =k a; 5 --No of nodes
Reliability calculation time required for 1 node f (k) where 4-
k=I 2
Reliability calculation time required for k nodes f (k) 0
Reliability calculation time required for N nodes 0 1 2 3 4 6
f(k) where k=N Reliability Cal time (sec)
So, Fig. 9. No. ofNodes vs Reliability calculation time

Vn e N,P(n) As process of checkpointing itself is an overhead, so in
3g g c N order to make system efficient we need to minimize overhead
of checkpointing. The more the system is reliable; less the
f(k) =Yk =yO +o+ )+U Syl+-u D(AY1+Y) .... overhead of checkpointing is. Table II show that a system
2 2 3! having 95%0 reliability requires checkpointing after 50%0 of task
execution. Whereas a system having 55°0 reliability require
x- xo
Where u = checkpoint after every 10% of task execution.
h
Where x is the period of interpolation of any node and h is TABLE II RELIABILITY VS CHECKPOINTING INTENSITY REQUIRED
node difference. Reliability Checkpointing Intensity Req
B. Performance Comparison ofMulticast andAnycast 55%0 10%
Total No of nodes =N 65%0 20%
Group of nodes in N for multicast =g 75%0 30%
Group of nodes in g that are nearest (Anycast) =k 85% 40o%
Comparative analysis based on distance 950o 50o%
Comparative analysis based on delay
N is total number of nodes The less the checkpointing intensity is; more time required
Vn e N,P(n) for checkpointing. Table III shows the relationship between
3g: g c N, where g is a group in N i.e. gi checkpoint intensity and time required to save checkpoints. It
where i= 1.....n. shows that a system with 10% of checkpoint intensity takes 50
seconds for checkpointing. Whereas, a system with 50% of
Avg (g) checkpoint intensity takes only 10 seconds for checkpointing.
n
TABLE III CHECKPOINTING INTENSITY VS TIME REQUIRED FOR
9 g>O CHECKPOINTING
Where g =
Checkpointing Time required for checkpointing
-g<O
Intensity Req (seconds)
3kcgcN
Delay factor |k.< |g.< |N| 10%
20%
50
25
For k,
30% 16.65
k1 = di 40% 12.5
k2 =d2
50% 10
k1= d.
min{dl,d2, d13= k VI. FUTURE WORK
Avg (g) >k
As k < Avg ( g) < N We have proposed fault tolerant architecture with respect to
So, Delay factor is less. Condor and Globus only. A detail empirical evaluation is
required for other grid middleware although theoretically it [5]. Y. Derbal, "A New Fault-Tolerance Framework for Grid Computing",
Journal ofMultiagent and Grid Systems, vol.2, no. 2, 2006, pp. 115-133.
looks feasible. [6]. M. T. Huda, H W. Schmidt, I D. Peake, "An Agent Oriented Fault-
We also have a plan to implement proposed concept. tolerant Framework for Grid Computing", Proc. of the js International
Implementation of proposed system consists of three parts: Conference on e-Science and Grid Computing (e-Science'05),
Melbourne, Australia, 2005, pp. 304-311.
Communication, HPSO and checkpointing. Communication [7]. K. Vaidyanathan and K. S. Trivedi, "Extended Classification of Software
uses Anycast; details of deploying IP Anycast have been Faults based on Aging", 12th International Symposium on Software
presented in [16, 17]. We prefer to use kernel level check Reliability Engineering, Hong Kong, 2001. pp. 99.
[8] P. Asadzadeh, R. Buyya, C L. Kei, D. Nayyar and S. Venugopal, "Global
pointing so that taking process snap shot could be application Grids and Software Toolkits, A study of Four Grid Middleware
independent. An algorithm for implementing HPSO have been Technologies", High Performance Computing: Paradigm and
described in detail in [9], it can be implemented at core grid Infrastructure, Wiley Press, USA, June 2005.
[9]. P. Y. Yin, S. S. Yu, P. P. Wang and Y. T. Wang, "Task allocation for
middleware in case of open source, at user level middleware or maximizing reliability of a distributed system using hybrid particle
embedding in grid application. swarm optimization", Journal of Systems and Software, vol. 80, issue 5,
2007, pp. 724-735.
[10]. B. Nazir and T. Khan "Fault Tolerant Job Scheduling in Computational
VII. CONCLUSION Grid", Proc. of the IEEE 2nd International Conference on Emerging
Technologies (ICET2006), Peshawar, Pakistan, 2006, pp. 708-713.
In this paper we have proposed that fault management [11]. S Haider, M. Imran, I. A. Niaz, S. Ullah and M.A. Ansari, "Component
mechanisms should be implemented at appropriate grid layer. based Proactive Fault Tolerant Scheduling in Computational Grid", Proc.
of the IEEE 3rd International Conference on Emerging Technologies
We have proposed cross-layered grid design for fault (ICET 2007), Rawalpindi, Pakistan, 2007.
management. We also propose that Anycast should be used [12]. L. Chunlin and L. Layuan, "Joint QoS optimization for layered
instead of multicast. computational grid", Information Sciences: an International Journal, vol.
177, issue 15 (August 2007) Pages: 3038-3059.
The proposed mechanism has ability to handle diverse [13]. Wikipedia, http:Hen.wikipedia.org/wiki/Anycast [15 June 2007].
failures both proactively and reactively. It is proactive in a [14]. M. Szymaniak, G. Pierre, M. S. Nikolova and M. van Steen, "Enabling
sense that it finds near optimal reliable machines for job Service Adaptability with Versatile Anycast", Concurrency and
Computation: Practice and Experience, vol 19, issue 13, 2007), pp.
execution in terms of reliability. It behaves in reactively, if a 1837-1863.
machine, server or link fails, it continues job execution. The [15]. N. Hussain, M. A. Ansari, M. M. Yasin, A. Rauf and S. Haider, "Fault
proposed mechanism can work for both centralized and Tolerance using "Parallel Shadow Image Servers (PSIS)" in Grid Based
Computing Environment", Proc. of the IEEE 2nd International
distributed environment. Conference on Emerging Technologies (ICET 2006), Peshawar, Pakistan,
The proposed approach provides increased reliability by 2006, pp. 703-707.
reducing communication and computation. Increased safety is [16]. 0. E. Masafumi and S. Yamaguchi, "Implementation and Evaluation of
IPv6 Anycast", Proc. of the INET 2000, Yokohama, Japan, 2000. pp 323,
achieved through Anycast, that limits the effect of D(DoS) http://www.isoc.org/inet2000/cdproceedings/
attacks nearest to the source of the attack. Significant [17]. K. Miller "Deploying IP Anycast", http://www.net.cmu.edu/pres/anycast
performance improvement is achieved through using Anycast [18]. B. Wu and J. Wu, "K-Anycast Routing Schemes for Mobile Ad Hoc
Networks", Proc. of the IEEE 20th International Parallel and Distributed
before HPSO. The proposed architecture is more helpful in Processing Symposium (IPDPS 2006), Rhodes Island, Greece, 2006.
identifying root cause of problems. [19]. Globus Allaiance, "Globus Toolkit", http://www.globus.org
[20]. M. Litzkow, M. Livny and m.W. Mutka, "Condor-A Hunter of Idle
ACKNOWLEDGEMENTS Workstations", Proc. of the IEEE 8th International Conference of
Distributed Computing Systems, San Jose, CA, USA, 1988, pp. 104-111.
We are thankful to all of our friends and colleagues [21]. Springer Link, Stirling Interpolation Formula, Encyclopedia of
including Mr. Imran Baig, Mr. Saeed Ullah Mr. Muhammad Mathematics, http:Heom.springer.de/s/sO87840.htm.
Affaan and Mr. Muhammad Aqeel who have been very helpful
in giving their comments and valuable suggestions for the
improvement of proposed architecture. We are especially
thankful to Mr. Nadeem Talib for helping in developing
mathematical models.
REFERENCES
[1] M. Affan and M. A. Ansari "Distributed Fault Management for
Computational Grids", Proc. of Fifth International Conference on Grid
and Cooperative Computing (GCC 2006), Changsha, Hunan, China,
2006, pp. 363-368.
[2]. A. Avizienis, "The N-version Approach to Fault-Tolerant Software",
IEEE Transactions on Software Engineering, vol. 11, no. 12, 1985, pp.
1491-1501.
[3] M. Baker, R. Buyya and D. Laforenza, "Grids and Grid Technologies for
Wide-Area Distributed Computing", Software - Practice and Experience
(SPE), vol. 32, issue 15, 2002, pp. 1437-1466.
[4] R. Medeiros, W. Cirne, F. Brasileiro and J. Sauve, "Faults in Grids: Why
are They so Bad and What Can Be Done about It?", Proc. of the 4th
International Workshop on Grid Computing, Phoenix, Arizona, USA,
2003, pp. 18-24.

Computational: Optimal Fault Tolerant

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Computational: Optimal Fault Tolerant

Enviado por

Direitos autorais:

Formatos disponíveis

Towards Optimal Fault Tolerant Scheduling in

1-4244-1494-6/07/$25.00 C 2007 IEEE

Fig. 4. Effect of Anycast before running HPSO

The proposed concept can work both in centralized and

Fig. 7. Selecting execute machine for task execution

Fig. 8. Selecting backup machine for sending checkpoints

f(k) where k=N Reliability Cal time (sec)

So, Fig. 9. No. ofNodes vs Reliability calculation time

Você também pode gostar