ACO Algorithm-Based Parallel Job Scheduling Investigation On Hadoop

ACO Algorithm-based Parallel Job Scheduling Investigation on Hadoop
Hengliang Shi, Guangyi Bai,

Zhenmin Tang

International Journal of Digital Content Technology and its Applications. Volume 5, Number 7, July 2011
ACO Algorithm-based Parallel Job Scheduling Investigation on

Hadoop

1,2
Hengliang Shi
*
,
3
Guangyi Bai,
1
Zhenmin Tang

1
Computer School of Nanjing University of Science & Technology, shihl@noahsi.com.cn
2
Electronics and Information School of Henan University of Science & Technology,
3
Noah IT Solution (SuzhouCo., LTD., bai@noahsi.com.cn
doi:10.4156/jdcta.vol5.issue7.35

Abstract
The pursuit of Hadoop by researchers is based on its architecture advantages, however, its job
scheduling shortcoming and workload unbalance are the key bottlenecks of its built-in FIFO algorithm
for large amounts of small granularity jobs under cloud computing situation. After analyzing FIFOs
advantages and disadvantages, this paper innovatively proposes that ACO(Ant Colony Optimization)
can improve job scheduling performance dynamically with cost matrix, hormone matrix and
probability matrix, and then simulates large amounts of jobs scheduling on cloud computing scene.
According to these experiments result, this algorithm can deal with job scheduling and workload
unbalance problems better, meanwhile, save response time and improve throughput.

KeywordsACO, Job Scheduling, Hadoop, Cloud Computing

1. Introduction

The pursuit of Hadoop is based on its architecture advantages, which can make full use of all kinds
of resources on Internet such as computing, network bandwidth, storage resource. Being the subproject
of Apache, Hadoop has certain advantages in adopting built-in FIFO(First In First Out) algorithm as its
common scheduling method to deal with large data-intensive jobs[1,2]. However, most of jobs that
cloud computing needs to deal with are small granularity jobs, which means to need longer
waiting-time, consume more resources and lead to lower flexibility and other drawbacks[3].
Under cloud computing environment, in regard to multi-user and large amounts of small granularity
concurrent job requirements, how to properly dispatch jobs to different slave nodes to avoid
underutilization and how to deal with workload unbalance are the bottlenecks which importantly
influence system performance [4].
Being the earliest SNS website which adopts Hadoop as its platform, Facebook suffers from rapid
increasing users and large amounts of datas challenges due to its built-in FIFO job scheduling
algorithm, although it has advantages for coupling with large granularity jobs[4,5]. Some research
groups within Facebook considered dealing with workload problems through building private clusters,
however, this was too expensive to be justified for all kinds of multivariate, elastic applications. If
these problems are not to be dealt with effectively, cloud computing system will be led to lower
computing efficiency, even worse to be failed[3].
Being next generation internet platform, Cloud Computing needs to deal with large amounts of
different granularity concurrent jobs. Explicitly, static job scheduling algorithm is not suitable to this
application scene in which needs dispatching jobs to large amounts of machines asynchronously.
Dynamic job scheduling algorithm has great stochastic performance in this scope. Torque uses a fixed
number of node machines which is not to match job complexity completely [2, 4]. And its HDFS
system is to dispatch jobs on all node machines in cluster, while not considering the variation of jobs
requested and the available resources [5, 7]. Obviously, the two methods negatively influence the
resource utilization and degrade elastic computing in some degree.
With the feature of master node machines overall job scheduling, Hadoop project can split the
input file into many blocks and dispatch them to different slave node machines, and implement data
locality to avoid large scale of data shuttle and save lots of processing time and I/O time. Being the
overall scheduling machine, the master node machine has higher performance than common slave node
machines [7, 8, 13, 15]. So the proposed algorithm in this paper is based on ignoring the
- 283 -

Zhenmin Tang

communication cost among large amounts of slave node machines.

ACO(Ant Colony Optimization) is one effective method to deal with NP problem, which has
stronger robust, distributed capability, parallel and scalability[9]. Based on Hadoop, cloud computing
platform is deployed on many common commodity PCs which have high failure possibility, and
possibility that new node machine entries into cloud computing [10]. We can combine the ACOs
advantages and cloud computing requirement into one to realize a stable, efficient, scalable job
scheduling model [11].
The built-in FIFO algorithm in Hadoop is sequentially to execute jobs according to jobs priority
and arrival time [1]. However, this algorithm is unfair for such scenes that exit a few large granularity
jobs and lots of small granularity jobs, because large granularity jobs execution would delay the start
time of small granularity jobs, which maybe needs execution at the deadline. These disadvantages
would turn out many resource fragments, underutilization and poor flexibility [11, 12, 17]. The first
solution to this problem in Hadoop was HOD (Hadoop on Demand) project, which provides private
MapReduce clusters over a large physical cluster using Torque [3, 13].
The original architecture of Hadoop only has two types of node, one is the unique master node
machine, and the other is lots of slave node machines. This type of architecture is properly applied on
the environment where has a few large-granularity data-intensive jobs, however it will enforce the
workload to lots of small-granularity CPU-intensive situation for its degrading job scheduling
efficiency, suffering from busy scheduling. So we propose an idea to dispatch small-granularity job
to a few slave node machines, and dispatch large-granularity job to more slave node machines to
couple with. Logically, we have established a virtual layer which is over the slave node machine and is
under the master node machine layer, meanwhile, this virtual layer will continually integrate and
separate slave node machines according to the granularity of coming jobs [14, 15, 16].

Figure 1. Demo of Dynamic Job Dispatching

Fig.1 demonstrates the dynamic job scheduling of 9 slave node machines under cloud computing
platform. When the 1
st
job is submitted, master node machine will dispatch slave1, slave2,and slave3
node machine to execute 1
st
job according to jobs granularity, priority and other influence factors,
continually dispatch slave4, slave5, slave6 node machines to 2
nd
job, and dispatch slave7, slave8,
slave9 node machine to 3
rd
job respectively. And when the 4
th
job (tagged as yellow eclipse) is
submitted, the master node machine will dispatch slave6, slave7, slave8 and slave 9 to it according to
this jobs granularity and other factors to implement elastic integration and separation.

2. Introduction

The goal of job scheduling is to properly dispatch parallel jobs to slave node machines according to
scheduling policy under meeting certain performance indexes and priority constraints to shorten total
execution time and lower computing cost and improve system efficiency.
- 284 -

Zhenmin Tang

To facilitate this problems research, we make the following suppositions in regard to Hadoop
characteristics:

2.1. Suppose the communication cost among slave node machines under Hadoop platform is
ignored. Hadoop architecture adopts data locality storage where computation occurs on the data storage
node as possible as it can, and it has two data replicas respectively on the same rack and the nearest
rack as possible to avoid large amounts of data shuttle cost [2, 3,7, 12].

2.2. Suppose slave node machines have same architecture under Hadoop platform. Due to
heterogeneous, different core processor architecture will increase experiment complexity and research
model complexity, and non-compatibility of hardware [2,3].

Figure 2. Computing Model of System

Fig 2. Demonstrates the improvement of Map-reduce model, and inserts two modules (including
local updating and global updating) of hormone updating of ACO algorithm into the master node and
slave node respectively.

3. Algorithm design and analysis

Under the above supposition, a typical job scheduling problem is described as the following: n jobs
need to be dispatched onto n node machines, and one node machine only processes one job, and one
job only can be implemented by one node machine. So different dispatch plans have different
execution cost and resource consumption, job scheduling is to find one plan to make jobs completing
smoothly to ensure availability, reliability and optimality.
Under the same processor performance conditions, jobs complexity is the key factor of influence
processing time. Jobs complexity is higher, and the required processing time is more; vice verse,
simple job needs fewer processing time [9]. With different clients requests submitted such as ftp, mail,
http, upload, these requests need differential processing time.
Definition 1. Cost matrix C
n*n
={C
i*j
| C
i*j
C
n*n
C
i*j
>=0i=1,n; j=1n}stands for the
processing cost that the i
th
node machine completes the j
th
job. The value of element in the matrix is
derived from the requested job complexity and processors performance.
Definition 2. Hormone matrix T
n*n
={T
ij
|T
ij
T
n*n
i=1,n; j=1n;}, stands for the value of
Job Dispatcher
JobRecycler

U
s
e
r

B
r
o
k
e
r

Cloud Computing
Resource Content
ACO Local Hormone Update
ACO Global
Hormone
Update
Master
Slave
Monitor Local Job execution

user n
- 285 -

Zhenmin Tang

hormone that the j

th
job was dispatched to i
th
node machine. The matrix is initialized as constant matrix
or zero, which means the values of all the elements of matrix is constant or 0 before the job scheduling.
Definition 3. Efficiency matrix V
n*n
={ V
ij
|V
ij
V
n*n
i=1,n; j=1n;}stands for the efficiency
value that the j
th
job was dispatched to i
th
node machine. The matrix is initialized as 1/ C
i*j
that is to say,
this matrix is reverse ratio to cost matrix.
Definition 4. Job scheduling matrix R
k
n*n
={R
k
ij
| R
k
ij
R
k
n*n
i=1,n; j=1n ; k=1n }, which
stands for the job scheduling plan that the k
th
ant implements job scheduling. The matrix is initialized as
0. And the value of element in this matrix is 1 or 0, that is to say, R
ij
=1 or 0The statement of R
ij
=1
means j
th
job is dispatched to i
th
node machine, however, the statement of R
ij
=0 means j
th
job is not
dispatched to i
th
node machine.
This problem is transformed to how to dispatch slave node machines to complete all the submitted
jobs with the minimum cost, that is to find a job dispatch matrix R
n*n
.
Suppose there are n ants to complete all the submitted jobs. Ants one trip stands for one job
dispatching procedure in which ant needs n walks which demonstrates dispatching one job, tagging the
walks as s; when all the ants complete one trip, can be thought as one loop finished; N
c
stands for the
times of loop. And introduce two sets Task={task
1
,task
2
,task
n
} and Node={node
1
,,node
n
};Task
stands for job set that is to be dispatched; and Node stands for slave node machine set that is to be
given jobs.
Meanwhile, lets introduce one n dimension vector D
Nc
n
which has elements D
Nc
k
, that is D
Nc
k

D
Nc
n
to stand for the k
th
ants cost vector during the Nc
th
algorithm loop, the initial value of D
Nc
n
is 0.
And then introduce the key probability matrix P
k
n*m
which has elements P
k
ij
, that is P
k
ij
P
k
n*n
to stand
for the probability of dispatching the j
th
job to i
th
slave node machine, and P
k
ij
has relations to its
hormone matrix and its job efficiency matrix, the relation is as following:
P
k
ij
=
T
]
o
V
]
o
T
]
o
V
]
o
j
n
]=1

Definition 5.The arrival moment of job i , tagged as at(i); The completion moment of job i, tagged
as et(i), each job has its priority, which is tagged as priority(i, t), meaning the priority of job i at the
moment of t.
Definition 6.Supposes M jobs are arrived during a given period, and there are N jobs which are
submitted, and K jobs are rollback to its original status as it is exceeded the time limit. There are some
indexes of QoS as follows:
(1) The jobs AET( Average Executing Time), AET={ (ct(i) - ot(i))]
n
=1
N
(2) The jobs WAET(Weight Average Executing Time)
WAET={ ((ct(i) -ot(i)) priority(i, ct(i)))]
n
=1
priority(i, ct(i))
n
=1
,
(3) The jobs losing rate , ls=K/M*100%
(4) The slave node machines workload ratio which is sampled at every period.
The algorithm is described as the following:
Initiate N
c
=1; and Task, Node;
Do
For (k=1; k<=n; k++)
Begin
For (s=1; s<=n; s++) //one ant needs n walks standing for dispatching n jobs.
Begin
Randomly select the elements Node
i
from Node set to compute the probability P
k
ij
,
and find the maximum, tagging it as P
k
ijmax
, and set R
k
ij
=1then delete the Node
i

from Node set, and then delete the Task
j
from Task set; and set cost vector
D
Nc
k
=D
Nc
k
+C
ij;

End
End
For(k=1;k<=n; k++) //every ant needs to update its hormone matrix
Begin
2. Locally update T. Introduce ants hormone increment zT, and the ants
- 286 -

Zhenmin Tang

hormone is gained to cost inversely, zT=

Q
DNck
n
k=1
, and according to the k
th
ants
R
k
n*n
matrix,
if R
k
ij
=1, and set T
ij
=T
ij
+zT;
End
3. Globally update T, set volatile parameter 0<p<1 to limit the infinite increment of
T, T=T*(1-p);
4. to find the minimum element D
Nc
min
among cost vector D
Nc
n
; if D
Nc
min
< D
Nc-1
min
then set D
min
=D
Nc
min
Nc=Nc+1
Until Nc>=Ncmax.

4. Experiment deployment and analysis

The whole experiment can be divided into two parts, the first is to identify ACO algorithms static
dispatch performance and dynamic dispatch performance; and the second part is to compare workload
balance performance and other QoS indexes of ACO algorithm against Hadoops built-in FIFO.
The first part of experiment is implemented with Matlab7.0 under Windows XP platform. The
sub-step is to dispatch 20 jobs to 10 slave node machines, when the submitted jobs are completed, and
then dispatch another 20 jobs to slave node machines, and then analyze the two scheduling matrixes.
The dynamic job scheduling is to dispatch 20 jobs to 10 slave node machines, during the submitted
jobs processing procedure, the second batch of 20 jobs are submitted to master node machine to
dynamically dispatch, and analyze the job scheduling matrix.

Table1. Experiment result of static scheduling
Experiment times Sum of two loops Sum of two execution time
1
2
3
17
16
15
27ms
26ms
25ms

Table2. Experiment result of dynamic scheduling
Experiment times Loop times Execution time
1
2
3
8
8
7
14ms
14ms
12ms

Because the static scheduling is divided into two phases, the total execution time is twice of
dynamic scheduling approximately, and the total loop time is twice of dynamic scheduling
approximately too, so we get a conclusion: ACO algorithm is more adoptable to continual small
granularity jobs for saving processing time. In view to optimization, static scheduling is easy to find
local optimization result; and the dynamic scheduling is easy to find global optimization result among
the whole data scope. The forecast can be gained: along with the increasing of cloud computing scale
and submitted jobs, ACO algorithm is possible to achieve global optimization under long time, large
amounts of submitted jobs situation. A shorter time slot is chosen by this experiment after all.
The second part experiment is as following:
Experiment platform: Ubuntu-9.10 OS, 10 Pentium IV processor and 1 high performance master
node machine. The first experiment executes dynamic scheduling of 100 jobs with Hadoop-0.20.2
built-in FIFO algorithm; the second experiment executes dynamic scheduling of 100 jobs with ACO
algorithm under Eclipse platform; and then scales out the jobs quantity to compare the two algorithms
QoS indexes, choosing jobs execution time ,WAET, jobs losing rate, and workload rate.
With the left of Fig.3, we can see that the execution time of two algorithms is equal approximately
when the quantity of jobs is at 300. Along with the quantity of jobs increasing, the advantage of ACO
algorithm is highlighting.
- 287 -
W
of th
WA

W
FIFO
and
lost.
1000

W
work
sam
poin
FIFO
T
little
certa
matr

4. C

then
little
prob
dyna
data
conc
easi
Internation
With the midd
he number of
ET is lower th
Figure
With the right
O is rising ea
requirement s
. But the line
0.

We randomly
kload rate of
mpling value of
nt more than 8
O line, and no
The above fac
e granularity j
ain advantage
rix, instead FI
Conclusion
This paper f
n identifies the
e granularity
blems under H
amic scheduli
a, we can infe
current jobs to
ly.
ACO Algo
nal Journal of Dig
dle of Fig.3, w
f jobs. And w
han FIFOs.
e 3. Jobs Exec
t of Fig.3, we
arly, and then
scale. As soon
of ACO is al
Figure 4. 1
#
select 1
#
, 3
#

1
#
, 3
#
7
#
slave
f 1
#
slave node
80% of ACO
one of ACO.
ct demonstrate
jobs situation
e. At the same
IFO only gets
firstly analyze
e shortcoming
concurrent jo
Hadoop platfo
ing, and also
er that ACO h
o shorten resp
orithm-based Para
Hengliang Sh
gital Content Tech
we can see that
we also can fin
cution Time, W
e test the job
is dropping, w
n as the waiting
ways droppin
#
, 3
#
7
#
Slave N
7
#
Slave Nod
e machine resp
e more than 80
line. Meanwh
es ACO algor
n; and for a f
e time, ACO
local optimal j
es application
gs of Hadoops
obs, and inno
orm, and com
compares Qo
has much adva
ponse time, im
allel Job Schedulin
hi, Guangyi Bai,

Z
hnology and its A
single jobs W
nd that the AC
WAET, and Lo
s losing rate
which is caus
g time of many
ng, and is stea
Node Machine

de Machine to
pectively. Stat
0%, and 3 poi
hile, there is 1
rithm is more
few large gran
algorithm is e
job scheduling
scene charac
s built-in FIFO
ovatively prop
mpares the exp
oS indexes of
antages to cou
mprove through
ng Investigation o
Zhenmin Tang

Applications. Volu
WAET is decre
CO performan
osing Rate Com
of FIFO and
ed by the diff
y jobs exceed
dy when the q
e Workload Ra
o be as our te
tically, for FIF
ints of 3
#
, 2 po
sampling poi
adoptable to
nularity concu
easy to get gl
g table.
cteristics unde
O algorithm in
poses ACO al
periment resu
ACO against
uple with larg
hput, and get t
on Hadoop
ume 5, Number 7,
easing along w
nce is better t
mparison of F
ACO respect
ferentia betwe
s its deadline,
quantity of job
ate Respective
est targets. Fig
FO line, we ca
oints of 7
#
, and
int which is lo
be applied in
urrent jobs sit
lobal optimiza
er cloud comp
n dealing with
lgorithm to c
ult of static sc
t FIFO. With
ge amounts of
the optimal jo
, July 2011
with the increa
than FIFO, fo
FIFO and ACO
ively. The lin
een resource s
these jobs wi
bs submitted
ely
g. 4 demonstr
an find there a
d there are no
ower than 20%
large amount
tuation, FIFO
ation job disp
puting model,
h large amount
couple with th
cheduling aga
these experim
f little granula
ob dispatch ma
asing
or its
O
ne of
scale
ll be
is at

rates
are 3
one
% of
ts of
has
patch
and
ts of
hese
ainst
ment
arity
atrix
- 288 -

ACO Algorithm-Based Parallel Job Scheduling Investigation On Hadoop

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

ACO Algorithm-Based Parallel Job Scheduling Investigation On Hadoop

Enviado por

Direitos autorais:

Formatos disponíveis

ACO Algorithm-based Parallel Job Scheduling Investigation on Hadoop

Hengliang Shi, Guangyi Bai,

ACO Algorithm-based Parallel Job Scheduling Investigation on

communication cost among large amounts of slave node machines.

Monitor Local Job execution

hormone that the j

hormone is gained to cost inversely, zT=

Você também pode gostar