Escolar Documentos
Profissional Documentos
Cultura Documentos
Abstract
Hadoop is a leading open source tool that supports the realization of the Big Data rev-
olution and is based on Googles MapReduce pioneering work in the field of ultra large
amount of data storage and processing. Instead of relying on expensive proprietary
hardware, Hadoop clusters typically consist of hundreds or thousands of multi-core com-
modity machines. Instead of moving data to the processing nodes, Hadoop moves the
code to the machines where the data reside, which is inherently more scalable. Hadoop
can store a diversity of data types such as video files, structured or unstructured data,
audio files, log files, and signal communication records. The capability to process a
large amount of diverse data in a distributed and parallel fashion with built-in fault tol-
erance, using free software and cheap commodity hardware makes a very compelling
business case for the use of Hadoop as the Big Data platform of choice for most com-
mercial and government organizations. However, making a MapReduce job that reads
in and processes terabytes of data spanning tens of hundreds of machines complete in
an acceptable amount of time can be challenging as illustrated here. This paper first
presents the Hadoop ecosystem in some detail and provides details of the MapReduce
engine that is at Hadoops core. The paper then discusses the various MapReduce
schedulers and their impact on performance.
Keywords: Big Data, Hadoop Distributed File System, Performance, Scheduler, Hadoop Map/Re-
duce, YARN, Corona.
42567089-7%
;I""3&H"-2%(&
F07GG2'%:% $(+,,%*2%3+?9%
O"")%%3%(&;N-(%2&;6-6%&
1FBA> /01-/23-%
1@/A& B@C& 1DE;A& F07GG2'%;%
GEDHA& &'()*%%
F07GG2'%<%
,-.-&/0&
LN8)+-&H"JJ$KJ&
9:8=%&H"JJ$KJ&
1-2""3&4$56($786%2&9$:%&;<56%=&
Figure 2: Hadoop MapReduce Data Flow
:#*4;#%841;5'1% F7,$7%$#*4;#$%)1%1#K% Tasks also have a set of counters that count various
.A<%%ICCJ% events as a task runs, either those built into the frame-
=47-47%>#?@A,B4#%C?-#%
FM4N#%<,-%'47-47"%
work, such as the number of map output records written,
7'%:#*4;#%C,"O"% or counters defined by users. If a task reports progress,
=47-47%:#;'$*%D$)7#$%
it sets a flag to indicate that the status change should
F-)BB%,1*%<#$3#%*,7,%
=47-47%8'$9,7% be sent to the TaskTracker. The flag is checked in a
P'$%$#*4;#$%
separate thread every few seconds, and if set it notifies
=47-47%E';,5'1% 8)1,B%=47-47% the TaskTracker of the current task status. Meanwhile,
0'BB#;5'1%
the TaskTracker is sending heartbeats to the JobTracker
every five seconds (this is a minimum, as the heartbeat
Figure 4: MapReduce Steps - Who does what interval is actually dependent on the size of the cluster:
for larger clusters, the interval is longer and a source
of much consternation since this value can delay the
time to start up a jobs task(s)), and the status of all the
tasks being run by the TaskTracker is sent in the call.
3.5 Shutdown Phase
Counters are sent less frequently than every five sec-
MapReduce jobs are long-running batch jobs, taking onds, because they can be relatively high-bandwidth.
anything from minutes to hours or even several days to The JobTracker combines these updates to produce a
run. Because this is a significant length of time, it is im- global view of the status of all the jobs being run and
portant for the user to obtain feedback on how the job their constituent tasks. Finally, the JobClient receives
is progressing. A job and each of its tasks have a sta- the latest status by polling the JobTracker every sec-
tus, which includes such things as the state of the job or ond. Clients can also use JobClients getJob() method
task (e.g., running, successfully completed, failed), the to obtain a RunningJob object instance, which contains
progress of maps and reduces, the values of the jobs all of the status information for the job.
counters, and a status message or description (which When the JobTracker receives a notification that the
may be set by user code). These statuses change over last task of a job is complete, it changes the status of the
the course of the job and they are communicated back job to successful. Then, when the JobClient polls for
to the client. When a task is running, it keeps track of its status, the JobClient learns that the job has completed
progress, that is, the proportion of the task completed. successfully and it prints a message to tell the user and
For map tasks, this is the proportion of the input that has then returns from the runJob() method. The JobTracker
been processed. For reduce tasks the system can es- also sends a HTTP job notification if it is configured to
timate the proportion of the reduce input processed. It do so. This can be configured by clients wishing to re-
does this by dividing the total progress into three parts, ceive callbacks, via the job.end.notification.url property.
corresponding to the three phases of the reducer. Lastly, the JobTracker cleans up its working state for
Measurement of progress is somewhat nebulous but the job, and instructs TaskTrackers to do the same (the
it tells Hadoop that a task is doing something. For ex- intermediate output is deleted and other such cleanup
ample, a task writing output records is making progress, tasks are performed).
4 MapReduce Scheduling Challenges the best possible QoS out of a Hadoop MapReduce
cluster subject to a varying workload. Also, we have
Hundreds of jobs (small/medium/large) may be present seen that almost none of the schedulers consider the
on a Hadoop cluster for processing at any given time. contention placed on the nodes due to the running of
How the map and reduce tasks of these jobs are sched- multiple tasks.
uled has an impact on the completion time and conse-
quently on the QoS requirements of these jobs. Hadoop
5 MapReduce Performance Challenges
uses a FIFO scheduler out of the box. Subsequently,
two more schedulers have been developed. Firstly, Face- Hundereds of thousands of jobs of varying demands on
book developed the Fair Scheduler which is meant to CPU, I/O and network are executed on Hadoop clusters
give fast response time for small jobs and reasonable consisting of several hundred nodes. Tasks are sched-
finish time for production jobs. Secondly, Capacity Sched- uled on machines, in many cases with 16 or more cores
uler was developed by Yahoo and this scheduler has each. Short jobs have a different completion time re-
named queues in which jobs are submitted. Queues are quirement than long jobs and similarly production level
allocated a fraction of the total computing resource and high priority jobs have a different quality of service re-
jobs have priorities. It is evident that no single schedul- quirement compared to adhoc query type jobs. Predict-
ing algorithm and policy is appropriate for all kinds of job ing the completion time of Hadoop MapReduce jobs is
mixes. A mix or combination of scheduling algorithms an important research topic since for large enterprises,
may actually be better depending on the workload char- correctly forecasting the completion time and an effi-
acteristics. cient scheduling of Hadoop jobs directly affects the bot-
Essentially, Hadoop is a multi-tasking software prod- tom line. A plethora of work is going on in the field of
uct that can process multiple data-sets for multiple jobs Hadoop MapReduce performance. We briefly talk about
for many users at the same time. This means that Hadoop a few prominent recent ones which are most relevant to
is also concerned with scheduling jobs in a way that this paper. In his 2011 technical report [16], Herodotou
makes optimum usage of the resources available on the describes in detail a set of mathematical performance
compute cluster. Unfortunately, Hadoop started off not models for describing the execution of a MapReduce
doing a good job of scheduling in an efficient fashion job on Hadoop. The models can be used to estimate
and even today its scheduling algorithms and sched- the performance of MapReduce jobs as well as to find
ulers are not all that sophisticated. the optimal configuration settings to use when running
At the inception of Hadoop, five or so years ago, the the jobs. A set of 100 or so equations calculate the
original scheduler was a first-in first-out (FIFO) sched- total cost of a MapReduce job based on different cate-
uler woven into Hadoops JobTracker. Even though it gories of parameters - Hadoop parameters, job profile
was simple, the implementation was inflexible and could statistics and a set of parameters that define the I/O,
not be tailored. After all, not all jobs have the same CPU, and network cost of a job execution. In a 2010
priority and a higher priority job is not supposed to lan- paper, Kavulya et al. [18] analysed ten months worth
guish behind a low priority long running batch job. Around of Yahoo MapReduce logs and used an instance-based
2008, Hadoop introduced a pluggable scheduler inter- learning technique that exploits temporal locality to pre-
face which was independent of the JobTracker. The goal dict job completion times from historical data. Though
was to develop new schedulers which would help opti- the thrust of the paper is essentially analyzing the log
mize scheduling based on particular job characteristics. traces of Yahoo, this paper extends the instance based
Another advantage of this pluggable scheduler architec- (nearest-neighbor) learning approaches. The algorithm
ture is that now greater experimentation is possible and consists of first using a distance-based approach to find
specialized schedulers are possible to cater to an ever a set of similar jobs in the recent past, and then gen-
increasing types of Hadoop MapReduce applications. erating regression models that predict the completion
Before it can choose a task for the TaskTracker, the time of the incoming job from the Hadoop-specific input
JobTracker must choose a job to select the task from. parameters listed in the set of similar jobs. In a 2010 pa-
Having chosen a job, the JobTracker now chooses a per [31], Verma et al. design a performance model that
task for the job. TaskTrackers have a fixed number of for a given job (with a known profile) and its SLO (soft
slots for map tasks and for reduce tasks: for example, deadline), estimates the amount of resources required
a TaskTracker may be able to run two map tasks and for job completion within the deadline. They also imple-
two reduce tasks simultaneously. The default scheduler ment a Service Level Objective based scheduler that
fills empty map task slots before reduce task slots, so if determines job ordering and the amount of resources
the TaskTracker has at least one empty map task slot, to allocate for meeting the job deadlines.
the JobTracker will select a map task; otherwise, it will In [3], the authors describe an analytic model they
select a reduce task. built to predict a MapReduce jobs map phase comple-
Based on published research [31, 3], it is very evi- tion time. The authors take into account the effect of
dent that a single scheduler is not adequate to obtain contention at the compute nodes of a cluster and use
a closed Queuing Network model [22] to estimate the After the cluster manager receives resource requests
completion time of a jobs map phase based on the from the job tracker, it pushes the resource grants back
number of map tasks, the number of compute nodes, to the job tracker. Also, once the job tracker gets re-
the number of slots on each node, and the total number source grants, it creates tasks and then pushes these
of map slots available on the cluster. tasks to the task trackers for running. There is no peri-
odic heartbeat involved in this scheduling, so the schedul-
ing latency is minimized.
6 MapReduce - Looking Ahead The cluster manager does not perform any tracking
The discussion in the preceding sections was all based or monitoring of the jobs progress, as that responsibility
on current Hadoop (also known as Hadoop V1.0). In is left to the individual job trackers. This separation of
clusters consisting of thousands of nodes, having a sin- duties allows the cluster manager to focus on making
gle JobTracker manage all the TaskTrackers and the en- fast scheduling decisions. Job trackers track only one
tire set of jobs has proved to be a big bottleneck. The job each and have less code complexity. This allows
architecture of one JobTracker introduces inherent de- Corona to manage a lot more jobs and achieve better
lays in scheduling jobs which proved to be unacceptable cluster utilization. Figure 5 depicts a notional diagram
for many large Hadoop users. of Coronas main components.
and many times the startup time for a job is several tens .5"6&!""#49$%:"$2&
.*,&*(/,#0(
of seconds [7]. The JobTracker also could not handle its ,%3& /+4(0+&
.*,&*(/,#0( /+4(0+& /+4(0+&
[10] Flume, http://flume.apache.org/ [27] Jorda Polo et. al., Resource-aware adaptive
scheduling for mapreduce clusters, Middleware
[11] A. Ganapathi et al., Statistics-driven workload 2011, Springer Berlin Heidelberg, 2011, pp. 187
modeling for the Cloud, Data Engineering Work- 207.
shops (ICDEW), 2010 IEEE.
[28] http://www.quora.com/Facebook-
[12] A. Ganapathi et al., Predicting Multiple Perfor- Engineering/How-is-Facebooks-new-replacement-
mance Metrics for Queries: Better Decisions En- for-MapReduce-Corona-different-from-
abled by Machine Learning, Proc International MapReduce2
Conference on Data Engineering, 2009.
[29] Sqoop, http://sqoop.apache.org/
[13] Hadoop, Documentation and open source release,
http://hadoop.apache.org/core/ [30] http://hadoop.apache.org/docs/r0.18.3/
mapred tutorial.html
[14] HBase, http://hbase.apache.org/
[31] Abhishek Verma, Ludmila Cherkasova, and Roy H.
[15] Herodotos Herodotou, Fei Dong, and Shivnath Campbell, ARIA: automatic resource inference and
Babu, No one (cluster) size fits all: automatic allocation for mapreduce environments, Proc. 8th
cluster sizing for data-intensive analytics, Proc. ACM Intl. Conf. Autonomic computing, ACM, 2011.
2nd ACM Symposium on Cloud Computing, ACM,
2011. [32] E. Vianna et al., Modeling the Performance of the
Hadoop Online Prototype, 23rd Intl. Symp. Com-
[16] Herodotos Herodotou, Hadoop performance mod- puter Architecture and High Performance Comput-
els, Cornell University Library, Report Number CS- ing (SBAC-PAD), 26-29 Oct. 2011, pp.152-159.
2011-05, arXiv:1106.0940v1 [cs.DC].
[33] http://developer.yahoo.com/blogs/hadoop/anatomy-
[17] Hive, http://hive.apache.org/ hadoop-o-pipeline-428.html
[34] Hailong Yang et al., MapReduce Workload Model-
ing with Statistical Approach, J. Grid Computing,
Volume 10, Number 2 (2012).