Você está na página 1de 4

2013 Third International Conference on Advances in Computing and Communications

TaskTracker Aware Scheduling for Hadoop MapReduce

Jisha S Manjaly Varghese S Chooralil


Department of Computer Science and Engineering Department of Computer Science and Engineering
Rajagiri School of Engineering and Technology Rajagiri School of Engineering and Technology
Mahatma Gandhi University Mahatma Gandhi University
Kochi, India Kochi, India
jishamanjaly@yahoo.com varghesesc@rajagiritech.ac.in

Abstract— Hadoop is a framework for processing large amount independent unit with default size of 64 MB. The blocks are
of data in parallel with the help of Hadoop Distributed File stored with a replication factor for handling the hardware
System (HDFS) and MapReduce framework. Job scheduling is failures. The default replication factor is 3.
an important process in Hadoop MapReduce. Hadoop comes
with three types of schedulers namely FIFO, Fair and Capacity
Hadoop uses large number of commodity hardware’s for
Scheduler. The schedulers are now a pluggable component in the
Hadoop MapReduce framework. When jobs have a dependency creating the cluster to solve BigData problems. The Hadoop
on an external service like database or web service may leads to cluster offers a great potential of resources to multiple users in
the failure of tasks due to overloading. In this scenario, Hadoop a shared manner. A good scheduler is required for sharing
needs to re-run the tasks in another slots. To address this issue, resources fairly between users. The default scheduler in
TaskTracker aware scheduling has introduced. This scheduler Hadoop MapReduce version 1 is FIFO scheduler [1][3]. FIFO
enables users to configure a maximum load per TaskTracker in Scheduler schedules the jobs in the order in which it is
the Job Configuration itself. The algorithm will not allow a task submitted. There are two more schedulers comes with Hadoop
to run and fail if the load of the TaskTracker reaches its MapReduce named Fair Scheduler [8] and Capacity Scheduler
threshold for the job. Also this scheduler allows the users to
[9]. These schedulers failed to serve their responsibility in
select the TaskTracker's per Job in the Job configuration.
some use cases. This paper exposes some of these use cases
and a modified scheduler for handling the scenarios.
Keywords-Hadoop, HDFS, BigData, MapReduce, JobTracker,
TaskTracker, Scheduler II. LITERATURE SURVEY
The main advantage of Hadoop MapReduce is the data is
I. INTRODUCTION distributed efficiently in the HDFS and the program is running
over the local data wherever possible [6][7]. So the data
Apache Hadoop is a software framework for processing
movement between different nodes in the cluster is reduced
BigData such as data in the range of petabytes [5]. The
and it gains the performance. So for efficient processing of
framework was originally developed by Doug Cutting, the
BigData using Hadoop, the data should be present in the
creator of Apache Lucene, as part of his Web Search Engine
HDFS. But in almost 90% of the real world use cases, the data
Apache Nutch. Hadoop leverages the power of distributed
is generated by some legacy systems or applications and
computing with the help of Google Map Reduce Framework
moved in to HDFS for perform the analytics.
and Hadoop Distributed File System (HDFS)[6][7].
The existing solution for copying the remote data to HDFS
MapReduce is the framework for processing large volume
is consists of two stages. First stage will copy the data to a
of datasets as key value pairs[4]. MapReduce divides each job
Hadoop edge Node using any of the file transfer protocols such
in to two types of functions, map and reduce. Both Map and
as ftp or sftp. Then in the second stage, the data is moved into
Reduce functions take input as key value pairs and emits the
HDFS using Hadoop copyFromLocal or put command utility.
result as another set of key value pairs. Each job is divided in
Another solution is copying the data directly from remote
to number of map tasks and Reduce tasks. The input is initially
machine to HDFS using the Hadoop file system API. In both
processed in distributed map tasks and aggregate the result
methods, only a single file channel is opened from source file
with the help of reduce tasks [10].
system to HDFS for copying the data. The performance of file
copying can be improved, if there are multiple channels opened
In conventional programming, the data is copying to the
from the source to HDFS by creating a map-reduce program
location where the actual job is running. But in Hadoop
for the second approach [6][7].
MapReduce, the job is copied to the location, where the actual
data resides. For achieving the data locality, Hadoop uses
For getting the maximum throughput from the MapReduce
HDFS as its primary data storage area [6][7]. HDFS is
approach [10], application needs to get the control of the
designed for storing very large files with streaming access in
commodity hardware. The data is stored as blocks of

978-0-7695-5033-6/13 $26.00 © 2013 IEEE 278


DOI 10.1109/ICACC.2013.103
number of concurrent mapper execution on a cluster node. • Control the number of concurrent tasks for a
Otherwise some map tasks may fail due to protocol limitations. particular job in a TaskTracker
• This value should be configurable via the Job
There are other use cases like some MapReduce applications configuration itself. So user can change this value
needs to connect to external databases for reading some data dynamically for different jobs.
as part of a big job or needs to update the results to a database. • Fence some Tasktrackers from Job execution
Different users are submitting the jobs to the Hadoop cluster • Execute different jobs in user selected TaskTrackers.
and there are some restrictions for opening simultaneous • It should support all the functionalities currently
connections from a database to a particular machine or user. provided by the Fair Scheduler.
Database engine will reject further connection requests, once
the connections reach the threshold. This will also leads to a The proposed scheduler schedules the jobs according to the
map task failure. As per the design of Hadoop MapReduce current status of the TaskTracker’s. So the scheduler is named
framework, a map task will retry 4 times before marked as accordingly. The proposed system shown in figure 1 is divided
failed. If any of the tasks fails, the job is marked as failed. So into two components, the core scheduler module which will
the above scenario may leads to failure of Jobs. handle the actual scheduling part and a preprocessing module.
When a Heartbeat is received from a TaskTracker, the
Hadoop is dealing with large volumes of data and most of TaskTracker information and List of scheduled Jobs should
the Hadoop jobs are running for a long time. So it’s a huge loss hand over to the preprocessor. The preprocessing module first
if the jobs failed after running for a long time. To avoid this, compare the hostname of the TaskTracker against the list of
the jobs need to be scheduled properly and user needs to get TaskTrackers specified for the Job. If this check succeeds then
some control for scheduling the jobs. it will compute the number of tasks currently running for the
Job in the TaskTracker.
The default scheduler that shipped with the Apache Hadoop
packaging is FIFO scheduler. This scheduler schedules the jobs If the number of currently running tasks is less than the
in which the order it is submitted to the scheduler [1]. FIFO number specified in the Job Configuration, then the Job object
scheduler allows jobs to utilize the entire cluster capacity. This and TaskTracker information is hand over to the Core
scheduler doesn’t provide an option to control the number of Scheduler module. The Core scheduler is a modified Fair
map or reduce slots used for a job or user. So this scheduler is Scheduler [2] with a priority enhanced algorithm.
skipped from the research.

Other schedulers come along with the package of Hadoop


are Fair Scheduler and Capacity Scheduler. Both are almost
same in functionality. Fair Scheduler [8] is introduced by
Facebook and Capacity scheduler is introduced by Yahoo [11].

Fair Scheduler aims to give every user a fair share of the


resources available with cluster. Fair scheduler allocates a fair
share pool for each user. The fair share dynamically changes
when the number of users submitting the jobs changed. The
pool allocation is directly proportional to 1/N where N is the
number of users submitting jobs to the Cluster. It is also
possible to create custom pools with guaranteed minimum
number of map and reduce slots and weightage.

Fair scheduler provides an opportunity to control the total


number of concurrent tasks per user. But it doesn’t provide an
option to set this value for a particular Job. The jobs are
scheduled using the hierarchical scheduling algorithm [2]. It
doesn’t provide a configuration to fence some nodes from the
execution. So we decided to modify the existing Fair scheduler
to support the desired functionalities.
Fig. 1. Proposed System

III. PROPOSED SYSTEM


The following are the desired functionalities for the The TaskTracker Aware Scheduler algorithm is explained in
proposed scheduler algorithm 1.

279
Algorithm. 1. TaskTracker Aware Scheduling Algorithm exposes two methods, obtainNewMapTask and
obtainNewReduceTask, to launch a task of either type. Both
• Each Job ‘j’ is initialized with j.wait=0, methods may either return a Task object or null if the job does
j.priority =Normal not wish to launch a task.
• Read and set value of j.max_task_taskTracker
and j.hosts_array from job configuration The TaskTracker Aware Scheduler overrides assignTasks
• Heartbeat is received from node ‘n’ method with an algorithm called TaskTracker Aware
• For each job ‘j’ sort by hierarchical scheduling Scheduling to control the TaskTracker selection. The
policy with priority implementation introduced two configuration properties
• For j in jobs do
– If j.hosts_array contains n.hostname • mapred.tascheduler.tasks.max – Maximum number of
and j.Max_task_tasktracker > running tasks that can be run on a TaskTracker for a Single
tasks for job ‘j’ on node ‘n’ then Job. Value is in Integer. Value<=0 should be treated
• if ‘j’ has node local task ‘T’ as maximum slots available
on ‘n’ then • mapred.tascheduler.hosts – Hostnames of the
• Return task ‘T’ to ‘n’ TaskTrackers in which jobs needs to be run. Values
• Else if in Strings separated with comma. String “ALL’
j.Priority==Priority.HIGH should be treated as select all available TaskTrackers.
then
• Set j.wait = 0; User needs to configure the above two properties in the
• Set j.Priority = Job Configuration object. A java class JobSchedulableForTAS
Priority. NORMAL; is implemented which extends the Schedulable class present in
• Return task ‘T’ to the original Fair Scheduler. The configuration properties are
node ‘n’ read from the Job Object and applied the preprocessing logic
• Else at this point.
• j.wait = j.wait + W1
• End if The data flow for Job execution using TaskTracker Aware
– Else if j.wait>Max_Wait_time Scheduler is shown in figure 2.
• Set J.Priority = HIGH
– Else
• J.wait = j.wait + W1
– End IF
End For
where, W1 - The time difference between Heartbeats
Hierarchical Scheduling - Sort jobs with maximum
number of task remaining to execute comes first in the
job queue

IV. IMPLEMENTATION

All schedulers in Hadoop, including the TaskTracker


Fig. 2.DataFlow Diagram
Aware Scheduler, inherit from the TaskScheduler abstract
class. This class provides access to a TaskTrackerManager - The steps are shown below:
an interface to the JobTracker - as well as a Configuration 1. Submit Hadoop Job to the Job Tracker with JobConf
instance. It also insists the scheduler to implement three 2. Job Tracker will create a JobInProgress Object
abstract methods: the lifecycle methods start and terminate, 3. Heartbeat received from Task Tracker at Job Tracker
and a method called assignTasks to launch tasks on a given 4. JobTracker Pass the TaskTracker information to the
TaskTracker. Task assignment in Hadoop is reactive. TaskTracker Aware Scheduler
TaskTrackers periodically send heartbeats to the JobTracker 5. TaskTracker Aware Scheduler pass the TaskTracker
with their TaskTrackerStatus, which contains a list of running availability and Jobs in queue to the Job Scheduler
tasks, the number of slots on the node, and other information. 6. JobScheduler iterates through the queued jobs and
The JobTracker then calls assignTasks on the scheduler to picks the corresponding JobInProgress object.
obtain tasks to launch. These are returned with the heartbeat 7. Handover the JobInProgress object to the Scheduling
response. algorithm and find out that if the task present in the
object matches with the TaskTracker configuration.
Selection of tasks within a job is mostly done by the 8. Handover the task slot to Job Scheduler if it matches
JobInProgress class, and the JobScheduler class. JobInProgress with algorithm, else return null. If the result is null,

280
Job scheduler picks next JobInProgress Object from
the Queue and continue from step 7.
9. If 8 is success, update the status in JobInProgress.
10. Handover the task object to Scheduler
11. Handover the task object to Job Tracker
12. Handover the information to the TaskTracker
13. Execute the task in TaskTracker and update the
result in HDFS
V. EXPERIMENTS AND RESULTS
A Hadoop cluster is setup in the lab with Apache Hadoop
version 1.04. The available core in the machine was Intel i3. Fig. 5. Job with max_TasksPerNode=1
Each machine is configured with 2 map slots and 2 reduce slots
as shown in figure 3. The source code of Fair Scheduler is VI. CONCLUSION
downloaded from Apache Hadoop website. TaskTracker
Aware Scheduler is implemented by adding the preprocessor A Hadoop cluster is setup in the lab with 3 nodes. Different
module and Priority handling logic to the original Fair types of schedulers are tested and verified the results.
scheduler. Hadoop is configured to TaskTracker Aware TaskTracker Aware scheduling algorithm is implemented and
scheduler by adding the jar file to the lib folder and updated the tested successfully. The advantages over the existing
property name mapred. jobtracker.taskScheduler to algorithms are verified. The proposed solution overcomes the
org.apache.hadoop.mapred.Task TrackerAwareScheduler. A limitations of the existing schedulers described here and gives
word count sample program with NLineInputFormat is written more control to the users for Job execution.
for testing different use cases.
The below use cases and combinations are tested and REFERENCES
verified the results: [1] B.Thirumala Rao,Dr.L S S Reddy “Survey on Improved
1. Submit the job with a single host name, shown in Scheduling in Hadoop MapReduce in Cloud Environments” in
figure 4. International Journal of Computer Applications (0975 –
2. Submit the job with hostnames=”ALL’ 8887)Volume 34– No.9, November 2011
3. Submit the job with max_TasksPerNode=1, shown [2] Matei Zaharia , Dhruba Borthakur, and Joydeep Sen Sarma,
in figure 5 Delay Scheduling: A Simple Technique for Achieving Locality
4. Submit the job with max_Tasks_PerNode=2 and Fairness in Cluster Scheduling , Yahoo! Research
,University of California, Berkeley ,2009
[3] Mark Yong, Nitin Garegrat, Shiwali Mohan “Towards a
Resource Aware Scheduler in Hadoop” December 21,2009
[4] Tom White, “Hadoop The Definitive guide”, Third Edition,
2012.
[5] Apache Hadoop,” http://hadoop.apache.org”.
[6] Hadoop Distributed file system, “http://hadoop.apache.org/
docs/r1.0.4/hdfs_design.html”
[7] Hadoop Distributed file system, “http://en.wikipedia.org/wiki/
Apache_Hadoop#Hadoop_Distributed_File_System”
[8] Hadoop Fair Scheduler, “http://hadoop.apache.org/ common/
docs/r0.20.2/fair scheduler.html”
[9] Hadoop Capacity Scheduler, “ http://hadoop.apache.org/
docs/stable/capacity_scheduler.html”
Fig. 3. Hadoop Cluster Machine List [10] Hadoop MapReduce tutorial, “http://hadoop.apache.org/
common /docs/current/mapred tutorial.html”
[11] Hadoop Tutorial, “http://developer.yahoo.com/hadoop/”

Fig. 4. Job with a single host name

281

Você também pode gostar