Escolar Documentos
Profissional Documentos
Cultura Documentos
Abstract— Hadoop is a framework for processing large amount independent unit with default size of 64 MB. The blocks are
of data in parallel with the help of Hadoop Distributed File stored with a replication factor for handling the hardware
System (HDFS) and MapReduce framework. Job scheduling is failures. The default replication factor is 3.
an important process in Hadoop MapReduce. Hadoop comes
with three types of schedulers namely FIFO, Fair and Capacity
Hadoop uses large number of commodity hardware’s for
Scheduler. The schedulers are now a pluggable component in the
Hadoop MapReduce framework. When jobs have a dependency creating the cluster to solve BigData problems. The Hadoop
on an external service like database or web service may leads to cluster offers a great potential of resources to multiple users in
the failure of tasks due to overloading. In this scenario, Hadoop a shared manner. A good scheduler is required for sharing
needs to re-run the tasks in another slots. To address this issue, resources fairly between users. The default scheduler in
TaskTracker aware scheduling has introduced. This scheduler Hadoop MapReduce version 1 is FIFO scheduler [1][3]. FIFO
enables users to configure a maximum load per TaskTracker in Scheduler schedules the jobs in the order in which it is
the Job Configuration itself. The algorithm will not allow a task submitted. There are two more schedulers comes with Hadoop
to run and fail if the load of the TaskTracker reaches its MapReduce named Fair Scheduler [8] and Capacity Scheduler
threshold for the job. Also this scheduler allows the users to
[9]. These schedulers failed to serve their responsibility in
select the TaskTracker's per Job in the Job configuration.
some use cases. This paper exposes some of these use cases
and a modified scheduler for handling the scenarios.
Keywords-Hadoop, HDFS, BigData, MapReduce, JobTracker,
TaskTracker, Scheduler II. LITERATURE SURVEY
The main advantage of Hadoop MapReduce is the data is
I. INTRODUCTION distributed efficiently in the HDFS and the program is running
over the local data wherever possible [6][7]. So the data
Apache Hadoop is a software framework for processing
movement between different nodes in the cluster is reduced
BigData such as data in the range of petabytes [5]. The
and it gains the performance. So for efficient processing of
framework was originally developed by Doug Cutting, the
BigData using Hadoop, the data should be present in the
creator of Apache Lucene, as part of his Web Search Engine
HDFS. But in almost 90% of the real world use cases, the data
Apache Nutch. Hadoop leverages the power of distributed
is generated by some legacy systems or applications and
computing with the help of Google Map Reduce Framework
moved in to HDFS for perform the analytics.
and Hadoop Distributed File System (HDFS)[6][7].
The existing solution for copying the remote data to HDFS
MapReduce is the framework for processing large volume
is consists of two stages. First stage will copy the data to a
of datasets as key value pairs[4]. MapReduce divides each job
Hadoop edge Node using any of the file transfer protocols such
in to two types of functions, map and reduce. Both Map and
as ftp or sftp. Then in the second stage, the data is moved into
Reduce functions take input as key value pairs and emits the
HDFS using Hadoop copyFromLocal or put command utility.
result as another set of key value pairs. Each job is divided in
Another solution is copying the data directly from remote
to number of map tasks and Reduce tasks. The input is initially
machine to HDFS using the Hadoop file system API. In both
processed in distributed map tasks and aggregate the result
methods, only a single file channel is opened from source file
with the help of reduce tasks [10].
system to HDFS for copying the data. The performance of file
copying can be improved, if there are multiple channels opened
In conventional programming, the data is copying to the
from the source to HDFS by creating a map-reduce program
location where the actual job is running. But in Hadoop
for the second approach [6][7].
MapReduce, the job is copied to the location, where the actual
data resides. For achieving the data locality, Hadoop uses
For getting the maximum throughput from the MapReduce
HDFS as its primary data storage area [6][7]. HDFS is
approach [10], application needs to get the control of the
designed for storing very large files with streaming access in
commodity hardware. The data is stored as blocks of
279
Algorithm. 1. TaskTracker Aware Scheduling Algorithm exposes two methods, obtainNewMapTask and
obtainNewReduceTask, to launch a task of either type. Both
• Each Job ‘j’ is initialized with j.wait=0, methods may either return a Task object or null if the job does
j.priority =Normal not wish to launch a task.
• Read and set value of j.max_task_taskTracker
and j.hosts_array from job configuration The TaskTracker Aware Scheduler overrides assignTasks
• Heartbeat is received from node ‘n’ method with an algorithm called TaskTracker Aware
• For each job ‘j’ sort by hierarchical scheduling Scheduling to control the TaskTracker selection. The
policy with priority implementation introduced two configuration properties
• For j in jobs do
– If j.hosts_array contains n.hostname • mapred.tascheduler.tasks.max – Maximum number of
and j.Max_task_tasktracker > running tasks that can be run on a TaskTracker for a Single
tasks for job ‘j’ on node ‘n’ then Job. Value is in Integer. Value<=0 should be treated
• if ‘j’ has node local task ‘T’ as maximum slots available
on ‘n’ then • mapred.tascheduler.hosts – Hostnames of the
• Return task ‘T’ to ‘n’ TaskTrackers in which jobs needs to be run. Values
• Else if in Strings separated with comma. String “ALL’
j.Priority==Priority.HIGH should be treated as select all available TaskTrackers.
then
• Set j.wait = 0; User needs to configure the above two properties in the
• Set j.Priority = Job Configuration object. A java class JobSchedulableForTAS
Priority. NORMAL; is implemented which extends the Schedulable class present in
• Return task ‘T’ to the original Fair Scheduler. The configuration properties are
node ‘n’ read from the Job Object and applied the preprocessing logic
• Else at this point.
• j.wait = j.wait + W1
• End if The data flow for Job execution using TaskTracker Aware
– Else if j.wait>Max_Wait_time Scheduler is shown in figure 2.
• Set J.Priority = HIGH
– Else
• J.wait = j.wait + W1
– End IF
End For
where, W1 - The time difference between Heartbeats
Hierarchical Scheduling - Sort jobs with maximum
number of task remaining to execute comes first in the
job queue
IV. IMPLEMENTATION
280
Job scheduler picks next JobInProgress Object from
the Queue and continue from step 7.
9. If 8 is success, update the status in JobInProgress.
10. Handover the task object to Scheduler
11. Handover the task object to Job Tracker
12. Handover the information to the TaskTracker
13. Execute the task in TaskTracker and update the
result in HDFS
V. EXPERIMENTS AND RESULTS
A Hadoop cluster is setup in the lab with Apache Hadoop
version 1.04. The available core in the machine was Intel i3. Fig. 5. Job with max_TasksPerNode=1
Each machine is configured with 2 map slots and 2 reduce slots
as shown in figure 3. The source code of Fair Scheduler is VI. CONCLUSION
downloaded from Apache Hadoop website. TaskTracker
Aware Scheduler is implemented by adding the preprocessor A Hadoop cluster is setup in the lab with 3 nodes. Different
module and Priority handling logic to the original Fair types of schedulers are tested and verified the results.
scheduler. Hadoop is configured to TaskTracker Aware TaskTracker Aware scheduling algorithm is implemented and
scheduler by adding the jar file to the lib folder and updated the tested successfully. The advantages over the existing
property name mapred. jobtracker.taskScheduler to algorithms are verified. The proposed solution overcomes the
org.apache.hadoop.mapred.Task TrackerAwareScheduler. A limitations of the existing schedulers described here and gives
word count sample program with NLineInputFormat is written more control to the users for Job execution.
for testing different use cases.
The below use cases and combinations are tested and REFERENCES
verified the results: [1] B.Thirumala Rao,Dr.L S S Reddy “Survey on Improved
1. Submit the job with a single host name, shown in Scheduling in Hadoop MapReduce in Cloud Environments” in
figure 4. International Journal of Computer Applications (0975 –
2. Submit the job with hostnames=”ALL’ 8887)Volume 34– No.9, November 2011
3. Submit the job with max_TasksPerNode=1, shown [2] Matei Zaharia , Dhruba Borthakur, and Joydeep Sen Sarma,
in figure 5 Delay Scheduling: A Simple Technique for Achieving Locality
4. Submit the job with max_Tasks_PerNode=2 and Fairness in Cluster Scheduling , Yahoo! Research
,University of California, Berkeley ,2009
[3] Mark Yong, Nitin Garegrat, Shiwali Mohan “Towards a
Resource Aware Scheduler in Hadoop” December 21,2009
[4] Tom White, “Hadoop The Definitive guide”, Third Edition,
2012.
[5] Apache Hadoop,” http://hadoop.apache.org”.
[6] Hadoop Distributed file system, “http://hadoop.apache.org/
docs/r1.0.4/hdfs_design.html”
[7] Hadoop Distributed file system, “http://en.wikipedia.org/wiki/
Apache_Hadoop#Hadoop_Distributed_File_System”
[8] Hadoop Fair Scheduler, “http://hadoop.apache.org/ common/
docs/r0.20.2/fair scheduler.html”
[9] Hadoop Capacity Scheduler, “ http://hadoop.apache.org/
docs/stable/capacity_scheduler.html”
Fig. 3. Hadoop Cluster Machine List [10] Hadoop MapReduce tutorial, “http://hadoop.apache.org/
common /docs/current/mapred tutorial.html”
[11] Hadoop Tutorial, “http://developer.yahoo.com/hadoop/”
281