Escolar Documentos
Profissional Documentos
Cultura Documentos
Chapter : 4
Introduction to MapReduce
Introduction to MapReduce
MapReduce is a programming model suitable for processing of huge data. Hadoop is capable of
running MapReduce programs written in various languages like Java, Ruby, Python and C++. But,
mostly we use Java language as Hadoop is inbuilt by Java Language. MapReduce contains two
phases : Map and Reduce. Each node processes the data stored on that node locally.
Features of MapReduce
Specify the computation in terms of Map and Reduce function.
Input to each phase is in key and value pair.
Automatic fault tolerance.
Class platform support – MapReduce programs are usually in Java but can be written in any
other languages also using Hadoop streaming.
Clean abstraction for developers – MapReduce abstracts all housekeeping away from
developer, so developer can concentrate on problem domain.
Automatic parallel and distributed processing across multiple nodes.
Data locality.
i. Mapper Phase -
-- Each Mapper operates on single HDFS block
-- Mapper runs on the node where the block is stored.
-- This is very first phase in the execution of MapReduce program.
-- The data in each split is passed to mapping function to produce output values.
Slave Nodes
Job Tracker
JobTracker is present in the master node of Hadoop cluster.
JobTracker accepts the job request from client.
It divides the job into multiple tasks and allocates the tasks to Task Trackers.
It respond to heartbeat message from task trackers. If it does not recieve heart beat for 10
times i.e 30 secs it considers that Task Tracker as working slow or dead.
It gathers the final output and informs the client with success or failure status.
It is single point of failure as, if it is dead then all the task is disturbed.
Task Tracker
Task Trackers are present in the slave nodes of Hadoop cluster.
Task Trackers run the tasks assigned by Job Tracker.
Periodically report the progress of tasks to Job Tracker via heartbeat message after 3 secs.
Task Tracker
Data Node
Task Tracker
Master Node Data Node
Input Format
Input
File
Input Split
Record Reader
Mapper
Reducer
Input Split : Input to a MapReduce job is divided into a fixed size pieces called Input Split. It is a
chunk of the I/P file that is consumed by single Map.
Record Reader : It is a predefined interface. It reads the record line by line. The lines per record is
converted into key value pairs so, that mapper can read it. By default the format is derived for it we
need to specify the format in driver code.
Input Format
Input
File(s)
Mapper Mapper
Reducer
Output
Output Format File
Overall Process
Step 3 Writing programs : - Driver Code.java, Mapper Code.java, Reducer Code.java [Driver
Code.java will have the main method as only one main method should be there.]
[Note: As hadoop is a framework so we are going to use plenty of interfaces like abstract classes,
concrete classes in these 3 classes. So, compiler should recognize all these predefined classes
interface & abstract classes. So, that can be done only because of hadoop – core jar]
[Note : We are giving the Driver Code here as the jvm should recognize the class having main
method as the execution starts from the main method only]