Você está na página 1de 5

SANKHYANA CONSULTANCY SERVICES

Data Driven Decision Science

Chapter : 4
Introduction to MapReduce
Introduction to MapReduce
MapReduce is a programming model suitable for processing of huge data. Hadoop is capable of
running MapReduce programs written in various languages like Java, Ruby, Python and C++. But,
mostly we use Java language as Hadoop is inbuilt by Java Language. MapReduce contains two
phases : Map and Reduce. Each node processes the data stored on that node locally.

Features of MapReduce
 Specify the computation in terms of Map and Reduce function.
 Input to each phase is in key and value pair.
 Automatic fault tolerance.
 Class platform support – MapReduce programs are usually in Java but can be written in any
other languages also using Hadoop streaming.
 Clean abstraction for developers – MapReduce abstracts all housekeeping away from
developer, so developer can concentrate on problem domain.
 Automatic parallel and distributed processing across multiple nodes.
 Data locality.

Key MapReduce Phases

i. Mapper Phase -
-- Each Mapper operates on single HDFS block
-- Mapper runs on the node where the block is stored.
-- This is very first phase in the execution of MapReduce program.
-- The data in each split is passed to mapping function to produce output values.

ii. Shuffle & Sort Phase -


-- This phase starts after all map task are completed.
-- It is internally driven, Hadoop itself is having the code related to them.
-- Sorts and consolidates intermediate data from all mappers.
-- Shuffling is a phase on intermediate data to combine all values into a collection associated
to same key.
-- Sorting is another phase on intermediate data to sort all key,value pairs. Because of shuffling
phase, all unique keys will compare with each other & gives O/P in some sorting order.
Basically this sorting will be done because of comparable interface.

iii. Reducer Phase -


-- Operates on Shuffled/Sorted intermediate data.
-- Writes the final output to HDFS.
-- By default 1 reducer will be there & it may reside anywhere in the cluster.

SANKHYANA CONSULTANCY SERVICES


Data Driven Decision Science (Training/Consulting/Analytics)
1188, HNR Tower, 4th Floor, 24th Main, Near Parangipalya Bus Stop, Above Udupi Palace, 2 nd
Sector, HSR Layout, Bangalore – 560102. Ph: 080 48147185, 48147186
SANKHYANA CONSULTANCY SERVICES
Data Driven Decision Science

MapReduce basic flow

Master Node JobTracker

TaskTracker TaskTracker TaskTracker

Slave Nodes

Job Tracker
 JobTracker is present in the master node of Hadoop cluster.
 JobTracker accepts the job request from client.
 It divides the job into multiple tasks and allocates the tasks to Task Trackers.
 It respond to heartbeat message from task trackers. If it does not recieve heart beat for 10
times i.e 30 secs it considers that Task Tracker as working slow or dead.
 It gathers the final output and informs the client with success or failure status.
 It is single point of failure as, if it is dead then all the task is disturbed.

Task Tracker
 Task Trackers are present in the slave nodes of Hadoop cluster.
 Task Trackers run the tasks assigned by Job Tracker.
 Periodically report the progress of tasks to Job Tracker via heartbeat message after 3 secs.

Task Tracker
Data Node

Client Job Tracker Task Tracker


Name Node Data Node Slave
Node

Task Tracker
Master Node Data Node

SANKHYANA CONSULTANCY SERVICES


Data Driven Decision Science (Training/Consulting/Analytics)
1188, HNR Tower, 4th Floor, 24th Main, Near Parangipalya Bus Stop, Above Udupi Palace, 2 nd
Sector, HSR Layout, Bangalore – 560102. Ph: 080 48147185, 48147186
SANKHYANA CONSULTANCY SERVICES
Data Driven Decision Science

Input Format
Input
File

Input Split

Record Reader

Mapper

Shuffle and Sort

Reducer

Output Format Output


File

Fig : Basic MapReduce Flow With One Input File

Input Split : Input to a MapReduce job is divided into a fixed size pieces called Input Split. It is a
chunk of the I/P file that is consumed by single Map.

Record Reader : It is a predefined interface. It reads the record line by line. The lines per record is
converted into key value pairs so, that mapper can read it. By default the format is derived for it we
need to specify the format in driver code.

SANKHYANA CONSULTANCY SERVICES


Data Driven Decision Science (Training/Consulting/Analytics)
1188, HNR Tower, 4th Floor, 24th Main, Near Parangipalya Bus Stop, Above Udupi Palace, 2 nd
Sector, HSR Layout, Bangalore – 560102. Ph: 080 48147185, 48147186
SANKHYANA CONSULTANCY SERVICES
Data Driven Decision Science

Input Format
Input
File(s)

Input Split 1 Input Split 2

Record Reader Record Reader

Mapper Mapper

Shuffle and Sort

Reducer

Output
Output Format File

Fig : Basic MapReduce Flow with Two Input Files

Overall Process

Step 1 Creating a file


cat > file.txt
hi how are you
how is your job
how is your family
what is time now
what is the strength of Hadoop
ctrl+d (to save & exit)

Step 2 Loading file.txt from Local file system to HDFS


hdfs fs -put file.txt file

Step 3 Writing programs : - Driver Code.java, Mapper Code.java, Reducer Code.java [Driver
Code.java will have the main method as only one main method should be there.]

SANKHYANA CONSULTANCY SERVICES


Data Driven Decision Science (Training/Consulting/Analytics)
1188, HNR Tower, 4th Floor, 24th Main, Near Parangipalya Bus Stop, Above Udupi Palace, 2 nd
Sector, HSR Layout, Bangalore – 560102. Ph: 080 48147185, 48147186
SANKHYANA CONSULTANCY SERVICES
Data Driven Decision Science

Step 4 Compiling all above .java files.to


Javac -classpath $HADOOP_HOME/hadoop – core.jar *.java

[Note: As hadoop is a framework so we are going to use plenty of interfaces like abstract classes,
concrete classes in these 3 classes. So, compiler should recognize all these predefined classes
interface & abstract classes. So, that can be done only because of hadoop – core jar]

Step 5 Creating jar file


jar cvf test.jar *.class

Step 6 Running above test.jar on file which is there in HDFS


hdfs jar test.jar DriverCode file TestOutput

[Note : We are giving the Driver Code here as the jvm should recognize the class having main
method as the execution starts from the main method only]

SANKHYANA CONSULTANCY SERVICES


Data Driven Decision Science (Training/Consulting/Analytics)
1188, HNR Tower, 4th Floor, 24th Main, Near Parangipalya Bus Stop, Above Udupi Palace, 2 nd
Sector, HSR Layout, Bangalore – 560102. Ph: 080 48147185, 48147186

Você também pode gostar