Hadoop Chapter 4

SANKHYANA CONSULTANCY SERVICES
Data Driven Decision Science
Chapter : 4
Introduction to MapReduce
Introduction to MapReduce
MapReduce is a programming model suitable for processing of huge data. Hadoop is capable of
running MapReduce programs written in various languages like Java, Ruby, Python and C++. But,
mostly we use Java language as Hadoop is inbuilt by Java Language. MapReduce contains two
phases : Map and Reduce. Each node processes the data stored on that node locally.
Features of MapReduce
 Specify the computation in terms of Map and Reduce function.
 Input to each phase is in key and value pair.
 Automatic fault tolerance.
 Class platform support – MapReduce programs are usually in Java but can be written in any
other languages also using Hadoop streaming.
 Clean abstraction for developers – MapReduce abstracts all housekeeping away from
developer, so developer can concentrate on problem domain.
 Automatic parallel and distributed processing across multiple nodes.
 Data locality.
Key MapReduce Phases
i. Mapper Phase -
-- Each Mapper operates on single HDFS block
-- Mapper runs on the node where the block is stored.
-- This is very first phase in the execution of MapReduce program.
-- The data in each split is passed to mapping function to produce output values.
ii. Shuffle & Sort Phase -

-- This phase starts after all map task are completed.
-- It is internally driven, Hadoop itself is having the code related to them.
-- Sorts and consolidates intermediate data from all mappers.
-- Shuffling is a phase on intermediate data to combine all values into a collection associated
to same key.
-- Sorting is another phase on intermediate data to sort all key,value pairs. Because of shuffling
phase, all unique keys will compare with each other & gives O/P in some sorting order.
Basically this sorting will be done because of comparable interface.
iii. Reducer Phase -

-- Operates on Shuffled/Sorted intermediate data.
-- Writes the final output to HDFS.
-- By default 1 reducer will be there & it may reside anywhere in the cluster.

Data Driven Decision Science (Training/Consulting/Analytics)
1188, HNR Tower, 4th Floor, 24th Main, Near Parangipalya Bus Stop, Above Udupi Palace, 2 nd
Sector, HSR Layout, Bangalore – 560102. Ph: 080 48147185, 48147186
MapReduce basic flow
Master Node JobTracker
TaskTracker TaskTracker TaskTracker
Slave Nodes
Job Tracker
 JobTracker is present in the master node of Hadoop cluster.
 JobTracker accepts the job request from client.
 It divides the job into multiple tasks and allocates the tasks to Task Trackers.
 It respond to heartbeat message from task trackers. If it does not recieve heart beat for 10
times i.e 30 secs it considers that Task Tracker as working slow or dead.
 It gathers the final output and informs the client with success or failure status.
 It is single point of failure as, if it is dead then all the task is disturbed.
Task Tracker
 Task Trackers are present in the slave nodes of Hadoop cluster.
 Task Trackers run the tasks assigned by Job Tracker.
 Periodically report the progress of tasks to Job Tracker via heartbeat message after 3 secs.
Task Tracker
Data Node
Client Job Tracker Task Tracker

Name Node Data Node Slave
Node
Task Tracker
Master Node Data Node

Input Format
Input
File
Input Split
Record Reader
Mapper
Shuffle and Sort
Reducer
Output Format Output

File
Fig : Basic MapReduce Flow With One Input File
Input Split : Input to a MapReduce job is divided into a fixed size pieces called Input Split. It is a
chunk of the I/P file that is consumed by single Map.
Record Reader : It is a predefined interface. It reads the record line by line. The lines per record is
converted into key value pairs so, that mapper can read it. By default the format is derived for it we
need to specify the format in driver code.

Input Format
Input
File(s)
Input Split 1 Input Split 2
Record Reader Record Reader
Mapper Mapper
Shuffle and Sort
Reducer
Output
Output Format File
Fig : Basic MapReduce Flow with Two Input Files
Overall Process
Step 1 Creating a file

cat > file.txt
hi how are you
how is your job
how is your family
what is time now
what is the strength of Hadoop
ctrl+d (to save & exit)
Step 2 Loading file.txt from Local file system to HDFS

hdfs fs -put file.txt file
Step 3 Writing programs : - Driver Code.java, Mapper Code.java, Reducer Code.java [Driver
Code.java will have the main method as only one main method should be there.]

Step 4 Compiling all above .java files.to

Javac -classpath $HADOOP_HOME/hadoop – core.jar *.java
[Note: As hadoop is a framework so we are going to use plenty of interfaces like abstract classes,
concrete classes in these 3 classes. So, compiler should recognize all these predefined classes
interface & abstract classes. So, that can be done only because of hadoop – core jar]
Step 5 Creating jar file

jar cvf test.jar *.class
Step 6 Running above test.jar on file which is there in HDFS

hdfs jar test.jar DriverCode file TestOutput
[Note : We are giving the Driver Code here as the jvm should recognize the class having main
method as the execution starts from the main method only]


Hadoop Chapter 4

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Hadoop Chapter 4

Enviado por

Direitos autorais:

Formatos disponíveis

SANKHYANA CONSULTANCY SERVICES

Data Driven Decision Science

Key MapReduce Phases

ii. Shuffle & Sort Phase -

iii. Reducer Phase -

SANKHYANA CONSULTANCY SERVICES

MapReduce basic flow

Master Node JobTracker

TaskTracker TaskTracker TaskTracker

Client Job Tracker Task Tracker

SANKHYANA CONSULTANCY SERVICES

Shuffle and Sort

Output Format Output

Fig : Basic MapReduce Flow With One Input File

SANKHYANA CONSULTANCY SERVICES

Input Split 1 Input Split 2

Record Reader Record Reader

Shuffle and Sort

Fig : Basic MapReduce Flow with Two Input Files

Step 1 Creating a file

Step 2 Loading file.txt from Local file system to HDFS

SANKHYANA CONSULTANCY SERVICES

Step 4 Compiling all above .java files.to

Step 5 Creating jar file

Step 6 Running above test.jar on file which is there in HDFS

SANKHYANA CONSULTANCY SERVICES

Você também pode gostar