Map Reduce Concepts: Job Tracker

Map Reduce Concepts
Job Tracker
The Job-Tracker is responsible for accepting jobs from clients,dividing those jobs into tasks, and assigning those tasks to be executed by worker nodes.
Task Tracker
Task-Tracker process that manages the execution of the tasks currently assigned to that node. Each Task Tracker has a fixed number of slots for executing tasks (two maps and two reduces by default).
MapReduce co-located with HDFS

JobTracker
Client submits MapReduce job
NameNode
JobTracker and NameNode need not be on same node
Slave node A
TaskTracker
Slave node B
TaskTracker
Slave node C
TaskTracker
DataNode
DataNode
DataNode
TaskTrackers (compute nodes) and DataNodes colocate = high aggregate bandwidth across cluster
Introduction to MapReduce Framework
A programming model for parallel data processing. Hadoop can run map reduce programs in multiple languages like Java, Python, Ruby and C++.
Map function: Operate on set of key, value pairs Map is applied in parallel on input data set This produces output keys and list of values for each key depending upon the functionality Mapper output are partitioned per reducer = No. Of reduce task for that job
Reduce function: Operate on set of key, value pairs Reduce is then applied in parallel to each group, again producing a collection of key, values. No of reducers can be set by the user.
How does a map-reduce algorithm work?
Understanding processing in a MapReduce framework

User runs a program on the client computer Program submits a job to HDFS. Job contains: Input data Map / Reduce program Configuration information Two types of daemons that control job execution: Job Tracker (master node) Task Trackers (slave nodes) Job sent to JobTracker JobTracker communicates with NameNode and assigns parts of job to TaskTrackers (TaskTracker is run on each DataNode) Task is a single MAP or REDUCE operation over piece of data Hadoop divides the input to MAP / REDUCE job into equal splits The JobTracker knows (from NameNode) which node contains the data, and which other machines are nearby. Task processes send heartbeats to TaskTracker, TaskTracker sends heartbeats to the JobTracker.
Understanding processing in a MapReduce framework

Any tasks that did not report in certain time (default is 10 min) assumed to be failed and its JVM will be killed by TaskTracker and reported to the JobTracker The JobTracker will reschedule any failed tasks (with different TaskTracker) If same task failed 4 times all job fails Any TaskTracker reporting high number of failed jobs on particular node will be blacklist the node (remove metadata from NameNode) JobTracker maintains and manages the status of each job. Results from failed tasks will be ignored 1 Job Tracker (master) n TaskTrackers (slaves) m Tasks
Map/Reduce data flow

Output of Map is stored on local disk Output of Reduce is stored in HDFS When there is more than one reducer the map tasks partition their output: One partition for each reduce task There are many keys and associated values for each partition , but records for each given key are all in the same partition Partitioning can be controlled by user defined function (default is hash function) Shuffle data flow between map and reduce tasks
Computing parallelism meet data locality All map tasks are equivalent; so can run in parallel
All reduce tasks can also run in parallel Input data on HDFS on can be processed independently
Therefore, run map task on whatever data is local (or closest) to a particular node in HDFS For map task assignment, JobTracker has an affinity for a particular node which has a replica of the input data If lots of data does happen to pile up on the same node, nearby nodes will map instead
Therefore, good performance And improve recovery from partial failure of servers or storage during the operation: if one map or reduce task fails, the work can be rescheduled
Data Distribution
In a MapReduce cluster, data is distributed to all the nodes of the cluster as it is being loaded in An underlying distributed file systems (e.g., GFS) splits large data files into chunks which are managed by different nodes in the cluster
Input data: A large file
Node 1
Chunk of input data
Node 2
Chunk of input data
Node 3
Chunk of input data
Even though the file chunks are distributed across several machines, they form a single namesapce
Keys and Values

The programmer in MapReduce has to specify two functions, the map function and the reduce function that implement the Mapper and the Reducer in a MapReduce program In MapReduce data elements key-value (i.e., (K, V)) pairs are always structured as
The map and reduce functions receive and emit (K, V) pairs
Input Splits Intermediate Outputs Final Outputs
(K, V) Pairs
Map Function
(K, V) Pairs
Reduce Function
(K, V) Pairs
Partitions
In MapReduce, intermediate output values are not usually reduced together All values with the same key are presented to a single Reducer together More specifically, a different subset of intermediate key space is assigned to each Reducer These subsets are known as partitions
Different colors represent different keys (potentially) from different Mappers
Partitions are the input to Reducers
Hadoop MapReduce: A Closer Look

Node 1
Files loaded from local HDFS store
Node 2
InputFormat
file Split file RecordReaders RR RR RR RR Split Split Split
InputFormat
file Split Split file RR RR RecordReaders Input (K, V) pairs
Input (K, V) pairs

Map Intermediate (K, V) pairs Partitioner Map Map Shuffling Process Map Map Map
Intermediate (K, V) pairs Partitioner
Sort
Reduce Final (K, V) pairs Writeback to local HDFS store
Intermediate (K,V) pairs exchanged by all nodes
Sort
Reduce Final (K, V) pairs
OutputFormat
OutputFormat
Writeback to local HDFS store
Input Files
Input files are where the data for a MapReduce task is initially stored The input files typically reside in a distributed file system (e.g. HDFS) The format of input files is arbitrary Line-based log files Binary files Multi-line input records Or something else entirely
file
file
13
InputFormat
How the input files are split up and read is defined by the InputFormat InputFormat is a class that does the following:
Selects the files that should be used for input file Defines the InputSplits that break file a file Provides a factory for RecordReader objects that read the file
InputFormat
14
InputFormat Types
Several InputFormats are provided with Hadoop:
InputFormat
TextInputFormat
Description
Default format; reads lines of text files Parses lines into (K, V) pairs A Hadoop-specific high-performance binary format
Key
The byte offset of the line Everything up to the first tab character user-defined
Value
The line contents
KeyValueInputFormat
The remainder of the line user-defined
SequenceFileInputFormat
15
Input Splits
An input split describes a unit of work that comprises a single map task in a MapReduce program By default, the InputFormat breaks a file up into 64MB splits By dividing the file into splits, we allow several map tasks to operate on a single file in parallel If the file is very large, this can improve performance significantly through parallelism
InputFormat
file Split file Split Split
Each map task corresponds to a single input split
RecordReader
The input split defines a slice of work but does not describe how to access it The RecordReader class actually loads data from its source and converts it into (K, V) pairs suitable for reading by Mappers
The RecordReader is invoked repeatedly on the input until the entire split is consumed
file
InputFormat
Each invocation of the RecordReader leads to another call of the map function defined by the programmer
Split file RR
Split
Split
RR
RR
Mapper and Reducer

The Mapper performs the user-defined work of the first phase of the Files loaded from local HDFS store MapReduce program A new instance of Mapper is created for each split
file Split Split Split
InputFormat
The Reducer performs the user-defined work of file the second phase of the MapReduce program A new instance of Reducer is created for each partition For each key in the partition assigned to a Reducer, the Reducer is called once
RR
RR
RR
Map
Map
Map
Partitioner
Sort
Reduce
Partitioner
Each mapper may emit (K, V) pairs to any partition
Therefore, the map nodes must all agree on where to send different pieces of intermediate data
file
InputFormat
file Split Split Split
The partitioner class determines which partition a given (K,V) pair will go to The default partitioner computes a hash value for a given key and assigns it to a partition based on this result
RR
RR
RR
Map
Map
Map
Partitioner
Sort
Reduce
Sort
Each Reducer is responsible for reducing the values associated with (several) intermediate keys The set of intermediate keys on a single node is automatically sorted by MapReduce before they are presented to the Reducer
InputFormat
file Split file RR RR RR Split Split
Map
Map
Map
Partitioner
Sort Reduce
OutputFormat
The OutputFormat class defines the way (K,V) pairs produced by Reducers are written to output files
file
InputFormat
The instances of OutputFormat provided by Hadoop write to files on the local disk or in HDFS
Split file RR
Split
Split
RR
RR
Several OutputFormats are provided by Hadoop:

Description
Default; writes lines in "key \t value" format Writes binary files suitable for reading into subsequent MapReduce jobs Generates no output files
Map Map Map
OutputFormat
TextOutputFormat SequenceFileOutputFormat
Partitioner
Sort
NullOutputFormat
Reduce
OutputFormat
Job Scheduling in MapReduce

In MapReduce, an application is represented as a job A job encompasses multiple map and reduce tasks MapReduce in Hadoop comes with a choice of schedulers:
The default is the FIFO scheduler which schedules jobs in order of submission
There is also a multi-user scheduler called the Fair scheduler which aims to give every user a fair share of the cluster capacity over time
22
FIFO Scheduling
Job Queue
FIFO Scheduling
Job Queue
FIFO Scheduling
Job Queue
Fair Scheduling
Job Queue
Fair Scheduling
Job Queue
Fair Scheduler Basics

Group jobs into pools Assign each pool a guaranteed minimum share Divide excess capacity evenly between pools
Pools Determined from a configurable job property

Default in 0.20: user.name (one pool per user)
Pools have properties:

Minimum map slots Minimum reduce slots Limit on # of running jobs
Example Pool Allocations

entire cluster 100
slots
matei
jeff
tom
min share = 30
ads
min share = 40
job 1
30 slots
job 2
15 slots
job 3
15 slots
job 4
40 slots
Scheduling Algorithm
Split each pools min share among its jobs Split each pools total share among its jobs When a slot needs to be assigned:
If there is any job below its min share, schedule it Else schedule the job that weve been most unfair to (based on deficit)
Fault Tolerance in Hadoop

MapReduce can guide jobs toward a successful completion even when jobs are run on a large cluster where probability of failures increases The primary way that MapReduce achieves fault tolerance is through restarting tasks
If a TT fails to communicate with JT for a period of time (by default, 1 minute in Hadoop), JT will assume that TT in question has crashed
If the job is still in the map phase, JT asks another TT to re-execute all Mappers that previously ran at the failed TT If the job is in the reduce phase, JT asks another TT to re-execute all Reducers that were in progress on the failed TT
32
Speculative Execution
A MapReduce job is dominated by the slowest task MapReduce attempts to locate slow tasks (stragglers) and run redundant (speculative) tasks that will optimistically commit before the corresponding stragglers This process is known as speculative execution Only one copy of a straggler is allowed to be speculated Whichever copy (among the two copies) of a task commits first, it becomes the definitive copy, and the other copy is killed by JT
Locating Stragglers
How does Hadoop locate stragglers?
Hadoop monitors each task progress using a progress score between 0 and 1 If a tasks progress score is less than (average 0.2), and the task has run for at least 1 minute, it is marked as a straggler
T1 PS= 2/3 T2 PS= 1/12
Not a straggler
A straggler
Time
What Makes MapReduce Unique?

MapReduce is characterized by: 1. Its simplified programming model which allows the user to quickly write and test distributed systems
2. Its efficient and automatic distribution of data and workload across machines
3. Its flat scalability curve. Specifically, after a Mapreduce program is written and functioning on 10 nodes, very little-if any- work is required for making that same program run on 1000 nodes
35
Programming using MapReduce
WordCount is a simple application that counts the number of occurences of each word in a given input file. Here we divide the entire code into 3 files 1)Mapper.java 2)Reducer.java 3)Basic.java
Mapper.java
import import import import import import import java.io.IOException; java.util.*; org.apache.hadoop.fs.Path; org.apache.hadoop.conf.*; org.apache.hadoop.io.*; org.apache.hadoop.mapred.*; org.apache.hadoop.util.*;
public class Mapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }
Reducer.java
import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*;
public class Reducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
Basic.java
import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class Basic extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public static void main(String[] args) throws Exception { JobConf conf = new JobConf(Basic.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Mapper.class); conf.setReducerClass(Reducer.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf);

} }
Executing the MapReduce program

1)Compile all the 3 java files which will create 3 .class files
2)Add all 3 .class files into 1 single jar file by writing this command jar cvf file_name.jar *.class
3)Now you just need to execute single jar file by writing this command bin/hadoop jar file_name.jar Basic input_file_name output_file_name

Map Reduce Concepts: Job Tracker

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Map Reduce Concepts: Job Tracker

Enviado por

Direitos autorais:

Formatos disponíveis

Map Reduce Concepts

MapReduce co-located with HDFS

JobTracker and NameNode need not be on same node

Introduction to MapReduce Framework

How does a map-reduce algorithm work?

Understanding processing in a MapReduce framework

Understanding processing in a MapReduce framework

Map/Reduce data flow

Keys and Values

Partitions are the input to Reducers

Hadoop MapReduce: A Closer Look

Input (K, V) pairs

Intermediate (K, V) pairs Partitioner

Intermediate (K,V) pairs exchanged by all nodes

Writeback to local HDFS store

The remainder of the line user-defined

Each map task corresponds to a single input split

Mapper and Reducer

Several OutputFormats are provided by Hadoop:

Job Scheduling in MapReduce

Fair Scheduler Basics

Pools Determined from a configurable job property

Pools have properties:

Example Pool Allocations

Fault Tolerance in Hadoop

T1 PS= 2/3 T2 PS= 1/12

What Makes MapReduce Unique?

Programming using MapReduce

FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf);

Executing the MapReduce program

Você também pode gostar