Você está na página 1de 76

DEUTSCH-FRANZSISCHE SOMMERUNIVERSITT FR UNIVERSIT DT FRANCO-ALLEMANDE POUR JEUNES

NACHWUCHSWISSENSCHAFTLER 2011 CHERCHEURS 2011


CLOUD COMPUTING : HERAUSFORDERUNGEN UND CLOUD COMPUTING: DFIS ET OPPORTUNITS
MGLICHKEITEN

Hadoop MapReduce in Practice

Speaker: Pietro Michiardi


Contributors: Antonio Barbuzzi, Mario Pastorelli

Institution: Eurecom

Pietro Michiardi (Eurecom) Hadoop in Practice 1 / 76


Introduction

Overview of this Lecture

Hadoop: Architecture Design and Implementation Details


[45 minutes]
I HDFS
I Hadoop MapReduce
I Hadoop MapReduce I/O

Exercise Session:
I Warm up: WordCount and first design patterns [45 minutes]
I Exercises on various design patterns: [60 minutes]
F Pairs
F Stripes
F Order Inversion [HomeWork]
I Solved Exercise: PageRank in MapReduce [45 minutes]

Pietro Michiardi (Eurecom) Hadoop in Practice 2 / 76


Hadoop MapReduce

Hadoop MapReduce

Pietro Michiardi (Eurecom) Hadoop in Practice 3 / 76


Hadoop MapReduce Preliminaries

From Theory to Practice

The story so far


I Concepts behind the MapReduce Framework
I Overview of the programming model

Terminology
I MapReduce:
F Job: an execution of a Mapper and Reducer across a data set
F Task: an execution of a Mapper or a Reducer on a slice of data
F Task Attempt: instance of an attempt to execute a task
I Example:
F Running Word Count across 20 files is one job
F 20 files to be mapped = 20 map tasks + some number of reduce tasks
F At least 20 attempts will be performed... more if a machine crashes

Pietro Michiardi (Eurecom) Hadoop in Practice 4 / 76


Hadoop MapReduce HDFS in details

HDFS in details

Pietro Michiardi (Eurecom) Hadoop in Practice 5 / 76


Hadoop MapReduce HDFS in details

The Hadoop Distributed Filesystem

Large dataset(s) outgrowing the storage capacity of a single


physical machine
I Need to partition it across a number of separate machines
I Network-based system, with all its complications
I Tolerate failures of machines

Hadoop Distributed Filesystem[6, 7]


I Very large files
I Streaming data access
I Commodity hardware

Pietro Michiardi (Eurecom) Hadoop in Practice 6 / 76


Hadoop MapReduce HDFS in details

HDFS Blocks

(Big) files are broken into block-sized chunks


I NOTE: A file that is smaller than a single block does not occupy a
full blocks worth of underlying storage

Blocks are stored on independent machines


I Reliability and parallel access

Why is a block so large?


I Make transfer times larger than seek latency
I E.g.: Assume seek time is 10ms and the transfer rate is 100 MB/s,
if you want seek time to be 1% of transfer time, then the block size
should be 100MB

Pietro Michiardi (Eurecom) Hadoop in Practice 7 / 76


Hadoop MapReduce HDFS in details

NameNodes and DataNodes

NameNode
I Keeps metadata in RAM
I Each block information occupies roughly 150 bytes of memory
I Without NameNode, the filesystem cannot be used
F Persistence of metadata: synchronous and atomic writes to NFS

Secondary NameNode
I Merges the namespce with the edit log
I A useful trick to recover from a failure of the NameNode is to use the
NFS copy of metadata and switch the secondary to primary

DataNode
I They store data and talk to clients
I They report periodically to the NameNode the list of blocks they hold

Pietro Michiardi (Eurecom) Hadoop in Practice 8 / 76


Hadoop MapReduce HDFS in details

Anatomy of a File Read

NameNode is only used to get block location


I Unresponsive DataNode are discarded by clients
I Batch reading of blocks is allowed

External clients
I For each block, the NameNode returns a set of DataNodes holding
a copy thereof
I DataNodes are sorted according to their proximity to the client

MapReduce clients
I TaskTracker and DataNodes are colocated
I For each block, the NameNode usually1 returns the local DataNode

1
Exceptions exist due to stragglers.
Pietro Michiardi (Eurecom) Hadoop in Practice 9 / 76
Hadoop MapReduce HDFS in details

Anatomy of a File Write

Details on replication
I Clients ask NameNode for a list of suitable DataNodes
I This list forms a pipeline: first DataNode stores a copy of a
block, then forwards it to the second, and so on

Replica Placement
I Tradeoff between reliability and bandwidth
I Default placement:
F First copy on the same node of the client, second replica is off-rack,
third replica is on the same rack as the second but on a different node
F Since Hadoop 0.21, replica placement can be customized

Pietro Michiardi (Eurecom) Hadoop in Practice 10 / 76


Hadoop MapReduce HDFS in details

HDFS Coherency Model

Read your writes is not guaranteed


I The namespace is updated
I Block contents may not be visible after a write is finished
I Application design (other than MapReduce) should use sync() to
force synchronization
I sync() involves some overhead: tradeoff between
robustness/consistency and throughput

Multiple writers (for the same file) are not supported


I Instead, different files can be written in parallel (using MapReduce)

Pietro Michiardi (Eurecom) Hadoop in Practice 11 / 76


Hadoop MapReduce Hadoop MapReduce in details

How Hadoop MapReduce Works

Pietro Michiardi (Eurecom) Hadoop in Practice 12 / 76


Hadoop MapReduce Hadoop MapReduce in details

Anatomy of a MapReduce Job Run

Pietro Michiardi (Eurecom) Hadoop in Practice 13 / 76


Hadoop MapReduce Hadoop MapReduce in details

Job Submission

JobClient class
I The runJob() method creates a new instance of a JobClient
I Then it calls the submitJob() on this class

Simple verifications on the Job


I Is there an output directory?
I Are there any input splits?
I Can I copy the JAR of the job to HDFS?

NOTE: the JAR of the job is replicated 10 times

Pietro Michiardi (Eurecom) Hadoop in Practice 14 / 76


Hadoop MapReduce Hadoop MapReduce in details

Job Initialization

The JobTracker is responsible for:


I Create an object for the job
I Encapsulate its tasks
I Bookkeeping with the tasks status and progress

This is where the scheduling happens


I JobTracker performs scheduling by maintaining a queue
I Queueing disciplines are pluggable

Compute mappers and reducers


I JobTracker retrieves input splits (computed by JobClient)
I Determines the number of Mappers based on the number of input
splits
I Reads the configuration file to set the number of Reducers

Pietro Michiardi (Eurecom) Hadoop in Practice 15 / 76


Hadoop MapReduce Hadoop MapReduce in details

Task Assignment
Heartbeat-Based mechanism
I TaskTrackers periodically send heartbeats to the JobTracker
I TaskTracker is alive
I Heartbeat contains also information on availability of the
TaskTrackers to execute a task
I JobTracker piggybacks a task if TaskTracker is available

Selecting a task
I JobTracker first needs to select a job (i.e. scheduling)
I TaskTrackers have a fixed number of slots for map and reduce
tasks
I JobTracker gives priority to map tasks (WHY?)

Data locality
I JobTracker is topology aware
F Useful for map tasks
F Unused for reduce tasks
Pietro Michiardi (Eurecom) Hadoop in Practice 16 / 76
Hadoop MapReduce Hadoop MapReduce in details

Task Execution

Task Assignment is done, now TaskTrackers can execute


I Copy the JAR from the HDFS
I Create a local working directory
I Create an instance of TaskRunner

TaskRunner launches a child JVM


I This prevents bugs from stalling the TaskTracker
I A new child JVM is created per InputSplit
F Can be overridden by specifying JVM Reuse option, which is very
useful for custom, in-memory, combiners

Pietro Michiardi (Eurecom) Hadoop in Practice 17 / 76


Hadoop MapReduce Hadoop MapReduce in details

Shuffle and Sort

The MapReduce framework guarantees the input to every


reducer to be sorted by key
I The process by which the system sorts and transfers map outputs
to reducers is known as shuffle

Shuffle is the most important part of the framework, where


the magic happens
I Good understanding allows optimizing both the framework and the
execution time of MapReduce jobs

Subject to continuous refinements

Pietro Michiardi (Eurecom) Hadoop in Practice 18 / 76


Hadoop MapReduce Hadoop MapReduce in details

Shuffle and Sort: the Map Side

Pietro Michiardi (Eurecom) Hadoop in Practice 19 / 76


Hadoop MapReduce Hadoop MapReduce in details

Shuffle and Sort: the Map Side


The output of a map task is not simply written to disk
I In memory buffering
I Pre-sorting

Circular memory buffer


I 100 MB by default
I Threshold based mechanism to spill buffer content to disk
I Map output written to the buffer while spilling to disk
I If buffer fills up while spilling, the map task is blocked

Disk spills
I Written in round-robin to a local dir
I Output data is partitioned corresponding to the reducers they will
be sent to
I Within each partition, data is sorted (in-memory)
I Optionally, if there is a combiner, it is executed just after the sort
phase
Pietro Michiardi (Eurecom) Hadoop in Practice 20 / 76
Hadoop MapReduce Hadoop MapReduce in details

Shuffle and Sort: the Map Side

More on spills and memory buffer


I Each time the buffer is full, a new spill is created
I Once the map task finishes, there are many spills
I Such spills are merged into a single partitioned and sorted output
file

The output file partitions are made available to reducers over


HTTP
I There are 40 (default) threads dedicated to serve the file partitions
to reducers

Pietro Michiardi (Eurecom) Hadoop in Practice 21 / 76


Hadoop MapReduce Hadoop MapReduce in details

Shuffle and Sort: the Map Side

Pietro Michiardi (Eurecom) Hadoop in Practice 22 / 76


Hadoop MapReduce Hadoop MapReduce in details

Shuffle and Sort: the Reduce Side

The map output file is located on the local disk of


TaskTracker
Another TaskTracker (in charge of a reduce task) requires
input from many other TaskTracker (that finished their map
tasks)
I How do reducers know which TaskTrackers to fetch map output
from?
F When a map task finishes it notifies the parent TaskTracker
F The TaskTracker notifies (with the heartbeat mechanism) the
jobtracker
F A thread in the reducer polls periodically the JobTracker
F TaskTrackers do not delete local map output as soon as a reduce
task has fetched them (WHY?)
Copy phase: a pull approach
I There is a small number (5) of copy threads that can fetch map
outputs in parallel

Pietro Michiardi (Eurecom) Hadoop in Practice 23 / 76


Hadoop MapReduce Hadoop MapReduce in details

Shuffle and Sort: the Reduce Side

The map outputs are copied to the the TraskTracker


running the reducer in memory (if they fit)
I Otherwise they are copied to disk

Input consolidation
I A background thread merges all partial inputs into larger, sorted
files
I Note that if compression was used (for map outputs to save
bandwidth), decompression will take place in memory

Sorting the input


I When all map outputs have been copied a merge phase starts
I All map outputs are sorted maintaining their sort ordering, in rounds

Pietro Michiardi (Eurecom) Hadoop in Practice 24 / 76


Hadoop MapReduce Hadoop I/O

Hadoop I/O

Pietro Michiardi (Eurecom) Hadoop in Practice 25 / 76


Hadoop MapReduce Hadoop I/O

I/O operations in Hadoop

Reading and writing data


I From/to HDFS
I From/to local disk drives
I Across machines (inter-process communication)

Customized tools for large amounts of data


I Hadoop does not use Java native classes
I Allows flexibility for dealing with custom data (e.g. binary)

Whats next
I Overview of what Hadoop offers
I For an in depth knowledge, use [7]

Pietro Michiardi (Eurecom) Hadoop in Practice 26 / 76


Hadoop MapReduce Hadoop I/O

Serialization

Transforms structured objects into a byte stream


I For transmission over the network: Hadoop uses RPC
I For persistent storage on disks

Hadoop uses its own serialization interface, Writable


I Comparison of types is crucial (Shuffle and Sort phase): Hadoop
provides a custom RawComparator, which avoids deserialization
I Custom Writable for having full control on the binary
representation of data
I Also external frameworks are allowed: enter Avro

Fixed-Length or variable-length encoding?


I Fixed-Length: when the distribution of values is uniform
I Variable-length: when the distribution of values is not uniform

Pietro Michiardi (Eurecom) Hadoop in Practice 27 / 76


Hadoop MapReduce Hadoop I/O

Hadoop MapReduce Types and Formats

Pietro Michiardi (Eurecom) Hadoop in Practice 28 / 76


Hadoop MapReduce Hadoop I/O

MapReduce Types

Input / output to mappers and reducers


I map: (k 1, v 1) [(k 2, v 2)]
I reduce: (k 2, [v 2]) [(k 3, v 3)]

Types:
I K types implement WritableComparable
I V types implement Writable

Pietro Michiardi (Eurecom) Hadoop in Practice 29 / 76


Hadoop MapReduce Hadoop I/O

What is a Writable

Hadoop defines its own classes for strings (Text), integers


(IntWritable), etc...

All keys are instances of WritableComparable


I Why comparable?

All values are instances of Writable

Pietro Michiardi (Eurecom) Hadoop in Practice 30 / 76


Hadoop MapReduce Hadoop I/O

Getting Data to the Mapper

Pietro Michiardi (Eurecom) Hadoop in Practice 31 / 76


Hadoop MapReduce Hadoop I/O

Reading Data

Datasets are specified by InputFormats


I InputFormats define input data (e.g. a file, a directory)
I InputFormats is a factory for RecordReader objects to extract
key-value records from the input source

InputFormats identify partitions of the data that form an


InputSplit
I InputSplit is a (reference to a) chunk of the input processed by
a single map
I Each split is divided into records, and the map processes each
record (a key-value pair) in turn
I Splits and records are logical, they are not physically bound to a file

Pietro Michiardi (Eurecom) Hadoop in Practice 32 / 76


Hadoop MapReduce Hadoop I/O

FileInputFormat and Friends

TextInputFormat
I Treats each newline-terminated line of a file as a value

KeyValueTextInputFormat
I Maps newline-terminated text lines of key SEPARATOR value

SequenceFileInputFormat
I Binary file of key-value pairs with some additional metadata

SequenceFileAsTextInputFormat
I Same as before but, maps (k.toString(), v.toString())

Pietro Michiardi (Eurecom) Hadoop in Practice 33 / 76


Hadoop MapReduce Hadoop I/O

Record Readers

Each InputFormat provides its own RecordReader


implementation

Examples:
I LineRecordReader
F Used by TextInputFormat, reads a line from a text file

I KeyValueRecordReader
F Used by KeyValueTextInputFormat

Pietro Michiardi (Eurecom) Hadoop in Practice 34 / 76


Hadoop MapReduce Hadoop I/O

Example: TextInputFormat, LineRecordReader

On the top of the Crumpetty Tree (0, On the top of the Crumpetty Tree)
The Quangle Wangle sat, (33, The Quangle Wangle sat,)
But his face you could not see, (57, But his face you could not see,)
On account of his Beaver Hat. (89, On account of his Beaver Hat.)

Pietro Michiardi (Eurecom) Hadoop in Practice 35 / 76


Hadoop MapReduce Hadoop I/O

Writing the Output

Pietro Michiardi (Eurecom) Hadoop in Practice 36 / 76


Hadoop MapReduce Hadoop I/O

Writing the Output

Analogous to InputFormat

TextOutputFormat writes key value <newline> strings to


output file

SequenceFileOutputFormat uses a binary format to pack


key-value pairs

NullOutputFormat discards output

Pietro Michiardi (Eurecom) Hadoop in Practice 37 / 76


Algorithm Design

Algorithm Design

Pietro Michiardi (Eurecom) Hadoop in Practice 38 / 76


Algorithm Design Preliminaries

Preliminaries

Pietro Michiardi (Eurecom) Hadoop in Practice 39 / 76


Algorithm Design Preliminaries

Algorithm Design

Developing algorithms involve:


I Preparing the input data
I Implement the mapper and the reducer
I Optionally, design the combiner and the partitioner

How to recast existing algorithms in MapReduce?


IIt is not always obvious how to express algorithms
IData structures play an important role
I Optimization is hard

The designer needs to bend the framework

Learn by examples
I Design patterns
I Synchronization is perhaps the most tricky aspect

Pietro Michiardi (Eurecom) Hadoop in Practice 40 / 76


Algorithm Design Preliminaries

Algorithm Design
Aspects that are not under the control of the designer
I Where a mapper or reducer will run
I When a mapper or reducer begins or finishes
I Which input key-value pairs are processed by a specific mapper
I Which intermediate key-value paris are processed by a specific
reducer

Aspects that can be controlled


I Construct data structures as keys and values
I Execute user-specified initialization and termination code for
mappers and reducers
I Preserve state across multiple input and intermediate keys in
mappers and reducers
I Control the sort order of intermediate keys, and therefore the order
in which a reducer will encounter particular keys
I Control the partitioning of the key space, and therefore the set of
keys that will be encountered by a particular reducer
Pietro Michiardi (Eurecom) Hadoop in Practice 41 / 76
Algorithm Design Preliminaries

Algorithm Design

MapReduce jobs can be complex


I Many algorithms cannot be easily expressed as a single
MapReduce job
I Decompose complex algorithms into a sequence of jobs
F Requires orchestrating data so that the output of one job becomes
the input to the next
I Iterative algorithms require an external driver to check for
convergence

Optimizations
I Scalability (linear)
I Resource requirements (storage and bandwidth)

Pietro Michiardi (Eurecom) Hadoop in Practice 42 / 76


Algorithm Design Local Aggregation

Warm up: WordCount and Local Aggregation

Pietro Michiardi (Eurecom) Hadoop in Practice 43 / 76


Algorithm Design Local Aggregation

Motivations
In the context of data-intensive distributed processing, the
most important aspect of synchronization is the exchange of
intermediate results
I This involves copying intermediate results from the processes that
produced them to those that consume them
I In general, this involves data transfers over the network
I In Hadoop, also disk I/O is involved, as intermediate results are
written to disk

Network and disk latencies are expensive


I Reducing the amount of intermediate data translates into
algorithmic efficiency

Combiners and preserving state across inputs


I Reduce the number and size of key-value pairs to be shuffled

Pietro Michiardi (Eurecom) Hadoop in Practice 44 / 76


Algorithm Design Local Aggregation

Word Counting in MapReduce

Pietro Michiardi (Eurecom) Hadoop in Practice 45 / 76


Algorithm Design Local Aggregation

Word Counting in MapReduce

Input:
I Key-value pairs: (docid, doc) stored on the distributed filesystem
I docid: unique identifier of a document
I doc: is the text of the document itself
Mapper:
I Takes an input key-value pair, tokenize the document
I Emits intermediate key-value pairs: the word is the key and the
integer is the value
The framework:
I Guarantees all values associated with the same key (the word) are
brought to the same reducer
The reducer:
I Receives all values associated to some keys
I Sums the values and writes output key-value pairs: the key is the
word and the value is the number of occurrences

Pietro Michiardi (Eurecom) Hadoop in Practice 46 / 76


Algorithm Design Local Aggregation

Technique 1: Combiners

Combiners are a general mechanism to reduce the amount of


intermediate data
I They could be thought of as mini-reducers

Example: word count


I Combiners aggregate term counts across documents processed by
each map task
I If combiners take advantage of all opportunities for local
aggregation we have at most m V intermediate key-value pairs
F m: number of mappers
F V : number of unique terms in the collection
I Note: due to Zipfian nature of term distributions, not all mappers will
see all terms

Pietro Michiardi (Eurecom) Hadoop in Practice 47 / 76


Algorithm Design Local Aggregation

Technique 2: In-Mapper Combiners

In-Mapper Combiners, a possible improvement


I Hadoop does not guarantee combiners to be executed

Use an associative array to cumulate intermediate results


I The array is used to tally up term counts within a single document
I The Emit method is called only after all InputRecords have been
processed

Example (see next slide)


I The code emits a key-value pair for each unique term in the
document

Pietro Michiardi (Eurecom) Hadoop in Practice 48 / 76


Algorithm Design Local Aggregation

In-Mapper Combiners

Pietro Michiardi (Eurecom) Hadoop in Practice 49 / 76


Algorithm Design Local Aggregation

In-Mapper Combiners

Summing up: a first design pattern, in-mapper combining


I Provides control over when local aggregation occurs
I Design can determine how exactly aggregation is done

Efficiency vs. Combiners


I There is no additional overhead due to the materialization of
key-value pairs
F Un-necessary object creation and destruction (garbage collection)
F Serialization, deserialization when memory bounded
I Mappers still need to emit all key-value pairs, combiners only
reduce network traffic

Pietro Michiardi (Eurecom) Hadoop in Practice 50 / 76


Algorithm Design Local Aggregation

In-Mapper Combiners

Precautions
I In-mapper combining breaks the functional programming paradigm
due to state preservation
I Preserving state across multiple instances implies that algorithm
behavior might depend on execution order
F Ordering-dependent bugs are difficult to find

Scalability bottleneck
I The in-mapper combining technique strictly depends on having
sufficient memory to store intermediate results
F And you dont want the OS to deal with swapping
I Multiple threads compete for the same resources
I A possible solution: block and flush
F Implemented with a simple counter

Pietro Michiardi (Eurecom) Hadoop in Practice 51 / 76


Algorithm Design Paris and Stripes

Exercise: Pairs and Stripes

Pietro Michiardi (Eurecom) Hadoop in Practice 52 / 76


Algorithm Design Paris and Stripes

Pairs and Stripes

A common approach in MapReduce: build complex keys


I Data necessary for a computation are naturally brought together by
the framework

Two basic techniques:


I Pairs
I Stripes

Next, we focus on a particular problem that benefits from


these two methods

Pietro Michiardi (Eurecom) Hadoop in Practice 53 / 76


Algorithm Design Paris and Stripes

Problem statement

The problem: building word co-occurrence matrices for large


corpora
I The co-occurrence matrix of a corpus is a square n n matrix
I n is the number of unique words (i.e., the vocabulary size)
I A cell mij contains the number of times the word wi co-occurs with
word wj within a specific context
I Context: a sentence, a paragraph a document or a window of m
words
I NOTE: the matrix may be symmetric in some cases

Motivation
I This problem is a basic building block for more complex operations
I Estimating the distribution of discrete joint events from a large
number of observations
I Similar problem in other domains:
F Customers who buy this tend to also buy that

Pietro Michiardi (Eurecom) Hadoop in Practice 54 / 76


Algorithm Design Paris and Stripes

Word co-occurrence: the Pairs approach


Input to the problem
I Key-value pairs in the form of a docid and a doc

The mapper:
I Processes each input document
I Emits key-value pairs with:
F Each co-occurring word pair as the key
F The integer one (the count) as the value
I This is done with two nested loops:
F The outer loop iterates over all words
F The inner loop iterates over all neighbors

The reducer:
I Receives pairs relative to co-occurring words
I Computes an absolute count of the joint event
I Emits the pair and the count as the final key-value output
F Basically reducers emit the cells of the matrix
Pietro Michiardi (Eurecom) Hadoop in Practice 55 / 76
Algorithm Design Paris and Stripes

Word co-occurrence: the Pairs approach

Pietro Michiardi (Eurecom) Hadoop in Practice 56 / 76


Algorithm Design Paris and Stripes

Word co-occurrence: the Stripes approach


Input to the problem
I Key-value pairs in the form of a docid and a doc

The mapper:
I Same two nested loops structure as before
I Co-occurrence information is first stored in an associative array
I Emit key-value pairs with words as keys and the corresponding
arrays as values

The reducer:
I Receives all associative arrays related to the same word
I Performs an element-wise sum of all associative arrays with the
same key
I Emits key-value output in the form of word, associative array
F Basically, reducers emit rows of the co-occurrence matrix

Pietro Michiardi (Eurecom) Hadoop in Practice 57 / 76


Algorithm Design Paris and Stripes

Word co-occurrence: the Stripes approach

Pietro Michiardi (Eurecom) Hadoop in Practice 58 / 76


Algorithm Design Paris and Stripes

Pairs and Stripes, a comparison

The pairs approach


I Generates a large number of key-value pairs (also intermediate)
I The benefit from combiners is limited, as it is less likely for a
mapper to process multiple occurrences of a word
I Does not suffer from memory paging problems

The stripes approach


I More compact
I Generates fewer and shorted intermediate keys
F The framework has less sorting to do
I The values are more complex and have serialization/deserialization
overhead
I Greately benefits from combiners, as the key space is the
vocabulary
I Suffers from memory paging problems, if not properly engineered

Pietro Michiardi (Eurecom) Hadoop in Practice 59 / 76


Algorithm Design Order Inversion

Order Inversion [Homework]

Pietro Michiardi (Eurecom) Hadoop in Practice 60 / 76


Algorithm Design Order Inversion

Computing relative frequenceies


Relative Co-occurrence matrix construction
I Similar problem as before, same matrix
I Instead of absolute counts, we take into consideration the fact that
some words appear more frequently than others
F Word wi may co-occur frequently with word wj simply because one of
the two is very common
I We need to convert absolute counts to relative frequencies f (wj |wi )
F What proportion of the time does wj appear in the context of wi ?

Formally, we compute:

N(wi , wj )
f (wj |wi ) = P 0
w 0 N(wi , w )

I N(, ) is the number of times a co-occurring word pair is observed


I The denominator is called the marginal
Pietro Michiardi (Eurecom) Hadoop in Practice 61 / 76
Algorithm Design Order Inversion

Computing relative frequenceies

The stripes approach


I In the reducer, the counts of all words that co-occur with the
conditioning variable (wi ) are available in the associative array
I Hence, the sum of all those counts gives the marginal
I Then we divide the the joint counts by the marginal and were done

The pairs approach


I The reducer receives the pair (wi , wj ) and the count
I From this information alone it is not possible to compute f (wj |wi )
I Fortunately, as for the mapper, also the reducer can preserve state
across multiple keys
F We can buffer in memory all the words that co-occur with wi and their
counts
F This is basically building the associative array in the stripes method

Pietro Michiardi (Eurecom) Hadoop in Practice 62 / 76


Algorithm Design Order Inversion

Computing relative frequenceies: a basic approach


We must define the sort order of the pair
I In this way, the keys are first sorted by the left word, and then by the
right word (in the pair)
I Hence, we can detect if all pairs associated with the word we are
conditioning on (wi ) have been seen
I At this point, we can use the in-memory buffer, compute the relative
frequencies and emit

We must define an appropriate partitioner


I The default partitioner is based on the hash value of the
intermediate key, modulo the number of reducers
I For a complex key, the raw byte representation is used to compute
the hash value
F Hence, there is no guarantee that the pair (dog, aardvark) and
(dog,zebra) are sent to the same reducer
I What we want is that all pairs with the same left word are sent to
the same reducer

Limitations of this approach


Pietro Michiardi (Eurecom)Hadoop in Practice 63 / 76
Algorithm Design Order Inversion

Computing relative frequenceies: order inversion

The key is to properly sequence data presented to reducers


I If it were possible to compute the marginal in the reducer before
processing the join counts, the reducer could simply divide the joint
counts received from mappers by the marginal
I The notion of before and after can be captured in the ordering of
key-value pairs
I The programmer can define the sort order of keys so that data
needed earlier is presented to the reducer before data that is
needed later

Pietro Michiardi (Eurecom) Hadoop in Practice 64 / 76


Algorithm Design Order Inversion

Computing relative frequenceies: order inversion

Recall that mappers emit pairs of co-occurring words as keys

The mapper:
I additionally emits a special key of the form (wi , )
I The value associated to the special key is one, that represtns the
contribution of the word pair to the marginal
I Using combiners, these partial marginal counts will be aggregated
before being sent to the reducers

The reducer:
I We must make sure that the special key-value pairs are processed
before any other key-value pairs where the left word is wi
I We also need to modify the partitioner as before, i.e., it would take
into account only the first word

Pietro Michiardi (Eurecom) Hadoop in Practice 65 / 76


Algorithm Design Order Inversion

Computing relative frequenceies: order inversion

Memory requirements:
I Minimal, because only the marginal (an integer) needs to be stored
I No buffering of individual co-occurring word
I No scalability bottleneck

Key ingredients for order inversion


I Emit a special key-value pair to capture the margianl
I Control the sort order of the intermediate key, so that the special
key-value pair is processed first
I Define a custom partitioner for routing intermediate key-value pairs
I Preserve state across multiple keys in the reducer

Pietro Michiardi (Eurecom) Hadoop in Practice 66 / 76


Algorithm Design PageRank

Solved Exercise: PageRank in MapReduce

Pietro Michiardi (Eurecom) Hadoop in Practice 67 / 76


Algorithm Design PageRank

Introduction

What is PageRank
I Its a measure of the relevance of a Web page, based on the
structure of the hyperlink graph
I Based on the concept of random Web surfer

Formally we have:
 1  X P(m)
P(n) = + (1 )
|G| C(m)
mL(n)

I |G| is the number of nodes in the graph


I is a random jump factor
I L(n) is the set of out-going links from page n
I C(m) is the out-degree of node m

Pietro Michiardi (Eurecom) Hadoop in Practice 68 / 76


Algorithm Design PageRank

PageRank in Details

PageRank is defined recursively, hence we need an interative


algorithm
I A node receives contributions from all pages that link to it

Consider the set of nodes L(n)


I A random surfer at m arrives at n with probability 1/C(m)
I Since the PageRank value of m is the probability that the random
surfer is at m, the probability of arriving at n from m is P(m)/C(m)

To compute the PageRank of n we need:


I Sum the contributions from all pages that link to n
I Take into account the random jump, which is uniform over all nodes
in the graph

Pietro Michiardi (Eurecom) Hadoop in Practice 69 / 76


Algorithm Design PageRank

PageRank in MapReduce: Example

Pietro Michiardi (Eurecom) Hadoop in Practice 70 / 76


Algorithm Design PageRank

PageRank in MapReduce: Example

Pietro Michiardi (Eurecom) Hadoop in Practice 71 / 76


Algorithm Design PageRank

PageRank in MapReduce

Sketch of the MapReduce algorithm


I The algorithm maps over the nodes
I Foreach node computes the PageRank mass the needs to be
distributed to neighbors
I Each fraction of the PageRank mass is emitted as the value, keyed
by the node ids of the neighbors
I In the shuffle and sort, values are grouped by node id
F Also, we pass the graph structure from mappers to reducers (for
subsequent iterations to take place over the updated graph)
I The reducer updates the value of the PageRank of every single
node

Pietro Michiardi (Eurecom) Hadoop in Practice 72 / 76


Algorithm Design PageRank

PageRank in MapReduce: Pseudo-Code

Pietro Michiardi (Eurecom) Hadoop in Practice 73 / 76


Algorithm Design PageRank

PageRank in MapReduce

Implementation details
I Loss of PageRank mass for sink nodes
I Auxiliary state information
I One iteration of the algorith
F Two MapReduce jobs: one to distribute the PageRank mass, the
other for dangling nodes and random jumps
I Checking for convergence
F Requires a driver program
F When updates of PageRank are stable the algorithm stops

Further reading on convergence and attacks


I Convergenge: [5, 2]
I Attacks: Adversarial Information Retrieval Workshop [1]

Pietro Michiardi (Eurecom) Hadoop in Practice 74 / 76


References

References I

[1] Adversarial information retrieval workshop.


[2] Monica Bianchini, Marco Gori, and Franco Scarselli.
Inside pagerank.
In ACM Transactions on Internet Technology, 2005.
[3] Silvio Lattanzi, Benjamin Moseley, Siddharth Suri, and Sergei
Vassilvitskii.
Filtering: a method for solving graph problems in mapreduce.
In Proc. of SPAA, 2011.
[4] Jure Leskovec, Jon Kleinberg, and Christos Faloutsos.
Graphs over time: Densification laws, shrinking diamters and
possible explanations.
In Proc. of SIGKDD, 2005.

Pietro Michiardi (Eurecom) Hadoop in Practice 75 / 76


References

References II

[5] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry


Winograd.
The pagerank citation ranking: Bringin order to the web.
In Stanford Digital Library Working Paper, 1999.
[6] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert
Chansler.
The hadoop distributed file system.
In Proc. of the 26th IEEE Symposium on Massive Storage Systems
and Technologies (MSST). IEEE, 2010.
[7] Tom White.
Hadoop, The Definitive Guide.
OReilly, Yahoo, 2010.

Pietro Michiardi (Eurecom) Hadoop in Practice 76 / 76

Você também pode gostar