Você está na página 1de 23

Vlad Ureche

ROSEdu Tech Talks


Contents
● Map Reduce
● Hadoop
● HDFS
● Hbase
● Example
MapReduce (1)
● Google paper released in 2004
● labs.google.com/papers/mapreduce-osdi04.pdf
● Context
● Google cluster – many nodes – many hw failures
● Lots of data
● Idea:
● Separate the administrative part from the algorithms
● Create a framework for all algorithms
● Move computation instead of moving data
MapReduce (2)
● “MapReduce is a programming model and an
associated implementation for processing
and generating large data sets”

● “Our implementation of MapReduce runs on a


large cluster of commodity machines and is
highly scalable: a typical MapReduce
computation processes many terabytes of
data on thousands of machines.”
MapReduce (3)

3
N1 MAP REDUCE N1

DATA SHUFFLE 2
N2 MAP and REDUCE N2
N1
SORT

2
2
N3 MAP REDUCE N3
N1
MapReduce (4)
● Map: <key1, value1> → List(<key2, value2>)
● Reduce: <key2, List(<value2>)> →
List(<value2>)

● Key1, key2 – Anything that can be compared


and checked for equality
● Value1, value2 – Anything

● Map and Reduce functions are up to you!


● Fault tolerance, scheduling, concurrency -
MapReduce (5)
● Example: count occurrences of 2-grams in a
book
● “The quick brown fox” →
– “The quick”
– “quick brown”
– “brown fox”
● Input: The book
● Map: <row number, row> → List(<2gram,
occur=1>)
● Reduce: <2gram,List(occur=1)> → sizeof(List)
MapReduce (6)
● When should you use MR?
● Lots of data
● Jobs can be parallel
● Lots of machines
● When not to use MR?
● Intensive computation on small data
● Jobs depend on each other
Hadoop (1)
● Hadoop is an open-source implementation of
the MapReduce framework
● Is a top project of the Apache Foundation
● Appeared two years after the MapReduce
paper
● Developed by companies:
● Yahoo
● Cloudera
● …
● And independent submitters
Hadoop (2)
● Used by everybody

http://wiki.apache.org/hadoop/PoweredBy
Hadoop (3)

JobTracker

Task tracker Task tracker Task tracker Task tracker Task tracker

● Completely automated
● Jobs are scheduled based on data locality
● Speculative execution
Hadoop (4)
● Code
● Is open source
● Java
● Build scripts
● Bash scripts
● Configuration files
Hadoop (5)
● Is part of a larger ecosistem
● HDFS – distributed file system
● Hbase – distributed, column-oriented database
● Mahout – machine learning algorithm library
● Nutch – web crawler
● And lots of other stuff
Hadoop example
● Ad clicking log
● User information (Age, Location) database
● How could you use that to your advantage?

● Mahout – machine learning framework


HDFS (1)
● Distributed file system
● Modelled after the GFS paper
● labs.google.com/papers/gfs-sosp2003.pdf

● Stores multiple copies of data


● Seek time >> Scan time
● Move computation vs Move data
● Small File Problem (TM)
HDFS (2)

Secondary
NameNode NameNode

DataNode DataNode DataNode DataNode DataNode

● 128MB blocks vs. 64K on a regular FS


● Limitation:
● Files can not be edited (can be appended)
● Can be overcome
HDFS (3)
● Part of Hadoop
● Open source
● Java
● Build scripts
● Bash scripts
● Configuration files
HBase
● Distributed, column-oriented, sparse hash table
● Data is stored in HDFS
● Based on the BigTable paper by Google
● labs.google.com/papers/bigtable.html
HBase (2)
● Table
● Key
● Columns
– Column Families

● key=location:Romania;age:16;sex=M
● ads:copiutze.ro.clickProbability = 0.0018
● ads:copiutze.ro.bestPlacement = calendarPage
● …
● stats:clickProbability=0.0015
HBase (3)

Master

Region Server Region Server Region Server

● Idee de distribuire asemanatoare HDFS-ului


● Foloseste HDFS pentru stocarea fisierelor
Conclusion
● MapReduce
● Lots of input data
● Parallel jobs
● Lots of computers
● We could also talk about
● Mahout – machine learning
● Nutch – web crawling
● Lucene/Solr – search engine
● Pig, Cascading – frameworks over Hadoop
Questions?
Thank you!

Você também pode gostar