ROSEdu Tech Talks Prezentarea 09: Hadoop

Vlad Ureche
ROSEdu Tech Talks

Contents
● Map Reduce
● Hadoop
● HDFS
● Hbase
● Example
MapReduce (1)
● Google paper released in 2004
● labs.google.com/papers/mapreduce-osdi04.pdf
● Context
● Google cluster – many nodes – many hw failures
● Lots of data
● Idea:
● Separate the administrative part from the algorithms
● Create a framework for all algorithms
● Move computation instead of moving data
MapReduce (2)
● “MapReduce is a programming model and an
associated implementation for processing
and generating large data sets”
● “Our implementation of MapReduce runs on a

large cluster of commodity machines and is
highly scalable: a typical MapReduce
computation processes many terabytes of
data on thousands of machines.”
MapReduce (3)
3
N1 MAP REDUCE N1
DATA SHUFFLE 2
N2 MAP and REDUCE N2
N1
SORT
2
2
N3 MAP REDUCE N3
N1
MapReduce (4)
● Map: <key1, value1> → List(<key2, value2>)
● Reduce: <key2, List(<value2>)> →
List(<value2>)
● Key1, key2 – Anything that can be compared

and checked for equality
● Value1, value2 – Anything
● Map and Reduce functions are up to you!

● Fault tolerance, scheduling, concurrency -
MapReduce (5)
● Example: count occurrences of 2-grams in a
book
● “The quick brown fox” →
– “The quick”
– “quick brown”
– “brown fox”
● Input: The book
● Map: <row number, row> → List(<2gram,
occur=1>)
● Reduce: <2gram,List(occur=1)> → sizeof(List)
MapReduce (6)
● When should you use MR?
● Lots of data
● Jobs can be parallel
● Lots of machines
● When not to use MR?
● Intensive computation on small data
● Jobs depend on each other
Hadoop (1)
● Hadoop is an open-source implementation of
the MapReduce framework
● Is a top project of the Apache Foundation
● Appeared two years after the MapReduce
paper
● Developed by companies:
● Yahoo
● Cloudera
● …
● And independent submitters
Hadoop (2)
● Used by everybody
http://wiki.apache.org/hadoop/PoweredBy
Hadoop (3)
JobTracker
Task tracker Task tracker Task tracker Task tracker Task tracker
● Completely automated
● Jobs are scheduled based on data locality
● Speculative execution
Hadoop (4)
● Code
● Is open source
● Java
● Build scripts
● Bash scripts
● Configuration files
Hadoop (5)
● Is part of a larger ecosistem
● HDFS – distributed file system
● Hbase – distributed, column-oriented database
● Mahout – machine learning algorithm library
● Nutch – web crawler
● And lots of other stuff
Hadoop example
● Ad clicking log
● User information (Age, Location) database
● How could you use that to your advantage?
● Mahout – machine learning framework

HDFS (1)
● Distributed file system
● Modelled after the GFS paper
● labs.google.com/papers/gfs-sosp2003.pdf
● Stores multiple copies of data

● Seek time >> Scan time
● Move computation vs Move data
● Small File Problem (TM)
HDFS (2)
Secondary
NameNode NameNode
DataNode DataNode DataNode DataNode DataNode
● 128MB blocks vs. 64K on a regular FS

● Limitation:
● Files can not be edited (can be appended)
● Can be overcome
HDFS (3)
● Part of Hadoop
● Open source
● Java
● Build scripts
● Bash scripts
● Configuration files
HBase
● Distributed, column-oriented, sparse hash table
● Data is stored in HDFS
● Based on the BigTable paper by Google
● labs.google.com/papers/bigtable.html
HBase (2)
● Table
● Key
● Columns
– Column Families
● key=location:Romania;age:16;sex=M
● ads:copiutze.ro.clickProbability = 0.0018
● ads:copiutze.ro.bestPlacement = calendarPage
● …
● stats:clickProbability=0.0015
HBase (3)
Master
Region Server Region Server Region Server
● Idee de distribuire asemanatoare HDFS-ului

● Foloseste HDFS pentru stocarea fisierelor
Conclusion
● MapReduce
● Lots of input data
● Parallel jobs
● Lots of computers
● We could also talk about
● Mahout – machine learning
● Nutch – web crawling
● Lucene/Solr – search engine
● Pig, Cascading – frameworks over Hadoop
Questions?
Thank you!

ROSEdu Tech Talks Prezentarea 09: Hadoop

Enviado por

Dados do documento

Descrição original:

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

ROSEdu Tech Talks Prezentarea 09: Hadoop

Enviado por

Direitos autorais:

Formatos disponíveis

Vlad Ureche

ROSEdu Tech Talks

● “Our implementation of MapReduce runs on a

● Key1, key2 – Anything that can be compared

● Map and Reduce functions are up to you!

● Mahout – machine learning framework

● Stores multiple copies of data

DataNode DataNode DataNode DataNode DataNode

● 128MB blocks vs. 64K on a regular FS

Region Server Region Server Region Server

● Idee de distribuire asemanatoare HDFS-ului

Você também pode gostar