Você está na página 1de 65

Ha d oo

p
Ag e n d a
Architecture of HDFS and
MapReduce Hadoop Streams
Hadoop Pipes
Basics of Hbase and Zookeeper
Had oop Distributed
Filesystem
When a dataset outgrows the storage capacity of a single
physical machine, it becomes necessary to partition it
across a number of separate machines.
Filesystems that manage the storage across a network of
machines are called distributed fi lesystems.
Since they are network-based, all the complications of
network programming kick in, thus making distributed
fi lesystems more complex than regular disk
fi lesystems.
For example, one of the biggest challenges is making the
fi lesystem tolerate node failure without suff ering data loss.
Hadoop comes with a distributed fi lesystem called HDFS,
which stands for H a d o o p d i st r i b u t ed fi l e s y s t e m .
D e s i g n o f HDFS
Very large files
Very large in this context me an s fi les that are hundreds of
megabytes, gigabytes, or terabytes in size. There are Hadoop
clusters running today that store petabytes of data.

Streaming data access


HDFS is built around the idea that the most effi cient data processing
pattern is a write-once, read-many-times pattern. A dataset is
typically generated or copied from source, then various analyses are
performed on that dataset over time.

Commodity hardware
Hadoop doesnt require expensive, highly reliable hardware to run on.
Its designed to run on clusters of commodity hardware for which the
chance of node failure across the cluster is high, at least for large
clusters. HDFS is designed to carry on working without a noticeable
interruption to the user in the face of such failure.
Where HDFS i s not a g o o d
fi t
Low-latency data access
Applications that require low-latency access to data, in the tens of
milliseconds range, will not work well with HDFS. Remember, HDFS is
optimized for delivering a high throughput of data, and this m a y be at
the expense of latency. Hbase is currently a better choice for low-latency
access.

Lots of small fi les


Since the namenode holds fi lesystem metadata in memory, the limit to the
number of fi les in a fi lesystem is governed by the amount of memory on
the namenode. As a rule of thumb, each file, directory, and block takes
about 150 bytes. So, for example, if you had one million files, each taking
one block, you would need at least 300 MB of memory. While storing
millions of fi les is feasible, billions is beyond the capability of current
hardware.

Multiple writers, arbitrary file modifi cations


Files in HDFS m a y be written to by a single writer. Writes are always ma de at
the end of the file. There is no support for multiple writers, or for
Why I s a Block in HDFS S o
Large?
HDFS blocks are large compared to disk blocks, and the reason is to minimize
the cost of seeks.

By making a block large enough, the time to transfer the data from the
disk can be made to be signifi cantly larger than the time to seek to the
start of the block.

Thus the time to transfer a large fi le m a d e of multiple blocks operates at


the disk transfer rate.

A quick calculation shows that if the seek time is around 10 ms, and the
transfer rate is 1 0 0 MB/s, then to m a ke the seek time 1% of the transfer
time, we need to ma ke the block size around 1 0 0 MB.

The default is actually 6 4 MB, although m a n y HDFS installations use 1 2 8 M B


blocks.
Ad v a nta g e of
HDFS?
Moving Computation is Cheaper than Moving Data

A computation requested by an application is much more


effi cient if it is executed near the data it operates on. This is
especially true when the size of the data set is huge.

The assumption is that it is often better to migrate the


computation closer to where the data is located rather
than moving the data to where the application is running.

HDFS provides interfaces for applications to move


themselves closer to where the data is located.
HDFS Architecture
HDFS has a master/slave architecture.

An HDFS cluster consists of a single NameNode, a master


server that manages the fi le system namespace and
regulates access to fi les by clients.

It also consist of secondary namenode. It updates the


namespace image with datalog.
A client r e a d i n g d a t a from
HDFS
The client opens the fi le it wishes to read by calling open()
on the FileSystem object, which for HDFS is an instance of
DFS.

DFS calls the namenode, using RPC, to determine the


locations of the blocks for the first few blocks in the file.

For each block, the namenode returns the addresses of


the datanodes that have a copy of that block and the
datanodes are sorted according to their proximity to the
client

The DFS returns a FSDataInputStream (an input stream that


supports fi le seeks) to the client for it to read data from.
The client then calls read() on the stream

DFSStream connects to the first (closest) datanode for the


first block in the file.

Data is streamed from the datanode back to the client,


which calls read() repeatedly on the stream

When the end of the block is reached, DFSInputStream will


close the connection to the datanode, then fi nd the best
datanode for the next block

When the client has fi nished reading, it calls close()


on the FSDataInputStream
Distance b/w nodes
A client writing d a t a to
HDFS
The client creates the fi le by calling create() on DFS

DFS makes an RPC call to the namenode to create a new


fi le in the fi lesystems namespace, with no blocks
associated with it.

The namenode performs various checks to make sure the fi le


doesnt already exist, and that the client has the right
permissions to create the file.

If these checks pass, the namenode makes a record of the


new file; otherwise, fi le creation fails and the client is
thrown an IOException.

The DFS returns an FSDataOutputStream for the client to start


writing data to.

As the client writes data, DFSOutputStream splits it into


packets, which it writes to an internal queue, called the
data queue.
The data queue is consumed by the DataStreamer,
whose responsibility it is to ask the namenode to
allocate new blocks by picking a list of suitable
datanodes to store the replicas.

The list of datanodes forms a pipeline, well assume the


replication level is three, so there are three nodes in the
pipeline.

The DataStreamer streams the packets to the first datanode


in the pipeline, which stores the packet and forwards it to
the second datanode in the pipeline. Similarly, the second
datanode stores the packet and forwards it to the third (and
last) datanode in the pipeline

DFSOutputStream also maintains an internal queue of


packets that are waiting to be acknowledged by datanodes,
called the ack queue. A packet is removed from the
ackqueue only when it has been acknowledged by all the
datanodes in the pipeline
What i f a d a t a n o d e f a i l s
while data is being
First written to i t ?
the pipeline is closed

Any packets in the ackqueue are added to the front of the


data queue so that datanodes that are downstream from the
failed node will not miss any packets.

The current block on the good datanodes is given a new


identity, which is communicated to the namenode, so that
the partial block on the failed datanode will be deleted if the
failed datanode recovers later on.

The failed datanode is removed from the pipeline and the


remainder of the blocks data is written to the two good
datanodes in the
pipeline.

The namenode notices that the block is under-


replicated, and it arranges for a further replica to be
created on another node
Replication
Comparison w i th Other
Sy
Why c as
n t
t e
w em
u ss
e d a t a b a s e s w i t h lo t s o f d i s k s to
d o l a r g e - s c a l e b a t c h a n a l y s i s ? Why i s
MapReduce needed?

The answer to these questions comes from another trend


in disk drives:
seek time is improving more slowly than transfer rate.
Seeking is the
process of moving the disks head to a particular place on
the disk to read or write data.
It characterizes the latency of a disk operation, whereas the
transfer rate corresponds to a disks bandwidth.

On the other hand, for updating a small proportion of


records in a database, a traditional B-Tree (the data
structure used in relational databases, which is limited by
the rate it can perform seeks) works well.
For updating the majority of a database, a B-Tree
is less effi cient than MapReduce, which uses Sort/Merge to
rebuild the database.
Example1:
Problem
statement
Whats the highest recorded
global temperature for each
year in the dataset?
Map an d Reduce with
example
Example2: Word count
For the given sample input the first ma p
emits:
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>
The second map emits:
< Hello, 1>
< Hadoop, 1>
< Goodbye, 1>
< Hadoop, 1>
After a p p l y i n g
Combiner a n d
sorting
The output of the first map:
< Bye, 1>
< Hello, 1>
< World, 2>
The output of the second
map:
< Goodbye, 1>
< Hadoop, 2>
< Hello, 1>
Fi na l o u t p u t o f
reducer:
Thus the output of the job
is:
< Bye, 1>
< Goodbye, 1>
< Hadoop, 2>
< Hello, 2>
< World, 2>
MapReduce logical data flow
Writing MapReduce Program
We need three things: a ma p function, a reduce function, and
some code to run the job.
The m a p function is represented by the Mapper class, which
declares an abstract map() method.
The Mapper class is a generic type, with four formal type
parameters that specify the i n p u t key, i n p u t v a l u e ,
o u t p u t key, a n d o u t p u t v a l u e types of the m ap function.
For the present example, the input key is a long integer
off set, the input value is a line of text, the output key is a
year, and the output value is an air temperature (an integer).
The map() method also provides an instance of Context to
write the output to. In this case, we write the year as a Text
object (since we are just using it as a key), and the
temperature is wrapped in an IntWritable.
Writing MapReduce
Program
The reduce function is represented by the Reducer class,
which declares an abstract reduce() method.
Again, four formal type parameters are used to specify the
input and output types, this time for the reduce function.
The input types of the reduce function must match the output
types of the m ap function: Text and IntWritable. And in this
case, the output types of the reduce function are Text and
IntWritable, for a year and its m a x im u m temperature, which
we fi nd by iterating through the temperatures and comparing
each with a record of the highest found so far.
Data locality optimization

Hadoop does its best to run the ma p task on a node where


the input data resides in HDFS. This is called the data locality
optimization.
Data Flow
A MapReduce job is a unit of work that the client wants
to be performed, it consists of
q
The input data
q
The MapReduce
q
program
Confi guration
Hadoop
information
runs the job by dividing it into
tasks
q
map tasks
q
reduce
tasks
N ode s t h a t control t h e
job e x e cu t io n p r o c e s s

The jobtracker coordinates all the jobs run on the


system by scheduling tasks to run on tasktrackers.

Tasktrackers run tasks and send progress reports to the


jobtracker, which keeps a record of the overall progress of
each job.

If a tasks fails, the jobtracker can reschedule it on a


diff erent tasktracker.
Inputsplits
Hadoop divides the input to a MapReduce job into fi xed-size
pieces called inputsplits, or just splits.

Hadoop creates one ma p task for each split, which


runs the userdefi ned ma p function for each record in
the split.
What i s t h e Optimal
in p u t sp li t s i z e ?
optimal split size = = block size

If the split spanned two blocks, it would be unlikely that any


HDFS node stored both blocks, so some of the split would
have to be transferred across the network to the node
running the m ap task, which is clearly less effi cient than
running the whole m a p task using local data.
M a p t a s k s write their o u tp u t
to local d i s k , not to HDFS.
Why?
Map output is intermediate output: its processed by reduce
tasks to produce the fi nal output, and once the job is
complete the map output can be thrown away. So storing it
in HDFS, with replication, would be overkill.
R e d u c e Task
Reduce tasks dont have the advantage of data locality, the
input to a single reduce task is normally the output from all
mappers.
MapReduce data flow
with a single reduce
task
Multiple r e d u c e t a s k s
When there are multiple reducers, the m ap tasks
partition their output, each creating one partition for each
reduce task.

There can be many keys (and their associated values)in


each partition, but the records for every key are all in a
single partition.

The partitioning can be controlled by a user-defi ned


partitioning function, but normally the default partitioner
which buckets keys using a hash functionworks very
well.
MapReduce data flow with
multiple reduce tasks
How m a n y r e d u c e
tasks?
The number of reduce tasks is not governed by the size of
the input, but is specifi ed independently.
MapReduce data flow
with no reduce
tasks
Combiner f u n c t i o n
Many MapReduce jobs are limited by the bandwidth
available on the cluster

It pays to minimize the data transferred between ma p and


reduce tasks.

Hadoop allows the user to specify a combiner function to be


run on the map output

The output of Combiner function is input to reducer


Without combiner f u n ct i o n
If we use combiner!
How M a p r e d u c e w o r k s ?
At higher level there 4 independent entities
q
The client, which submits the MapReduce job.
q
The jobtracker, which coordinates the job run. The
jobtracker is a Java application whose main class is
JobTracker.
q
The tasktrackers, which run the tasks that the job has
been split into. Tasktrackers are Java applications whose
main class is TaskTracker.
q
The distributed fi lesystem (normally HDFS), which is
used for sharing job fi les between the other entities.
How Hadoop runs
a MapReduce
job
Job s u b m i s s i o n p r o c e s s
The job submission process implemented by
JobClients submitJob() method does the
following:
Asks the jobtracker for a new job
ID(step 2) Checks the output
specifi cation of the job
Computes the input splits for the job. If the splits cannot be
computed, because the input paths dont exist, for example,
then the job is not submitted and an error is thrown to the
MapReduce program.
Copies the resources needed to run the job, including the job
JAR file, the confi guration fi le and the computed input splits,
to the jobtrackers fi lesystem in a directory named after the
job ID.(step 3) The job JAR is copied with a high replication
factor(default mapred.submit.replication 10), so that there
are lots of copies across the cluster for the tasktrackers to
access when they run tasks for the job
Tells the jobtracker that the job is ready for execution(by
calling submitJob() on JobTracker)(step 4)
Job initialization
When the JobTracker receives a call
to its submitJob() method,
It puts it into an internal queue from where the job scheduler
will pick it up and initialize it.
Initialization involves
q
creating an object to represent the job being run,
which encapsulates its tasks

q
bookkeeping inform ation to keep track of the tasks
status and progress(step 5)
To create the list of tasks to run, the job scheduler first
retrieves the input splits computed by the JobClient from
the shared fi lesystem (step 6)
It then creates one map task for each split.
The number of reduce tasks to create is determined
by the mapred.reduce.tasks property in the JobConf
The scheduler simply creates this number of reduce tasks to
be run. Tasks are given IDs at this point.
Ta s k
As s i g n m en t
periodically sends & hmethod
eartbeat
Tasktrackers run a simple loop that
heartbeat
calls to the jobtracker.
Heartbeats tell the jobtracker that a
tasktracker is alive
As a part of the heartbeat, a tasktracker will indicate
whether it is ready to run a new task, and if it is, the
jobtracker will allocate it a task, which it communicates to
the tasktracker using the heartbeat return value(step 7)
Tasktrackers have a fi xed number of
slots for ma p tasks and for reduce
tasks:
The default scheduler fills empty map task slots before
reduce task slots
If the tasktracker has at least one empty ma p task slot,
the jobtracker will select a map task; otherwise, it will
select a reduce task.
Task Execution
Once the tasktracker has been assigned a task
It localizes the job JAR by copying it from the shared
fi lesystem to the tasktrackers
fi lesystem. It also copies any fi les needed from the
distributed cache by the application to the local
disk(step 8)
It creates a local working directory for the task, and
un-jars the contents of the JAR into this directory.
It creates an instance of TaskRunner to run the task.
Task R u n n e r
TaskRunner launches a new Java Virtual
Machine(step 9) run each task in(step 10)
Why n e w JVM for e a c h
task?
Any bugs in the user-defi ned ma p and reduce functions
dont aff ect the tasktracker(by causing it to crash or hang)
The child process communicates with its parent through the
umbilical interface.
It informs the parent of the tasks progress every few
seconds until the task is complete.
User Defined Counters
Ha d oop S t r e a m i n g
Hadoop provides an API to MapReduce that allows you to
write your ma p and reduce functions in languages other
than Java.
Hadoop Streaming uses Unix standard streams as the
interface between Hadoop and your program, so you can
use any language that can read standard input and write
to standard output to write your MapReduce program.
Ruby, Python

h a d o o p jar
$HADOOP_INSTALL/contrib/streaming/hadoop-*-
st re a mi n g. j a r \
-i np ut i n p u t/ sa mp l e .t x t \
-output o u t pu t \
- ma p pe r src/main/ruby/mapper.rb \
-reducer src/main/ruby/reducer.rb
Hadoop P i p e s
Hadoop Pipes is the name of the C + + interface to
Hadoop MapReduce.
Unlike Streaming, which uses standard input and
output to communicate with the m ap and reduce
code
Pipes uses sockets as the channel over which the
tasktracker communicates with the process running the
C + + ma p or reduce function. JNI is not used.

h a d o o p f s -p ut max_temperature

bin/max_temperature h a d o o p f s - pu t

in p u t / sa mp l e .t x t s a m p l e . t x t

hadoop pipes \
-D h a d o o p . p i p e s . j a v a . r e c o r d r e a d e r = t r u e \
-D h a d o o p . p i p e s . j a v a . r e c o r d w r i t e r = t r u e \
-in pu t sa m p le . t xt \
-o utput o ut p u t \
-program bin/max_temperature
Job s t a t u s u p d a t e
Data fl o w for t w o j o b s
HBa s e
HBase is a distributed column-oriented
database built on top of HDFS
HBase is the Hadoop application to use when you require
real-time read/write random-access to very large datasets.
Zookeeper

ZooKeeper allows distributed processes to coordinate with


each other
Li n k s
http://hadoop.apache.org/common/docs/current/hdfs_desig
n.html
http://hadoop.apache.org/mapreduce/
http://www.cloudera.com/hadoop/
http://www.cloudera.com/
http://www.cloudera.com/hadoop-training/
http://www.cloudera.com/resources/?type=Trainin
g

http://blog.adku.com/2011/02/hbase-vs-cassandra
.html
http://www.google.co.in/search?q=hbase+tutori
al&ie=utf-8&oe=utf-8&aq=t&rls=o

Você também pode gostar