Escolar Documentos
Profissional Documentos
Cultura Documentos
viraj@yahoo-inc.com
About Me
Principal Engg in the Yahoo! Grid Team since May 2008
PhD from Rutgers University, NJ
Specialization in Data Streaming, Grid, Autonomic Computing
-2-
HDFS Federation
YARN
Hadoop 23 User Impact
-3-
Oozie
HCatalog
PIG
Hive
Map Reduce
HBase
File Format (HFile)
HDFS
4
-4-
2012
Cloudera Impala
Big Data
2006
PIG
2006
Google Dremel
2005
Hadoop
2004
Google
Map Reduce,
BigTable
-5-
2010
Microsoft
Stream Insight
2010
Google Percolator
2012
2011 Berkeley Spark
Twitter Storm
-6-
Map
-7-
Reduce
-8-
Parallelism
Map is inherently parallel
Each list element processed independently
-9-
- 10 -
HDFS
Data is organized into files and directories
- 11 -
- 12 -
- 13 -
HADOOP 23 FEATURES
HDFS FEDERATION
- 14 -
- 15 -
Namenode
Block Management
Datanode
Namespace
Datanode
Storage
Block Storage
Block Management
Datanode cluster membership
Supports create/delete/modify/get block
location operations
Manages replication and replica placement
Implemented as
Single Namespace Volume
Namenode
NS
Block Management
Blocks
Datanode
Storage
- 17 -
Performance
File system operations throughput limited by a single node
- 18 -
on Block Storage
HBase
Foreign namespaces
19
- 19 -
Datanode
Datanode
Storage
Namespace
HDFS Federation
NS1
Foreign NS
n
NS k
...
Pool 1
Block Storage
NN-n
NN-k
NN-1
...
Pool k
Pool n
Block Pools
Datanode 1
...
Datanode 2
...
Datanode m
...
Common Storage
Managing Namespaces
Client-side
mount-table
data project
home
NS4
- 21 -
tmp
NS1
NS2
NS3
- 22 -
- 23 -
0.23
<property>
<name>fs.default.name</name>
<value>viewfs://ClusterName/</value>
</property>
- 24 -
- 25 -
- 26 -
- 27 -
- 28 -
- 29 -
- 30 -
- 31 -
- 32 -
- 33 -
- 34 -
TaskTracker
Per-node agent
Manage tasks
- 35 -
- 36 -
- 37 -
Availability
Scalability - Clusters of 6,000-10,000 machines
Each machine with 16 cores, 48G/96G RAM, 24TB/36TB disks
Wire Compatibility
- 38 -
Design Methodology
Split up the two major functions of JobTracker
Cluster resource management
Application life-cycle management
- 39 -
Architecture
- 40 -
Architecture
- 41 -
Architecture
- 42 -
Architecture of YARN
Resource Manager
Global resource scheduler
Hierarchical queues
Node Manager
Per-machine agent
Manages the life-cycle of container
Container resource monitoring
Application Master
Per-application
Manages application scheduling and task execution
- 43 -
- 44 -
Application Master
Optional failover via application-specific checkpoint
MapReduce applications pick up where they left off
- 45 -
- 46 -
- 47 -
- 48 -
Iterative processing
Enabled by allowing use of paradigm-specific Application
Master
Run all on the same Hadoop cluster
- 49 -
Performance Improvements
Small Job Optimizations
Runs all tasks of Small job (i.e. job with up to 3/4 tasks)
entirely in Application Master's JVM
Reduces JVM startup time and also eliminates inter-node and
inter-process data transfer during the shuffle phase.
- 50 -
Surprisingly Stable
Web Services
Better Utilization of Resources at Yahoo!
No fixed partitioning between Map and Reduce Tasks
- 51 -
- 52 -
- 53 -
- 54 -
- 55 -
$HADOOP_COMMON_HOME
$HADOOP_MAPRED_HOME
$HADOOP_HDFS_HOME
New Usage
mapred queue -showacls
hdfs dfs ls <path>
mapred job -kill <job_id>
- 56 -
Hadoop Java programs will not require any code change, However
users have to recompile with Hadoop 0.23
If code change is required, please let us know.
- 57 -
- 58 -
- 59 -
Hadoop 0.20.204 or
0.20.205
Hadoop 23
0.9.1
0.9.2
- 60 -
- 61 -
Oozie Version
Hadoop 0.20.205
Hadoop 23
3.1.2
3.1.4
- 62 -
- 63 -
- 64 -
- 65 -
- 66 -
- 67 -
- 68 -
- 69 -
- 70 -
- 71 -
- 72 -
Resource Manager
- 73 -
- 74 -
Acknowledgements
YARN Robert Evans, Thomas Graves, Jason Lowe
- 76 -
References
0.23 Documentation
http://people.apache.org/~acmurthy/hadoop-0.23/
YARN Documentation
http://people.apache.org/~acmurthy/hadoop-0.23/hadoopyarn/hadoop-yarn-site/YARN.html
- 77 -