Escolar Documentos
Profissional Documentos
Cultura Documentos
TigerHATS
www.tigerhats.org
Hadoop
Hadoop is an open source implementation of the MapReduce platform and distributed file system, written in Java. This module explains the basics of how to begin using Hadoop to experiment and learn from the rest of this tutorial. It covers setting up the platform and connecting other tools to use it. Source: http://developer.yahoo.com/hadoop/tutorial/module3.html
What Hadoop is
Inspired by Google Distributed file system similar to Google File System Parallel programming model similar to Google MapReduce Parallel database similar to Google Bigtable Open source Java project
Hadoop was created by Doug Cutting, who named it after his son's toy elephant. It was originally developed to support distribution for the Nutch search engine project.
Hadoop
Distributed file system (HDFS)
HDFS, MapReduce
Commodity Hardware
Typically in 2 level architecture Nodes are commodity PCs 30-40 nodes/rack Uplink from rack is 3-4 gigabit Rack-internal is 1 gigabit
HDFS Architecture
Cluster Membership
NameNode
Secondary NameNode
Client
Cluster Membership
NameNode : Maps a file to a file-id and list of MapNodes DataNode : Maps a block-id to a physical location on disk SecondaryNameNode: Periodic merge of Transaction log
DataNodes
Data Flow
Web Servers
Scribe Servers
Network Storage
Oracle RAC
Hadoop Cluster
MySQL
Image Source:
http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/aw-apachecon-eu-2009.pdf
Very Large Distributed File System 10K nodes, 100 million files, 10 PB Assumes Commodity Hardware Files are replicated to handle hardware failure Detect failures and recover from them Optimized for Batch Processing Data locations exposed so that computations can move to where data resides Provides very high aggregate bandwidth User Space, runs on heterogeneous OS
Data Coherency Write-once-read-many access model Client can only append to existing files Files are broken up into blocks Typically 128 MB block size Each block replicated on multiple Data Nodes Intelligent Client Client can find location of blocks Client accesses data directly from Data Node
MapReduce Paradigm
Simple data-parallel programming model designed for scalability and fault-tolerance Framework for distributed processing of large data sets
At Google: - Index construction for Google Search - Article clustering for Google News - Statistical machine translation At Yahoo!: - Web map powering Yahoo! Search - Spam detection for Yahoo! Mail At Facebook: - Data mining - Ad optimization - Spam detection
Amazon/A9 Facebook Google IBM Joost Last.fm New York Times PowerSet Veoh Yahoo!
If the same task fails repeatedly, fail the job or ignore that input block (usercontrolled)
- Stragglers occur frequently due to failing hardware, software bugs, misconfiguration, etc - Single straggler may noticeably slow down a job
Takeaways
By providing a data-parallel programming model, MapReduce can control job execution in useful ways:
Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures & stragglers
Search
Input: (lineNumber, line) records Output: lines matching a given pattern
Input: (key, value) records Output: same records, sorted by key Map: identity function Reduce: identify function
Trick: Pick partitioning function h such that k1<k2=> h(k1)<h(k2)
Job 1: - Create inverted index, giving (word, list(file)) records Job 2: - Map each (word, list(file)) to (count, word) - Sort these records by count as in sort job
MapReduce in Hadoop
Three ways to write jobs in Hadoop: - Java API - Hadoop Streaming (for Python, Perl, etc) - Pipes API (C++)
MapReduce architecture
Scope of Mapreduce
Hadoop-Mapreduce Tutorial
http://developer.yahoo.com/hadoop/tutorial/module3.html
http://v-lad.org/Tutorials/Hadoop/00%20-%20Intro.html
Summary
We introduced MapReduce programming model for processing large scale data We discussed the supporting Hadoop Distributed File System The concepts were illustrated using a simple example We reviewed some important parts of the source code for the example.
b) Using the Gfarm File System as a POSIX compatible storage platform for Hadoop MapReduce applications
Download: www.shun0102.net/wp-content/uploads/PID2037887.pdf
Thank You