Escolar Documentos
Profissional Documentos
Cultura Documentos
Why Hadoop
Simply put, Hadoop can transform the way you store and process data throughout your enterprise.
According to analysts, about 80% of the data in the world is unstructured, and until Hadoop, it was
essentially unusable in any systematic way. With Hadoop, for the first time you can combine all your
data and look at it as one.
E-tailing
Recommendation engines increase average order size by recommending complementary
products based on predictive analysis for cross-selling.
Cross-channel analytics sales attribution, average order value, lifetime value (e.g., how many
in-store purchases resulted from a particular recommendation, advertisement or promotion).
Event analytics what series of steps (golden path) led to a desired outcome (e.g., purchase,
registration).
Financial Services
Compliance and regulatory reporting.
Risk analysis and management.
Fraud detection and security analytics.
CRM and customer loyalty programs.
Credit scoring and analysis.
Trade surveillance.
1
Government
Fraud detection and cybersecurity.
Compliance and regulatory analysis.
Energy consumption and carbon footprint management.
Retail/CPG
Merchandizing and market basket analysis.
Campaign management and customer loyalty programs.
Supply-chain management and analytics.
Event- and behavior-based targeting.
Market and consumer segmentations.
Telecommunications
Revenue assurance and price optimization.
Customer churn prevention.
Campaign management and customer loyalty.
Call Detail Record (CDR) analysis.
Network performance and optimization.
Overview of HDFS
HDFS has many similarities with other distributed file systems, but is different in several respects. One
noticeable difference is HDFS's write-once-read-many model that relaxes concurrency control
requirements, simplifies data coherency, and enables high-throughput access.
HDFS has many goals. Here are some of the most notable:
Scalability to reliably store and process large amounts of data.
Economy by distributing data and processing across clusters of commodity personal computers.
Efficiency by distributing data and logic to process it in parallel on nodes where data is located.
Reliability by automatically maintaining multiple copies of data and automatically redeploying
processing logic in the event of failures.
Hadoop Multi-node Architecture
The Hadoop architecture is made simple in the diagram. The MapReduce algorithm sits on top
of a distributed file system. Arrows represent data access. Large enclosing rectangles represent
the master and slave nodes. The small rectangles represent functional units.
The file system layer can be any virtualized distributed file system. Hadoop performs best when
coupled with the Hadoop Distributed File System because the physical data node, being
location/rack aware, can be placed closer to the task tracker that will access this data.
JobTracker:
The JobTracker is the service within Hadoop that farms out MapReduce tasks to specific nodes in the
cluster, ideally the nodes that have the data, or at least are in the same rack.
1.
2.
3.
4.
5.
Hadoop Architecture:
Hive:
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data
summarization, query and analysis.
Using Hadoop was not easy for end users, especially for the ones who were not familiar with
MapReduce framework. End users had to write map/reduce programs for simple tasks like getting raw
counts or averages. Hive was created to make it possible for analysts with strong SQL skills (but meager
Java programming skills) to run queries on the huge volumes of data to extract patterns and meaningful
information. It provides an SQL-like language called HiveQL while maintaining full support for
map/reduce. In short, a Hive query is converted to MapReduce tasks.
The main building blocks of Hive are
1.
2.
3.
4.
5.
Metastore stores the system catalog and metadata about tables, columns, partitions, etc.
Driver manages the lifecycle of a HiveQL statement as it moves through Hive
Query Compiler compiles HiveQL into a directed acyclic graph for MapReduce tasks
Execution Engine executes the tasks produced by the compiler in proper dependency order
HiveServer provides a Thrift interface and a JDBC / ODBC server
HBase:
HBase is the Hadoop application to use when you require real-time read/write random-access to
very large datasets.
It is a distributed column-oriented database built on top of HDFS.
HBase is not relational and does not support SQL, but given the proper problem space, it is able to do
what an RDBMS cannot: host very large, sparsely populated tables on clusters made from
commodity hardware.
Mahout:
Mahout is an open source machine learning library from Apache.
Its highly scalable.
Mahout aims to be the machine learning tool of choice when the collection of data to be
processed is very large, perhaps far too large for a single machine. At the moment, it primarily
implements recommender engines (collaborative filtering), clustering, and classification.
Sqoop:
Loading bulk data into Hadoop from production systems or accessing it from map-reduce applications
running on large clusters can be a challenging task. Transferring data using scripts is inefficient and timeconsuming.
How do we efficiently move data from an external storage into HDFS or Hive or HBase? Meet Apache
Sqoop. Sqoop allows easy import and export of data from structured data stores such as relational
databases, enterprise data warehouses, and NoSQL systems. The dataset being transferred is sliced up
into different partitions and a map-only job is launched with individual mappers responsible for
transferring a slice of this dataset.
ZooKeeper:
ZooKeeper is a distributed, open-source coordination service for distributed applications.
It exposes a simple set of primitives that distributed applications can build upon to implement
higher level services for synchronization, configuration maintenance, and groups and naming.
10
Overview of HDFS
HDFS has many similarities with other distributed file systems, but is different in several
respects. One noticeable difference is HDFS's write-once-read-many model that relaxes concurrency
control requirements, simplifies data coherency, and enables high-throughput access.
Another unique attribute of HDFS is the viewpoint that it is usually better to locate processing logic near
the data rather than moving the data to the application space.
HDFS rigorously restricts data writing to one writer at a time. Bytes are always appended to the end of a
stream, and byte streams are guaranteed to be stored in the order written.
HDFS has many goals. Here are some of the most notable:
Fault tolerance by detecting faults and applying quick, automatic recovery
Data access via MapReduce streaming
Simple and robust coherency model
Processing logic close to the data, rather than the data close to the processing logic
Portability across heterogeneous commodity hardware and operating systems
Scalability to reliably store and process large amounts of data
Economy by distributing data and processing across clusters of commodity personal computers
Efficiency by distributing data and logic to process it in parallel on nodes where data is located
Reliability by automatically maintaining multiple copies of data and automatically redeploying
processing logic in the event of failures
HDFS provides interfaces for applications to move them closer to where the data is located, as described
in the following section.
11
A command-line interface similar to common Linux and UNIX shells (bash, csh, etc.)
that allows interaction with HDFS data.
A command set that you can use to administer an HDFS cluster.
A subcommand of the Hadoop command/application. You can use the fsck command
fsck
to check for inconsistencies with files, such as missing blocks, but you cannot use the
fsck command to correct these inconsistencies.
Name nodes and These have built-in web servers that let administrators check the current status of a
data nodes
cluster.
HDFS has an extraordinary feature set with high expectations thanks to its simple, yet powerful,
architecture.
HDFS architecture
HDFS is comprised of interconnected clusters of nodes where files and directories reside. An HDFS
cluster consists of a single node, known as a NameNode, that manages the file system namespace and
regulates client access to files. In addition, data nodes (DataNodes) store data as blocks within files.
Name nodes and data nodes
Within HDFS, a given name node manages file system namespace operations like opening, closing, and
renaming files and directories. A name node also maps data blocks to data nodes, which handle read
and write requests from HDFS clients. Data nodes also create, delete, and replicate data blocks
according to instructions from the governing name node.
12
As Figure 1 illustrates, each cluster contains one name node. This design facilitates a simplified model
for managing each namespace and arbitrating data distribution.
Relationships between name nodes and data nodes
Name nodes and data nodes are software components designed to run in a decoupled manner on
commodity machines across heterogeneous operating systems. HDFS is built using the Java
programming language; therefore, any machine that supports the Java programming language can run
HDFS. A typical installation cluster has a dedicated machine that runs a name node and possibly one
data node. Each of the other machines in the cluster runs one data node.
Communications protocols
All HDFS communication protocols build on the TCP/IP protocol. HDFS clients connect to a Transmission
Control Protocol (TCP) port opened on the name node, and then communicate with the name node
using a proprietary Remote Procedure Call (RPC)-based protocol. Data nodes talk to the name node
using a proprietary block-based protocol.
Data nodes continuously loop, asking the name node for instructions. A name node can't connect
directly to a data node; it simply returns values from functions invoked by a data node. Each data node
maintains an open server socket so that client code or other data nodes can read or write data. The host
or port for this server socket is known by the name node, which provides the information to interested
13
clients or other data nodes. See the Communications protocols sidebar for more about communication
between data nodes, name nodes, and clients.
The name node maintains and administers changes to the file system namespace.
File system namespace
HDFS supports a traditional hierarchical file organization in which a user or an application can create
directories and store files inside them. The file system namespace hierarchy is similar to most other
existing file systems; you can create, rename, relocate, and remove files.
HDFS also supports third-party file systems such as CloudStore and Amazon Simple Storage Service (S3)
Data replication
HDFS replicates file blocks for fault tolerance. An application can specify the number of replicas of a file
at the time it is created, and this number can be changed any time after that. The name node makes all
decisions concerning block replication.
Rack awareness
Typically, large HDFS clusters are arranged across multiple installations (racks). Network traffic between
different nodes within the same installation is more efficient than network traffic across installations. A
name node tries to place replicas of a block on multiple installations for improved fault tolerance.
However, HDFS allows administrators to decide on which installation a node belongs. Therefore, each
node knows its rack ID, making it rack aware.
HDFS uses an intelligent replica placement model for reliability and performance. Optimizing replica
placement makes HDFS unique from most other distributed file systems, and is facilitated by a rackaware replica placement policy that uses network bandwidth efficiently.
Large HDFS environments typically operate across multiple installations of computers. Communication
between two data nodes in different installations is typically slower than data nodes within the same
installation. Therefore, the name node attempts to optimize communications between data nodes. The
name node identifies the location of data nodes by their rack IDs.
Data organization
One of the main goals of HDFS is to support large files. The size of a typical HDFS block is 64MB.
Therefore, each HDFS file consists of one or more 64MB blocks. HDFS tries to place each block on
separate data nodes.
File creation process
14
Manipulating files on HDFS is similar to the processes used with other file systems. However, because
HDFS is a multi-machine system that appears as a single disk, all code that manipulates files on HDFS
uses a subclass of the org.apache.hadoop.fs.FileSystem object
The code shown in Listing 1 illustrates a typical file creation process on HDFS.
Staging to commit
When a client creates a file in HDFS, it first caches the data into a temporary local file. It then redirects
subsequent writes to the temporary file. When the temporary file accumulates enough data to fill an
HDFS block, the client reports this to the name node, which converts the file to a permanent data node.
The client then closes the temporary file and flushes any remaining data to the newly created data node.
The name node then commits the data node to disk.
Replication pipelining
When a client accumulates a full block of user data, it retrieves a list of data nodes that contains a
replica of that block from the name node. The client then flushes the full data block to the first data
node specified in the replica list. As the node receives chunks of data, it writes them to disk and
transfers copies to the next data node in the list. The next data node does the same. This pipelining
process is repeated until the replication factor is satisfied.
15
HDFS heartbeats
Several things can cause loss of connectivity between name and data nodes. Therefore, each data node
sends periodic heartbeat messages to its name node, so the latter can detect loss of connectivity if it
stops receiving them. The name node marks as dead data nodes not responding to heartbeats and
refrains from sending further requests to them. Data stored on a dead node is no longer available to an
HDFS client from that node, which is effectively removed from the system. If the death of a node causes
the replication factor of data blocks to drop below their minimum value, the name node initiates
additional replication to bring the replication factor back to a normalized state.
Figure 2 illustrates the HDFS process of sending heartbeat messages.
if the free space on a data node falls too low. Another model might dynamically create additional
replicas and rebalance other data blocks in a cluster if a sudden increase in demand for a given file
occurs. HDFS also provides the hadoop balance command for manual rebalancing tasks.
One common reason to rebalance is the addition of new data nodes to a cluster. When placing new
blocks, name nodes consider various parameters before choosing the data nodes to receive them. Some
of the considerations are:
Block-replica writing policies
Prevention of data loss due to installation or rack failure
Reduction of cross-installation network I/O
Uniform data spread across data nodes in a cluster
The cluster-rebalancing feature of HDFS is just one mechanism it uses to sustain the integrity of its data.
Other mechanisms are discussed next.
Data integrity
HDFS goes to great lengths to ensure the integrity of data across clusters. It uses checksum validation on
the contents of HDFS files by storing computed checksums in separate, hidden files in the same
namespace as the actual data. When a client retrieves file data, it can verify that the data received
matches the checksum stored in the associated file.
The HDFS namespace is stored using a transaction log kept by each name node. The file system
namespace, along with file block mappings and file system properties, is stored in a file called FsImage.
When a name node is initialized, it reads the FsImage file along with other files, and applies the
transactions and state information found in these files.
17
18
Introduction to MapReduce
Introduction
MapReduce is a programming model designed for processing large volumes of data in parallel by
dividing the work into a set of independent tasks.
MapReduce programs are written in a particular style influenced by functional programming
constructs, specifically idioms for processing lists of data.
This module explains the nature of this programming model and how it can be used to write
programs which run in the Hadoop environment.
Goals for this Module:
Understand functional programming as it applies to MapReduce
Understand the MapReduce program flow
Understand how to write programs for Hadoop MapReduce
Learn about additional features of Hadoop designed to aid software development.
MapReduce Basics
Functional Programming Concepts
MapReduce programs are designed to compute large volumes of data in a parallel fashion. This
requires dividing the workload across a large number of machines.
This model would not scale to large clusters (hundreds or thousands of nodes) if the
components were allowed to share data arbitrarily.
The communication overhead required to keep the data on the nodes synchronized at all times
would prevent the system from performing reliably or efficiently at large scale.
Instead, all data elements in MapReduce are immutable, meaning that they cannot be updated.
If in a mapping task you change an input (key, value) pair, it does not get reflected back in the
input files; communication occurs only by generating new output (key, value) pairs which are
then forwarded by the Hadoop system into the next phase of execution.
List Processing
Conceptually, MapReduce programs transform lists of input data elements into lists of output
data elements.
A MapReduce program will do this twice, using two different list processing idioms: map, and
reduce. These terms are taken from several list processing languages such as LISP, Scheme, or
ML.
Mapping Lists
The first phase of a MapReduce program is called mapping. A list of data elements are provided,
one at a time, to a function called the Mapper, which transforms each element individually to an
output data element.
19
As an example of the utility of map: Suppose you had a function toUpper(str) which returns an
uppercase version of its input string. You could use this function with map to turn a list of strings
into a list of uppercase strings.
Note that we are not modifying the input string here: we are returning a new string that will
form part of a new output list.
Reducing Lists
Reducing lets you aggregate values together. A reducer function receives an iterator of input
values from an input list. It then combines these values together, returning a single output
value.
Reducing is often used to produce "summary" data, turning a large volume of data into a smaller
summary of itself. For example, "+" can be used as a reducing function, to return the sum of a
list of input values.
Putting Them Together in MapReduce:
The Hadoop MapReduce framework takes these concepts and uses them to process large volumes of
information. A MapReduce program has two components: one that implements the mapper, and
another that implements the reducer. The Mapper and Reducer idioms described above are extended
slightly to work in this environment, but the basic principles are the same.
Keys and values: In MapReduce, no value stands on its own. Every value has a key associated with it.
Keys identify related values. For example, a log of time-coded speedometer readings from multiple cars
could be keyed by license-plate number; it would look like:
AAA-123
ZZZ-789
AAA-123
CCC-456
...
65mph, 12:00pm
50mph, 12:02pm
40mph, 12:05pm
25mph, 12:15pm
The mapping and reducing functions receive not just values, but (key, value) pairs. The output of each of
these functions is the same: both a key and a value must be emitted to the next list in the data flow.
MapReduce is also less strict than other languages about how the Mapper and Reducer work.
In more formal functional mapping and reducing settings, a mapper must produce exactly one
output element for each input element, and a reducer must produce exactly one output
element for each input list.
In MapReduce, an arbitrary number of values can be output from each phase; a mapper may
map one input into zero, one, or one hundred outputs.
A reducer may compute over an input list and emit one or a dozen different outputs.
Keys divide the reduce space: A reducing function turns a large list of values into one (or a few) output
values. In MapReduce, all of the output values are not usually reduced together. All of the values with
the same key are presented to a single reducer together. This is performed independently of any reduce
operations occurring on other lists of values, with different keys attached.
20
21
We can write a very similar program to this in Hadoop MapReduce; it is included in the Hadoop
distribution in src/examples/org/apache/hadoop/examples/WordCount.java. It is partially reproduced
below:
Second, the default input format used by Hadoop presents each line of an input file as a
separate input to the mapper function, not the entire file at a time. It also uses a StringTokenizer
object to break up the line into words. This does not perform any normalization of the input, so
"cat", "Cat" and "cat," are all regarded as different strings.
Note that the class-variable word is reused each time the mapper outputs another (word, 1)
pairing; this saves time by not allocating a new variable for each output.
The output.collect() method will copy the values it receives as input, so you are free to overwrite
the variables you use.
24
InputFormat:
TextInputFormat
Description:
Default format; reads lines of
text files
KeyValueInputFormat
SequenceFileInputFormat
Key:
Value:
The byte offset of the
The line contents
line
Everything up to the first The remainder of
tab character
the line
user-defined
user-defined
The key associated with each line is its byte offset in the file.
The RecordReader is invoke repeatedly on the input until the entire InputSplit has been
consumed.
Each invocation of the RecordReader leads to another call to the map() method of the Mapper.
Mapper:
The Mapper performs the interesting user-defined work of the first phase of the MapReduce
program.
Given a key and a value, the map() method emits (key, value) pair(s) which are forwarded to the
Reducers.
Partition & Shuffle:
The process of moving map outputs to the reducers is known as shuffling.
A different subset of the intermediate key space is assigned to each reduce node; these subsets
(known as "partitions") are the inputs to the reduce tasks.
Each map task may emit (key, value) pairs to any partition; all values for the same key are always
reduced together regardless of which mapper is its origin.
Therefore, the map nodes must all agree on where to send the different pieces of the
intermediate data. The Partitioner class determines which partition a given (key, value) pair will
go to. The default partitioner computes a hash value for the key and assigns the partition based
on this result.
Sort: Each reduce task is responsible for reducing the values associated with several intermediate keys.
The set of intermediate keys on a single node is automatically sorted by Hadoop before they are
presented to the Reducer.
Reduce:
A Reducer instance is created for each reduce task.
This is an instance of user-provided code that performs the second important phase of jobspecific work.
For each key in the partition assigned to a Reducer, the Reducer's reduce() method is called
once.
This receives a key as well as an iterator over all the values associated with the key.
The values associated with a key are returned by the iterator in an undefined order.
The Reducer also receives as parameters OutputCollector and Reporter objects; they are used in
the same manner as in the map() method.
OutputFormat:
The (key, value) pairs provided to this OutputCollector are then written to output files. The way
they are written is governed by the OutputFormat.
The OutputFormat functions much like the InputFormat class described earlier.
The instances of OutputFormat provided by Hadoop write to files on the local disk or in HDFS;
they all inherit from a common FileOutputFormat.
26
OutputFormat:
Description
TextOutputFormat
Default; writes lines in "key \t value" form
SequenceFileOutputFormat Writes binary files suitable for reading into subsequent MapReduce jobs
NullOutputFormat
Disregards its inputs
Table 2: OutputFormats provided by Hadoop
RecordWriter: Much like how the InputFormat actually reads individual records through the
RecordReader implementation, the OutputFormat class is a factory for RecordWriter objects; these are
used to write the individual records to the files as directed by the OutputFormat.
The output files written by the Reducers are then left in HDFS for your use, either by another
MapReduce job, a separate program, for for human inspection.
Hadoop Streaming
Whereas Pipes is an API that provides close coupling between C++ application code and Hadoop,
Streaming is a generic API that allows programs written in virtually any language to be used as Hadoop
Mapper and Reducer implementations.
Hadoop Streaming allows you to use arbitrary programs for the Mapper and Reducer phases of a
MapReduce job. Both Mappers and Reducers receive their input on stdin and emit output (key, value)
pairs on stdout.
Input and output are always represented textually in Streaming. The input (key, value) pairs are written
to stdin for a Mapper or Reducer, with a 'tab' character separating the key from the value. The
Streaming programs should split the input on the first tab character on the line to recover the key and
the value. Streaming programs write their output to stdout in the same format: key \t value \n.
The inputs to the reducer are sorted so that while each line contains only a single (key, value) pair, all
the values for the same key are adjacent to one another.
Provided it can handle its input in the text format described above, any Linux program or tool can be
used as the mapper or reducer in Streaming. You can also write your own scripts in bash, python, perl,
27
or another language of your choice, provided that the necessary interpreter is present on all nodes in
your cluster.
Running a Streaming Job: To run a job with Hadoop Streaming, use the following command:
$ bin/hadoop jar contrib/streaming/hadoop-version-streaming.jar
The command as shown, with no arguments, will print some usage information. An example of how to
run real commands is given below:
$ bin/hadoop jar contrib/streaming-hadoop-0.18.0-streaming.jar -mapper \
myMapProgram -reducer myReduceProgram -input /some/dfs/path \
-output /some/other/dfs/path
This assumes that myMapProgram and myReduceProgram are present on all nodes in the system ahead
of time. If this is not the case, but they are present on the node launching the job, then they can be
"shipped" to the other nodes with the -file option:
$ bin/hadoop jar contrib/streaming-hadoop-0.18.0-streaming.jar -mapper \
myMapProgram -reducer myReduceProgram -file \
myMapProgram -file myReduceProgram -input some/dfs/path \
-output some/other/dfs/path
Any other support files necessary to run your program can be shipped in this manner as well.
28
MapReduce API
Package org.apache.hadoop.mapreduce
Interface Summary
Counter
CounterGroup
JobContext
MapContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
ReduceContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
TaskAttemptContext
TaskInputOutputContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
Class Summary
Cluster
ClusterMetrics
Counters
ID
InputFormat<K,V>
InputSplit
Job
JobID
JobStatus
Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
MarkableIterator<VALUE>
OutputCommitter
OutputFormat<K,V>
Partitioner<KEY,VALUE>
QueueAclsInfo
QueueInfo
RecordReader<KEYIN,VALUEIN>
RecordWriter<K,V>
Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
TaskAttemptID
TaskCompletionEvent
TaskID
TaskTrackerInfo
Enum Summary
JobCounter
JobPriority
QueueState
TaskCompletionEvent.Status
TaskCounter
TaskType
30
Mapper
Constructor Detail
Mapper
public Mapper ()
Method Detail
setup
protected void setup(org.apache.hadoop.mapreduce.Mapper.Context context)
throws IOException, InterruptedException
Called once at the beginning of the task.
Throws:
IOException
InterruptedException
map
protected void map(KEYIN key,
VALUEIN value,
org.apache.hadoop.mapreduce.Mapper.Context context)
throws IOException, InterruptedException
Called once for each key/value pair in the input split. Most applications should override this, but
the default is the identity function.
Throws:
IOException
InterruptedException
cleanup
protected void cleanup(org.apache.hadoop.mapreduce.Mapper.Context context)
throws IOException, InterruptedException
Called once at the end of the task.
Throws:
IOException
InterruptedException
31
run
public void run(org.apache.hadoop.mapreduce.Mapper.Context context)
throws IOException, InterruptedException
Expert users can override this method for more complete control over the execution of the
Mapper.
Parameters:
context Throws:
IOException
InterruptedException
RecordReader
Constructor Detail
RecordReader
public RecordReader()
Method Detail
initialize
public abstract void initialize(InputSplit split, TaskAttemptContext context)
throws IOException,InterruptedException
Called once at initialization.
Parameters:
split - the split that defines the range of records to read
context - the information about the task
Throws:
IOException
InterruptedException
nextKeyValue
public abstract boolean nextKeyValue()
throws IOException,InterruptedException
Read the next key, value pair.
Returns:
true if a key/value pair was read
Throws:
IOException
InterruptedException
32
getCurrentKey
public abstract KEYIN getCurrentKey()
throws IOException, InterruptedException
Get the current key
Returns:
the current key or null if there is no current key
Throws:
IOException
InterruptedException
getCurrentValue
public abstract VALUEIN getCurrentValue()
throws IOException, InterruptedException
Get the current value.
Returns:
the object that was read
Throws:
IOException
InterruptedException
getProgress
public abstract float getProgress()
throws IOException, InterruptedException
The current progress of the record reader through its data.
Returns:
a number between 0.0 and 1.0 that is the fraction of the data read
Throws:
IOException
InterruptedException
close
public abstract void close()
throws IOException
Close the record reader.
Specified by:
close in interface Closeable
Throws:
IOException
33
Reducer
Constructor Detail
Reducer
public Reducer()
Method Detail
setup
protected void setup(org.apache.hadoop.mapreduce.Reducer.Context context)
throws IOException, InterruptedException
Called once at the start of the task.
Throws:
IOException
InterruptedException
reduce
protected void reduce(KEYIN key,
Iterable<VALUEIN> values,
org.apache.hadoop.mapreduce.Reducer.Context context)
throws IOException, InterruptedException
This method is called once for each key. Most applications will define their reduce class by
overriding this method. The default implementation is an identity function.
Throws:
IOException
InterruptedException
cleanup
protected void cleanup(org.apache.hadoop.mapreduce.Reducer.Context context)
throws IOException,InterruptedException
Called once at the end of the task.
Throws:
IOException
InterruptedException
34
run
public void run(org.apache.hadoop.mapreduce.Reducer.Context context)
throws IOException, InterruptedException
Advanced application writers can use the run(org.apache.hadoop.mapreduce.Reducer.Context)
method to control how the reduce task works.
Throws:
IOException
InterruptedException
Prior to Hadoop 0.20.x, a Map class had to extend a MapReduceBase and implement a Mapper as such:
public static class Map extends MapReduceBase implements Mapper {
...
}
and similarly, a map function had to use an OutputCollector and a Reporter object to emit (key, value)
pairs and send progress updates to the main program. A typical map function looked like:
public void map(K1, V1, OutputCollector o, Reporter r) throws IOException {
...
output. Collect(key,value);
}
With the new Hadoop API, a mapper or reducer has to extend classes from the package
org.apache.hadoop.mapreduce.* and there is no need to implement an interface anymore. Here is how
a Map class is defined in the new API:
public class MapClass extends Mapper { ...
}
and a map function uses Context objects to emit records and send progress updates. A typical map
function is now defined as:
public void map(LongWritable key, Text value, Context context) throws IOException,
InterruptedException { ...
context.write(key,value);
}
All of the changes for a Mapper above go the same way for a Reducer.
Another major change has been done in the way a job is configured and controlled. Earlier, a map
reduce job was configured through a JobConf object and the job control was done using an instance of
JobClient. The main body of a driver class used to look like:
JobConf conf = new JobConf(Driver.class);
conf.setPropertyX(..);
conf.setPropertyY(..);
...
...
35
JobClient.runJob(conf);
In the new Hadoop API, the same functionality is achieved as follows:
Configuration conf = new Configuration();
Job job = new Job(conf);
job.setJarByClass(Driver.class);
job.setPropertyX(..);
job.setPropertyY(..);
job.waitForCompletion(true);
36
Combiner:
The primary goal of combiners is to optimize/minimize the number of key value pairs that
will be shuffled accross the network between mappers and reducers and thus to save as
most bandwidth as possible
Eg. Take word count example on a text containing one million times the word the. Without
combiner the mapper will send one million key/value pairs of the form <the,1>. With
combiners, it will potentially send much less key/value pairs of the form <the,N> with N a
number potentially much bigger than 1. Thats just the intuition (see the references at the
end of the post for more details).
Simply speaking a combiner can be considered as a mini reducer that will be applied
potentially several times still during the map phase before to send the new (hopefully
reduced) set of key/value pairs to the reducer(s). This is why a combiner must implement
the Reducer interface (or extend the Reducer class as of hadoop 0.20).
conf.setCombinerClass(Reduce.class);
Indeed, suppose 5 key/value pairs emitted from the mapper for a given key k: <k,40>, <k,30>,
<k,20>, <k,2>, <k,8>. Without combiner, when the reducer will receive the list
<k,{40,30,20,2,8}>, the mean output will be 20, but if a combiner were applied before on the
two sets (<k,40>, <k,30>, <k,20>) and (<k,2>, <k,8>) separately, then the reducer would have
received the list <k,{30,5}> and the output would have been different (17.5) which is an
unexpected behavior.
37
Performance Measurement:
Local Execution Mode using LocalJobRunner from Hadoop
Hadoop's LocalJobRunner is to execute the same Map Reduce Physical plans locally. So we
compile the logical plan into a map reduce physical plan and create the jobcontrol object
corresponding to the mapred plan. We just need to write a separate launcher which will submit
the job to the LocalJobRunner instead of submitting to an external Job Tracker.
Pros
Code Reuse
No need to write and maintain
Different operators
Different logical to physical tranlators
Different launchers
The current framework does not have any progress reporting. With this approach we
will have it at no extra cost.
Cons
Not sure how stable LocalJobRunner is.
38
Found some bugs in hadoop-15 on it which makes it practically useless for us right now.
These have been fixed however in hadoop-16
Not sure how this will affect Example generator
Definitely it does. As measured in hadoop 15, it has about 5 sec startup latency. Whether this
affects depends on how and where we are using LocalJobRunner. If we strictly use it only when
the user asks for local execution mode it should not matter. Also if the size of the data is at
least in 10s of MBs, the LocalJobRunner performs better than streaming tuples through the
plan of local operators.
<name>size</name>
<value>10</value>
<description>Size</description>
</property>
<property>
<name>weight</name>
<value>heavy</value>
<final>true</final>
<description>Weight</description>
</property>
<property>
<name>size-weight</name>
<value>${size},${weight}</value>
<description>Size and weight</description>
</property>
</configuration>
Assuming this configuration file is in a file called configuration-1.xml, we can access its
properties using a piece of code like this:
Configuration conf = new Configuration();
conf.addResource("configuration-1.xml");
assertThat(conf.get("color"), is("yellow"));
assertThat(conf.getInt("size", 0), is(10));
assertThat(conf.get("breadth", "wide"), is("wide"));
Unless explicitly turned off, Hadoop by default specifies two resources, loaded in-order
from the classpath:
40
Partitioner
A Partitioner is responsible to perform the partitioning.
In Hadoop, the default partitioner is HashPartitioner.
The number of partition is then equal to the number of reduce tasks for the job.
Why is it important?
First, it has a direct impact on the overall performance of your job: a poorly designed
partitioning function will not evenly distribute the charge over the reducers, potentially losing
all the interest of the map/reduce distributed infrastructure.
Example
As you can see, the tokens are correctly ordered by number of occurrences on each reducer
(which is what hadoop guarantees by default) but this is not what you need! Youd rather
expect something like:
41
where tokens are totally ordered over the reducers, from 1 to 30 occurrences on the first reducer and
from 31 to 14620 on the second. This would happen as a result of a correct partitioning function: all the
tokens having a number of occurrences inferior to N (here 30) are sent to reducer 1 and the others are
sent to reducer 2, resulting in two partitions. Since the tokens are sorted on each partition, you get the
expected total order on the number of occurrences.
Conclusion
Partitioning in map/reduce is a fairly simple concept but that is important to get correctly. Most of the
time, the default partitioning based on an hash function can be sufficient. But as we illustrated in this
Issue, youll need some time to modify the default behavior and to customize your own partitioning
suited for your needs.
HDFS Accessibility
HDFS can be accessed from applications in many different ways. Natively, HDFS provides a FileSystem
Java API for applications to use. A C language wrapper for this Java API is also available. In addition, an
HTTP browser can also be used to browse the files of an HDFS instance. Work is in progress to expose
HDFS through the WebDAV protocol.
FS Shell
42
HDFS allows user data to be organized in the form of files and directories. It provides a commandline
interface called FS shell that lets a user interact with the data in HDFS. The syntax of this command set is
similar to other shells (e.g. bash, csh) that users are already familiar with. Here are some sample
action/command pairs:
Action
Command
FS shell is targeted for applications that need a scripting language to interact with the stored data.
DFSAdmin
The DFSAdmin command set is used for administering an HDFS cluster. These are commands that
are used only by an HDFS administrator. Here are some sample action/command pairs:
Action
Command
Browser Interface
43
A typical HDFS install configures a web server to expose the HDFS namespace through a
configurable TCP port. This allows a user to navigate the HDFS namespace and view the contents of
its files using a web browser.
HIVE Basics
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data
summarization, query, and analysis
44
Features of Hive
Hive supports indexing to provide acceleration
Support for different storage types.
Hive stores metadata in an RDBMS which reduces significant time to perform the
semantic checks during the query execution.
Hive can operate on compressed data stored into Hadoop ecosystem
Built-in user defined functions (UDFs) to manipulate dates, strings, and other datamining tools. If none serves our need, we can create our own UDFs
Hive supports SQL like queries (Hive QL) which is implicitly converted into map-reduce
jobs
HiveQL
While based on SQL, HiveQL does not strictly follow the full SQL-92 standard. HiveQL offers
extensions not in SQL
**Detail will be provided Later
PIG Basics
Pig is a high-level platform for creating MapReduce programs used with Hadoop. The language
for this platform is called Pig Latin
Apache Pig is a platform for analyzing large data sets that consists of a high-level language for
expressing data analysis programs, coupled with infrastructure for evaluating these programs.
The salient property of Pig programs is that their structure is amenable to substantial
parallelization, which in turns enables them to handle very large data sets.
At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of
Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the
Hadoop subproject). Pig's language layer currently consists of a textual language called Pig
Latin, which has the following key properties:
45
Practical Development
Counters
A named counter that tracks the progress of a map/reduce job.
Counters represent global counters, defined either by the Map-Reduce framework or
applications. Each Counter is named by an Enum and has a long for the value.
Counters are a useful channel for gathering statistics about the job. In addition to counter
values being much easier to retrieve than log output for large
distributed jobs, you get a record of the number of times that condition occurred, which
is more work to obtain from a set of logfiles
Types of Counter
Built-in Counters
Hadoop maintains some built-in counters for every job, which report various metrics
for your job.
Eg. MapReduce Task Counters , Filesystem Counters
Task Counters
Task counters gather information about tasks over the course of their execution, and
46
the results are aggregated over all the tasks in a job. Task counters are maintained by
each task attempt, and periodically sent to the Task tracker and then to the jobtracker,
so they can be globally aggregated.
Eg. Map input records, Map skipped records
Job counters
Job counters are maintained by the jobtracker. They measure job-level statistics, not
values that change while a task is running. For example, TOTAL_LAUNCHED_MAPS
counts the number of map tasks that were launched over the course of a job (including
ones that failed).
Eg. Launched map tasks, Launched reduce tasks
ChainMapper
The ChainMapper class allows to use multiple Mapper classes within a single Map task.
The Mapper classes are invoked in a chained (or piped) fashion, the output of the first becomes
the input of the second, and so on until the last Mapper, the output of the last Mapper will be
written to the task's output.
47
The key functionality of this feature is that the Mappers in the chain do not need to be aware
that they are executed in a chain. This enables having reusable specialized Mappers that can be
combined to perform composite operations within a single task.
48
Lets look at a particular Writable to see what we can do with it. We will use IntWritable, a
wrapper for a Java int. We can create one and set its value using the set() method:
IntWritable writable = new IntWritable();
writable.set(163);
Equivalently, we can use the constructor that takes the integer value:
IntWritable writable = new IntWritable(163);
Hadoop comes with a large selection of Writable classes in the org.apache.hadoop.io package.
They form the class hierarchy shown in figure
49
51
Avro
Apache Avro is a language-neutral data serialization system.
Avro data is described using a language-independent schema.
Avro schemas are usually written in JSON, and data is usually encoded using a binary format, but
there are other options, too. There is a higher-level language called Avro IDL, for writing
schemas in a C-like language that is more familiar to developers. There is also a JSON-based data
encoder, which, being human-readable, is useful for prototyping and debugging Avro data.
Avro specifies an object container format for sequences of objectssimilar to Hadoops
sequence file. An Avro data file has a metadata section where the schema is stored, which
52
makes the file self-describing. Avro data files support compression and are splittable, which is
crucial for a MapReduce data input format.
Avro provides APIs for serialization and deserialization, which are useful when you want to
integrate Avro with an existing system, such as a messaging system where the framing format is
already defined. In other cases, consider using Avros data file format.
Lets write a Java program to read and write Avro data to and from streams. Well start with a simple
Avro schema for representing a pair of strings as a record:
{
"type": "record",
"name": "StringPair",
"doc": "A pair of strings.",
"fields": [
{"name": "left", "type": "string"},
{"name": "right", "type": "string"}
]
}
If this schema is saved in a file on the classpath called StringPair.avsc (.avsc is the conventional
extension for an Avro schema), then we can load it using the following two lines of code:
Schema.Parser parser = new Schema.Parser();
Schema schema = parser.parse(getClass().getResourceAsStream("StringPair.avsc"));
We can create an instance of an Avro record using the generic API as follows:
GenericRecord datum = new GenericData.Record(schema);
datum.put("left", "L");
datum.put("right", "R");
There are two important objects here: the DatumWriter and the Encoder. A DatumWriter
translates data objects into the types understood by an Encoder, which the latter writes to the
output stream. Here we are using a GenericDatumWriter, which passes the fields of
GenericRecord to the Encoder. We pass a null to the encoder factory since we are not reusing a
previously constructed encoder here.
53
Avros object container file format is for storing sequences of Avro objects. It is very similar in
design to Hadoops sequence files. A data file has a header containing metadata, including the
Avro schema and a sync marker, followed by a series of (optionally compressed) blocks
containing the serialized Avro objects.
Writing Avro objects to a data file is similar to writing to a stream. We use a DatumWriter, as
before, but instead of using an Encoder, we create a DataFileWriter instance with the
DatumWriter. Then we can create a new data file (which, by convention, has a .avro extension)
and append objects to it:
File file = new File("data.avro");
DatumWriter<GenericRecord> writer = new GenericDatumWriter<GenericRecord>(schema);
DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(writer);
dataFileWriter.create(schema, file);
dataFileWriter.append(datum);
dataFileWriter.close();
The objects that we write to the data file must conform to the files schema, otherwise an
exception will be thrown when we call append().
Writing a SequenceFile
SequenceFile provides Writer, Reader and SequenceFile.Sorter classes for writing, reading and
sorting respectively.
Hadoop has ways of splitting sequence files for doing jobs in parallel, even if they are
compressed, making them a convenient way of storing your data without making your own
format.
Hadoop provides two file formats for grouping multiple entries in a single file:
SequenceFile: A flat file which stores binary key/value pairs. The output of Map/Reduce
tasks is usually written into a SequenceFile.
MapFile: Consists of two SequenceFiles. The data file is identical to the SequenceFile
and contains the data stored as binary key/value pairs. The second file is an index file,
which contains a key/value map with seek positions inside the data file to quickly access
the data.
We started using the SequenceFile format to store log messages. It turned out that, while this
format seems to be well suited for storing log messages and processing them with Map/Reduce
jobs, the direct access to specific log messages is very slow. The API to read data from a
54
SequenceFile is iterator based, so that it is necessary to jump from entry to entry until the
target entry is reached.
Since one of our most important use cases is searching for log messages in real time, slow
random access performance is a show stopper.
MapFiles use 2 files: the index file stores seek positions for every n-th key in the datafile. The
datafile stores to data as binary key/value-pairs.
Therefore we moved to MapFiles. MapFiles have the disadvantage that a random access needs
to read from 2 separate files. This seems to be slow, but the indexes which store the seek
positions for our log entries are small enough to be cached in memory. Once the seek position
is identified, only relevant portions of the data file are read. Overall this leads to a nice
performance gain.
To create a SequenceFile, use one of its createWriter() static methods, which returns a
SequenceFile.Writer instance.
try
{
writer = SequenceFile.createWriter(fs, conf, path, key.getClass(), value.getClass());
while((line = buffer.readLine()) != null)
{
key.set(line);
value.set(line);
writer.append(key, value);
}
}
finally
{
IOUtils.closeStream(writer);
}
}
}
}
finally
{
IOUtils.closeStream(reader);
}
}
}
conf.setNumReduceTasks(1);
conf.setReducerClass(IdentityReducer.class);
conf.setOutputKeyClass(LongWritable.class);
conf.setOutputValueClass(Text.class);
conf.setOutputFormat(TextOutputFormat.class);
JobClient.runJob(conf);
return 0;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new MinimalMapReduceWithDefaults(), args);
System.exit(exitCode);
}
}
Input Formats
The InputFormat defines how to read data from a file into the Mapper instances. Hadoop comes with
several implementations of InputFormat; some work with text files and describe different ways in which
the text files can be interpreted. Others, like SequenceFileInputFormat, are purpose-built for reading
particular binary file formats.
Input Splits and Records
An input split is a chunk of the input that is processed by a single map. Each map processes a single split.
Each split is divided into records, and the map processes each recorda key-value pairin turn.
FileInputFormat- use files as their data source
FileInputFormat input paths- The input to a job is specified as a collection of paths
public static void addInputPath(JobConf conf, Path path)
public static void addInputPaths(JobConf conf, String commaSeparatedPaths)
public static void setInputPaths(JobConf conf, Path... inputPaths)
public static void setInputPaths(JobConf conf, String commaSeparatedPaths)
FileInputFormat input splits- FileInputFormat splits only large files. Here large means larger
than an HDFS block
Small files and CombineFileInputFormat- Hadoop works better with a small number of large files
than a large number of small files. One reason for this is that FileInputFormat generates splits in
such a way that each split is all or part of a single file
58
Text Input
Hadoop excels at processing unstructured text
TextInputFormat- TextInputFormat is the default InputFormat
KeyValueTextInputFormat- It is common for each line in a file to be a key-value pair, separated
by a delimiter such as a tab character
NLineInputFormat- N refers to the number of lines of input that each mapper receives
XML
Most XML parsers operate on whole XML documents, so if a large XML document is made up of multiple
input splits. Using StreamXmlRecordReader, the page elements can be interpreted as records for
processing by a mapper.
Binary Input
SequenceFileInputFormat- Hadoops sequence file format stores sequences of binary key-value
pairs.
SequenceFileAsTextInputFormatSequenceFileAsTextInputFormat
is
a
variant
of
SequenceFileInputFormat that converts the sequence files keys and values to Text objects
SequenceFileAsBinaryInputFormat- SequenceFileAsBinaryInputFormat is a variant of
SequenceFileInputFormat that retrieves the sequence files keys and values as opaque binary
objects
Multiple Inputs
Although the input to a MapReduce job may consist of multiple input files (constructed by a
combination of file globs, filters, and plain paths), all of the input is interpreted by a single InputFormat
and a single Mapper.
MultipleInputs.addInputPath(conf,InputPath,TextInputFormat.class, Mapper.class)
Database Input (and Output)
DBInputFormat is an input format for reading data from a relational database, using JDBC.
The corresponding output format is DBOutputFormat, which is useful for dumping job outputs (of
modest size) into a database.
59
Output Formats
Text Output -The default output format, TextOutputFormat, writes records as lines of text.
Binary Output
SequenceFileOutputFormat -As the name indicates, SequenceFileOutputFormat writes sequence
files for its output.
SequenceFileAsBinaryOutputFormat- SequenceFileAsBinaryOutputFormat is the counterpart to
SequenceFileAsBinaryInputFormat, and it writes keys and values in raw binary format into a
SequenceFile container.
MapFileOutputFormat- MapFileOutputFormat writes MapFiles as output.
Multiple Outputs
There are two special cases when it does make sense to allow the application to set the number of
partitions (or equivalently, the number of reducers):
Zero reducers
This is a vacuous case: there are no partitions, as the application needs to run only map tasks.
One reducer
It can be convenient to run small jobs to combine the output of previous jobs into a single file. This
should only be attempted when the amount of data is small enough to be processed comfortably by one
reducer.
MultipleOutputFormatMultipleOutputFormat allows you to write data to multiple files whose names are derived from the
output keys and values.
60
Joins
MapReduce can perform joins between large datasets.
Ex : Inner join of 2 data sets.
Stations
station_id
2
7
station_loc
Pune
Mumbai
Records
st_id
7
7
2
2
2
St_name
atlanta
atlanta
richmond
richmond
richmond
temp
111
78
0
22
-11
station_id
2
2
2
7
7
Station_loc
Pune
Pune
Pune
Mumbai
Mumbai
St_name
richmond
richmond
richmond
atlanta
atlanta
JOIN
temp
0
22
-11
111
78
If the join is performed by the mapper, it is called a map-side join, whereas if it is performed by
the reducer it is called a reduce-side join.
If both datasets are too large for either to be copied to each node in the cluster, then we can
still join them using MapReduce with a map-side or reduce-side join, depending on how the
data is structured.
61
Distributed Cache
Side-Data can be shared using the Hadoops Distributed cache mechanism. We can copy files
and archives to the task nodes when the tasks need to run. Usually this is the preferrable way
over the JobConfigurtion.
If both the datasets are too large then we cannot copy either of the datasets to each node in
the cluster as we did in the Side data distribution.
Map-Side Joins
A map-side join between large inputs works by performing the join before the data reaches the map
function. the inputs to each map must be partitioned and sorted. Each input dataset must be divided
into the same number of partitions, and it must be sorted by the same key (the join key) in each source.
Use a CompositeInputFormat from the org.apache.hadoop.mapred.join package to run a map-side join.
We can set it to the CompositeInputFormat using,
inner(tbl(org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.class,
hdfs://localhost:8000/usr/data)
tbl(org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.class,
hdfs://localhost:8000/usr/activity)
62
Reduce-Side Joins
Reduce-Side joins are more simple than Map-Side joins since the input datasets need not to be
structured. But it is less efficient as both datasets have to go through the MapReduce shuffle
phase. the records with the same key are brought together in the reducer. We can also use the
Secondary Sort technique to control the order of the records.
63
Secondary Sort
The MapReduce framework sorts the records by key before they reach the reducers.For any
particular key, however, the values are not sorted.
It is possible to impose an order on the values by sorting and grouping the keys in a particular
way.
To illustrate the idea, consider the MapReduce program for calculating the maximum
temperature for each year. If we arranged for the values (temperatures) to be sorted in
descending order, we wouldnt have to iterate through them to find the maximum we could
take the first for each year and ignore the rest. (This approach isnt the most efficient way to
solve this particular problem, but it illustrates how secondary sort works in general.)
To achieve this, we change our keys to be composite: a combination of year and temperature.
1901 35C
1900 35C
1900 34C
....
1900 34C
1901 36C
We want the sort order for keys to be by year (ascending) and then by temperature
(descending):
64
HIVE
Hive runs on your workstation and converts your SQL query into a series of MapReduce
jobs for execution on a Hadoop cluster.
Hive organizes data into tables, which provide a means for attaching structure to data
stored in HDFS.
Metadatasuch as table schemasis stored in a database called the metastore.
Query Execution
Input data file: sample.txt
1950
1950
1950
1949
1949
34
22
11
18
42
1
2
2
1
1
Loading data
LOAD DATA LOCAL INPATH '/home/hadoop/Documents/hivedata/sample.txt'
OVERWRITE INTO TABLE records;
Running this command tells Hive to put the specified local file in its warehouse directory.
66
Example of dumping data out from a query into a file using silent mode
You can suppress the messages time taken to run a query using the -S option at launch time.
Command : $HIVE_HOME/bin/hive -S -e 'select a.col from tab1 a' > a.txt
67
The Metastore
The metastore is the central repository of Hive metadata. The metastore is divided into two
pieces:
a service
the backing store for the data.
Using an embedded metastore is a simple way to get started with Hive; however, only one
embedded Derby database can access the database files on disk at any one time, which means
you can only have one Hive session open at a time that shares the same metastore. Trying to
start a second session gives the error:
Failed to start database 'metastore_db'
when it attempts to open a connection to the metastore.
The solution to supporting multiple sessions (and therefore multiple users) is to use a
standalone database. This configuration is referred to as a local metastore, since the metastore
service still runs in the same process as the Hive service, but connects to a database running in
a separate process, either on the same machine or on a remote machine.
MySQL is a popular choice for the standalone metastore. In this case,
javax.jdo.option.ConnectionURL is set to jdbc:mysql://host/dbname?createDatabaseIf
NotExist=true, and javax.jdo.option.ConnectionDriverName is set to com.mysql.jdbc.Driver.
(The user name and password should be set, too, of course.) The JDBC driver JAR file for MySQL
(Connector/J) must be on Hives classpath, which is simply achieved by placing it in Hives lib
directory.
Going a step further, theres another metastore configuration called a remote metastore,
where one or more metastore servers run in separate processes to the Hive service. This brings
better manageability and security, since the database tier can be completely firewalled off, and
the clients no longer need the database credentials.
A Hive service is configured to use a remote metastore by setting hive.meta store.local to false,
and hive.metastore.uris to the metastore server URIs, separated by commas if there is more
than one. Metastore server URIs are of the form thrift:// host:port, where the port corresponds
to the one set by METASTORE_PORT when starting the metastore server.
Partitions
The advantage to this scheme is that queries that are restricted to a particular date or set of
dates can be answered much more efficiently since they only need to scan the files in the
partitions that the query pertains to.
CREATE TABLE logs (ts INT, line STRING)
PARTITIONED BY (dt STRING, country STRING);
69
Buckets
Bucketing imposes extra structure on the table, which Hive can take advantage of when
performing certain queries.
The CLUSTERED BY clause to specify the columns to bucket on and the number of buckets
CREATE TABLE bucketed_users (id INT, name STRING)
CLUSTERED BY (id) INTO 4 BUCKETS;
Here we are using the user ID to determine the bucket (which Hive does by hashing the value
and reducing modulo the number of buckets), so any particular bucket will effectively have a
random set of users in it.
70
Hives Serde
Internally, Hive uses a SerDe called LazySimpleSerDe for this delimited format, along with the
line-oriented MapReduce text input and output formats
Hive-json-serde
This SerDe can be used to read data in JSON format. For example, if your JSON files had the
following contents:
{"field1":"data1","field2":100,"field3":"more data1","field4":123.001}
{"field1":"data2","field2":200,"field3":"more data2","field4":123.002}
{"field1":"data3","field2":300,"field3":"more data3","field4":123.003}
{"field1":"data4","field2":400,"field3":"more data4","field4":123.004}
The following steps can be used to read this data:
71
JOINS
Inner joins
where each match in the input tables results in a row in the output.
Sales Table
Joe
2
Hank 4
Ali
0
Eve
3
Hank 2
things Table
2
Tie
4
Coat
3
Hat
1
Scarf
72
Joe
Hank
Eve
Hank
2
2
3
4
2
2
3
4
Tie
Tie
Hat
Coat
Outer joins
Outer joins allow you to find nonmatches in the tables being joined
73
74
75
PIG
Apache Pig is a high-level procedural language for querying large semi-structured data sets
using Hadoop and the MapReduce Platform.
Pig simplifies the use of Hadoop by allowing SQL-like queries to a distributed dataset. Explore
the language behind Pig and discover its use in a simple Hadoop cluster.
The Pig tutorial shows you how to run two Pig scripts in local mode and mapreduce mode.
Local Mode: To run the scripts in local mode, no Hadoop or HDFS installation is required. All
files are installed and run from your local host and file system.
Mapreduce Mode: To run the scripts in mapreduce mode, you need access to a Hadoop
cluster and HDFS installation.
Why Pig?
Programming Map and Reduce applications is not overly complex; doing so does require
some experience with software development.
Apache Pig changes this by creating a simpler procedural language abstraction over
MapReduce to expose a more Structured Query Language (SQL)-like interface for
Hadoop applications. So instead of writing a separate MapReduce application, you can
write a single script in Pig Latin that is automatically parallelized and distributed across a
cluster.
Pig Latin
Pig Latin is a relatively simple language that executes statements.
A statement is an operation that takes input (such as a bag, which represents a set of
tuples) and emits another bag as its output.
A bag is a relation, similar to table, that you'll find in a relational database (where tuples
represent the rows, and individual tuples are made up of fields).
A script in Pig Latin often follows a specific format in which data is read from the file
system, a number operations are performed on the data (transforming it in one or more
ways), and then the resulting relation is written back to the file system.
Pig has a rich set of data types, supporting not only high-level concepts like bags, tuples,
and maps, but also simple data types such as ints, longs, floats, doubles, chararrays, and
bytearrays. With the simple types.
76
Pig consists of a range of arithmetic operators (such as add, subtract, multiply, divide,
and module) in addition to a conditional operator called bincond that operates similar
to the C ternary operator. And as you'd expect, a full suite of comparison operators,
including rich pattern matching using regular expressions.
A simple Pig Latin script
messages = LOAD 'messages';
warns = FILTER messages BY $0 MATCHES '.*WARN+.*';
STORE warns INTO 'warnings';
The above Pig Latin Script shows the simplicity of this process in Pig. Given the three lines
shown, only one is the actual search. The first line simply reads the test data set (the messages
log) into a bag that represents a collection of tuples. You filter this data (the only entry in the
tuple, represented as $0, or field 1) with a regular expression, looking for the character
sequence WARN. Finally, you store this bag, which now represents all of those tuples from
messages that contain WARN into a new file called warnings in the host file system.
List of Pig Latin relational operators
Operator
FILTER
Description
Select a set of tuples from a relation based on a condition.
JOIN
LOAD
ORDER
SPLIT
STORE
78
If you had specified the STORE operator, it would have generated your data within a directory
of the name specified (not a simple regular file).
As shown, this code will result in a listing of one or more files, if Hadoop is running successfully.
Now, let's test Pig. Begin by starting Pig, and then changing the directory to your hdfs root to
determine whether you can see what you saw externally in HDFS (see Listing 4).
Listing 4. Testing Pig
$ pig
2011-12-10 06:39:44,276 [main] INFO org.apache.pig.Main - Logging error messages to...
2011-12-10 06:39:44,601 [main] INFO org.apache.pig.... Connecting to hadoop file \
system at: hdfs://0.0.0.0:8020
2011-12-10 06:39:44,988 [main] INFO org.apache.pig.... connecting to map-reduce \
job tracker at: 0.0.0.0:8021
grunt> cd hdfs:///
grunt> ls
hdfs://0.0.0.0/tmp <dir>
hdfs://0.0.0.0/user <dir>
hdfs://0.0.0.0/var <dir>
grunt>
So far, so good. You can see your Hadoop file system from within Pig, so now, try to read some
data into it from your local host file system. Copy a file from local to HDFS through Pig (see
Listing 5).
79
grunt>
But your desire is a count of the unique shells specified within the passwd file. So, you use the
FOREACH operator to iterate each tuple in your group to COUNT the number that appear (see
Listing 8).
Note: To execute this code as a script, simply type your script into a file, and then execute it as
pig myscript.pig.
Important Points
Pig has several built-in data types (chararray, float, integer)
PigStorage can parse standard line oriented text files.
Pig can be extended with custom load types written in Java.
Pig doesnt read any data until triggered by a DUMP or STORE.
Use FOREACH..GENERATE to pick of specific fields or generate new fields. Also referred to as a
projection
GROUP will create a new record with the group name and a bag of the tuples in each group
You can reference a specific field in a bag with <bag>.field (i.e. a models.model)
You can use aggregate functions like COUNT, MAX, etc on a bag.
Nothing really happens until a DUMP or STORE is performed.
Use FILTER and FOREACH early to remove unneeded columns or rows to reduce temporary
output.
Use PARALLEL keyword on GROUP operations to run more reduce tasks.
A quick word on writing UDFs in Pig
public class ComputeAverage extends EvalFunc {
81
And this is how you call it (r1,r2,r3,r4 are just columns/fields from another variable)
grunt> B = foreach A generate id, hid, com.pfalabs.test.ComputeAverage(r1,r2,r3,r4);
Just make sure you pack this into a jar and run this first:
grunt> register /path/to/your/jar/my-udfs.jar;
82
The number one question here is, how do you iterate through the values you can receive? I can
obviously push more fields into this function.
If it's a one to one function (for one value of input you get one value of output) you can look at
pig-release-0.5.0/tutorial/src/org/apache/pig/tutorial/ExtractHour.java:
String timestamp = (String)input.get(0);
What do we have here? a DataBag that has an Iterator as a first element. Iterator of Touple(s)
that have your value as a first element...wow...
Next, we have the one to many functions. Luckily we can use pig-release0.5.0/tutorial/src/org/apache/pig/tutorial/NGramGenerator.java as a reference.
....
// take the value
String query = (String)input.get(0);
// generate the output and push it to the return value
DataBag output = DefaultBagFactory.getInstance().newDefaultBag();
// its a DataBag, so feel free to fill that up!
for (String ngram : ngrams) {
Tuple t = DefaultTupleFactory.getInstance().newTuple(1);
t.set(0, ngram);
output.add(t);
83
HBase is a distributed column-oriented database built on top of HDFS. HBase is the Hadoop
application to use when you require real-time read/write random-access to very large datasets.
HBase comes at the scaling problem from the opposite direction. It is built from the ground-up
to scale linearly just by adding nodes. HBase is not relational and does not support SQL, but
given the proper problem space, it is able to do what an RDBMS cannot: host very large,
sparsely populated tables on clusters made from commodity hardware.
In Another Word
HBase is a key/value store. Specifically it is a Sparse, Consistent, Distributed,
Multidimensional, Sorted map.
Map
HBase maintains maps of Keys to Values (key -> value). Each of these mappings is called a
"KeyValue" or a "Cell". You can find a value by its key... That's it.
Sorted
These cells are sorted by the key. This is a very important property as it allows for searching
("give me all values for which the key is between X and Y"), rather than just retrieving a value
for a known key.
Multidimensional
The key itself has structure. Each key consists of the following parts: row-key, column family,
column, and time-stamp. So the mapping is actually:
(rowkey, column family, column, timestamp) -> value
rowkey and value are just bytes (column family needs to be printable), so you can store
anything that you can serialize into a byte[] into a cell.
84
Sparse
This follows from the fact the HBase stores key -> value mappings and that a "row" is nothing
more than a grouping of these mappings (identified by the rowkey mentioned above).Unlike
NULL in most relational databases, no storage is needed for absent information, there will be
just no cell for a column that does not have any value. It also means that every value carries all
its coordinates with it.
Distributed
One key feature of HBase is that the data can be spread over 100s or 1000s of machines and
reach billions of cells. HBase manages the load balancing automatically.
Consistent
HBase makes two guarantees:
All changes the with the same rowkey (see Multidimensional above) are atomic. A reader will
always read the last written (and committed) values.
HBASE Architecture
85
HBASE Characteristics
HBase uses the Hadoop Filesystem (HDFS) as its data storage engine
The advantage of this approach is then HBase doesn't need to worry about data
replication
The downside is that it is also constrained by the characteristics of HDFS, which is not
optimized for random read access.
Data is stored in a farm of Region Servers.
The "key-to-server" mapping is needed to locate the corresponding server and this
mapping is stored as a "Table" similar to other user data table.
Also in the HBase architecture, there is a special machine playing the "role of master" who
monitors and coordinates the activities of all region servers (the heavy-duty worker node). To
the best of my knowledge, the master node is the single point of failure at this moment.
86
Regions
Tables are automatically partitioned horizontally by HBase into regions.
Each region comprises a subset of a tables rows.
A region is denoted by the table it belongs to, its first row, inclusive, and last row,
exclusive.
Initially, a table comprises a single region, but as the size of the region grows, after it
crosses a configurable size threshold,
it splits at a row boundary into two new regions of approximately equal size.
Locking
Row updates are atomic, no matter how many row columns constitute the row-level
transaction. This keeps the locking model simple.
Implementation
87
Installation
Download a stable release from an Apache Download Mirror and unpack it on your local
filesystem. For example:
% tar xzf hbase-x.y.z.tar.gz
% hbase
Usage: hbase <command>
where <command> is one of:
shell
master
regionserver
zookeeper
rest
thrift
avro
migrate
hbck
Getting Started
88
Start Hbase
$ ./bin/start-hbase.sh
starting Master, logging to logs/hbase-user-master-example.org.out
Create a table named test with a single column family named cf. Verify its
creation by listing all tables and then insert some values
hbase(main):003:0> create 'test', 'cf'
0 row(s) in 1.2200 seconds
hbase(main):003:0> list 'test'
..
1 row(s) in 0.0550 seconds
hbase(main):004:0> put 'test', 'row1', 'cf:a', 'value1'
0 row(s) in 0.0560 seconds
hbase(main):005:0> put 'test', 'row2', 'cf:b', 'value2'
0 row(s) in 0.0370 seconds
hbase(main):006:0> put 'test', 'row3', 'cf:c', 'value3'
0 row(s) in 0.0450 seconds
row2
column=cf:b, timestamp=1288380738440, value=value2
row3
column=cf:c, timestamp=1288380747365, value=value3
3 row(s) in 0.0590 seconds
Now, disable and drop your table. This will clean up all done above.
hbase(main):012:0> disable 'test'
0 row(s) in 1.0930 seconds
hbase(main):013:0> drop 'test'
0 row(s) in 0.0770 seconds
Stopping Hbase
$ ./bin/stop-hbase.sh
stopping hbase...............
The HBase Shell is (J)Ruby's IRB with some HBase particular commands added. Anything
you can do in IRB, you should be able to do in the HBase Shell.
To run the HBase shell, do as follows:
90
$ ./bin/hbase shell
Type help and then <RETURN> to see a listing of shell commands and options. Browse at
least the paragraphs at the end of the help emission for the gist of how variables and
command arguments are entered into the HBase shell; in particular note how table
names, rows, and columns, etc., must be quoted.
See Section 1.2.3, Shell Exercises for example basic shell operation.
Scripting
For examples scripting HBase, look in the HBase bin directory. Look at the files that end in
*.rb. To run one of these files, do as follows:
$ ./bin/hbase org.jruby.Main PATH_TO_SCRIPT
Shell Tricks
irbrc
Create an .irbrc file for yourself in your home directory. Add customizations. A useful one
is command history so commands are save across Shell invocations:
$ more .irbrc
require 'irb/ext/save-history'
IRB.conf[:SAVE_HISTORY] = 100
IRB.conf[:HISTORY_FILE] = "#{ENV['HOME']}/.irb-save-history"
See the ruby documentation of .irbrc to learn about other possible confiurations.
91
To output in a format that is exactly like that of the HBase log format will take a little
messing with SimpleDateFormat.
Debug
Shell debug switch
You can set a debug switch in the shell to see more output -- e.g. more of the stack trace
on exception -- when you run a command:
hbase> debug <RETURN>
Overview
NoSQL?
HBase is a type of "NoSQL" database. "NoSQL" is a general term meaning that the database isn't
an RDBMS which supports SQL as its primary access language, but there are many types of NoSQL
databases: BerkeleyDB is an example of a local NoSQL database, whereas HBase is very much a
distributed database. Technically speaking, HBase is really more a "Data Store" than "Data Base"
because it lacks many of the features you find in an RDBMS, such as typed columns, secondary
indexes, triggers, and advanced query languages, etc.
However, HBase has many features which supports both linear and modular scaling. HBase
clusters expand by adding RegionServers that are hosted on commodity class servers. If a cluster
expands from 10 to 20 RegionServers, for example, it doubles both in terms of storage and as
well as processing capacity. RDBMS can scale well, but only up to a point - specifically, the size of
a single database server - and for the best performance requires specialized hardware and
storage devices. HBase features of note are:
92
93
94