Você está na página 1de 13

Hadoop-Hbase-Hive-Sqoop Configuration Documentation

Hadoop Distributed File System (HDFS) is the primary storage system used by
Hadoop applications. HDFS creates multiple replicas of data blocks and distributes
them on compute nodes throughout a cluster to enable reliable, extremely rapid
Hadoop MapReduce is a programming model and software framework for writing
applications that rapidly process vast amounts of data in parallel on large clusters of
compute nodes.
Hive is a data warehouse system for Hadoop that facilitates easy data summarization,
ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file
systems. Hive provides a mechanism to project structure onto this data and query the
data using a SQL-like language called HiveQL. At the same time this language also
allows traditional map/reduce programmers to plug in their custom mappers and
reducers when it is inconvenient or inefficient to express this logic in HiveQL.
The main building blocks of Hive are
1. Metastore stores the system catalog and metadata about tables, columns,
partitions, etc.
2. Driver manages the lifecycle of a HiveQL statement as it moves through Hive
3. Query Compiler compiles HiveQL into a directed acyclic graph for MapReduce
4. Execution Engine executes the tasks produced by the compiler in proper
dependency order
5. HiveServer provides a Thrift interface and a JDBC / ODBC server
HBase is the Hadoop database. Think of it as a distributed, scalable, big data store.
Use HBase when you need random, realtime read/write access to your Big Data. This
projects goal is the hosting of very large tables billions of rows X millions of columns
atop clusters of commodity hardware. HBase is an open-source, distributed,
versioned, column-oriented store modeled after Googles Bigtable: A Distributed
Storage System for Structured Data by Chang et al. Just as Bigtable leverages the
distributed data storage provided by the Google File System, HBase provides Bigtable-
like capabilities on top of Hadoop and HDFS.

Linear and modular scalability.
Strictly consistent reads and writes.
Automatic and configurable sharding of tables
Automatic failover support between RegionServers.
Convenient base classes for backing Hadoop MapReduce jobs with HBase
Easy to use Java API for client access.
Block cache and Bloom Filters for real-time queries.
Query predicate push down via server side Filters
Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and
binary data encoding options
Extensible jruby-based (JIRB) shell
Support for exporting metrics via the Hadoop metrics subsystem to files or
Ganglia; or via JMX
Loading bulk data into Hadoop from production systems or accessing it from
map-reduce applications running on large clusters can be a challenging task.
Transferring data using scripts is inefficient and time-consuming.
How do we efficiently move data from an external storage into HDFS or Hive or
HBase? Meet Apache Sqoop. Sqoop allows easy import and export of data from
structured data stores such as relational databases, enterprise data warehouses,
and NoSQL systems. The dataset being transferred is sliced up into different
partitions and a map-only job is launched with individual mappers responsible for
transferring a slice of this dataset.

ZooKeeper is a centralized service for maintaining configuration information,
naming, providing distributed synchronization, and providing group services. All of
these kinds of services are used in some form or another by distributed applications.
Each time they are implemented there is a lot of work that goes into fixing the bugs
and race conditions that are inevitable. Because of the difficulty of implementing
these kinds of services, applications initially usually skimp on them ,which make
them brittle in the presence of change and difficult to manage. Even when done
correctly, different implementations of these services lead to management
complexity when the applications are deployed.

Pig is a platform for analyzing large data sets that consists of a high-level language for
expressing data analysis programs, coupled with infrastructure for evaluating these
programs. The salient property of Pig programs is that their structure is amenable to
substantial parallelization, which in turns enables them to handle very large data sets.
At the present time, Pigs infrastructure layer consists of a compiler that produces
sequences of Map-Reduce programs, for which large-scale parallel implementations
already exist (e.g., the Hadoop subproject). Pigs language layer currently consists of a
textual language called Pig Latin, which has the following key properties:
Ease of programming. It is trivial to achieve parallel execution of simple,
embarrassingly parallel data analysis tasks. Complex tasks comprised of
multiple interrelated data transformations are explicitly encoded as data flow
sequences, making them easy to write, understand, and maintain.
Optimization opportunities. The way in which tasks are encoded permits the
system to optimize their execution automatically, allowing the user to focus on
semantics rather than efficiency.
Extensibility. Users can create their own functions to do special-purpose

Install Linux
Install jdk1.6
Rpm Uvh jdk1.6.0_33-linux-i586.rpm
Add a dedicated Hadoop system user
We will use a dedicated Hadoop user account for running Hadoop. While thats not
required it is recommended because it helps to separate the Hadoop installation from
other software applications and user accounts running on the same machine (think:
security, permissions, backups, etc).
$ groupadd hadoop
$ useradd G hadoop hduser
$ passwd hduser
Enter password:hadoop
This will add the user hduser and the group hadoop to your local machine.
Update hosts
Go to /etc/hosts and edit
IP address of master machine Master
IP address of slave machine Slave

In our case master slave1 slave2 slave3 slave4

Configuring SSH
Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local
machine if you want to use Hadoop on it (which is what we want to do in this short
tutorial). Assuming that you have SSH up and running on your machine and configured
it to allow SSH public key authentication.First, we have to generate an SSH key for the
hduser user.
user@ubuntu:~$ su - hduser
hduser@ubuntu:~$ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
The key fingerprint is:
9b:82:ea:58:b4:e0:35:d7:ff:19:66:a6:ef:ae:0e:d2 hduser@ubuntu
The key's randomart image is:
The second line will create an RSA key pair with an empty password. Generally, using
an empty password is not recommended, but in this case it is needed to unlock the key
without your interaction (you dont want to enter the passphrase every time Hadoop
interacts with its nodes).

Second, you have to enable SSH access to your master and slave with this newly
created key.
On Master,run
hduser@ubuntu:~$cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
hduser@ubuntu:~$scp $HOME/.ssh/id_rsa.pub slavesIP $HOME/.ssh/authorized_keys
On Slave, run
hduser@ubuntu:~$scp $HOME/.ssh/id_rsa.pub masterIP:$HOME/.ssh/authorized_keys2

Hadoop Installation
You have to download Hadoop from the Apache Download Mirrors and extract the
contents of the Hadoop package to a location of your choice. I picked /usr/local/hadoop.
Make sure to change the owner of all the files to the hduser user and hadoop group, for
$ cd /usr/local
$ tar xzf hadoop-1.0.3.tar.gz
$ mv hadoop-1.0.3 hadoop
$ chown -R hduser:hadoop hadoop

Update $HOME/.bashrc
Add the following lines to the end of the $HOME/.bashrc file of user hduser. If you use a
shell other than bash, you should of course update its appropriate configuration files
instead of .bashrc.
# Set Hadoop-related environment variables
export HADOOP_HOME=/usr/local/hadoop

# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
export JAVA_HOME=/usr/java/jdk1.6.0_33

# Add Hadoop bin/ directory to PATH

# Set hbase-related environment variables
export HBASE_HOME=/usr/local/hbase

Setting hadoop.tmp.dir
mkdir -p /app/hadoop/tmp
chown hduser:hadoop /app/hadoop/tmp
# ...and if you want to tighten up security, chmod from 755 to 750...
chmod 750 /app/hadoop/tmp

If you forget to set the required ownerships and permissions, you will see a
java.io.IOException when you try to format the name node.
Add the following snippets between the <configuration> ... </configuration> tags in the
respective configuration XML file.

In file conf/core-site.xml:
<!-- In: conf/core-site.xml -->
<description>A base for other temporary directories.</description>

<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>

In file conf/mapred-site.xml:
<!-- In: conf/mapred-site.xml -->
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.

In file conf/hdfs-site.xml:

<!-- In: conf/hdfs-site.xml -->
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.

conf/masters (master only)
Despite its name, the conf/masters file defines on which machines Hadoop will start
secondary NameNodes in our multi-node cluster. In our case, this is just the master
machine. The primary NameNode and the JobTracker will always be the machines on
which you run the bin/start-dfs.sh and bin/start-mapred.sh scripts, respectively (the
primary NameNode and the JobTracker will be started on the same machine if you run
bin/start-all.sh). Note that you can also start an Hadoop daemon manually on a machine
via bin/hadoop-daemon.sh start [namenode | secondarynamenode | datanode |
jobtracker | tasktracker], which will not take the conf/masters and conf/slaves files into
account.On master, update conf/masters that it looks like this:
conf/slaves (master only)
This conf/slaves file lists the hosts, one per line, where the Hadoop slave daemons
(DataNodes and TaskTrackers) will be run. We want both the master box and the slave
box to act as Hadoop slaves because we want both of them to store and process data.
On master, update conf/slaves that it looks like this:

Format the namenode
Format the namenode ie only master

hduser@ubuntu:~$ /usr/local/hadoop/bin/hadoop namenode -format

The output will look like this:
hduser@ubuntu:/usr/local/hadoop$ bin/hadoop namenode -format
10/05/08 16:59:56 INFO namenode.NameNode: STARTUP_MSG:
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = ubuntu/
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 0.20.2
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r
911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
10/05/08 16:59:56 INFO namenode.FSNamesystem: fsOwner=hduser,hadoop
10/05/08 16:59:56 INFO namenode.FSNamesystem: supergroup=supergroup
10/05/08 16:59:56 INFO namenode.FSNamesystem: isPermissionEnabled=true
10/05/08 16:59:56 INFO common.Storage: Image file of size 96 saved in 0 seconds.
10/05/08 16:59:57 INFO common.Storage: Storage directory .../hadoop-hduser/dfs/name has been
successfully formatted.
10/05/08 16:59:57 INFO namenode.NameNode: SHUTDOWN_MSG:
SHUTDOWN_MSG: Shutting down NameNode at ubuntu/

Starting your multi-node cluster
Run the command:
hduser@ubuntu:~$ /usr/local/hadoop/bin/start-all.sh

Stopping your multi-node cluster
Run the command
hduser@ubuntu:~$ /usr/local/hadoop/bin/stop-all.sh


HBase Installation
You have to download HBase from the Apache Download Mirrors and extract the
contents of the HBase package to a location of your choice. I picked /usr/local/hbase.
Make sure to change the owner of all the files to the hduser user and hadoop group, for
$ cd /usr/local
$ tar xzf hbase-0.92.1.tar.gz
$ mv hbase-0.92.1 hbase
$ chown -R hduser:hadoop hbase

In file conf/hbase-env.sh,write
export JAVA_HOME=/usr/java/jdk1.6.0_33

Open the file $HBASE_INSTALL_DIR/conf/hbase-site.xml and add the following properties.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<description>The host and port that the HBase master runs at.
A value of 'local' runs the master and a regionserver
in a single process.

<description>The directory shared by region servers.</description>

<description>The mode the cluster will be in. Possible values are
false: standalone and pseudo-distributed setups with managed
Zookeeper true: fully-distributed with unmanaged Zookeeper
Quorum (see hbase-env.sh)
<description>Property from ZooKeeper's config zoo.cfg.
The port at which the clients will connect.

<value> master</value>
<description>Comma separated list of servers in the ZooKeeper Quorum.
For example,
By default this is set to localhost for local and
pseudo-distributed modes of operation. For a
fully-distributed setup, this should be set to a full
list of Zookeeper quorum servers. If
HBASE_MANAGES_ZK is set in hbase-env.sh
this is the list of servers which we will start/stop
Zookeeper on.

In our case, Zookeeper and hbase master both are running in same machine.
Open the file $HBASE_INSTALL_DIR/conf/hbase-env.sh and uncomment the following
export HBASE_MANAGES_ZK=true

Open the file $HBASE_INSTALL_DIR/conf/regionservers and add all the regionserver
machine names.


Note: Add hbase-master machine name only if you are running a regionserver on
hbase-master machine.


Starting the Hbase Cluster:-

Execute the following command to start the hbase cluster.

Starting the hbase shell:-
$HBASE_INSTALL_DIR/bin/hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Version: 0.20.6, r965666, Mon Jul 19 16:54:48 PDT 2010
hbase (main):001:0>

Now, create table in hbase.
hbase (main):001:0>create 't1','f1'
0 row(s) in 1.2910 seconds
hbase (main):002:0>

Note: - If table is created successfully, then everything is running fine.

3. Stopping the Hbase Cluster:-
Execute the following command on hbase-master machine to stop the hbase cluster.

Installing Hive
Once you unpack Hive set the HIVE_HOME environment variable.
Run:$ export HIVE_HOME=usr/local/Hive
Now that Hadoop and Hive are both installed and running you need to create directories
for the Hive metastore and set their permissions.
$ $HADOOP_HOME/bin/hadoop fs -mkdir /tmp
$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/hive/warehouse
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /tmp
$ $HADOOP_HOME/bin/hadoop fs -chmod g+w /user/hive/warehouse

Create hive-site.xml in /usr/local/hive/conf





Test your Hive install.
$ $HIVE_HOME/bin/hive
hive> show tables;
Time taken: 6.374 seconds


Download mysql-connector-java and place its jar in sqoop lib
Untar the sqoop folder to /usr/local/

[root@master ~]# chown hduser /usr/local/sqoop
[root@master ~]# chown hduser /usr/local/sqoop/conf
[root@master ~]# chmod 755 /usr/local/sqoop/conf
[root@master ~]# chmod 755 /usr/local/sqoop/

Copy hadoop-core-1.0.3.jar in sqoop lib
Copy sqoop-sqlserver-1.0.jar,mysql-connector-java-5.1.21-bin.jar in sqoop lib
Set the environment variables

Você também pode gostar