Escolar Documentos
Profissional Documentos
Cultura Documentos
Contents
1.Prerequisites for Pseudo-mode Cluster Installation.............................................3 1.Sun Java 6......................................................................................................3 2.Adding a dedicated Hadoop system user.......................................................3 3.Configuring SSH............................................................................................3 2.Pseudo-mode Cluster Installation...........................................................................5 1.Configuration.................................................................................................5 2.Formatting the name node.............................................................................7 3.Starting your single-node cluster...................................................................7 4.Running a MapReduce job............................................................................8 3.Prerequisites for Fully-distributed Cluster Installation.......................................11 1.Networking....................................................................................................11 2.SSH access.....................................................................................................11 4.Fully-distributed Cluster Installation.....................................................................13 1.Configuration..................................................................................................13 2.Formatting the name node..............................................................................15 3.Starting your single-node cluster....................................................................15 4.Running a MapReduce job.............................................................................16 5.Hadoop Web Interfaces............................................................................................18 1.HDFS Name Node Web Interface..................................................................18 2.MapReduce Job Tracker Web Interface.........................................................19 3.Task Tracker Web Interface...........................................................................19 6.Points to remember...................................................................................................21
Configuring SSH
Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it . For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost for the hadoop user we create in the previous section. First, we have to generate an SSH key for the hadoop user.For this the command is: $ssh -keygen -t rsa -P
The second line will create an RSA key pair with an empty password. Generally, using an empty password is not recommended, but in this case it is needed to unlock the key
without your interaction (you dont want to enter the passphrase every time Hadoop interacts with its nodes). Second, you have to enable SSH access to your local machine with this newly created key.For this type the command: $cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
The final step is to test the SSH setup by connecting to your local machine with the hadoop user. The step is also needed to save your local machines host key fingerprint to the hadoop users known_hosts file.
Configuration
Open conf/hadoop-env.sh in the editor of your choice and set the JAVA_HOME environment variable to the Sun JDK/JRE 6 directory.
cygdrive/C/"Program Files"/Java/jdk1.6.0_20
Set the PATH and CLASSPATH variables appropriately in your .profile file and load it. For eg.
To view the started daemons $ jps This should show the started daemons .
To create a directory in HDFS $bin/hadoop fs mkdir hdfs://localhost:9000/input To copy files to HDFS $ bin/hadoop fs copyFromLocal input/* hdfs://localhost:9000/input/
Run the MapReduce job Now, we actually run the WordCount example job.
Run the program as follows:$ bin/hadoop jar hadoop-0.18.3-examples.jar wordcount hdfs://localhost:9000/input hdfs://localhost:9000/output
To copy files from HDFS onto local system, use below command. $ bin/hadoop fs copyToLocal hdfs://localhost:9000/output/part-r-00000 .
Networking
Both machines must be able to reach each other over the network. The easiest way is to put both machines in the same network with regard to hardware and software configuration, for example connect both machines via a single hub or switch and configure the network interfaces to use a common network. To make it simple, we will assign the IP address x.x.x.x to the master machine and y.y.y.y to the slave machine. Update /etc/hosts on both machines .For example:
SSH access
The hadoop user on the master must be able to connect 1) To its own user account on the master i.e. ssh master in this context and not ssh localhost. 2) To the hadoop user account on the slave via a password-less SSH login. If you followed pseudo-mode Cluster Prequisities, you just have to add the hadoop@masters public SSH key (which should be in $HOME/.ssh/id_rsa.pub to the authorized_keys file of hadoop@slave (in this users $HOME/.ssh/authorized_keys). You can do this manually or use the following ssh command. $ ssh-copy-id -i $HOME/.ssh/id_rsa.pub hadoop@slave This command will prompt you for the login password for user hadoop on slave, then copy the public SSH key for you, creating the correct directory and fixing the permissions as necessary
conf/slaves
This conf/slaves file lists the hosts, one per line, where the Hadoop slave daemons (DataNodes and TaskTrackers) will be run. We want both the master box and the slave box to act as Hadoop slaves because we want both of them to store and process data. Update conf/slaves that it looks like this:
If you have additional slave nodes, just add them to the conf/slaves file, one per line Make the following configuration entries for both master and slave machines. $ vi core-site.xml
$> vi hdfs-site.xml
$ vi mapred-site.xml
To view the started daemons $ jps This should show the started daemons . Master:
Slave:
To create a directory in HDFS $bin/hadoop fs mkdir hdfs://localhost:9000/input To copy files to HDFS $ bin/hadoop fs copyFromLocal input/* hdfs://localhost:9000/input/
Run the MapReduce job Now, we actually run the WordCount example job. Run the program as follows:$ bin/hadoop jar hadoop-0.18.3-examples.jar wordcount hdfs://localhost:9000/input hdfs://localhost:9000/output1
To copy files from HDFS onto local system, use below command. $ bin/hadoop fs copyToLocal hdfs://localhost:9000/output/part-r-00000 .
Hadoop comes with several web interfaces which are by default available at these locations: http://localhost:50070/ web UI for HDFS name node(s) http://localhost:50030/ web UI for MapReduce job tracker(s) http://localhost:50060/ web UI for task tracker(s)
These web interfaces provide concise information about whats happening in your Hadoop cluster.
Points to remember
If there is any problem in the cluster, do not forget to go through the logs.It will provide you all the details about the errors. While starting a fully distributed cluster,if daemons does not start on the slave machine and in slave machine datanode logs shows errors regarding connection refused with master.There might be a problem with firewall.Turn off the firewall and try again.You can turn the firewall off with following command on the RHEL machine. $/etc/init.d/iptables save $/etc/init.d/iptables stop You need to have root access to implement this. Insure that sshd service is running on your machine.You can start sshd service on RHEL as follows $/etc/init.d/sshd start While configuring ssh on the machine insure that User directory is chmod 700 authorized_keys file in .ssh directory is chmod 644 If you see the error java.io.Exception:Incompatible namespaceIDs in the logs of a DataNode , chances are you are affected by issue HDFS-107 (formerly known as HADOOP-1212). At the moment, there seem to be two options as described below.
Option 1: Start from scratch
1. Stop the cluster 2. Delete the data directory on the problematic DataNode: the directory is specified by dfs.data.dir in conf/hdfs-site.xml; if you followed this tutorial, the relevant directory is /home/hadoop/hadoop-0.20.2/full/dfs/data 3. Reformat the NameNode (NOTE: all HDFS data is lost during this process!) 4. Restart the cluster When deleting all the HDFS data and starting from scratch does not sound like a good idea (it might be ok during the initial setup/testing), you might give the second approach a try. Option 2: Updating namespaceID of problematic DataNodes 1. Stop the DataNode 2. Edit the value of name spaceID in /current/VERSION to match the value of the current NameNode
3. Restart the DataNode If you followed the instructions in this tutorials, the full path of the relevant files are: NameNode:/home/hadoop/hadoop-0.20.2/full/dfs/name/current/VERSION DataNode:/home/hadoop/hadoop-0.20.2/full/dfs/name/current/VERSION