Hadoop Deployment Cheat Sheet

4/2/2018 Hadoop Deployment Cheat Sheet | Jethro
(/)
Hadoop Deployment Cheat Sheet

Introduction
If you are using, or planning to use the Hadoop framework for big data and Business Intelligence (BI) this
document can help you navigate some of the technology and terminology, and guide you in setting up and
con guring the system.
In this document we provide some background information about the framework, the key distributions,
modules, components, and related products. We also provide you with single and multi-node Hadoop
installation commands and con guration parameters.
The nal section includes some tips and tricks to help you get started, and provides guidance in setting up a
Hadoop project.
Contents
Hadoop Distributions
Hadoop Modules
Hadoop Components
Hadoop Ecosystem
https://jethro.io/hadoop-deployment-cheat-sheet 1/20
Major Hadoop Cloud Providers

(/)
Single Node Installation
Multi-node Installation
Backup HDFS Metadata
HDFS Basic Commands
HDFS Administration
Yarn
MapReduce
Resource Manager UI
Secure Hadoop
Common Data Formats
Hadoop Tips and Tricks
Key Hadoop Distributions

Vendor Strength
Apache Hadoop The open source distribution from Apache
Hortonworks A leading vendor committed to a 100% open source package
Cloudera Hadoop lesystem w/proprietary components for enterprise needs
MapR Uses its own proprietary le system
IBM Integration w/ IBM analytics products
Pivotal Integration
(/) w/ Greenplum and Cloud Foundry (CF)
Hadoop Modules
Module Description
Common Common utilities. Supports other Hadoop modules
HDFS
Hadoop Distributed File System: provides high-throughput access to application data based on commodity hardware
YARN Yet Another Resource Negotiator: a framework for cluster resource management including job scheduling
MapReduce Software framework for parallel processing of large data sets based on YARN
Hadoop Components
Component / Module Description
NameNode / HDFS The directory tree of the Hadoop HDFS le system (a.k.a Hadoop inode)
Secondary(/)NameNode / HDFS
High availability mechanism for the NameNode. It provides checkpoints of the namespace by merging the edits le
into the fsimage le
JournalNode / HDFS Arbiter node that supports auto failover between NameNodes
DataNode / HDFS Nodes (or servers) that store the actual data
NFS3 Gateway / HDFS Daemons that enable NFS3 support
ResourceManager / YARN
Global daemon that arbitrates resources among all the applications in the Hadoop cluster
ApplicationMaster / YARN
Takes care of a single application: gets resources for it from the ResourceManager and works with the NodeManager
to consume them and monitor the tasks
NodeManager / YARN
Single machine agent that is responsible for the containers as well as allocation and monitoring of resource usage
such as CPU and disk, and reporting back to the ResourceManager
Container / YARN
Running speci c tasks on a speci c machine for a speci c application based on allocated resources
Hadoop
(/) Ecosystem – Related Products
Product Description
Ambari
A completely open-source management platform for provisioning, managing, monitoring and securing Apache
Hadoop clusters
Apex Big data in motion platform based on YARN
Azbakan Work ow job scheduling and management system for Hadoop
Flume Reliable, distributed and available service that streams logs into HDFS
Knox Authentication and Access gateway service for Hadoop
HBase Distributed non-relational database that runs on top of HDFS
Hive Data warehouse system based on Hadoop
Mahout
Machine learning algorithm (clustering, classi cation and batch-based collaborative ltering) implementation based on
MapReduce
Impala Enables low-latency SQL queries on HBase and HDFS
Oozie Work ow job scheduling and management system for Hadoop
Ranger Access policy manager for HDFS les, folders, databases, tables and columns
Spark (/)
Cluster computing framework that utilizes YARN and HDFS. Supports streaming, and batch jobs. Has an SQL-like
interface and machine learning library.
Sqoop Data migration application between RDBMS and Hadoop using CLI
Tez Application framework for running complex Directed Acyclic Graph (DAG) of tasks based on YARN
Pig High level platform (and script-like language) to create and run programs on MapReduce, Tez and Spark
ZooKeeper
Distributed name registry, synchronization service and con guration service that is used as a sub-system in Hadoop
Major Hadoop Cloud Providers

Cloud operator Service name
Amazon Web Services EMR (Elastic Map Reduce)
IBM Softlayer IBM Brightsight
Microsoft Azure HDInsight
Common
(/) Data Formats
Format Description
Avro JSON-based format that includes RPC and serialization support. Designed for systems that exchange data.
Parquet Columnar storage format
ORC Fast Columnar storage format
RCFile Data placement format for Rational tables
SequenceFile Binary data format with a record of speci c data types
Unstructured Hadoop also supports various unstructured data formats
Single Node Installation

Requirement / Task Command
Java Installation / Check version >java -version
Java Installation / Install >sudo apt-get -y update && sudo apt-get -y install default-jdk
Create User
(/) and Permissions / Create User >useradd hadoop
>passwd hadoop
>mkdir /home/hadoop
>chown -R hadoop:hadoop /home/hadoop
Create User and Permissions / Create keys >su - hadoop

>ssh-keygen -t rsa &&
>cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
>&& chmod 0600 ~/.ssh/authorized_keys
Install from source
>wget http://apache.spd.co.il/hadoop/common/hadoop-2.7.2/hadoop-2.7.2.tar.gz &&

>tar xzf hadoop-2.7.2.tar.gz &&
>mv hadoop-2.7.2 hadoop
Environment
(/) / Env Vars >source ~/.bashrc
>export HADOOP_HOME=/home/hadoop/hadoop
>export HADOOP_INSTALL=$HADOOP_HOME
>export HADOOP_MAPRED_HOME=$HADOOP_HOME
>export HADOOP_COMMON_HOME=$HADOOP_HOME
>export HADOOP_HDFS_HOME=$HADOOP_HOME
>export YARN_HOME=$HADOOP_HOME
>export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
>export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
Environment / Set Java_Home >vi $HADOOP_HOME/etc/hadoop/conf/hadoop-env.sh

export JAVA_HOME=/opt/jdk1.8.0_05/
Con guration les / Edit if required core-site.xml

hdfs-site.xml
mapred-site.xml
yarn-site.xml
Format NameNode >hdfs namenode -format
Start System
(/) >cd $HADOOP_HOME/sbin/
>start-dfs.sh
>start-yarn.sh
Test System >bin/hdfs dfs -mkdir /user

>bin/hdfs dfs -mkdir /user/hadoop
>bin/hdfs dfs -put /var/log/httpd logs
Multi-node Installation
Task Command
Con gure hosts on each node >vi /etc/hosts

192.168.1.11 hadoop-master
192.168.1.12 hadoop-slave-1
192.168.1.13 hadoop-slave-2
Enable cross node authentication >su – hadoop

>ssh-keygen -t rsa
>ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-master
>ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-slave-1
>ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-slave-2
>chmod 0600 ~/.ssh/authorized_keys>exit
Copy system
(/) >su - hadoop
>cd /opt/hadoop
>scp -r hadoop hadoop-slave-1:/opt/hadoop
>scp -r hadoop hadoop-slave-2:/opt/hadoop
Con gure Master >su - hadoop
>cd /opt/hadoop/hadoop
>vi conf/masters
//add your master node to the file:
hadoop-master
>vi conf/slaves
//add your slave nodes to the file, one hostname per line:
hadoop-slave-1
hadoop-slave-2
>su - hadoop
>cd /opt/hadoop/hadoop
>bin/hadoop namenode -format
Start system >bin/start-all.sh
Backup(/) HDFS Metadata

Task Command
Stop the cluster >stop-all.sh
Perform cold backup to metadata directories >cd /data/dfs/nn

>tar -cvf /tmp/backup.tar.gz
Start the cluster >start-all.sh
HDFS Basic Commands

Task Command
List the content of the home directory >hdfs dfs -ls /data/
Upload a le from the local le system to HDFS >hdfs dfs -put logs.csv /data/
Read the content of the le from HDFS >hdfs dfs -cat /data/logs.csv
Change the permission of a le >hdfs dfs -chmod 744 /data/logs.csv
Set the replication factor of a le to 3 >hdfs dfs -setrep -w 3 /data/logs.csv
Check the size of the le >hdfs dfs -du -h /data/logs.csv
Move the (/)le to the newly-created subdirectory >hdfs dfs -mv logs.csv logs/
Remove directory from HDFS >hdfs dfs -rm -r logs
HDFS Administration
Task Command
Balance the cluster storage >hdfs balancer -threshold
Run the NameNode >hdfs namenode
Run the secondary NameNode >hdfs secondarynamenode
Run a datanode >hdfs datanode
Run the NFS3 gateway >hdfs nfs3
Run the RPC portmap for the NFS3 gateway >hdfs portmap
YARN
Task Command
Show yarn(/)help >yarn
De ne con guration le >yarn [--config confdir]
De ne log level
>yarn [--loglevel loglevel] where loglevel is FATAL, ERROR, WARN, INFO, DEBUG or
TRACE
User commands
Show Hadoop classpath >yarn classpath
Show and kill application >yarn application
Show application attempt >yarn applicationattempt
Show container information >yarn container
Show node information >yarn node
Show queue information >yarn queue
Administration commands
Start NodeManager >yarn nodemanager
Start Proxy web server >yarn proxyserver
Start ResourceManager >yarn resourcemanager
Run ResourceManager
(/) admin client >yarn rmadmin
Start Shared Cache Manager >yarn sharedcachemanager
Start TimeLineServer >yarn timelineserver
MapReduce
Submit the WordCount MapReduce job to the cluster
>hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount input

logs-output
Check the output of this job in HDFS >hadoop fs -cat logs -output/*
Submit a scalding job >hadoop jar scalding.jar com.twitter.scalding.Tool Scalding
Kill a MapReduce job >yarn application -kill
Resource Manager UI
Resource Default URI
NameNode http://:50070/
DataNode(/) http://:50075/
Sec NameNode http://:50090/
Resource Manager http://:8088
HBase Master http://:60010
Secure Hadoop
Aspect Best Practice
Authentication
De ne users
Enable Kerberos in Hadoop
Setup Knox gateway to control access and authentication to the HDFS cluster
Integrate with the organization’s SSO and LDAP
Authorization
De ne groups
De ne HDFS Permissions
De ne HDFS ACL’s
Enable Ranger policies to control access to HDFS folders, directories, databases, tables and columns
Audit (/) Enable process execution audit trail
Data Protection
Wire encryption with Knox or Hadoop
Hadoop Tips and Tricks

Project Concept
Iterate cluster sizing to optimize performance and meet actual load patterns
Hardware
Clusters with more nodes recover faster
The higher the storage per node, the longer the recovery time
Use commodity hardware:

Use large slow disks (SATA) without RAID (3-6TB disks)
Use as much RAM as is cost-effective (96-192GB RAM)
Use mainstream CPU with as many cores as possible (8-12 cores)
Invest in reliable hardware for the NameNodes

Product (/product) Partners (/partners) Resources (/resources) Jethro Blog (https://info.jethro.io/blog)
NameNode RAM should be 2GB + 1GB for every 100TB raw disk space
Support (/documentation) Try Jethro (http://info.jethro.io/download-jethro-data-engine) +1 844-384-3844 (tel:1-844-384-3844)
Networking
(/)cost should be 20% of hardware budget
40 nodes is the critical mass to achieve best performance/cost ratio
Your actual net storage capacity should be 25% of raw storage capacity. This leaves 25% spare capacity, and allows
for 3 replicas
Operating System and JVM
Must be 64-bit
Set le descriptor limit to 64K (ulimit)
Enable time synchronization using NTP
Speed up reads by mounting disks with NOATIME
Disable hugepages
System
Enable monitoring using Ambari
Monitor the checkpoints of the NameModes to verify that they occur at the correct times. This will enable you to
recover your cluster when needed
Avoid reaching 90% cluster disk utilization
Balance the cluster periodically using balancer
Edit metadata les using Hadoop utilities only, to avoid corruption

Keep replication
(/) >= 3
Place quotas and limits on users and project directories, as well as on tasks to avoid cluster starvation
Clean /tmp regularly – it tends to ll up with junk les
Optimize the number of reducers to avoid system starvation
Verify that the le system you selected is supported by your Hadoop vendor
Data and System Recovery
Disk failure is not an issue
Data nodes failure is not a major issue
NameNodes failure is an issue even in a clustered environment
Make regular backups of namenode metadata
Enable NameNode clustering using ZooKeeper
Provide su cient disk space for NameNode logging
Enable trash to avoid accidental permanent deletion (rm -r) at core-site.xml
Jethro Blog Highlights Quick Links About Jethro

(/)
(/) Hortonworks Partners with Jethro Acceleration Layer A SQL-on-Hadoop engine,
+1 844-384-3844 Jethro Data to Deliver Jethro acts as a BI-on-
(/product)
Interactive BI on Hadoop Hadoop acceleration layer
(tel:1- 844-384-3844) Jethro for Qlik
that speeds up big data
info@jethrodata.com (http://info.jethro.io/blog/hortonworks-
(/jethro-data-for-qlik) query performance for BI
partners-jethro-data-
Jethro for Tableau tools like Tableau, Qlik and
(mailto:info@jethrodata.com) business-intelliegence-
Company (/tableau-on-hadoop) Microstrategy from any data
hadoop)
Hadoop Hive and 11 SQL-on- source like Hadoop or
(/about) Can Your Grandpa's OLAP do
Hadoop Alternatives Amazon S3.
Careers Big Data BI?
(http://info.jethro.io/blog/olap- (/hadoop-hive)
(/about#jobs)
bi-big-data) Infographic: Business
Contact
Applying Big Data to tame Intelligence (BI) on Hadoop
(http://info.jethro.io/contact- Manufacturing Complexity (/business-intelligence-
us)
(http://info.jethro.io/blog/big- hadoop-infographic)
Press Releases
data-manufacturing)
(http://info.jethro.io/press-
releases)
© Copyright -
Jethro Data
 (http://twitter.com/jethrodata)  (http://facebook.com/Jethrodata)

Hadoop Deployment Cheat Sheet - Jethro

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Hadoop Deployment Cheat Sheet - Jethro

Enviado por

Direitos autorais:

Formatos disponíveis

4/2/2018 Hadoop Deployment Cheat Sheet | Jethro

Major Hadoop Cloud Providers

Key Hadoop Distributions

Apache Hadoop The open source distribution from Apache

Hortonworks A leading vendor committed to a 100% open source package

Cloudera Hadoop lesystem w/proprietary components for enterprise needs

MapR Uses its own proprietary le system

IBM Integration w/ IBM analytics products

Common Common utilities. Supports other Hadoop modules

NFS3 Gateway / HDFS Daemons that enable NFS3 support

Apex Big data in motion platform based on YARN

Azbakan Work ow job scheduling and management system for Hadoop

Knox Authentication and Access gateway service for Hadoop

HBase Distributed non-relational database that runs on top of HDFS

Hive Data warehouse system based on Hadoop

Impala Enables low-latency SQL queries on HBase and HDFS

Oozie Work ow job scheduling and management system for Hadoop

Major Hadoop Cloud Providers

Amazon Web Services EMR (Elastic Map Reduce)

IBM Softlayer IBM Brightsight

Microsoft Azure HDInsight

Parquet Columnar storage format

ORC Fast Columnar storage format

RCFile Data placement format for Rational tables

SequenceFile Binary data format with a record of speci c data types

Unstructured Hadoop also supports various unstructured data formats

Single Node Installation

Java Installation / Check version >java -version

Create User and Permissions / Create keys >su - hadoop

Install from source

>wget http://apache.spd.co.il/hadoop/common/hadoop-2.7.2/hadoop-2.7.2.tar.gz &&

Environment / Set Java_Home >vi $HADOOP_HOME/etc/hadoop/conf/hadoop-env.sh

Con guration les / Edit if required core-site.xml

Format NameNode >hdfs namenode -format

Test System >bin/hdfs dfs -mkdir /user

Con gure hosts on each node >vi /etc/hosts

Enable cross node authentication >su – hadoop

Con gure Master >su - hadoop

>bin/hadoop namenode -format

Start system >bin/start-all.sh

Backup(/) HDFS Metadata

Stop the cluster >stop-all.sh

Perform cold backup to metadata directories >cd /data/dfs/nn

Start the cluster >start-all.sh

HDFS Basic Commands

Change the permission of a le >hdfs dfs -chmod 744 /data/logs.csv

Set the replication factor of a le to 3 >hdfs dfs -setrep -w 3 /data/logs.csv

Check the size of the le >hdfs dfs -du -h /data/logs.csv

Remove directory from HDFS >hdfs dfs -rm -r logs

Balance the cluster storage >hdfs balancer -threshold

Run the NameNode >hdfs namenode

Run the secondary NameNode >hdfs secondarynamenode

Run a datanode >hdfs datanode

Run the NFS3 gateway >hdfs nfs3

Show yarn(/)help >yarn

De ne con guration le >yarn [--config confdir]

Show Hadoop classpath >yarn classpath

Show and kill application >yarn application

Show application attempt >yarn applicationattempt

Show container information >yarn container

Show node information >yarn node

Show queue information >yarn queue