Você está na página 1de 20

4/2/2018 Hadoop Deployment Cheat Sheet | Jethro

(/)

Hadoop Deployment Cheat Sheet


Introduction
If you are using, or planning to use the Hadoop framework for big data and Business Intelligence (BI) this
document can help you navigate some of the technology and terminology, and guide you in setting up and
con guring the system.

In this document we provide some background information about the framework, the key distributions,
modules, components, and related products. We also provide you with single and multi-node Hadoop
installation commands and con guration parameters.

The nal section includes some tips and tricks to help you get started, and provides guidance in setting up a
Hadoop project.

Contents
Hadoop Distributions
Hadoop Modules
Hadoop Components
Hadoop Ecosystem
https://jethro.io/hadoop-deployment-cheat-sheet 1/20
4/2/2018 Hadoop Deployment Cheat Sheet | Jethro

Major Hadoop Cloud Providers


(/)
Single Node Installation
Multi-node Installation
Backup HDFS Metadata
HDFS Basic Commands
HDFS Administration
Yarn
MapReduce
Resource Manager UI
Secure Hadoop
Common Data Formats
Hadoop Tips and Tricks

Key Hadoop Distributions


Vendor Strength

Apache Hadoop The open source distribution from Apache

Hortonworks A leading vendor committed to a 100% open source package

Cloudera Hadoop lesystem w/proprietary components for enterprise needs

MapR Uses its own proprietary le system

IBM Integration w/ IBM analytics products

https://jethro.io/hadoop-deployment-cheat-sheet 2/20
4/2/2018 Hadoop Deployment Cheat Sheet | Jethro

Pivotal Integration
(/) w/ Greenplum and Cloud Foundry (CF)

Hadoop Modules
Module Description

Common Common utilities. Supports other Hadoop modules

HDFS

Hadoop Distributed File System: provides high-throughput access to application data based on commodity hardware

YARN Yet Another Resource Negotiator: a framework for cluster resource management including job scheduling

MapReduce Software framework for parallel processing of large data sets based on YARN

Hadoop Components
Component / Module Description

NameNode / HDFS The directory tree of the Hadoop HDFS le system (a.k.a Hadoop inode)

https://jethro.io/hadoop-deployment-cheat-sheet 3/20
4/2/2018 Hadoop Deployment Cheat Sheet | Jethro

Secondary(/)NameNode / HDFS

High availability mechanism for the NameNode. It provides checkpoints of the namespace by merging the edits le
into the fsimage le

JournalNode / HDFS Arbiter node that supports auto failover between NameNodes

DataNode / HDFS Nodes (or servers) that store the actual data

NFS3 Gateway / HDFS Daemons that enable NFS3 support

ResourceManager / YARN

Global daemon that arbitrates resources among all the applications in the Hadoop cluster

ApplicationMaster / YARN

Takes care of a single application: gets resources for it from the ResourceManager and works with the NodeManager
to consume them and monitor the tasks

NodeManager / YARN

Single machine agent that is responsible for the containers as well as allocation and monitoring of resource usage
such as CPU and disk, and reporting back to the ResourceManager

Container / YARN

Running speci c tasks on a speci c machine for a speci c application based on allocated resources

https://jethro.io/hadoop-deployment-cheat-sheet 4/20
4/2/2018 Hadoop Deployment Cheat Sheet | Jethro

Hadoop
(/) Ecosystem – Related Products

Product Description

Ambari

A completely open-source management platform for provisioning, managing, monitoring and securing Apache
Hadoop clusters

Apex Big data in motion platform based on YARN

Azbakan Work ow job scheduling and management system for Hadoop

Flume Reliable, distributed and available service that streams logs into HDFS

Knox Authentication and Access gateway service for Hadoop

HBase Distributed non-relational database that runs on top of HDFS

Hive Data warehouse system based on Hadoop

Mahout

Machine learning algorithm (clustering, classi cation and batch-based collaborative ltering) implementation based on
MapReduce

Impala Enables low-latency SQL queries on HBase and HDFS

Oozie Work ow job scheduling and management system for Hadoop

Ranger Access policy manager for HDFS les, folders, databases, tables and columns

https://jethro.io/hadoop-deployment-cheat-sheet 5/20
4/2/2018 Hadoop Deployment Cheat Sheet | Jethro

Spark (/)

Cluster computing framework that utilizes YARN and HDFS. Supports streaming, and batch jobs. Has an SQL-like
interface and machine learning library.

Sqoop Data migration application between RDBMS and Hadoop using CLI

Tez Application framework for running complex Directed Acyclic Graph (DAG) of tasks based on YARN

Pig High level platform (and script-like language) to create and run programs on MapReduce, Tez and Spark

ZooKeeper

Distributed name registry, synchronization service and con guration service that is used as a sub-system in Hadoop

Major Hadoop Cloud Providers


Cloud operator Service name

Amazon Web Services EMR (Elastic Map Reduce)

IBM Softlayer IBM Brightsight

Microsoft Azure HDInsight

https://jethro.io/hadoop-deployment-cheat-sheet 6/20
4/2/2018 Hadoop Deployment Cheat Sheet | Jethro

Common
(/) Data Formats
Format Description

Avro JSON-based format that includes RPC and serialization support. Designed for systems that exchange data.

Parquet Columnar storage format

ORC Fast Columnar storage format

RCFile Data placement format for Rational tables

SequenceFile Binary data format with a record of speci c data types

Unstructured Hadoop also supports various unstructured data formats

Single Node Installation


Requirement / Task Command

Java Installation / Check version >java -version

Java Installation / Install >sudo apt-get -y update && sudo apt-get -y install default-jdk

https://jethro.io/hadoop-deployment-cheat-sheet 7/20
4/2/2018 Hadoop Deployment Cheat Sheet | Jethro

Create User
(/) and Permissions / Create User >useradd hadoop
>passwd hadoop
>mkdir /home/hadoop
>chown -R hadoop:hadoop /home/hadoop

Create User and Permissions / Create keys >su - hadoop


>ssh-keygen -t rsa &&
>cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
>&& chmod 0600 ~/.ssh/authorized_keys

Install from source

>wget http://apache.spd.co.il/hadoop/common/hadoop-2.7.2/hadoop-2.7.2.tar.gz &&


>tar xzf hadoop-2.7.2.tar.gz &&
>mv hadoop-2.7.2 hadoop

https://jethro.io/hadoop-deployment-cheat-sheet 8/20
4/2/2018 Hadoop Deployment Cheat Sheet | Jethro

Environment
(/) / Env Vars >source ~/.bashrc
>export HADOOP_HOME=/home/hadoop/hadoop

>export HADOOP_INSTALL=$HADOOP_HOME

>export HADOOP_MAPRED_HOME=$HADOOP_HOME

>export HADOOP_COMMON_HOME=$HADOOP_HOME

>export HADOOP_HDFS_HOME=$HADOOP_HOME

>export YARN_HOME=$HADOOP_HOME

>export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

>export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

Environment / Set Java_Home >vi $HADOOP_HOME/etc/hadoop/conf/hadoop-env.sh


export JAVA_HOME=/opt/jdk1.8.0_05/

Con guration les / Edit if required core-site.xml


hdfs-site.xml
mapred-site.xml
yarn-site.xml

Format NameNode >hdfs namenode -format

https://jethro.io/hadoop-deployment-cheat-sheet 9/20
4/2/2018 Hadoop Deployment Cheat Sheet | Jethro

Start System
(/) >cd $HADOOP_HOME/sbin/
>start-dfs.sh
>start-yarn.sh

Test System >bin/hdfs dfs -mkdir /user


>bin/hdfs dfs -mkdir /user/hadoop
>bin/hdfs dfs -put /var/log/httpd logs

Multi-node Installation
Task Command

Con gure hosts on each node >vi /etc/hosts


192.168.1.11 hadoop-master
192.168.1.12 hadoop-slave-1
192.168.1.13 hadoop-slave-2

Enable cross node authentication >su – hadoop


>ssh-keygen -t rsa
>ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-master
>ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-slave-1
>ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hadoop-slave-2
>chmod 0600 ~/.ssh/authorized_keys>exit

https://jethro.io/hadoop-deployment-cheat-sheet 10/20
4/2/2018 Hadoop Deployment Cheat Sheet | Jethro

Copy system
(/) >su - hadoop
>cd /opt/hadoop
>scp -r hadoop hadoop-slave-1:/opt/hadoop
>scp -r hadoop hadoop-slave-2:/opt/hadoop

Con gure Master >su - hadoop

>cd /opt/hadoop/hadoop

>vi conf/masters
//add your master node to the file:
hadoop-master

>vi conf/slaves
//add your slave nodes to the file, one hostname per line:
hadoop-slave-1
hadoop-slave-2

>su - hadoop

>cd /opt/hadoop/hadoop

>bin/hadoop namenode -format

Start system >bin/start-all.sh

https://jethro.io/hadoop-deployment-cheat-sheet 11/20
4/2/2018 Hadoop Deployment Cheat Sheet | Jethro

Backup(/) HDFS Metadata


Task Command

Stop the cluster >stop-all.sh

Perform cold backup to metadata directories >cd /data/dfs/nn


>tar -cvf /tmp/backup.tar.gz

Start the cluster >start-all.sh

HDFS Basic Commands


Task Command

List the content of the home directory >hdfs dfs -ls /data/

Upload a le from the local le system to HDFS >hdfs dfs -put logs.csv /data/

Read the content of the le from HDFS >hdfs dfs -cat /data/logs.csv

Change the permission of a le >hdfs dfs -chmod 744 /data/logs.csv

Set the replication factor of a le to 3 >hdfs dfs -setrep -w 3 /data/logs.csv

Check the size of the le >hdfs dfs -du -h /data/logs.csv

https://jethro.io/hadoop-deployment-cheat-sheet 12/20
4/2/2018 Hadoop Deployment Cheat Sheet | Jethro

Move the (/)le to the newly-created subdirectory >hdfs dfs -mv logs.csv logs/

Remove directory from HDFS >hdfs dfs -rm -r logs

HDFS Administration
Task Command

Balance the cluster storage >hdfs balancer -threshold

Run the NameNode >hdfs namenode

Run the secondary NameNode >hdfs secondarynamenode

Run a datanode >hdfs datanode

Run the NFS3 gateway >hdfs nfs3

Run the RPC portmap for the NFS3 gateway >hdfs portmap

YARN
Task Command

https://jethro.io/hadoop-deployment-cheat-sheet 13/20
4/2/2018 Hadoop Deployment Cheat Sheet | Jethro

Show yarn(/)help >yarn

De ne con guration le >yarn [--config confdir]

De ne log level

>yarn [--loglevel loglevel] where loglevel is FATAL, ERROR, WARN, INFO, DEBUG or
TRACE

User commands

Show Hadoop classpath >yarn classpath

Show and kill application >yarn application

Show application attempt >yarn applicationattempt

Show container information >yarn container

Show node information >yarn node

Show queue information >yarn queue

Administration commands

Start NodeManager >yarn nodemanager

Start Proxy web server >yarn proxyserver

Start ResourceManager >yarn resourcemanager

https://jethro.io/hadoop-deployment-cheat-sheet 14/20
4/2/2018 Hadoop Deployment Cheat Sheet | Jethro

Run ResourceManager
(/) admin client >yarn rmadmin

Start Shared Cache Manager >yarn sharedcachemanager

Start TimeLineServer >yarn timelineserver

MapReduce
Submit the WordCount MapReduce job to the cluster

>hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar wordcount input


logs-output

Check the output of this job in HDFS >hadoop fs -cat logs -output/*

Submit a scalding job >hadoop jar scalding.jar com.twitter.scalding.Tool Scalding

Kill a MapReduce job >yarn application -kill

Resource Manager UI
Resource Default URI

NameNode http://:50070/

https://jethro.io/hadoop-deployment-cheat-sheet 15/20
4/2/2018 Hadoop Deployment Cheat Sheet | Jethro

DataNode(/) http://:50075/

Sec NameNode http://:50090/

Resource Manager http://:8088

HBase Master http://:60010

Secure Hadoop
Aspect Best Practice

Authentication
De ne users
Enable Kerberos in Hadoop
Setup Knox gateway to control access and authentication to the HDFS cluster
Integrate with the organization’s SSO and LDAP

Authorization

De ne groups
De ne HDFS Permissions
De ne HDFS ACL’s
Enable Ranger policies to control access to HDFS folders, directories, databases, tables and columns

https://jethro.io/hadoop-deployment-cheat-sheet 16/20
4/2/2018 Hadoop Deployment Cheat Sheet | Jethro

Audit (/) Enable process execution audit trail

Data Protection
Wire encryption with Knox or Hadoop

Hadoop Tips and Tricks


Project Concept

Iterate cluster sizing to optimize performance and meet actual load patterns

Hardware

Clusters with more nodes recover faster

The higher the storage per node, the longer the recovery time

Use commodity hardware:


Use large slow disks (SATA) without RAID (3-6TB disks)
Use as much RAM as is cost-effective (96-192GB RAM)
Use mainstream CPU with as many cores as possible (8-12 cores)

Invest in reliable hardware for the NameNodes


Product (/product) Partners (/partners) Resources (/resources) Jethro Blog (https://info.jethro.io/blog)
NameNode RAM should be 2GB + 1GB for every 100TB raw disk space
Support (/documentation) Try Jethro (http://info.jethro.io/download-jethro-data-engine) +1 844-384-3844 (tel:1-844-384-3844)

https://jethro.io/hadoop-deployment-cheat-sheet 17/20
4/2/2018 Hadoop Deployment Cheat Sheet | Jethro

Networking
(/)cost should be 20% of hardware budget

40 nodes is the critical mass to achieve best performance/cost ratio

Your actual net storage capacity should be 25% of raw storage capacity. This leaves 25% spare capacity, and allows
for 3 replicas

Operating System and JVM

Must be 64-bit

Set le descriptor limit to 64K (ulimit)

Enable time synchronization using NTP

Speed up reads by mounting disks with NOATIME

Disable hugepages

System

Enable monitoring using Ambari

Monitor the checkpoints of the NameModes to verify that they occur at the correct times. This will enable you to
recover your cluster when needed

Avoid reaching 90% cluster disk utilization

Balance the cluster periodically using balancer

Edit metadata les using Hadoop utilities only, to avoid corruption


https://jethro.io/hadoop-deployment-cheat-sheet 18/20
4/2/2018 Hadoop Deployment Cheat Sheet | Jethro

Keep replication
(/) >= 3

Place quotas and limits on users and project directories, as well as on tasks to avoid cluster starvation

Clean /tmp regularly – it tends to ll up with junk les

Optimize the number of reducers to avoid system starvation

Verify that the le system you selected is supported by your Hadoop vendor

Data and System Recovery

Disk failure is not an issue

Data nodes failure is not a major issue

NameNodes failure is an issue even in a clustered environment

Make regular backups of namenode metadata

Enable NameNode clustering using ZooKeeper

Provide su cient disk space for NameNode logging

Enable trash to avoid accidental permanent deletion (rm -r) at core-site.xml

https://jethro.io/hadoop-deployment-cheat-sheet 19/20
4/2/2018 Hadoop Deployment Cheat Sheet | Jethro

Jethro Blog Highlights Quick Links About Jethro


(/)
(/) Hortonworks Partners with Jethro Acceleration Layer A SQL-on-Hadoop engine,
+1 844-384-3844 Jethro Data to Deliver Jethro acts as a BI-on-
(/product)
Interactive BI on Hadoop Hadoop acceleration layer
(tel:1- 844-384-3844) Jethro for Qlik
that speeds up big data
info@jethrodata.com (http://info.jethro.io/blog/hortonworks-
(/jethro-data-for-qlik) query performance for BI
partners-jethro-data-
Jethro for Tableau tools like Tableau, Qlik and
(mailto:info@jethrodata.com) business-intelliegence-
Company (/tableau-on-hadoop) Microstrategy from any data
hadoop)
Hadoop Hive and 11 SQL-on- source like Hadoop or
(/about) Can Your Grandpa's OLAP do
Hadoop Alternatives Amazon S3.
Careers Big Data BI?
(http://info.jethro.io/blog/olap- (/hadoop-hive)
(/about#jobs)
bi-big-data) Infographic: Business
Contact
Applying Big Data to tame Intelligence (BI) on Hadoop
(http://info.jethro.io/contact- Manufacturing Complexity (/business-intelligence-
us)
(http://info.jethro.io/blog/big- hadoop-infographic)
Press Releases
data-manufacturing)
(http://info.jethro.io/press-
releases)

© Copyright -
Jethro Data
 (http://twitter.com/jethrodata)  (http://facebook.com/Jethrodata)

https://jethro.io/hadoop-deployment-cheat-sheet 20/20

Você também pode gostar