Hadoop MapReduce v1 and v2 Configuration Files

Hadoop Configuration
There are a handful of files for controlling the configuration of a Hadoop installation; the most important ones are
listed in Table 9-1. This section covers MapReduce 1, which employs the jobtracker and tasktracker daemons.
Running MapReduce 2 is substantially different, and is covered in YARN Configuration on page 318.
Table 9-1. Hadoop configuration files
Filename
Format
Description
hadoop-env.sh
Bash script
core-site.xml
masters
Hadoop
configuration
XML
Hadoop
configuration
XML
Hadoop
configuration
XML
Plain text
Slaves
Plain text
A list of machines (one per line) that each run a

datanode and a tasktracker.
hadoopmetrics.properties
Java Properties
log4j.properties
Java Properties
Properties for controlling how metrics are

published in Hadoop (see Metrics on page
350).
Properties for system logfiles, the namenode
audit log, and the task log for the tasktracker
child process (Hadoop Logs on page 173).
hdfs-site.xml
mapred-site.xml
Environment variables that are used in the scripts

to run Hadoop.
Configuration settings for Hadoop Core, such as
I/O settings that are common to HDFS and
MapReduce.
Configuration settings for HDFS daemons: the
namenode, the secondary namenode, and the
datanodes.
Configuration settings for MapReduce daemons:
the jobtracker, and the tasktrackers.
A list of machines (one per line) that each run a
secondary namenode.
These files are all found in the conf directory of the Hadoop distribution. The configuration directory can be
relocated to another part of the filesystem (outside the Hadoop
YARN Configuration
YARN is the next-generation architecture for running MapReduce (and is described in YARN (MapReduce 2) on
page 194). It has a different set of daemons and configuration options to classic MapReduce (also called
MapReduce 1), and in this section we shall look at these differences and how to run MapReduce on YARN.
Under YARN you no longer run a jobtracker or tasktrackers. Instead, there is a single resource manager running on
the same machine as the HDFS namenode (for small clusters) or on a dedicated machine, and node managers
running on each worker node in the cluster.
The YARN start-all.sh script (in the bin directory) starts the YARN daemons in the cluster. This script will start a
resource manager (on the machine the script is run on), and a node manager on each machine listed in the slaves
file.
YARN also has a job history server daemon that provides users with details of past job runs, and a web app proxy
server for providing a secure way for users to access the UI provided by YARN applications. In the case of
MapReduce, the web UI served by the proxy provides information about the current job you are running, similar to
the one described in The MapReduce Web UI on page 164. By default the web app proxy server runs in the same
process as the resource manager, but it may be configured to run as a standalone daemon.
YARN has its own set of configuration files listed in Table 9-8, these are used in addition to those in Table 9-1.
Table 9-8. YARN configuration files
Filename Format
yarnenv.sh
yarnsite.xml
Bash script
Hadoop
configuration XML
Description
Environment variables that are used in the scripts to run
YARN.
Configuration settings for YARN daemons: the resource
manager, the job history server, the webapp proxy server,
and the node managers.
Important YARN Daemon Properties

When running MapReduce on YARN the mapred-site.xml file is still used for general MapReduce properties,
although the jobtracker and tasktracker-related properties are not used. None of the properties in Table 9-4 are
applicable to YARN, except for mapred.child.java.opts (and the related properties mapreduce.map.java.opts and map
reduce.reduce.java.opts which apply only to map or reduce tasks, respectively). The JVM options specified in this way
are used to launch the YARN child process that runs map or reduce tasks.
The configuration files in Example 9-4 show some of the important configuration properties for running
MapReduce on YARN.
Example 9-4. An example set of site configuration files for running MapReduce on YARN
<?xml version="1.0"?>

<configuration>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx400m</value>

</property>
</configuration>
<?xml version="1.0"?>

<configuration>
<property>
<name>yarn.resourcemanager.address</name>
<value>resourcemanager:8040</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/disk1/nm-local-dir,/disk2/nm-local-dir</value>
<final>true</final>
</property>
YARN Configuration
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce.shuffle</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>8192</value>
</property>
</configuration>
The YARN resource manager address is controlled via yarn.resourceman ager.address, which takes the form of a hostport pair. In a client configuration this property is used to connect to the resource manager (using RPC), and in
addition the mapreduce.framework.name property must be set to yarn for the client to use YARN rather than the local
job runner.
Although YARN does not honor mapred.local.dir, it has an equivalent property called yarn.nodemanager.local-dirs,
which allows you to specify which local disks to store intermediate data on. It is specified by a comma-separated
list of local directory paths, which are used in a round-robin fashion.
YARN doesnt have tasktrackers to serve map outputs to reduce tasks, so for this function it relies on shuffle
handlers, which are long-running auxiliary services running in node managers. Since YARN is a general-purpose
service the shuffle handlers need to be explictly enabled in the yarn-site.xml by setting the yarn.nodemanager.aux-serv
ices property to mapreduce.shuffle.
Table 9-9 summarizes the important configuration properties for YARN.
Table 9-9. Important YARN daemon properties
Property name
Type
Default value
Description
yarn.resourceman
ager.address
hostname and port
0.0.0.0:8040
The hostname and port that

the resource managers RPC
server runs on.
yarn.nodeman
ager.local-dirs
comma-separated
directory names
/tmp/nm-localdir
A list of directories where

node managers allow
containers to store
intermediate data. The data is
cleared out when the
application ends.
yarn.nodeman ager.auxservices
commaseparated
service names
A list of auxiliary services run

by the node manager. A
service is implemented by the
class defined by the property
yarn.nodemanager.aux-serv
ices.service-name.class. By
default no auxiliary services

are specified.
yarn.nodeman
ager.resource.mem
ory-mb
Int
8192
The amount of physical

memory (in MB) which may
be allocated to containers
being run by the node
manager.
Property name
Type
Default value
Description
yarn.nodeman
ager.vmem-pmemratio
Float
2.1
The ratio of virtual to

physical memory for
containers. Virtual memory
usage may exceed the
allocation by this amount.

Hadoop MapReduce v1 and v2 Configuration Files

Enviado por

Dados do documento

Descrição original:

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Hadoop MapReduce v1 and v2 Configuration Files

Enviado por

Direitos autorais:

Formatos disponíveis

Hadoop Configuration

A list of machines (one per line) that each run a

Properties for controlling how metrics are

Environment variables that are used in the scripts

Important YARN Daemon Properties

hostname and port

The hostname and port that

A list of directories where

A list of auxiliary services run

default no auxiliary services

The amount of physical

The ratio of virtual to

Você também pode gostar