Você está na página 1de 4

Apache Hadoop is an open-source software framework for storage and largescale processing of data-sets on clusters of commodity hardware.

Hadoop is an
Apache top-level project being built and used by a global community of
contributors and users.It is licensed under the Apache License 2.0.The Apache
Hadoop framework is composed of the following modules:
Hadoop Common contains libraries and utilities needed by other Hadoop
modules.
Hadoop Distributed File System (HDFS) a distributed file-system that stores
data on commodity machines, providing very high aggregate bandwidth across
the cluster.
Hadoop YARN a resource-management platform responsible for managing
compute resources in
clusters and using them for scheduling of users'
applications.
Hadoop Map Reduce a programming model for large scale data
processing.
Hadoop Cluster
A Hadoop cluster is a special type of computational cluster designed
specifically for storing and analyzing huge amounts of unstructured data in a
distributed computing environment.
Such clusters run Hadoop's open source distributed processing software on
low-cost commodity computers. Typically one machine in the cluster is
designated as the NameNode and another machine the as JobTracker; these
are the masters. The rest of the machines in the cluster act as both DataNode
and TaskTracker; these are the slaves. Hadoop clusters are often referred to as
"shared nothing" systems because the only thing that is shared between nodes
is the network that connects them.
Hadoop clusters are known for boosting the speed of data analysis
applications. They also are highly scalable: If a cluster's processing power is
overwhelmed by growing volumes of data, additional cluster nodes can be
added to increase throughput. Hadoop clusters also are highly resistant to

failure because each piece of data is copied onto other cluster nodes, which
ensures that the data is not lost if one node fails.
A smallest cluster can be deployed using a minimum of four machines (for
evaluation purpose only). One machine can be used to deploy all the master
processes (NameNode, Secondary NameNode, JobTracker, HBase Master),
Hive Metastore, WebHCat Server, and ZooKeeper nodes. The other three
machines can be used to co-deploy the slave nodes (TaskTrackers, DataNodes,
and RegionServers).

HDFS(Hadoop Distributed File System)


The Hadoop distributed file system (HDFS) is a distributed, scalable, and
portable file-system written in Java for the Hadoop framework. A Hadoop
cluster has nominally a single namenode plus a cluster of datanodes, although
redundancy options are available for the namenode due to its criticality. Each
datanode serves up blocks of data over the network using a block protocol
specific to HDFS. The file system uses TCP/IP sockets for communication.
Clients use remote procedure call (RPC) to communicate between each other.

HDFS stores large files (typically in the range of gigabytes to terabytes) across
multiple machines. It achieves reliability by replicating the data across multiple
hosts, and hence theoretically does not require RAID storage on hosts.

The NameNode and DataNode are pieces of software designed to run on


commodity machines. These machines typically run a GNU/Linux operating
system (OS). HDFS is built using the Java language; any machine that supports
Java can run the NameNode or the DataNode software. Usage of the highly
portable Java language means that HDFS can be deployed on a wide range of
machines. A typical deployment has a dedicated machine that runs only the
NameNode software. Each of the other machines in the cluster runs one
instance of the DataNode software. The architecture does not preclude
running multiple.DataNodes on the same machine but in a real deployment
that is rarely the case.
The File System Namespace
HDFS supports a traditional hierarchical file organization. A user or an
application can create directories and store files inside these directories. The
file system namespace hierarchy is similar to most other existing file systems;
one can create and remove files, move a file from one directory to another, or
rename a file. HDFS does not yet implement user quotas or access permissions.

HDFS does not support hard links or soft links. However, the HDFS architecture
does not preclude implementing these features.
MapReduce is a programming model and an associated implementation for
processing and generating large data sets with a parallel, distributed algorithm
on a cluster.[1][2]
MapReduce
A MapReduce program is composed of a Map() procedure that performs
filtering and sorting (such as sorting students by first name into queues, one
queue for each name) and a Reduce() procedure that performs a summary
operation (such as counting the number of students in each queue, yielding
name frequencies). The "MapReduce System" (also called "infrastructure" or
"framework") orchestrates the processing by marshalling the distributed
servers, running the various tasks in parallel, managing all communications and
data transfers between the various parts of the system, and providing for
redundancy and fault tolerance.
The model is inspired by the map and reduce functions commonly used in
functional programming, although their purpose in the MapReduce framework
is not the same as in their original forms. The key contributions of the
MapReduce framework are not the actual map and reduce functions, but the
scalability and fault-tolerance achieved for a variety of applications by
optimizing the execution engine once. As such, a single-threaded
implementation of MapReduce will usually not be faster than a traditional
implementation. Only when the optimized distributed shuffle operation (which
reduces network communication cost) and fault tolerance features of the
MapReduce framework come into play, is the use of this model beneficial.

Você também pode gostar