Escolar Documentos
Profissional Documentos
Cultura Documentos
Hadoop is an
Apache top-level project being built and used by a global community of
contributors and users.It is licensed under the Apache License 2.0.The Apache
Hadoop framework is composed of the following modules:
Hadoop Common contains libraries and utilities needed by other Hadoop
modules.
Hadoop Distributed File System (HDFS) a distributed file-system that stores
data on commodity machines, providing very high aggregate bandwidth across
the cluster.
Hadoop YARN a resource-management platform responsible for managing
compute resources in
clusters and using them for scheduling of users'
applications.
Hadoop Map Reduce a programming model for large scale data
processing.
Hadoop Cluster
A Hadoop cluster is a special type of computational cluster designed
specifically for storing and analyzing huge amounts of unstructured data in a
distributed computing environment.
Such clusters run Hadoop's open source distributed processing software on
low-cost commodity computers. Typically one machine in the cluster is
designated as the NameNode and another machine the as JobTracker; these
are the masters. The rest of the machines in the cluster act as both DataNode
and TaskTracker; these are the slaves. Hadoop clusters are often referred to as
"shared nothing" systems because the only thing that is shared between nodes
is the network that connects them.
Hadoop clusters are known for boosting the speed of data analysis
applications. They also are highly scalable: If a cluster's processing power is
overwhelmed by growing volumes of data, additional cluster nodes can be
added to increase throughput. Hadoop clusters also are highly resistant to
failure because each piece of data is copied onto other cluster nodes, which
ensures that the data is not lost if one node fails.
A smallest cluster can be deployed using a minimum of four machines (for
evaluation purpose only). One machine can be used to deploy all the master
processes (NameNode, Secondary NameNode, JobTracker, HBase Master),
Hive Metastore, WebHCat Server, and ZooKeeper nodes. The other three
machines can be used to co-deploy the slave nodes (TaskTrackers, DataNodes,
and RegionServers).
HDFS stores large files (typically in the range of gigabytes to terabytes) across
multiple machines. It achieves reliability by replicating the data across multiple
hosts, and hence theoretically does not require RAID storage on hosts.
HDFS does not support hard links or soft links. However, the HDFS architecture
does not preclude implementing these features.
MapReduce is a programming model and an associated implementation for
processing and generating large data sets with a parallel, distributed algorithm
on a cluster.[1][2]
MapReduce
A MapReduce program is composed of a Map() procedure that performs
filtering and sorting (such as sorting students by first name into queues, one
queue for each name) and a Reduce() procedure that performs a summary
operation (such as counting the number of students in each queue, yielding
name frequencies). The "MapReduce System" (also called "infrastructure" or
"framework") orchestrates the processing by marshalling the distributed
servers, running the various tasks in parallel, managing all communications and
data transfers between the various parts of the system, and providing for
redundancy and fault tolerance.
The model is inspired by the map and reduce functions commonly used in
functional programming, although their purpose in the MapReduce framework
is not the same as in their original forms. The key contributions of the
MapReduce framework are not the actual map and reduce functions, but the
scalability and fault-tolerance achieved for a variety of applications by
optimizing the execution engine once. As such, a single-threaded
implementation of MapReduce will usually not be faster than a traditional
implementation. Only when the optimized distributed shuffle operation (which
reduces network communication cost) and fault tolerance features of the
MapReduce framework come into play, is the use of this model beneficial.