Escolar Documentos
Profissional Documentos
Cultura Documentos
Class: CS-B
Roll no. : 2014BCS1150
Overview
History
The Hadoop Distributed File System (HDFS) is the primary storage system used by
Hadoop applications.
HDFS is a distributed file system that provides high-performance access to data
across Hadoop clusters.
Firstly Hadoop has to know in which node the data will reside for that it quaries
something called name node.
After locating the data it will send the job to each one of those node.
Then each processor will independently read the input and write result in local
output file.
Thats all done in parallel . Then each local output is summed up to give result.
Name Node
The name node is the commodity hardware that contains the GNU/Linux
operating system and the name node software.
The system having the name node acts as the master server and it does the
following tasks:
manages the file system metadata
Regulates clients access to files.
It also executes file system operations such as renaming, closing, and
opening files and directories.
Data Node
These nodes manage the data storage of their system.
Data nodes perform read-write operations on the file systems, as per client
request.
They also perform operations such as block creation, deletion, and replication
according to the instructions of the name node.
Map-Reduce
A method for distributing computation across multiple nodes.
Each node processes the data that is stored at that node.
Consists of two main phases.
Map
Reduce
The Map Reduce framework consists of a single masterJob Trackerand one
slaveTask Trackerper cluster-node.
The master is responsible for resource management, tracking resource
consumption/availability and scheduling the jobs component tasks on the
slaves, monitoring them and re-executing the failed tasks.
The slaves Task Tracker execute the tasks as directed by the master and provide
task-status information to the master periodically.
Mapper:
Reads data as key/value pairs.
Outputs zero or more key/value pairs