Você está na página 1de 14

HADOOP

Overview & Discussion

HADOOP FUNDAMENTALS
HDFS - Hadoop FileSystem
MapReduce - Efficient technique for computing, processing & generating data.
YARN - Yet Another Resource Negotiator
Hadoop common - Libraries, APIs, Documentations.

HDFS - Hadoop FileSystem


HDFS is a Unix-like File System. It differentiates from unix mainly by the fact that
the storage in HDFS is distributed among several (or more) machines.
Hadoop Distributed File System (HDFS) is designed to reliably store very large
files across machines in a large cluster. It is inspired by the GoogleFileSystem.

Map/Reduce
An efficient technique to process / compute data among HDFS.
Processing data parallely, Basic idea :
Mapping : Each node applies map() on its data and stores the output.
Shuffle : The node distributes the processed data (based by output keys that are
generated by map() func in earlier stage) - such that all data with same KEY are
stored in same node
Reduce - Worker node now computes each GROUP of keys in parallel.

YARN - ResourceManager
In YARN, the ResourceManager is, primarily, a pure scheduler. In essence, its
strictly limited to arbitrating available resources in the system among the
competing applications a market maker if you will. It optimizes for cluster
utilization (keep all resources in use all the time) against various constraints such
as capacity guarantees, fairness, and SLAs. To allow for different policy
constraints the ResourceManager has a pluggable scheduler that allows for
different algorithms such as capacity and fair scheduling to be used as necessary.

HADOOP COMMON
Contains the libraries, the JAR Files, Documentation etc.
Basically can be interpreted as some kind of Interface to all of hadoops
functionalities and principles.

Questions, Debate?
STOP!! Make sure everything is clear and everyone agrees / understands the
explained above.

Pros of Hadoop
0. FREE to use and Widely used - open source - lots of tweaks , update, fixed and
upgrades constantly.
1.Maximizing Utilization and Throughput - Meaning the total time of actual work is
to be maximized, as we handle bigdata we probably have lots of
computations/work to do, one of Hadoops pros is the possibility to maximize our
resources and maximize the utilization of the system.

Architecture view

The NameNode is the Master Node - a root of all the other DataNode, it tracks where across the cluster the file data is kept.
It does not store the data of these files itself.
Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a
file. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives.

NameNode - Job Tracker

1.
2.
3.
4.
5.
6.
7.
8.

JobTracker is the , it
delegates data and/or computations to
DataNodes, it aims to store/compute
the data in the same datanode unless it
has no freespace, then it would hop to
next Node wishing to stay in the stay
rack.

Client applications submit jobs to the Job tracker.


The JobTracker talks to the NameNode to determine the location of the data
The JobTracker locates TaskTracker nodes with available slots at or near the data
The JobTracker submits the work to the chosen TaskTracker nodes.
The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and
the work is scheduled on a differentTaskTracker.
A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do then: it may resubmit the job
elsewhere, it may mark that specific record as something to avoid, and it may may even blacklist the TaskTracker as unreliable.
When the work is completed, the JobTracker updates its status.
Client applications can poll the JobTracker for information.

The JobTracker is a point of failure for the Hadoop MapReduce service. If it goes down, all running jobs are halted.

DataNode

A DataNode stores data in the [HadoopFileSystem]. A functional filesystem has more than one DataNode, with data
replicated across them.

A task tracker for a DataNode.


A TaskTracker is a node in the cluster that accepts tasks - Map, Reduce and Shuffle operations - from a JobTracker.
Every TaskTracker is configured with a set of slots, these indicate the number of tasks that it can accept. When the
JobTracker tries to find somewhere to schedule a task within the MapReduce operations, it first looks for an empty slot on
the same server that hosts the DataNode containing the data, and if not, it looks for an empty slot on a machine in the same
rack.
The TaskTracker spawns a separate JVM processes to do the actual work; this is to ensure that process failure does not
take down the task tracker. The TaskTracker monitors these spawned processes, capturing the output and exit codes.
When the process finishes, successfully or not, the tracker notifies the JobTracker. The TaskTrackers also send out
heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These
message also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in
the cluster work can be delegated.

End of Presentation
Discussion, Questions, Debates , Pros & Cons , Use-Cases and such.

Você também pode gostar