Escolar Documentos
Profissional Documentos
Cultura Documentos
HADOOP FUNDAMENTALS
HDFS - Hadoop FileSystem
MapReduce - Efficient technique for computing, processing & generating data.
YARN - Yet Another Resource Negotiator
Hadoop common - Libraries, APIs, Documentations.
Map/Reduce
An efficient technique to process / compute data among HDFS.
Processing data parallely, Basic idea :
Mapping : Each node applies map() on its data and stores the output.
Shuffle : The node distributes the processed data (based by output keys that are
generated by map() func in earlier stage) - such that all data with same KEY are
stored in same node
Reduce - Worker node now computes each GROUP of keys in parallel.
YARN - ResourceManager
In YARN, the ResourceManager is, primarily, a pure scheduler. In essence, its
strictly limited to arbitrating available resources in the system among the
competing applications a market maker if you will. It optimizes for cluster
utilization (keep all resources in use all the time) against various constraints such
as capacity guarantees, fairness, and SLAs. To allow for different policy
constraints the ResourceManager has a pluggable scheduler that allows for
different algorithms such as capacity and fair scheduling to be used as necessary.
HADOOP COMMON
Contains the libraries, the JAR Files, Documentation etc.
Basically can be interpreted as some kind of Interface to all of hadoops
functionalities and principles.
Questions, Debate?
STOP!! Make sure everything is clear and everyone agrees / understands the
explained above.
Pros of Hadoop
0. FREE to use and Widely used - open source - lots of tweaks , update, fixed and
upgrades constantly.
1.Maximizing Utilization and Throughput - Meaning the total time of actual work is
to be maximized, as we handle bigdata we probably have lots of
computations/work to do, one of Hadoops pros is the possibility to maximize our
resources and maximize the utilization of the system.
Architecture view
The NameNode is the Master Node - a root of all the other DataNode, it tracks where across the cluster the file data is kept.
It does not store the data of these files itself.
Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a
file. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives.
1.
2.
3.
4.
5.
6.
7.
8.
JobTracker is the , it
delegates data and/or computations to
DataNodes, it aims to store/compute
the data in the same datanode unless it
has no freespace, then it would hop to
next Node wishing to stay in the stay
rack.
The JobTracker is a point of failure for the Hadoop MapReduce service. If it goes down, all running jobs are halted.
DataNode
A DataNode stores data in the [HadoopFileSystem]. A functional filesystem has more than one DataNode, with data
replicated across them.
End of Presentation
Discussion, Questions, Debates , Pros & Cons , Use-Cases and such.