Escolar Documentos
Profissional Documentos
Cultura Documentos
Map-Reduce
Map-Reduce um paradigma de programao em que cada tarefa
especificada em termos de funes de mapeamento e reduo. Ambas as
tarefas rodam paralelamente no cluster. O armazenamento necessrio para
essa funcionalidade fornecido pelo HDFS. A seguir esto os principais
componentes do Map-Reduce.
Job Tracker: tarefas de Map-Reduce so submetidas ao Job Tracker. Ele
precisa falar com o Namenode para conseguir os dados. O Job Tracker
submete a tarefa para os ns task trackers. Esses task tracker precisam se
reportar ao Job Tracker em intervalos regulares, especificando que esto
vivos e efetuando suas tarefas. Se o task tracker no se reportar a eles, ento
o n considerado morto e seu trabalho redesignado para outro task
tracker. O Job tracker novamente um ponto crucial de falha. Se o Job Tracker
falhar, no poderemos rastrear as tarefas.
Task Tracker: o Task Tracker aceita as tarefas to Job Tracker. Essas tarefas
so tanto de map, reduce ou ambas (shuffle). O Task Tracker cria um processo
JVM separado para cada tarefa a fim de se certificar de que uma falha no
processo no resulte em uma falha de Task Tracker. Task trackers tambm se
reportam ao Job Tracker continuamente para que este possa manter o registro
de tarefas bem ou mal sucedidas.
***
Very few people in the computer science industry wouldnt have come across the terms Big
Data and Hadoop. These are a few buzz words which we are coming across quite frequently
now a days. Though sometimes over-hyped, it is a big deal for all the analytics companies and
policy makers. So lets see what this buzz is all about.
Ever since the onset of Internet, massive amounts of user data is getting generated. Particularly,
in the last couple of years, social media like Facebook, Twitter and blogging websites have
created humongous amounts of user data. According to Gartner, Big Data is very high volume,
high velocity data which originates from multitude of sources. Being created in a random fashion,
this data lacks the structure. This information can be analysed to help in smarter and efficient
decision making. Big data differs from the traditional data in two significant ways. First, big data is
very huge and cant be stored in single machine. Second, it lacks the structure which traditional
data has. Because of these characteristics handling and processing of big data requires special
tools and techniques. This is where Hadoop kicks in.
Hadoop is an open source implementation of the Map-Reduce programming paradigm. MapReduce is a programming paradigm introduced by Google for processing and analyzing very
large data-sets. All these programs which are developed in this paradigm parallely processes the
data-sets and so they can be run on servers without much effort. The reason for scalability of this
paradigm is the inherent distributive nature in the way solution works. The big task is divided into
many small jobs which then run parallely on different machines and then combine to give the
solution for the original big task we started with. The examples of usage of Hadoop are for
analyzing user patterns on e-commerce websites and suggest users new products to buy.
This is traditionally called a Recommendations Systems and can be found in all of the major ecommerce websites. It can be used for processing large graphs like Facebook etc. The reason
why Hadoop has simplified parallel processing is because the developer doesnt have to care
about the parallel programming worries. A developer only writes functions on how he wants to
process the data.
Apache Hadoop Components
Hadoop framework consists of two major components, Storage and Processing. First,
HDFS (Hadoop Distributed File System) handles the data storage across all the machines on
which Hadoop cluster is running. Second, Map-Reduce handles the processing part of the
framework. Lets have a look at them individually.
Namenode: It manages the namespace of the file system. It manages all the files and
directories. Namenode has mapping between file and the blocks on which it is stored. All the
files are accessed using these namenodes and datanodes.
Datanode: It actually stores the data in the form of blocks. Datanode keeps reporting to
namenode about the files it has stored so that namenode is aware and data can be accessed.
Namenode in such a way is the most crucial and single point of failure in the system without
which data cant be accessed.
Secondary Namenode: This node is responsible for check pointing the information
from namenode. In case of failure we can use this node to restart the system.
Map-Reduce
Map-Reduce is a programming paradigm where every task is specified in terms of map function
and a reducefunction. Both these tasks run parallely on the clusters. The storage required for this
functionality is provided by HDFS. Following are the main components of the Map-Reduce
Job Tracker: Map-Reduce jobs are submitted to Job Tracker. It has to talk to
Namenode to fetch the data. Job Tracker submits the task to task trackers nodes. These task
tracker nodes have to report to Job Tracker at regular intervals specifying they are alive and
doing the task. If the task tracker doesnt report then it is assumed to be dead and its work is
reassigned to other task tracker. Job Tracker is again a single point of failure. If Job Tracker
fails we will not be able to track the tasks.
Task Tracker: Task Tracker takes the tasks from Job tracker. These tasks are either
map, reduce or shuffle. Task Tracker creates a separate JVM process for each task to make
sure that process failure doesnt result into Task Tracker failure. Task Tracker also reports to
Job Tracker continuously so that Job Tracker can keep track of successful and failed Task
Trackers.