By Pallavi Mandal Class: CS-B Roll No.: 2014BCS1150

By Pallavi Mandal
Class: CS-B
Roll no. : 2014BCS1150
Overview
History
What is Big Data

How Hadoop works(HDFS , MapReduce)
Hadoop vs Conventional database
Hadoop Ecosystem
Uses
Who use Hadoop

History
In 2005 Doug Cutting and Michael J. Cafarella
developed Hadoop to support distribution for a
project named Nutch search engine.
The project was funded by Yahoo.
In 2006 Yahoo gave the project to Apache
Software Foundation.
Cutting named the program after his sons toy
elephant.
Big Data
Big data means really a big data, it is a collection of large
datasets that cannot be processed using traditional
computing techniques.
The amount of data produced by us from the beginning of
time till 2003 was 5 billion gigabytes. If you pile up the
data in the form of disks it may fill an entire football
field.
The same amount was created in every two days in 2011,
and in every ten minutes in 2013.
This rate is still growing enormously.
How it works?
Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is the primary storage system used by
Hadoop applications.
HDFS is a distributed file system that provides high-performance access to data
across Hadoop clusters.
Firstly Hadoop has to know in which node the data will reside for that it quaries
something called name node.
After locating the data it will send the job to each one of those node.
Then each processor will independently read the input and write result in local
output file.
Thats all done in parallel . Then each local output is summed up to give result.
Name Node
The name node is the commodity hardware that contains the GNU/Linux
operating system and the name node software.
The system having the name node acts as the master server and it does the
following tasks:
manages the file system metadata
Regulates clients access to files.
It also executes file system operations such as renaming, closing, and
opening files and directories.
Data Node
These nodes manage the data storage of their system.
Data nodes perform read-write operations on the file systems, as per client
request.
They also perform operations such as block creation, deletion, and replication
according to the instructions of the name node.
Map-Reduce
A method for distributing computation across multiple nodes.
Each node processes the data that is stored at that node.
Consists of two main phases.
Map
Reduce
The Map Reduce framework consists of a single masterJob Trackerand one
slaveTask Trackerper cluster-node.
The master is responsible for resource management, tracking resource
consumption/availability and scheduling the jobs component tasks on the
slaves, monitoring them and re-executing the failed tasks.
The slaves Task Tracker execute the tasks as directed by the master and provide
task-status information to the master periodically.
Mapper:
Reads data as key/value pairs.
Outputs zero or more key/value pairs
Shuffle and Sort:

Output from the mapper is sorted by key.
All values with the same key are guaranteed to go to the same machine.
Reducer:
Called once for each unique key.
Gets a list of all values associated with a key as input.
The reducer outputs zero or more final key/value pairs.
Usually just one output per input key
Hadoop VS Conventional Database
The Hadoop data is distributed across many nodes while in

conventional database the data resides on one server.
The Hadoop data can be write once and read many while in
conventional database data can be modified several times.
Hadoop supports NoSQL while conventional database use SQL.
Hadoop Ecosystem
Hardware
It is the lowest level.
It doen not require to buy special hardware as it run on
commodity hardware.
Hadoop Layer
It contain Map Reduce and HDFS i.e Hadoop Distributed file
system.
Tools Layer
A set of tools and utilities such as Rhadoop used for statistical
data processing using R programming language.
Tools for doing NoSQL like hive , pig.
Tools for getting data in and out of Hadoop file system like
sqoop.
Uses of Hadoop
Data-intensive text processing
Assembly of large genomes
Graph mining
Machine learning and data mining
Large scale social network analysis
Who Uses Hadoop?
References
You Tube
http://www.tutorialspoint.com/hadoop/
https://en.wikipedia.org/wiki/Apache_Hadoop
http://hortonworks.com/apache/mapreduce/

By Pallavi Mandal Class: CS-B Roll No.: 2014BCS1150

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

By Pallavi Mandal Class: CS-B Roll No.: 2014BCS1150

Enviado por

Direitos autorais:

Formatos disponíveis

By Pallavi Mandal

What is Big Data

Who use Hadoop

Shuffle and Sort:

The Hadoop data is distributed across many nodes while in

Você também pode gostar