Você está na página 1de 31

Course Topics

Week 1
Understanding Big Data Introduction to HDFS

Week 5
Analytics using Hive Understanding HIVE QL

Week 2
Playing around with Cluster Data loading Techniques

Week 6
NoSQL Databases Understanding HBASE

Week 3
Map-Reduce Basics, types and formats Use-cases for Map-Reduce

Week 7
Real world Datasets and Analysis Hadoop Project Environment

Week 4
Analytics using Pig Understanding Pig Latin

Week 8
Project Reviews Planning a career in Big Data

How it works
Live classes Class recordings Module wise Quizzes, Coding Assignments 24x7 on-demand technical support Project work on large Datasets Online certification exam Lifetime access to the Learning Management System

Complementary Java Classes

What is Big Data?

Facebook Example

Facebook users spend 10.5 billion minutes (almost 20,000 years) online on the social network Facebook has an average of 3.2 billion likes and comments are posted every day.

Twitter Example
Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough to finish well ahead of Brazil, Japan, the UK and Indonesia. 79% of US Twitter users are more like to recommend brands they follow 67% of US Twitter users are more likely to buy from brands they follow 57% of all companies that use social media for business use Twitter

Other Industrial Usecases


Insurance Healthcare Retail Recommendations Groupings Genome Sequencing Utilities

Hadoop Users

http://wiki.apache.org/hadoop/Po weredBy

Data volume is growing exponentially


Estimated Global Data Volume:

2011: 1.8 ZB
2015: 7.9 ZB

The world's information doubles every two years Over the next 10 years:
The number of servers worldwide will grow by 10x Amount of information managed by enterprise data centers will grow by 50x Number of files enterprise data center handle will grow by 75x

Source: http://www.emc.com/leadership/programs/digitaluniverse.htm, which was based on the 2011 IDC Digital Universe Study

Un-Structured Data is exploding

Why DFS?
Read 1 TB Data

1 Machine
4 I/O Channels Each Channel 100 MB/s

10 Machines
4 I/O Channels Each Channel 100 MB/s

Why DFS?
Read 1 TB Data

1 Machine
4 I/O Channels Each Channel 100 MB/s

10 Machines
4 I/O Channels Each Channel 100 MB/s

45 Minutes

Why DFS?
Read 1 TB Data

1 Machine
4 I/O Channels Each Channel 100 MB/s

10 Machines
4 I/O Channels Each Channel 100 MB/s

45 Minutes

4.5 Minutes

What Is Distributed File System? (DFS)

What is Hadoop?
Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters

of commodity computers using a simple programming model.

Companies using Hadoop: - Yahoo - Google - Facebook

- Amazon
- AOL - IBM - And many more at

http://wiki.apache.org/hadoop/PoweredBy

Hadoop Eco-System

Hadoop Core Components:


HDFS Hadoop Distributed File System (storage) MapReduce (processing)

What is HDFS?
HDFS - Hadoop Distributed File System
Highly fault-tolerant

High throughput
Suitable for applications with large data sets Streaming access to file system data Can be built out of commodity hardware

Main Components Of HDFS:


NameNode:
master of the system maintains and manages the blocks which are present on the DataNodes

DataNodes:
slaves which are deployed on each machine and provide the actual storage responsible for serving read and write requests for the clients

Secondary NameNode:

metadata

Secondary NameNode:
Not a hot standby for the NameNode Connects to NameNode every hour* Housekeeping, backup of NemeNode metadata Saved metadata can build a failed NameNode

NameNode

Single Point Failure You give me metadata every hour, I make it secure

Secondary NameNode

metadata

JobTracker and TaskTracker:

HDFS Architecture

Job Tracker

Job Tracker Contd.

Job Tracker Contd.

Job Tracker Contd.

HDFS Client Creates a New File

Rack Awareness

Anatomy of a File Write:

Anatomy of a File Read:

Thank You
See You in Class Next Week

Você também pode gostar