Escolar Documentos
Profissional Documentos
Cultura Documentos
Week 1
Understanding Big Data Introduction to HDFS
Week 5
Analytics using Hive Understanding HIVE QL
Week 2
Playing around with Cluster Data loading Techniques
Week 6
NoSQL Databases Understanding HBASE
Week 3
Map-Reduce Basics, types and formats Use-cases for Map-Reduce
Week 7
Real world Datasets and Analysis Hadoop Project Environment
Week 4
Analytics using Pig Understanding Pig Latin
Week 8
Project Reviews Planning a career in Big Data
How it works
Live classes Class recordings Module wise Quizzes, Coding Assignments 24x7 on-demand technical support Project work on large Datasets Online certification exam Lifetime access to the Learning Management System
Facebook Example
Facebook users spend 10.5 billion minutes (almost 20,000 years) online on the social network Facebook has an average of 3.2 billion likes and comments are posted every day.
Twitter Example
Twitter has over 500 million registered users. The USA, whose 141.8 million accounts represents 27.4 percent of all Twitter users, good enough to finish well ahead of Brazil, Japan, the UK and Indonesia. 79% of US Twitter users are more like to recommend brands they follow 67% of US Twitter users are more likely to buy from brands they follow 57% of all companies that use social media for business use Twitter
Hadoop Users
http://wiki.apache.org/hadoop/Po weredBy
2011: 1.8 ZB
2015: 7.9 ZB
The world's information doubles every two years Over the next 10 years:
The number of servers worldwide will grow by 10x Amount of information managed by enterprise data centers will grow by 50x Number of files enterprise data center handle will grow by 75x
Source: http://www.emc.com/leadership/programs/digitaluniverse.htm, which was based on the 2011 IDC Digital Universe Study
Why DFS?
Read 1 TB Data
1 Machine
4 I/O Channels Each Channel 100 MB/s
10 Machines
4 I/O Channels Each Channel 100 MB/s
Why DFS?
Read 1 TB Data
1 Machine
4 I/O Channels Each Channel 100 MB/s
10 Machines
4 I/O Channels Each Channel 100 MB/s
45 Minutes
Why DFS?
Read 1 TB Data
1 Machine
4 I/O Channels Each Channel 100 MB/s
10 Machines
4 I/O Channels Each Channel 100 MB/s
45 Minutes
4.5 Minutes
What is Hadoop?
Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters
- Amazon
- AOL - IBM - And many more at
http://wiki.apache.org/hadoop/PoweredBy
Hadoop Eco-System
What is HDFS?
HDFS - Hadoop Distributed File System
Highly fault-tolerant
High throughput
Suitable for applications with large data sets Streaming access to file system data Can be built out of commodity hardware
DataNodes:
slaves which are deployed on each machine and provide the actual storage responsible for serving read and write requests for the clients
Secondary NameNode:
metadata
Secondary NameNode:
Not a hot standby for the NameNode Connects to NameNode every hour* Housekeeping, backup of NemeNode metadata Saved metadata can build a failed NameNode
NameNode
Single Point Failure You give me metadata every hour, I make it secure
Secondary NameNode
metadata
HDFS Architecture
Job Tracker
Rack Awareness
Thank You
See You in Class Next Week