Você está na página 1de 136


About this course

Course Logistics
Chapter Topics
Refer the course objective and content PDF.
Course Objective
About your instructor
About you
Experience with Hadoop?
Experience as a developer?
Expectations from the course?
Module 1
Introduction to BIG Data
and its Need
Lesson 1: Introduction to BIG Data
Lesson 2: Big Data Analytics and why its a need now
Lesson 3: Real Time Case Studies
Lesson 4: Traditional Vs. Big Data Approach
Lesson 5: Technologies within Big Data Eco System
Module Objectives
At the end of this module, you will learn to:
Introduction to BIG Data
Few Examples on BIG Data
Big Data Real time case studies
Why Big Data is a BUZZ and why its a need now
Big Data Analytics
Comparison between Traditional and Big Data approach
Technologies within Big Data Eco System
Introduction to BIG Data
At the end of this lesson, you will learn to:
What is Big Data?
The 3 Vs of BIG Data
Few Example of Big Data
Why Big Data is a BUZZ!
Lesson 1
What is BIGData???
When you hear the term BIG Data
what is the first instant thought?
Volume!!!! right???
Massive, huge, enormous quantities of digital stuff.
But its not just the volume that makes BIG Data difficult to manage
and analyze its also the Variety and Velocity!!!
Big Data : Insight
Big data is a collection of data sets so large and complex that it becomes difficult
to process using on-hand database management tools or traditional data
processing applications.
Big Data is equipped enough to handle day - to- day data explosion.
Big data is difficult to work with using most relational database management
Big data usually includes data sets with sizes beyond the ability of commonly
used software tools to capture, curate, manage, and process the data within a
tolerable elapsed time.
What do we Mean by Big Data?
BIG Data has three defining attributes 3 Vs. They are:
Data Volume,
Data Variety &
Data Velocity
Together 3 Vs constitute a comprehensive definition of BIG Data.
Using millions of transactions & events to analyze trends
and perform forecast!!
Turning 12 terabytes of Tweets created each day into
improved product sentiment analysis!!
Converting 350 billion annual meter readings to better
predict power consumption!!
Using fast paced real time transactions for predictive
Scrutinizing 5 million trade events created each day to
identify potential fraud!!
Analyzing 500 million daily call detail records in real-time
to predict customer churn faster!!
Collectively analyzing all forms of data (text, sensor data,
audio, video, click streams, log files ) gives new insights!!
Monitoring 100s of live video feeds from surveillance
cameras to target points of interest!!
Exploiting 80% data growth in images, video and
documents to improve customer satisfaction!!
3Vs Of BIG Data
Velocity Variety
Batch Load
Near Time Data
Real Time Data
Semi Structured
Understanding BIG Data - Summary
Data thats an order of magnitude greater than you are accustomed to
- Gartner Analyst Doug Laney
BIG Data is a collection of data sets so large and complex that it becomes difficult
to process using on-hand Database Management Tools
- Wikipedia
3 Vs
- Volume
- Velocity &
- Variety
Few Examples of BIG Data
Facebook handles 40 billion photos from its user base and has more than 901 million
active users generating social interaction data.
RFID (radio frequency ID) systems generate up to 1,000 times the data of conventional
bar code systems.
10,000 payment card transactions are made every second around the world.
340 million tweets are sent per day. That's nearly 4,000 tweets per second.
More than 5 billion people are calling, texting, tweeting and browsing websites on mobile
Boeing jet engines produces terabytes of operational information every 30 minutes they
turn. A four engine jumbo jet can create 640 terabytes of data on just one Atlantic
crossing, multiply that by the more than 25,000 flights flown each day
Walmart handles more than 1 million customer transactions every hour, which is
imported into databases estimated to contain more than 2.5 petabytes of data
Why BIG Data is a BUZZ!
BIG Data Platform can be used to analyze semi structured & unstructured data along with
raw structured data
BIG data solutions are ideal for iterative & exploratory analysis when business measures
can not be pre-determined using structured data set
Big Data can be used to supports Predictive Analytics and Provide Predictive Enterprise
Solutions using all forms of real transactions in contrary to the Traditional DWBI.
Few of case study for BIG Data would be
Performing IT log Analytics
Identifying Fraud detection pattern
Sentiment Analytics using SocialMedia Feed
Executing usage analytics in Energy Sector
Analyzing competitor market penetration
So what BIG data means to a business?
Profile customers & Gain Customer Trust
Determine pricing strategies
Identify competitive advantages
Better target advertising
Strengthen customer service
In this chapter you have learned
What is Big Data?
The 3 Vs of BIG Data
Few Example of Big Data
Why Big Data is a BUZZ!
BIG Data Analytics
& Why its a Need Now?
At the end of this lesson, you will learn to:
What is Big Data Analytics?
Its advantages and challenges.
Why it has become a need now?
Big data as a complete Solution.
Big Data Analytics implementation.
Lesson 2
What is BIG Data Analytics
Big data analytics is the process of examining large amounts of data of a variety of types
(Structured, Semi-Structured or Unstructured), to uncover hidden patterns, unknown
correlations and other useful information.
The primary goal of big data analytics is to help companies make better business decisions
by analyzing data sources that may be left untapped by conventional business intelligence
(BI) programs
Underlying data may include Web server logs, Internet clickstream data, social media
activity reports, mobile-phone call detail records, information captured by sensors, IT Logs
Lack of skill set
High Initial Cost Involvement
Challenges in integrating BIG Data
Little Awareness of Technologies
Unavailability of matured BIG Data toolset
Making sense out of unstructured Data
Optimized usage of Organizational Data
Value add to existing BI Solutions
More accurate BI Results
Best bet to make Better Business Decisions
Why BIG Data Analytics is a need now?
Information is at center of a New
Wave Opportunity
and Organization needs
Deeper Insights
1 in 3
business leaders frequently
makes Business Decisions
based on Information they
do not trust or do not have!!!
1 in 2
business leaders say they do
not have access to relevant
Information they require to do
their job!!!
83 %
of CIO cited BI as part of
their visionary plans to be
competitive and enhance
their competitiveness
60 %
of CEOs need to do a better
job capturing & understanding
Information rapidly in order to
make swift business decisions
As much data & content
In coming decade
Of Worlds available data Is
Unstructured or Semi Structured
Zeta bytes of data by 2020
BIG Data Platform helps you combine varied data forms for making decisions
Why BIG Data Analytics is a need now?
BIG Data Platform provides you multi-channel Customer Sentiment Analytics
Who are the
BIGGEST influencers
and what are they
Call Centre
What people think about your company or product???
Why BIG Data Analytics is a need now?
New Information Sources
Traditional Sources
Twitter produces
7TB data everyday
2 Billion Internet
users as of now
Facebook produces
10 TB data everyday
4.6 Billion mobile
phones worldwide
Steady growth
of traditional data
Enormous Satellite
data growth
New media channels
emerging everyday
Digitization makes
exponential data growth
Future continues to bring new data sources with high data volume
BIG Data Platform ensures consolidation of ever growing varied data sets
Why BIG Data Analytics is a need now?
Imagine if we could
infections in
newborns 24
hours earlier?
apply social
relationships of
customers to
prevent churn?
adjust credit
lines as
transactions are
occurring to account
for risk fluctuations
determine whom
to offer discounts at
time of sale
instead of offering
to all
Physician Call Centre REP Loan Officer Sales Associate
BIG Data Platform can be used across industry for making Analytic Decisions
BIG Data: The Solution
The Solution Bring together any data source @ any velocity to generate insight
Analyze variety of data
@ enormous volume
Insight on streaming
Large volume
structured data
BIG Data
Multi channel
customer sentiment
Predict weather
patterns to optimize
capital expenditure
Make risk decisions
based on real time
transactional data
Identify criminal &
threats from disparate
Find life threatening
conditions in time to
Implementing BIG Data Analytics - Different Approaches
Interactive Exploration Operational Reporting Indirect Batch Analysis
For Data Analysts & Data
Scientists who wants to
discover real time pattern
as they emerge from their
BIG Data Content
Foe executives &
Operational Managers who
wants summarized, pre-
built, periodic reports on
BIG Data Content
For Data Analysts & Op
Managers who want to
analyze data trends based
on predefined questions in
their BIG Data Content
Low Medium High
Hbase, No-SQL, Analytic
Hive, No-SQL, Analytic
Hadoop, No-SQL, Analytic
Native Native, SQL ETL
Use Cases
BIG Data Platform
BI Platform
In-Memory Engine
BI Platform
BI Platform
OLAP Engine
Reports & Dashboards Multidimensional Analysis Multidimensional Analysis
Native Native SQL
In this chapter you have learned
What is Big Data Analytics?
Its advantages and challenges.
Why it has become a need now?
Big data as a complete Solution.
Big Data Analytics implementation.
Traditional Analytics
Big Data Analytics
At the end of this lesson, you will learn to:
The Traditional Approach
The BIG Data Approach
Traditional Vs. Big Data Approach
BIG Data Complements Traditional Enterprise Data Warehouse
Traditional Analytics Vs. Big Data Analytics
Lesson 3
The Traditional Approach: Business Requirement Drives Solution Design
New Requirements
require redesign &
Business executes queries to answer
questions over and over
IT designs a solution
with a set structure
& functionality
Business defines requirements
what questions should we ask
Well suited to
High Value, Structured Data
Repeated operations & processes
Relatively stable sources
Well understood requirements
Stretched by
Highly valuable data and content
Exploratory analysis
Volatile sources
Changing requirements
The Traditional Approach : Business Requirements drive solution design
The BIG Data Approach: Information Sources drive Creative Discovery
Can be implemented for
Structured or Unstructured Data
Exploratory operations & processes
Relatively Unstable sources
Unknown Business Requirements
The BIG Data Approach : Information sources drive Creative discovery
New insights drive
integration to
traditional technology
Business determines what questions to ask
by exploring data & relationships
IT delivers platform that
enables creative exploration
of all available data & content
Business & IT identify
available Information Sources
Traditional and BIG Data Approaches
Traditional approach vs BIG Data approach
Traditional Approach
Structured & Repeatable Analysis
BIG Data Approach
Iterative & Exploratory Analysis
Business Users
Determines what
questions to ask
Structures data to
answer questions
Delivers platform to
enable creative
Explores what
questions could be
Monthly Sales Report
- Profitability Analysis
- Customer Surveys
Brand sentiment
- Product Strategy
- Maximizing Utilization
BIG Data Complements Traditional Enterprise Data Warehouse
Traditional Sources New Sources
BIG Data Platform Data Warehouse
BIG Data shouldnt be a silo,
Must be an integrated part of your Enterprise Information Architecture
Traditional Analytics Platform v/s BIG Data Analytics Platform
Traditional DW Analytics Platform BIG Data Analytics Platform
Gigabytes to Terabytes Petabytes to Exabyte
Centralized Data Structure Distributed Data Structure
Structured Semi Structured & Non Structured
Relational Data-model Flat Schemas
Batch oriented data load process Aimed at near real time analysis of the data
Analytics based on historical trends BIG Data Analytics is based on real time data
Data generated using conventional method
(Data Entry)
Data generated using unconventional
methods like, RFID, Sensor networks etc.
In this chapter you have learned
The Traditional Approach
The BIG Data Approach
Traditional Vs. Big Data Approach
How Big Data Complements Traditional
Enterprise Data Warehouse
Traditional Analytics Vs. Big Data
Real Time Case Studies
At the end of this lesson, you will learn to:
Big Data Analytics: Use Cases
Big Data to Predict Your Customers' Behaviors
When to consider for Big Data Solution
Big Data Real Time Case Studies
Lesson 4
BIG Data Analytics - Use Cases
Integrated Website Analytics
Competitive Pricing
Customer Segmentation
Predictive buying behavior
Market Campaign Management
Defense Intelligence Analysis
Threat Analytics
Customer Experience Analytics
Healthcare & Pharmaceutical
Drug Discovery
Customer Segmentation
Service Response Optimization
Financial Services
Fraud detection Analytics
Risk Modeling & Analysis
Inventory Optimization
Energy & Utilities
Customer Experience Analytics
Service Quality Optimization
Media & Content
Customer Satisfaction Analytics
Dispatch Optimization
Big Data to Predict Your Customers' Behaviors
Retailers like Wal-Mart and Kohl's are making use of sales, pricing, and economic
data, combined with demographic and weather data, to fine-tune merchandising
store by store and anticipate appropriate timing of store sales.
Online data services like eHarmony and Match.com are constantly observing
activity on their sites to optimize their matching algorithms to predict who will
hit it off with whom.
Google search queries on flu symptoms and treatments reveal weeks in advance
what flu-related volumes hospital emergency departments can expect.
BIG Data provides capacity to predict the future before your rivals can whether
they're companies or criminals. Currently NYPD is using Big Data platform to
fight crime in Manhattan.
When to consider for Big Data Solution
Big Data solutions are ideal for analyzing not only raw structured data, but semi
structured and unstructured data from wide variety of sources.
Big Data solutions are ideal when all, or most, of the data needs to be analyzed
versus a sample of the data; or a sampling of data isn't nearly as effective as a
larger set of data from which to derive analysis.
Big Data solutions are ideal for iterative and exploratory analysis when business
measures on data are not predetermined.
Big Data Real Time Case Study
TXU Energy Smart Electric Meters:
Because of smart meters, electricity providers can read the
meter once every 15 minutes rather than once a month. This
not only eliminates the need to send some one for meter
reading, but as the meter is read once every fifteen minutes,
electricity can be priced differently for peak and off-peak
hours. Pricing can be used to shape the demand curve during
peak hours, eliminating the need for creating additional
generating capacity just to meet peak demand, saving
electricity providers millions of dollars worth of investment in
generating capacity and plant maintenance costs.
Big Data Real Time Case Study .(Contd)
T-Mobile USA:
T-Mobile USA has integrated Big Data across multiple IT
systems to combine customer transaction and interactions
data in order to better predict customer defections. By
leveraging social media data (Big Data) along with transaction
data from CRM and Billing systems, T-Mobile USA has been
able to cut customer defections in half in a single quarter.
Big Data Real Time Case Study .(Contd)
US Xpress :
US Xpress, provider of a wide variety
of transportation solutions collects
about a thousand data elements
ranging from fuel usage to tire
condition to truck engine operations
to GPS information, and uses this
data for optimal fleet management
and to drive productivity saving
millions of dollars in operating costs.
Big Data Real Time Case Study .(Contd)
McLarens Formula One racing team :
McLarens Formula One racing teamuses real-time car sensor
data during car races, identifies issues with its racing cars
using predictive analytics and takes corrective actions pro-
actively before its too late!
In this chapter you have learned
Big Data Analytics: Use Cases
Big Data to Predict Your Customers'
When to consider for Big Data Solution
Big Data Real Time Case Studies
Like TXU smart meters, T- Mobile, US Xpress,
McLarens Formula One racing team
Technologies within Big
Data Eco System
At the end of this lesson, you will learn to:
BIG Data Landscape
BIG Data Key Components
Components of Analytical Big-data Processing
Hadoop at a glance
Lesson 5
BIG Data Landscape
Hardware BIG Data
Data Management
Analytics Layer Application Layer Services
Vendors include
DELL, HP, Arista,
IBM, Cisco, EMC,
Open source
Non-Hadoop BIG
Data Frameworks
include Apache,
Distributed File
Data Integration
Data Quality &
include Apache,
DataStax, Pervasive,
Couchbase, IBM,
Oracle, Informatica,
Syncsort, Talend.
include Apache,
Hadapt, Attivio,
101Data, EMC, SAS
Institute, Digital
Revolution Analytics.
Data Visualization
BI Applications
Vendore include
Datameter, ClickFox,
Platfora, Tableau
Software, Tresata,
Pentaho, QlikTech,
Technical Support
Hosting / BIG
Data as a Service
Vendor include
Trisata, Tidemark,
Think Big Analytics,
Amazon Web
Services, Accenture,
BIG Data Key Components
Map Reduce Engine
Pig Hive (DW)
Cascading Kerberos
ETL (Extract Transform & Load) & Modeling Tools (CRX)
Click Fox, Merced etc..
Eg. Greenplum
File System,
eg. HDFS
No SQL Database,
eg. HBASE, Cassandra
Processing &
Original Data
Location aware
File System
Job & Task
Higher Level
Management &
Fast Loading
Analytic Database
Components of Analytical Big-data Processing
Raw massive data: Kept within cheap commodity machines/ servers. They are further
categorized as Nodes and clusters.
File-Systems such as the Hadoop Distributed File System (HDFS), which manages the
retrieval and storing of data and metadata required for computation. Other file
systems or databases such as Hbase (a NoSQL tabular store) or Cassandra (a NoSQL
Eventuallyconsistent keyvalue store) can also be used.
Computation Engine: Instead of writing in JAVA, higher level languages as Pig (part of
Hadoop) can be used such, simplifying the writing of computations.
Data warehouse Layer: Hive is a Data Warehouse layer built on top of Hadoop,
developed by Facebook programmers for BIG Data Platform.
Cascading is a thin Java library that sits on top of Hadoop that allows suites of
MapReduce jobs to be run and managed as a unit. It is widely used to develop special
Semi-automated modeling tools such as CR-X allow models to develop interactively
at great speed, and can help set up the database that will run the analytics.
Analytic Database : Specialized scale-out analytic databases such as Greenplum or
Netezza with very fast loading load & reload the data for the analytic models .
ISV big data analytical packages such as ClickFox and Merced run against the
database to help address the business issues.
Hadoop at a Glance
It is not advisable to dig out the hole for a pool using only an ice cream scooper;
you need a big tool Hadoop is one among them!!!
Apache Hadoop is an open-source project which was inspired by BIG Data research of
Hadoop is best available tool for processing and storing herculean amounts of big Data.
Hadoop throws thousands of computers at big data problem, rather than using single
In Hadoop parlance, group of coordinated computers is called cluster & individual
computers in the cluster are called nodes.
Hadoop makes data mining, analytics, and processing of big data cheap and fast when
compared with other toolsets.
Hadoop is cheap, fast, flexible & scales to large amounts of big data storage &
Hadoop: A big tool for BIG Data
Looking at the Data explosion, the real issue is not to acquire large amount of Data or
storing those data, it is what you do with your BIG Data!!!
With BIG Data and BIG Data Analytics, its possible to:
Analyze millions of SKUs to determine optimal prices that maximize profit and clear inventory.
Recalculate entire risk portfolios in minutes and understand future possibilities to mitigate risk.
Mine customer data for insights that drive new strategies for customer acquisition, retention, campaign etc.
Quickly identify customers who matter the most.
Generate retail coupons at the point of sale based on the customer's current and past purchases
Send tailored recommendations at just the right time, while customers are in the right location
Analyze data from social media to detect new market trends and changes in demand.
Use clickstream analysis and data mining to detect fraudulent behavior.
Determine root causes of failures, issues & defects by investigating user sessions, network logs & sensors.
The working principle behind all big data platform is to move the query to the data to
be processed, not the data to the query processor.
Its time to move on and try to avoid just looking at the rear view mirror and drive the
car (Traditional BI) but also to look a step forward and get into the predictive analytics
by using the power of BIG DATA and hence help the organization to take right decision
at right point of time.
In this chapter you have learned
BIG Data Landscape
BIG Data Key Components
Components of Analytical Big-data
Hadoop at a glance
Module 2
Introduction to Apache
Hadoop and its
Lesson 1: The Motivation for Hadoop
Lesson 2: Hadoop: Concepts and Architecture
Lesson 3: Hadoop and the Data Warehouse: When and Where to use which
Lesson 4: Introducing Hadoop Eco system components
Module Objectives
At the end of this module, you will learn to:
Introduction to Apache Hadoop
The motivation for Hadoop
The Basic concept of Hadoop
Hadoop Architecture
Hadoop Distributes File System (HDFS) and MapReduce
Right usage and scenarios for Hadoop
Introduction to key Hadoop Eco System Projects
The Motivation for Hadoop
At the end of this lesson, you will learn to:
What problems exist with traditional large scale computing systems
What requirements an alternative approach should have
How Hadoop addresses those requirements
Lesson 1
Traditional Large Scale Computation
Traditionally, computation has been processor bound
Relatively small amounts of data
Significant amount of complex processing performed on that data
For decades, the primary push was to increase the computing
power of a
single machine
Faster processor, more RAM
Distributed systems evolved to allow
developers to use multiple machines
for a single job
MPI: Message Passing Interface
PVM: Parallel Virtual Machine
Distributed Systems: Problems
Programming for traditional distributed systems is complex
Data exchange requires synchronization
Finite bandwidth is available
Temporal dependencies are complicated
It is difficult to deal with partial failures of the system
Ken Arnold, CORBA designer:
Failure is the defining difference between distributed and local Programming, so
you have to design distributed systems with the expectation of failure
Developers spend more time designing for failure than they do actually working
on the problem itself
CORBA: Common Object Request Broker Architecture
Distributed Systems: Data Storage
Typically, data for a distributed system is stored on a SAN
At compute time, data is copied to the compute nodes
Fine for relatively limited amounts of data
The Data Driven World
Modern systems have to deal with far more data than was the case in the
Organizations are generating huge amounts of data
That data has inherent value, and cannot be discarded
Facebook over 70PB of data
eBay over 5PB of data
Many organizations are generating data at a rate of terabytes per day
Data Becomes the Bottleneck
Moores Law has held firm for over 40 years
Processing power doubles every two years
Processing speed is no longer the problem
Getting the data to the processors becomes the bottleneck
Quick calculation
Typical disk data transfer rate: 75MB/sec
Time taken to transfer 100GB of data to the processor: approx. 22 minutes!
Assuming sustained reads
Actual time will be worse, since most servers have less than 100GB of RAM available
A new approach is needed
Partial Failure Support
The system must support partial failure
Failure of a component should result in a graceful
degradation of application performance
Not complete failure of the entire system
Data Recoverability
If a component of the system fails, its workload should be assumed by still
functioning units in the system
Failure should not result in the loss of any data
Component Recovery
If a component of the system fails and then recovers, it should be able to rejoin
the system
Without requiring a full restart of the entire system
Component failures during execution of a job should not affect the outcome of
the job
Adding load to the system should result in a
graceful decline in performance of
individual jobs
Not failure of the system
Increasing resources should support a
proportional increase in load capacity
Hadoops History
Hadoop is based on work done by Google in the late
1990s/early 2000s
Specifically, on papers describing the Google File System
(GFS) published in 2003, and MapReduce published in 2004
This work takes a radical new approach to the
problem of distributed computing
Meets all the requirements we have for reliability and
Core concept: distribute the data as it is initially stored
in the system
Individual nodes can work on data local to those nodes
No data transfer over the network is required for initial
Core Hadoop Concepts
Applications are written in high level code
Developers need not worry about network programming, temporal dependencies
or low/level infrastructure
Nodes talk to each other as little as possible
Developers should not write code which communicates between nodes
Shared nothing architecture
Data is spread among machines in advance
Computation happens where the data is stored, wherever possible
Data is replicated multiple times on the system for increased availability and
Hadoop: Very High/Level Overview
When data is loaded into the system, it is split into blocks
Typically 64MB or 128MB
Map tasks (the first part of the MapReduce system) work on relatively small
portions of data
Typically a single block
A master program allocates work to nodes such that a Map task will work on a
block of data stored locally on that node whenever possible
Many nodes work in parallel, each on their own part of the overall dataset
Fault Tolerance
If a node fails, the master will detect that failure and re-assign the work to a
different node on the system
Restarting a task does not require communication with nodes working on other
portions of the data
If a failed node restarts, it is automatically added back to the system and
assigned new tasks
If a node appears to be running slowly, the master can redundantly execute
another instance of the same task
Results from the first to finish will be used
Known as speculative execution
In this chapter you have learned
What problems exist with traditional
large-scale computing systems
What requirements an alternative
approach should have
How Hadoop addresses those
Hadoop: Concepts and
At the end of this lesson, you will learn to:
What Hadoop is all about
Hadoop Components
What features the Hadoop Distributed File System (HDFS) provides
HDFS Architecture
The concepts behind MapReduce
Lesson 2
The Hadoop Project
Hadoop is an open-source project overseen by the Apache Software Foundation
Originally based on papers published by Google in 2003 and 2004
Hadoop committers work at several different organizations
Including Yahoo!, Facebook, LinkedIn
Hadoop Components
Hadoop consists of two core components
The Hadoop Distributed File System (HDFS)
There are many other projects based around
core Hadoop
Often referred to as the Hadoop Ecosystem
Pig, Hive, HBase, Flume, Oozie, Sqoop etc
Many are discussed later in the course
A set of machines running HDFS and
MapReduce is known as a Hadoop Cluster
Individual machines are known as nodes
A cluster can have as few as one node, as many as
several thousands
More nodes = better performance
Hadoop Components: HDFS
HDFS, the Hadoop Distributed File System, is responsible for storing data on the
Data is split into blocks and distributed across multiple nodes in the cluster
Each block is typically 64MB or 128MB in size
Each block is replicated multiple times
Default is to replicate each block three times
Replicas are stored on different nodes
This ensures both reliability and availability
The Data File is
broken up into 64MB
or 128 MB blocks
The Data Blocks are
replicated 3 times and
scattered amongst the
Hadoop Components: MapReduce
MapReduce is the system used to
process data in the Hadoop cluster
Consists of two phases: Map, and
then Reduce
Between the two is a stage
known as the shuffle and sort
Each Map task operates on a discrete
portion of the overall dataset
Typically one HDFS block of
After all Maps are complete, the
MapReduce system distributes the
intermediate data to nodes which
perform the Reduce phase
Much more on this later!
HDFS Basic Concepts
HDFS is a filesystemwritten in Java
Based on Googles GFS
Sits on top of a native filesystem
Such as ext3, ext4 or xfs
Provides redundant storage for massive amounts of data
Using commodity (relatively low/cost) computers
HDFS Basic Concepts (Contd)
HDFS performs best with a modest number of large files
Millions, rather than billions, of files
Each file typically 100MB or more
Files in HDFS are write once
No random writes to files are allowed
HDFS is optimized for large, streaming reads of files
Rather than random reads
How Files Are Stored
Files are split into blocks
Each block is usually 64MB or 128MB
Data is distributed across many machines at load time
Different blocks from the same file will be stored on different
This provides for efficient MapReduce processing (see later)
Blocks are replicated across multiple machines, known as DataNodes
Default replication is three/fold
Meaning that each block exists on three different machines
A master node called the NameNode keeps track of which blocks make up a file,
and where those blocks are located
Known as the metadata
How Files Are Stored. Example
NameNode holds metadata for
the two files (Foo.txt and Bar.txt)
DataNodes hold the actual blocks
Each block will be 64MB or 128MB
in size
Each block is replicated three times
on the cluster
More On The HDFS NameNode
The NameNode daemon must be running at all times
If the NameNode stops, the cluster becomes inaccessible
Your system administrator will take care to ensure that the NameNode hardware is
The NameNode holds all of its metadata in RAM for fast access
It keeps a record of changes on disk for crash recovery
A separate daemon known as the Secondary NameNode takes care of some
housekeeping tasks for the NameNode
Be careful: The Secondary NameNode is not a backup NameNode!
CDH4 introduces NameNode High Availability
NameNode is not a single point of failure
Features an Active and a Standby NameNode
HDFS: Points To Note
Although files are split into 64MB or 128MB blocks, if a file is smaller than this the full
64MB/128MB will not be used
Blocks are stored as standard files on the DataNodes, in a set of directories specified in Hadoops
configuration files
This will be set by the system administrator
Without the metadata on the NameNode, there is no way to access the files in the HDFS cluster
When a client application wants to read a file:
It communicates with the NameNode to determine which blocks make up the file, and which DataNodes those
blocks reside on
It then communicates directly with the DataNodes to read the data
The NameNode will not be a bottleneck
Accessing HDFS
Applications can read and write HDFS files directly via the Java API
Covered later in the course
Typically, files are created on a local filesystem and must be moved into HDFS
Likewise, files stored in HDFS may need to be moved to a machines local
Access to HDFS from the command line is achieved with the hadoop fs
Hadoop fs Examples
hadoop fs Examples (contd)
hadoop fs Examples (contd)
Hands-On Exercise: Using HDFS
Aside: The Training Virtual Machine
During this course, you will perform numerous Hands-On Exercises using the
Training Virtual Machine (VM)
The VM has Hadoop installed in pseudo-distributed mode
This essentially means that it is a cluster comprised of a single node
Using a pseudo/distributed cluster is the typical way to test your code before
you run it on your full cluster
It operates almost exactly like a real cluster
A key difference is that the data replication factor is set to 1, not 3
Hands-On Exercise: Using HDFS
In this Hands-On Exercise you will gain familiarity with manipulating files in
Please refer to the Hands-On Exercise Manual
What is MapReduce
MapReduce is a method for distributing a task across multiple nodes
Each node processes data stored on that node
Where possible
Consists of two phases:
Features of MapReduce
Automatic parallelization and distribution
Status and monitoring tools
A clean abstraction for programmers
MapReduce programs are usually written in Java
Can be written in any language using Hadoop Streaming
All of Hadoop is written in Java
MapReduce abstracts all the housekeeping away from the developer
Developer can concentrate simply on writing the Map and Reduce functions
In 2010, Facebook sat on top of a mountain
of data; just one year later it had grown from
21 to 30 petabytes. If you were to store all of
this data on 1TB hard disks and stack them
on top of one another, you would have a
tower twice as high as the Empire State
building in New York.
Enterprises like Google and Facebook use
the mapreduce approach to process
petabyte-range volumes of data. For some
analyses, it is an attractive alternative to SQL
databases, and Apache Hadoop exists as an
open source implementation.
Giant Data: MapReduce and Hadoop
MapReduce: Automatically Distributed
Processing and analyzing such data need to
take place in a distributed process on
multiple machines. However, this kind of
processing has always been very complex,
and much time is spent solving recurring
problems, like processing in parallel,
distributing data to the compute nodes, and,
in particular, handling errors during
processing. To free developers from these
repetitive tasks, Google introduced the
MapReduce framework.
MapReduce Framework
The MapReduce framework breaks down data processing into map, shuffle,
and reduce phases. Processing is mainly in parallel on multiple compute
MapReduce: Map Phase
The Map Phase
The Shuffle Phase
The Reduce Phase
MapReduce Programming Example: Search Engine
A web search engine is a good example for the use of MapReduce.
Set of MapReduce programming is used to implement page Rank algorithm, that
Google uses to evaluate the relevance of a page on the web.
Map Method:
Reduce Method:
Schematic process of a mapreduce computation
The use of a combiner
The use of a combiner makes sense for arithmetic operations in particular.
MapReduce: The Big Picture
The Five Hadoop Daemons
Hadoop is comprised of five separate daemons
Holds the metadata for HDFS
Secondary NameNode
Performs housekeeping functions for the NameNode
Is not a backup or hot standby for the NameNode
Stores actual HDFS data blocks
Manages MapReduce jobs, distributes individual tasks to machines running the
Instantiates and monitors individual Map and Reduce tasks
The Five Hadoop Daemons (contd)
Each daemon runs in its own Java Virtual Machine (JVM)
No node on a real cluster will run all five daemons
Although this is technically possible
We can consider nodes to be in two different categories:
Master Nodes
Run the NameNode, Secondary NameNode, JobTracker daemons
Only one of each of these daemons runs on the cluster
Slave Nodes
Run the DataNode and TaskTracker daemons
A slave node will run both of these daemons
Basic Cluster Configuration
Basic Cluster Configuration (Contd)
On very small clusters, the NameNode, JobTracker and Secondary NameNode
can all reside on a single machine
It is typical to put them on separate machines as the cluster grows beyond 20/30 nodes
Each dotted box on the previous diagram represents a separate Java Virtual
Machine (JVM)
Submitting A Job
When a client submits a job, its configuration information is packaged into an
XML file
This file, along with the .jar file containing the actual program code, is handed
to the JobTracker
The JobTracker then parcels out individual tasks to TaskTracker nodes
When a TaskTracker receives a request to run a task, it instantiates a separate JVM for
that task
TaskTracker nodes can be configured to run multiple tasks at the same time
If the node has enough processing power and memory
MapReduce: The JobTracker
MapReduce jobs are controlled by a software daemon known as the JobTracker
The JobTracker resides on a master node
Clients submit MapReduce jobs to the JobTracker
The JobTracker assigns Map and Reduce tasks to other nodes on the cluster
These nodes each run a software daemon known as the TaskTracker
The TaskTracker is responsible for actually
instantiating the Map or Reduce
task, and reporting progress
back to the JobTracker
MapReduce: Terminology
A job is a full program
A complete execution of Mappers and Reducers over a dataset
A task is the execution of a single Mapper or Reducer over a slice of data
A task attempt is a particular instance of an attempt to execute a task
There will be at least as many task attempts as there are tasks
If a task attempt fails, another will be started by the JobTracker
Speculative execution (see later) can also result in more task attempts than completed
MapReduce: The Mapper
MapReduce: The Mapper (contd)
The Mapper may use or completely ignore the input key
For example, a standard pattern is to read a line of a file at a time
The key is the byte offset into the file at which the line starts
The value is the contents of the line itself
Typically the key is considered irrelevant
If the Mapper writes anything out, the output must be in the form of key/value
Example Mapper: Upper Case Mapper
Example Mapper: Explode Mapper
Example Mapper: Filter Mapper
Example Mapper: Changing Keyspaces
MapReduce: The Reducer
After the Map phase is over, all the intermediate values for a given
intermediate key are combined together into a list
This list is given to a Reducer
There may be a single Reducer, or multiple Reducers
This is specified as part of the job configuration (see later)
All values associated with a particular intermediate key are guaranteed to go
to the same Reducer
The intermediate keys, and their value lists, are passed to the Reducer in
sorted key order
This step is known as the shuffle and sort
The Reducer outputs zero or more final key/value pairs
These are written to HDFS
In practice, the Reducer usually emits a single key/value pair for
each input key
Example Reducer: Sum Reducer
Example Reducer: Identity Reducer
MapReduce Example: Word Count
MapReduce Example: Word Count (Contd)
MapReduce Example: Word Count (Contd)
MapReduce: Data Locality
Whenever possible, Hadoop will attempt to ensure that a Map task on a node is
working on a block of data stored locally on that node via HDFS
If this is not possible, the Map task will have to transfer the data across the
network as it processes that data
Once the Map tasks have finished, data is then transferred across the network
to the Reducers
Although the Reducers may run on the same physical machines as the
Map tasks, there is no concept of data locality for the Reducers
All Mappers will, in general, have to communicate with all Reducers
MapReduce: Is Shuffle and Sort a Bottleneck?
It appears that the shuffle and sort phase is a bottleneck
The reduce method in the Reducers cannot start until all Mappers have finished
In practice, Hadoop will start to transfer data from Mappers to Reducers as the
Mappers finish work
This mitigates against a huge amount of data transfer starting as soon as the last
Mapper finishes
Note that this behavior is configurable
The developer can specify the percentage of Mappers which should finish before
Reducers start retrieving data
The developers reduce method still does not start until all intermediate data has been
transferred and sorted
MapReduce: Is a Slow Mapper a Bottleneck?
It is possible for one Map task to run more slowly than the others
Perhaps due to faulty hardware, or just a very slow machine
It would appear that this would create a bottleneck
The reduce method in the Reducer cannot start until every Mapper has finished
Hadoop uses speculative execution to mitigate against this
If a Mapper appears to be running significantly more slowly than the others, a new
instance of the Mapper will be started on another machine, operating on the same data
The results of the first Mapper to finish will be used
Hadoop will kill off the Mapper which is still running
Hands/On Exercise: Running A MapReduce Job
In this Hands-On Exercise, you will run a MapReduce job on your pseudo-
distributed Hadoop cluster
Please refer to the Hands-On Exercise Manual
In this chapter you have learned
What Hadoop is all about?
What are the components in Hadoop
Concept and detailed architecture of
What features the Hadoop
Distributed File System (HDFS)
The concepts behind MapReduce
Few illustrations on MapReduce and
how it works in real time.
Hadoop and the Data
Warehouse: When and
Where to use which
At the end of this lesson, you will learn to:
Find out the answer, when should I use Hadoop, and when should I put
the data into a data warehouse?
Hadoop Differentiators
Data Warehouse Differentiators
Where and where to use which?
Lesson 3
Hadoop and the Data Warehouse
Figure 1. Before: Data flow of meter reading done manually
Figure 2. After: Meter reading every 5 or 60 minutes via smart meters
Hadoop Differentiators
Hadoop is the repository and refinery for
raw data.
Hadoop is a powerful, economical and
active archive.
Data Warehouse Differentiators
Data warehouse performance
Integrated data that provides business value
Interactive BI tools for end users
While there are certain use cases that are distinct to Hadoop or the data
warehouse, there is also overlap where either technology could be effective. The
following table is a good starting place for helping to decide which platform to use
based on your requirements.
When and Where to Use Which
In this chapter you have learned
Tried finding out the answer,
when should I use Hadoop, and
when should I put the data into a
data warehouse?
Learn the Hadoop and Data
Warehouse Differentiators
Also learn and discussed, where
to use which?
Introducing Hadoop Eco
system components
At the end of this lesson, you will learn to:
A quick over view of few key Hadoop Eco system projects. Like Hive,
Pig, Flume, Sqoop, Oozie, Hbase. The details of each one with demo
and Hands-on will be covered in separate module.
Lesson 4
Other Ecosystem Projects: Introduction
The term Hadoop core refers to HDFS and MapReduce
Many other projects exist which use Hadoop core
Either both HDFS and MapReduce, or just HDFS
Most are Apache projects or Apache Incubator projects
Some others are not hosted by the Apache Software Foundation
These are often hosted on GitHub or a similar repository
We will investigate many of these projects later in the course
Following is an introduction to some of the most significant projects
Hive is an abstraction on top of MapReduce
Allows users to query data in the Hadoop cluster without knowing Java or
Uses the HiveQL language
Very similar to SQL
The Hive Interpreter runs on a client machine
Turns HiveQL queries into MapReduce jobs
Submits those jobs to the cluster
Note: this does not turn the cluster into a relational database server
It is still simply running MapReduce jobs
Those jobs are created by the Hive Interpreter
Hive (contd)
Pig is an alternative abstraction on top of MapReduce
Uses a dataflow scripting language
Called PigLatin
The Pig interpreter runs on the client machine
Takes the PigLatin script and turns it into a series of MapReduce jobs
Submits those jobs to the cluster
As with Hive, nothing magical happens on the cluster
It is still simply running MapReduce jobs
Pig (Contd)
Flume provides a method to import data
into HDFS as it is generated
Rather than batch/processing the data later
For example, log files from a Web server
A high level diagram
Sqoop provides a method to import data from tables in a
relational database into HDFS
Does this very efficiently via a Map/only MapReduce job
Can also go the other way
Populate database tables from files in HDFS
We will investigate Sqoop later in the course.
Oozie allows developers to create a workflow of MapReduce jobs
Including dependencies between jobs
The Oozie server submits the jobs to the server in the correct sequence
We will investigate Oozie later in the course
HBase is the Hadoop database
A NoSQL datastore
Can store massive amounts of data
Gigabytes, terabytes, and even petabytes of data in a table
Scales to provide very high write throughput
Hundreds of thousands of inserts per second
Copes well with sparse data
Tables can have many thousands of columns
Even if most columns are empty for any given row
Has a very constrained access model
Insert a row, retrieve a row, do a full or partial table scan
Only one column (the row key) is indexed
Hbase vs Traditional RDBMSs
In this chapter you have learned
Different Hadoop Eco System
Projects namely
HBase (a Hadoop datastore)