Big Data, Cloud Computing, & CDN Emerging Technologies PDF

Cloud Computing
Cloud Introduction
1
Cloud Computing
What does Cloud Computing do?
Provides online data storage
Enables configuration and accessing of online applications
Provides a variety of software usage
Provides computing platform and computing infrastructure
2
Cloud Computing
Application Example
Using Gmail on my smartphone to check e-mails

Receive an e-mail with a MS Power Point attachment file
However, MS Power Point and Windows OS is not installed
on my smartphone!
Google Drive services Google Docs, Sheets, and Slides
can be used to open the file
3
Cloud Computing
What is a Cloud?
Cloud can provide services through a public or private

Network or the Internet, where the service hosting system is
at a remote location
Cloud can support various applications

E-mail, Web Conferencing, Games, Database
Management, CRM (Customer Relationship Management),
etc.
4
Cloud Computing
Cloud Models
5
Cloud Computing
Cloud Models
Public Cloud
Enables public systems and service access
Open architecture (e.g., e-mail)
Could be less secure due to openness
Private Cloud
Enables service access within an organization
Due to its private nature, it is more secure
6
Cloud Computing
Cloud Models
Community Cloud
Cloud accessible by a group of organizations
Hybrid Cloud
Hybrid Cloud = Public Cloud + Private Cloud
Private cloud supports critical activities
Public cloud supports non-critical activities
7
Cloud Computing
Cloud Service Models

The lower service model supports the
management, computing power, security
of its upper service model
SaaS: Software as a Service

PaaS: Platform as a Service
IaaS: Infrastructure as a Service
8
Cloud Computing
Software as a Service (SaaS)

Provides a variety of software applications as a service to
end users
Platform as a Service (PasS)

Provides a program executable platform for applications,
development tools, etc.
Infrastructure as a Service (IaaS)

Provides the fundamental computing and security
resources for the entire cloud
Backup storage, computing power, VM (Virtual Machines),
etc.
9
Cloud Computing

There are many other service models
XaaS = Anything as a Service
NaaS N for Network as a Service

DaaS D for Database as a Service
BaaS B for Business as a Service
etc.
10
Cloud Computing
Cloud Benefits
11
Cloud Computing
Characteristics
12
Cloud Computing
REFERENCES
13
References
K. Kumar and Y. H. Lu, Cloud Computing for Mobile Users: Can Offloading
Computation Save Energy?, Computer, vol. 43, no. 4, pp. 5156, Apr. 2010.
Wikipedia, http://www.wikipedia.org
Apple, iCloud, https://www.icloud.com
Google, Google Cloud, https://cloud.google.com/products [Accessed June 1, 2015]
Virtualization, Ciscos IaaS cloud,
http://www.virtualization.co.kr/data/file/01_2/1889266503_6f489654_1.jpg
[Accessed June 1, 2015]
Tutorialspoint, Cloud computing,
http://www.tutorialspoint.com/cloud_computing/cloud_computing_tutorial.pdf
14
References
Image sources
AWS Simple Icons Storage Amazon S3 Bucket with Objects, By Amazon Web
Services LLC [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via
Wikimedia Commons
iCloud Logo, By EEIM (Own work) [Public domain], via Wikimedia Commons
MobileMe Logo, By Apple Inc. [Public domain], via Wikimedia Commons
15
Cloud Computing
16
Cloud Computing

The lower service model supports the
management, computing power, security
of its upper service model
SaaS: Software as a Service

PaaS: Platform as a Service
IaaS: Infrastructure as a Service
17
IaaS
IaaS (Infrastructure as a Service)
Infrastructure support over the Internet

Clouds Computing & Storage Resources
Computing Power
Storage Services
Software Packages & Bundles
VLAN (Virtual Local Area Network)
VM (Virtual Machine) Features
18
IaaS
VM (Virtual Machine) Administration

IaaS enables control of computing resources through
Administrative Access to VMs
Server Virtualization features
Access to computing resources are enabled by
Administrative Access to VMs
VM Administrative Command examples
Save data on cloud server
Start web server
Install new application
19
IaaS
IaaS Procedures
20
IaaS
IaaS Benefits
Flexible and Efficient Renting of Computer & Server
Hardware
Rentable Resources
VM, Storage, Bandwidth,
IP Addresses, Monitoring Services, Firewalls,
etc.
Rent Payment Basis
Resource type
Usage time
Service packages
21
IaaS
IaaS Benefits
Portability & Interoperability with
Legacy Applications
Enables portability based on infrastructure
resources that are
used through Internet connections
Enables a method to maintain interoperability with
legacy applications and workloads
between IaaS clouds
22
PaaS
PaaS
(Platform as a Service)
Provides development &
deployment tools for
application development
Provides runtime
environment for apps.
23
Cloud Services
PaaS Types
Application Stand Alone

Delivery-Only Development
Environment Environment
Add-on
Open Platform Development
as a Service Facilities
24
PaaS
PaaS Types
Application Delivery-Only Environment
Provides on-demand scaling & application security
Stand-Alone Development Environment
Provides an independent platform for a specific function
Open Platform as a Service
Provides open source software to run applications for
PaaS providers
Add-On Development Facilities
Enables customization to the existing SaaS platforms
25
PaaS
PaaS Benefits
26
PaaS
Benefits
Lower Administrative Overhead

User does not need to be involved in any
administration of the platform
Lower Total Cost of Ownership

User does not need to purchase any hardware,
memory, or server
27
PaaS
Benefits
Scalable Solutions
Application resource demand based automatic
resource scale control
More Current System Software

Cloud provider needs to maintain software
upgrades & patch installations
28
SaaS
SaaS (Software as a Service)

Provides software applications as a service to the
user
Software that is deployed on a cloud server which

is accessible through the Internet
29
SaaS
Characteristics
On Demand Availability
Cloud software is available anywhere that the
cloud is reachable via Internet
Easy Maintenance
No user software upgrade or maintenance needed
All supported by the cloud
Flexible Scale Up or Scale Down
Centralized Management & Data
30
SaaS
Characteristics
Enables a Shared Data Model
Multiple users can share a single
data model and database
Cost Effectiveness
Pay based on usage
No risk in buying the wrong software
Multitenant Programming Solutions
Multiple programmers are ensured to use the same
software version
No version mismatch problems
31
Software-as-a-service
Open SaaS
Applications
32
Cloud Computing
REFERENCES
33
References
34
References
Image sources
Wikimedia Commons
35
Cloud Computing
Cloud Services
36
Cloud Services
Google Cloud
Google App Engine
Released as a preview in April 2008
PaaS (Platform as a Service) for web applications
Provides automatic scaling based on resource
demands and server load
Google Cloud Storage

Launched in May 2010
Online file storage service
37
Cloud Services
Google Cloud
Google BigQuery
Released in April 2012
Data analysis tool that uses SQL-like queries to
process big datasets in seconds
Google Compute Engine

Released in June 2012
IaaS (Infrastructure as a Service) support
to enable on demand launching of VMs (Virtual
Machines)
38
Cloud Services
Google Cloud
Google Cloud Endpoints
Released in November 2013
Tool to create services inside App Engine
Easily connects from Android, iOS, and JavaScript
clients
Google Cloud DNS (Domain Name System)

DNS service supported by the Google Cloud
39
Cloud Services
Google Cloud
Google Cloud Datastore

NoSQL (No Structured Query Language) data storage
Google Cloud SQL (Structured Query Language)

Released in February 2014
as GA (General Availability)
Fully managed MySQL database
40
Cloud Services
Amazon S3 (Simple Storage Service)

Online file storage web service offered by Amazon Web
Services
Public web service released in the United States in March
2006 and in Europe in November 2007
Provides storage through
web services interfaces
(REST, SOAP, and BitTorrent)
41
Cloud Services
Amazon Cloud Drive

Amazon Cloud Drive was released in
March 2011
Web storage application from Amazon
Storage Space Characteristics
Can be accessed from up to eight specific devices (e.g.,
mobile devices & different computers) and by using
different browsers on the same computer
42
Cloud Services
Amazon Cloud Drive
Cloud Player (Originally bundled)
Users can play music in their Cloud Drive from any

computer or Android device
Music browsing based on song titles, albums, artists,

genres (website only), and playlists
43
Cloud Services
Amazon Cloud Drive Options
Unlimited Photos
Unlimited storage for photos & raw data files
5 gigabytes of video storage
Unlimited Everything
Unlimited storage for photos, videos, documents, and
various files types
44
Cloud Services
iCloud
Developed by Apple, Inc.

Public release in October 2011
Cloud Storage & Cloud Computing
Operating system
OS X (10.7 Lion or later)
Microsoft Windows 7 or later
iOS 5 or later
45
Cloud Services
iCloud replaces MobileMe

Subscription-based collection of Apples online
services and software
MobileMe was replaced by iCloud
MobileMe ceased services in
June 2012
MobileMe users were allowed transfers to iCloud
until
July 2012
46
Cloud Services
iCloud Features
Email, Contacts, and Calendars
Find My Friends
Backup & Restore
Back up feature for device settings & data
iOS 5 or later required
Find My iPhone
Enables a user to track the location of an iOS device or
Mac
Formerly a feature of MobileMe
47
Cloud Services
iCloud Features
Can manage lost or stolen Apple devices

Back to My Mac
Enables remote log in to other computers that have
Back to My Mac installed (using the same Apple ID)
iWork for iCloud

Apple's iWork suite (Pages, Numbers, and Keynote)
made available on a web interface
48
Cloud Services
iCloud Features
Photo Stream
Can store most recent 1,000 photos
Free storage for up to 30 days
iCloud Photo Library

Stores all photos at original resolution
Stores photo metadata
Storage (Introduced in 2011)

5 GB of free storage per account
49
Cloud Services
iCloud Features
iCloud Drive
Can save photos, videos, documents, and apps
iCloud Keychain
Secure database for Website and Wi-Fi
password
Secure Credit card & Debit card management for
quick access and auto-fill
50
Cloud Services
iCloud Features
iTunes Match
iTunes music library scan and match tracks
function
Serves tracks copied from CDs or other sources
51
Cloud Computing
REFERENCES
52
References
53
References
Image sources
Wikimedia Commons
54
Big Data
Big Data Examples
55
Big Data
New FLU Virus Starts in the U.S.!

H1N1 flu virus (which has combined virus elements of the
bird and swine (pig) flu) started to spread in the U.S. in
2009
U.S. CDC (Centers for Disease Control and Prevention) was
only collecting diagnostic data of Medical Doctors once a
week
Using the CDC information to find how the flu was
spreading would have an approximate
2 week lag, which is far too slow compared to the speed of
the virus spreading
56
Big Data
What vaccine was needed?

How much vaccine was needed?
Where was the vaccine needed?
Vaccine preparation and delivery plans could

not be setup fast enough to safely prevent the
virus from spreading out of control
57
Big Data

Fortunately, Google published a paper about
how they could predict the spread of the winter
flu in the U.S. accurately down to specific
regions and states
This paper was published in the journal Nature

a few weeks before the H1N1 virus made the
headline news
58
Big Data

Millions of the most common search terms and
Millions of different mathematical models were tested
on Googles database
Google receives more than 3 billion search queries
a day
Analysis system was set to look for correlation

between the frequency of certain search queues and
the spread of the flu over time and space
59
Big Data
Googles method of analysis did not use data

provided from hospitals or Medical Doctors
Google used Big Data analysis on the most common
search terms people use
Googles system proved to be more accurate and
faster than analyzing government statistics
60
Big Data
Wal-Mart
Wal-Marts Data Warehouse

Stores 4 petabytes (41015) of data
Records every single purchase
Approximately 267 million
transactions a day from 6000
stores worldwide is recorded
61
Big Data
Wal-Mart
Wal-Marts Data Analysis

Focused on evaluating the effectiveness of
pricing strategies and advertising campaigns
Seeking for improvement methods
in inventory management and supply chains
62
Big Data
Recommendation System using Big Data

Based on data analysis of simple elements
What users made purchases in the past
Which items do they have in their virtual
shopping cart
Which items did customers rate and like
What influence did the rating have on other
customers to make a purchase
63
Big Data
Amazon.com
Amazon.coms Recommendation System
Item-to-Item Collaborative Filtering Algorithm
Personalization of the Online Store
Customized to each customer
Each customers store is based on the customers
personal interest
Example: For a new mother, the store will display
baby supplies and toys
64
Big Data
Citibank
Bank operations in 100 countries
Big Data analysis on the database of basic financial
transactions can enable Global insight on
investments, market changes, trade patterns, and
economic conditions
Many companies (e.g., Zara, H&M, etc.) work with
Citibank to locate new stores and factories
65
Big Data
Product Development & Sales

For example, a Smartphone takes significant time
and money to manufacture
In addition, the duration of popularity for a new
Smartphone is limited
To maximize sales, a company needs to manufacture
just the right amount of products and sell them in the
right locations
66
Big Data
Product Development & Sales

Too much will result in leftovers and a
big waste for the company!
Too less will result in a lost opportunity for company profit
and growth!
Big Data analysis can help find how many smartphones
and where the products could be popular based on
common search terms that people use Use this to also
estimate how many products could be sold in a certain
location But why is this difficult?
67
Big Data
REFERENCES
68
References
V. Mayer-Schnberger, and K. Cukier, Big data: A revolution that will transform how
we live, work, and think. Houghton Mifflin Harcourt, 2013.
T. White, Hadoop: The Definitive Guide. O'Reilly Media, 2012.
J. Venner, Pro Hadoop. Apress, 2009.
S. LaValle, E. Lesser, R. Shockley, M. S. Hopkins, and N. Kruschwitz, Big Data,
Analytics and the Path From Insights to Value, MIT Sloan Management Review,
vol. 52, no. 2, Winter 2011.
B. Randal, R. H. Katz, and E. D. Lazowska, "Big-data Computing: Creating
revolutionary breakthroughs in commerce, science and society," Computing
Community Consortium, pp. 1-15, Dec. 2008.
G. Linden, B. Smith, and J. York. "Amazon.com Recommendations: Item-to-Item
Collaborative Filtering," IEEE Internet Computing, vol. 7, no. 1, pp. 76-80, Jan/Feb.
2003.
69
References
J. R. GalbRaith, "Organizational Design Challenges Resulting From Big Data,"
Journal of Organization Design, vol. 3, no. 1, pp. 2-13, Apr. 2014.
S. Sagiroglu and D. Sinanc, Big data: A review, Proc. IEEE International
Conference on Collaboration Technologies and Systems, pp. 42-47, May 2013.
M. Chen, S. Mao, and Y. Liu, Big Data: A Survey, Mobile Networks and
Applications, vol. 19, no. 2, pp. 171-209, Jan. 2014.
X. Wu, X. Zhu, G. Q. Wu, and W. Ding, Data Mining with Big Data, IEEE
Transactions on Knowledge and Data Engineering, vol. 26, no. 1, pp. 97107, Jan.
2014.
Z. Zheng, J. Zhu, and M. R. Lyu, Service-Generated Big Data and Big Data-as-a-
Service: An Overview, Proc. IEEE International Congress on Big Data, pp. 403
410, Jun/Jul. 2013.
70
References
I. Palit and C.K. Reddy, Scalable and Parallel Boosting with MapReduce, IEEE
Transactions on Knowledge and Data Engineering, vol. 24, no. 10, pp. 1904-1916,
2012.
M.-Y Choi, E.-A. Cho, D.-H. Park, C.-J Moon, and D.-K. Baik, A Database
Synchronization Algorithm for Mobile Devices, IEEE Transactions on Consumer
Electronics, vol. 56, no. 2, pp. 392-398, May 2010.
IBM, What is big data?, http://www.ibm.com/software/data/bigdata/what-is-big-
data.html [Accessed June 1, 2015]
Hadoop Apache, http://hadoop.apache.org
Image sources
Walmart Logo, By Walmart [Public domain], via Wikimedia Commons
Amazon Logo, By Balajimuthazhagan (Own work) [CC BY-SA 3.0
(http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons
71
Big Data
Big Data's 4 Vs
72
Big Data
Big Datas 4 V Big Challenges
Volume Data Size
Variety Data Formats
Velocity Data Streaming Speeds
Veracity Data Trustworthiness
73
Big Data
Volume Data Size

40 Zettabytes (1021) of data is predicted to be created
by 2020
2.5 Quintillionbytes (1018) of data are created every
day
6 Billion (109) people have mobile phones
100 Terabytes (1012) of data (at least) is stored by
most U.S. companies
966 Petabytes (1015) was the approximate storage size
of the American manufacturing industry in 2009
74
Big Data
150 Exabytes (1018) was the estimated size of data for

health care throughout the world in 2011
More than 4 Billion (109) hours each month are used in
watching YouTube
30 Billon contents are exchanged every month on
Facebook
200 Million monthly active users exchange 400 Million
tweets every day
75
Big Data

1 Terabytes (1012) of trade information is exchanged
during every trading session at the New York Stock
Exchange
100 sensors (approximately) are installed in modern

cars to monitor fuel level, tire pressure, etc.
18.9 Billion network connections are predicted to

exist by 2016
76
Big Data
1 out of 3 business leaders have experienced trust

issues with their data when trying to make a
business decision
$3.1 Trillion (1012) a year is estimated to be wasted

in the U.S. economy due to poor data quality
77
Big Data
New technology is needed to overcome these

4 V Big Data Challenges
Volume Data Size
78
Big Data
REFERENCES
79
References
vol. 52, no. 2, Winter 2011.
2003.
80
References
2014.
410, Jun/Jul. 2013.
81
References
2012.
Image sources
82
Big Data
HADOOP
83
Hadoop
Data Storage, Access, and Analysis
Hard drive storage capacity has tremendously

increased
But the data read and write speeds to and from the
hard drives have not significantly improved yet
Simultaneous parallel read and write of data with
multiple hard disks requires advanced technology
84
Hadoop

Challenge 1: Hardware Failure
When using many computers for data storage and
analysis, the probability that one computer will fail is
very high
Challenge 2: Cost
To avoid data loss or computed analysis information
loss, using backup computers and memory is needed,
which helps the reliability, but is very expensive
85
Hadoop

Challenge 3: Combining Analyzed Data
Combining the analyzed data is very difficult
If one part of the analyzed data is not ready, then the

overall combining process has to be delayed
If one part has errors in its analysis, then the overall

combined result may be unreliable and useless
86
Hadoop
Hadoop
Hadoop is a Reliable Shared Storage and Analysis System
Hadoop = HDFS + MapReduce +

HDFS provides Data Storage
HDFS: Hadoop Distributed FileSystem
MapReduce provides Data Analysis

MapReduce = Map + Reduce
Function Function
87
Hadoop
DFS (Distributed FileSystem) is designed for storage

management of a network of computers
HDFS is optimized to store huge files with streaming

data access patterns
HDFS is designed to run on clusters of general

computers
88
Hadoop

HDFS was designed to be optimal in performance
for a WORM (Write Once, Read Many times) pattern,
which is a very efficient data processing pattern
HDFS was designed considering the time to read the

whole dataset to be more important than the time
required to read the first record
89
Hadoop
HDFS
HDFS clusters use 2 types of nodes
Namenode (master node)
Datanode (worker node)
90
Hadoop
HDFS: Namenode
Manages the filesystem namespace
Maintains the filesystem tree and the metadata for all the
files and directories in the tree
Stores on the local disk using 2 file forms

Namespace Image
Edit Log
91
Hadoop
HDFS: Datanodes
Workhorse of the filesystem
Store and retrieve blocks when requested by the

client or the namenode
Report back to the namenode periodically with lists

of blocks that were stored
92
Hadoop
MapReduce
MapReduce is a program that abstracts the analysis

problem from stored data
MapReduce transforms the analysis problem into a

computation process that uses a set of keys and
values
93
Hadoop
MapReduce System Architecture
MapReduce was designed for tasks that consume

several minutes or hours on a set of dedicated trusted
computers connected with a broadband high-speed
network managed by a single master data center
94
Hadoop
MapReduce Characteristics
MapReduce uses a somewhat brute-force data analysis

approach
The entire dataset (or a big part of the dataset) is

processed for every query
Batch Query Processor model
95
Hadoop
MapReduce enables the ability to run an ad hoc query

against the whole dataset within a scalable time
Many distributed systems combine data from multiple

sources (which is very difficult), but MapReduce does
this in a very effective and efficient way
96
Hadoop
Technical Terms used in MapReduce
Seek Time is the delay in finding a file
Transfer Rate is the speed to move a file
Transfer Rate has improved significantly more (i.e.,

now has much faster transfer speeds) compared to
improvements in Seek Time (i.e., still relatively slow)
97
Hadoop
MapReduce
MapReduce gains performance enhancement through
optimal balancing
of Seeking and Transfer operations
Reduce Seek operations
Effectively use Transfer operations
In the next lecture, we will compare MapReduce with a

traditional RDBMS (Rational Database Management
System)
98
Big Data
REFERENCES
99
References
vol. 52, no. 2, Winter 2011.
2003.
100
References
2014.
410, Jun/Jul. 2013.
101
References
2012.
Image sources
102
Big Data
MapReduce vs.
RDBMS
103
Hadoop
MapReduce vs. RDBMS

RDBMS (Rational Database Management System)
Characteristics
RDBMS is good for updating a small proportion of a

big database
RDBMS uses a traditional B-Tree, which is highly

dependent in the time required to perform seek
operations
104
Hadoop
MapReduce vs. RDBMS
MapReduce is good for updating all (or a majority) of

a big database
MapReduce uses Sort and Merge to rebuild the

database, which depends more on transfer
operations
105
Hadoop
MapReduce vs. RDBMS
RDBMS is good for applications that require the

datasets of the database to be very frequently updated
(e.g., point queries or small dataset updates)
MapReduce is better for WORM (Write Once and Read
Many times) based data applications
MapReduce is a complementary system to RDBMS
106
Hadoop
MapReduce vs. RDBMS

RDBMS MapReduce
Data Size Gigabytes (109) Petabytes (1012)
Access Interactive & Batch Batch
Updates Read & Write Many Times WORM (Write Once,

Read Many Times)
Data Static Schema Dynamic Schema
Structure
Integrity High Low
Scalability Nonlinear Linear
107
Hadoop
MapReduce vs. RDBMS: Data Types

Structured Data: Data that has a formal defined structure (e.g.,
XML documents or database tables)
Semi-Structured Data: Data that has a looser format where the

data structure is used as a guide and may be ignored
Unstructured Data: Data that does not have any formal

structure (e.g., plain text or image data)
108
Hadoop
MapReduce vs. RDBMS: Data Types

MapReduce is very effective on unstructured and semi-
structured data
Why?
MapReduce interprets data during the data
processing sessions
MapReduce does not use intrinsic properties of the
data as input keys or input values. The parameters
used
are selected by the person analyzing the data
109
Hadoop
MapReduce vs. RDBMS: Scalability

MapReduce has a programming model that is linearly
scalable
MapReduce Functions: 2 types

Map function
Reduce function
Both of these functions define a

Key-Value pair mapping relation
(e.g., Key-Value pair 1 Key-Value pair 2)
110
Hadoop
Hadoop Release Series Release 2.6.0 became available Nov. 2014
Feature 1.x 0.22 2.X

Secure authentication Yes No Yes
Old configuration names Yes
New configuration names No Yes Yes
Old MapReduce API Yes Yes Yes
Yes (with some
New MapReduce API Yes Yes
missing libraries)
MapReduce 1 runtime (Classic) Yes Yes No
MapReduce 2 runtime (YARN) No No Yes
HDFS Federation No No Yes
HDFS High-Availability No No Yes
111
Hadoop
Hadoop Release Series
2.x includes several major new features

MapReduce 2 is the new MapReduce runtime
implemented on a new system called YARN
YARN
Yet Another Resource Negotiator
General resource management system for
running distributed applications
112
Hadoop
Hadoop Release Series

HDFS Federation partitions the HDFS namespace
across multiple namenodes
Enables improved support for clusters with very
large numbers of files
HDFS High-Availability feature uses standby

namenodes for backup, and therefore, the namenode
is no longer a potential SPOF (Single Point of Failure)
113
Big Data
REFERENCES
114
References
vol. 52, no. 2, Winter 2011.
2003.
115
References
2014.
410, Jun/Jul. 2013.
116
References
2012.
Image sources
117
Big Data
MapReduce
118
MapReduce
Hadoop


MapReduce = Map Function + Reduce Function
119
MapReduce
Scaling Out
Scaling out is done by the DFS (Distributed FileSystem),

where the data is divided and stored in distributed
computers & servers
Hadoop uses HDFS to move the MapReduce computation

to several distributed computing machines
that will process a part of the
divided data assigned
120
MapReduce
Jobs
MapReduce job is a unit of work that needs to be
executed
Job types: Data input, MapReduce program,

Configuration Information, etc.
Job is executed by dividing it into one of two types of

tasks
Map Task
Reduce Task
121
MapReduce
Node types for Job execution
Job execution is controlled by 2 types of nodes

Jobtracker
Tasktracker
Jobtracker coordinates all jobs
Jobtracker schedules all tasks and assigns the tasks

to tasktrackers
122
MapReduce
Tasktracker will execute its assigned task

Tasktracker will send a progress reports to the Jobtracker
Jobtracker will keep a record of the progress of all jobs executed
123
MapReduce
Data flow
Hadoop divides the input into input splits (or splits)

suitable for the MapReduce job
Split has a fixed-size
Split size is commonly matched to the size of a HDFS

block (64 MB) for maximum processing efficiency
124
MapReduce
Data flow
Map Task is created for each split
Map Task executes the map function for all records

within the split
Hadoop commonly executes the Map Task on the

node where the input data resides
125
MapReduce
Data flow
Data-Local Map Task

Data locality optimization
does not need to use the cluster network
Data-local flow process shows why the
Optimal Split Size = 64 MB HDFS Block Size
126
MapReduce
Data flow Node
Rack
Data Center
Rack-Local Map Task
Map Task
A node hosting the
HDFS Block
HDFS block replicas for
a map tasks input split
could be running other map tasks
Job Scheduler will look for a free map slot on
a node in the same rack as one of the blocks
127
MapReduce
Data flow
Off-Rack Map Task

Needed when the
Job Scheduler
cannot perform data-local or rack-local map tasks
Uses inter-rack network transfer
128
MapReduce
Map
Map task will write its output to the local disk
Map task output is not the final output, it is only the
intermediate output
Reduce
Map task output is processed by Reduce Tasks to produce
the final output
Reduce Task output is stored in HDFS
For a completed job, the Map Task output can be
discarded
129
MapReduce
Single Reduce Task
Node includes Split, Map, Sort, and Output unit

Light blue arrows show data transfers in a node
Black arrows show data transfers between nodes
130
MapReduce
Single Reduce Task
Number of reduce tasks is specified

independently, and is not based on
the size of the input
131
MapReduce
Combiner Function
User specified function to run on the Map output
Forms the input to the Reduce function
Specifically designed to minimize the data transferred
between Map Tasks and Reduce Tasks
Solves the problem of limited network speed on the
cluster and helps to reduce the time in completing
MapReduce jobs
132
MapReduce
Multiple Reducer
Map tasks partition their output, each creating one

partition for each reduce task
Each partition may use many keys and key

associated values
All records for a key are kept in a single partition
133
MapReduce
Multiple Reducers
Shuffle
Shuffle process is used in the data flow

between the Map tasks and Reduce tasks
134
MapReduce
Zero Reducer
Zero reducer uses

no shuffle process
Applied when all of the
processing can be carried
out in parallel Map tasks
135
Big Data
REFERENCES
136
References
vol. 52, no. 2, Winter 2011.
2003.
137
References
2014.
410, Jun/Jul. 2013.
138
References
2012.
Image sources
139
Big Data
HDFS
140
HDFS
Hadoop


MapReduce = Map Function + Reduce Function
141
HDFS
DFS (Distributed FileSystem) is designed for storage

management of a network of computers
HDFS is optimized to store large terabyte size files

with streaming data access patterns
142
HDFS
HDFS was designed to be optimal in performance for

a WORM (Write Once,
Read Many times) pattern
HDFS is designed to run on clusters of general

computers & servers from multiple vendors
143
HDFS
HDFS Characteristics
HDFS is optimized for large scale and high throughput

data processing
HDFS does not perform well in supporting applications

that require minimum delay (e.g., tens of milliseconds
range)
144
HDFS
Blocks
Files in HDFS are divided into block size chunks 64
Megabyte default block size
Block is the minimum size of data that it can read or write
Blocks simplifies the storage and replication process

Provides fault tolerance & processing speed
enhancement for larger files
145
HDFS
HDFS
HDFS clusters use 2 types of nodes
Namenode (master node)
Datanode (worker node)
146
HDFS
Namenode
Manages the filesystem namespace
Namenode keeps track of the datanodes that have
blocks of a distributed file assigned
Maintains the filesystem tree and the metadata for all
the files and directories in the tree
Stores on the local disk using 2 file forms
Namespace Image
Edit Log
147
HDFS
Namenode
Namenode holds the filesystem metadata in its memory
Namenodes memory size determines the limit to the

number of files in a filesystem
But then, what is Metadata?
148
HDFS
Metadata
Traditional concept of the library card catalogs
Categorizes and describes the contents and context of

the data files
Maximizes the usefulness of the original data file by

making it easy to find and use
149
HDFS
Metadata Types
Structural Metadata
Focuses on the data structure's design and
specification
Descriptive Metadata
Focuses on the individual instances of application
data or the data content
150
HDFS
Datanodes
Workhorse of the filesystem
Store and retrieve blocks when requested by the client

or the namenode
Periodically reports back to the namenode with lists of

blocks that were stored
151
HDFS
Client Access
Client can access the filesystem (on behalf of the user)

by communicating with the namenode and datanodes
Client can use a filesystem interface (similar to a POSIX

(Portable Operating System Interface)) so the user code
does not need to know about the namenode and
datanodes to function properly
152
HDFS
Namenode Failure
Namenode keeps track of the datanodes that have blocks
of a distributed file assigned Without the namenode, the
filesystem cannot be used
If the computer running the namenode malfunctions then

reconstruction of the files (from the blocks on the
datanodes) would not be possible Files on the
filesystem would be lost
153
HDFS
Namenode Failure Resilience
Namenode failure prevention schemes
1. Namenode File Backup
2. Secondary Namenode
154
HDFS
1. Namenode File Backup

Back up the namenode files that form the persistent
state of the filesystems metadata
Configure the namenode to write its persistent state
to multiple filesystems
Synchronous and atomic backup
Common backup configuration Copy to Local Disk
and Remote FileSystem
155
HDFS
2. Secondary Namenode
Secondary namenode does not act the same way as the
namenode
Secondary namenode periodically merges the
namespace image with the edit log to prevent the edit log
from becoming too large
Secondary namenode usually runs on a separate
computer to perform the merge process because this
requires significant processing capability and memory
156
HDFS
Hadoop 2.x Release Series HDFS Reliability

Enhancements
HDFS Federation
HDFS HA (High-Availability)
157
HDFS
HDFS Federation
Allows a cluster to scale by adding namenodes
Each namenode manages a

namespace volume and a block pool
Namespace volume is made up of the metadata for
the namespace
Block pool contains all the blocks for the files in the
namespace
158
HDFS
HDFS Federation
Namespace volumes are all independent
Namenodes do not communicate with each other
Failure of a namenode is also independent to other
namenodes
A namenode failure does not influence the
availability of another namenodes namespace
159
HDFS
HDFS High-Availability
Pair of namenodes (Primary & Standby) are set to be in
Active-Standby configuration
Secondary namenode stores the latest edit log entries

and an up-to-date block mapping
When the primary namenode fails, the standby

namenode takes over serving client requests
160
HDFS
HDFS High-Availability
Although the active-standby namenode can takeover

operation quickly (e.g., few tens of seconds), to
avoid unnecessary namenode switching, standby
namenode activation will be executed after a
sufficient observation period
(e.g., approximately a minute or a few minutes)
161
Big Data
REFERENCES
162
References
vol. 52, no. 2, Winter 2011.
2003.
163
References
2014.
410, Jun/Jul. 2013.
164
References
2012.
Image sources
165
CDN (Content Delivery Network)
CDN Introduction
166
CDN
Table of Contents
CDN Motivation & Structure
CDN Procedures
Hierarchical Content Delivery Model
CDN Market & Major Service Providers
CDN Research & Development
167
CDN
CDN Motivation
CDN is a network constructed from a group of
strategically placed and geographically distributed
caching servers
CDN is one of the most efficient solutions for CPs (Content

Providers) in serving a large number of user devices, for
reduction in content download time and network traffic
168
CDN
CDN Motivation
Network traffic that is accessed by mobile users (e.g., smart
devices) is rapidly increasing
Mobile network performance is highly dependent on the

content download of multimedia data and applications
Several mobile network operators have suffered from service

outage or performance deterioration due to the significant
increase in use of mobile devices
169
CDN
Using CDN, both content
CDN Structure download time and network
traffic are reduced
Content
Provider
User
Store
Caching popular
Server contents in
advance
Content request and delivery route with CDN
Content request and delivery route without CDN
170
CDN
CDN in Mobile Networks
Mobile communication networks have a stronger need

for both reduced traffic load and content delivery time
compared to broadband backbone networks where
capacity is abundant such that traffic load reduction may
not be as much of a critical issue
171
CDN
CDN Structure
CDN usually consists of the CP (Content Provider) and

caching servers
CP possesses all contents to serve
Caching servers are distributed in the network

containing selected copies of identical contents that the
CP stores
172
CDN
CDN Structure
When a user requests a content to its nearest
caching server, the server can delivery the
content if the requested content is in its cache
Otherwise the caching server redirects the

users request to the remotely located CP
173
CDN
CDN Procedures
When a user requests a content to its nearest caching server, the
server can delivery the content if the requested content is in its
cache
174
CDN
CDN Procedures
If the requested content is not in the local servers cache,
content request is redirected to the remotely located CP
175
CDN
Content Aging Procedure

Content aging is focused on delivering the most popular
contents to users in the most effective way
Dependent on
Location of caching servers
Number of caching servers
Limited memory size of caching servers
Content Aging
Delete expired contents from the cache server
Download updated contents from the CP
176
CDN
Content Aging Procedure
Each content has a content update period

TTL (Time to Live)
Few seconds for on-line trading
Few seconds for auction information
24 hours or more for movies
177
CDN
REFERENCES
178
References
Content Delivery Functional Architecture in NGN, Telecommunication
Standardization Sector of ITU, White Paper, Sep. 2010.
Content delivery networks: Market dynamics and growth perspectives, Informa
Telecoms & Media, White Paper, Oct. 2012.
Cisco, Cisco Visual Networking Index: Global Mobile Data Traffic Forecast Update,
http://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-
index-vni/white_paper_c11-520862.pdf [Accessed June 1, 2015]
Akamai, http://www.akamai.com/index.html/
LimeLight, http://www.limelight.com/
Level 3, http://www.level3.com/
CDNetworks, http://www.us.cdnetworks.com/
179
CDN Hierarchical
Content Delivery
180
Hierarchical Content Delivery
It is not possible for a caching server to save all

contents that the CP (Content Providers) serves
Retrieving contents from the remotely located CP can
cause a long content download time. In addition, a
large amount of traffic will be generated by each
server in support of the contents packet routing
181

For the given cache size of each server, it is important
to maximize the hit rate of the local caching server
such that the requested contents do not have to be
retrieved from the CP
To accomplish this objective in the Internet in a
scalable way, hierarchical cooperative content delivery
techniques are used in providing content delivery to
local caching servers
182

CD & LCF (Content Distribution & Location Control
Functions) controls the overall content delivery process,
and has all content IDs of the CDN
CCF (Cluster Control Function) controls multiple CDPFs

(Content Delivery Processing Functions) and saves
content IDs of the cluster
CDPF stores and delivers the contents to the users
183
Hierarchical Content Delivery Network Example
184
Content Delivery Procedures
Case 1
Requested content is in the local cluster
Content request message is delivered to the CCF
CCF sends a session request message to the
CDPF to deliver the content to the user
CDPF delivers the content to the user
185
Case 1 Procedures
186

Case 2
Requested content is not in the local cluster, but
another local cluster (i.e., target cluster) has the
content
Procedures
Content request message is redirected from
the local cluster to the CD & LCF
Continued
187

Case 2
Procedures Continued
CD & LCF checks if the requested content is
in the
other cluster
Requested content can be delivered from the

target cluster to the user directly, or through
the local cluster (the local cluster can store
the requested content)
188

Case 2 Procedures
189
Case 3
When the requested content is not in the CDN
Content request message is sent from the
CD & LCF to the CP
CP delivers the content to the user through
the local cluster
The requested content can be stored in
the local cluster
190
Content Delivery Procedure
Case 3 Procedures
191
CDN
REFERENCES
192
References
193
CDN Market
194
CDN Market
Measuring the CDN Market Value

There are many ways to evaluate the value of the CDN market
Evaluation is related to the diverse range of CDN industry
participants
Example of industry participants
CSP (Communications Service Provider)
Industry manufacturers
CDN service providers
Content provider
195
CDN Market
Measuring the CDN Market Value
For communication service providers, the CDNs value

includes improving retail service delivery and supporting
their efforts to win and retain customers
For industry manufacturers, the market value is related to

the demand from telcos, content providers and other
businesses
196
CDN Market
CDN Market Size

2014 CDN Market size was $3.71 billion
CDNs Market Components
Content delivery technologies, hardware, analytics,
monitoring, encoding, transparent caching, DRM
(Digital Rights Management), CMS (Content
Management System), OVP (Online Video Platform),
etc.
CDN Market Estimations
Expectations to grow to $12.16 billion by 2019
Predicted 26.3% CAGR (Compound Annual Growth Rate) from
2014~2019
197
CDN Market
CDN Service Providers

Akamai has about 110,000 servers over the world.
Akamai's service includes cloud computing, HD video
delivery, etc.
Amazon Cloudfront delivers static and streaming

contents. Amazon Cloudfront works seamlessly with
other Amazon Web and Cloud Service solutions
S3 (Simple Storage Service)
EC2 (Elastic Compute Cloud)
198
CDN Market
CDNetworks has POPs (Point of Presences) in 6

continents, including 20 POPs in China. Worlds
3rd largest, and Asias #1, full-service provider
Level 3 supports a comprehensive encoding suite

for video data, and intelligent traffic manager
services (i.e., load balance)
199
CDN Market
Limtlight has 6,000 servers at 75 POPs (Points of

Presence), and more than 30 regional content delivery
centers in the U.S., Europe, and Asia
ChinaCache is a CDN market leader in China, which

has 127 POPs and 11,000 servers in China. CDN
services include hotlink protection, custom CNAME
for SSL and Purge All.
200
CDN Market
Telcos with a CDN resale agreement
CDN Provider Operator (Market Region)

Verizon (US), NTT Communications (Japan), du (UAE),
Akamai Telekom Malaysia (Malaysia)
Andorra Telecom (Andorra), MegaFon (Russia),
CDNetworks Telecom Italia Sparkle (Italy),
SingTel (Singapore)
ChinaCache China Mobile (China), HGC (International)
201
CDN Market
Telcos with a CDN resale agreement

CDN Provider Operator (Market Region)
AT&T (US), AAPT (Australia), Deutsche Telekom ICSS

EdgeCast (Germany), Dogan Telecom (Turkey), Pacnet (Asia Pacific),
Telus (Canada)
Jet-Stream Telenet (Belgium), Ziggo (Netherlands)
Internexa (South America), MWeb (South Africa), STC
Level 3 (Saudi Arabia)
Limelight Bell Canada (Canada), Bestel (Mexico),
Networks Bharti Airtel (India), XO Communications (US)
202
CDN
REFERENCES
203
References
204
CDN R&D
205
CDN

Content Aspects
Content Type based Differentiated Support
Data, Multimedia, Mobile Apps, etc.
Content Aging Control
Content Selection & Deletion
Content Replication Detection
Dynamic Page Publishing
Digital Rights Management
Live Event Management
206
CDN

System Aspects
Surrogate Server Location (Dynamic)
Storage Memory Size (Dynamic)
Content Delivery Method
Mobile Device Characteristics, Location
Network Latency
Security & Information Assurance
Anomaly Detection
User Authentication
Content Authentication
207
CDN
Mobile CDN Research & Development

Mobile wireless networks have additional challenges in
supporting CDN services, e.g.,
GPS & Navigation Information
Mobile TV
ITS (Intelligent Transportation System)
LBS (Location Based Service)
Efficient content provisioning is required to provide

scalable control over wide coverage areas while
providing high levels of QoS with limited resources
208
CDN
Mobile CDN Challenges

Mobile node constraints (limited storage, processing
power, input capability) due to the portable size of mobile
devices
Frequent network disconnections due to mobile users
Location oriented services regarding user mobility
Real time monitoring to obtain the real time status of mobile

users
209
CDN
CDN vs. Mobile CDN

Features CDN Mobile CDN [Future]
Static, Dynamic,
Content Type Static, Dynamic, Streaming
Streaming
Users Location Fixed Mobile, Fixed
Surrogate Location Fixed Fixed, [Mobile]
ISP (Internet Service BSs (Base Stations), RAN (Radio

Surrogate Topology Provider) Local, Center Access Network) Systems,
of Service Area [Mobile Devices]
Maintenance Complexity Low~Medium Medium~High [Dynamic]

Multimedia & Data Mobile Apps, LBS, [Mobile]
Services
Services, etc. Cloud, etc.
210
CDN
REFERENCES
211
References
212

Big Data, Cloud Computing, & CDN Emerging Technologies PDF

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Big Data, Cloud Computing, & CDN Emerging Technologies PDF

Enviado por

Direitos autorais:

Formatos disponíveis

Cloud Computing

What does Cloud Computing do?

Provides online data storage

Enables configuration and accessing of online applications

Provides a variety of software usage

Provides computing platform and computing infrastructure

Using Gmail on my smartphone to check e-mails

Cloud can provide services through a public or private

Cloud can support various applications

Cloud Service Models

SaaS: Software as a Service

Software as a Service (SaaS)

Platform as a Service (PasS)

Infrastructure as a Service (IaaS)

Cloud Service Models

XaaS = Anything as a Service

NaaS N for Network as a Service

Cloud Service Models

Cloud Service Models

SaaS: Software as a Service

IaaS (Infrastructure as a Service)

Infrastructure support over the Internet

VM (Virtual Machine) Administration

Application Stand Alone

Lower Administrative Overhead

Lower Total Cost of Ownership

More Current System Software

SaaS (Software as a Service)

Software that is deployed on a cloud server which

Google Cloud Storage

Google Compute Engine

Google Cloud DNS (Domain Name System)

Google Cloud Datastore

Google Cloud SQL (Structured Query Language)

Amazon S3 (Simple Storage Service)

Amazon Cloud Drive

Amazon Cloud Drive

Cloud Player (Originally bundled)

Users can play music in their Cloud Drive from any

Music browsing based on song titles, albums, artists,

Amazon Cloud Drive Options

Developed by Apple, Inc.

iCloud replaces MobileMe

Can manage lost or stolen Apple devices

iWork for iCloud

iCloud Photo Library

Storage (Introduced in 2011)

Big Data Examples

New FLU Virus Starts in the U.S.!

New FLU Virus Starts in the U.S.!

What vaccine was needed?

Vaccine preparation and delivery plans could

New FLU Virus Starts in the U.S.!

This paper was published in the journal Nature

New FLU Virus Starts in the U.S.!

Analysis system was set to look for correlation

New FLU Virus Starts in the U.S.!

Googles method of analysis did not use data

Wal-Marts Data Warehouse

Wal-Marts Data Analysis

Recommendation System using Big Data

Product Development & Sales