Você está na página 1de 463

Online Slide

http://tinyurl.com/ycs5zqe8

Machine Learning
for
Data Science

Assoc. Prof. Dr. Thanachart Numnonda Mr. Aekanun Thongtae


Executive Director Big Data Consultant
IMC Institute IMC Institute
Mr. Aekanun Thongtae
Related Experiences:
Big Data Consultant at IMC Institute
Director of Relations Network, Office of
The Student Loans Fund
Former Manager of Architecture and
Prototype at EGA
Former Manager of Research and
Development at EGA
Guest Lecturer in Many University,
Courses of Data Mining and Information
Security

https://www.facebook.com/analyticsindeep

http://www.aekanun.com
Cross-Validation & Hyperparameters Tuning

Source: mapr.com

3 aekanun@imcinstitute.com
Cross-Validation & Hyperparameters Tuning

Source: databricks.com

4 aekanun@imcinstitute.com
Topic: Lecture & LAB

Introduction to Data Science

Introduction to Machine Learning

Jupyter (tool for data science)

Tools for Large Scale Machine Learning: Spark Core, Spark SQL, Spark MLlib

Methodology of Data Science

Clustering | LAB for Clustering

Classification | LAB for Classification

Recommendation

Course Supplement: Codes, VM on Cloud, Docker, Q&A

5 aekanun@imcinstitute.com
What is Data Science ?

- In a nutshell: The purpose of data science is to find patterns.


- Pattern finding means understanding, problem solving and prediction.

Introduction to Machine Learning Aekanun Thongtae, aekanun@imcinstitute.com March 2017


Machine: How to Prediction

Machine Learning: Teach machine with Data


(Dependent Var. and/or Independent Var.)

Images: https://www.slideshare.net/KaiWaehner/r-spark-tensorflow-spark-applied-to-streaming-analytics
Images: https://www.slideshare.net/KaiWaehner/r-spark-tensorflow-spark-applied-to-streaming-analytics
Data Science Methodology: CRISP-DM

- Cross Industry Standard


Process for Data Mining,
commonly known by its
acronym CRISP-DM is a data
mining process model that
describes commonly used
approaches that data mining
experts use to tackle
problems.
- Polls conducted at one and
the same website
(KDNuggets) in 2002, 2004,
2007 and 2014 show that it
was the leading methodology
used by industry data miners
who decided to respond to
the survey.

Source: Shearer C., The CRISP-DM model: the new blueprint for data mining, J Data Warehousing (2000); 5:1322.

9 aekanun@imcinstitute.com
Step 1: Business Understanding
- Define the problem, project objectives and solution requirements from a
business perspective.

- Define the analytic approach to solving it.

- Affect to data requirements, technology requirements, success criterias,


etc.

- Ex.
- Alert after events occurs: Not predictive but its descriptive mining.
- Recommendation System: Not user profiles but focus on user
preferences (Like, Love, etc.)

Adapted from Practical Data Science with Hadoop and Spark: Designing and Building Effective Analytics at Scale, Addison-Wesley Professional, 2016

10 aekanun@imcinstitute.com
Step 2: Data Understanding
- Identifies Data Requirements for analytic approach
- Data Sources: Flat files, Relational DB, HDFS
- Data Characteristics: Volumes, Variety, Velocity
- Data format & presentation: Structure/Unstructure, Categorical/Numerical
- Directly affect to selection of the Data Ingestion Tools, next steps of the methodology.
- Data Collection
- Exploratory data analysis (EDA): Approach to analyzing data sets to summarize their main
characteristics, often with visual methods.
- Assess data quality: Bias, Inconsistent, Duplication, Missing

Histogram Scatterplot
Boxplots
Adapted from Practical Data Science with Hadoop and Spark: Designing and Building Effective Analytics at Scale, Addison-Wesley Professional, 2016

11 aekanun@imcinstitute.com
Hidden Biases of Big Data
How if you analyzed tweets immediately before and
after Hurricane Sandy, you would think that most
people were supermarket shopping pre-Sandy and
partying post-Sandy.

However, most of those tweets came from New


Yorkers.

First of all, theyre heavier Twitter users than, say, the


coastal New Jerseyans, and second of all, the
coastal New Jerseyans were worrying about other
stuff like their house falling down and didnt have time
to tweet.

Source: Cathy O'Neil et.al, Doing Data Science, O'Reilly Media, 2013
Image: wikipedia.org

12 aekanun@imcinstitute.com
Step 3: Data Preparation

- Data Preparation (Data Preprocessing) :


- The most time-consuming, but directly affect to models accuracy.
- Data Cleansing: Missing values, Duplicate Records, Errors
- Data Transformation: Normalization, Standardization, Discretization
- Attributes Selection: Removing non-relevant attributes

Source: D. Chantal et.al, Data Mining and Predictive Analytics, 2nd Edition, John Wiley & Sons, 2015

13 aekanun@imcinstitute.com
GIGO: Garbage in, Garbage out

Missing Values
Anomalous values

Inconsistency: format,
presentation, unit of
measurement

Errors
Source: D. Chantal et.al, Data Mining and Predictive Analytics, 2nd Edition, John Wiley & Sons, 2015

14 aekanun@imcinstitute.com
GIGO: Garbage in, Garbage out

Missing Values
Anomalous values

Inconsistency: format,
presentation, unit of
measurement

Errors
Source: D. Chantal et.al, Data Mining and Predictive Analytics, 2nd Edition, John Wiley & Sons, 2015

15 aekanun@imcinstitute.com
GIGO: Garbage in, Garbage out

Missing Values
Anomalous values

Inconsistency: format,
presentation, unit of
measurement

Errors
Source: D. Chantal et.al, Data Mining and Predictive Analytics, 2nd Edition, John Wiley & Sons, 2015

16 aekanun@imcinstitute.com
GIGO: Garbage in, Garbage out

Missing Values
Anomalous values

Inconsistency: format,
presentation, unit of
measurement

Errors
Source: D. Chantal et.al, Data Mining and Predictive Analytics, 2nd Edition, John Wiley & Sons, 2015

17 aekanun@imcinstitute.com
GIGO: Garbage in, Garbage out

Single vs Separated

Missing Values
Anomalous values

Inconsistency: format,
presentation, unit of
measurement

Errors
Source: D. Chantal et.al, Data Mining and Predictive Analytics, 2nd Edition, John Wiley & Sons, 2015

18 aekanun@imcinstitute.com
GIGO: Garbage in, Garbage out

Unit of measurement ?

Missing Values
Anomalous values

Inconsistency: format,
presentation, unit of
measurement

Errors
Source: D. Chantal et.al, Data Mining and Predictive Analytics, 2nd Edition, John Wiley & Sons, 2015

19 aekanun@imcinstitute.com
Errors Values

Errors in Testing Process


Unseen label: red.

20 aekanun@imcinstitute.com
Missing Values

Errors in Process of Vector Transformation

21 aekanun@imcinstitute.com
Handling Missing Data

Some of our field values are missing

Constant 0 for the numerical variable and


label missing for the categorical variable.

Replacing missing field values with means or


modes.

Replacing missing field values with random


draws from the distribution of the variable.

Source: D. Chantal et.al, Data Mining and Predictive Analytics, 2nd Edition, John Wiley & Sons, 2015

22 aekanun@imcinstitute.com
Transforming Numerical Var. into Categorical Var.
(Discretization)

(a)

(b)

Source: D. Chantal et.al, Data Mining and Predictive Analytics, 2nd Edition, John Wiley & Sons, 2015

23 aekanun@imcinstitute.com
Transforming: Scaling Difference of Values
using Min-Max Normalization

Source: D. Chantal et.al, Data Mining and Predictive Analytics, 2nd Edition, John Wiley & Sons, 2015

24 aekanun@imcinstitute.com
Step 4: Modeling
- Machine Learning Algorithm
- Decision Trees
- Recommendation engines
- k-Nearest Neighbours
- Naive Bayes
- Logistic Regression
- K-Means
- Fuzzy C-Means
- Genetic algorithms
- etc.

Image:machinelearningmastery.com

25 aekanun@imcinstitute.com
Categories of Machine Learning

- Supervised Learning
- A set of example observations is provided as a training set.
- Goal: to learn an association between the inputs (features) and output
(target variable) by using the examples provided.

- Unsupervised Learning
- The input data is a feature matrix of observations without a target
variable.
- Scenarios:
- Used for exploratory analysis.
- A step before supervised learning.

Source: Ofer Mendelevitch et.al, Practical Data Science with Hadoop and Spark: Designing and Building Effective Analytics at Scale, Addison-Wesley Professional, 2016
27
27 aekanun@imcinstitute.com
28
28 aekanun@imcinstitute.com
Spark ML Pipeline
Source: stats.stackexchange.com

29
29 aekanun@imcinstitute.com
Source: stats.stackexchange.com

30
30 aekanun@imcinstitute.com
Tasks of Machine Learning

- Classification/Regression

- Target variable is categorical in nature, the


problem is called Classification
- Ex: spam or non-spam

- Target variable is numerical, the problem is


called Regression
- Ex. Predict a price of a house given
features like number of rooms, square
footage, etc.

Source: Ofer Mendelevitch et.al, Practical Data Science with Hadoop and Spark: Designing and Building Effective Analytics at Scale, Addison-Wesley Professional, 2016
Classification vs. Regression

Classification

Regression

Source: Ofer Mendelevitch et.al, Practical Data Science with Hadoop and Spark: Designing and Building Effective Analytics at Scale, Addison-Wesley Professional, 2016
Classification vs. Regression

Source: www.edureka.in/data-science
Tasks of Machine Learning
- Clustering
- Identify natural groupings or clusters of observations
that are similar to each other.

- Recommender Systems
- Predict user preference for a product or item given
historical preference data from other uses.

- Market Basket Analysis


- Identify patterns of association between items or
variables that co-occur in the same observation.
(Association rules)

Source: Ofer Mendelevitch et.al, Practical Data Science with Hadoop and Spark: Designing and Building Effective Analytics at Scale, Addison-Wesley Professional, 2016
Algorithm of Supervised Learning

- Decision Trees
- Mining Task: Classification/Regression (Regression Trees)

- Features: Handle categorical features, extend to the multiclass


classification setting

- Spark.ML: supports decision trees for binary and multiclass


classification and for regression, using both continuous (numerical)
and categorical features.

Source: spark.apache.org
Spark.ML: Decision Trees

Source: spark.apache.org
Algorithm of Supervised Learning

- Random Forest

- Mining Task: Classification/Regression

- Features:
- Correct for decision trees' habit of overfitting to their training set.
- Classification: Support random forests for binary and multiclass
classification
- Regression: Using both continuous and categorical features

- Process:
- a set of decision trees is constructed, each using a random
subset of the training examples and features.
- A vote based on the outputs of all the trees is used to make the
final decision.
Spark.ML: Random Forest

Source: spark.apache.org
Algorithm of Supervised Learning

- k-Nearest Neighbours
- Mining Task: Classification/Regression
- Process:
- Define a distance metric between observations
and use this metric for classification or
regression.
- Determine the k-nearest neighbors of any given
observation.
- For classification, an unseen observation is then
classified to the most popular class amongst its
k-nearest neighbors. Similarly, in regression, the
mean of the k-nearest neighbors is used.

Adapted from: Ofer Mendelevitch et.al, Practical Data Science with Hadoop and Spark: Designing and Building Effective Analytics at Scale, Addison-Wesley Professional, 2016
Algorithm of Supervised Learning

- Naive Bayes
- Mining Task: Classification
- Features:
- Bayes' theorem with strong (naive)
independence assumptions between the
features.
- Use Cases:
- Sex Classification
- Document Classification: Spam/Not
Algorithm of Supervised Learning

- Logistic Regression
- Mining Task:
- A regression model where the dependent
variable (DV) is categorical.
- Features: Binary dependent variable
Algorithm of Unsupervised Learning

- Clustering
- k-Means, MinHash, Hierarchical Clustering
- Hidden Markov Models
- Feature Extraction methods
- Neural Networks
Features Encoding Pipeline

- Required: Text (Categorical) -> StringIndexer ->


Category Indices (Numerical: Discrete: Ordinal)

- Optional: Category Indices -> OneHotEncoder ->


Binary (Numerical: Discrete: Nominal)

Adapted from: scalaformachinelearning.com


Design Choices

- Which features to use.


- Often there is a large space of possible features to choose from, and a
wise choice of features often results in significantly better model
outcomes.

- Which algorithms to use.


- For example, are we using decision tree, linear models, support vector
machines (SVMs), or a random forest?

- Choosing values for various parameters of each


model.
- For example, with random forest the number of trees and the
maximum depth of each tree needs to be specified.

Source: Ofer Mendelevitch et.al, Practical Data Science with Hadoop and Spark: Designing and Building Effective Analytics at Scale, Addison-Wesley Professional, 2016
Step 5: Evaluation
- Evaluation: Model Performance
- Categorical Data: Confusion Matrix
- Numerical Data: Root Mean Squared Error

- Constructive Feedback Principle

- Check accuracy of the model prior to computing predicted values.


- Model Selection - Cross Validation
- Isnt a really a evaluation metric which is used openly to
communicate model accuracy.
- But, the result of cross validation provides good enough intuitive
result to generalize the performance of a model.

Testing
Data

Adapted from Practical Data Science with Hadoop and Spark: Designing and Building Effective Analytics at Scale, Addison-Wesley Professional, 2016

45 aekanun@imcinstitute.com
Evaluation - Confusion Matrix

Confusion Matrix Predicted

Positive Negative

Positive TP FN Sensitivity/Recall
Actual
Negative FP TN Specificity
(True Negative Rate)

Positive Negative
Predictive Predictive
(Precision)

Source: analyticsvidhya.com

46 aekanun@imcinstitute.com
Pro
por
wh tion
o a of a
we re fo ll ca
re d cus ses
ete
-
-
We focus on Churn.
Sensitivity or True Positive Rate means: cte ed
d
- Proportion of all cases who are Churn
were detected.

- Sensitivity = TP/Actual Positive


= TP/(TP+FN)

Confusion Matrix Predicted

Positive Negative

Positive TP FN Sensitivity/
Recall

Actual Negative FP TN Specificity


(True
Negative
Rate)

Positive Negative
Predictive Predictive
(Precision)

47 aekanun@imcinstitute.com
Be
lief
or
No
t?
- We focus on Churn.
- Predictive Positive Value means:

- Chance that a person with a positive test


Truly will be Churn.

- PPV = TP/Predictive Positive


= TP/(TP+FP)

Confusion Matrix Predicted

Positive Negative

Positive TP FN Sensitivity/
Recall

Actual Negative FP TN Specificity


(True
Negative
Rate)

Positive Negative
Predictive Predictive
(Precision)

48 aekanun@imcinstitute.com
Step 6: Deployment

Source: machinelearningmastery.com

49 aekanun@imcinstitute.com
Deployment
- Feedback
- Updates & Tuning

Source: bbi-consultancy.com

50 aekanun@imcinstitute.com
Data Science Methodology: CRISP-DM

Source: The CRISP-DM 1.0, CRISP-DM consortium, 2000.

51 aekanun@imcinstitute.com
Speaker

Executive Director, IMC Institute


Committee of the Council, Ubon
Ratchathani University
Chairman, Siameast Solutions Public
Co.Ltd.
Independent Director & President of
Audit Committee, Thanachart Bank
Public Co.Ltd.
Independent Director, Vintcom
Technology Public Co.Ltd.
Independent Director, Humanica Ltd.
52
53
WALMART: Retail Industry

Problem Solving
Realtime analytics:
- Largest retailer in the world product recommendation
- 20,000 stores in 28 countries. Right place, right time,
- Has Big Data and analytics right customer
department since 2004 Monitors public social
- The worlds largest private data media conversations,and
cloud
attempts to predict what
- Process 2.5 PB every hour
products people will buy
54
Source: Big Data in Practice, Bernard Marr, 2016
WALMART: Retail Industry

Data Technology
Data Caf uses database 40 petabytes of data
consisting of 200 billion Hadoop (since 2011)
rows of transactional data Spark
200 other sources, Cassandra
including meteorological R
data, economic data, SAS
telecoms data, social media
data, gas prices
55
Source: Big Data in Practice, Bernard Marr, 2016
WALMART: Retail Industry

Results
Data Caf system has led to a reduction in the time it takes
from a problem being spotted in the numbers to a solution
being proposed from an average of two to three weeks down
to around 20 minutes.

56
Source: Big Data in Practice, Bernard Marr, 2016
57
Netflix: Entertainment

Problem Solving
To understand customer
- Streaming movie and TV service viewing habits
- 65 million members in over 50 Improve in the number of
countries hours customers spending
- one-third of peak-time Internet They launched the Netflix
traffic in the US Prize

58
Source: Big Data in Practice, Bernard Marr, 2016
Netflix: Entertainment

Data Technology
Customer ID, movie ID, 3 petabytes of data
rating and the date the Amazon Web Services
movie was watched Hadoop, Hive and Pig
Streaming data Originally used Oracle
databases, but they
switched to NoSQL and
Cassandra

59
Source: Big Data in Practice, Bernard Marr, 2016
60
Netflix: Entertainment

Results
They added 4.9 million new subscribers in Q1 2015,
compared to four million in the same period in 2014.
Q1 2015 alone, Netflix members streamed 10 billion hours of
content.

61
Source: Big Data in Practice, Bernard Marr, 2016
Uber: Transportation

Problem Solving
Big Data principle of
- A smartphone app-based taxi crowdsourcing.
booking service. Store and monitor data on
- Now valued at $41 billion. every journey to determine
- Firmly in Big Data, and demand, allocate resources
leveraging this data in a more and set fares.
effective way than traditional taxi
Big Data-informed pricing,
firms.
which call surge pricing
62
Source: Big Data in Practice, Bernard Marr, 2016
Uber: Transportation

Data Technology
mixture of internal and Hadoop data lake.
external data. Apache Spark
GPS, traffic data
public transport routes

63
Source: Big Data in Practice, Bernard Marr, 2016
Uber: Transportation

Results
This case is less about short-term results and more about
long-term development of a data-driven business model. But
its fair to say that without their clever use of data the
company wouldnt have grown into the phenomenon they
are.

64
Source: Big Data in Practice, Bernard Marr, 2016
Amazon

Problem Solving
recommendation engine
- one of the worlds largest technology is based on
retailers of physical goods, virtual collaborative filtering.
goods such as ebooks and 360-degree view of you
streaming video and more as an individual customer
recently Web services. monitor, track and secure
its 1.5 billion items in its
retail store
65
Source: Big Data in Practice, Bernard Marr, 2016
Amazon

Data Technology
Data from users as they 187 million unique
browse the site. monthly website visitor.
Location data and Hewlett-Packard servers
information about other running Oracle on Linux
apps use on your phone. 5 TB of data
External datasets such as
census information
Streaming data
66
Source: Big Data in Practice, Bernard Marr, 2016
Amazon

Results
Amazon have grown to become the largest online retailer in
the US based on their customer-focused approach to
recommendation technology. Last year, they took in nearly
$90 billion from worldwide sales.

67
Source: Big Data in Practice, Bernard Marr, 2016
Big Data

Source: http://www.datasciencecentral.com/ 68
Source: IBM 69
Source: IBM 70
Source: IBM 71
Source: IBM 72
Source: Bernard Marr 73
74
Source: A digital age vision of insurance services, CRIF Reference
75
76
Source: William EL KAIM, Enterprise Architecture and Technology Innovation
77
Source: Domo
Big Data : Why Now?

78
Source: William EL KAIM, Enterprise Architecture and Technology Innovation
Use Cases

Source Big Data Analytics with Hadoop: Phillippe Julio 79


We are forecasting the future based on the
past

80
81
What is Data Science?

Data science is the extraction of knowledge from data.


It employs techniques and theories drawn from
many fields within the broad areas of mathematics,
statistics, information theory and information technology

Methods that scale to Big Data are


of particular interest in data science

82
83
Data Scientist Lifecycle

84
Source Big Data: Understanding How Data Powers Big Business
What is Machine Learning?

Machine learning is the subfield of computer science that gives


computers the ability to learn without being explicitly
programmed. Evolved from the study of pattern recognition
and computational learning theory in artificial intelligence,
machine learning explores the study and construction of
algorithms that can learn from and make predictions on data.

85
86
Source: http://www.datamation.com/
What is Deep Learning?

Deep learning is a branch of machine learning based on a set


of algorithms that attempt to model high level abstractions in
data. In a simple case, there might be two sets of neurons:
ones that receive an input signal and ones that send
an output signal.

87
Deep Learning

A set of machine-learning techniques based on neural


networking.
It enables computers to recognize items of interest in large
quantities of unstructured and binary data, and to deduce
relationships without needing specific models or
programming instructions.

88
Source: http://www.information-management.com/
What is Data Mining?

Data mining is the application of specific algorithms for


extracting patterns from data.
The distinction between the KDD process
and the data-mining step (within the process)
is a central point

From Data Mining to Knowledge Discovery in Databases; 1996


89
Data Mining

Data mining" was introduced in the 1990s, but data


mining is the evolution of a field with a long history
Data mining roots are traced back along three family
lines:
Classical statistics,
Artificial intelligence,
Machine learning.

Source: http://www.unc.edu/~xluan/258/datamining.html 90
What is Business intelligence?

Business intelligence (BI) is an umbrella


term that includes the applications,
infrastructure and tools, and best practices
that enable access to and analysis of information
to improve and optimize decisions and performance.

Gartner
91
92
93
Source Big Data: Understanding How Data Powers Big Business
94
Source Big Data: Understanding How Data Powers Big Business
Data Analytics Lifecycle

95
Source Big Data: Understanding How Data Powers Big Business
96
Source: Data Science and Critical Thinking, A.Scroll
97
Types of Data Scientist

Data Businesspeople

Data Creatives

Data Developers

Data Researchers

Source: www.edureka.in/data-science 98
Source: www.edureka.in/data-science 99
Big Data is changing the world

100
Big Data Challenges

Source Introduction to Data Science: Edureka 101


102
Big Data Technology

103
Big Data Analytics Reference Architectures

104
Source: SoftServe
Relational Reference Architectures

105
Source: SoftServe
Non-Relational Reference Architectures

106
Source: SoftServe
Data Discovery: Non-Relational Architecture

107
Source: SoftServe
Business Reporting: Hybrid Architecture

108
Source: SoftServe
109
110
Source: Xiaoxiao Shi
What is Hadoop?

A scalable fault-tolerant distributed system


for data storage and processing

Completely written in java


Open source & distributed under Apache license

111
Hadoop Environment

112
Source: Hadoop in Practice; Alex Holmes
113
Source: HDInsight Essentials - Second Edition
Hadoop Platform

114
Source: Octo Technology
Hadoop Ecosystem

115
Source: Apache Hadoop Operations for Production Systems, Cloudera
116
Source: The evolution and future of Hadoop storage: Cloudera
117
Source: The evolution and future of Hadoop storage: Cloudera
Hadoop Cluster

118
Source: HDInsight Essentials - Second Edition
When to use Hadoop?

119
Hadoop for Big Data Analytics

120
Source: Microsoft
What is Mahout?

Mahout is a Java library which Implementing


Machine Learning techniques for
clustering, classification and recommendation

121
Mahout in Apache Software

122
Why Mahout?

Apache License
Good Community
Good Documentation
Scalable
Extensible
Command Line Interface
Java Library

123
Mahout Architecture

124
Use Cases

125
Data Science: Core Components

126
Source www.edureka.in/data-science
Data Science Implementation

127
Source www.edureka.in/data-science
Social Media Use Case

128
Source www.edureka.in/data-science
Social Media Use Case: Classification

129
Source www.edureka.in/data-science
What is Spark?

An open source big data processing framework


built around speed, ease of use,
and sophisticated analytics.

Spark enables applications in Hadoop clusters


to run up to 100 times faster in memory
and 10 times faster even when running on disk.

130
Spark Framework

131
132
Source: http://www.informationweek.com/big-data
Source: Jump start into Apache Spark and Databricks 133
What is Spark?
Framework for distributed processing.
In-memory, fault tolerant data structures
Flexible APIs in Scala, Java, Python, SQL, R
Open source

134
Spark History
Founded by AMPlab, UC Berkeley
Created by Matei Zaharia (PhD Thesis)
Maintained by Apache Software Foundation
Commercial support by Databricks

135
Why Spark?
Handle Petabytes of data
Significant faster than MapReduce
Simple and intutive APIs
General framework
Runs anywhere
Handles (most) any I/O
Interoperable libraries for specific use-cases

136
Spark Platform

137
138
Hadoop Processing Paradigms

Source: Cloudera 139


Hadoop Processing Paradigms

Source: William EL KAIM, Enterprise Architecture and Technology Innovation 140


Hadoop Processing Paradigms Evolutions

141
Source: William EL KAIM, Enterprise Architecture and Technology Innovation
Introduction to
Machine Learning

Assoc. Prof. Dr. Thanachart Numnonda


Executive Director
IMC Institute
What is Machine Learning?
Machine Learning

Branch of Artificial Intelligence and Statistics

Focuses on prediction based on known properties

Used as a sub-process in Data Mining.

Data Mining focuses on discovering new,


unknown properties.
Machine Learning Algorithm

Decision Trees

Recommendation engines

k-Nearest Neighbours

Naive Bayes

Logistic Regression

K-Means

Fuzzy C-Means

Genetic algorithms

etc.

Categories

Supervised Learning

Unsupervised Learning

Semi-Supervised Learning

Reinforcement Learning

Supervised Learning
The correct classes of the training data are known
Supervised Learning

Learn from the Data


Data is already labelled

Expert, Crowd-sourced or case-based labelling of


data.
Applications

Handwriting Recognition
Spam Detection
Information Retrieval
Personalisation based on ranks
Speech Recognition
Supervised Learning Algorithm

Decision Trees

k-Nearest Neighbours

Random Forest

Naive Bayes

Logistic Regression

Perceptron and Multi-level Perceptrons

Neural Networks

SVM and Kernel estimation

Unsupervised Learning
The correct classes of the training data are not
known
Unsupervised Learning

Source www.edureka.in/data-science
Unsupervised Learning

Finding hidden structure in data


Unlabelled Data
SMEs needed post-processing to verify, validate and use
the output
Used in exploratory analysis rather than predictive

analytics
Applications
Pattern Recognition
Groupings based on a distance measure
Group of People, Objects, ...
Unsupervised Learning Algorithm

Clustering

k-Means, MinHash, Hierarchical Clustering


Hidden Markov Models

Feature Extraction methods

Neural Networks

Types of Algorithms

Classification

Assigning records to predefined groups


Clustering

Splitting records into groups based on similarity


Recommendation

Takes users' behavior and from that tries to find


items users might like.
Recommendation
Recommendation

Types of Recommender Systems

Content Based Recommendations


Collaborative Filtering Recommendations
User-User Recommendations
Item-Item Recommendations
Dimensionality Reduction (SVD) Recommendations
Applications

Products you would like to buy


People you might want to connect with
Potential Life-Partners
Etc.
Use Cases

Recommend friends/dates/products

Classify content into predefined groups

Find similar content based on object properties

Find associations/patterns in actions/behaviors

Identify key topics in large collections of text

Detect anomalies in machine output

Ranking search results

Use Cases

Rating a Review/Comment: Yelp

Fraud detection : Credit card Providers

Decision Making : e.g. Bank/Insurance sector

Sentiment Analysis

Speech Understanding iPhone with Siri

Face Detection Facebooks Photo tagging

Spam Email Detection

Image Search (Similarity)

Clustering (KMeans) : Amazon Recommendations

Classification : Google News

Clustering
Source: Mahout in Action
Sample Data

Source: Mahout in Action


Distance Measures

Source www.edureka.in/data-science
Distance Measures

Euclidean distance measure

Squared Euclidean distance measure

Manhattan distance measure

Cosine distance measure


Distance Measures
K-Means Clustering

Source: www.edureka.in/data-science
Example of K-Means Clustering
http://stanford.edu/class/ee103/visualizations/kmeans/kmeans.html
K-Means with different
distance measures

Source: Mahout in Action


Choosing number of clusters
Feature Extraction
Transforming data into vectors

Source: Mahout in Action


Google search: Life Learning

Source: www.edureka.in/data-science
Step 1.1: Term Frequency (TF)

Source: www.edureka.in/data-science
Step 1.2: Normalized Term Frequency

Source: www.edureka.in/data-science
Step 2: Inversed Document Frequency (IDF)

Source: www.edureka.in/data-science
Step 3: TF*IDF

Source: www.edureka.in/data-science
Step 4: Cosine similarity

Source: www.edureka.in/data-science
Step 4: Cosine similarity

Source: www.edureka.in/data-science
Classification
Classification Process

Source: Mahout in Action


Classification Process

Source: www.edureka.in/data-science
Keywords of Classification

Model

Training data

Test data

Training

Feature: the input variables


[1]

Variable: the raw input variables


[1]

Predictor variable

Target variable

[1] Isabelle Guyon, An Introduction to Variable and Feature Selection, Journal of Machine Learning Research, 2003
Classification Workflow
Step 1: Training the model

Define target variable.


Collect historical data.
Define predictor variables.
Select a learning algorithm.
Use the learning algorithm to train the model.
Step 2: Evaluating the model

Run test data.


Adjust the input (use different predictor variables, different
algorithms, or both).
Step 3: Using the model in production

Choosing predictor variables

In this case, a target variable is color-fill.


Different data sets
Choosing a learning algorithm

Decision Trees

k-Nearest Neighbours

Forest Trees

Naive Bayes

Logistic Regression

Stochastic gradient descent (SGD)

Regression

Regression analysis is a statistical


process for estimating the
relationships among variables.
Regression means to predict the

output value using training data.


Popular one is Logistic regression
(binary regression)
Classification vs. Regression

Source: www.edureka.in/data-science
Supervised Learning

Source www.edureka.in/data-science
Decision Trees

A method for approximating discrete valued target


variables.
When to Consider Decision Trees

Instances describable by attribute-value pairs

Target function is discrete valued

Disjunctive hypothesis may be required

Possibly noisy training data

Examples:

Equipment or medical diagnosis


Credit risk analysis
Modeling calendar scheduling preferences
Example Training Data
Model
Top-Down Induction of Decision Trees

Main loop:

1) A <= the best" decision attribute for next node


2) Assign A as decision attribute for node
3) For each value of A, create new descendant of node
4) Sort training examples to leaf nodes
5) If training examples perfectly classified, Then
STOP, Else iterate over new leaf nodes
Entropy (Information Theory)

S is a sample of training examples

p+ is the proportion of positive examples in S

p- is the proportion of negative examples in S

Information Gain: Attribute Selection Measure

Source: www.edureka.in/data-science
Attribute Selection Example

Source: www.edureka.in/data-science
Source: cs.upc.edu
Source: cs.upc.edu
Source: cs.upc.edu
Missing Values - some Solutions

1. Ignore the examples


2. Substitute the values (mode/median)
Source: cs.upc.edu
Classification performance measure

Source: www.edureka.in/data-science
Nave Bayes Classifier
Bayes Classifiers

Bayesian classifiers use Bayes theorem, which says


Bayes Classifiers : Example

Source: whttp://www.cs.ucr.edu/~eamonn/CE/Bayesian%20Classification%20withInsect_examples.pdf
Bayes Classifiers : Example

Source: whttp://www.cs.ucr.edu/~eamonn/CE/Bayesian%20Classification%20withInsect_examples.pdf
Bayes Classifiers (Cont.)

Source: whttp://www.cs.ucr.edu/~eamonn/CE/Bayesian%20Classification%20withInsect_examples.pdf
Probability model of Bayes Classifiers

Source: whttp://www.cs.ucr.edu/~eamonn/CE/Bayesian%20Classification%20withInsect_examples.pdf
Recommendation
Differences

Clustering algorithms
Decide on their own which distinctions appear to
be important
Classification algorithms
Learn to mimic examples of correct decisions
Make a single decision with a very limited set of
possible outcomes
Recommendation algorithms
Select and rank the best of many possible
alternatives
Collaborative Filtering framework

Popularized by Amazon and others

Uses historical data (ratings, clicks, and purchases) to

provide recommendations
User-based: Recommend items by finding similar users.
This is often harder to scale because of the dynamic nature
of users.
Item-based: Calculate similarity between items and make
recommendations. Items usually don't change much, so
this often can be computed offline.
Slope-One: A very fast and simple item-based
recommendation approach applicable when users have
given ratings (and not just boolean preferences).
Recommendation: Example
Recommendation: Example

Source: Mahout in Action


Measuring Similarity: Pearson correlation

Source: Mahout in Action


Measuring Similarity: Euclidean distance

Source: Mahout in Action


Recommendation: Architecture
Item-based Recommendation

Source: www.edureka.in/data-science
Item-based Recommendation

Source: Mahout in Action


Hands-On
Launch a virtual cluster
on Google Cloud
(Note: You can skip this session if you use your own
computer or another cloud service)
Virtual Cluster

This lab will use a virtual server to install a Cloudera


docker using the following features:

Ubuntu Server 14.04 LTS


4vCPU, 15 GB memory,50 GB SSD
cloud.google.com
Create Google Cloud Project
Select Compute Engine
Select Create Instance
Create an instance with the following configuration

lambdaarch-batch
lambdaarch-batch
Connect via SSH in browser window

lambdaarch-batch
Connect to the instance

lambdaarch-batch:~$
Config Firewall: Select the instance

lambdaarch-batch
Config Firewall: Select the Network
Config Firewall: Select Add Firewall rules
Create a firewall rule with the following configuration
Come back and so forth

lambdaarch-batch:~$
Hands-On: Install a Docker Engine
Update OS (Ubuntu)
$ sudo su -
# apt-get update
Docker Installation
# apt-get install docker.io
Install Cloudera Quickstart on
Docker Container
Pull Cloudera Quickstart
# docker pull cloudera/quickstart:latest
Verify the image was successfully pulled
# docker images
Run Cloudera quickstart

SYNTAX: docker run --hostname=quickstart.cloudera


--privileged=true -t -i [OPTIONS] [IMAGE]
/usr/bin/docker-quickstart

# docker run -v /root:/mnt --hostname=quickstart.cloudera


--privileged=true -t -i -p 8888:8888 -p 8880:8880 -p
9092:9092 -p 2181:2181 -p 11122:11122 cloudera/quickstart
/usr/bin/docker-quickstart
Successful running the Cloudera image
Hue: Finding the instance's external IP address

lambdaarch-batch
Login to Hue: http://external-ip-address:8080

Username: cloudera Password: cloudera


Jupyter
Open source, Interactive data science

December 2016
Mr.Aekanun Thongtae
aekanun@imcinstitute.com

255 aekanun@imcinstitute.com
ISSUE: Python 2.6.x can NOT run many pythons neccessary libraries.

Adapted from: www.cloudera.com

256 aekanun@imcinstitute.com
What is Jupyter ?

- Server-Client application: allows you to edit and run your notebooks via a
web browser.
- Two main components:
- A kernel is a program that runs and introspects the users code.
- The dashboard of the application: can also be used to manage the
kernels
- IPython is original, Fernando Prez starts developing in late 2001.
- Lastly, in 2014, Project Jupyter started as a spin-off project from
IPython.
- IPython is now the name of the Python backend, which is also known
as the kernel.

source: www.datacamp.com

257 aekanun@imcinstitute.com
Why is Jupyter ?

- IDEs
- Can be saved and easily shared in .ipynb JSON format.
- Statistical data visualization, such as Seaborn

Source: Jonathan W., Jupyter Notebook for Data Science Teams, Infinite Skills, 2016

258 aekanun@imcinstitute.com
1. Getting Started (this page may be skipped if the
container has been already run.)
- Delete the exising containers

# docker rm $(sudo docker ps -a -q)


# docker ps -a

- Run a Cloudera container

# docker run -v /home:/mnt --hostname=quickstart.cloudera


--privileged=true -t -i -p 8888:8888 -p 8880:8880 cloudera/quickstart
/usr/bin/docker-quickstart

259 aekanun@imcinstitute.com
- Install neccessary applications

[root@quickstart /]# yum install wget -y


[root@quickstart /]# cd /mnt
[root@quickstart /]# wget https://bootstrap.pypa.io/get-pip.py
[root@quickstart /]# python get-pip.py
[root@quickstart /]# yum install nano -y

260 aekanun@imcinstitute.com
2. Install the Anaconda, Jupyter and some Modules
wget
[root@quickstart /]#
https://repo.continuum.io/archive/Anaconda2-4.2.0-Linux-x86_64.sh

[root@quickstart /]# bash Anaconda2-4.2.0-Linux-x86_64.sh

Enter Yes / Accept throughout the installation.

261 aekanun@imcinstitute.com
export
[root@quickstart /]#
PYSPARK_DRIVER_PYTHON=/root/anaconda2/bin/jupyter

export PYSPARK_DRIVER_PYTHON_OPTS="notebook
[root@quickstart /]#
--NotebookApp.open_browser=False --NotebookApp.ip='*'
--NotebookApp.port=8880"

[root@quickstart /]# export PYSPARK_PYTHON=/root/anaconda2/bin/python

[root@quickstart /]# nano /root/.bashrc

262 aekanun@imcinstitute.com
- Run Jupyter notebook with Pyspark

[root@quickstart /]# pyspark --packages com.databricks:spark-csv_2.10:1.2.0

263 aekanun@imcinstitute.com
Stop all Hadoops services
#! /usr/bin/env bash
/etc/init.d/zookeeper-server stop
/etc/init.d/hadoop-hdfs-datanode stop
/etc/init.d/hadoop-hdfs-journalnode stop
/etc/init.d/hadoop-hdfs-namenode stop
/etc/init.d/hadoop-hdfs-secondarynamenode stop
/etc/init.d/hadoop-httpfs stop
/etc/init.d/hadoop-mapreduce-historyserver stop
/etc/init.d/hadoop-yarn-nodemanager stop
/etc/init.d/hadoop-yarn-resourcemanager stop
/etc/init.d/hbase-master stop
/etc/init.d/hbase-rest stop
/etc/init.d/hbase-thrift stop
/etc/init.d/hive-metastore stop
/etc/init.d/hive-server2 stop
/etc/init.d/sqoop2-server stop
/etc/init.d/spark-history-server stop
/etc/init.d/hbase-regionserver stop
/etc/init.d/hue stop
/etc/init.d/impala-state-store stop
/etc/init.d/oozie stop
/etc/init.d/solr-server stop
/etc/init.d/impala-catalog stop
/etc/init.d/impala-server stop
Difficult to
SCALE
Image:integralnet.co.uk

265
265 aekanun@imcinstitute.com
266
266 aekanun@imcinstitute.com
Image: linkedin.com
267
267 aekanun@imcinstitute.com
Which topic would we
like to focus ?

Source: dailykos.com

268
268 aekanun@imcinstitute.com
269
269
Image: linkedin.com/pulse/dealing-data-structured-unstructured-way-ronald-baan
270
270 aekanun@imcinstitute.com
Image: rodneyrohrmann.blogspot.com

271
271 aekanun@imcinstitute.com
Semi-structured Data

Image: Thomas Eri et.al, Big Data Fundamentals: Concepts, Drivers & Techniques, Prentice Hall, 2016

272
272 aekanun@imcinstitute.com
horizontal

273
273 aekanun@imcinstitute.com
Hadoop 2

Source: Tomcy John and Pankaj Misra, Data Lake for Enterprises, Packt Publishing, 2017

274
Aekanun Thongtae, aekanun@imcinstitute.com Apr 2017
MapReduce-Architecture View

Source: Tomcy John and Pankaj Misra, Data Lake for Enterprises, Packt Publishing, 2017

275
Aekanun Thongtae, aekanun@imcinstitute.com Apr 2017
276

Hadoop Environment

276
Source: Hadoop in Practice; Alex Holmes Linux
277

Why Cloud Computing ?

277
Source:
278

Why Cloud Computing ?

278
Source: Thomas Eri et.al, Big Data Fundamentals: Concepts, Drivers & Techniques, Prentice Hall, 2016
279

Hadoop Cluster

279
Source: HDInsight Essentials - Second Edition
280
280
Source: hadoop.apache.org
281
281
282
282
MapReduce Tutorial: A Word Count Example of MapReduce

- Text file: example.txt

Dear, Bear, River, Car, Car, River, Deer, Car and Bear

- Finding: number of occurrences of those unique words

Source: edureka.co
283
283
Conclusion: MapReduce

- Parallel Processing
- The time taken to process the data gets reduced by a tremendous
amount
- Data Locality
- Instead of moving data to the processing unit, we are moving
processing unit to the data

Source: edureka.co

284
284
Introduction
A Petabyte Scale Data Warehouse Using Hadoop

The Apache Hive data warehouse software facilitates


reading, writing, and managing large datasets residing in
distributed storage using SQL. Structure can be projected
onto data already in storage. A command line tool and JDBC
driver are provided to connect users to Hive.

285
Big data Architecture and Analytics Platform
Architecture Overview
Hive Map Reduce HDFS

Hive CLI
Web UI
Mgmt.

Browsing Queries DDL

Thrift API Parser HDFS


Execution
Planner
Hive QL

MetaStore SerDe
Thrift Jute JSON..

Hive.apache.org 286
Big Data Hadoop Workshop Danairat T., danairat@gmail.com: Thanachart N., thanachart@imcinstitute.com April 2015
Introduction to
Apache Spark

Mr.Aekanun Thongtae
Big Data Consultant
IMC Institute
288

Typical Hadoop Job: Map-Reduce

Issues

1. Streaming data processing


to perform near real-time
analysis.

2. Interactive querying of
large datasets so a data
scientist may run ad-hoc
queries on data set.

288
Source: dzone.com
Challenges with big data analytics

- Many tasks will results in too much time spent starting / stopping JVMs and
too many small files.

- Computational challenges: latency due to disk I/O

- Analytical challenges: limitation of in-memory computation


- Business Intelligence:
- Dice and slice the data cube
- Correlation Analysis:
- Many many and many dimensions
- 100C2 combinations

Adapted from Srinivas Duvvuri; Bikramaditya, Singhal,Spark for Data Science, Packt Publishing, 2016
Evolution of big data analytics

- Hadoop's MapReduce model could not fit in well was with machine
learning algorithms that were iterative in nature.

Source: Srinivas Duvvuri; Bikramaditya, SinghalSpark for Data Science, Packt Publishing, 2016
Evolution of big data analytics

- Spark: Instead of redesigning all the algorithms, a general-purpose


engine was needed that could be leveraged by most of the algorithms
for in-memory computation on a distributed computing platform.

Source: Srinivas Duvvuri; Bikramaditya, SinghalSpark for Data Science, Packt Publishing, 2016
292

292
Source: dzone.com
293

Various Components of Apache Spark

293
Source: dzone.com
A fast and general engine for large scale data processing

An open source big data processing framework built around


speed, ease of use, and sophisticated analytics.

Spark enables applications in Hadoop clusters to run up to


100 times faster in memory and 10 times faster even when
running on disk.

294
Big Data Ecosystem

Ingestion Storage Processing Presentation

295
Spark: History

Founded by AMPlab, UC Berkeley


Created by Matei Zaharia (PhD Thesis)
Maintained by Apache Software Foundation
Commercial support by Databricks

296
297
What is Spark?
Framework for distributed processing.

In-memory, fault tolerant data structures

Flexible APIs in Scala, Java, Python, SQL, R

Open source

Source: dzone.com

298
Source: databricks.com

299
Spark Platform

Source: MapR Academy


300
301

The Master controls how data is


partitioned, and it takes advantage
of data locality while keeping track
of all the distributed data
computation on the Slave
machines.

If a certain Slave machine is


unavailable, the data on that
machine is reconstructed on other
available machine(s).

Master is currently a single point


of failure, but it will be fixed in
upcoming releases.

301
Source: dzone.com
Apache Spark

Source: Suresh Kumar Gorakala, Building Recommendation, Packt Publishing, 2016

302 aekanun@imcinstitute.com
303

The Driver sends Tasks to the empty slots on the Executors when work has to be done:

303
Source: databricks.com
Apache Spark

Source: Suresh Kumar Gorakala, Building Recommendation, Packt Publishing, 2016

304 aekanun@imcinstitute.com
RDD & Partition
- Though Spark sets the number
of partitions automatically
based on the cluster, we have
the liberty to set it manually by
passing it as a second
argument to the parallelize
function (for example,
sc.parallelize(data, 3)).
- A diagrammatic representation
of an RDD which is created
with a dataset with, say, 14
records (or tuples) and is
partitioned into 3, distributed
across 3 nodes:
The Spark engine
Source: Jump start into Apache Spark and Databricks
Big data Architecture and Analytics Platform
What is a RDD?

Resilient: if the data in memory (or on a node) is lost, it


can be recreated.
Distributed: data is chucked into partitions and stored in
memory across the cluster.
Dataset: initial data can come from a table or be created
programmatically

Big data Architecture and Analytics Platform


RDD:

Fault tolerance
Immutable
Three methods for creating RDD:
Parallelizing an existing correction
Referencing a dataset
Transformation from an existing RDD
Types of files supported:
Text files
Sequence Files
Hadoop InputFormat

Big data Architecture and Analytics Platform


RDD: Operations

Transformations: transformations are lazy (not computed


immediately)
Actions: the transformed RDD gets recomputed when an
action is run on it (default)

Big data Architecture and Analytics Platform


Direct Acyclic Graph (DAG)

Big data Architecture and Analytics Platform


Spark:Transformation

Big data Architecture and Analytics Platform


Spark:Transformation

Big data Architecture and Analytics Platform


Single RDD Transformation

Source: Jspark: A brain-friendly introduction


Big data Architecture and Analytics Platform
Multiple RDD Transformation

Source: Jspark: A brain-friendly introduction


Big data Architecture and Analytics Platform
Pair RDD Transformation

Source: Jspark: A brain-friendly introduction


Big data Architecture and Analytics Platform
Spark:Actions

Big data Architecture and Analytics Platform


Spark:Actions

Big data Architecture and Analytics Platform


Transformation & Action

Source: dzone.com

319
DEMO: Transformation & Action

RDD#1
RDD#2-6

block1_rdd block2_rdd block3_rdd

block4_rdd

block6_rdd block5_rdd Source: dzone.com

320
RDD Creation

hdfsData = sc.textFile("hdfs://data.txt)

Source: Pspark: A brain-friendly introduction


Big data Architecture and Analytics Platform
Hands-On: Word Count
(LAB I)

Focus: Transformation & Action

Big data Architecture and Analytics Platform


Workflow of Word Count

RDD#1
RDD#2-6

block1_rdd block2_rdd block3_rdd

block4_rdd

block6_rdd block5_rdd Source: dzone.com

323
Platform: Cloudera/Dataproc
Tools: Jupyter

Big data Architecture and Analytics Platform


Platform: Cloudera/Dataproc
Tools: Jupyter

Big data Architecture and Analytics Platform


Platform: Cloudera/Dataproc
Tools: Jupyter

Big data Architecture and Analytics Platform


Platform: Cloudera/Dataproc
Tools: Jupyter

Big data Architecture and Analytics Platform


Hands-On: Make a format of key,value

(LAB II)

Focus: Data Preparation

Big data Architecture and Analytics Platform


Flight Details Data

http://stat-computing.org/dataexpo/2009/the-data.html

Big data Architecture and Analytics Platform


Big data Architecture and Analytics Platform
Platform: Cloudera/Dataproc
Tools: Jupyter

Big data Architecture and Analytics Platform


Platform: Cloudera/Dataproc
Tools: Jupyter

Header: Attributes

Data/Fact

Big data Architecture and Analytics Platform


Platform: Cloudera/Dataproc
Tools: Jupyter

Each element of RDD is one line of the file

Result from take,collect operation is data structure of List

Big data Architecture and Analytics Platform


Platform: Cloudera/Dataproc
Tools: Jupyter

text ***

text ***

Big data Architecture and Analytics Platform


Platform: Cloudera/Dataproc
Tools: Jupyter

Big data Architecture and Analytics Platform


Platform: Cloudera/Dataproc
Tools: Jupyter
text
Data Structure: List

Data Structure: List

Big data Architecture and Analytics Platform


Platform: Cloudera/Dataproc
Tools: Jupyter

Big data Architecture and Analytics Platform


Hands-On: Make Data Analysis

(LAB III)

Find Best/Worst Airlines


*** Please reuse code of Lab II

Big data Architecture and Analytics Platform


Platform: Cloudera/Dataproc
Tools: Jupyter

Big data Architecture and Analytics Platform


Platform: Cloudera/Dataproc
Tools: Jupyter

Big data Architecture and Analytics Platform


Platform: Cloudera/Dataproc
Tools: Jupyter

Big data Architecture and Analytics Platform


Spark SQL
Mr.Aekanun Thongtae

Jan 2017

Big data Architecture and Analytics Platform


Spark Platform

Big data Architecture and Analytics Platform


Introduction to SparkSQL
Spark SQL is a Spark module for structured data processing

Spark SQL supports the execution of SQL queries written


using either a basic SQL syntax or HiveQL
Catalyst optimizer: built inside Spark SQL:
Built-in mechanism to fetch data from some external data
sources. For example, JSON, JDBC, Parquet, MySQL, Hive,
PostgreSQL, HDFS, S3, and so on
Designed to optimize all phases of query execution: analysis,
logical optimization, physical planning, and code generation to
compile parts of queries to Java bytecode.

Three ways to interact with Spark SQL: SQL, DataFrame API


and Dataset API.

344
Introduction to DataFrame
DataFrame is an immutable distributed collection of data.
Unlike an RDD, data is organized into named columns,
like a table in a relational database.

DataFrame API was built as one more level of


abstraction on top of Spark SQL.
Allow you use a two-dimensional data structure that
usually has labelled rows and columns is called a
DataFrame (R, Python-Pandas)

The DataFrame API builds on the Spark SQL query


optimizer to automatically execute code efficiently on a
cluster of machines.
Adapted from Srinivas Duvvuri and Bikramaditya Singhal,Spark for Data Science, Packt Publishing, 2016

345
Architecture: Spark SQL & DataFrame

346

Source: Srinivas Duvvuri and Bikramaditya Singhal,Spark for Data Science, Packt Publishing, 2016
RDDs vs. DataFrames: Similarities
Both are fault-tolerant, partitioned data abstractions in
Spark

Both can handle disparate data sources

Both are lazily evaluated (execution happens when an


output operation is performed on them), thereby having
the ability to take the most optimized execution plan

Both APIs are available in all four languages: Scala,


Python, Java, and R

Adapted from Srinivas Duvvuri and Bikramaditya Singhal,Spark for Data Science, Packt Publishing, 2016

347
Source: Jump start into Apache Spark and Databricks
348
RDDs vs. DataFrames: Differences
DataFrames are a higher-level abstraction than RDDs.

The definition of RDD implies defining a Directed Acyclic


Graph (DAG) whereas defining a DataFrame leads to the
creation of an Abstract Syntax Tree (AST). An AST will
be utilized and optimized by the Spark SQL catalyst
engine.

RDD is a general data structure abstraction whereas a


DataFrame is a specialized data structure to deal with
two-dimensional, table-like data.

Adapted from Srinivas Duvvuri and Bikramaditya Singhal,Spark for Data Science, Packt Publishing, 2016

349
When to use RDDs?
Low-level transformation and actions and control on your
dataset;
Data is unstructured, such as media streams or streams
of text;
Manipulate your data with functional programming
constructs than domain specific expressions;
Dont care about imposing a schema, such as columnar
format, while processing or accessing data attributes by
name or column; and
Forgo some optimization and performance benefits
available with DataFrames and Datasets for structured
and semi-structured data.
Adapted from databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

350
Get Access to the SparkSQL
Use DataFrame API to entry point: SQLContext or
HiveContext

SQLContext: RDDs, JSON, JDBC

HiveContext: Hive Tables

Adapted from Srinivas Duvvuri and Bikramaditya Singhal,Spark for Data Science, Packt Publishing, 2016

351
Creating DataFrames: RDDs

Adapted from Srinivas Duvvuri and Bikramaditya Singhal,Spark for Data Science, Packt Publishing, 2016

352
Creating DataFrames: JSON
HDFS (Hadoop Ecosystem)

Adapted from Srinivas Duvvuri and Bikramaditya Singhal,Spark for Data Science, Packt Publishing, 2016

353
Creating DataFrames: JDBC

Adapted from Srinivas Duvvuri and Bikramaditya Singhal,Spark for Data Science, Packt Publishing, 2016

354
SparkSQL can leverage the Hive
metastore
Hive Metastore can also be leveraged by a wide
array of applications
Spark
Hive
Impala
Available from HiveContext

355
SparkSQL: HiveContext

Big data Architecture and Analytics Platform


Big data Architecture and Analytics Platform
Hands-on: Spark SQL

Big data Architecture and Analytics Platform


DataFrames Operations
(LAB I)

Big data Architecture and Analytics Platform


DataFrames Operations
Create a local collection of colors first

>>> colors = ['white','green','yellow','red','brown','pink']


Distribute the local collection to form an RDD
Apply map function on that RDD to get another RDD containing
colour, length tuples and convert that RDD to a DataFrame

>>> color_df = sc.parallelize(colors)


.map(lambda x:(x,len(x))).toDF(['color','length'])

Check the object type

>>> color_df
Adapted from Srinivas Duvvuri and Bikramaditya Singhal,Spark for Data Science, Packt Publishing, 2016

360
DataFrames Operations
Check the schema.

>>> color_df.dtypes

Check row count.

>>> color_df.count()

Look at the table contents. You can limit displayed rows by


passing parameter to show .

>>> color_df.show(2)
Adapted from Srinivas Duvvuri and Bikramaditya Singhal,Spark for Data Science, Packt Publishing, 2016

361
DataFrames Operations
List out column names.

>>> color_df.columns

Drop a column. The source DataFrame color_df remains the


same.
Spark returns a new DataFrame which is being passed to show.

>>> color_df.drop('length').show()

Convert to JSON format.

>>> color_df.toJSON().first()
Adapted from Srinivas Duvvuri and Bikramaditya Singhal,Spark for Data Science, Packt Publishing, 2016

362
DataFrames Operations

The following example selects the colors having a length of four or


five only and label the column as "mid_length" filter.

>>> color_df.filter(color_df.length.between(4,5))
.select(color_df.color.alias("mid_length")).show()

Adapted from Srinivas Duvvuri and Bikramaditya Singhal,Spark for Data Science, Packt Publishing, 2016

363
DataFrames Operations

This example uses multiple filter criteria.

>>> color_df.filter(color_df.length > 4) .filter(color_df[0]!="white").show()

Sort the data on one or more columns sort.


A simple single column sorting in default (ascending) order.

>>> color_df.drop('length').show()

Adapted from Srinivas Duvvuri and Bikramaditya Singhal,Spark for Data Science, Packt Publishing, 2016

364
DataFrames Operations

First filter colors of length more than 4 and then sort on multiple
columns.
The Filtered rows are sorted first on the column length in default
ascending order.
Rows with same length are sorted on color in descending order.

>>> color_df.filter(color_df['length']>=4).sort("length",
'color',ascending=False).show()

Adapted from Srinivas Duvvuri and Bikramaditya Singhal,Spark for Data Science, Packt Publishing, 2016

365
DataFrames Operations
You can use orderBy instead, which is an alias to sort.

>>> color_df.orderBy('length','color').take(4)

Alternative syntax, for single or multiple columns.

>>> color_df.sort(color_df.length.desc(), color_df.color.asc()).show()

Adapted from Srinivas Duvvuri and Bikramaditya Singhal,Spark for Data Science, Packt Publishing, 2016

366
DataFrames Operations

GroupBy

>>> color_df.groupBy('length').count().show()

Adapted from Srinivas Duvvuri and Bikramaditya Singhal,Spark for Data Science, Packt Publishing, 2016

367
Large-Scale Machine Learning
using
Apache Spark MLlib & ML Pipeline

Mr.Aekanun Thongtae
IMC Institute
aekanun@imcinstitute.com

Sep 2017

368 aekanun@imcinstitute.com
Apache Spark

Source: Suresh Kumar Gorakala, Building Recommendation, Packt Publishing, 2016

369 aekanun@imcinstitute.com
Apache Spark

Source: Suresh Kumar Gorakala, Building Recommendation, Packt Publishing, 2016

370 aekanun@imcinstitute.com
What is MLlib?

Source: MapR Academy


371 aekanun@imcinstitute.com
What is MLlib?
MLlib is a Spark subproject providing machine
learning primitives:
initial contribution from AMPLab, UC Berkeley
shipped with Spark since version 0.8
33 contributors

372 aekanun@imcinstitute.com
MLlib Algorithms

Classification: logistic regression, linear support vector


machine(SVM), naive Bayes
Regression: generalized linear regression (GLM)
Collaborative filtering: alternating least squares (ALS)
Clustering: k-means
Decomposition: singular value decomposition (SVD),
principal component analysis (PCA)

373 aekanun@imcinstitute.com
MLlib Algorithms

Source: Mllib:Spark's Machine Learning Library, A. Talwalkar


374 aekanun@imcinstitute.com
MLlib: Benefits
Part of Spark
Scalable
Support: Python, Scala, Java
Broad coverage of applications & algorithms
Rapid developments in speed & robustness

375 aekanun@imcinstitute.com
ML Pipeline
- MLlibs goal is to make practical machine learning (ML)
scalable and easy.

- Databricks, jointly with AMPLab, UC Berkeley,


continues this effort by introducing a pipeline API to
MLlib for easy creation and tuning of practical ML
pipelines.

Source: databricks.com

376 aekanun@imcinstitute.com
ML Pipeline
- Leverage on Spark SQL

- A feature transformation can be viewed as appending


new columns created from existing columns.

Source: databricks.com

377 aekanun@imcinstitute.com
Hands-On: Basic Predictive Analytics
with MLlib and ML pipeline

(LAB I)
Focus: Process of the pipeline

Big data Architecture and Analytics Platform


Workflow: Prediction of Product Name

1. Transformation categorical variable to numerical one.

Spark ML Pipeline
2. Combines a selected columns into a single vector column.

3. Define an algorithm.

4. Pipeline.

Training Set 5. Launch the pipeline and get a model.

Testing Set 6. Model Deployment.

379
Define training set

Label / Target Variable

Big data Architecture and Analytics Platform


1. Transformation categorical to numerical variable

0.0 : apple
1.0 : pipeapple
2.0 : grape

Big data Architecture and Analytics Platform


2. Combines a selected columns into a single vector column

Big data Architecture and Analytics Platform


3. Define an algorithm

Big data Architecture and Analytics Platform


4. Pipeline

5. Launch the pipeline and get a model

Big data Architecture and Analytics Platform


6. Model Deployment

Big data Architecture and Analytics Platform


Prediction of Loan Payment
using JumpStart Platform

Mr. Aekanun Thongtae


Big Data Consultant
IMC Institute
June 2017
Step 1:
Business Understanding

Images: Novellogycs, SA Global Advisors, Udemy Blog

Introduction to Machine Learning Aekanun Thongtae, aekanun@imcinstitute.com March 2017


Business Understanding

Lending Club is a marketplace for personal loans that matches borrowers who are
seeking a loan with investors looking to lend money and make a return.

Current - Outstanding

Due Charged
Date Grace Period: 15 Days Late: 16 - 30 Days Late: 31 - 120 Days Default Off

Source: lendingclub.com

Introduction to Machine Learning Aekanun Thongtae, aekanun@imcinstitute.com March 2017


How Lending Club Works: Borrower

- Lending Club evaluates each borrower's credit score using past historical data and
assign an interest rate to the borrower.

- Higher interest rate means that the borrower is riskier and more unlikely to
pay back the loan

- Lower interest rate means that the borrower has a good credit history is
more likely to pay back the loan.

- If the borrower accepts the interest rate, then the loan is listed on the
Lending Club marketplace.

Source: lendingclub.com

Introduction to Machine Learning Aekanun Thongtae, aekanun@imcinstitute.com March 2017


How Lending Club Works: Investor

- Investors are primarily interested in receiving a return on their investments.

- Approved loans are listed on the Lending Club website, where qualified
investors can browse recently approved loans, the borrower's credit score,
the purpose for the loan, and other information from the application.

- Once they're ready to back a loan, they select the amount of money they
want to fund.

- Once a loan's requested amount is fully funded, the borrower receives the
money they requested minus the origination fee that Lending Club charges.

Source: lendingclub.com

Introduction to Machine Learning Aekanun Thongtae, aekanun@imcinstitute.com March 2017


Some Business Questions Investor

Fully paid / Charge off (Default) ?

Late payment (Past Due) / Not ? How much principal has been paid so far ?

How much interest?

Current - Outstanding

Due Charged
Date Grace Period: 15 Days Late: 16 - 30 Days Late: 31 - 120 Days Default Off

Adapted from lendingclub.com

Introduction to Machine Learning Aekanun Thongtae, aekanun@imcinstitute.com March 2017


Business Understanding

- Objectives of This: Loan Payment

- Prediction of whether the loan will be fully paid or


defaulted

Source: rstudio-pubs-static.s3.amazonaws.com
Image: weiminwang.blog

Introduction to Machine Learning Aekanun Thongtae, aekanun@imcinstitute.com March 2017


Grace Period

- How a customer is treated on a past-due payment


will often come down to their payment history; if there
is a pattern of late payments, the grace period may be
shortened or removed.

Sources: investopedia.com
Images: nrilifeinsurance.com, lendingmemo.com

Introduction to Machine Learning Aekanun Thongtae, aekanun@imcinstitute.com March 2017


Step 2:
Data Understanding

Images: Funnelholic

Introduction to Machine Learning Aekanun Thongtae, aekanun@imcinstitute.com March 2017


Data Understanding

- Focus on data on approved loans only.

- The raw Lending Club data contains 60 fields for each


loan originated. (https://resources.lendingclub.com/LCDataDictionary.xlsx)
- Dataset: 1,418,062 records

- Each loan consists of:


- All the details of the loans at the time of their issuance.

- Information relative to the latest status of loan.

Source: rstudio-pubs-static.s3.amazonaws.com

Introduction to Machine Learning Aekanun Thongtae, aekanun@imcinstitute.com March 2017


Raw Data

Columns/
Field Names

Data/Fact

Introduction to Machine Learning Aekanun Thongtae, aekanun@imcinstitute.com March 2017


Step 3:
Data Preprocessing

Images: Funnelholic

Introduction to Machine Learning Aekanun Thongtae, aekanun@imcinstitute.com March 2017


Architecture and Flow of Data Processing

Loans issuance Loans issuance


and latest status and latest status
(LendingClub) (Kaggle)
Load to DataFrame

Load to DataFrame

Union
Output
Core
rawweb_df rawkaggle_df raw_df

398 aekanun@imcinstitute.com
Architecture and Flow of Data Processing

raw_df Input - Select only related


columns.
- Remove rows that
contain missing value.
Core
DataFrame
Load to

Register to Temp.
df_no_missing: table
- ONLY Columns that are related to
prediction.
- Clean
df
SQL

399 aekanun@imcinstitute.com
Architecture and Flow of Data Processing

of month.
Make a extraction
Input df

SQL
Hive Table
Write to

DW (Parquet Format)

Table: personal_loan
- earliest_cr_line, last_credit_pull_d have already
extracted as only month.
SQL

400 aekanun@imcinstitute.com
Architecture and Flow of Data Processing

Crosstab
(Frequency)

DW (Parquet Format)
C
on
ne
ct
to

Table: personal_loan

Basic Statistics.
SQL

401 aekanun@imcinstitute.com
Architecture and Flow of Data Processing

DW (Parquet Format)

Table: personal_loan

SQL

Temp. table
Register to
Normalization
raw_df Output - annual_inc Input
crunched_data
- loan_amnt
SQL

402 aekanun@imcinstitute.com
Features

403 aekanun@imcinstitute.com
Data Cleansing & Transformation

- Not all of the fields are intuitively useful for our


learning models, such as
- the loan ID and the month the last payment was received.
- URL

- Missing Values:
- Remove all tuples that contain them.

- Categorical Values such as address state:


- Transform them to Vector space

- Label Values:
- Values of charged off and fully paid are selected, but others are
removed

Source: stanford.edu

Introduction to Machine Learning Aekanun Thongtae, aekanun@imcinstitute.com March 2017


Training Data/Testing Data: 70/30

Training Data
252,531 Records

Testing Data
Records

Columns and their


data types

Source: lendingclub.com

Introduction to Machine Learning Aekanun Thongtae, aekanun@imcinstitute.com March 2017


Step 4:
Data Modeling

Images: InfoAdvisors Blog

Introduction to Machine Learning Aekanun Thongtae, aekanun@imcinstitute.com March 2017


Architecture and Flow of Data Processing

Testing set

raw_df Split

Training set SQL

Input
Transformation
Output - Numerical
- Vectors
Algorithm:
RandomForestClassifier

407 aekanun@imcinstitute.com
Training Data/Testing Data

Training Data
322,640 Records

Testing Data
3383 Records

Columns and their


data types

Introduction to Machine Learning Aekanun Thongtae, aekanun@imcinstitute.com March 2017


Code for Data Transformation

Vector Space

SparseVector(4, {1: 1.0})

DenseVector([0.0, 1.0,
0.0, 0.0, 4.0, 40.0])

Features:
18
Columns

409 aekanun@imcinstitute.com
Architecture and Flow of Data Processing

Algorithm:
RandomForestClassifier

Output

Model

Testing set
(Not yet be transformed)

410 aekanun@imcinstitute.com
Code for Training Model

Source: lendingclub.com

Introduction to Machine Learning Aekanun Thongtae, aekanun@imcinstitute.com March 2017


What is the Result Model ?

- Modeling a borrower's credit risk.


- A credit risk is the risk of default on a debt that may arise from a
borrower failing to make required payments.

- Prediction: Fully paid/Charged Off

Source: stanford.edu
Images: dimensionless.in

Introduction to Machine Learning Aekanun Thongtae, aekanun@imcinstitute.com March 2017


Step 5:
Model Evaluation

Images: classeval.wordpress.com

Introduction to Machine Learning Aekanun Thongtae, aekanun@imcinstitute.com March 2017


Evaluation - Confusion Matrix

Confusion Matrix Predicted

Positive Negative

Positive TP FN Sensitivity/Recall
Actual
Negative FP TN Specificity
(True Negative Rate)

Positive Negative
Predictive Predictive
(Precision)

Source: analyticsvidhya.com

414 aekanun@imcinstitute.com
Our Evaluation from Testing Data

- 0.0 means the Fully Paid.

- 1.0 means the Charged Off.

- IndexedLabel means
label/target that is collected
from observations.

- Prediction means label/target


that is predicted by the model.

Introduction to Machine Learning Aekanun Thongtae, aekanun@imcinstitute.com March 2017


Tra
inin
gc
145 omp
.72 le
Sec ted i
. n
Trained

322,640
Records !

Accuracy:

0.69
416 aekanun@imcinstitute.com
Be
lief
or
No
t?

Positive Predictive Values:

0.89
417 aekanun@imcinstitute.com
Pro
wh portio
oa
re f n of a
o ll
det cuse case
ect dw s
ed ere

Sensitivity of Full Paid:

0.71
418 aekanun@imcinstitute.com
Model Tuning

- Tuning algorithms parameter


- https://spark.apache.org/docs/latest/mllib-ensembles.html#random-f
orests
-

- Check whether there are imbalanced data.


- https://weiminwang.blog/2016/06/09/pyspark-tutorial-building-a-rand
om-forest-binary-classifier-on-unbalanced-dataset/?blogsub=confir
ming#subscribe-blog
-

Introduction to Machine Learning Aekanun Thongtae, aekanun@imcinstitute.com March 2017


Step 6:
Model Deployment

Images: http://machinelearningmastery.com/

Introduction to Machine Learning Aekanun Thongtae, aekanun@imcinstitute.com March 2017


Source: machinelearningmastery.com

421 aekanun@imcinstitute.com
Hands-on
Clustering on Network Interaction

Apache Spark in Action Aekanun Thongtae, aekanun@imcinstitute.com Mar 2017


Part 1: Data Understanding and Data Preparation
PySpark: Organizing data in Linux filesystem

! cd /mnt
! rm -rf /mnt/kddcup.data*
! wget
http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data.gz
! gunzip kddcup.data.gz
! ls /mnt/kddcup.data* -lh

PySpark: Organizing data in HDFS

! hdfs dfs -rm /user/root/kddcup.data


! hdfs dfs -rm -r /user/root/sample_standardized
! hdfs dfs -put kddcup.data /user/root

Source: Jose A Dianes, KDD Cup 99 - PySpark, github.com


Apache Spark in Action Aekanun Thongtae, aekanun@imcinstitute.com Mar 2017
PySpark: Import necessary modules

import sys
import os

try:
from pyspark import SparkContext, SparkConf
from pyspark.mllib.clustering import KMeans
from pyspark.mllib.feature import StandardScaler
print ("Successfully imported Spark Modules")
except ImportError as e:
print ("Can not import Spark Modules", e)
sys.exit(1)

from collections import OrderedDict


from numpy import array
from math import sqrt

Source: Jose A Dianes, KDD Cup 99 - PySpark, github.com


Apache Spark in Action Aekanun Thongtae, aekanun@imcinstitute.com Mar 2017
PySpark: Set up environment

max_k = 20
data_file = '/user/root/kddcup.data'

def parse_interaction(line):
"""
Parses a network data interaction.
"""
line_split = line.split(",")
clean_line_split = [line_split[0]]+line_split[4:-1]
return (line_split[-1], array([float(x) for x in
clean_line_split]))

def distance(a, b):


"""
Calculates the euclidean distance between two numeric RDDs
"""
return sqrt(
a.zip(b)
.map(lambda x: (x[0]-x[1]))
.map(lambda x: x*x)
.reduce(lambda a,b: a+b)
)

Source: Jose A Dianes, KDD Cup 99 - PySpark, github.com


Apache Spark in Action Aekanun Thongtae, aekanun@imcinstitute.com Mar 2017
PySpark: Set up environment (Cont.)

def dist_to_centroid(datum, clusters):


"""
Determines the distance of a point to its cluster centroid
"""
cluster = clusters.predict(datum)# model evaluation
centroid = clusters.centers[cluster]
return sqrt(sum([x**2 for x in (centroid - datum)]))

def clustering_score(data, k):


clusters = KMeans.train(data, k, maxIterations=10, runs=5,
initializationMode="random")# modeling
result = (k, clusters, data.map(lambda datum:
dist_to_centroid(datum, clusters)).mean())
print "Clustering score for k=%(k)d is %(score)f" % {"k": k,
"score": result[2]}
return result

Source: Jose A Dianes, KDD Cup 99 - PySpark, github.com


Apache Spark in Action Aekanun Thongtae, aekanun@imcinstitute.com Mar 2017
PySpark: Load raw data and create RDD

raw_data = sc.textFile(data_file)

PySpark: Make data transformation (1)

labels = raw_data.map(lambda line:


line.strip().split(",")[-1])

PySpark: Count by all different labels and print them

# count by all different labels and print them decreasingly


label_counts = labels.countByValue()
print label_counts

Source: Jose A Dianes, KDD Cup 99 - PySpark, github.com


Apache Spark in Action Aekanun Thongtae, aekanun@imcinstitute.com Mar 2017
PySpark: Make data transformation (2) - remove and add
some attributes/features

parsed_data = raw_data.map(parse_interaction)

Source: Jose A Dianes, KDD Cup 99 - PySpark, github.com


Apache Spark in Action Aekanun Thongtae, aekanun@imcinstitute.com Mar 2017
PySpark: Make data transformation (3) - make data
standardization

parsed_data_values = parsed_data.values().cache()

standardizer = StandardScaler(True, True)

standardizer_model =
standardizer.fit(parsed_data_values)

standardized_data_values =
standardizer_model.transform(parsed_data_values)

Source: Jose A Dianes, KDD Cup 99 - PySpark, github.com


Apache Spark in Action Aekanun Thongtae, aekanun@imcinstitute.com Mar 2017
Part 2: Modeling: Calculate clustering score and
Model Evaluation:
PySpark: Call method, clustering_score() with standardized
data, Predefined k-value (10-20)

scores = map(lambda k:
clustering_score(standardized_data_values, k),
range(10,max_k+1,10)) # Call predefined functions

# Obtain min score k


min_k = min(scores, key=lambda x: x[2])[0]

Source: Jose A Dianes, KDD Cup 99 - PySpark, github.com


Apache Spark in Action Aekanun Thongtae, aekanun@imcinstitute.com Mar 2017
Part 3: Model Deployment

best_model = min(scores, key=lambda x: x[2])[1]


cluster_assignments_sample =
standardized_data_values.map(lambda datum:
str(best_model.predict(datum))+","+",".join(map(str,datum)))
.sample(False,0.05)

Source: Jose A Dianes, KDD Cup 99 - PySpark, github.com


Apache Spark in Action Aekanun Thongtae, aekanun@imcinstitute.com Mar 2017
Hands-on
Recommendation of MovieLens
using Jupyter

Apache Spark in Action Aekanun Thongtae, aekanun@imcinstitute.com Mar 2017


The recommendation engine approach

An approach to build the recommendation engine using Spark:

1. Start the Spark environment such as Hadoop, HDFS,


Jupyter, etc.
2. Load the data.
3. Explore the data source.
4. Use the MLlib recommendation engine module to generate
the recommendations using ALS instance.
5. Generate the recommendations.
6. Evaluate the model.

Adapted from: Suresh Kumar Gorakala, Building Recommendation, Packt Publishing, 2016
Apache Spark in Action Aekanun Thongtae, aekanun@imcinstitute.com Mar 2017
Preparing Large Dataset
http://grouplens.org/datasets/movielens/

Apache Spark in Action Aekanun Thongtae, aekanun@imcinstitute.com Mar 2017


Part 1: Data Understanding and Data Preparation
PySpark: Organizing data in Linux filesystem

! cp -rf /usr/lib/hive/conf/hive-site.xml /usr/lib/spark/conf/


! mkdir -p /mnt/dataset/ml-100k
! ls -lR /mnt/dataset
! cd /mnt/dataset
! rm -rf u.data
! wget http://files.grouplens.org/datasets/movielens/ml-100k/u.data
! mv u.data ./dataset/ml-100k/
! ls -l ./dataset/ml-100k/
! head -5 ./dataset/ml-100k/u.data
! wc -l ./dataset/ml-100k/u.data

PySpark: Organizing data in HDFS

! hdfs dfs -rm ./u.data


! hdfs dfs -put ./dataset/ml-100k/u.data ./
! hdfs dfs -ls ./

Source: Suresh Kumar Gorakala, Building Recommendation, Packt Publishing, 2016


Apache Spark in Action Aekanun Thongtae, aekanun@imcinstitute.com Mar 2017
PySpark: Generating RDD

data = sc.textFile("./u.data")

PySpark: Import necessary modules for make recommendation

from pyspark.mllib.recommendation import ALS,


MatrixFactorizationModel, Rating

PySpark: Transform RDD to DataFrame

ratings = data.map(lambda l: l.split('\t'))\


.map(lambda l: Rating(int(l[0]), int(l[1]),
float(l[2])))

df = ratings.toDF(['user','product','rating'])

Source: Suresh Kumar Gorakala, Building Recommendation, Packt Publishing, 2016


Apache Spark in Action Aekanun Thongtae, aekanun@imcinstitute.com Mar 2017
PySpark: Data Exploratory - Distribution of Ratings

import numpy as np
import matplotlib.pyplot as plt
n_groups = 5
x = df.groupBy("rating").count().select('count')
xx = x.rdd.flatMap(lambda x: x).collect()
fig, ax = plt.subplots()
index = np.arange(n_groups)
bar_width = 1
opacity = 0.4
rects1 = plt.bar(index, xx, bar_width,
alpha=opacity,
color='b',
label='ratings')
plt.xlabel('ratings')
plt.ylabel('Counts')
plt.title('Distribution of ratings')
plt.xticks(index + bar_width, ('1.0', '2.0', '3.0', '4.0', '5.0'))
plt.legend()
plt.tight_layout()
plt.show()

Source: Suresh Kumar Gorakala, Building Recommendation, Packt Publishing, 2016

Apache Spark in Action Aekanun Thongtae, aekanun@imcinstitute.com Mar 2017


PySpark: Data Exploratory - Distribution of Ratings

Source: Suresh Kumar Gorakala, Building Recommendation, Packt Publishing, 2016

Apache Spark in Action Aekanun Thongtae, aekanun@imcinstitute.com Mar 2017


PySpark: Data Exploratory - Crosstabulation

df.stat.crosstab("user", "rating").show()

PySpark: Data Exploratory - Average rating given by each


user

df.groupBy('user').agg({'rating': 'mean'}).show(5)

PySpark: Data Exploratory - Average rating per movie

import pyspark.sql.functions as func


df.groupBy('product').agg(func.mean('rating').alias('mean_ra
ting')).sort('mean_rating', ascending=False).show(5)

Source: Suresh Kumar Gorakala, Building Recommendation, Packt Publishing, 2016

Apache Spark in Action Aekanun Thongtae, aekanun@imcinstitute.com Mar 2017


Part 2: Modeling: Build the recommendation
model using ALS

PySpark: Separate training set and test set

(training, test) = df.randomSplit([0.8, 0.2])

PySpark: Call the ALS.train() method to train the model

rank = 10
numIterations = 10
model = ALS.train(training,rank,numIterations)

Source: Suresh Kumar Gorakala, Building Recommendation, Packt Publishing, 2016

Apache Spark in Action Aekanun Thongtae, aekanun@imcinstitute.com Mar 2017


Part 3: Model Deployment - Make Predictions
PySpark: Filter out the rating

testdata = test.map(lambda p: (p[0], p[1]))

PySpark: Try to predict rating for a pair of user and movie

pred_ind = model.predict(1, 5)

PySpark: Predict ratings for all pairs of user and movie

predictions = model.predictAll(testdata).map(lambda r:
((r[0], r[1]), r[2]))

Source: Suresh Kumar Gorakala, Building Recommendation, Packt Publishing, 2016

Apache Spark in Action Aekanun Thongtae, aekanun@imcinstitute.com Mar 2017


PySpark: Call a method to generate the top-N item
recommendations for users

recommedItemsToUsers = model.recommendProductsForUsers(10)

recommedItemsToUsers.take(2)

Source: Suresh Kumar Gorakala, Building Recommendation, Packt Publishing, 2016

Apache Spark in Action Aekanun Thongtae, aekanun@imcinstitute.com Mar 2017


Part 4: Model Evaluation
PySpark: Create a ratesAndPreds object by joining the
original ratings and predictions

ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]),


r[2])).join(predictions)

PySpark: Calculate the mean squared error

MSE = ratesAndPreds.map(lambda r: (r[1][0] -


r[1][1])**2).mean()

Source: Suresh Kumar Gorakala, Building Recommendation, Packt Publishing, 2016

Apache Spark in Action Aekanun Thongtae, aekanun@imcinstitute.com Mar 2017


Prediction of Loan Payment
using Google Cloud
Platform

Mr. Aekanun Thongtae


Big Data Consultant
IMC Institute
Sep 2017
Setting up Dataproc and Jupyter
for
Spark Programming

Hadoop Workshop using Cloudera on Amazon EC2 Thanachart Numnonda, thanachart@imcinstitute.com December 2016
Launch Dataproc using gcloud command:
with Jupyter Notebook I) Launch Cloud Shell

Hive.apache.org

Hadoop Workshop using Cloudera on Amazon EC2 Thanachart Numnonda, thanachart@imcinstitute.com December 2016
Launch Dataproc using gcloud command
with Jupyter Notebook
II) Type the following command
Changed to be your name.

gcloud dataproc clusters create aekanun-datascience \


--zone us-central1-a \
--master-machine-type=n1-standard-2 \
--worker-machine-type=n1-standard-2 \
--initialization-actions \
gs://dataproc-initialization-actions/jupyter/jupyter.sh

Hive.apache.org

Hadoop Workshop using Cloudera on Amazon EC2 Thanachart Numnonda, thanachart@imcinstitute.com December 2016
Hadoop Workshop using Cloudera on Amazon EC2 Thanachart Numnonda, thanachart@imcinstitute.com December 2016
Launch the Jupyter notebook
<<public ip>> :8123

Code for the prediction: https://goo.gl/tcaAa1


Hive.apache.org

Hadoop Workshop using Cloudera on Amazon EC2 Thanachart Numnonda, thanachart@imcinstitute.com December 2016
Course Supplement

- Test for regular expression: http://regexr.com

- ALL PySpark Codes:


- Basic Spark Core and its operations
- Spark Core: LAB1
- Spark Core: LAB2
- LAB3 using Spark Core
- LAB3 using Spark SQL
- Basic Machine Learning
- Prediction of Loan Payment
- Network Clustering using ML Pipeline
Course Supplement
- Python with Twitter Stream:
https://drive.google.com/file/d/0B6nJWVBexOxyS
WsyRU81VXVtVUk/view?usp=sharing
Recommended Books

452 aekanun@imcinstitute.com
www.facebook.com/imcinstitute

453
Thank you

thanachart@imcinstitute.com aekanun@imcinstitute.com
www.facebook.com/imcinstitute www.facebook.com/analyticsindeep
www.aekanun.com
www.slideshare.net/imcinstitute
www.thanachart.org

454
A Machine Learning Issue

- Overfitting: Learning the parameters of a


prediction function and testing it on the same
data.

- A model that would just repeat the labels of the


samples that it has just seen would have a perfect
score but would fail to predict anything useful on
yet-unseen data.

Source: scikit-learn.org
Model selection (a.k.a. hyperparameter tuning)

- Tuning may be done for individual Estimators


such as LogisticRegression, or for entire Pipelines
which include multiple algorithms, featurization,
and other steps.

- Users can tune an entire Pipeline at once, rather


than tuning each element in the Pipeline
separately.
Model selection (a.k.a. hyperparameter tuning)

- MLlib supports model selection using tools such


as CrossValidator and TrainValidationSplit. These
tools require the following items:

- Estimator: algorithm or Pipeline to tune

- Set of ParamMaps: parameters to choose


from, sometimes called a parameter grid to
search over

- Evaluator: metric to measure how well a fitted


Model does on held-out test data
A Tool for Model Selection: Cross-Validation

Source: Mark Peng, General Tips for participating Competitions


A Tool for Model Selection: Cross-Validation

Source: Mark Peng, General Tips for participating Competitions


Python overtakes R, becomes the leader in Data
Science, Machine Learning platforms

460 aekanun@imcinstitute.com
Python overtakes R, becomes the leader in Data
Science, Machine Learning platforms

Source: kdnuggets.com

461 aekanun@imcinstitute.com
Python overtakes R, becomes the leader in Data
Science, Machine Learning platforms

Source: kdnuggets.com

462 aekanun@imcinstitute.com
Statistics

Source: mph.ufl.edu

463 aekanun@imcinstitute.com

Você também pode gostar