Escolar Documentos
Profissional Documentos
Cultura Documentos
Essential Readings:
a. Chapter 12 from Fundamentals of Business Analytics by Prasad and Acharya of Wiley
b. Chapter B10 from Business Driven Technology by Paige Baltzan of McGraw Hill
Recommended Readings:
a. Chapter 17 from Data Warehousing, Data Mining & OLAP by Alex Berson and Stephen
J.Smith of McGraw Hill
Self-learning https://www.youtube.com/watch?v=BB2O4VCu5j8
https://www.youtube.com/watch?v=2rzI4QHwwOs
2
WHEN TO APPLY DATA MINING ?
ATTRIBUTE
WELL-STRUCTURED UNSTRUCTURED
BUSINESS PROBLEM BUSINESS PROBLEM
Characteristics Can be: Cannot be:
described with a high degree of described with a high degree of
completeness completeness
solved with a high degree of resolved with a high degree of certainty
certainty
4
WHEN IS DATA MINING TECHNOLOGY APPROPRIATE? When…..
- the sample size and the number of variables are both large
5
UNDERSTANDING…..
What category of people will be interested in our new
investment scheme?
8
SUPERVISED LEARNING UNSUPERVISED LEARNING
Prediction Simple linear regression Clustering K-Means Methods
Multiple linear regression Hierarchical Methods
Association Market Basket Analysis
Classification Decision trees (MBA)
Linear Regression
Naive Bayes
Bayesian Method
Neural Network
9
REAL WORLD PROBLEMS……….
A. What will be the stock price of “Coal India" today?
B. Will this customer repay the loan?
C. Which book he/she will purchase next if he/she has
already purchased “Five point someone”?
D. Do I really know my customers?
10
A. What will be the stock price of "Coal-India"
today? - I need Regression/Prediction
A. Will this customer repay the loan? - I need Classification
B. Which book he/she will purchase next if he/she has
already purchased “Five point someone”? - I need
Association
C. Do I really know my customers? - I need Clustering
11
1. Supervised Learning
Prediction/Regression
12
1. Supervised Learning
Classification
A. Purchase or not,
B. Loyalty of customer (high, medium, low),
C. Loan Risk (high, medium, low)
D. Classify a car loan applicant as a good or a poor
credit risk
13
AN EXAMPLE OF CLASSIFICATION TRAINING DATA
The risk classes for this training data have been specified and this data is used in
building a model (e.g. decision tree)
14
CLASSIFICATION ALGORITHMS
(I) LINEAR REGRESSION
Linear Regression
w0 + w1 x + w2 y >= 0
Regression computes wi from
data to minimize squared
error to ‘fit’ the data
Not flexible enough
15
CLASSIFICATION ALGORITHMS
(II) DECISION TREES if X > 5 then blue
else if Y > 3 then blue
Y else if X > 2 then green
else blue
2 5 X
16
CLASSIFICATION ALGORITHMS
(III) NEURAL NETS
Can select more complex
regions
Can be more accurate
Also can overfit the data –
find patterns in random noise
17
2. UNSUPERVISED LEARNING
(i) Clustering
A. What to do with this data?
B. Am I still missing something interesting?
(ii) Association
A. What he/she might buy next,
B. Which products to keep together.
18
19
20
21
22
CAN WE SAY WHEN A CLUSTERING SOLUTION IS
GOOD?
Homogeneity and separation principles
Homogeneity: Elements within a cluster are close
to each other
23
ARE THEY SIMILAR?
24
AN EXAMPLE OF WITHOUT TRAINING DATA
We now have no prior information about the classes. Normally the data in cluster
analysis is not categorical. A more typical cluster analysis example may include
numerical attributes like age, salary, length of the residence at current address, loan
amount sought etc.
25
DATA MINING APPLICATIONS:
Customer segmentation
All industries can take advantage of DM to discover discrete
segments in their customer bases by considering additional
variables beyond traditional analysis.
Warranties
Manufacturers need to predict the number of customers who will
submit warranty claims and the average cost of those claims.
Frequent flier incentives
Airlines can identify groups of customers that can be given
incentives to fly more.
26
Data Mining Applications
Financial Industry, Banks, Businesses, E-commerce
Stock and investment analysis
Identify loyal customers vs. risky customer
Predict customer spending
Risk management
Sales forecasting
Detect patterns of fraudulent credit card use
27
DATA MINING PRODUCTS
P re di ctive D yn am ix
QuickTime™ and a QuickTime™ and a
GIF decompressor
are needed to see this picture. GIF decompressor
are needed to see this picture.
Model 1
QuickTime™ and a
GIF decompressor
are needed to see this picture.
28
THINGS TO BE NOTED…………
Data Mining is a tool, not a magic stick
29
TEXT MINING
WHAT IS TEXT MINING?
Traditional data mining uses ‘structured data’ (n x p
matrix)
31
CHALLENGES OF TEXT MINING
A. Very high number of possible “dimensions”
All possible word and phrase types in the language!!
B. Unstructured
records (= docs) are not structurally identical
34
AMAZON.COM
36
OF MICE AND MEN: TEXT STATS
38
EXAMPLE 1: DECISION SUPPORT USING BANK CALL CENTER
DATA
The Information Source:
Call center records
Example:
1000
800
600 Cleveland
400 New York
Boston
200
0
39
3. SUMMARIZING / WORD CLOUDS
Takes text as input, finds the most
interesting ones, and displays
them graphically
Blogs do this
40
4. SUMMARIZATION
A. Text summarization is immensely helpful for trying
to figure out whether or not a lengthy document
meets the user’s needs and is worth reading for
further information.
41
TEXT MINING PRODUCTS
42