Você está na página 1de 42

USE OF DATA MINING AND TEXT

MINING (MACHINE LEARNING)


SESSION – 3
Learning Objectives
A. To have an in-depth understanding of data and text mining and its applications across
business functions
Key Content:
Data Mining and Text Mining
a. Basic Concepts
b. Methodology and Techniques (Supervised vs. Unsupervised learning)
c. Applications of Mining in Business Context

Essential Readings:
a. Chapter 12 from Fundamentals of Business Analytics by Prasad and Acharya of Wiley
b. Chapter B10 from Business Driven Technology by Paige Baltzan of McGraw Hill
Recommended Readings:
a. Chapter 17 from Data Warehousing, Data Mining & OLAP by Alex Berson and Stephen
J.Smith of McGraw Hill

Self-learning https://www.youtube.com/watch?v=BB2O4VCu5j8
https://www.youtube.com/watch?v=2rzI4QHwwOs

2
WHEN TO APPLY DATA MINING ?
ATTRIBUTE
WELL-STRUCTURED UNSTRUCTURED
BUSINESS PROBLEM BUSINESS PROBLEM
Characteristics Can be: Cannot be:
described with a high degree of described with a high degree of
completeness completeness
solved with a high degree of resolved with a high degree of certainty
certainty

Goal Find the best solution Find reasonable/acceptable solution


Complexity From very simple to complex, usually From complex to very complex, as a rule
lies within one discipline lies at the interface of multiple disciplines

Example Sample size identification Customer Feedback Evaluation


Key business What is customers sample size to Is there any relationship between
question detect at least two purchase customer perception of company
difference in sales of Product A? reputation/brand equity and company
products portfolio sales?

4
WHEN IS DATA MINING TECHNOLOGY APPROPRIATE? When…..

- the business problem is unstructured

- the data includes a mixture of interval, nominal, ordinal, count, and


text variables, and number of non-numeric variables etc.

- the set of inputs includes many irrelevant and redundant variables

- the relationship among variables could be non-linear with


nonlinearities

- the data is highly heterogeneous with a large percentage of outliers,


and missing values

- the sample size and the number of variables are both large

5
UNDERSTANDING…..
What category of people will be interested in our new
investment scheme?

 What category of students are likely to have trouble with


specific subjects of the course?

Who is more susceptible to intestinal cancer?

What are the typical buying patterns of categories of people?

Analyze fault reports to detect early signs of failures.

Choose potential candidates to be targeted for a promotion


drive.
6
DATA MINING ALGORITHMS AND
TYPES
SUPERVISED LEARNING VS. UNSUPERVISED LEARNING
Supervised learning: Unsupervised learning:
Discover patterns in the data that The data have no target
relate data attributes with a target attribute.
(class) attribute.
We want to explore the data to
These patterns are then utilized to find some intrinsic structures in
predict the values of the target them.
attribute in future data instances.
The model is not provide with
Training data include both input every time the
and target (class) attribute. correct/acceptable results during
These methods are usually fast and the training.
more accurate.
Give the reasonable accepted
results when new instances are
given as an input without knowing a
prior target

8
SUPERVISED LEARNING UNSUPERVISED LEARNING
Prediction Simple linear regression Clustering K-Means Methods
Multiple linear regression Hierarchical Methods
Association Market Basket Analysis
Classification Decision trees (MBA)
Linear Regression
Naive Bayes
Bayesian Method
Neural Network

9
REAL WORLD PROBLEMS……….
A. What will be the stock price of “Coal India" today?
B. Will this customer repay the loan?
C. Which book he/she will purchase next if he/she has
already purchased “Five point someone”?
D. Do I really know my customers?

10
A. What will be the stock price of "Coal-India"
today? - I need Regression/Prediction
A. Will this customer repay the loan? - I need Classification
B. Which book he/she will purchase next if he/she has
already purchased “Five point someone”? - I need
Association
C. Do I really know my customers? - I need Clustering

11
1. Supervised Learning

Prediction/Regression

A. Today's Stock price,


B. 2019's flat price in Juhu,
C. Today's weather,
D. Your salary in 2020,
E. The salary of an individual who owns a sports car,
F. The number of minutes before a thunderstorm will reach a given
location.

12
1. Supervised Learning
Classification

A. Purchase or not,
B. Loyalty of customer (high, medium, low),
C. Loan Risk (high, medium, low)
D. Classify a car loan applicant as a good or a poor
credit risk

13
AN EXAMPLE OF CLASSIFICATION TRAINING DATA

Owns Married Gender Employed Credit Risk Class


Home? Rating
Yes Yes Male Yes A B
No No Female Yes A A
Yes Yes Female Yes B C
Yes No Male No B B
No Yes Female Yes B C
No NO Male Yes B A
Yes No Female No B B

The risk classes for this training data have been specified and this data is used in
building a model (e.g. decision tree)

14
CLASSIFICATION ALGORITHMS
(I) LINEAR REGRESSION
 Linear Regression

w0 + w1 x + w2 y >= 0
 Regression computes wi from
data to minimize squared
error to ‘fit’ the data
 Not flexible enough

15
CLASSIFICATION ALGORITHMS
(II) DECISION TREES if X > 5 then blue
else if Y > 3 then blue
Y else if X > 2 then green
else blue

2 5 X

16
CLASSIFICATION ALGORITHMS
(III) NEURAL NETS
 Can select more complex
regions
 Can be more accurate
 Also can overfit the data –
find patterns in random noise

17
2. UNSUPERVISED LEARNING
(i) Clustering
A. What to do with this data?
B. Am I still missing something interesting?

(ii) Association
A. What he/she might buy next,
B. Which products to keep together.

18
19
20
21
22
CAN WE SAY WHEN A CLUSTERING SOLUTION IS
GOOD?
Homogeneity and separation principles
Homogeneity: Elements within a cluster are close
to each other

Separation: Elements in different clusters are


further apart from each other

Clustering is not an easy task!

23
ARE THEY SIMILAR?

24
AN EXAMPLE OF WITHOUT TRAINING DATA

Owns Married Gender Employed Credit


Home? Rating
Yes Yes Male Yes A
No No Female Yes A
Yes Yes Female Yes B
Yes No Male No B
No Yes Female Yes B
No NO Male Yes B
Yes No Female No B

We now have no prior information about the classes. Normally the data in cluster
analysis is not categorical. A more typical cluster analysis example may include
numerical attributes like age, salary, length of the residence at current address, loan
amount sought etc.

25
DATA MINING APPLICATIONS:
Customer segmentation
 All industries can take advantage of DM to discover discrete
segments in their customer bases by considering additional
variables beyond traditional analysis.
Warranties
 Manufacturers need to predict the number of customers who will
submit warranty claims and the average cost of those claims.
Frequent flier incentives
 Airlines can identify groups of customers that can be given
incentives to fly more.

26
Data Mining Applications
Financial Industry, Banks, Businesses, E-commerce
 Stock and investment analysis
 Identify loyal customers vs. risky customer
 Predict customer spending
 Risk management
 Sales forecasting
 Detect patterns of fraudulent credit card use

Retail and Marketing


 Customer buying patterns/demographic characteristics
 Market basket analysis
 Trend analysis

27
DATA MINING PRODUCTS

P re di ctive D yn am ix
QuickTime™ and a QuickTime™ and a
GIF decompressor
are needed to see this picture. GIF decompressor
are needed to see this picture.

Model 1

QuickTime™ and a
GIF decompressor
are needed to see this picture.

28
THINGS TO BE NOTED…………
Data Mining is a tool, not a magic stick

 Data Mining will not automatically discover solution without


guidance

 Accuracy not always the most important measure of data mining

 To get meaningful result, you must understand your data.

Find Meaningful & effective patterns

29
TEXT MINING
WHAT IS TEXT MINING?
Traditional data mining uses ‘structured data’ (n x p
matrix)

The analysis of ‘free-form text’ is also referred to as


‘unstructured data’,

Not enough time or patience to read or Can we extract


the most vital kernels of information…

31
CHALLENGES OF TEXT MINING
A. Very high number of possible “dimensions”
 All possible word and phrase types in the language!!

B. Unstructured
 records (= docs) are not structurally identical

C. Complex relationships between concepts in text


 “AOL merges with Time-Warner”
 “Time-Warner is bought by AOL”

D. Ambiguity and context sensitivity


 automobile = car = vehicle = Toyota
 Apple (the company) or apple (the fruit)
32
APPLICATIONS OF TEXT MINING
1. TEXT ANALYSIS

Which words are most present


Which words are most surprising
Which words help define the document
What are the interesting text phrases?

34
AMAZON.COM

DATA MINING -VOLINSKY - 2011 - COLUMBIA UNIVERSITY 35


OF MICE AND MEN: CONCORDANCE
Concordance is an alphabetized list of the most frequently occurring words in a book,
excluding common words such as "of" and "it." The font size of a word is proportional to
the number of times it occurs in the book.

36
OF MICE AND MEN: TEXT STATS

DATA MINING -VOLINSKY - 2011 - COLUMBIA UNIVERSITY 37


2. SENTIMENT ANALYSIS
 Automatically determine tone in text: positive, negative or neutral

 Typically uses collections of good and bad words

 Analysis of call records as input into decision-making process of Bank’s


management

 Quick answers to important questions


 Which offices receive the most angry calls?
 What products have the fewest satisfied customers?
 (“Angry” and “Satisfied” are recognizable sentiments)

38
EXAMPLE 1: DECISION SUPPORT USING BANK CALL CENTER
DATA
The Information Source:
 Call center records
 Example:

AC2G31, 01, 0101, PCC, 021, 0053352,


NEW YORK, NY, H-SUPRVR8, STMT,
“mr stark has been with the company for
about 20 yrs. He hates his stmt format and
wishes that we would show a daily balance
to help him know when he falls below the
required balance on the account.” Negative Calls Related to Bank
Statements

1000
800
600 Cleveland
400 New York
Boston
200
0
39
3. SUMMARIZING / WORD CLOUDS
Takes text as input, finds the most
interesting ones, and displays
them graphically
Blogs do this

40
4. SUMMARIZATION
A. Text summarization is immensely helpful for trying
to figure out whether or not a lengthy document
meets the user’s needs and is worth reading for
further information.

B. Summarizes the document in the time it would take


the user to read the first paragraph.

C. The key to summarization is to reduce the length


and detail of a document while retaining its main
points and overall meaning.

41
TEXT MINING PRODUCTS

42

Você também pode gostar