Você está na página 1de 76

DATA MINING

CS C415/I S C415
I Semester 2009-10
Prof. Navneet Goyal
Department of Computer Science & I nformation Systems,
BI TS, Pilani.
10/16/2014 Dr. Navneet Goyal, BITS, Pilani 2
Text & Reference Books
Text Book
Tan P. N. , Steinbach M & Kumar V. Introduction to Data
Mining Pearson Education, 2006.
Reference Books
Han J & Kamber M, Data Mining: Concepts and
Techniques, Morgan Kaufmann Publishers, Second
Edition, 2006
Dunhum M.H. & Sridhar S. Data Mining-Introductory
and Advanced Topics, Pearson Education, 2006.
10/16/2014 Dr. Navneet Goyal, BITS, Pilani 3
Tsunami of Data
How much data we generate daily?
Phone calls
sms
ATM
Air/train reservations
Registration
Attending classes
Email
Internet browsing
Blogging
10/16/2014 Dr. Navneet Goyal, BITS, Pilani 4
Tsunami of Data
Some Interesting Figures
1.5 billion internet users at the end of 2008
2.2 billion by 2013
400 million mobile phone users in India (Mar.
2009)
771 million by 2013
10/16/2014 Dr. Navneet Goyal, BITS, Pilani 5
The Course
Covers 2 aspects of Data Mining
Algorithmic or Data aspect
Domain or application aspect
Synergy!
What is expected from you?
You are doing mining!
Be prepared to get your hands
dirty
10/16/2014 Dr. Navneet Goyal, BITS, Pilani 6
Latest News
IBM acquires SPSS for $1.2 b in
cash
SPSS Data Mining Software
Clementine
PASW Modeler (new name for
Clementine)
10/16/2014 Dr. Navneet Goyal, BITS, Pilani 7
Success Stories
Beer & Diapers
Polo Shirts & Barbie Dolls
10/16/2014 Dr. Navneet Goyal, BITS, Pilani 8
New Generation
Integrated Information
Systems
DW & OLAP Technology
Data Mining & KDD

XML based Databases
Web Mining

+
2000 onwards
10/16/2014 Dr. Navneet Goyal, BITS, Pilani 9
Searching a phone number in a phone
book

Searching a keyword on Google

Generating histograms of salaries for
different age groups

Issuing SQL query to a database and
reading the reply
What is NOT Data Mining?
10/16/2014 Dr. Navneet Goyal, BITS, Pilani 10
What is NOT Data Mining?
Originally a statistician term
Overusing of data to draw invalid inferences
Bonferroni's theorem warns us that if there are too
many possible conclusions to draw, some will be true
for purely statistical reasons, with no physical
validity.
Famous example: David Rhine, a parapsychologist" at Duke
in the 1950's tested students for extrasensory perception" by
asking them to guess 10 cards - red or black. He found
about 1/1000 of them guessed all 10, and instead of
realizing that is what you'd expect from random guessing,
declared them to have ESP. When he retested them, he
found they did no better than average.

His conclusion: telling people they have ESP causes them
to lose it!

10/16/2014 Dr. Navneet Goyal, BITS, Pilani 11
Data Mining is NOT
Data Warehousing
(Deductive) query processing
SQL/ Reporting
Software Agents
Expert Systems
Online Analytical Processing (OLAP)
Statistical Analysis Tool
Data visualization

10/16/2014 Dr. Navneet Goyal, BITS, Pilani 12
What is Data Mining?
Discovery of useful summaries of data -
Ullman
Extracting or Mining knowledge form large
amounts of data
The efficient discovery of previously
unknown patterns in large databases
Technology which predict future trends
based on historical data
It helps businesses make proactive and
knowledge-driven decisions
Data Mining vs. KDD
The name Data Mining a misnomer?
10/16/2014 Dr. Navneet Goyal, BITS, Pilani 13
Data mining:
Extraction of interesting (non-trivial,
implicit, previously unknown and
potentially useful) information or
patterns from data in large databases
What Is Data Mining?
10/16/2014 Dr. Navneet Goyal, BITS, Pilani 14
Data Mining
Data mining is ready for application in
the business & scientific community
because it is supported by three
technologies that are now sufficiently
mature:
Massive data collection
Powerful multiprocessor computers
Data mining algorithms
10/16/2014 Dr. Navneet Goyal, BITS, Pilani 15
Data Mining:
On What Kind of Data?
Relational databases
Data warehouses
Transactional databases
Advanced DB and information repositories
Object-oriented and object-relational
databases
Spatial databases
Time-series data and temporal data
Text databases and multimedia databases
Heterogeneous and legacy databases
WWW
10/16/2014 Dr. Navneet Goyal, BITS, Pilani 16
Data Mining
10/16/2014 Dr. Navneet Goyal, BITS, Pilani 17
Data Mining:
A KDD Process
Data mining: the core
of knowledge
discovery process.
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
10/16/2014 Dr. Navneet Goyal, BITS, Pilani 18
Stages of Data Mining
Process

1. Data gathering, e.g., data warehousing.
2. Data cleansing: eliminate errors and/or bogus data,
e.g., patient fever = 125.
3. Feature extraction: obtaining only the interesting
attributes of the data, e.g., date acquired is
probably not useful for clustering celestial objects, as
in Skycat.
4. Pattern extraction and discovery. This is the stage
that is often thought of as data mining and is where
we shall concentrate our effort.
5. Visualization of the data.
6. Evaluation of results; not every discovered fact is
useful, or even true! Judgment is necessary before
following your software's conclusions.

10/16/2014 Dr. Navneet Goyal, BITS, Pilani 19
Data Mining
Figure taken from M H Dunham book on Data Mining

DATA
MINING
Predictive
Classification
Regression
Time Series
Analysis
Prediction
Descriptive
Clustering
Summarization
Association
Rules
Sequence
Discovery
10/16/2014 Dr. Navneet Goyal, BITS, Pilani 20
Predictive Model
Making prediction about values of
data using known results from
different data
Example: Credit Card Company
Every purchase is placed in 1 of 4
classes
1. Authorize
2. Ask for further identification before authorizing
3. Do not authorize
4. Do not authorize but contact police
Two functions of Data Mining
1. Examine historical data to determine how the
data fit into 4 classes
2. Apply the model to each new purchase

10/16/2014 Dr. Navneet Goyal, BITS, Pilani 21
Descriptive Model
Identifies patterns or relationship in data
Example: Later
10/16/2014 Dr. Navneet Goyal, BITS, Pilani 22
Two Important Terms
Supervised Learning
Training Data Set
Model is told to which class each training data
belongs
Learning by example
Example CLASSIFICATION
Similar to Discriminate Analysis in Statistics
Unsupervised Learning
Class-label of training set is not known
No. of classes also may not be known
Learning by observation
Example CLUSTERING

10/16/2014 Dr. Navneet Goyal, BITS, Pilani 23
Data Mining Applications
Some examples of successes":
1. Decision trees constructed from bank-loan histories to
produce algorithms to decide whether to grant a loan.

2. Patterns of traveler behavior mined to manage the sale
of discounted seats on planes, rooms in hotels,etc.

3. Diapers and beer" Observation that customers who buy
diapers are more likely to by beer than average allowed
supermarkets to place beer and diapers nearby, knowing
many customers would walk between them. Placing
potato chips between increased sales of all three items.
10/16/2014 Dr. Navneet Goyal, BITS, Pilani 24
Data Mining Applications
Some examples of successes":
4. Skycat and Sloan Sky Digital Sky Survey: clustering sky
objects by their radiation levels in different bands
allowed astronomers to distinguish between galaxies,
nearby stars, and many other kinds of celestial objects.
(168 million records and some 500 attributes)
for details see http://www.sdss.org/dr1/

5. Comparison of the genotype of people with/without a
condition allowed the discovery of a set of genes that
together account for many cases of diabetes. This sort of
mining has become much more important as the human
genome has fully been decoded

10/16/2014 Dr. Navneet Goyal, BITS, Pilani 25
Examples
BANK AGENT:
Must I grant a mortgage to this customer?
SUPERMARKET MANAGER:
When customers buy eggs, do they also buy oil?
PERSONNEL MANAGER:
What kind of employees do I have?
AGRICULTURAL SCIENTIST:
What would be the wheat yield this year?
NETWORK ADMINISTRATOR:
Which website visitor is a hacker?
Which incoming mail is a spam?
TRADER in a RETAIL COMPANY:
How many flat TVs do we expect to sell next month?

10/16/2014 Dr. Navneet Goyal, BITS, Pilani 26
Classification Example
Example taken from Jos Hernndez-Orallo slides, Spain
10/16/2014 Dr. Navneet Goyal, BITS, Pilani 27
Association Rule:
Example
Example taken from Jos Hernndez-Orallo slides, Spain
10/16/2014 Dr. Navneet Goyal, BITS, Pilani 28
Clustering: Example
Example taken from Jos Hernndez-Orallo slides, Spain
10/16/2014 Dr. Navneet Goyal, BITS, Pilani 29
Examples of Discovered
Patterns
o Association rules
o 98% of people who purchase diapers also
buy beer
o Classification
o People with age less than 25 and salary >
40k drive sports cars
o Similar time sequences
o Stocks of companies A and B perform
similarly
o Outlier Detection
o Residential customers for telecom company
with businesses at home
10/16/2014 Dr. Navneet Goyal, BITS, Pilani 30
Anomaly Detection/Outliers
o Objects whose characteristics are significantly
different from the rest of the data
o Such observations are known as ANOMALIES or
OUTLIERS
o False alarms to be avoided
o Applications
o Fraud detection
o Network intrusions
o Unusual patterns of disease
o Ecosystem disturbances


10/16/2014 Dr. Navneet Goyal, BITS, Pilani 31
Association Rules &
Frequent Itemsets
Market-Basket Analysis
Grocery Store: Large no. of ITEMS
Customers fill their market baskets with
subset of items
98% of people who purchase diapers also
buy beer
Used for shelf management
Used for deciding whether an item should be
put on sale

10/16/2014 Dr. Navneet Goyal, BITS, Pilani 32
Classification
Customers name, age income_level and
credit _rating known
Training Set
Use classification algorithm to come up
with classification rules
If age between 31 & 40 and income_level=
High, then credit_rating = Excellent
New Data(customer): Sachin, age=31,
income_level=High implies
credit_rating=Excellent
Classifier Accuracy?
10/16/2014 Dr. Navneet Goyal, BITS, Pilani 33
Clustering
Given points in some space, often a high-
dimensional space
Group the points into a small number of
clusters
Each cluster consisting of points that are
near in some sense
Points in the same cluster are similar and
are dissimilar to points in other clusters
10/16/2014 Dr. Navneet Goyal, BITS, Pilani 34
Clustering: Examples
Cholera outbreak in London








Skycat clustered 2x10
9
sky objects into stars, galaxies,
quasars, etc. Each object was a point in a space of 7
dimensions, with each dimension representing radiation in
one band of the spectrum.
The Sloan Sky Survey is a more ambitious attempt to
catalog and cluster the entire visible universe

10/16/2014 Dr. Navneet Goyal, BITS, Pilani 35
AR Support & Confidence
Transaction Database



For minimum support = 50%, minimum
confidence = 50%, we have the following rules
1 => 3 with 50% support and 66%
confidence
3 => 1 with 50% support and 100%
confidence

Transaction Id Purchased Items
1 {1, 2, 3}
2 {1, 4}
3 {1, 3}
4 {2, 5, 6}
10/16/2014 Dr. Navneet Goyal, BITS, Pilani 36
Mining Association Rules
2 Step Process
1. Find all frequent Itemsets is all itemsets
satisfying min_sup
2. Generate strong ARs from frequent
itemsets ie ARs satisfying min_sup &
min_conf

Classification & Prediction
What is Classification?
What is Prediction?
Any relationship between the two?
Supervised or Unsupervised?
Issues
Applications
Algorithms
Classifier Accuracy
Prediction:
models continuous-valued functions, i.e.,
predicts unknown or missing values


Classification:
predicts categorical class labels
classifies data (constructs a model) based on
the training set and the values (class labels)
in a classifying attribute and uses it in
classifying new data
Classification & Prediction
Given a database D={t
1
,t
2
,,t
n
} and a
set of classes C={C
1
,,C
m
}, the
Classification Problem is to define a
mapping f:DC where each t
i
is assigned
to one class.
Prediction is similar, but may be viewed
as having infinite number of classes.
Classification & Prediction
10/16/2014 Dr. Navneet Goyal, BITS, Pilani 40
Applications
Credit approval
Target marketing
Medical diagnosis
Treatment effectiveness analysis
Image recognition

10/16/2014 Dr. Navneet Goyal, BITS, Pilani 41
Some More Applications
Teachers classify students grades as A,
B, C, D, or E.
Identify mushrooms as poisonous or
edible.
Predict when a river will flood.
Identify individuals with credit risks.
Speech recognition
Pattern recognition
10/16/2014 Dr. Navneet Goyal, BITS, Pilani 42
Grading: A Simple Example
If x >= 90 then grade
=A.
If 80<=x<90 then
grade =B.
If 70<=x<80 then
grade =C.
If 60<=x<70 then
grade =D.
If x<50 then grade
=E.
F
>=60 <50
>=80 <80
x
>=70 <70
x
B
A
x
C
D
>=90 <90
x
10/16/2014 Dr. Navneet Goyal, BITS, Pilani 43
Classification Example:
Letter Recognition
View letters as constructed from 5 components:
Letter C
Letter E
Letter A
Letter D
Letter F
Letter B
10/16/2014 Dr. Navneet Goyal, BITS, Pilani 44
Classification:
A Two-Step Process
Model construction: describing a set of
predetermined classes
Each tuple/sample is assumed to belong to a
predefined class, as determined by the class label
attribute
The set of tuples used for model construction:
training set
The model is represented as classification rules,
decision trees, or mathematical formulae
10/16/2014 Dr. Navneet Goyal, BITS, Pilani 45
Accuracy of Classification
Classification is a fuzzy problem, the
correct answer may depend on user
%age of tuples places in correct
class
Cost of incorrect assignment

Performance Evaluation
10/16/2014 Dr. Navneet Goyal, BITS, Pilani 46
Height Example Data
Name Gender Height Output1 Output2
Kristina F 1.6m Short Medium
Jim M 2m Tall Medium
Maggie F 1.9m Medium Tall
Martha F 1.88m Medium Tall
Stephanie F 1.7m Short Medium
Bob M 1.85m Medium Medium
Kathy F 1.6m Short Medium
Dave M 1.7m Short Medium
Wo rth M 2.2m Tall Tall
Steven M 2.1m Tall Tall
Debbie F 1.8m Medium Medium
Todd M 1.95m Medium Medium
Kim F 1.9m Medium Tall
Amy F 1.8m Medium Medium
Wynette F 1.75m Medium Medium

Example taken from M H Dunham book on Data Mining

10/16/2014 Dr. Navneet Goyal, BITS, Pilani 47
Classification Performance
True Positive
False Negative False Positive
True Negative
Example taken from M H Dunham book on Data Mining book slides

10/16/2014 Dr. Navneet Goyal, BITS, Pilani 48
Classification:
Model Construction
Training
Data
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Classification
Algorithms
IF rank = professor
OR years > 6
THEN tenured = yes
Classifier
(Model)
10/16/2014 Dr. Navneet Goyal, BITS, Pilani 49
Classifier
Testing
Data
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Unseen Data
(Jeff, Professor, 4)
Tenured?

Classification:
Use the Model
10/16/2014 Dr. Navneet Goyal, BITS, Pilani 50
Clustering
Clustering of data is a method by which
large sets of data is grouped into
clusters of smaller sets of similar data.

Objects in one cluster have high
similarity to each other and are
dissimilar to objects in other clusters.

It is an example of unsupervised
learning.

10/16/2014 Dr. Navneet Goyal, BITS, Pilani 51
Clustering Applications
Segment customer database based
on similar buying patterns.
Group houses in a town into
neighborhoods based on similar
features.
Identify similar documents based on
keywords
10/16/2014 Dr. Navneet Goyal, BITS, Pilani 52
Clustering Applications
Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs
Land use: Identification of areas of similar land use in an
earth observation database
Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost
City-planning: Identifying groups of houses according to their
house type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters should
be clustered along continent faults
10/16/2014 Dr. Navneet Goyal, BITS, Pilani 53
Clustering vs. Classification
No prior knowledge
Number of clusters
Meaning of clusters
Unsupervised learning

10/16/2014 Dr. Navneet Goyal, BITS, Pilani 54
Continuum of Analysis

OLTP




OLAP


Data Mining

Primitive &
Canned
Analysis
Complex
Ad-hoc
Analysis
Automated
Analysis
SQL
Specialized
Algorithms
10/16/2014 Dr. Navneet Goyal, BITS, Pilani 55
Summary
Predictive Modeling
Building a model for the target or
dependent variable as a function
of the explanatory or independent
variables
Classification: discrete target
variables
Regression/Prediction: continuous
target variables

10/16/2014 Dr. Navneet Goyal, BITS, Pilani 56
Summary
Association Analysis
Discover patterns that describe
strongly associated features in the data
Cluster Analysis
Groups closely related observations so
that observations that belong to the
same cluster are more similar to each
other than observations belonging to
other clusters

10/16/2014 Dr. Navneet Goyal, BITS, Pilani 57
Examples
Type of a flower
Yield of a crop
Type of intrusion
Market-basket analysis
Fraud detection
Similar documents using keywords
Combinations are common

10/16/2014 Dr. Navneet Goyal, BITS, Pilani 58
Recent Research Trends
Stream Data Mining
Agent based Data Mining
Incremental Algorithms for Data Mining
Text Data Mining
Multi-relational Data Mining
Sampling Techniques for Data Mining


10/16/2014 Dr. Navneet Goyal, BITS, Pilani 59
Other Data Mining Tasks
Summarization
One of the many criteria used to compare universities by the
US: News & World Report is the average SAT or ACT score*.
Summarization is used to estimate the type & intellectual level
of the student body
Sequence discovery
Unlike MBA analysis requires items to be purchased at the
same time, here we look for items purchased over time in some
order. Temporal ARs fall under this category. Example:
sequence of pages frequently accessed on a website. 70% of
the users of page A follow one of the following patterns of
behavior (A,B,C) or (A,D,B,C) or (A,E,B,C).

* [GM 99] Dunham Book

10/16/2014 Dr. Navneet Goyal, BITS, Pilani 60
Important Conferences &
Journals
Conferences
ACM SIGKDD Int. Conf. on KD & DM (KDD)
IEEE ICDM
SIAM ICDM
PKDD
PAKDD
ICKM
ICDE
ICML



10/16/2014 Dr. Navneet Goyal, BITS, Pilani 61
Important Conferences &
Journals
Journals
IEEE Txs. on Knowledge & Data Engg.
Data Mining & Knowledge Discovery
Knowledge & Information Systems
Intelligent Data Analysis
Information Systems
Journal of Intelligent Information Systems



10/16/2014 Dr. Navneet Goyal, BITS, Pilani 62
Web Resources
Text book's website at:
http://wwwusers.cs.umn.edu/~kumar/dmbook/.
PowerPoints for the text book are available via
anonymous ftp at:
ftp://ftp.aw.com/cseng/authors/tan
Other Resources:
http://web.ccsu.edu/datamining/resources.html

10/16/2014 Dr. Navneet Goyal, BITS, Pilani 63
Data Mining, Statistics, &
Machine Learning
Similarities
Differences
Interplay
Scope
10/16/2014 Dr. Navneet Goyal, BITS, Pilani 64
Data Mining Communities
Several different communities have laid claim to
DM
1. Statistics.
2. AI, where it is called machine learning."
3. Researchers in clustering algorithms.
4. Visualization researchers.
5. Databases. We'll be taking this approach, of
course, concentrating on the challenges that
appear when the data is large and the
computations complex. In a sense, data mining
can be thought of as algorithms for executing
very complex queries on non-main-memory
data.

10/16/2014 Dr. Navneet Goyal, BITS, Pilani 65
History of Data Mining
Emerged late 1980s
Flourished 1990s
Roots traced back along three
family lines
Classical Statistics
Artificial Intelligence
Machine Learning

10/16/2014 Dr. Navneet Goyal, BITS, Pilani 66
Statistics
Foundation of most DM technologies
Regression analysis, standard
distribution/deviation/variance, cluster
analysis, confidence intervals
Building blocks
Significant role in todays data
mining but alone is not powerful
enough
10/16/2014 Dr. Navneet Goyal, BITS, Pilani 67
Artificial Intelligence
Heuristics vs. Statistics
Human-thought-like processing
Requires vast computer processing
power
Supercomputers

10/16/2014 Dr. Navneet Goyal, BITS, Pilani 68
Machine Learning
Union of statistics and AI
Blends AI heuristics with advanced
statistical analysis
Machine Learning let computer programs
learn about data they study - make different decisions
based on the quality of studied data
using statistics for fundamental concepts and adding
more advanced AI heuristics and algorithms

10/16/2014 Dr. Navneet Goyal, BITS, Pilani 69
Data Mining
Adoption of the Machine learning
techniques to the real world problems
Union: Statistics, AI, Machine learning
Used to find previously hidden trends or
patterns
Finding increasing acceptance in science
and business areas which need to
analyze large amount of data to discover
trends which could not be found
otherwise
10/16/2014 Dr. Navneet Goyal, BITS, Pilani 70
Data Mining &
Statistics
Statistically speaking students doing DM course
at BITS are not good at statistics
10/16/2014 Dr. Navneet Goyal, BITS, Pilani 71
Data Mining &
Statistics
Statistical ideas and methods are
fundamental to DM
Size of the data set is the most
fundamental difference
Statistician hundreds or thousands of
data points
Data Miner many millions or billions of data
points are not unexpected
10/16/2014 Dr. Navneet Goyal, BITS, Pilani 72
Data Mining &
Statistics
Wal Mart in 1994 20 million transactions
11 TB database of customer txs. in 1998
AT&T has 100 m customers making 300 m long
distance calls in a day
Mobil Oil aims to store 100 TB of data on oil
exploration
Statistics is a primary process - data collected
with particular questions in mind
DM is secondary process data collected for
some other purpose
DM has also to deal with non traditional nature
of the data sets involved
10/16/2014 Dr. Navneet Goyal, BITS, Pilani 73
Data Mining &
Statistics
Statistics is definitely a descriptive DM task
Some DM applications determine correlations
among data
Relationships are not causal
Relationship like AR
60% of the time when customers buy hot dog,
they also buy juice
No relationship at all except that they are
bought together just a coincidence
Sampling is used in DM in statistics it is
referred to as statistical inference
DM is meant to be used by business user and
not by statistician
10/16/2014 Dr. Navneet Goyal, BITS, Pilani 74
Data Mining & ML
Students learn all the bad things from historical
data (read seniors)
Attending classes is a waste of time
Copying assignments is rewarding
Make up is your birth right
10/16/2014 Dr. Navneet Goyal, BITS, Pilani 75
Data Mining & ML
ML is an area of AI that examines how to write
programs that can learn
In DM, ML is used for prediction or classification
In ML, the program makes a prediction and
then, based on feedback, learns
Learns through feedback, domain knowledge,
and examples
2 types of ML
Supervised: learns by example & uses training set and
correct answers
Unsupervised: data exists but there is no knowledge
of correct answer
ML research has focused on the learning portion
than on the creation of useful information
10/16/2014 Dr. Navneet Goyal, BITS, Pilani 76
Data Mining & ML
ML tries to do things that are difficult for
humans to do

ML tries to develop learning techniques that can
mimic human behavior

Objective of DM is to find information that can
be used o provide information to humans (not
take their place)

Você também pode gostar