Escolar Documentos
Profissional Documentos
Cultura Documentos
Concepts and
Techniques
Chapter 1
Introduction
November 10,
Chapter 1. Introduction
November 10,
November 10,
Evolution of Database
Technology
1960s:
1970s:
1980s:
1990s:
2000s
November 10,
Alternative names
November 10,
Other Applications
November 10,
Where does the data come from?Credit card transactions, loyalty cards,
discount coupons, customer complaint calls, plus (public) lifestyle studies
Target marketing
November 10,
Resource planning
Competition
November 10,
Medical insurance
Retail industry
Anti-terrorism
November 10,
Data miningcore of
knowledge discovery
process
Pattern Evaluation
Data Mining
Task-relevant Data
Selection
Data Warehouse
Data Cleaning
Data Integration
Databases
November 10,
10
November 10,
11
Decisio
n
Making
Data Presentation
Visualization Techniques
End User
Business
Analyst
Data Mining
Information Discovery
Data
Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
November 10,
DBA
12
Machine
Learning
Pattern
Recognition
November 10,
Statistics
Data Mining
Algorithm
Visualization
Other
Disciplines
13
High-dimensionality of data
November 10,
14
Data to be mined
Knowledge to be mined
Techniques utilized
Relational, data warehouse, transactional, stream, objectoriented/relational, active, spatial, time-series, text, multimedia, heterogeneous, legacy, WWW
Applications adapted
November 10,
15
General functionality
November 10,
16
Object-relational databases
Multimedia database
Text databases
November 10,
17
November 10,
18
Cluster analysis
Class label is unknown: Group data to form new classes, e.g., cluster
houses to find distribution patterns
Maximizing intra-class similarity & minimizing interclass similarity
Outlier analysis
Outlier: Data object that does not comply with the general behavior
of the data
Noise or exception? Useful in fraud detection, rare events analysis
Trend and evolution analysis
Trend and deviation: e.g., regression analysis
Sequential pattern mining: e.g., digital camera large SD memory
Periodicity analysis
Similarity-based analysis
Other pattern-directed or statistical analyses
November 10,
19
Interestingness measures
November 10,
20
Approaches
First general all the patterns and then filter out the
uninteresting ones
Generate only the interesting patternsmining query
optimization
November 10,
21
November 10,
22
November 10,
23
Task-relevant data
Background knowledge
Visualization/presentation of discovered
patterns
November 10,
24
November 10,
25
Characterization
Discrimination
Association
Classification/prediction
Clustering
Outlier analysis
November 10,
26
Schema hierarchy
Set-grouping hierarchy
Operation-derived hierarchy
Rule-based hierarchy
low_profit_margin (X) <= price(X, P1) and cost (X, P2) and
(P1 - P2) < $50
November 10,
27
Simplicity
e.g., (association) rule length, (decision) tree size
Certainty
e.g., confidence, P(A|B) = #(A and B)/ #(B), classification
reliability or accuracy, certainty factor, rule strength, rule
quality, discriminating weight, etc.
Utility
potential usefulness, e.g., support (association), noise
threshold (description)
Novelty
not previously known, surprising (used to remove
redundant rules, e.g., Illinois vs. Champaign rule
implication support ratio)
November 10,
28
November 10,
29
Motivation
Design
November 10,
30
November 10,
31
November 10,
32
November 10,
33
Loose coupling
November 10,
34
Know
ledge
-Base
Database or Data
Warehouse Server
data cleaning, integration, and selection
Database
November 10,
Data
World-Wide Other Info
Repositories
Warehouse
Web
35
Mining methodology
Mining different kinds of knowledge from diverse data types, e.g., bio,
stream, Web
User interaction
November 10,
36
Summary
November 10,
37
November 10,
38
KDD Conferences
ACM SIGKDD Int. Conf. on
Knowledge Discovery in
Databases and Data Mining
(KDD)
SIAM Data Mining Conf.
(SDM)
(IEEE) Int. Conf. on Data
Mining (ICDM)
Conf. on Principles and
practices of Knowledge
Discovery and Data Mining
(PKDD)
Pacific-Asia Conf. on
Knowledge Discovery and
Data Mining (PAKDD)
November 10,
ACM SIGMOD
VLDB
(IEEE) ICDE
WWW, SIGIR
Journals
KDD Explorations
39
Statistics
Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS,
etc.
Journals: Machine Learning, Artificial Intelligence, Knowledge and Information
Systems, IEEE-PAMI, etc.
Web and IR
Visualization
November 10,
40
Recommended Reference
Books
S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan
Kaufmann, 2002
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery,
Morgan Kaufmann, 2001
J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2 nd ed., 2006
D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001
T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining,
Inference, and Prediction, Springer-Verlag, 2001
P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005
I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with
Java Implementations, Morgan Kaufmann, 2nd ed. 2005
November 10,
41
November 10,
42