Escolar Documentos
Profissional Documentos
Cultura Documentos
دكترمحسن كاهاني
http://www.um.ac.ir/~kahani/
Motivation:
“Necessity is the Mother of Invention”
Machine Visualization
Learning
Data Mining and
Knowledge Discovery
Statistics Databases
Interpretation Knowledge
& Evaluation
Knowledge
Raw
Data __ __ __
Patterns
Understanding
__ __ __
__ __ __ and
Rules
Transformed
Data
DATA Target
Data
Ware
house
Data Exploration
Statistical Analysis, Querying and Reporting
Descriptive Method
- …foundation of human-interpretable patterns that
describe the data…
Many approaches:
Statistics,
Decision Trees,
Neural Networks,
...
Linear Regression
w0 + w1 x + w2 y >= 0
Regression computes
wi from data to
minimize squared error
to ‘fit’ the data
Not flexible enough
2 5 X
Splitting Attributes
Tid Refund Marital Taxable Refund
Status Income Cheat Yes No
1 Yes Single 125K No
NO MarSt
2 No Married 100K No
Single, Divorced Married
3 No Single 70K No
4 Yes Married 120K No TaxInc NO
5 No Divorced 95K Yes < 80K > 80K
6 No Married 60K No
NO YES
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No The splitting attribute at a node is
10 No Single 90K Yes determined based on the Gini index.
10
4 6
2
5 Output
Inputs
Hidden Layer
دكتر كاهاني-سيستمهاي خبره و مهندسي دانش
Neural Networks (cont.)
Set
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No Learn
Training
10 No Single 90K Yes Model
10
Set Classifier
دكتر كاهاني-سيستمهاي خبره و مهندسي دانش
Classification Application
Direct Marketing
Fraud Detection
Customer Attrition/Churn
TID Items
1 Bread, Coke, Milk
Rules Discovered:
2 Beer, Bread {Milk} --> {Coke}
3 Beer, Coke, Diaper, Milk {Diaper, Milk} --> {Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Inventory Management
Deviation Detection:
…discovering most significant changes in data from
previously measured or normative values…
V. Kumar, M. Joshi, Tutorial on High Performance Data Mining.
Sample
Extract a portion of the dataset for data mining
Explore
Modify
create, select and transform variables with the intention of building
a model
Model
Specify a relationship of variables that reliably predicts a desired
goal
Assess
Evaluate the practical value of the findings and the model resulting
from the data mining effort
دكتر كاهاني-سيستمهاي خبره و مهندسي دانش
Data Mining Methodology:
CRISP-DM
Data understanding
Data preparation
Modeling
Evaluation
Deployment
Determine Collect Initial Data Data Set Select Modeling Evaluate Results Plan Deployment
Business Objectives Initial Data Collection Data Set Description Technique Assessment of Data Deployment Plan
Background Report Modeling Technique Mining Results w.r.t.
Business Objectives Select Data Modeling Assumptions Business Success Plan Monitoring and
Business Success Describe Data Rationale for Inclusion / Criteria Maintenance
Criteria Data Description Report Exclusion Generate Test Design Approved Models Monitoring and
Test Design Maintenance Plan
Situation Assessment Explore Data Clean Data Review Process
Inventory of Resources Data Exploration Report Data Cleaning Report Build Model Review of Process Produce Final Report
Requirements, Parameter Settings Final Report
Assumptions, and Verify Data Quality Construct Data Models Determine Next Steps Final Presentation
Constraints Data Quality Report Derived Attributes Model Description List of Possible Actions
Risks and Contingencies Generated Records Decision Review Project
Terminology Assess Model Experience
Costs and Benefits Integrate Data Model Assessment Documentation
Merged Data Revised Parameter
Determine Settings
Data Mining Goal Format Data
Data Mining Goals Reformatted Data
Data Mining Success
Criteria
34
Data Mining and
Discrimination
Can discrimination be based on features like sex,
age, national origin?
In some areas (e.g. mortgages, employment), some
features cannot be used for decision making
In other areas, these features are needed to assess the
risk factors
E.g. people of African descent are more susceptible to
sickle cell anemia
35
Data Mining and Privacy
Can information collected for one purpose be used for mining
data for another purpose
In Europe, generally no, without explicit consent
In US, generally yes
Companies routinely collect information about customers and
use it for marketing, etc.
People may be willing to give up some of their privacy in
exchange for some benefits
See Data Mining And Privacy Symposium,
www.kdnuggets.com/gpspubs/ieee-expert-9504-priv.html
36
Data Mining and Privacy
Data Mining looks for patterns, not people!
Technical solutions can limit privacy invasion
Replacing sensitive personal data with anon. ID
Give randomized outputs
Multi-party computation – distributed data
…
Over-inflated
expectations
Growing acceptance
and mainstreaming
rising
expectations
Disappointment Performance
Expectations
1990
1998 2000 2002
دكتر كاهاني-سيستمهاي خبره و مهندسي دانش
Final Remarks