Escolar Documentos
Profissional Documentos
Cultura Documentos
Concepts and
Techniques
May 5, 2015
Chapter 5: Concept
Description: Characterization
and
Comparison
What
is concept
description?
Discussion
Summary
May 5, 2015
May 5, 2015
Concept description:
can handle complex data types of the
attributes and their aggregations
a more automated process
OLAP:
restricted to a small number of
dimension and measure types
user-controlled process
May 5, 2015
Chapter 5: Concept
Description: Characterization
and
Comparison
What
is concept
description?
Discussion
Summary
May 5, 2015
Data generalization
A process which abstracts a large set of taskrelevant data in a database from a low
conceptual levels to higher
ones.
1
2
3
4
5
Conceptual levels
Approaches:
Data cube approach(OLAP approach)
Attribute-oriented induction approach
May 5, 2015
Strength
Limitations
May 5, 2015
Attribute-Oriented
Induction
May 5, 2015
May 5, 2015
See Implementation
May 5, 2015
See example
See complexity
Example
May 5, 2015
11
Gender
Jim
Initial
Woodman
Relation Scott
Major
M
F
Removed
Retained
Residence
Phone #
GPA
Vancouver,BC, 8-12-76
Canada
CS
Montreal, Que, 28-7-75
Canada
Physics Seattle, WA, USA 25-8-70
687-4598
3.67
253-9106
3.70
420-5232
3.83
Sci,Eng,
Bus
City
Removed
Excl,
VG,..
Gender Major
M
F
Birth_date
CS
Lachance
Laura Lee
Prime
Generalized
Relation
Birth-Place
Science
Science
Country
Age range
Birth_region
Age_range
Residence
GPA
Canada
Foreign
20-25
25-30
Richmond
Burnaby
Very-good
Excellent
Birth_Region
Canada
Foreign
Total
Gender
See Principles
See Algorithm
May 5, 2015
16
14
30
10
22
32
Total
26
36
62
See Implementation
Count
16
22
Presentation of Generalized
Results
Generalized relation:
Cross tabulation:
Visualization techniques:
May 5, 2015
PresentationGeneralized
Relation
May 5, 2015
14
PresentationCrosstab
May 5, 2015
15
Implementation by Cube
Technology
May 5, 2015
16
Chapter 5: Concept
Description: Characterization
and
Comparison
What
is concept
description?
Discussion
Summary
May 5, 2015
17
Similarity:
Differences:
May 5, 2015
18
Attribute Relevance
Analysis
Why?
Which dimensions should be included?
How high level of generalization?
Automatic vs. interactive
Reduce # attributes; easy to understand patterns
What?
statistical method for preprocessing data
May 5, 2015
19
How?
Data Collection
Analytical Generalization
Relevance Analysis
On selected dimension/level
May 5, 2015
20
Relevance Measures
May 5, 2015
21
Information-Theoretic Approach
Decision tree
each internal node tests an attribute
each branch corresponds to attribute value
each leaf node assigns a classification
ID3 algorithm
build decision tree based on training objects
with known class labels to classify testing
objects
rank attributes with information gain measure
minimal height
May 5, 2015
22
Outlook
sunny
overcast
Humidity
high
no
May 5, 2015
rain
Wind
yes
normal
yes
strong
no
weak
yes
23
s
i 1
E(A)
May 5, 2015
24
Example: Analytical
Characterization
Task
Mine general characteristics describing graduate
students using analytical characterization
Given
attributes name, gender, major, birth_place,
birth_date, phone#, and gpa
Gen(a ) = concept hierarchies on a
i
i
U = attribute analytical thresholds for a
i
i
T = attribute generalization thresholds for a
i
i
R = attribute relevance threshold
May 5, 2015
25
Example: Analytical
Characterization (contd)
1. Data collection
target class: graduate student
contrasting class: undergraduate student
2. Analytical generalization using Ui
attribute removal
attribute generalization
May 5, 2015
26
Example: Analytical
characterization (2)
gender
major
birth_country
age_range
gpa
count
M
F
M
F
M
F
Science
Science
Engineering
Science
Science
Engineering
Canada
Foreign
Foreign
Foreign
Canada
Canada
20-25
25-30
25-30
25-30
20-25
20-25
Very_good
Excellent
Excellent
Excellent
Excellent
Excellent
16
22
18
25
21
18
major
birth_country
age_range
gpa
count
M
F
M
F
M
F
Science
Business
Business
Science
Engineering
Engineering
Foreign
Canada
Canada
Canada
Foreign
Canada
<20
<20
<20
20-25
20-25
<20
Very_good
Fair
Fair
Fair
Very_good
Excellent
18
20
22
24
22
24
May 5, 2015
27
Example: Analytical
characterization (3)
3. Relevance analysis
Calculate expected info required to classify an
arbitrary tuple
I(s 1, s 2 ) I( 120 ,130 )
120
120 130
130
log 2
log 2
0.9988
250
250 250
250
S11=84
S21=42
I(s11,s21)=0.9183
S22=46
I(s12,s22)=0.9892
For major=Business:
S23=42
I(s13,s23)=0
S13=0
Number of grad
students in Science
May 5, 2015
Number of undergrad
students in Science
28
250
I ( s11, s 21 )
250
I ( s12 , s 22 )
250
I ( s13, s 23 ) 0.7873
CalculateGain(major
information
) I(s 1, s 2 )gain
E(major)
for each
0.2115 attribute
Gain(gender)
= 0.0003
Information
gain for all
attributes
May 5, 2015
Gain(birth_country)
= 0.0407
Gain(major)
Gain(gpa)
= 0.2115
= 0.4490
Gain(age_range)
= 0.5971
29
Example: Analytical
characterization (5)
R = 0.1
remove irrelevant/weakly relevant attributes from
candidate relation => drop gender, birth_country
remove contrasting class candidate relation
major
Science
Science
Science
Engineering
Engineering
age_range
20-25
25-30
20-25
20-25
25-30
gpa
Very_good
Excellent
Excellent
Excellent
Excellent
count
16
47
21
18
18
May 5, 2015
30
Chapter 5: Concept
Description: Characterization
and
Comparison
What
is concept
description?
Discussion
Summary
May 5, 2015
31
May 5, 2015
Task
Compare graduate and undergraduate
students using discriminant rule.
DMQL query
use Big_University_DB
mine comparison as grad_vs_undergrad_students
in relevance to name, gender, major, birth_place, birth_date, residence, phone#, gpa
for graduate_students
where status in graduate
versus undergraduate_students
where status in undergraduate
analyze count%
from student
May 5, 2015
33
Given
attributes name, gender, major, birth_place,
birth_date, residence, phone# and gpa
Gen(a ) = concept hierarchies on attributes
i
ai
May 5, 2015
34
1. Data collection
target and contrasting classes
3. Synchronous generalization
controlled by user-specified dimension thresholds
prime target and contrasting class(es)
relations/cuboids
May 5, 2015
35
Other
25-30
Over_30
Over_30
Good
Very_good
Excellent
2.32%
5.86%
4.68%
Birth_country
Canada
Canada
Canada
Other
Age_range
15-20
15-20
25-30
Over_30
Gpa
Fair
Good
Good
Excellent
Count%
5.53%
4.53%
5.02%
0.68%
May 5, 2015
36
5. Presentation
as generalized relations, crosstabs, bar
charts, pie charts, or rules
contrasting measures to reflect comparison
between target and contrasting classes
e.g. count%
May 5, 2015
37
Cj = target class
qa = a generalized tuple covers some tuples of
class
but can also cover some tuples of contrasting
class
d-weight d weight mcount(qa Cj )
range: [0, 1]
count(qa Ci )
i 1
May 5, 2015
38
Example: Quantitative
Discriminant
Status
Birth_country Rule
Age_range Gpa
Count
Graduate
Canada
25-30
Good
90
Undergraduate
Canada
25-30
Good
210
Count distribution between graduate and undergraduate students for a generalized tuple
X , graduate _ student ( X )
birth _ country ( X ) " Canada" age _ range( X ) "25 30" gpa( X ) " good " [d : 30%]
May 5, 2015
39
Class Description
necessary
Quantitative discriminant rule
sufficient
Quantitative description rule
X, target_cla ss(X)
condition 1(X) [t : w1, d : w 1] ... conditionn(X) [t : wn, d : w n]
May 5, 2015
40
Example: Quantitative
Description Rule
Location/item
TV
Computer
Both_items
Count
t-wt
d-wt
Count
t-wt
d-wt
Count
t-wt
d-wt
Europe
80
25%
40%
240
75%
30%
320
100%
32%
N_Am
120
17.65%
60%
560
82.35%
70%
680
100%
68%
Both_
regions
200
20%
100%
800
80%
100%
1000
100%
100%
Crosstab showing associated t-weight, d-weight values and total number (in thousands) of TVs and
computers sold at AllElectronics in 1998
X, Europe(X)
(item(X) " TV" ) [t : 25%, d : 40%] (item(X) " computer" ) [t : 75%, d : 30%]
May 5, 2015
41
Chapter 5: Concept
Description: Characterization
and
Comparison
What
is concept
description?
Discussion
Summary
May 5, 2015
42
Motivation
May 5, 2015
43
Mean
1 n
x xi
n i 1
w x
i 1
n
w
i 1
estimated by interpolation
median L1 (
n / 2 ( f )l
Mode
Empirical formula:
May 5, 2015
f median
)c
n
1
1
1
2
2
( xi x )
[ xi
( xi ) 2 ]
n 1 i 1
n 1 i 1
n i 1
May 5, 2015
45
Boxplot Analysis
May 5, 2015
46
A Boxplot
A boxplot
May 5, 2015
47
Visualization of Data
Dispersion: Boxplot Analysis
May 5, 2015
48
Variance
1 n
1
1
2
2
2
2
s
(
x
x
)
i
i
i
n 1 i 1
n 1
n
May 5, 2015
49
Histogram Analysis
May 5, 2015
50
Quantile Plot
May 5, 2015
51
May 5, 2015
52
Scatter plot
May 5, 2015
53
Loess Curve
May 5, 2015
54
May 5, 2015
55
Chapter 5: Concept
Description: Characterization
and
Comparison
What
is concept
description?
Discussion
Summary
May 5, 2015
56
May 5, 2015
57
May 5, 2015
58
May 5, 2015
59
Chapter 5: Concept
Description: Characterization
and
Comparison
What
is concept
description?
Discussion
Summary
May 5, 2015
60
Summary
Discussion
May 5, 2015
61
References
May 5, 2015
62
References (cont.)
May 5, 2015
63
http://www.cs.sfu.ca/~han/dmb
ook
May 5, 2015
64