Escolar Documentos
Profissional Documentos
Cultura Documentos
Clustering
Discover correlation
Data Mining and Kno
Pattern representation
Clustering
Data abstraction
Cluster validation
Data Mining and Kno
Pattern Representation
Number of classes
Number of available patterns
Feature selection
Feature extraction
Pattern Proximity
Pattern Proximity
Nominal attributes
nx
d (xi , x j )
n
n Number of attributes
x Number of attributes that are the same
Clustering Techniques
Clustering
Hierarchical
Single
Link
Partitional
Complete
Link
CobWeb
Square
Error
Mixture
Maximization
K-means
Expectation
Maximization
Technique Characteristics
Agglomerative vs Divisive
Hard vs Fuzzy
More Characteristics
Monothetic vs Polythetic
Incremental vs Non-Incremental
Hierarchical Clustering
Dendrogram
S
F
C
B
DE
G
i
m
i
l
a
r
i
t
y
A B
C D E F
Hierarchical Algorithms
Single-link
Complete-link
10
Complete-Link
1 1
1 1 1
1
*
1
1
1 11
1 1
1 1 1
1
*
1
1
1 11
2
*
2* 2
2
2* 2
2
2
2
2
2
2
2 2
2
*
2
2
2
2
2
2 2
11
Partitional Clustering
12
K-Means
Predetermined
number of clusters
Start with seed
clusters of one
element
Seeds
Data Mining and Kno
13
Assign Instances to
Clusters
14
15
New Clusters
16
Discussion: k-means
17
Clustering in Weka
18
CobWeb
The k clusters
CU C1 , C2 ,..., Ck
2
2
Pr
C
Pr
a
v
|
C
Pr
a
v
l i ij l
i
ij
l
Why divide by k?
Data Mining and Kno
19
Category Utility
n Pr ai vij
CU C1 , C2 ,..., Ck
Without k it would always kbe best for each
instance to have its own cluster,
overfitting!
i
20
Play
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
21
Start by putting
the first instance
in its own cluster
22
b
a
a
b
c
Highest utility
23
Adding Instance f
First instance not to get
its own cluster:
a
d
e
24
Add Instance g
Look at the instances:
E) Rainy Cool Normal FALSE
F) Rainy Cool Normal TRUE
G) Overcast Cool Normal TRUE
d
e
g
25
Add Instance h
Look at the instances:
A) Sunny Hot High FALSE
D) Rainy Mild High FALSE
H) Sunny Mild High FALSE
Rearrange:
Merged into a
single cluster
before h is added
b
a
Runner up
c
e
g
(Splitting is also possible)
26
Final Hierarchy
g
a
c
b
i
What next?
27
Dendrogram Clusters
g
a
c
b
What do a, b, c, d, h, k, and l
have in common?
28
Numerical Attributes
1
1
1
l Pr Cl 2 i
il
i
CU C1 , C2 ,..., Ck
k
29
30
Discussion
Advantages
Disadvantages
divide by k,
artificial minimum value for variance of numeric
attributes,
ad hoc cutoff value
31
Probabilistic Perspective
32
Mixture Resolution
33
Attribute
Given some data, how can you determine the parameters:
34
Problems
cluster
instance
to each
Pr[ x]
1
f ( x; A , A )
e
2
Pr[ x]
( x )2
2 2
35
EM Algorithm
36
Straightforward
Easy if assume attributes independent
If dependent attributes, treat them
jointly using the bivariate normal
Nominal attributes
37
EM using Weka
Options
38
Other Clustering
39
Applications
Image segmentation
Data Mining:
40
DM Clustering Challenges
41
Other (General)
Challenges
Shape of clusters
Minimum domain knowledge (e.g.,
knowing the number of clusters)
Noisy data
Insensitivity to instance order
Interpretability and usability
Data Mining and Kno
42
Clustering for DM
43
Practical Partitional
Clustering Algorithms
44
Large-Scale Problems
CLARANS:
Similar to CLARA
Draws samples randomly while searching
More effective than PAM and CLARA
45
Hierarchical Methods
46
BIRCH Mechanism
Phase I:
Phase II:
47
Conclusion
48
49
50
51
Play
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
52
Item Sets
1-Item
2-Item
3-Item
4-Item
Outlook=sunny
(5)
Outlook=sunny
temp=mild (2)
Outlook=
overcast (4)
Outlook=sunny
temp=hot (2)
Outlook=sunny
temp=hot
humidity=high
(2)
Outlook=sunny
temp=hot
play=no (2)
Outlook=rainy
(5)
Outlook=sunny
humidity=norm
(2)
Outlook=sunny
humidity=norm
play=yes (2)
Temp=cool (4)
Outlook=sunny
windy=true (2)
Outlook=sunny
humidity=high
windy=false (2)
Temp=mild (6)
Outlook=sunny
windy=true (2)
Outlook=sunny
humidity=high
play=no (3)
Outlook=sunny
temp=hot
humidity=high
play=no (2)
Outlook=sunny
humidity=high
windy=false
play=no (2)
Outlook=over
temp=hot
windy=false
play=no (2)
Outlook=rainy
temp=mild
windy=false
play=yes (2)
Outlook=rainy
humidity=norm
windy=false
play=yes (2)
53
Accuracy
4/4
4/6
4/6
4/7
4/8
4/9
4/12
54
Accuracy
2/2
2/2
2/2
55
Overall
58 association rules
play = yes
humidity = normal
play = yes
56
57
58
Justification
Item Set 1: {Humidity = high}
Coverage(1) = Number of times humidity is high
Item Set 2: {Windy = false}
Coverage (2) = Number of times windy is false
Item Set 3: {Humidity = high, Windy = false}
Coverage (3) = Number of times humidity is high and
windy is false
Coverage (3) Coverage(1)
Coverage (3) Coverage(2)
(A B C)
(A B D)
(A C D)
(A C E)
Merge to
generate
4-item sets
(A B C D)
(A C D E)
60
61
Generating Rules
Meets min.
If windy = false and play = no then
coverage
and accuracy outlook = sunny and humidity = high
Meets min.
coverage
and accuracy
If windy
then
If windy
then
62
63
64
Efficiency Improvement
candidate rules
check for accuracy
65
Apriori Algorithm
66
67
Difficulties
68
Solution?
partitioning-based
divide-and-conquer
(as opposed to bottom-up generation)
69
TID
100
200
300
400
500
Database
Items
Frequent Items
Tree
F,A,C,D,G,I,M,P
F,C,A,M,P
A,B,C,F,L,M,O
B,F,H,J,O
B,C,K,S,P
A,F,C,E,L,P,M,N
(Min. support = 3)
FP-
F,C,A,B,M
F,B
C,B,P
F,C,A,M,P
Item
F
C
A
B
M
P
Head of
node links
Root
F:4
C:3
C:1
B:1
A:3
P:1
M:2
B:1
P:2
M:1
B:1
70
Computational Effort
item name
count
node link
item name
head of node link
71
Comments
72
Mining Patterns
73
Example Root
Item
F
C
A
B
M
P
Head of
node links
F:4
C:1
C:3
B:1
A:3
B:1
P:1
M:2
B:1
P:2
M:1
Occurs twice
Frequent Pattern
(P:3)
Paths
<F:4, C:3, A:3, M:2, P:2>
<C:1, B:1, P:1>
Occurs ones
74
Rule Generation
75
76
Example
ID
10
20
30
40
50
Items
A,C,D,E,F
A,B,E
C,E,F
A,C,D,F
C,E,F
77
NOTE
C:4
E:4
F:4
A:3 Order for
D:2 conditional DB
CEFAD
EA
CEF
CFAD
CEF
D-cond DB (D:2)
A-cond DB (A:3)
F-cond DB (F:4)
E-cond DB (E:4)
CEFA
CEF
CE:3
C:4
CFA
Output: CFAD:2
CF
Output: A:3
Output: E:4
Output: CF:2,CEF:3
EA-cond DB (EA:2)
C
Output: EA:2
78
Footwear
Shirts
Shoes
Hiking Boots
Ski Pants
79
Why Taxonomy?
However:
80
Example
ID
10
20
30
40
50
60
Item Set
{Jacket}
{Outerwear}
{Cloths}
{Shoes}
{Hiking Boots}
{Footwear}
{Outerwear, Hiking Boots}
{Cloths, Hiking Boots}
{Outerwear, Footwear}
{Cloths, Footwear}
Items
Shirt
Jacket, Hiking Boots
Ski pants, Hiking Boots
Shoes
Shoes
Jacket
Rule
Outerwear Hiking Boots
Outerwear Footwear
Hiking Boots Outerwear
Hiking Boots Clothes
Support
2
2
2
2
Support
2
3
4
2
2
2
2
2
2
2
Confidence
2/3
2/3
2/2
2/2
81
Interesting Rules
Many way in which the interestingness of a rule can be evaluated based on ancestors
For example:
Rule ID
1
2
3
Rule
Clothes Footwear
Outerwear Footwear
Jackets Footwear
Support
10
8
4
Item
Clothes
Outerwear
Jackets
Support
5
2
1
82
Discussion
Support
83
Optimized rules
What makes for an
Maximal frequent item sets interesting rule?
Closed item sets
Data Mining and Kno
84
Algorithm Construction
85
Bottom-up
Counting
Apriori-like
algorithms
* Have discussed
Intersecting
Apriori*
Partition
AprioriTID
DIC
Top-down
Counting
FP-Growth*
Intersecting
Eclat
No algorithm
dominates others!
86
Applications
Applications to recommender
systems
87
Recommender
88
Classification Approach
89
Product associations
User associations
90
Advantages
91
92
Single-Consequent Rules
Association Rules
All possible item
combination consequent
Associations for
Recommender
Classification
One single item
consequent
93