Escolar Documentos
Profissional Documentos
Cultura Documentos
Berlin Chen
Graduate Institute of Computer Science & Information Engineering National Taiwan Normal University
References:
1. Data Mining: Concepts, Models, Methods and Algorithms, Chapter 8 2. Data Mining: Concepts and Techniques , Chapter 6
Given:
Database of transactions Each transaction is a list of items (purchased by a customer in a visit)
Find:
All rules that correlate the presence of one set of items with that of another set of items E.g., 98% of people who purchase tires and auto accessories also get automotive services done
ML-2
Applications:
Basket data analysis, cross-marketing, catalog design, lossleader analysis, clustering, classification, etc.
Examples.
Rule form: Body Head [support, confidence]. buys(x, diapers) buys(x, beers) [0.5%, 60%] major(x, CS) takes(x, DB) grade(x, A) [1%, 75%]
ML-3
ML-4
Example: Let minimum support 50%, and minimum confidence 50%, we have Transaction ID Items Bought
A C (50%, 66.6%) C A (50%, 100%)
2000 1000 4000 5000 A,B,C A,C A,D B,E,F
Use the obtained large itemsets to generate the association rules that have confidence above a predefined minimum threshold
confidence as the criterion
ML-7
For rule A C:
support = support({A , C}) = 50% confidence = support({A , C})/support({A}) = 66.6%
ML-8
Ck = Lk 1 * Lk 1 = {X Y whereX , Y Lk 1 and X Y = k 2}
Or more specifically,
(li [1] = l j [1]) (li [2] = l j [2]) ... (li [k 2] = l j [k 2]) (li [k 1] < l j [k 1])
ML-9
A, B A, D, E B, D
A, B, D are frequent 1-itermsets However, {A, B}, {A, D}, {B, D} are not frequent 2-itemsets
ML-10
m C2
ML-12
ML-13
Example
s (l-s) if
Nonempty subsets: {I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2}, {I5} Resulting association rules (if confidence threshold is 70%) I1 I2 I5 with c=2/4 I1 I5 I2 with c=2/2 I2 I5 I1 with c=2/2 I1 I2 I5 with c=2/6 I2 I1 I5 with c=2/7 I5 I1 I2 with c=2/2
ML-15
However, the overall percentage of students eating cereal is 75% is larger than 66% P(eat cereal)> P(eat cereal | play basketball) 0.75 0.66 => negatively associated !
ML-16
Or, support(A, B) - support(A)support(B) > k A kind of statistical (linear) independence test E.g.: the association rule in the previous example
support(play basketball, eat cereal) - support(play basketball) support(eat cereal) = 0.4 - 0.60.75 = -0.05 < 0 (negative associated !)
ML-17
= +
A itemset whose corresponding bucket count is below the support threshold can not be frequent and thus should be removed
ML-18
ML-20
10000
ML-21
ML-24
ML-26
Null{} I2:2
ML-27
Reasons
No candidate generation, no candidate test Use compact data structure Eliminate repeated database scan Basic operation is counting and FP-tree building
ML-28
Concept Hierarchy
Here we focus on finding frequent itemsets with items belonging to the same concept level ML-29
ML-30
ML-31
ML-32
ML-33
ML-34
We say the first rule is an ancestor of the second rule A rule is redundant if its support is close to the expected value, based on the rules ancestor Do not offer any additional information and is less general than the rules ancestor
ML-35
Apriori property: every subset of a frequent predicate set must also be frequent
ML-37