Escolar Documentos
Profissional Documentos
Cultura Documentos
Correlations
Basic Concepts and Techniques
Source : Books: Han et al. , 3rd Edition ; Tan et al. (2016)
Web Resource: www.digitalvidya.com
https://www.solver.com/xlminer/help/association-rules
Ex. Rule: Computer => antivirus s/w [ support = 2%, confidence = 60%]
Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf thresholds
Computationally prohibitive!
d d k
R
d 1 d k
k j
k 1 j 1
3 2 1
d d 1
More than 80% of the rules are discarded after applying minsup = 20%
and minconf=50%. , thus making most of the computations become
wasted.
To avoid performing needless computations, it would be useful to prune
the rules early without having to compute their support and
confidence values.
Example of Rules:
TID Items
1 Bread, Milk {Milk,Diaper} {Beer} (s=0.4, c=0.67)
2 Bread, Diaper, Beer, Eggs {Milk,Beer} {Diaper} (s=0.4, c=1.0)
{Diaper,Beer} {Milk} (s=0.4, c=0.67)
3 Milk, Diaper, Beer, Coke
{Beer} {Milk,Diaper} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Beer
{Diaper} {Milk,Beer} (s=0.4, c=0.5)
5 Bread, Milk, Diaper, Coke {Milk} {Diaper,Beer} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
If the item set {Milk, Diaper, Beer} is infrequent, then all 6 candidate
rules can be pruned immediately without any need for us to
compute their confidence values.
Thus, we may decouple the support and confidence requirements.
Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support minsup. These
itemsets are called frequent itemsets.
2. Rule Generation
– Extract all the high confidence rules from each frequent
itemset found in step 1, where each rule is a binary
partitioning of a frequent itemset. These rules are called
strong rules.
Frequent itemset generation (Step 1) is still
computationally expensive as compared to Rule
generation (Step 2)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Frequent Itemset Generation (A Lattice Structure can be used to list all possible
itemsets)
null
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the
database
Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
w
– Match each transaction against every candidate
– Complexity ~ O(NMw) => Expensive since M = 2d !!! [N: No. of transactions;
M= (2k – 1) : No of candidate itemsets; w: maximum transaction width]
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Strategies for reducing computational complexity of Frequent Itemset
Generation
Apriori principle:
– If an itemset is frequent, then all of its subsets must also
be frequent
X , Y : ( X Y ) s( X ) s(Y )
– Support of an itemset never exceeds the support of its
subsets
– This is known as the anti-monotone property of support
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Conversely, if an itemset is infrequent, then all its supersets must be
infrequent too.
Eg if If the item set {a, b} is infrequent, then the entire sub-graph
containing the supersets of {a, b} can be pruned immediately
without any need for us to compute their confidence values.
Apriori is the first association rule mining algorithm which uses support-
based pruning to control the exponential growth of candidate
itemsets.
null
A B C D E
AB AC AD AE BC BD BE CD CE DE
Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Pruned
ABCDE
supersets
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Illustrating Apriori Principle (Ref: Table 1) Assumption: support threshold =
Minimum Support = 3
Triplets (3-itemsets)
Method:
– Let k=1
– Generate frequent itemsets of length 1
– Repeat until no new frequent itemsets are identified
Generate length (k+1) candidate itemsets from length k
frequent itemsets
Prune candidate itemsets containing subsets of length k that
are infrequent
Count the support of each candidate by scanning the DB
Eliminate candidates that are infrequent, leaving only those
that are frequent