Você está na página 1de 27

Mining Frequent Patterns, Associations and

Correlations
Basic Concepts and Techniques
Source : Books: Han et al. , 3rd Edition ; Tan et al. (2016)
Web Resource: www.digitalvidya.com
https://www.solver.com/xlminer/help/association-rules

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


 Basic Concepts

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


2
What Is Frequent Pattern Analysis?
 Frequent pattern: a pattern (a set of items, subsequences,
substructures, etc.) that occurs frequently in a data set
 First proposed by Agrawal, Imielinski, and Swami [AIS93] in the
context of frequent itemsets and association rule mining
 Motivation: Finding inherent regularities in data
– What products were often purchased together?— Beer and diapers?!
– What are the subsequent purchases after buying a Laptop?
– What kinds of DNA are sensitive to this new drug?
– Can we automatically classify web documents?
 Applications
– Basket data analysis, cross-marketing, catalog design, sale campaign analysis,
Web log (click stream) analysis, and DNA sequence analysis.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


3
Why Is Freq. Pattern Mining
Important?

 Freq. pattern: An intrinsic and important property of


datasets
 Foundation for many essential data mining tasks
– Association, correlation, and causality analysis
– Helps in Classification, Cluster analysis

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


4
Association Rule Mining

 Given a set of transactions, find rules that will predict the


occurrence of an item based on the occurrences of other
items in the transaction

Table 1: Market-Basket transactions


General Example of Association
Rules (not based on Table 1)
TID Items
{Diaper}  {Beer},
1 Bread, Milk {Milk, Bread}  {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread}  {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke not causality!

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Definition: Frequent Itemset
 Itemset
– A collection of one or more items
 Example: {Milk, Bread, Diaper}
– k-itemset : { X1 , X2 , X3, …. Xk} TID Items

 An itemset that contains k items 1 Bread, Milk


 Support count () 2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
– Frequency of occurrence of an itemset
4 Bread, Milk, Diaper, Beer
– E.g. ({Milk, Bread,Diaper}) = 2
5 Bread, Milk, Diaper, Coke
 Frequent Itemset
– An itemset whose support is greater
than or equal to a minsup threshold

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Definition: Association Rule
 Association Rule
TID Items
– An implication expression of the form
1 Bread, Milk
X  Y, where X and Y are itemsets
2 Bread, Diaper, Beer, Eggs
– Example:
3 Milk, Diaper, Beer, Coke
{Milk, Diaper}  {Beer}
4 Bread, Milk, Diaper, Beer
– Here X: {Milk, Diaper} and Y: {Beer}
5 Bread, Milk, Diaper, Coke

 Rule Evaluation Metrics


– Support (s) Example:
 Fraction of transactions that contain {Milk , Diaper }  Beer
both X and Y (Prob. that a transaction
contains X U Y)  (Milk, Diaper, Beer ) 2
s   0.4
– Confidence (c) |T| 5
 Conditional probability that a
 (Milk, Diaper, Beer ) 2
transaction having X also contains Y
c   0.67
 (Milk, Diaper ) 3
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Example: Market Basket Analysis

 As a Manager of an electronics items mega store, you want to learn


more about buying habits of customers. E.g.
“ Which groups or sets of items are customers likely to purchase on a
given trip to the store”
For this, Market basket analysis is performed on the retail data of
customer transactions.
The results can be used to plan marketing or advertising strategies,
designing of a new catalogue or designing different store layouts.
Eg if it is revealed that customers who buy laptops also buy anti virus
s/w, then both can be displayed together to increase sales of both
items. Alternatively, they can be displayed at a distance to entice
customers to buy other items while heading for s/w.
(Computer-printer): discount on printers can promote sale of both.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Example: Market Basket Analysis

Each item can be thought of a Boolean variable associated with it


denoting presence or absence of the item. Then, each basket can be
represented as a Boolean vector of values assigned to these
variables.
The Boolean vectors can be analyzed to reveal buying patters to reflect
items that are frequently associated or purchased together.
These patterns can be represented in the form of association rules.

Ex. Rule: Computer => antivirus s/w [ support = 2%, confidence = 60%]

2% of the transactions under analysis show that computers and anti


virus s/w are purchased together.
60% of the customers who purchased a computer also purchased anti
virus.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Example: Market Basket Analysis

Association rules are considered interesting if they satisfy both a


minimum support threshold and a minimum confidence threshold.
The thresholds are set by experienced users or domain experts.
Additional analysis can be performed to discover correlations between
associated items.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Association and Causality

Association analysis results should be interpreted with caution.


Inference based on association rules does not necessarily imply
causality.
Instead, it suggests a strong co-occurrence relationship between items
in the antecedent and consequent of the rule.
Causality requires knowledge about the cause and effect attributes in
the data and involves relationships occurring over time (e.g. ozone
depletion leads to global warming)

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Association Rule Mining Task

 Given a set of transactions T, the goal of association rule mining is to


find all rules having minimum support and confidence:
– support ≥ minsup threshold
– confidence ≥ minconf threshold

 Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf thresholds
 Computationally prohibitive!

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Computational Complexity
 Given d unique items:
– Total number of itemsets = 2d
– Total number of possible association rules:

 d   d  k 
R        
d 1 d k

 k   j 
k 1 j 1

 3  2 1
d d 1

If d=6, R = 602 rules

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Computational Complexity

More than 80% of the rules are discarded after applying minsup = 20%
and minconf=50%. , thus making most of the computations become
wasted.
To avoid performing needless computations, it would be useful to prune
the rules early without having to compute their support and
confidence values.

An initial step for improving the performance of association rule mining


algorithms is to decompose the support and confidence requirements.

Note that the support of a rule X Y, depends only on the support of


corresponding itemset , X U Y
For example, the following rules have identical support because they
involve items from the same itemset: {Beer, Diapers, Milk}
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Mining Association Rules

Example of Rules:
TID Items
1 Bread, Milk {Milk,Diaper}  {Beer} (s=0.4, c=0.67)
2 Bread, Diaper, Beer, Eggs {Milk,Beer}  {Diaper} (s=0.4, c=1.0)
{Diaper,Beer}  {Milk} (s=0.4, c=0.67)
3 Milk, Diaper, Beer, Coke
{Beer}  {Milk,Diaper} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Beer
{Diaper}  {Milk,Beer} (s=0.4, c=0.5)
5 Bread, Milk, Diaper, Coke {Milk}  {Diaper,Beer} (s=0.4, c=0.5)

Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Mining Association Rules

If the item set {Milk, Diaper, Beer} is infrequent, then all 6 candidate
rules can be pruned immediately without any need for us to
compute their confidence values.
Thus, we may decouple the support and confidence requirements.

A common strategy adopted by many association rule mining algorithms


is to decompose the problem into two major subtasks:

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Mining Association Rules

 Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup. These
itemsets are called frequent itemsets.

2. Rule Generation
– Extract all the high confidence rules from each frequent
itemset found in step 1, where each rule is a binary
partitioning of a frequent itemset. These rules are called
strong rules.
 Frequent itemset generation (Step 1) is still
computationally expensive as compared to Rule
generation (Step 2)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Frequent Itemset Generation (A Lattice Structure can be used to list all possible

itemsets)

null

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE


Given d items, there
are 2d possible
ABCDE candidate itemsets
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Computational Complexity (An Illustration)

Itemset lattice for I ={a,b,c,d,e}


If I is a frequent itemset, each of its subset will also be frequent.
A frequent itemset with k items can generate (2k – 1) frequent itemsets,
excluding the null set.
Because k can be very large in practical situations, the search space of
itemsets that need to be explored will be exponentially large.

A brute-force approach for finding frequent itemsets require to compute


support count for every candidate itemset in the lattice structure.
So, each candidate has to be compared with against every transaction.
If the candidate is contained in a transaction, its support count will be
incremented.
e.g. the support for {Bread, Milk} is incremented 3 times because this
item set is contained in transactions 1,4,5.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Frequent Itemset Generation

 Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the
database
Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
w
– Match each transaction against every candidate
– Complexity ~ O(NMw) => Expensive since M = 2d !!! [N: No. of transactions;
M= (2k – 1) : No of candidate itemsets; w: maximum transaction width]
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Strategies for reducing computational complexity of Frequent Itemset
Generation

 Reduce the number of candidates (M)


– Complete search: M=2d
– Use pruning techniques to reduce M

 Reduce the number of transactions (N)


– Reduce size of N as the size of itemset increases
– Used by some mining algorithms

 Reduce the number of comparisons (NM)


– Use efficient data structures to store the candidates or
transactions
– No need to match every candidate against every
transaction

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Reducing Number of Candidates

 Apriori principle:
– If an itemset is frequent, then all of its subsets must also
be frequent

 Apriori principle holds due to the following property


of the support measure:

X , Y : ( X  Y )  s( X )  s(Y )
– Support of an itemset never exceeds the support of its
subsets
– This is known as the anti-monotone property of support
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Conversely, if an itemset is infrequent, then all its supersets must be
infrequent too.
Eg if If the item set {a, b} is infrequent, then the entire sub-graph
containing the supersets of {a, b} can be pruned immediately
without any need for us to compute their confidence values.

This strategy of trimming the exponential search space based on the


support measure is called support-based pruning. (Ref: anti-
monotone property explained on prev. slide)

Apriori is the first association rule mining algorithm which uses support-
based pruning to control the exponential growth of candidate
itemsets.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Illustrating Apriori Principle

null

A B C D E

AB AC AD AE BC BD BE CD CE DE

Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Pruned
ABCDE
supersets
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Illustrating Apriori Principle (Ref: Table 1) Assumption: support threshold =

60% which is equivalent to minsup =3

Item Count Items (1-itemsets)


Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)

If every subset is considered, Itemset Count


6C + 6C + 6C = 41 {Bread,Milk,Diaper} 2
1 2 3
With support-based pruning,
6 + 6 + 1 = 13

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Illustrating Apriori Principle (Ref: Table 1) Assumption: support threshold =

60% which is equivalent to minsup =3

Since the triplet does not satisfy minsup condition,


the alogrithm terminates. The frequent patterns are identified as
{Bread, Milk}, {Bread, Diaper}, {Milk, Diaper}, {Breer, Diaper}

Minimum Support = 3
Triplets (3-itemsets)

If every subset is considered, Itemset Count


6C + 6C + 6C = 41 {Bread,Milk,Diaper} 2
1 2 3
With support-based pruning,
6 + 6 + 1 = 13

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Apriori Algorithm

 Method:

– Let k=1
– Generate frequent itemsets of length 1
– Repeat until no new frequent itemsets are identified
 Generate length (k+1) candidate itemsets from length k
frequent itemsets
 Prune candidate itemsets containing subsets of length k that
are infrequent
 Count the support of each candidate by scanning the DB
 Eliminate candidates that are infrequent, leaving only those
that are frequent

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Você também pode gostar