Basic - Association - Analysis - Apriori Method With Illusration

Mining Frequent Patterns, Associations and
Correlations
Basic Concepts and Techniques
Source : Books: Han et al. , 3rd Edition ; Tan et al. (2016)
Web Resource: www.digitalvidya.com
https://www.solver.com/xlminer/help/association-rules
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

 Basic Concepts

2
What Is Frequent Pattern Analysis?
 Frequent pattern: a pattern (a set of items, subsequences,
substructures, etc.) that occurs frequently in a data set
 First proposed by Agrawal, Imielinski, and Swami [AIS93] in the
context of frequent itemsets and association rule mining
 Motivation: Finding inherent regularities in data
– What products were often purchased together?— Beer and diapers?!
– What are the subsequent purchases after buying a Laptop?
– What kinds of DNA are sensitive to this new drug?
– Can we automatically classify web documents?
 Applications
– Basket data analysis, cross-marketing, catalog design, sale campaign analysis,
Web log (click stream) analysis, and DNA sequence analysis.

3
Why Is Freq. Pattern Mining
Important?
 Freq. pattern: An intrinsic and important property of

datasets
 Foundation for many essential data mining tasks
– Association, correlation, and causality analysis
– Helps in Classification, Cluster analysis

4
Association Rule Mining
 Given a set of transactions, find rules that will predict the

occurrence of an item based on the occurrences of other
items in the transaction
Table 1: Market-Basket transactions

General Example of Association
Rules (not based on Table 1)
TID Items
{Diaper}  {Beer},
1 Bread, Milk {Milk, Bread}  {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread}  {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke not causality!

Definition: Frequent Itemset
 Itemset
– A collection of one or more items
 Example: {Milk, Bread, Diaper}
– k-itemset : { X1 , X2 , X3, …. Xk} TID Items
 An itemset that contains k items 1 Bread, Milk

 Support count () 2 Bread, Diaper, Beer, Eggs
– Frequency of occurrence of an itemset
4 Bread, Milk, Diaper, Beer
– E.g. ({Milk, Bread,Diaper}) = 2
5 Bread, Milk, Diaper, Coke
 Frequent Itemset
– An itemset whose support is greater
than or equal to a minsup threshold

Definition: Association Rule
 Association Rule
TID Items
– An implication expression of the form
1 Bread, Milk
X  Y, where X and Y are itemsets
2 Bread, Diaper, Beer, Eggs
– Example:
{Milk, Diaper}  {Beer}
– Here X: {Milk, Diaper} and Y: {Beer}
 Rule Evaluation Metrics

– Support (s) Example:
 Fraction of transactions that contain {Milk , Diaper }  Beer
both X and Y (Prob. that a transaction
contains X U Y)  (Milk, Diaper, Beer ) 2
s   0.4
– Confidence (c) |T| 5
 Conditional probability that a
 (Milk, Diaper, Beer ) 2
transaction having X also contains Y
c   0.67
 (Milk, Diaper ) 3
Example: Market Basket Analysis
 As a Manager of an electronics items mega store, you want to learn

more about buying habits of customers. E.g.
“ Which groups or sets of items are customers likely to purchase on a
given trip to the store”
For this, Market basket analysis is performed on the retail data of
customer transactions.
The results can be used to plan marketing or advertising strategies,
designing of a new catalogue or designing different store layouts.
Eg if it is revealed that customers who buy laptops also buy anti virus
s/w, then both can be displayed together to increase sales of both
items. Alternatively, they can be displayed at a distance to entice
customers to buy other items while heading for s/w.
(Computer-printer): discount on printers can promote sale of both.

Each item can be thought of a Boolean variable associated with it

denoting presence or absence of the item. Then, each basket can be
represented as a Boolean vector of values assigned to these
variables.
The Boolean vectors can be analyzed to reveal buying patters to reflect
items that are frequently associated or purchased together.
These patterns can be represented in the form of association rules.
Ex. Rule: Computer => antivirus s/w [ support = 2%, confidence = 60%]
2% of the transactions under analysis show that computers and anti

virus s/w are purchased together.
60% of the customers who purchased a computer also purchased anti
virus.

Association rules are considered interesting if they satisfy both a

minimum support threshold and a minimum confidence threshold.
The thresholds are set by experienced users or domain experts.
Additional analysis can be performed to discover correlations between
associated items.

Association and Causality
Association analysis results should be interpreted with caution.

Inference based on association rules does not necessarily imply
causality.
Instead, it suggests a strong co-occurrence relationship between items
in the antecedent and consequent of the rule.
Causality requires knowledge about the cause and effect attributes in
the data and involves relationships occurring over time (e.g. ozone
depletion leads to global warming)

Association Rule Mining Task
 Given a set of transactions T, the goal of association rule mining is to

find all rules having minimum support and confidence:
– support ≥ minsup threshold
– confidence ≥ minconf threshold
 Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf thresholds
 Computationally prohibitive!

Computational Complexity
 Given d unique items:
– Total number of itemsets = 2d
– Total number of possible association rules:
 d   d  k 
R        
d 1 d k
 k   j 
k 1 j 1
 3  2 1
d d 1
If d=6, R = 602 rules

Computational Complexity
More than 80% of the rules are discarded after applying minsup = 20%
and minconf=50%. , thus making most of the computations become
wasted.
To avoid performing needless computations, it would be useful to prune
the rules early without having to compute their support and
confidence values.
An initial step for improving the performance of association rule mining

algorithms is to decompose the support and confidence requirements.
Note that the support of a rule X Y, depends only on the support of

corresponding itemset , X U Y
For example, the following rules have identical support because they
involve items from the same itemset: {Beer, Diapers, Milk}
Mining Association Rules
Example of Rules:
TID Items
1 Bread, Milk {Milk,Diaper}  {Beer} (s=0.4, c=0.67)
2 Bread, Diaper, Beer, Eggs {Milk,Beer}  {Diaper} (s=0.4, c=1.0)
{Diaper,Beer}  {Milk} (s=0.4, c=0.67)
{Beer}  {Milk,Diaper} (s=0.4, c=0.67)
{Diaper}  {Milk,Beer} (s=0.4, c=0.5)
5 Bread, Milk, Diaper, Coke {Milk}  {Diaper,Beer} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence

If the item set {Milk, Diaper, Beer} is infrequent, then all 6 candidate
rules can be pruned immediately without any need for us to
compute their confidence values.
Thus, we may decouple the support and confidence requirements.
A common strategy adopted by many association rule mining algorithms

is to decompose the problem into two major subtasks:

 Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup. These
itemsets are called frequent itemsets.
2. Rule Generation
– Extract all the high confidence rules from each frequent
itemset found in step 1, where each rule is a binary
partitioning of a frequent itemset. These rules are called
strong rules.
 Frequent itemset generation (Step 1) is still
computationally expensive as compared to Rule
generation (Step 2)
Frequent Itemset Generation (A Lattice Structure can be used to list all possible
itemsets)
null
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE

Given d items, there
are 2d possible
ABCDE candidate itemsets
Computational Complexity (An Illustration)
Itemset lattice for I ={a,b,c,d,e}

If I is a frequent itemset, each of its subset will also be frequent.
A frequent itemset with k items can generate (2k – 1) frequent itemsets,
excluding the null set.
Because k can be very large in practical situations, the search space of
itemsets that need to be explored will be exponentially large.
A brute-force approach for finding frequent itemsets require to compute

support count for every candidate itemset in the lattice structure.
So, each candidate has to be compared with against every transaction.
If the candidate is contained in a transaction, its support count will be
incremented.
e.g. the support for {Bread, Milk} is incremented 3 times because this
item set is contained in transactions 1,4,5.

Frequent Itemset Generation
 Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the
database
Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
w
– Match each transaction against every candidate
– Complexity ~ O(NMw) => Expensive since M = 2d !!! [N: No. of transactions;
M= (2k – 1) : No of candidate itemsets; w: maximum transaction width]
Strategies for reducing computational complexity of Frequent Itemset
Generation
 Reduce the number of candidates (M)

– Complete search: M=2d
– Use pruning techniques to reduce M
 Reduce the number of transactions (N)

– Reduce size of N as the size of itemset increases
– Used by some mining algorithms
 Reduce the number of comparisons (NM)

– Use efficient data structures to store the candidates or
transactions
– No need to match every candidate against every
transaction

Reducing Number of Candidates
 Apriori principle:
– If an itemset is frequent, then all of its subsets must also
be frequent
 Apriori principle holds due to the following property

of the support measure:
X , Y : ( X  Y )  s( X )  s(Y )
– Support of an itemset never exceeds the support of its
subsets
– This is known as the anti-monotone property of support
Conversely, if an itemset is infrequent, then all its supersets must be
infrequent too.
Eg if If the item set {a, b} is infrequent, then the entire sub-graph
containing the supersets of {a, b} can be pruned immediately
without any need for us to compute their confidence values.
This strategy of trimming the exponential search space based on the

support measure is called support-based pruning. (Ref: anti-
monotone property explained on prev. slide)
Apriori is the first association rule mining algorithm which uses support-
based pruning to control the exponential growth of candidate
itemsets.

Illustrating Apriori Principle
null
A B C D E
AB AC AD AE BC BD BE CD CE DE
Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
Pruned
ABCDE
supersets
Illustrating Apriori Principle (Ref: Table 1) Assumption: support threshold =
60% which is equivalent to minsup =3
Item Count Items (1-itemsets)

Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered, Itemset Count

6C + 6C + 6C = 41 {Bread,Milk,Diaper} 2
1 2 3
With support-based pruning,
6 + 6 + 1 = 13

Illustrating Apriori Principle (Ref: Table 1) Assumption: support threshold =
60% which is equivalent to minsup =3
Since the triplet does not satisfy minsup condition,

the alogrithm terminates. The frequent patterns are identified as
{Bread, Milk}, {Bread, Diaper}, {Milk, Diaper}, {Breer, Diaper}
Minimum Support = 3
Triplets (3-itemsets)
If every subset is considered, Itemset Count

6C + 6C + 6C = 41 {Bread,Milk,Diaper} 2
1 2 3
With support-based pruning,
6 + 6 + 1 = 13

Apriori Algorithm
 Method:
– Let k=1
– Generate frequent itemsets of length 1
– Repeat until no new frequent itemsets are identified
 Generate length (k+1) candidate itemsets from length k
frequent itemsets
 Prune candidate itemsets containing subsets of length k that
are infrequent
 Count the support of each candidate by scanning the DB
 Eliminate candidates that are infrequent, leaving only those
that are frequent

Basic - Association - Analysis - Apriori Method With Illusration

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Basic - Association - Analysis - Apriori Method With Illusration

Enviado por

Direitos autorais:

Formatos disponíveis

Mining Frequent Patterns, Associations and

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

 Freq. pattern: An intrinsic and important property of

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

 Given a set of transactions, find rules that will predict the

Table 1: Market-Basket transactions

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

 An itemset that contains k items 1 Bread, Milk

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

 Rule Evaluation Metrics

 As a Manager of an electronics items mega store, you want to learn

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Each item can be thought of a Boolean variable associated with it

2% of the transactions under analysis show that computers and anti

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Association rules are considered interesting if they satisfy both a

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Association analysis results should be interpreted with caution.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

 Given a set of transactions T, the goal of association rule mining is to

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

If d=6, R = 602 rules

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

An initial step for improving the performance of association rule mining

Note that the support of a rule X Y, depends only on the support of

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

A common strategy adopted by many association rule mining algorithms

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

ABCD ABCE ABDE ACDE BCDE

Itemset lattice for I ={a,b,c,d,e}

A brute-force approach for finding frequent itemsets require to compute

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

 Reduce the number of candidates (M)

 Reduce the number of transactions (N)

 Reduce the number of comparisons (NM)

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

 Apriori principle holds due to the following property

This strategy of trimming the exponential search space based on the

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

ABCD ABCE ABDE ACDE BCDE

60% which is equivalent to minsup =3

Item Count Items (1-itemsets)

If every subset is considered, Itemset Count

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

60% which is equivalent to minsup =3

Since the triplet does not satisfy minsup condition,

If every subset is considered, Itemset Count

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Você também pode gostar