Você está na página 1de 35

Association Rules

Market Basket Analysis

Association Rules
Usually applied to market baskets but other
applications are possible
Useful Rules contain novel and actionable
information: e.g. On Thursdays grocery customers are
likely to buy diapers and beer together
Trivial Rules contain already known information: e.g.
People who buy maintenance agreements are the ones
who have also bought large appliances
Some novel rules may not be useful: e.g. New
hardware stores most commonly sell toilet rings

Association Rule: Basic Concepts


Given: (1) a set of transactions, (2) each transaction is a
set of items (e.g. purchased by a customer in a visit)
Find: (all ?)rules that correlate the presence of one set
of items with that of another set of items
E.g., 98% of people who purchase tires and auto accessories
also get automotive services done

Applications

Retailing (What other products should the store stocks up)


Attached mailing in direct marketing
Market Basket Analysis (what do people buy together?)
Catalog design (Which items should appear next to each
other)

What Is Association Rule Mining?


Association rule mining:
Finding frequent patterns, associations, correlations, or
causal structures among sets of items or objects in
transaction databases, relational databases, and other
information repositories.

Examples.
Rule form: Body ead [support, confidence].
buys(x, diapers) buys(x, beers) [0.5%, 60%]
major(x, CS) ^ takes(x, DB) grade(x, CS, A)
[1%, 75%]

Rule Measures: Support and Confidence


Customer
buys both

Customer
buys beer

Customer
buys diaper

Find all the rules X & Y Z with


minimum confidence and support
support, s, probability that a
transaction contains {X & Y &
Z}
confidence, c, conditional
probability that a transaction
having {X & Y} also contains Z

Transaction ID Items Bought


2000
A,B,C
1000
A,C
4000
A,D
5000
B,E,F

Let minimum support 50%,


and minimum confidence
50%, we have
A C (50%, 66.6%)
C A (50%, 100%)

Mining Association RulesAn Example


Frequent Itemsets
Transaction ID
2000
1000
4000
5000

Items Bought
A,B,C
A,C
A,D
B,E,F

For rule A C:

Min. support 50%


Min. confidence 50%
Frequent Itemset Support
{A}
75%
{B}
50%
{C}
50%
{A,C}
50%

support = support({A &C}) = 50%


confidence = support({A &C})/support({A}) = 66.6%

The Apriori principle (Agarwal, 1995):


Any subset of a frequent itemset must also be frequent

Mining Frequent Itemsets: the Key Step


Find the frequent itemsets: the sets of items that
have minimum support
A subset of a frequent itemset must also be a frequent
itemset
i.e., if {AB} is a frequent itemset, both { A} and {B} should be
frequent itemsets

Iteratively find frequent itemsets with cardinality from 1


to k (k-itemset)

Use the frequent itemsets to generate association


rules.

The Apriori Algorithm


Generate C1: all 1 unique items
Generate L1: all 1 unique items with minimum
support
Join Step: Ck is generated forming Cartesian Product of
Lk-1with L1. Since, any (k-1)-itemset that is not frequent
cannot be a subset of a frequent k-itemset

Prune Step: Lk-1 is generated by selecting from Ck itemsets


those with minimum support

The Apriori Algorithm Example


Database D
TID
100
200
300
400

L2

itemset sup.
{1}
2
C1
{2}
3
3
Scan D {3}
{4}
1
{5}
3

Items
134
235
1235
25

itemset
{1 3}
{2 3}
{2 5}
{3 5}

C3

sup
2
2
3
2

itemset
{1 3 5}
{2 3 5}

C2 itemset sup
{1
{1
{1
{2
{2
{3

Scan D

2}
3}
5}
3}
5}
5}

1
2
1
2
3
2

L1 itemset sup.
{1}
{2}
{3}
{5}

2
3
3
3

C2 itemset
{1 2}
Scan D

L3 itemset sup
{2 3 5} 2

{1
{1
{2
{2
{3

3}
5}
3}
5}
5}

Is Apriori Fast Enough?


Performance Bottlenecks
The core of the Apriori algorithm:
Use frequent ( k 1)-itemsets to generate candidate frequent k-itemsets
Use database scan and matching to collect counts for the candidate
itemsets

The bottleneck of Apriori: candidate generation


Huge candidate sets:
104 frequent 1 -itemset will generate 107 candidate 2 -itemsets
To discover a frequent pattern of size 100, e.g., {a1, a2, , a 100}, one
needs to generate 2 100 1030 candidates.

Multiple scans of database:


Needs (n +1 ) scans, n is the length of the longest pattern

Methods to Improve Aprioris Efficiency


Hash-based itemset counting: A k-itemset whose
corresponding hashing bucket count is below the threshold
cannot be frequent
Transaction reduction: A transaction that does not contain
any frequent k-itemset is useless in subsequent scans
Sampling: mining on a subset of given data, lower support
threshold + a method to determine the completeness
Dynamic itemset counting: add new candidate itemsets only
when all of their subsets are estimated to be frequent

How to Count Supports of Candidates?


Why is counting supports of candidates a problem?
The total number of candidates can be very huge
One transaction may contain many candidates

Method:
Candidate itemsets are stored in a hash-tree
Leaf node of hash-tree contains a list of itemsets and
counts
Interior node contains a hash table
Subset function: finds all the candidates contained in a
transaction

Criticism of Support and Confidence


Example 1: (Agarwal & Yu, PODS98)
Among 5000 students
3000 play basketball
3750 eat cereal
2000 both play basket ball and eat cereal
play basketball eat cereal [40%, 66.7%] is misleading because the
overall percentage of students eating cereal is 75% which is higher than
66.7%.
not play basketball eat cereal [35%, 87.5%] lower support but
higher confidence!
play basketball not eat cereal [20%, 33.3%] is more informative,
although with lower support and confidence

basketball not basketball sum(row)


cereal
2000
1750
3750
not cereal
1000
250
1250
sum(col.)
3000
2000
5000

Criticism of Support and Confidence


Example 2:
X and Y: positively
correlated,
X and Z, Y and Z: negatively
correlated
support and confidence of
X=>Z dominates

We need a measure of
dependent or correlated events
P(B|A)/P(B) is called the lift
of rule A => B

X 1 1 1 1 0 0 0 0
Y 1 1 0 0 0 0 0 0
Z 0 1 1 1 1 1 1 1
X=>Y 25%
X=>Z 37.5%
Y=>Z 12.5%

50%
75%
50%

Other Interestingness Measures: lift


P( A B)
Lift = P(B|A)/P(B) =
P( A) P( B)
takes both P(A) and P(B) into consideration
P(A^B)=P(B)*P(A), if A and B are independent events
A and B negatively correlated, if lift < 1
If lift > 1, A and B positively correlated
X 1 1 1 1 0 0 0 0
Y 1 1 0 0 0 0 0 0
Z 0 1 1 1 1 1 1 1

Rule

Support

Lift

X=>Y
X=>Z
Y=>Z

25%
37.50%
12.50%

2
0.86
0.57

Extensions

Multiple-Level Association Rules


Items often form hierarchy.
Items at the lower level are
expected to have lower
support.
Rules regarding itemsets at
appropriate levels could be
quite useful.
Transaction database can be
encoded based on
dimensions and levels

Food
bread

milk
skim
Fraser

2%

wheat

Sunset

white

Mining Multi-Level Associations


A top-down, progressive deepening approach:
First find high-level strong rules (Ancestors):

milk bread [20%, 60%].


Then find their lower-level weaker rules (Descendants):
2% milk wheat bread [6%, 50%].

Variations of mining multiple-level association rules.


Level-crossed association rules:
2% milk Wonder wheat bread
Association rules with multiple, alternative hierarchies:
2% milk Wonder bread

Multi-level Association: Uniform


Support vs. Reduced Support
Uniform Support: the same minimum support for all levels
+ One minimum support threshold. No need to examine itemsets
containing any item whose ancestors do not have minimum
support.

Lower level items do not occur as frequently. If support


threshold
too high miss low level associations
too low generate too many high level associations

Reducing Support: reduced minimum support at lower


levels
Needs modification to the basic algorithm

Uniform Support
Multi-level mining with uniform support
Level 1
min_sup = 5%

Level 2
min_sup = 5%

Milk
[support = 10%]

2% Milk

Skim Milk

[support = 6%]

[support = 4%]

Reduced Support
Multi-level mining with reduced support
Level 1
min_sup = 5%

Level 2
min_sup = 3%

Milk
[support = 10%]

2% Milk

Skim Milk

[support = 6%]

[support = 4%]

Multi-level Association:
Redundancy Filtering
Some rules may be redundant due to ancestor
relationships between items.
Example
milk wheat bread [support = 8%, confidence = 70%]
2% milk wheat bread [support = 2%, confidence = 72%]

We say the first rule is an ancestor of the second rule.


A rule is redundant if its support is close to the
expected value, based on the rules ancestor.

Multi-Level Mining: Progressive


Deepening
A top-down, progressive deepening approach:
First mine high-level frequent items:
milk (15%), bread (10%)
Then mine their lower-level weaker frequent itemsets:
2% milk (5%), wheat bread (4%)

Different min-support threshold across multi-levels


lead to different algorithms:
If adopting the same min-support across multi-levels
then reject t if any of ts ancestor is infrequent.

If adopting reduced min-support at lower levels


then examine only those descendants whose ancestors support is
frequent and whose own support is > reduced min-support

Sequence pattern mining


Sequence events database
Consists of sequences of values or events changing
with time
Data is recorded at regular intervals
Characteristic sequence events characteristics
Trend, cycle, seasonal, irregular

Examples
Financial: stock price, inflation
Biomedical: blood pressure
Meteorological: precipitation

Two types of sequence data


Event series
Record events that happens at certain time
E.g. network logins

Time series
Record changes of certain (typically numeric)
values over time
E.g. stock price movements, blood pressure

Event series
Series can be represented in two ways:
As a sequence (string) of events.
Empty space if no events occur at a certain time
Hard to represent multiple events

As a set of tuples: {(time, event)}


Allow for multiple event at the same time

Types of interesting info


Which events happen often (not too interesting)
What group of event happen often
People who rent Star Wars also rent Star Trek

What sequence of event happen often


Renting Star Wars, then Empire Strikes Back, then
Return of the Jedi in that order

Association of events within a time window


People who rent Star Wars tend to rent Empire
Strikes Back within one week

Similarity/Difference with
Association Rules
Similarities:
Groups of events : frequent item sets
Associations : Association rules

Differences:
Notion of (time) windows:
People who rent Star Wars tend to rent Empire Strikes
Back within one week

Ordering of events is important

Episodes
A partially ordered sequence of events
A
A

C
B

Serial
(B follows A)

Parallel
(B follows A OR
A follows B)

B
General
(order between
A & B unknown
or immaterial but A
& B precede C)

Sub-episode / super-episode
If A, B & C occur within a time window:

A & B is a sub-episode of A, B & C


A,B & C is the super-episode of A, B, C, A & B,
B&C

Frequent episodes / Episode Rules


Frequent episodes
Find episodes that appear often

Episode rules
Used to emphasize the effect of events on
episodes
Support/confidence as defined in association
rules

Example (window size = 11)


AB-C-DEABE-F-A-DFECDAABBCDE

Episode Rules : Example


A

A
C

Window size 10: Support 4%, Confidence 80%

Meaning: Given episode on the left appear,


episode on the right appears 80% of the time.
This essentially says that when (A,B) appears,
then C appears (within a given window size)

Mining episode rules


Apriori principle for episode
An episode is frequent if and only if all its subepisode is frequent

Thus apriori-based algorithm can be applied


However, there are a few tricky issues

Mining episode rules


Recognizing episode in sequences
Parallel episode: standard association rules techniques
Serial/General episode: Finite state machine based
construction
Alternative: Count parallel episodes first, then use them to
generate candidate episodes of other types

Counting number of windows


One event appears in n windows for window size w
O.K. if sequence long, as the ratios even out.
However when sequence size is small, the edges can
dominate

Você também pode gostar