Business Intelligence & Data Mining-9

Association Rules
Market Basket Analysis
Association Rules
Usually applied to market baskets but other
applications are possible
Useful Rules contain novel and actionable
information: e.g. On Thursdays grocery customers are
likely to buy diapers and beer together
Trivial Rules contain already known information: e.g.
People who buy maintenance agreements are the ones
who have also bought large appliances
Some novel rules may not be useful: e.g. New
hardware stores most commonly sell toilet rings
Association Rule: Basic Concepts

Given: (1) a set of transactions, (2) each transaction is a
set of items (e.g. purchased by a customer in a visit)
Find: (all ?)rules that correlate the presence of one set
of items with that of another set of items
E.g., 98% of people who purchase tires and auto accessories
also get automotive services done
Applications
Retailing (What other products should the store stocks up)

Attached mailing in direct marketing
Market Basket Analysis (what do people buy together?)
Catalog design (Which items should appear next to each
other)
What Is Association Rule Mining?

Association rule mining:
Finding frequent patterns, associations, correlations, or
causal structures among sets of items or objects in
transaction databases, relational databases, and other
information repositories.
Examples.
Rule form: Body ead [support, confidence].
buys(x, diapers) buys(x, beers) [0.5%, 60%]
major(x, CS) ^ takes(x, DB) grade(x, CS, A)
[1%, 75%]
Rule Measures: Support and Confidence

Customer
buys both
Customer
buys beer
Customer
buys diaper
Find all the rules X & Y Z with

minimum confidence and support
support, s, probability that a
transaction contains {X & Y &
Z}
confidence, c, conditional
probability that a transaction
having {X & Y} also contains Z
Transaction ID Items Bought

2000
A,B,C
1000
A,C
4000
A,D
5000
B,E,F
Let minimum support 50%,

and minimum confidence
50%, we have
A C (50%, 66.6%)
C A (50%, 100%)
Mining Association RulesAn Example

Frequent Itemsets
Transaction ID
2000
1000
4000
5000
Items Bought
A,B,C
A,C
A,D
B,E,F
For rule A C:
Min. support 50%

Min. confidence 50%
Frequent Itemset Support
{A}
75%
{B}
50%
{C}
50%
{A,C}
50%
support = support({A &C}) = 50%

confidence = support({A &C})/support({A}) = 66.6%
The Apriori principle (Agarwal, 1995):

Any subset of a frequent itemset must also be frequent
Mining Frequent Itemsets: the Key Step

Find the frequent itemsets: the sets of items that
have minimum support
A subset of a frequent itemset must also be a frequent
itemset
i.e., if {AB} is a frequent itemset, both { A} and {B} should be
frequent itemsets
Iteratively find frequent itemsets with cardinality from 1

to k (k-itemset)
Use the frequent itemsets to generate association

rules.
The Apriori Algorithm

Generate C1: all 1 unique items
Generate L1: all 1 unique items with minimum
support
Join Step: Ck is generated forming Cartesian Product of
Lk-1with L1. Since, any (k-1)-itemset that is not frequent
cannot be a subset of a frequent k-itemset
Prune Step: Lk-1 is generated by selecting from Ck itemsets

those with minimum support
The Apriori Algorithm Example

Database D
TID
100
200
300
400
L2
itemset sup.
{1}
2
C1
{2}
3
3
Scan D {3}
{4}
1
{5}
3
Items
134
235
1235
25
itemset
{1 3}
{2 3}
{2 5}
{3 5}
C3
sup
2
2
3
2
itemset
{1 3 5}
{2 3 5}
C2 itemset sup
{1
{1
{1
{2
{2
{3
Scan D
2}
3}
5}
3}
5}
5}
1
2
1
2
3
2
L1 itemset sup.
{1}
{2}
{3}
{5}
2
3
3
3
C2 itemset
{1 2}
Scan D
L3 itemset sup
{2 3 5} 2
{1
{1
{2
{2
{3
3}
5}
3}
5}
5}
Is Apriori Fast Enough?

Performance Bottlenecks
The core of the Apriori algorithm:
Use frequent ( k 1)-itemsets to generate candidate frequent k-itemsets
Use database scan and matching to collect counts for the candidate
itemsets
The bottleneck of Apriori: candidate generation

Huge candidate sets:
104 frequent 1 -itemset will generate 107 candidate 2 -itemsets
To discover a frequent pattern of size 100, e.g., {a1, a2, , a 100}, one
needs to generate 2 100 1030 candidates.
Multiple scans of database:

Needs (n +1 ) scans, n is the length of the longest pattern
Methods to Improve Aprioris Efficiency

Hash-based itemset counting: A k-itemset whose
corresponding hashing bucket count is below the threshold
cannot be frequent
Transaction reduction: A transaction that does not contain
any frequent k-itemset is useless in subsequent scans
Sampling: mining on a subset of given data, lower support
threshold + a method to determine the completeness
Dynamic itemset counting: add new candidate itemsets only
when all of their subsets are estimated to be frequent
How to Count Supports of Candidates?

Why is counting supports of candidates a problem?
The total number of candidates can be very huge
One transaction may contain many candidates
Method:
Candidate itemsets are stored in a hash-tree
Leaf node of hash-tree contains a list of itemsets and
counts
Interior node contains a hash table
Subset function: finds all the candidates contained in a
transaction
Criticism of Support and Confidence

Example 1: (Agarwal & Yu, PODS98)
Among 5000 students
3000 play basketball
3750 eat cereal
2000 both play basket ball and eat cereal
play basketball eat cereal [40%, 66.7%] is misleading because the
overall percentage of students eating cereal is 75% which is higher than
66.7%.
not play basketball eat cereal [35%, 87.5%] lower support but
higher confidence!
play basketball not eat cereal [20%, 33.3%] is more informative,
although with lower support and confidence
basketball not basketball sum(row)

cereal
2000
1750
3750
not cereal
1000
250
1250
sum(col.)
3000
2000
5000
Criticism of Support and Confidence

Example 2:
X and Y: positively
correlated,
X and Z, Y and Z: negatively
correlated
support and confidence of
X=>Z dominates
We need a measure of
dependent or correlated events
P(B|A)/P(B) is called the lift
of rule A => B
X 1 1 1 1 0 0 0 0
Y 1 1 0 0 0 0 0 0
Z 0 1 1 1 1 1 1 1
X=>Y 25%
X=>Z 37.5%
Y=>Z 12.5%
50%
75%
50%
Other Interestingness Measures: lift

P( A B)
Lift = P(B|A)/P(B) =
P( A) P( B)
takes both P(A) and P(B) into consideration
P(A^B)=P(B)*P(A), if A and B are independent events
A and B negatively correlated, if lift < 1
If lift > 1, A and B positively correlated
X 1 1 1 1 0 0 0 0
Y 1 1 0 0 0 0 0 0
Z 0 1 1 1 1 1 1 1
Rule
Support
Lift
X=>Y
X=>Z
Y=>Z
25%
37.50%
12.50%
2
0.86
0.57
Extensions
Multiple-Level Association Rules

Items often form hierarchy.
Items at the lower level are
expected to have lower
support.
Rules regarding itemsets at
appropriate levels could be
quite useful.
Transaction database can be
encoded based on
dimensions and levels
Food
bread
milk
skim
Fraser
2%
wheat
Sunset
white
Mining Multi-Level Associations

A top-down, progressive deepening approach:
First find high-level strong rules (Ancestors):
milk bread [20%, 60%].

Then find their lower-level weaker rules (Descendants):
2% milk wheat bread [6%, 50%].
Variations of mining multiple-level association rules.

Level-crossed association rules:
2% milk Wonder wheat bread
Association rules with multiple, alternative hierarchies:
2% milk Wonder bread
Multi-level Association: Uniform

Support vs. Reduced Support
Uniform Support: the same minimum support for all levels
+ One minimum support threshold. No need to examine itemsets
containing any item whose ancestors do not have minimum
support.
Lower level items do not occur as frequently. If support

threshold
too high miss low level associations
too low generate too many high level associations
Reducing Support: reduced minimum support at lower

levels
Needs modification to the basic algorithm
Uniform Support
Multi-level mining with uniform support
Level 1
min_sup = 5%
Level 2
min_sup = 5%
Milk
[support = 10%]
2% Milk
Skim Milk
[support = 6%]
[support = 4%]
Reduced Support
Multi-level mining with reduced support
Level 1
min_sup = 5%
Level 2
min_sup = 3%
Milk
[support = 10%]
2% Milk
Skim Milk
[support = 6%]
[support = 4%]
Multi-level Association:
Redundancy Filtering
Some rules may be redundant due to ancestor
relationships between items.
Example
milk wheat bread [support = 8%, confidence = 70%]
2% milk wheat bread [support = 2%, confidence = 72%]
We say the first rule is an ancestor of the second rule.

A rule is redundant if its support is close to the
expected value, based on the rules ancestor.
Multi-Level Mining: Progressive

Deepening
A top-down, progressive deepening approach:
First mine high-level frequent items:
milk (15%), bread (10%)
Then mine their lower-level weaker frequent itemsets:
2% milk (5%), wheat bread (4%)
Different min-support threshold across multi-levels

lead to different algorithms:
If adopting the same min-support across multi-levels
then reject t if any of ts ancestor is infrequent.
If adopting reduced min-support at lower levels

then examine only those descendants whose ancestors support is
frequent and whose own support is > reduced min-support
Sequence pattern mining

Sequence events database
Consists of sequences of values or events changing
with time
Data is recorded at regular intervals
Characteristic sequence events characteristics
Trend, cycle, seasonal, irregular
Examples
Financial: stock price, inflation
Biomedical: blood pressure
Meteorological: precipitation
Two types of sequence data

Event series
Record events that happens at certain time
E.g. network logins
Time series
Record changes of certain (typically numeric)
values over time
E.g. stock price movements, blood pressure
Event series
Series can be represented in two ways:
As a sequence (string) of events.
Empty space if no events occur at a certain time
Hard to represent multiple events
As a set of tuples: {(time, event)}

Allow for multiple event at the same time
Types of interesting info

Which events happen often (not too interesting)
What group of event happen often
People who rent Star Wars also rent Star Trek
What sequence of event happen often

Renting Star Wars, then Empire Strikes Back, then
Return of the Jedi in that order
Association of events within a time window

People who rent Star Wars tend to rent Empire
Strikes Back within one week
Similarity/Difference with
Association Rules
Similarities:
Groups of events : frequent item sets
Associations : Association rules
Differences:
Notion of (time) windows:
People who rent Star Wars tend to rent Empire Strikes
Back within one week
Ordering of events is important
Episodes
A partially ordered sequence of events
A
A
C
B
Serial
(B follows A)
Parallel
(B follows A OR
A follows B)
B
General
(order between
A & B unknown
or immaterial but A
& B precede C)
Sub-episode / super-episode
If A, B & C occur within a time window:
A & B is a sub-episode of A, B & C

A,B & C is the super-episode of A, B, C, A & B,
B&C
Frequent episodes / Episode Rules

Frequent episodes
Find episodes that appear often
Episode rules
Used to emphasize the effect of events on
episodes
Support/confidence as defined in association
rules
Example (window size = 11)

AB-C-DEABE-F-A-DFECDAABBCDE
Episode Rules : Example

A
A
C
Window size 10: Support 4%, Confidence 80%
Meaning: Given episode on the left appear,

episode on the right appears 80% of the time.
This essentially says that when (A,B) appears,
then C appears (within a given window size)
Mining episode rules

Apriori principle for episode
An episode is frequent if and only if all its subepisode is frequent
Thus apriori-based algorithm can be applied

However, there are a few tricky issues
Mining episode rules

Recognizing episode in sequences
Parallel episode: standard association rules techniques
Serial/General episode: Finite state machine based
construction
Alternative: Count parallel episodes first, then use them to
generate candidate episodes of other types
Counting number of windows

One event appears in n windows for window size w
O.K. if sequence long, as the ratios even out.
However when sequence size is small, the edges can
dominate

Business Intelligence &amp; Data Mining-9

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Business Intelligence &amp; Data Mining-9

Enviado por

Direitos autorais:

Formatos disponíveis

Association Rules

Market Basket Analysis

Association Rule: Basic Concepts

Retailing (What other products should the store stocks up)

What Is Association Rule Mining?

Rule Measures: Support and Confidence

Find all the rules X & Y Z with

Transaction ID Items Bought

Let minimum support 50%,

Mining Association RulesAn Example

Min. support 50%

support = support({A &C}) = 50%

The Apriori principle (Agarwal, 1995):

Mining Frequent Itemsets: the Key Step

Iteratively find frequent itemsets with cardinality from 1

Use the frequent itemsets to generate association

The Apriori Algorithm

Prune Step: Lk-1 is generated by selecting from Ck itemsets

The Apriori Algorithm Example

Is Apriori Fast Enough?

The bottleneck of Apriori: candidate generation

Multiple scans of database:

Methods to Improve Aprioris Efficiency

How to Count Supports of Candidates?

Criticism of Support and Confidence

basketball not basketball sum(row)

Criticism of Support and Confidence

Other Interestingness Measures: lift

Multiple-Level Association Rules

Mining Multi-Level Associations

milk bread [20%, 60%].

Variations of mining multiple-level association rules.

Multi-level Association: Uniform

Lower level items do not occur as frequently. If support

Reducing Support: reduced minimum support at lower

We say the first rule is an ancestor of the second rule.

Multi-Level Mining: Progressive

Different min-support threshold across multi-levels

If adopting reduced min-support at lower levels

Sequence pattern mining

Two types of sequence data

As a set of tuples: {(time, event)}

Types of interesting info

What sequence of event happen often

Association of events within a time window

Ordering of events is important

A & B is a sub-episode of A, B & C

Frequent episodes / Episode Rules

Example (window size = 11)

Episode Rules : Example

Window size 10: Support 4%, Confidence 80%

Meaning: Given episode on the left appear,

Mining episode rules

Thus apriori-based algorithm can be applied

Mining episode rules

Counting number of windows

Você também pode gostar

Business Intelligence & Data Mining-9

Business Intelligence & Data Mining-9