Escolar Documentos
Profissional Documentos
Cultura Documentos
Association rule mining is one of the fundamental research topics in data mining
and knowledge discovery that identifies interesting relationships between item-
sets in datasets and predicts the associative and correlative behaviors for new
data. Rooted in market basket analysis, there are a great number of techniques
developed for association rule mining. They include frequent pattern discovery,
interestingness, complex associations, and multiple data source mining. This pa-
per introduces the up-to-date prevailing association rule mining methods and
advocates the mining of complete association rules, including both positive and
negative association rules. C 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011
1 97–116 DOI: 10.1002/widm.10
and an evaluation criterion. The last challenge comes support (minsup, ms), which is given by users or ex-
with the constraints such as accuracy requirements perts.
and time and space limits. An association rule is an implication X → Y
As an introduction to association analysis, that describes the existence of a relationship between
this article focuses on the basic concepts, typical itemsets X and Y, where X, Y ⊂ I, and X ∩ Y =
techniques, and some applications. The article is or- (itemsets X and Y must not intersect). Each associa-
ganized as follows: The section ‘Association Rules’ tion rule has two quality measurements: support and
introduces the basic concepts and the Apriori algo- confidence. The confidence, denoted as conf, is the ra-
rithm concerning association rule discovery. The sec- tio of the number of transactions that include all items
tion ‘Complete Association Rule Analysis: Mining in the consequent as well as the antecedent (namely,
both Positive and Negative Rules’ describes a com- the support) to the number of transactions that in-
plete association analysis that identifies both positive clude all items in the antecedent. For X → Y, they
and negative association rules of interest. In section are defined as follows:
‘Applications of Association Rules’, we list some of
the main applications of association rules. Conclusion (1) The supp of X → Y is the supp of XY/(X ∪
remarks are given in the fifth section. Y), where XY means both itemsets X and Y
occur at the same time, that is, supp = the
frequency of occurring patterns or p(XY).
ASSOCIATION RULES (2) The conf of X → Y is the ratio p(XY)/ p(X),
This section introduces some representative work that is, conf = the strength of implication.
on association rule mining, including the supp–conf
framework, the Apriori algorithm, and some research Association rules provide information of this
directions of association rule mining. type in the form of ‘if–then’ statements. These rules
are computed from the data and, unlike the if–then
rules in logic, association rules are probabilistic in na-
The Support–Confidence Framework
ture. In addition to the antecedent (the ‘if’ part) and
Let I = {i1 , i2 , . . . , iN } be a set of literals or items.
the consequent (the ‘then’ part), an association rule
For example, milk, sugar, and bread for purchase in
has the supp and conf measurements that express the
a store are items. Assume D is a set of transactions
degree of uncertainty about the rule. In association
over I, called the transaction database, in which a
analysis, the antecedent and consequent are sets of
transaction is a set of items, that is, a subset of I. A
items (called itemsets) that are disjoint (without any
transaction has an associated unique identifier called
items in common).
Transaction IDentifier (TID).
Association rule mining seeks interesting asso-
A set of items is referred to an itemset. For sim-
ciations and/or correlation relationships between fre-
plicity, an itemset {i1 , i2 , i3 } is sometimes written as
quent itemsets in datasets. Association rules show
i1 i2 i3 . The number of items in an itemset is the length
attribute-value conditions that occur frequently to-
(or the size) of the itemset. Itemsets of some length k
gether in a given dataset. A typical and widely-used
are referred to as k-itemsets.
example of association rule mining is market basket
Each itemset has an associated statistical mea-
analysis. The first effort on mining association rules
sure called support, denoted as supp or p. The supp is
is based on a supp–conf framework as follows.
either the proportion of transactions in the database
The supp–conf framework (Agrawal et al.3 ). Let
that contains the itemset or the number of transac-
I be the set of items in database D, X, Y ⊆ I be item-
tions that contain the itemset. Formally, for an itemset
sets, X ∩ Y = , p(X) = 0, and p(Y) = 0. Assume
X ⊆ I, p(X) is defined as the fraction of transactions
the minimal support (minsup, or ms) and minimal
in D containing X or
confidence (minconf, or mc) are given by users or
1
n
experts. Then X → Y is a valid association rule if
p(X) = 1(X ⊆ Di ) p(XY) ≥ ms and conf (X → Y) ≥ mc.
n
i=1 Accordingly, association rule mining can be bro-
where the database D is viewed as a vector of n ken down into two subproblems as follows.
records (or transactions) Di such that each record is
a set of items. (1) Generating all itemsets that have a supp
An itemset X in a transaction database D is greater than, or equal to, the user speci-
called a large (or frequent) itemset if its support p(X) fiedms. That is, identifying all frequent item-
is equal to, or greater than, a threshold of minimal sets.
98
c 2011 John Wiley & Sons, Inc. Volume 1, March/April 2011
WIREs Data Mining and Knowledge Discovery Fundamentals of association rules
(2) Generating all rules that have the mc in the T A B L E 1 A Transaction Database
following way: for a frequent itemset Z, any
X ⊂ Z, and Y = Z − X, if the conf of a TID Item A Item B Item C Item D Item E
rule X → Y is greater than, or equal to, the
mc (or p(Z)/ p(X) ≥ mc, then it can be ex- 100 A C D
tracted as a valid rule. 200 B C E
300 A B C E
With the above decomposed subproblems, the
400 B E
supp–conf framework is a simple and easy-to-
understand two-step process. The first step is to search
for frequent itemsets and the second step generates as- (3.3) let Lk ← {c|c ∈ Temk ∧ ( p(c) = (c.count/
sociation rules. |D|) ≥ ms)};
(3.4) let F ← F ∪ Lk;
The Apriori Algorithm end (3)
The complexity of an association rule mining system
is heavily dependent upon the identification of fre- (4) output F;
quent itemsets. The most prevailing algorithm to per- (5) return.
form this identification is the Apriori algorithm.
The Apriori algorithm generates all frequent
Agrawal and Srikant4 observed an interesting
itemsets in a given database D. The initialization is
downward closure property, called Apriori, among
done in Step (1). Step (2) generates L1 of all frequent
frequent k-itemsets: a k-itemset is frequent only if all
1-itemsets in D in the first pass of D.
of its subitemsets are frequent. Accordingly, the fre-
Step (3) generates Lk for k ≥ 2 by a loop, where
quent 1-itemsets are searched in the first scan of the
Lk is the set of all frequent k-itemsets of interest in
database, then the frequent 1-itemsets are used to gen-
the kth pass of D, and the end condition of the loop is
erate candidate frequent 2-itemsets, and check against
Lk -1 = {}. For each pass of the database in Step (3),
the database to obtain the frequent 2-itemsets. Gen-
say pass k, there are four substeps as follows:
erally, the frequent (k-1)-itemsets are used to generate
Step (3.1) generates Temk of all k-itemsets in
candidate frequent k-itemsets, and check against the
D, where each k-itemset in Temk is generated by two
database to obtain the frequent k-itemsets. This pro-
frequent itemsets in Lk -1 . Each itemset in Temk is
cess iterates until no more frequent k-itemsets can be
counted in D by a loop in Step (3.2). Then Lk is gen-
generated for some k. This is the essence of the Apriori
erated in Step (3.3), which is the set of all potentially
algorithm.4 It is described as follows:
useful frequent k-itemsets in Temk , where all frequent
Algorithm 1. Apriori k-itemsets in Lk meet ms. Finally, Lk is added to F in
Input: D: a database; ms: minimum support; (3.4).
Output: F: a set of frequent itemsets of interest; Step (4) outputs the frequent itemsets of poten-
tial interest in F. The procedure ends in Step (5).
(1) let F ← { };
(2) let L1 ← {frequent 1-itemsets}; F ← F ∪
L1 ; An Example
Let I = {A, B, C, D, E} and the transaction universe
(3) for (k = 2; (Lk -1 = {}); k++ ) do begin be TID = {100, 200, 300, 400}a .
//Generate all possible frequent k-itemsets of In Table 1, 100, 200, 300, and 400 are the
interest in D. unique identifiers of the four transactions: A = sugar,
(3.1) let Temk ← {{x1 , . . . , xk-2 , xk-1 , xk}|{x1 , . . . , B = bread, C = coffee, D = milk, and E = cake.
xk-2 , xk-1 } ∈ Lk-1 ∧{x1 , . . . , xk-2 , xk} ∈ Lk-1 }; Each row in the table can be taken as a trans-
(3.2) for each transaction t in D do begin action. We can identify frequent itemsets (the first
step of the supp–conf framework) from these trans-
//Check which k-itemsets are included in actions using the Apriori algorithm (see, ‘How Apri-
transaction t. ori Works’), and association rules from the frequent
let Temt ← the k-itemsets in t that are also itemsets (the second step of the supp–conf framework)
contained in Temk ; using the supp–conf framework in ‘Generating Asso-
for each itemset A in Temt do ciation Rules’ (see below). Let
let A.count ← A.count + 1; (1) ms = 50% (to be frequent, an itemset must
end for occur in at least two transactions); and
T A B L E 2 Frequent 1-Itemsets and Their Frequencies in Table 1 T A B L E 3 Frequent 2-Itemsets and Their Frequencies in Table 1
{A} 2 Y {A,C} 2 Y
{B} 3 Y {B,C} 2 Y
{C} 3 Y {B,E} 3 Y
{E} 3 Y {C,E} 2 Y
100
c 2011 John Wiley & Sons, Inc. Volume 1, March/April 2011
WIREs Data Mining and Knowledge Discovery Fundamentals of association rules
T A B L E 5 Association Rules with 1-Item Consequences from 3- large. For example, suppose there are 1000 items in
Itemsets a given large database, the average number of items
in each transaction is six. Then there are almost 1015
Support possible itemsets to be counted in the database. These
Rule Number Rule Confidence (%) (%) ≥ mc
have led to some research directions as follows:
Frequent pattern (itemset) mining, such as the
Rule 1 BC → E 100 50 Y
Rule 2 BE → C 66.7 50 Y
FP-growth method5 for frequent pattern (itemset)
Rule 3 CE → B 100 50 Y mining, frequent closed patterns,6 and sampling
techniques7 for algorithm scale-up, has attracted
much attention in data mining. Well-known min-
T A B L E 6 Association Rules with 2-Item Consequences from 3- ing methods include, for example, data structures
Itemsets for association rule mining,8 hashing techniques,9
Support
partitioning,10,11 sampling,12 anytime mining,13 par-
Rule Number Rule Confidence (%) (%) ≥ mc allel and distributed mining,9,14–16 and integrating
mining with relational database systems.17
Rule 4 B → CE 66.7 50 Y Interestingness measures the strength of the re-
Rule 5 C → BE 66.7 50 Y lationship between itemsets X and Y. The prevailing
Rule 6 E → BC 66.7 50 Y measures are as follows:
(7) Certainty factor21 small minimum support might lead to poor mining
p(Y|X) − p(Y) performance and many generations of uninteresting
C F (X → Y) = C F (X, Y) = association rules. Therefore, users are unreasonably
1 − p(Y)
required to know details of the database to be mined
p(XY) − p(X) p(Y) to specify a suitable threshold. However, Han et al.6
=
p(X)(1 − p(Y)) have pointed out that setting the minimum support
Note that, if p(Y) > p(Y|X), C F (X, Y) is de- is quite subtle, which can hinder the widespread ap-
fined as plications of these algorithms; our own experience
of mining transaction databases also tells us that the
p(Y|X) − p(Y) p(XY) − p(X) p(Y)
C F (X, Y) = = setting is by no means an easy task. In particular,
− p(Y) − p(X) p(Y) even though a minimum support is explored under
(8) Laplace measure the supervision of an experienced miner, we cannot
p(XY) + 1 examine whether or not the results (mined with the
laplace(X → Y) = laplace(X, Y) = hunted minimum support) are just what users want.
p(X) + 2
This means that the minimum-support setting is a key
Obviously, it is very similar to the confi- issue in automatic association rule mining.
dence. Current techniques for addressing the
(9) J measure22 minimum-support issue are as follows: In proposals
for marketing, Piatetsky-Shapiro and Steingold28
p(Y|X)
J (X → Y) = J (X, Y) = p(X) p(Y|X) log proposed to identify only the top 10% or 20% of
p(Y) the prospects with the highest score. Han, et al.6,25
1 − p(Y|X) designed a strategy to mine top-k frequent patterns
+(1 − p(X|Y)) log for effectiveness and efficiency. In proposals for in-
1 − p(Y)
teresting itemset discovery, Cohen, et al.29 developed
Discovery of complex association rules, for exam- a family of effective algorithms for finding interesting
ple, quantitative association rules,23 causal rules,24 associations. In proposals for dealing with temporal
and multilevel and multidimensional association data, Roddick and Rice30 discussed independent
rules.25–27 thresholds and context-dependent thresholds for
Quantitative association rule mining is designed measuring time-varying interestingness of events.
for analyzing quantitative data that are over categor- In proposals for exploring new strategies, Hipp
ical attributes. An item over a categorical attribute and Guntzer31 presented a new mining approach
can be expressed as either an interval (a continuous that postpones constraints from mining to evalu-
set of attribute values) or a single value, called a quan- ation. In proposals for identifying new patterns,
titative item. For example, ‘Salary ∈ [50k, 70k]’ is a Wang, et al.32,33 designed a conf-driven mining
quantitative item. If X is a quantitative item, X can strategy without minimum support. However, these
be valued in a certain interval F. The supp of X is approaches only attempt to avoid specifying the
the sum of the supps of all values in F. A quantita- minimum support.
tive association rule is a relationship between X and Different from traditional association rule min-
Y of the form X → Y, where X and Y are quantitative ing methods, database-independent mining tech-
items. Quantitative association rule mining is based niques have been developed,34,35 in which users can
on the supp–conf framework, so do causal rule min- specify a threshold of supp for a mining task without
ing and multilevel and multidimensional association being required to know any of the database. This has
rule mining. provided a way of developing automatic association
rule mining systems.
Automation of Mining Association Rules
Apriori-based mining algorithms are based on the as- Association Analysis for Different Data Sources
sumption that users can specify the minimum support For example, data sources may be multiple, heteroge-
for their databases. That is, a frequent itemset (or neous, incomplete, and dynamic. Well-known mining
an association rule) is interesting if its supp is larger methods include local pattern analysis,36,37 selecting
than or equal to the minimum support. This creates relevant databases toward multidatabase mining,38
a challenging issue: performances of these algorithms peculiarity discovery,39 local mining for finding se-
heavily depend on some user-specified thresholds. For quential patterns,40 bridging the local and global
example, if the minimum-support value is too big, analysis for noise cleansing,41 classification from mul-
nothing might be found in a database, whereas a tiple sources,42 distributed data mining,43 and so on.
102
c 2011 John Wiley & Sons, Inc. Volume 1, March/April 2011
WIREs Data Mining and Knowledge Discovery Fundamentals of association rules
p(t ¬ c)/p(t) = 0.35/0.4 = 0.875 > mc. Therefore and a threshold minimum interestingness (mi). It
t → ¬ c is a valid rule from the database. means that a rule X → Y is of potential interest when
Mining negative association rules is a difficult R(X, Y) ≥ mi, and XY is referred to as a potentially
task due to the fact that there are essential differences interesting itemset. Including this R(X, Y) mechanism
between positive and negative association rule min- and the CF model,21 we can formally define the func-
ing. We illustrate this using an example as follows. tion that Z is a frequent itemset of potential interest
Consider a transaction database (TD) = {(A, B, as follows:
D); (B, C, D); (B, D); (B, C, D, E); (A, B, D, F)}, which
has five transactions, separated by semicolonsd . Each fipi(Z) = ( p(Z) ≥ ms) ∧ (∃X, Y ⊂ Z ∧ Z = XY)
transaction contains several items, separated by com- ∧ fipis(X, Y)
mas. There are at least 818 possible negative associ-
ation rules generated from the 49 infrequent itemsets fipis(X, Y) = (X ∩ Y = F ) ∧ g(X, Y, mc, mi) = 1
in TD when minsupp = 0.4. This means there are es-
sential differences between negative association rule where g(X, Y, mc, mi) = s(X, Y) ∨ s(Y, X), and
mining and positive association rule mining.
Because negative association rules are hidden 1, if R(X, Y) ≥ mi ∧ C F (X, Y) ≥ mc
s(X, Y) =
in infrequent itemsets (with lower frequency), tradi- 0, otherwise
tional pruning techniques are inefficient for identi-
fying infrequent itemsets of interest.21 This means, On the contrary, to mine negative association rules,
we must exploit alternative strategies to (1) confront all itemsets for possible negative association rules in
an exponential search space consisting of all possi- a given database need to be considered. For example,
ble itemsets, frequent and infrequent in a database; if X → ¬ Y can be discovered as a valid rule, then
(2) detect which of the infrequent itemsets can gen- p(X ¬ Y) ≥ ms must hold. If ms is high, p(X ¬ Y) ≥ ms
erate negative association rules; (3) perceive which would mean that p(XY) < ms, and itemset XY cannot
of the negative association rules are really useful to be generated as a frequent itemset in existing associ-
applications; and (4) measure the interestingness of ation analysis algorithms. In other words, XY is an
both positive and negative association rules. These infrequent itemset. However, there are too many in-
problems are very different from those being faced by frequent itemsets in databases, and we must define
discovering positive association rules. And it is rather some conditions for identifying infrequent itemsets of
difficult to identify negative association rules of inter- interest.
est in databases. If X is a frequent itemset and Y is an infrequent
In this subsection, we have not introduced algo- itemset with frequency 1 in a large database, then
rithms for identifying infrequent itemsets and negative X → ¬ Y certainly looks like a valid negative rule
association rules of interest. They will be presented in because p(X) ≥ ms, p(Y) ≈ 0, p(X ¬ Y) ≈ p(X) ≥ ms,
next subsection. conf(X → ¬ Y) = p(X ¬ Y)/p(X) ≈ 1 ≥ mc. This
could indicate that the rule X → ¬ Y is valid, and the
number of this type of itemsets in a given database
A Framework for Complete can be very large. For example, rarely purchased
Association Analysis products in a supermarket are always infrequent
As we have mentioned above, there can be an expo- itemsets.
nential number of infrequent itemsets in a database, However, in practice, more attention is paid
and only some of them are useful for mining associa- to frequent itemsets, and any patterns mined in
tion rules of interest. Therefore, pruning is critical to databases would mostly involve frequent itemsets
efficiently discover complete associations of interest. only. This means that if X → ¬ Y (or ¬ X → Y,
Therefore, in this subsection, we first design a pruning or ¬ X → ¬ Y) is a negative rule of interest, X and Y
strategy and the mining framework, and then a proce- would be frequent itemsets. In other words, no matter
dure for identifying frequent and infrequent itemsets whether association rules are positive or negative, we
of interest, and finally the algorithm of generating are only interested in relationships among frequent
complete associations of interest. itemsets. To operationalize this insight, we can use
the support measure p. If p(X) ≥ ms and p(Y) ≥ ms,
A Pruning Strategye and a Complete the rule X → ¬ Y is of potential interest, and XY is
Association Mining Framework referred to as a potentially interesting itemset.
According to the interest factor,18,19 we use an inter- Including the above insight, the R(X, Y) mecha-
estingness function R(X, Y) = |p(XY) − p(X)p(Y)|20 nism and the CF model, Z is an infrequent itemset of
104
c 2011 John Wiley & Sons, Inc. Volume 1, March/April 2011
WIREs Data Mining and Knowledge Discovery Fundamentals of association rules
(7) output PL and NL; (ii) If p(X ¬ Y) ≥ ms, p(X) ≥ ms, p(Y) ≥ ms,
(8) return. R(X, ¬ Y) ≥ mi, and CF(X, ¬ Y) ≥ mc, then
A → ¬ B is a negative rule of interest.
(iii) If p(¬ XY) ≥ ms, p(X) ≥ ms, p(Y) ≥ ms,
The procedure All Itemsets Of Interest gener-
R(¬ X, Y) ≥ mi, and CF(¬ X, Y) ≥ mc, then
ates all frequent and infrequent itemsets of interest
¬ X → Y is a negative rule of interest.
in a given database D, where PL is the set of all fre-
quent itemsets of interest in D, and NL is the set of all (iv) If p(¬ X ¬ Y) ≥ ms, p(A) ≥ ms, p(Y) ≥ ms,
infrequent itemsets of interest in D. PL and NL con- R(¬ X, ¬ Y) ≥ mi, and CF(¬ X, ¬ Y) ≥ mc,
tain only frequent and infrequent itemsets of interest then ¬ X → ¬ Y is a negative rule of interest.
respectively.
The initialization is done in Step (1). Step (2) In the above, Case (1) defines positive associ-
generates L1 of all frequent 1-itemsets in database D ation rules of interest, whereas others are negative
in the first pass of D. association rules of interest (in Cases 2, 3, and 4),
Step (3) generates Lk for k ≥ 2 by a loop, where where p(∗ ) ≥ ms guarantees that an association rule
Lk is the set of all frequent k-itemsets of interest in describes the relationship between two frequent item-
the kth pass of D, and the end condition of the loop is sets; the mi requirement makes sure that the associa-
Lk -1 = {}. For each pass of the database in Step (3), tion rule is of interest; and CF(∗ ,∗ ) ≥ mc specifies the
say, pass k, there are four substeps as follows. conf constraint.
Step (3.1) generates Temk of all k-itemsets in Let D be a database, and ms, mc, and mi be
D, where each k-itemset in Temk is generated by two given by the user. Our algorithm for extracting both
frequent itemsets in Lk -1 . Each itemset in Temk is positive and negative association rules with the CF
counted in D by a loop in Step (3.2). Then Lk is model for conf checking is designed as follows:
generated in Step (3.3). Lk is the set of all potentially
Algorithm 3. Complete Association
useful frequent k-itemsets in Temk , where all frequent
k-itemsets in Lk meet ms. Finally, Lk is added to PL
Input: D: a database; ms, mc, mi: threshold val-
in Step (3.4).
ues;
Step (4) generates the NL, that is, the set of all
infrequent itemsets, whose supports do not meet ms. Output: association rules;
And NL is the set of all potentially useful infrequent Step (1)
itemsets in D. call procedure All Itemsets Of Interest;
Steps (5) and (6) select all frequent and infre-
Step (2) // Generate positive association rules in
quent itemsets of interest. In Step (5), if an itemset
PL.
Z in PL does not satisfy fipi(Z), then Z is an unin-
teresting frequent itemset, and is removed from PL. for each frequent itemset Z in PL do
After all uninteresting frequent itemsets are removed for each expression XY = Z and X∩Y = do
from PL; in Step (6), if an itemset (X, Y) in NL does begin
not satisfy iipi(XY), then (X, Y) is an uninteresting if fipis(X, Y) then
infrequent itemset, and is removed from NL. All un-
if CF(X, Y) ≥ mc then
interested frequent itemsets are removed from NL.
Step (7) outputs the frequent and infrequent output the rule X → Y
itemsets of potential interest in PL and NL. The pro- with confidence CF(X, Y) and support p(A);
cedure ends in Step (8). if CF(X, Y) ≥ mc then
output the rule Y → X
Identifying Complete Associations of Interest with confidence CF(X, Y) and support p(A);
Let I be the set of items in a database TD, i = XY ⊂ I end for;
be an itemset, X∩Y = , p(X) = 0, p(Y) = 0, and ms,
Step (3) // Generate all negative association rules
mc, and mi > 0 be given by the user. There are four
in NL.
possible rules between X and Y as follows:
for each itemset (X, Y) in NL do
if iipis (X, Y) then begin
(i) If p(XY) ≥ ms, R(X, Y ) ≥ mi, and CF(X,
Y) ≥ mc, then X → Y is a positive rule of if CF(¬ X, Y) ≥ mc then
interest. output the rule ¬ X → Y
106
c 2011 John Wiley & Sons, Inc. Volume 1, March/April 2011
WIREs Data Mining and Knowledge Discovery Fundamentals of association rules
A 5 0.5 AD 3 0.3
B 6 0.6 BC 3 0.3
C 5 0.5 BD 5 0.5
D 6 0.6 BF 3 0.3
E 3 0.3 CD 3 0.3
F 5 0.5 CF 3 0.3
AB 3 0.3 ABD 3 0.3
AB); (C, AD); (C, BD); (C, BF); (C, ABD)} (9) For itemset BC, because no one in PL is ap-
is a set of candidate infrequent itemsets. Be- plicable to BC to generate candidate infre-
cause of CD and CF in PL, {(C, E); (C, AB); quent itemsets, there is no potentially useful
(C, AD); (C, BD); (C, BF); (C, ABD)} is the infrequent itemset in this case.
set of potentially useful infrequent itemsets. (10) For itemset BD, because only CF in PL is
(4) For itemset D, because the intersection of any applicable to BD to generate candidate in-
one of {E, F, AB, BC, BF, CF} in PL and D is frequent itemsets and the union is not in PL,
empty, {(D, E); (D, F); (D, AB); (D, BC); (D, (BD, CF) is a potentially useful infrequent
BF); (D, CF)} is a set of candidate infrequent itemset.
itemsets. Because of ABD in PL, {(D, E); (D, (11) For itemset BF, because only CD in PL is
F); (D, BC); (D, BF); (D, CF)} is the set of applicable to BF to generate candidate infre-
potentially useful infrequent itemsets. quent itemsets and the union is not in PL,
(5) For itemset E, because the intersection of any (BF, CD) is a potentially useful infrequent
one of {F, AB, AD, BC, BD, BF, CD, CF, itemset.
ABD} in PL and E is empty, {(E, F); (E, (12) For itemset CD, because no one of in PL is
AB); (E, AD); (E, BC); (E, BD); (E, BF); (E, applicable to BC to generate candidate infre-
CD); (E, CF); (E, ABD)} is a set of candi- quent itemsets, there is no potentially useful
date infrequent itemsets. Because of none of infrequent itemset in this case.
them in PL, all of them are potentially useful
(13) For itemset CF, because only ABD in PL is
infrequent itemsets.
applicable to CF to generate candidate infre-
(6) For itemset F, because the intersection of any quent itemsets and the union is not in PL,
one of {AB, AD, BC, BD, CD, ABD} in PL (CF, ABD) is a potentially useful infrequent
and F is empty, {(F, AB); (F, AD); (F, BC); itemset.
(F, BD); (F, CD); (F, ABD)} is a set of candi-
(14) For itemset ABD, no one in PL is applica-
date infrequent itemsets. Because of none of
ble to ABD to generate candidate infrequent
them in PL, all of them are potentially useful
itemsets.
infrequent itemsets.
(7) For itemset AB, because the intersection of
Therefore, we have
any one of {CD, CF} in PL and AB is empty,
{(AB, CD); (AB, CF)} is a set of candidate
NL = {(A, C, 2); (A, E, 2); (A, F, 2); (A, BC, 2);
infrequent itemsets. Because of none of them
in PL, all of them are potentially useful in- (A, BF, 1); (A, C D, 2); (A, C F, 1);
frequent itemsets.
(B, E, 0); (B, C D, 2);
(8) For itemset AD, because the intersection of
any one of {BC, BF, CF} in PL and AD is (C, E, 1); (C, AB, 2); (C, AD, 2); (C, BD, 2);
empty, {(AD, BC); (AD, BF); (AD, CF)} is a (C, BF, 2); (C, ABD, 2);
set of candidate infrequent itemsets. Because
(D, E, 1); (D, F, 2); (D, BC, 2); (D, BF, 2);
of none of them in PL, all of them are poten-
tially useful infrequent itemsets. (D, C F, 1);
108
c 2011 John Wiley & Sons, Inc. Volume 1, March/April 2011
WIREs Data Mining and Knowledge Discovery Fundamentals of association rules
(E, F, 1); (E, AB, 0); (E, AD, 0); (E, BC, 0); satisfying all the conditions for a frequent itemset of
(E, BD, 0); (E, BF, 0); (E, C D, 1); potential interest. And ABD is not removed from PL.
Step (6) is also a loop of pruning uninteresting
(E, C F, 0); (E, ABD, 0); itemsets in NL. Like the above, we illustrate this step
(F, AB, 1); (F, AD, 1); (F, BC, 2); (F, BD, 2); with pairs (A, C) and (E, BC) as follows:
(a) Considering (A, C) and for A → ¬ C, we have
(F, C D, 1); (F, ABD, 1); p(AC) = 0.2 < ms, p(A) = 0.5 > ms, p(C) = 0.5 > ms,
(AB, C D, 2); (AB, C F, 1); p(A ¬ C) = 0.3 > ms and
(AD, BC, 2); (AD, BF, 1); (AD, C F, 1); R(A, ¬C) = | p(A¬C) − p(A) p(¬C)|
(BD, C F, 1); = |0.3 − 0.5 × 0.5| = 0.05 = mi
(BF, C D, 1);
p(A¬C) − p(A) p(¬C)
(C F, ABD, 1)} C F (A, ¬C) =
p(A)(1 − p(¬C))
0.3 − 0.5 × 0.5
There are 43 pairs of infrequent itemsets of the = = 0.2 < mc
0.5(1 − 0.5)
form (X, Y, x), which is only used to simplify the
description. It means that X and Y are itemsets and x
is the frequency of the itemset XY. This means t(A, ¬ C) = 0.
Step (5) is a loop of pruning uninteresting item- For ¬ C → A, we have p(AC) = 0.2 < ms,
sets in PL. We illustrate this step with the frequent p(A) = 0.5 > ms, p(C) = 0.5 > ms,
2-itemset BF and 3-itemset ABD as follows: p(¬ CA) = 0.3 > ms, and
(i) Considering BF and for B → F, we have
R(¬C, A) = | p(¬C A) − p(¬C) p(A)|
p(BF) = 0.3 = ms, and
= |0.3 − 0.5 × 0.5| = 0.05 = mi
R(B, F ) = | p(BF ) − p(B) p(F )|
= |0.3 − 0.6 × 0.5| = 0 <mi p(¬C A) − p(¬C) p(A)
C F (¬C, A) =
p(¬C)(1 − p(A))
p(BF ) − p(B) p(F )
C F (B, F ) = 0.3 − 0.5 × 0.5
p(B)(1 − p(F )) = = 0.2 < mc
0.5(1 − 0.5)
0.3 − 0.6 × 0.5
= = 0 < mc
0.6(1 − 0.5)
This means t(¬ C, A) = 0.
This means s(B, F) = 0 and the function fipi is For ¬ A → C, we have p(AC) = 0.2 < ms,
false. p(A) = 0.5 > ms, p(C) = 0.5 > ms,
On the contrary, for F → B, we have s(F, B) = 0. p(¬ AC) = 0.3 > ms, and
Therefore, g(F, B, 0.6, 0.05) = 0 and the function fipi
R(¬A, C) = | p(¬AC) − p(¬A) p(C)|
is false and BF is uninteresting and is removed from
PL. = |0.3 − 0.5 × 0.5| = 0.05 = mi
(ii) Considering ABD and for AB → D, we have
p(ABD) = 0.3 = ms, and p(¬AC) − p(¬A) p(C)
C F (¬A, C) =
R(AB, D) = | p(ABD) − p(AB) p(D)| p(¬A)(1 − p(C))
110
c 2011 John Wiley & Sons, Inc. Volume 1, March/April 2011
WIREs Data Mining and Knowledge Discovery Fundamentals of association rules
112
c 2011 John Wiley & Sons, Inc. Volume 1, March/April 2011
WIREs Data Mining and Knowledge Discovery Fundamentals of association rules
algorithm.38 The idea is to first identify the associ- algorithm,72 the LiveSet-Driven method,73 and the
ation between a frequent pattern and a class label, ConSGapMiner technique.74
and then the discovered association rules are used for Also, association rule mining techniques have
predicting unlabelled data. From published reports on been successfully applied to discover patterns and
associative classification, it can be more accurate than knowledge from the Web.75,76 It includes Web usage
typical classification methods, such as C4.5. Other mining, Web structure mining, and Web content min-
main reports include emerging patterns-based clas- ing. An early application of association rule mining to
sifiers in Dong and Li60 and Li et al.,61 classifica- Web data is the analysis of users’ browse behaviors,
tion based on multiple association rules in Li et al.,62 called Web usage mining. It includes user grouping,
classification based on predictive association rules in page association, and sequential clicks through anal-
Yin and Han,63 and the classifier Refined Classifica- ysis. Web content mining identifies potentially useful
tion Based on Top-k rule groups (RCBT) in Cong information within Web pages, whereas Web struc-
et al.64 ture mining discovers useful structure linkage among
Another important application is clustering. It Web pages. Other applications include, for example,
is mainly applied to high-dimensional data clustering. discovering XML query patterns for caching,56 hyper-
A well-established application CLustering In QUEst link assessment,58 and filtering Web recommendation
(CLIQUE) is given in Agrawal et al.,65 which is an lists.59
Apriori-based dimension-growth subspace clustering
algorithm. It integrates density-based and grid-based
clustering methods. The Apriori property is used to Applications to Other Subjects
find clusterable subspaces, and dense units are iden- For trusty software development, association rule
tified. The algorithm then finds adjacent dense grid mining techniques have been applied to software bug
units in the selected subspaces using a depth first mining60 and software change history.77–79 For ex-
search. Clusters are formed by combining these units ample, Liu et al.80 developed a method to classify the
using a greedy growth scheme. An entropy-based sub- structured traces of program executions using soft-
space clustering algorithm for mining numerical data, ware behavior graphs. It utilizes a frequent graph
called entropy-based subspace clustering (ENCLUS), mining technique. Suspicious buggy regions are iden-
was proposed by Cheng et al.66 Beil et al.67 proposed tified through the capture of the classification accu-
a method for frequent term-based text clustering. racy change, which is measured incrementally during
Wang et al.68 proposed pCluster, a pattern similarity- program execution.
based clustering method for microarray data analy- Other applications include identifying com-
sis, and demonstrated its effectiveness and efficiency plex spatial relationships,54 detecting adverse drug
for finding subspace clusters in a high-dimensional reactions,55 and exploring the relationship be-
space. tween urban land surface temperature and biophysi-
cal/social parameters.57
b f
From the definition of Temk , there is only one can- For convenience of identifying negative rules,
didate in the second iteration. an infrequent itemset XY in NL is written to
c
Some data are adapted from Refs 18, 21. (X, Y).
d g
It is adapted from Ref 21. The data are slightly different from that in Wu
e
The techniques are similar to that in Ref 21. et al.21
ACKNOWLEDGEMENTS
This work was supported in part by the Australian Research Council under grant DP0985456,
the Nature Science Foundation (NSF) of China under grant 90718020, the China 973 Pro-
gram under grant 2008CB317108, the Research Program of China Ministry of Personnel for
Overseas-Return High-level Talents, the MOE Project of Key Research Institute of Humanities
and Social Sciences at Universities (07JJD720044), and the Guangxi NSF (Key) grants.
REFERENCES
1. Frawley WJ, Piatetsky-Shapiro G, Matheus CJ. Knowl- Conference on Very Large Databases. 1995, 432–
edge discovery in databases: An overview, AI Maga- 444.
zine, 1992, 13:57–70. 11. Zhang S, Wu X. Large Scale Data Mining Based on
2. Fayyad U, Piatetsky-Shapiro G, Smyth P. From data Data Partitioning. Applied Artificial Intelligence, 2001,
mining to knowledge discovery: an overview. Adv 15:129–139.
Knowledge Discov Data Min 1996, 1–34. 12. Toivonen H. Sampling large databases for association
3. Agrawal R, Imielinski T, Swami A. Mining association rules. Proceedings of the 22nd International Confer-
rules between sets of items in large databases. Proceed- ence on Very Large Databases. 1996, 134–145.
ings of the 1993 ACM SIGMOD International Con- 13. Zhang S, Zhang C. Anytime mining for multiuser ap-
ference on Management of Data. 1993, 207–216. plications. IEEE Trans Syst Man Cybern A Syst Hum
4. Agrawal R, Srikant R. Fast algorithms for mining as- 2002, 32:515–521.
sociation rules in large databases. Proceedings of the 14. Agrawal R, Shafer J. Parallel mining of association
Twentieth International Conference on Very Large rules. IEEE Transactions on Knowledge and Data En-
Databases. 1994, 487–499. gineering, 1996, 8:962–969.
5. Hon J, et al. Mining Frequent Patterns without Can- 15. Cheung D, Han J, Ng V, Wong C. Maintenance of
didate Generation. Proceedings 2000 ACM-SIGMOD discovered association rules in large databases: an in-
International Conference on Management of Data cremental updating technique. Proceedings of the 12th
(SIGMOD’00), Dallas, TX, May 2000, 1–12. IEEE International Conference on Data Engineering.
1996, 106–114.
6. Han J, Wang J, Lu Y, Tzvetkov P. Mining top-K fre-
quent closed patterns without minimum support. In: 16. Zaki M, et al. New Algorithms for Fast Discovery of
Proceedings of ICDM. 2002, 211–218. Association Rules. Proceedings of the Third Interna-
tional Conference on Knowledge Discovery and Data
7. Zhang C, Zhang S, Webb G. Identifying approximate Mining (KDD-97), 1997, 283–286.
itemsets of interest in large databases. Appl Int 2003,
17. Sarawagi S, Thomas S, Agrawal R: Integrating Min-
18:91–104.
ing with Relational Database Systems: Alternatives and
8. Yan X, Zhang C, Zhang S. On data structures for as- Implications. Proceedings of ACM SIGMOD Inter-
sociation rule discovery. Appl Artif Intell 2007, 21:57– national Conference on Management of Data, 1998,
79. 343–354.
9. Park J, Chen M, Yu P. An effective hash-based al- 18. Brin S, Motwani R, Silverstein C. Beyond market bas-
gorithm for mining association rules. Proceedings of kets: generalizing association rules to correlations. Pro-
ACM SIGMOD International Conference on Manage- ceedings of the ACM SIGMOD Conference. 1997,
ment of Data, 1995, 175–186. 265–276.
10. Savasere A, Omiecinski E, Navathe S. An effi- 19. Silverstein C, Brin S, Motwani R. Beyond Market Bas-
cient algorithm for mining association rules in large kets: Generalizing Association Rules to Dependence
databases. Proceedings of the 21nd International Rules. Data Min Knowl Discov, 1998, 2:39–68.
114
c 2011 John Wiley & Sons, Inc. Volume 1, March/April 2011
WIREs Data Mining and Knowledge Discovery Fundamentals of association rules
20. Piatetsky-Shapiro G. Discovery, Analysis, and Pre- 36. Zhang S, Wu X, Zhang C, Multi-Database Mining.
sentation of Strong Rules. Knowledge Discovery in IEEE Computational Intelligence Bulletin, June 2003,
Databases, 1991, 229–248. 2:5–13.
21. Wu X, Zhang C, Zhang S. Efficient mining of both 37. Zhang S, Zaki M. Mining multiple data sources: local
positive and negative association rules. ACM Trans Inf pattern analysis. Data Min Knowledge Discov 2006,
Syst 2004, 22:381–405. 12:121–125.
22. Wang Ke, Tay W, Liu B. An Interestingness-Based In- 38. Liu H, Lu H, Yao J. Identifying relevant databases for
terval Merger for Numeric Association Rules. Proceed- multi-database mining. Proceeding of PAKDD. 1998,
ings of the Fourth International Conference on Knowl- 210–221.
edge Discovery and Data Mining, New York, USA, 39. Zhong N, Yao Y, Ohshima M. Peculiarity Oriented
1998, 121–127. Multidatabase Mining. IEEE Trans Knowl Data Eng,
23. Srikant R, Agrawal R. Mining generalized association 2003, 15:952–960.
rules. Proceedings of the 21nd International Confer- 40. Kum H, Chang J, Wang W. Sequential pattern mining
ence on Very Large Databases. 1995, 407–419. in multi-databases via multiple alignment. Data Min
24. Zhang S, Zhang C. Discovering causality in Knowledge Discov 2006, 12:151–180.
large databases. Appl Artif Intell 2002, 16:333– 41. Zhu X, Wu X, Chen Q. Bridging local and global
358. data cleansing: identifying class noise in large, dis-
25. Han J, Pei J, Yin Y, Mao R. Mining frequent patterns tributed data datasets. Data Min Knowledge Discov
without candidate generation: a frequent-pattern tree 2006, 12:275–308.
approach. Data Min Knowledge Discov 2004, 8:53– 42. Ling C, Yang Q. Discovering classification from data of
87. multiple sources. Data Min Knowledge Discov 2006,
26. Han J, Kamber M. Data Mining: Concepts and Tech- 12:181–201.
niques. The Morgan Kaufmann Series in Data Man- 43. Zaki M. Parallel and distributed association mining: a
agement Systems, 2006. survey. IEEE Concurrency. 1999.
27. Kamber M, Han J, Chiang J: Metarule-Guided Mining 44. Wu X, Zhang C, Zhang S. Mining Both Positive and
of Multi-Dimensional Association Rules Using Data Negative Association Rules. In: Proceedings of the 19th
Cubes. Proceedings of the Fourth International Con- International Conference on Machine Learning, Syd-
ference on Knowledge Discovery and Data Mining, ney, Australia, July 2002, 658–665.
1997, 207–210. 45. Goncalves E, Mendes I, Plastino A. Mining exceptions
28. Piatetsky-Shapiro G, Steingold S. Measuring lift quality in databases. AI 2004: advances in artificial intelli-
in database marketing. SIGKDD Explor 2000, 2:76– gence. 17th Australian Joint Conference on Artificial
80. Intelligence. 2004, 1076–1081.
29. Cohen E, Datar M, Fujiwara S, Gionis A, Indyk P, 46. Pedreshi D, Ruggieri S, Turini F. Discrimination-aware
Motwani R, Ullman JD, Yang C. Finding interest- data mining. Proceeding of the 14th ACM SIGKDD
ing associations without support pruning. IEEE Trans International Conference on Knowledge Discovery and
Knowledge Data Eng 2001, 13:64–78. Data Mining. 2008, 560–568.
30. Roddick JF, Rice S. What’s interesting about cricket? 47. Shimada K, Hirasawa K, Hu J. Class association rule
On thresholds and anticipation in discovered rules. mining with chi-squared test using genetic network
SIGKDD Explor 2001, 3:1–5. programming. IEEE International Conference on Sys-
31. Hipp J, Guntzer U. Is pushing constraints deeply into tems, Man and Cybernetics. (SMC06), 2006, 5338–
the mining algorithms really what we want? SIGKDD 5344.
Explor 2002, 4:50–55. 48. Zhao L, Zaki MJ, Ramakrishnan N. BLOSOM: a
32. Wang K, He Y, Cheung D, Chin F. Mining confi- framework for mining arbitrary Boolean expressions.
dent rules without support requirement. In: Proceed- Proceedings of the 12th ACM SIGKDD International
ings of the 10th ACM International Conference on Conference on Knowledge Discovery and Data mining.
Information and Knowledge Management. 2001, 89– 2006, 827–832.
96. 49. Wan Q, An A. An efficient approach to mining indirect
33. Wang K, He Y, Han J. Pushing support constraints associations. J Intell Inf Sys 2006, 27:135–158.
into association rules mining. IEEE Trans Knowledge 50. Antonie M, Zaiane O. Mining positive andnegative
Data Eng 2003, 15:642–658. association rules: an approach for confined rules.
34. Yan X, Zhang C, Zhang S. Armga: identifying inter- Proceedings of the 8th European Conference on
esting association rules with genetic algorithms. Appl Principles and Practice of Knowledge Discovery in
Artif Intell 2005, 19:677–689. Databases. 2004, 27–38.
35. Zhang S, Wu X, Zhang C, Lu J. Computing 51. Tan P-N, Kumar V, Kuno H. Using SAS for mining
the minimum-support for mining frequent patterns. indirect associations in data. In Proc of the Western
Knowledge Inf Syst 2008, 15:233–257. Users of SAS Software Conference. 2001.
52. Tan P, Kumar V, Srivastava J. Indirect association: SIGMOD International Conference on Management
Mining higher order dependencies in data. In Principles of Data, 1998, 94–105.
of Data Mining and Knowledge Discovery. Springer, 66. Cheng CH, Fu AW, Zhang Y. Entropy-based subspace
Lyon, France, 2000, 632–637. clustering for mining numerical data. In: Proceeding
53. Tan P, Kumar V, Srivastava J. Selecting the right inter- of International Conference on Knowledge Discovery
estingness measure for association patterns. Proceed- and Data Mining (KDD’99), 1999, 84–93.
ings of the Fourth International Conference on Knowl- 67. Beil F, Ester M, Xu X. Frequent term-based text clus-
edge Discovery and Data Mining, 2002, 32–41. tering. In: Proceeding of ACM SIGKDD International
54. Munro R, Chawla S, Sun P. Complex spatial relation- Conference on Knowledge Discovery in Databases
ships. Third IEEE International Conference on Data (KDD’02), 2002, 436–442.
Mining (ICDM’03). 2003, 227. 68. Wang H, Wang W, Yang J, Yu PS. Clustering by pat-
55. Jin HW, Chen J, He H, Williams GJ, Kelman C, tern similarity in large data sets. In: Proceeding of
O’Keefe CM. Mining unexpected temporal associa- ACM-SIGMOD International Conference on Manage-
tions: applications in detecting adverse drug reactions. ment of Data, 2002, 418–427.
IEEE Trans Inf Technol Biomed 2008, 12:488–500. 69. Yan X, Zhang C, Zhang S. Identifying Software Com-
56. Chen L, Bhowmick SS, Chia LT. Mining positive and ponent Association with Genetic Algorithm. Interna-
negative association rules from XML query patterns tional Journal of Software Engineering and Knowledge
for caching. DASFAA-05. 2005, 736–747. Engineering, 2004, 14:441–447.
57. Rajasekar U, Weng Q. Application of association rule 70. Yan X, Zhang C, Zhang S. On Data Structures for
mining for exploring the relationship between urban Association Rule Discovery. Applied Artificial Intelli-
land surface temperature and biophysical/social pa- gence, 2007, 21:57–79.
rameters. Photogramm Eng Remote Sensing 2009, 71. Beyer K, Ramakrishnan R. Bottom-up computation
75:385–396. of sparse and iceberg cubes. In: Proceeding of ACM-
58. Kazienko P and Pilarczyk M. Hyperlink assessment SIGMOD International Conference on Management
based on web usage mining. Proceedings of the Sev- of Data, 1999, 359–370.
enteenth Conference on Hypertext and Hypermedia. 72. Imielinski T, Khachiyan L, Abdulghani A Cubegrades:
2006, 85–88. generalizing association rules. Data Min Knowl Dis-
59. Kazienko P. Filtering of web recommendation lists us- cov, 2002, 6:219–258.
ing positive and negative usage patterns. Knowledge- 73. Dong G, Han J, Lam J, Pei J, Wang K, Zou W. Mining
Based Intelligent Information and Engineering Sys- constrained gradients in multi-dimensional databases.
tems. 2007, 1016–1023. IEEE Trans Knowl Data Eng, 2004, 16:922–938.
60. Dong G, Li J. Efficient mining of emerging patterns: 74. Ji X, Bailey J, Dong G. Mining minimal distinguish-
Discovering trends and differences. Proceedings of the ing subsequence patterns with gap constraints. In: Pro-
Fourth International Conference on Knowledge Dis- ceeding of International Conference on Data Mining
covery and Data Mining, 1999, 43–52. (ICDM’05), 2005, 194–201.
61. Li J, Dong G, Ramamohanarao K. Instance-Based 75. Kosala R, Blockeel H. Web mining research: a survey.
Classification by Emerging Patterns. Principles of Data ACM SIGKDD Explorations, 2000, 2:1–15.
Mining and Knowledge Discovery (PKDD-00), 2000, 76. Srivastava J, Cooley R, Deshpande M, Tan PN. Web
191–200. usage mining: discovery and applications of usage pat-
62. Li J, Ramamohanarao K, Dong G. Combining the terns from web data. ACM SIGKDD Explorations,
Strength of Pattern Frequency and Distance for Clas- 2000, 1:12–23.
sification. Knowledge Discovery and Data Mining 77. Shirabad J, Lethbridge T, Matwin S. Mining the main-
(PAKDD-01), 2001, 455–466. tenance history of a legacy software system. ICSM-
63. Yin X, Han J. CPAR: Classification based on Predic- 2003. 2003, 95–104.
tive Association Rules. Proceedings of the Third SIAM 78. Ying A, Murphy G, Ng R, Chu-carroll M. Predicting
International Conference on Data Mining, San Fran- source code changes by mining change history. IEEE
cisco, CA, USA, May 1–3, 2003, Student Paper 5. Trans Software Eng 2004, 30:574–586.
64. Cong G, Tan K, Tung A, Xu X. Mining Top-k Covering 79. Zhao Q, Bhowmick S Mining history of changes to
Rule Groups for Gene Expression Data. In: Proceed- web access patterns. PKDD-2004. 2004, 521–523.
ings of ACM SIGMOD International Conference on 80. Liu C, Yan X, Yu H, Han J, Yu P. Mining behavior
Management of Data, 2005, 670–681. graphs for “backtrace” of noncrashing bugs. In: Pro-
65. Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Au- ceeding of the 2005 SIAM international conference on
tomatic Subspace Clustering of High Dimensional Data data mining (SDM’05), Newport Beach: 2005, 286–
for Data Mining Applications. In: Proceedings ACM 297.
116
c 2011 John Wiley & Sons, Inc. Volume 1, March/April 2011