Você está na página 1de 14

Enterprise Warehousing and Information Systems

The Apriori Algorithm

Student: Ionela-Beatrice Firan

The Apriori Algorithm

The information revolution is generating mountains of data from


sources as diverse as business and science fields. One of the
greatest challenges is how to turn these rapidly expending data
into accessible, and actionable knowledge.
Data mining is the automated discovery of non-trivial, implicit,
previously unknown, and potentially useful information or
patterns embedded in databases. Briefly state, it refers to
extracting or mining knowledge from large amounts of data.
The motivation for data mining is a suspicion that there might be
nuggets of useful information hiding in the masses of unanalyzed
or underanalyzed data, and therefore methods for locating
interesting information from data would be useful. From the
beginning, data mining research has been driven by its
applications. While the finance and industries have long
recognized the benefits of data mining, data mining techniques
can be effectively applied in many areas and can be performed on
a variety of data stores, including relational databases,
transaction databases and data warehouses.
Generally speaking, there are two classes of data mining:
descriptive and prescriptive.

Descriptive mining is to summarize or characterize general


properties of data in data repositories, while prescriptive mining is
to perform inference on current data, to make predictions based
on the historical data.
One of the fundamental methods from the prospering field of data
mining is the generation of association rules that describe
relationships between items in data sets. The original motivation
for searching association rules came from the need to analyze so
called supermarket transaction data, that is, to explore customer
behavior in terms of purchased products. Association rules
describe how often items are purchased together.
The first algorithm for mining association rules, Apriori algorithm,
was introduced in 1994. The motivation behind introducing the
Apriori algorithm was the progress that was made at that time in
bar-code technology, which enabled retail supermarkets to store
large quantities of sales data in their databases.
The collected data was referred to as market basket data, or just
basket data.
Basically, an association rule is an implication X Y where X and
Y are disjunctive sets of items. The meaning of such rules is quite
intuitive:
Let DB be a transaction database, where each transaction T D
is a set of items. An association rule X Y then expresses
Whenever a transaction T contains X than this transaction T also
contains Y with probability conf. The probability conf is called the
rule confidence and is supplemented by further quality
measures like rule support and interest. The support sup is
simply the number of transactions that contain all items in the
antecedent and consequent parts of the rule. (The support is
sometimes expressed as a percentage of the total number of
records in the database.) The confidence conf is the ratio of the
number of transactions that contain all items in the consequent

as well as the antecedent to the number of transactions that


contain all items in the antecedent.
The Apriori algorithm discovers association rules in data. For
example, "if a customer purchases a razor and after shave, then
that customer will purchase shaving cream with 80% confidence."
The association mining problem can be decomposed into two
subproblems:
Find all combinations of items, called frequent itemsets, whose
support is greater than the minimum support.
Use the frequent itemsets to generate the desired rules. The idea
is that if, for example, ABC and BC are frequent, then the rule "A
implies BC" holds if the ratio of support(ABC) to support(BC) is at
least as large as the minimum confidence. Note that the rule will
have minimum support because ABCD is frequent. ODM
Association only supports single consequent rules (ABC implies
D).
The number of frequent itemsets is governed by the minimum
support parameters. The number of rules generated is governed
by the number of frequent itemsets and the confidence
parameter. If the confidence parameter is set too high, there may
be frequent itemsets in the association model but no rules.
The most common way to store data collected in various areas is
in relational databases. Information and Communication
Technology development has lead to a huge volume of data
stored and to the inability to extract useful information and
knowledge from this data by using the traditional methods. For
this reason, data mining has developed as a specific field. Mining
association rules is one of the commonly used methods in data
mining. Association rules model dependencies between items in
transactional data. Most data mining systems work with data
stored in flat files. However, it has been shown it is beneficial to
implement data mining algorithms within a DBMS, and using of
SQL to discover patterns in data can bring certain advantages.

Since the generation of frequent item sets is the most expensive


part in terms of resources and time consuming, a lot of algorithms
for this task were developed. Most algorithms use a method that
build candidate itemsets, which are sets of potential frequent
itemsets, and then test them.
Support for these candidates is determined by taking into account
the whole database D. The process of generating candidate
itemsets considers the information regarding the frequence of all
candidates already checked. So, the procedure is the following:
the closure of frequent itemsets assumes that all subsets of a
frequent itemset are also frequent. This allows remove those sets
that contain at least one set of items that is not frequent, from
candidate itemsets.
After generating, the appearance of each candidate in the
database is counted, in order to retain only those having the
support greater than minsup.
Then we can move to the next iteration. The whole process ends
when there are no potential frequent itemsets.
The most known algorithm, which uses the above mechanism, is
Apriori. On this basis some variants such as Apriori Tid, Apriori All,
Apriori Some or Apriori Hibrid
were developed.
To see how Apriori alghorthim works we will use Weka (Waikato
Environment for Knowledge Analysis), a popular suite of machine
learning software written in Java.
Weka is a workbench that contains a collection of visualization
tools and algorithms for data analysis and predictive modeling,
together with graphical user interfaces for easy access to this
functionality.
I took a sample nominal dataset.

It outputs 10 rules, ranked according to the confidence measure


given in parentheses after each one. The number following a
rules antecedent shows how many instances satisfy the
antecedent; the number following the conclusion shows how
many instances satisfy the entire rule (this is the rules
support). Because both numbers are equal for all 10 rules, the
confidence of every rule is exactly 1.
Wekas Apriori runs the basic algorithm several times. It uses the
same user-specified minimum confidence value throughout. The
support level is expressed as a proportion of the total number of
instances (14 in this case), as a ratio between 0 and 1. The
minimum support level starts at a certain value (default 1.0). In
each iteration the support is decreased by a fixed amount (delta,
default 0.05, 5% of the instances) until either a certain number of
rules has been generated (, default 10 rules) or the support
reaches a certain minimum minimum level (default 0.1)
because rules are generally uninteresting if they apply to less
than 10% of the dataset. These four values can all be specified by
the user.
The Associator output text area shows that the algorithm
managed to generate 10 rules. This is based on a minimum
confidence level of 0.9, which is the default and is also shown in
the output. The Number of cycles performed, which is shown as
17, indicates that Apriori was actually run 17 times to generate
these rules, with 17 different values for the minimum support. The
final value, which corresponds to the output that was generated,
is 0.15 (corresponding to 0.15 14 2 instances). By looking at
the options in the Generic Object Editor window, you can see that
the initial value for the minimum support is 1 by default, and that
delta is 0.05. Now, 1 17 0.05 = 0.15, so this explains why a
minimum support value of 0.15 is reached after 17 iterations.
Note that minimum support is decreased by delta before the basic
Apriori algorithm is run for the first time. The Associator output

text area also shows how many frequent item sets were found,
based on the last value of the minimum support that was tried
(0.15 in this example). In this case, given a minimum support of
two instances, there are 12 item sets of size 1, 47 item sets of
size 2, 39 item sets of size 3, and six item sets of size 4. By
setting outputItemSets to true before running the algorithm, all
those different item sets and the number of instances that
support them are shown.

Você também pode gostar