Você está na página 1de 6

A Fast Parallel Algorithm for Discovering Frequent Patterns

Kawuu W. Lin
Department of Computer Science and Information
Engineering
National Kaohsiung University of Applied Sciences
Kaohsiung, Taiwan, R.O.C.
linwc@cc.kuas.edu.tw
Abstract
Fast discovery of frequent patterns is the most
extensively discussed problem in data mining fields
due to its wide applications. As the size of database
increases, the computation time and the required
memory increase severely. The difficulty of mining
large database launched the research of designing
parallel and distributed algorithms to solve the
problem. Most of the past studies tried to parallelize
the computation by dividing the database and
distribute the divided database to other nodes for
mining. This approach might leak data out and
evidently is not suitable to be applied to sensitive
domains like health-care. In this paper, we propose a
novel data mining algorithm named FD-Mine that is
able to efficiently utilize the nodes to discover
frequent patterns in cloud computing environments
with data privacy preserved. Through empirical
evaluations on various simulation conditions, the
proposed FD-Mine delivers excellent performance in
terms of scalability and execution time.
Keywords: Data mmmg; cloud computing;
association rule mining; frequent pattern mining;
privacy preserved
I. Introduction
With the progress of information technology, data
mining techniques have been extensively applied to
many applications in various domains. The goal of
data mining is to discover the hidden useful
information from large databases. The discovered
information could help the decision processes, aid the
commercial promotion, and so forth. The data mining
includes four main topics: association rule mining [2],
sequential pattern mining [3], clustering [11] and
classification [5]. Among the data mining studies, the
problem of frequent pattern mining, i.e. association
rule mining and sequential pattern mining, is mostly
discussed due to its wide applications.
The basic conception of frequent pattern mining
problem is to discover the pattern whose frequency of
appearance in the database is greater than a specific
threshold. An association rule is defined as X=>Y,
where X and Yare sets of items. The concept of
association rule mining is to discover the sets of
items tending to associate with the others in the
database. The studies on association rule mining can
be classified into two types, 1) the generate-and-test
Yu-Chin Luo
Department of Computer Science and Information
Engineering
National Kaohsiung University of Applied Sciences,
Kaohsiung, Taiwan, R.O.C.
kim-x@yahoo.com.tw
[2] (Apriori-like) approach and 2) the frequent
pattern growth approach [6] (FP-growth-like). The
Apriori-like methods iteratively generate candidate
itemset of size (k+1) from frequent itemset of size k
and scan the database repetitively to test the
frequency of each candidate itemset. Definitely, the
Apriori-like methods suffer from the large number of
candidate itemsets, especially when the support
threshold is small. In view of this reason, Han et al.
[6] proposed a novel data structure, named frequent
pattern tree (FP-tree), in which the transactions are
compressed and stored. A mining algorithm, namely
FP-growth was also proposed for discovering the
frequent patterns from the FP-tree. FP-growth needs
only two scans on physical databases and therefore
has a great improvement on the execution time.
As the size of database increases, the computation
time and the required memory increase severely.
Many studies on association rules mining were
proposed mainly to improve the efficiency in terms of
execution time. In the past decades, parallel and
distributed computing (PDC) techniques have
attracted extensive attentions on the ability to manage
and compute the significant amount of data. The
difficulty of mining large database launched the
research of designing parallel and distributed
algorithms to solve the problem [7], [8], [10], [13],
[14]. The main approach of the existing studies is to
divide the database and then to distribute each part of
the database to nodes or processors for mining with
the goal to distribute the computation loading. During
the mining process, the nodes will exchange required
transactions from each other. The workload of data
exchanging among nodes becomes heavy when the
average length of transaction is long or the size of
database is large. Although many algorithms have
been proposed, the execution efficiency of frequent
pattern mining is still a challenge to the researchers
due to the data explosion. In addition to the
exchanging workload, the data privacy is also a major
concern since this kind of algorithms duplicates the
database to every node in the PDC architecture. This
approach evidently is not suitable to be applied to
sensitive domains like health-care.
In this paper, we propose a novel data mining
method named FD-Mine that is able to efficiently
utilize the cloud nodes to fast discover frequent
patterns in cloud computing environments with data
privacy preserved. Through empirical evaluations on
Authorized licensed use limited to: LA TROBE UNIVERSITY. Downloaded on June 13,2010 at 08:00:36 UTC from IEEE Xplore. Restrictions apply.
various simulation conditions, the proposed FD-Mine
delivers excellent performance in terms of scalability
and execution time.
In the following sections, we briefly review related
work in Section 2. In Section 3, we propose the
architecture and present the data mining algorithm.
The empirical evaluation for performance study is
made in Section 4. The conclusions are given in
Section 5.
II. Related Work
In order to improve the performance of association
rule mining, many researchers tried to distribute the
mining computation over more than one
processor/node. In [9], the authors proposed a parallel
algorithm named Parallel FP-tree (PFP-tree) based on
the FP-tree data structure for mining frequent patterns
on message passing multiprocessor systems. The
proposed algorithm divides the database into several
non-overlapping parts according to number the
available processors, and lets each processor
construct its FP-tree by exchanging necessary
information from other processors. Because the
algorithm is performed on a node, the data
exchanging is done in the same node so that the
overhead might not be severe. To parallelize the
frequent pattern mining, the past studies relied on
mainly the database dividing method [4], [15]. The
database is divided equally or by some criteria and
each part of the database is sent to the node for
mining. The approach that duplicates the database to
other nodes risks leaking out the data. The data
privacy cannot be preserved by this approach.
Note that in cloud computing environments the
network latency is an important issue that should be
carefully considered. Generally, the size of the
targeted database is always large in the mining
applications. Transmitting the database and
exchanging large amount of data over the internet
will greatly slow down the performance. In [12], the
proposed method, named QFP-growth, divides the
database equally and constructs the FP-trees based on
the assigned parts of database. The FP-trees are then
merged to a FP-tree to complete the mining task.
The data transmission overhead was studied in [14].
The authors observed that the elapsed time by
exchanging transactions is much more than mining
time. To efficiently exchange transactions among
nodes for database dividing approach, TPFP-tree was
proposed by using transaction identification set
(Tidset) to select the transactions directly instead of
scanning the physical database. The Tidset is a table
recording the IDs of transactions that contain a
certain item, so the required memory of Tidset is as
the same size as the assigned partial database.
Therefore, TPFP is bound to the size of the targeted
database.
To balance the computing loading of TPFP-tree,
the authors [15] proposed BTP-tree algorithm, which
is a balanced Tidset-based parallel FP-tree algorithm,
for mining frequent patterns. The algorithm equally
divides the database into p parts, where p is the
number of nodes. The partial databases are sent to the
nodes individually. Each node establishes the Tidset
and header table in accordance with the assigned
database. A global header table named GHT is
derived by filtering the items with support smaller
than the threshold from the table in which all of the
header tables of the nodes are gathered. Before
executing the mining task, BTP-tree algorithm
calculates a performance index for each node, and
records the sum of performance indexes. A mining
task is then separated into p sub-tasks, where the
loading of each task is calculated in unit of the
number of items in header table. The task assignment
is decided by the mechanism of performance
indexing. After the task assignment, each node
constructs its Tidset for fast selection use. The
required transactions are exchanged among nodes to
generate the new sub-databases by referring to the
items of header tables. Finally, the FP-growth is
performed on each node to discover the frequent
patterns. The frequent patterns are further gathered
from all the nodes to obtain the complete frequent
patterns.
III. Proposed Algorithm: FD-Mine
In this section, we describe the proposed algorithm
that is able to efficiently distribute the computation in
the cloud computing environments. The cloud
architecture for mining frequent patterns is
introduced in Section 3.1. In Section 3.2, we
formulate the problem. The details of the proposed
algorithms are described in Section 3.3.
3.1 Proposed Cloud Architecture for Frequent
Pattern Mining
Note that in the cloud computing environments the
data privacy is an important issue. Since the clouds
are distributed physically and each cloud node
provides only its computation ability, the trusty of the
nodes cannot be preserved. Therefore, in order to
preserve the data privacy only a node that is safe,
while not every node, can access the database. In our
architecture, we name this node as trusted node or
kernel node, the cloud in which the node locates as
kernel cloud. Considering the efficiency of data
transmission among clouds, each cloud is designed to
have only a node to connect other clouds, named
connection-node, abbreviated as conn-node. If a node
N needs data from trusted node, the node N will ask
the conn-node of N's cloud to see whether the
conn-node has the data or not. If the conn-node has
the data, N can download the data from conn-node
via intranet. Otherwise, the data will be duplicated to
the conn-node via internet, and then N can download
the data from conn-node via intranet. By using this
transmission policy, the network latency can be
minimized.
Authorized licensed use limited to: LA TROBE UNIVERSITY. Downloaded on June 13,2010 at 08:00:36 UTC from IEEE Xplore. Restrictions apply.
Physical Machine
9 Dat"b.oL\\
IIIII!!II Trusted Xode
(Virtual Machine)
ConnectionNode
tvirt ual Mactunej
CI Comreting Xcdc
[VirtualMachine)
Figure 1. Proposed architecture for frequent pattern mining.
In this architecture, each conn-node should maintain
a table to record the status of the nodes of its cloud.
The recorded information for each node contains the
node's ID and the availability. All of the tables are
then gathered in the kernel node so that the kernel
node has complete information of computation ability
in terms of available nodes. The information is
updated periodically.
3.2 Mining frequent patterns in cloud computing
environments
One of the characteristics of the proposed algorithm
is that the data privacy is preserved. Unlike the
parallel Apriori-like algorithms [4] that need to
duplicate the database to remote nodes or the
BTP-tree [15] algorithm that distributes part of the
database directly to cloud nodes, only the kernel node
is permitted to access the database in our designed
architecture and algorithms. In addition to the leaking
problem of data privacy of the conventional
algorithms, the required time for duplicating physical
database is considerable.
The data structure used by the proposed algorithms
is based on that of FP-growth. The FP-tree is a data
structure that stores the frequent items in compressed
form. Because the items with support smaller than the
support threshold are filtered and the filtered
transactions have been constructed in the FP-tree,
reversely retrieving the complete transaction of any
user from the FP-tree is impossible. Moreover,
because the FP-tree is often implemented in
linked-list and our algorithm will also compress the
FP-tree again by ZIP to reduce the transmission time,
the transactions will not be reversed. The data
privacy can be preserved.
3.3 FD-Mine algorithm
The purpose of FD-Mine is fast mining. In the cloud
computing environments, the distribution of mining
computation accompanies data transmission over the
network. In BTP-tree [15], the database is divided
equally into several parts and sent to the available
nodes. Then the nodes ask the required data from
each other to finish the mining task. In fact, the
database is often large in size. Obviously, this
approach not only leaks the data but also incurs a lot
of data transmission over the network. The
perforrnance of this kind of approach is expected to
be bad.
An intuitive way to save the time is to minimize
the amount of data transmission. Our proposed
FD-Mine is designed to transmit as less data as
possible to save the time from network latency and
disk I/O time. The algorithm is presented in Figure 2.
We describe the details of FD-Mine as below. The
trusted node TN follows the FP-tree construction
algorithm to scan the database twice times, and
constructs the corresponding FP-tree stored in TN
(line I). The next step is to obtain the header table HT
(line 2) and to divide HT into IN! disjointed sets,
stored in IS (line 3). Since the frequent patterns are
not predictable, HT is divided randomly with the goal
to balance the loading of each node. Considering the
execution efficiency, the most important issue is that
the amount of data transmission should be minimized.
To minimize the amount of data transmission, the
FP-tree constructed on TN is duplicated to each idle
node. In the cloud computing environments, we also
consider the problem of network latency. Since the
internet latency always larger than intranet latency,
the FP-tree duplication should be done in intranet.
Algorithm FD-Mine
Input: A transaction database DB, a minimum support
threshold the trusted node TN, and a set of
nodes N with cloud architecture C
Output: The complete set of frequent patterns, FP
1 TN.FPT
II TN reads the DB and construct the corresponding FP-tree
2 HT getHT(FPT)
II Obtain the header table ofFPT
3 IS divideHT(lNI)
IIRandomly divide the items ofHT into IN[ disjointed sets
4 FOR i=1 TO IISI
5 n selectNode(N,i) II Select the ith node
6 cn selectConnNode(n,C)
II Select the conn-node ofn
7 IF (isExistFPT(cn)==FALSE)
8 cn.FPT TN.FPT
II Duplicate FPT from TN if en does not have FPT
9 ENDIF
10 n.FPT cn.FPT
II Duplicate FPT from the conn-node ofn
11 is, getSet(IS,i) II Obtain ith set of IS
12 fp, N;.BatchFPGrowth(isD
II Batch-run FP-growth for each conditional item in is; to
mine the frequent patterns
13 FP FP U fp,
14 ENDFOR
15 RETURN FP
Figure 2. FD-Mine Algorithm.
Authorized licensed use limited to: LA TROBE UNIVERSITY. Downloaded on June 13,2010 at 08:00:36 UTC from IEEE Xplore. Restrictions apply.
80 .--- - - - - - - - - - - - - - - ---,
Number of Nodes
Figure 3. The execut ion time for FD-Mine and BTP-tree with
number of nodes varied on dataset T20.IS.NIOOK.DIOOK.
10
0 0 0 -O
30
60
!E-
Q)
E
F 50
c:
.2
"S
40
w
70
the required execution time of FD-Mine and
BTP-tree decreases with the increase in the number
of nodes. It is observed that the execution time of
FD-Mine is almost the same to that ofBTP-tree when
there is only one node available to be used. This is
trivial because both of them perform FP-growth in a
single node. The execution time of FD-Mine is
slightly more than that of BTP-tree when the number
of processors is equal to 2 or 3. This is because the
time elapsed by FP-tree compression and
decompression is more than the time to directly
transmit the divided parts of database. When there are
more than 3 nodes, FD-Mine exhibits the advantage
of sending after compression, less time required for
completing the whole mining task.
Figure 4 shows the impact on execution time when
the average length of transaction is lengthened to 40.
It is found that FD-Mine delivers better performance
than BTP-tree when the number of nodes is greater
than 2. The reason is that BTP-tree, the database
dividing approach, needs to exchange the transactions
to each other, and the performance suffers from the
large number of exchanged transactions.
Figure 5 shows the performance of FD-Mine and
BTP-tree under the number of transactions set to
200K. In this experiment, FD-Mine outperforms
BTP-tree when the number of nodes is greater than 2,
in which the intrinsic drawback of the database
dividing approach is demonstrated. In the series of
experiments, it is observed that FD-Mine not only
can preserve the data privacy but also delivers better
performance than BTP-tree in terms of execution
time especially when the database is large in size.
5.2 Effects of varying the parameters of dataset
In the section, we study the effects by varying the
support threshold, and the parameters, number of
transactions and average transaction length, of the
data generator. Two algorithms are compared,
FD-Mine and BTP-tree in the experiment.
IV. Experimental Results
To evaluate the performance of the proposed
algorithm, we use IBM's Quest Synthetic Data
Generator [1] to generate the workload data for
mining. The experiments were conducted on a cloud
system with three clouds. The first cloud contains
four nodes, including the kernel node, in which each
node is equipped with an E8400 204GHZ CPU, 1GB
of available RAM and 320GB of disk storage. The
second cloud and third cloud contain four and three
nodes respectively, in which each node is equipped
with a P8600 204GHZ CPU, IGB of available RAM
and 160GB of disk storage. Note that the kernel node
is responsible for receiving the requests and is not
used for mining. Therefore totally ten nodes can be
used for mining in the system. To verify the
performance, since there are very few parallel and
privacy-preserved algorithms of frequent pattern
mining, we select the BTP-tree for comparison,
which is one of the most efficient algorithms that can
parallelize the mining task on grid systems. Both of
FD-Mine and BTP-tree were implemented in Java,
and the message passing among nodes and remote
function call were implemented in Java RMI
technology. Since the most of the existing parallel
algorithms are database dividing approach, we select
the most efficient one, BTP-tree, for performance
comparison.
5.1 Effects of varying the number of cloud nodes
In the following experiments, we investigate the
performance of FD-Mine in terms of execution time
by varying the number of cloud nodes from I to 10.
The performance results for database
T20.I5.NIOOK.D100K are described. The support
threshold is set to 0.03%, which is a very small value,
in order to verify the performance of both the
algorithms, FD-Mine and BTP-tree. Figure 3 shows
For this reason, the FP-tree duplication is processed
as follows. First, the algorithm selects an idle node n
(line 5), and selects the connection node en of n from
the cloud architecture C (line 6). If en has no
duplicated FP-tree, TN will duplicate one to en (line
7 to line 9). Note that in order to minimize the
transmitting overhead the FP-tree should be
compressed in advance. Afterwards, node n can
obtain the compressed FP-tree via intranet and
decompress it (line 10). After receiving the FP-tree,
node n is assigned to a subset of IS (line 11), and
batch-runs FP-growth for each conditional item in the
subset to mine the frequent patterns (line 12 to line
13). Obviously, each node needs only one data
transmission, i.e. FP-tree duplication, and the
transmission is in intranet to minimize the network
latency. After all of the IN! disjointed sets are
processed, the frequent patterns are returned (line
15).
Authorized licensed use limited to: LA TROBE UNIVERSITY. Downloaded on June 13,2010 at 08:00:36 UTC from IEEE Xplore. Restrictions apply.
0.05 0.04 0.03 0.02
o.
.0.
..0.
. .0 .
. . 0
0.01
34
32
20
u-
! 30
Q)
E 28
i=
c
.Q 26
3
24
22
36 ,--- - - - - - - - - - - - - - - - -----,
18.L-,-----r----.,....-----.----..,.......J
0 0 {) 0
140
120
U-
Q)
$
Q) 100
E
i=
c
a
80
c
Q)
o..,
x
w
>'0 .
60
40
8 10
Number of Nodes
Figure 4. The execution time for FD-Mine and BTP-tree with
number of nodes varied on dataset T40.I5 .N100K.D100K.
Support Thresh old (%)
Figure 6. The execution time for FD-Mine and BTP-tree with
support threshold varied on dataset T20.15.N100K.D100K.
data privacy is preserved. Unlike the parallel
Apriori-like algorithms that need to duplicate the
database to remote nodes or the BTP-tree algorithm
that distributes part of the database directly to cloud
nodes, the database will never be duplicated and only
the kernel node is permitted to access the database in
our designed architecture and algorithms. Through
empirical evaluations on various simulation
conditions, the proposed FD-Mine delivers excellent
performance in terms of scalability and execution
time.
0 0 . 0 0
100
90
u-
80
Q)
$
Q)
70
E
i=
c
.Q
60
:5
o
Q)
x
0
w
50
40
30 -'-r-----.-----r---r---r--r-----.-----r---r--..,........
10
Number of Nodes
Figure 5. The execution time for FD-Mine and BTP-tree with
number of nodes varied on dataset T20.I5 . Nl OOK.D200K.
Acknowledgement
This research was partially supported by National
Science Council , Taiwan, ROC under Grant
No.97-2218-E-151-003-MY2.
In Figure 6, we explore the impact on execution
time by varying the support threshold from 0.05% to
0.0I% with ten cloud nodes. It can be found that
FD-Mine always requires less time than BTP-tree.
The efficiency in execution time of FD-Mine is
mainly achieved by reducing the transmission
overhead and the disk I/O times. In the experiment,
the required time of FD-Mine is only about 82% of
the execution time ofBTP-tree in average.
V. Conclusions
In this paper, we have presented an efficient
algorithm named FD-Mine that is able to efficiently
utilize the cloud nodes to discover frequent patterns
in cloud computing environments with data privacy
preserved. The proposed FD-Mine is composed of
two algorithms, namely HD-Mine and FD-Mine. The
limitation of the conventional algorithm for mining
the dataset with a large number of frequent patterns is
bounded to the available memory. The proposed
HD-Mine is able to discover the frequent patterns
from this kind of datasets by merging the memory of
several nodes. The proposed FD-Mine focuses on the
fast discovery of frequent patterns by utilizing the
cloud nodes, and is useful to the applications that
emphasize real time mining. Another important
characteristic of the proposed algorithms is that the
References
[IJ R. Agrawal and R. Srikant. Quest Synthetic Data Generator.
IBM Almaden Research Center, San Jose, California,
http://www.almaden.ibm.com/cs/quest/syndata.html.
[2J R. Agrawal, Imielinski T, Swami A. Mining association rules
between sets of items in large databases. In: Proc. ACM SIGMOD
IntI. ConfManagement Data, 1993.
[3J R. Agrawal, R. Srikant, Mining Sequential Patterns, in: Proc. of
the 11
th
1nt' l Conf. on Data Engineering, 1995, pp. 3-14.
[4J R. Agrawal, John C. Shafer, "Parallel Mining of Association
Rules", IEEE Transactions on knowledge and Data Engineering,
December 1996.
[5J R. J. Bayardo, Jr., Brute-force mining of high-confidence
classification rules. In Proceedings of the 3rd international
conference on knowledge discovery and data mining (KDD'97),
Newport Beach, California, USA.
[6J J. Han, 1. Pei, and Y. Yin. Mining Frequent Patterns Without
Candidate Generation. Proc. of ACM Int. Conf. on Management of
Data (SIGMOD), \-12,2000.
[7J J.D. Holt, S.M. Chung, "Parallel mining of association rules
from text databases on a cluster of workstations," Proceedings of
18th International Symposium on Parallel and Distributed
Processing, 2004, pp. 86.
[8J P. Iko and M. Kitsuregawa, "Shared Nothing Parallel Execution
of FPgrowth." DBSJ Letters, Volume 2, No.1, 2003, pp. 43-46.
[9J A. Javed, A. Khokhar, "Frequent Pattern Mining on Message
Passing Multiprocessor Systems," Distributed and Parallel
database, Volume 16, Issue 3, 2004, pp. 321-334.
[ IOJ T. Li, S. Zhu, M. Ogihara, "A New Distributed Data Mining
Model Based on Similarity," Symposium on Applied Computing,
2003, pp.432-436.
[II J Ester M., Kriegel H.-P., Sander 1., Xu X.: "A Density-Based
Authorized licensed use limited to: LA TROBE UNIVERSITY. Downloaded on June 13,2010 at 08:00:36 UTC from IEEE Xplore. Restrictions apply.
Algorithm for Discovering Clusters in Large Spatial Databases
with Noise", Proc. 2nd Int. Conf. on Knowledge Discovery and
Data Mining, Portland, OR, AAAI Press, 1996, pp. 226-231.
[12] Y. Qiu, Y. 1. Lan and Q. S. Xie, "An improved algorithm of
mining from FP- tree," Proceedings of the Third International
Conference on Machine Learning and Cybernetics, pp. 26-29,
2004.
[13] E.-H. S. Han, G. Karypis, and V. Kumar. Scalable parallel data
mining for association rules. IEEE Transactions on Knowledge and
Data Engineering, 12(3):352 -377, 2000.
[14] J. Zhou, K.-M. Yu, "Tidset-based Parallel FP-tree Algorithm
for the Frequent Pattern Mining Problem on PC Clusters," Lecture
Notes in Computer Science 5036, 2008, pp. 18-28.
[15] 1. Zhou, K.-M. Yu, Balanced Tidset-based Parallel FP-tree
Algorithm for the Frequent Pattern Mining on Grid System, Fourth
International Conference on Semantics, Knowledge and Grid, 2008.
Authorized licensed use limited to: LA TROBE UNIVERSITY. Downloaded on June 13,2010 at 08:00:36 UTC from IEEE Xplore. Restrictions apply.