Escolar Documentos
Profissional Documentos
Cultura Documentos
VOL. 28,
1245
INTRODUCTION
1041-4347 2015 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution
requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
1246
VOL. 28,
NO. 5,
MAY 2016
TABLE 1
Database D and eXternal Utility Table XUT
ui; t:
i2Xt
X
t2TSX
uX; t
X X
ui; t:
t2TSX i2X
LIU ET AL.: MINING HIGH UTILITY PATTERNS IN ONE PHASE WITHOUT GENERATING CANDIDATES
RELATED WORKS
High utility pattern mining problem is closely related to frequent pattern mining, including constraint-based mining.
In this section, we briefly review prior works both on frequent pattern mining and on utility mining, and discuss
how our work connects to and differs from the prior works.
1247
1248
VOL. 28,
NO. 5,
MAY 2016
LIU ET AL.: MINING HIGH UTILITY PATTERNS IN ONE PHASE WITHOUT GENERATING CANDIDATES
by a pseudo projection, for example, TSfa, bg can be projected from TSfbg without materialization, and thus we
can compute the utility of the pattern and a utility upper
bound used for pruning in an efficient and scalable way.
X
t2TSY
ufpeX; t; t
X
t2TSY
uY; t uY :
u
t
1249
1250
uS [ X minU; 8S W ^ S 6 ;:
(4)
t2TSfig[X;i2S
t2TSfig[X;i2W
u
t
then
(5)
uW [ X; t
t2TSW [X
uW n S; t
t2TSW nS[X
uW [ X
ufjg; t
t2TSfjg[X^j2W nS
uW [ X min
j2W
X
t2TSX
NO. 5,
MAY 2016
t2TSX
uS [ X < minU; 8S W:
VOL. 28,
u
t
LIU ET AL.: MINING HIGH UTILITY PATTERNS IN ONE PHASE WITHOUT GENERATING CANDIDATES
1251
Fig. 2. Pruned version of the reverse set enumeration tree in Fig. 1, showing how d2 HUP works given minU 30.
1252
VOL. 28,
NO. 5,
MAY 2016
LIU ET AL.: MINING HIGH UTILITY PATTERNS IN ONE PHASE WITHOUT GENERATING CANDIDATES
1253
TABLE 2
Characteristics of Six Datasets
Dataset
T10I6D1M
WebView-1
Chess
Chain-store
T20I6D1M
Foodmart
COMPARATIVE EVALUATION
jtj
jIj
jDj
Type
10 : 33
2:5 : 267
37 : 37
7:2 : 170
20 : 49
4:8 : 27
1,000
497
76
46,086
1,000
1,559
933;493
59;602
3;197
1;112;949
999;287
34;015
mixed
sparse
dense
sparse
mixed
dense
UP
UPG , IHUPTWU , and TwoPhase are 1 to 2 orders, 2 to 4
orders, and over 3 orders of magnitude more than the number of patterns enumerated by d2 HUP respectively. The reasons are as follows.
1254
VOL. 28,
NO. 5,
MAY 2016
TABLE 3
Patterns Enumerated by d2 HUP, HUIMiner, and Candidates
by UP
UPG, IHUPTWU, TwoPhase
(a) T10I6D1M
minU
hups(ml)
d2 HUP HUIM.
UP+
IHUP+
T.P.
0.5%
0.1%
.01%
.005%
25 (1)
389 (8)
873K (17)
1.6M (17)
639
1,092
1.06M
1.8M
639
1,871
1.21M
2M
673
8,735
1.7M
2.4M
711
32K
2.1M
2.9M
226K
611K
-
d2 HUP HUIM.
UP+
IHUP+
T.P.
19
27
15.7M
-
159
171
1,809
-
71K
12M
-
95K
-
UP+
IHUP+
T.P.
34
1.34M
43.7M
-
299K
43.7M
-
304K
-
UP+
IHUP+
T.P.
1,123
3,876
66.6K
225K
1,577
7,117
286K
9.48M
628K
7.3M
-
(b) WebView-1
minU
hups(ml)
3.2%
3%
2.1%
1.9%
2 (1)
2 (1)
6 (1)
11K (148)
11
21
203
160K
(c) Chess
minU
hups(ml)
d2 HUP HUIM.
60%
30%
20%
10%
0 (0)
7.4K (16)
1.1M (20)
124M (25)
0
13.9K
1.08M
83.6M
0
16.8K
1.61M
158M
(d) Chain-store
minU
hups(ml)
0.25%
0.1%
0.01%
.005%
17 (1)
80 (2)
3,839 (6)
12K (11)
d2 HUP HUIM.
152
1,661
21.4K
43.1K
135
1,629
28.4K
71.2K
(e) T20I6D1M
minU
hups(ml)
d2 HUP HUIM.
0.5%
0.1%
.01%
.005%
25 (1)
669 (11)
923K (17)
1.8M (17)
786
2,302
1.24M
2.35M
UP+
IHUP+
T.P.
798
66.4K
2.92M
9.55M
3,311
207K
6.1M
23.4M
334K
-
d2 HUP HUIM.
UP+
IHUP+
T.P.
1,553
1,590
1.58M
17.5M
1,559
26.1K
-
1,559
-
1.2M
-
786
10.3K
1.54M
2.81M
(f) Foodmart
minU
hups(ml)
0.1%
198 (1)
0.02%
1,467 (27)
0.015% 1.72M (27)
0.01% 74.2M (27)
d2 HUP is up to 6.6, 6.8, 7.8, 261, 472, and 1,502 times faster
than UP
UPG on T20I6D1M, T10I6D1M, Chain-store, WebView-1, Foodmart, and Chess respectively. The reasons are
as follows:
1,556
2,097
3.78M
91.5M
UP
UPG , IHUPTWU , and TwoPhase cannot.
efficient than UP
UPG , IHUPTWU , and TwoPhase. In particular,
LIU ET AL.: MINING HIGH UTILITY PATTERNS IN ONE PHASE WITHOUT GENERATING CANDIDATES
1255
Our d2 HUP algorithm uses the least amount of memory because d2 HUP uses CAUL that is more compact
than the vertical data structure by HUIMiner, and
d2 HUP does not materialize candidates in memory
while UP
UPG , IHUPTWU , and TwoPhase do.
1256
VOL. 28,
Fig. 8. The lookahead is extremely helpful with dense datasets, for example, for Chess, and large datasets, for example,
for T10I6D1M.
Third, in terms of pruning the search space, more irrelevant item filtering (Corollaries 2 and 3) by increasing g, as
depicted by rec, and more CAUL materialization by
increasing f, as depicted by mat, are very helpful as the
utility upper bounds become tighter, which also decreases
the running time with sparse data, for example, for
WebView-1. However, it comes with additional computational overhead and thus the running time does not always
decrease with the increase of g and f, for example, for
Chain-store.
In short, while the setting of g 1 and f 0 with lookahead is good, we recommend to use the setting of g 3
and f 0:5 with lookahead as the default.
NO. 5,
MAY 2016
ACKNOWLEDGMENTS
This work was supported in part by the National Natural Science Foundation of China (61272306), and the Zhejiang Provincial Natural Science Foundation of China
(LY12F02024). The authors would like to express their
gratitude to the anonymous reviewers.
REFERENCES
[1]
[2]
[3]
[4]
[5]
LIU ET AL.: MINING HIGH UTILITY PATTERNS IN ONE PHASE WITHOUT GENERATING CANDIDATES
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
1257