Escolar Documentos
Profissional Documentos
Cultura Documentos
ABSTRACT
The popularity of XML results in producing large numbers of XML documents.
Therefore, to develop an approach of association rule mining on native XML
databases is an important research. The FP-growth based on an FP-tree algorithm
performs more efficiently than other methods of association rules mining, but it
cannot be applied to native XML databases. Hence, we adaptive an improving FP-
tree algorithm called Frequent Pattern Split method, simply FP-split, for fast
association rule mining from native XML databases. We show that the FP-split
method is time-efficient for mining association rules from native XML databases
by experiments with various parameters, such as various minimum supports,
different number of items, and large amount of data. In addition, we also
implement a lot of experiments to show that our proposed method performs better
than FP-tree construction algorithm in transaction database.
Keywords: data mining, association rule, XML, DTD, XML schema, XML
database.
Step 1. Analyzing DTD or XML Schema minimum occurrence count of an element in the XML
Given each collection Ci, definitions of the XML document and its content model.
document structure provided by DTD or XML Schema
are parsed. Therefore, information can be obtained, [Definition 1] Minimum occurrence count of an
including tag names which may occur in the XML element in the XML document and its content model
document, tag’s occurrence count, and the content Let f 0(t)=Z+ ∪{0}, in which t ∈ TG, Z+ is a
model of the tag. Definition 1 below explains the positive integer, and f 0(t) indicates the minimum
Case 1: item βi ∈ Sα. For item α, item βi is a trivial conditioning)=0. The content model of each element
item. is fC(transaction log)=0, fC(customer data)=0,
Case 2: item βi ∈ Rα. For item α, item βi is a trivial fC(transaction item)=0, fC(identifier)=,…, and fC(air
item. conditioning)=1. Results are shown in Table 4. All
elements are categorized into the sets of TS, TM and
[Property] TO.
For item α, if item βi is a trivial item, the frequent Elements with content model of sub-element and
itemset α∪βi may create a trivial association rule, i.e. least occurrence count of zero are listed in TS set,
αÆβi or βiÆα. Therefore, the frequent itemset α∪βi TS={food, supply, electrical appliance}. Elements
should be avoided creating trivial association rules. with content model of character data and least
occurrence count of 1 are listed in TM set,
Algorithm 2 modifies the TP-FP-growth Algorithm TM={identifier, gender}. Elements with content model
proposed by Han et al. [17]; therefore, it applies to of character data and least occurrence count of zero are
semi-structured XML documents. listed in TO set, TO={snack, bread, drink, bath,
cleanser, other, audio, air conditioning}. This
Algorithm 2: Adaptive TD-FP-growth categorization is summarized in Table 5. In DTD,
Input: a FP-split tree elements of “transaction log”, “customer data” and
Output: frequent patterns “transaction item” are sub-elements in terms of content
Begin Mine-tree (L, H) model, and their least occurrence count is one. It
{ For each entry α in H indicates that these three elements are bound to occur
If ( H(α) >= minsup ) and ( α not exist (SL and RL ) ) in every XML document. In addition, they are not
output Αl used to describe character data. Therefore, it is trivial
create a new header table Hα by call function to mine these elements.
Build-subtable (α)
mine-tree (αL, Hα) } Step 2. Creating equivalence classes of items
Build-subtable (α) After DTD or XML Schema is parsed, element
{For each node u on the Link of α names in TS, TM and TO are matched with tag names in
walk up the path from u once do each document to establish equivalence classes of items
if encounter a J_node v as described in Algorithm 2.
then link v into the Link of J in Hα Take the six XML documents in Figure 9 for
count(v)=count(v)+count(u) instance. In the first XML document, we can find that
Hα(J)= Hα(J)+count(u) } {food, supply} ∈ TS. Based on Case 1 of creating
End Algorithm 2 equivalence classes, equivalence classes of tag are
created, i.e. ECfood ={1} and ECsupply ={1}. {identifier,
Phase 2: Creating association rules
Only when confidence crosses the threshold
∈
gender} TM.
determined by the user can the rule be established. According to Case 2, equivalence classes of
character data can be created—ECidentifier (A001) ={1} and
4 AN ELABORATION FOR THE PROPOASED ECgender(male) ={1}. {snack, drink, other} TO. ∈
APPROACH According to Case 3, not only equivalence classes
of tag but also equivalence classes of character data can
This section uses an example to illustrate how be created, i.e. ECfood(snack) ={1}, ECfood(drink) ={1} and
XML documents in a native XML database are analyzed ECsupply(other) ={1}, and the equivalence classes of
and transformed into a FP-split tree, as well as how character data these tags describe, ECfood(snack(puff)) ={1},
association rules with complete information are mined. ECfood(drink(beer)) ={1}, ECsupply(other(diaper)) ={1} and
DTD in Figure 2 is applied to define the structure of ECsupply(other(feeding bottle)) ={1}. The set of equivalence
XML documents. Based on the two definition methods, classes in the all XML documents are listed in Table 6.
six XML documents are derived as seem in Figure 3.
Step 3. Support of equivalence classes
Step 1. Parsing DTD or XML Schema When creating each ECι, its support can be
Algorithm 1 is utilized to parse a DTD in Figure 8 and computed at the same time to find frequent items.
acquire information about defined element names, Support of each item ι is the number of XML
number of element occurrence and the content model documents contained in the equivalence class of item.
of elements. A set of element names TG={transaction Therefore, |ECidentifier(A001)|=1, |ECgender(male)|=3, |ECfood|=4,
log, customer data, transaction item, identifier, gender, |ECfood(snack)|=2, |ECfood(drink)|=4,…, |ECsupply(bath(bath
food, supply, electrical appliance, snack, bread, drink, foam)) |=1.
bath, cleanser, other, Audio, air conditioning}. The We determine that the minimum support is 2. ECι
minimum occurrence count of each element is under the minimum support is deleted, and ECι over the
Figure 2: DTD of transaction data Table 6: The set of equivalence classes in all documents
Item EC Item EC
identifier(A001) 1 food(drink(root beer)) 3
gender(male) 1, 2, supply(cleanser) 3
3
food 1, 2, supply(cleanser(dish- 3
3, 4 washing liquid))
food(snack) 1, 4 identifier(A004) 4
food(drink) 1, 2, gender(female) 4, 5, 6
3, 4
food(snack(puff)) 1 food(snack(peak cracker)) 4
food(drink(beer) 1, 3 food(bread(toast)) 4
supply 1, 3, food(bread(croissant)) 4
5, 6
supply(other) 1, 3 identifier(A005) 5
supply(other(diaper)) 1, 3 supply(bath) 5, 6
supply(other(feeding 1 supply(bath(shampoo)) 5, 6
bottle))
identifier(A002) 2 supply(bath(conditioner)) 5, 6
food(bread) 2, 4 electrical appliance 5
food(bread(strawberr 2 electrical appliance(air 5
y bread)) conditioning)
food(bread(chocolate 2 electrical appliance(air 5
bread)) conditioning(electric fan))
food(bread(crunch 2 identifier(A006) 6
top sweet roll))
food(drink(milk)) 2, 4 supply(bath(bath foam)) 6
identifier(A003) 3
5 EXPERIMENTAL RESULTS
ru n tim e (s e c .)
T document
150
129.2
XML document numbers 114.6
D 100
107
92.3
104.2
88
N Total number of item in DB 76.9
I Average maximal frequent itemset size 50 42.9 39.4 38.1 35.5 33.9
Note: k is as 1000
0
4 5 6 7 8 9 10
5.2 Experimental analysis of FP-split Algorithm and
minimum support (%)
FP-tree Algorithm
FP-tree Algorithm can only be applied in Figure 7:Mining time with various minimum supports
conventional transaction databases, and is not feasible
for native XML databases. FP-split algorithm, can be
applied in both types of database. In this section, the FP-tree
efficiency of FP-split Algorithm and FP-tree Algorithm FP-split
will be contrasted in a conventional transaction database. 180
The synthetic data generator Assoc.gen will be 160 155.2
used to create several groups of transaction data. 140 139
Various parameters are designed as well to prove that 120 123.5
ru n tim e (sec.)
1.73 1.80
FP-tree 1.50 1.65
1.55
記憶體( MB )
FP-split 1.42
900
847.9
memory
800 1.00
(MB)
700
645.5
600
run time (sec.)
0.50
500
445.6
400
0.00
300
266.3 100 200 300 400 500
200 191
114 115.7
155.4 transaction
交易筆數(k) numbers
100 77.6
0
39.3 Figure 12: Utility rate of memory
100 200 300 400 500
transaction numbers 5.2.6 Comparisons of FP-split Algorithm and FP-tree
Figure 10:The run time with various transaction numbers Algorithm
FP-split Algorithm is superior to FP-tree
5.2.4 Comparative analysis with various item numbers
Algorithm in the execution efficiency of tree
construction for three reasons. Their differences are
In Figure 11 the setting of data parameter is summarized in Table 8.
T20.I10.D100k, and the minimum support is 1%. Item 1. FP-split Algorithm scans the database only once,
numbers are varied to evaluate the efficiency of tree while FP-tree Algorithm has to scan twice.
construction. Figure 11 shows that, regardless of item 2. With FP-split Algorithm, only candidate items with
number, FP-split Algorithm is more efficient than FP- a length of 1 need to be sorted and filtered. With
tree Algorithm in constructing the tree. FP-tree Algorithm, not only candidate items with a
length of 1 but all items in every transaction log
needs to be sorted and filtered.
80
Table 8: Comparison of FP-split Algorithm with FP-
350 335.25
150
140.56 300
ru n tim e (sec)
Figure 13: Different numbers of documents and Number of document and transaction (k)
250
200 http://www.almaden.ibm.com/software/quest/Resourc
150 es/datasets/syndata.html (2005).
100 [9] C. F. Lee, S.W. Changchien, and W.T. Wang,
50 “Association Rules Mining for Native XML
0
20 40 60 80 100 Database,” Department of Information Management,
Number of document and transaction (k) Chaoyang University of Technology, CYUT-IM-TR-
2003-011 (2003).
Figure 16: The individual time span
[10] J. W. Lee, K. Lee, and W. Kim, “Preparations for
Semantics-Based XML Mining,” Proceedings of the
IEEE International Conference on Data Mining, pp.
6 CONCLUSIONS 345-352(2001).
[11] C. H. Moh, E. P. Lim, and W. K. Ng, “DTD-Miner: A
In this paper, we proposed a fast algorithm called Tool for Mining DTD from XML Documents,”
Frequent Pattern Split method for extracting complete Proceedings of the 2nd International Workshop on
information from native XML databases. The FP-split Advanced Issues of E-Commerce and Web-based
method can easily and efficiently aid users to exploit Information Systems, pp. 144-151 (2000).
association rules from large number of XML documents [12] J. Shanmugasundaram, E. Shekita, R. Barr, M. Carey,
of the same structure by parsing DTD and XML schema B. Lindsay, H. Pirahesh, and B. Reinwald,
without needing to understand both the structure of “Efficiently Publishing Relational Data as XML
XML documents and their corresponding syntax. In Documents,” Proceedings of the 26th International
addition, our proposed method can mine multi-level Conference on Very Large Databases, pp. 65-
association rules between character data and tags. 76(2000).
Finally, we show that our proposed method is a [13] M. Ströbel, “An XML Schema Representation for the
fast association rule mining approach by experiment Communication Design of Electronic Negotiations,”
with various parameters. The FP-split method cannot Computer Networks, Vol.39, pp. 661-680(2002).
[14] D. Suciu, “On Database Theory and XML,” ACM
only be applied to native XML databases but also
SIGMOD Special Section on Advanced XML Data
efficiently applied to transaction databases for mining
Processing, Vol. 30, Issue 3, pp. 39-45(2001).
association rules. [15] J. W. Wan and G. Dobbie (2003), “Extracting
Association Rules from XML Documents Using
7 REFERENCES XQuery,” Proceedings of the 5th ACM International
Workshop on Web information and Data Management,
[1] R. Agrawal, T. Imielinski, and A. Swami, “Mining pp. 94-97.
Association Rules between Sets of Items in Large [16] J. W. Wan and G. Dobbie (2004), “Mining
Databases,” Proceedings of the ACM SIGMOD Association Rules from XML Data Using XQuery,”
Conference on Management of Data, pp. 207-216 Proceedings of the second workshop on Australasian
(1993). information security, Data Mining and Web
[2] D. Braga, A. Campi, M. Klemettinen, and P. Lanzi, Intelligence, and Software Internationalisation, Vol.
“Mining Association Rules from XML Data,” 32, pp.169-174.
Proceedings of the 4th International Conference on [17] K. Wang, L. Tang, J. Han, and J. Liu (2002), “Top
Data Warehousing and Knowledge Discovery (2002). Down FP-Growth for Association Rule Mining,”
[3] S. Chan, T. Dillon, and A. Siu, “Applying a Mediator Proceedings of the 6th Pacific-Asia Conference on
Architecture Employing XML to Retailing Inventory Advances in Knowledge Discovery and Data Mining,
Control,” The Journal of Systems and Software, pp. 334-340.
Vol.60, pp. 239-248 (2002). [18] Q. Wei and G. Chen (1999), “Mining Generalized
[4] J. Fong, H. K. Wong, and Z. Cheng, “Converting Association Rules with Fuzzy Taxonomic Structures,”
Relational Database into XML Documents with Proceedings of the 4th America Fuzzy Information
DOM,” Information and Software Technology, Vol. Processing Society, pp. 477-481.
45, pp. 335-355 (2003). [19] L. H. Yang, M. L. Lee, W. Hsu, and S. Acharya
[5] J. Han and M. Kamber, “Data Mining: Concepts and (2003), “Mining Frequent Query Patterns from XML
Techniques,” CA: Morgan Kaufmann Publishers Queries,” Proceedings of the 8th International
(2001). Conference on Database Systems for Advanced
[6] J. Han, J. Pei, Y. Yin, and R. Mao, “Mining Frequent Applications, pp. 355-362.
Patterns without Candidate Generation: A Frequent-
Pattern Tree Approach,” In Data Mining and
Knowledge Discovery, Vol. 8, pp. 53-87 (2004).