Escolar Documentos
Profissional Documentos
Cultura Documentos
Discovery
2.1. Pre-processing
⎧a0 ⊕ ⎫
⎪ ⎪
⎪a1 = b0 ⎪
⎪⎪a = b ⎪⎪
A⊕ B = ⎨ 2 1 ⎬if (A(d ) ⊆ B(d ) or B(d ) ⊆ A(d ) ) (1)
⎪ M ⎪
Figuare4. Seven Base Cluster of Suffix Tree ⎪an = bn−1 ⎪
⎪ ⎪
⎪⎩⊕ bn ⎪⎭
• Building a list of base clusters from internal
nodes which contain at least one snippet at leaf nodes.
Then, base clusters that their common phrases contain where A and B are base clusters, A( d ) is snippets in A
only query words are eliminated. According to figure cluster, B( d ) is snippets in B cluster,
4, a list of base clusters are shown in Table 1. From
table1, query word is “Jefferson Clinton”. Therefore, {a0 , a1 ,....., an } is a set of terms that appear in label
cluster that contains “Jefferson Clinton”, “Jefferson”
and “Clinton” are filtered out. of A cluster and {b0 , b1 ,....., bn } is set of terms that
appear in label of B cluster.
The base cluster combining technique uses the
new join operator to define a new common phase of a
new cluster when merging a pair of similar base
clusters. As an example, given that
A ={president, William, Jefferson } (0,1)(0,2) and
B ={William, Jefferson, Clinton } (1,0)(1,1)(1,2)(0,3)
a new common phase is defined as
A⊕B={president,William,Jefferson,Clinton}(1,1)(1,2)
After combining a pair of similarity base clusters, we
Table 1: A list of base clusters remove snippets that belong to the new cluster from
the old base cluster. Therefore, the result of combining
2.3. A New Base Cluster Combining a pair of A cluster and B cluster is
Algorithm with A New Partial Phase Join B={William, Jefferson, Clinton } (1,0)(0,3)
Operation A⊕ B ={president,william,jefferson,clinton}(1,1)(1,2)
Since snippets may share more than one phrase According to Table.1, the results of combining a pair
with other snippets, each snippet might appear in a of similarity base clusters are shown in Table 2.
number of base clusters. Therefore, base clusters,
obtained from a suffix tree, can have an overlap of
snippets. Overlap of snippets indicates that snippets
often have several topics [7][11]. However, high
overlap of irrelevance snippets in multiple clusters can
hurt cluster quality. In addition, STC with n-gram may
give too many base clusters and STC with n-gram can
not discover a true common phase when the length of
n-gram is shorter than the length of true common
phases. Table 2: Results of Content Based Combining Base
⎧100 , if query word apprear in phrase ⎫ Performance of precision or relevant web documents
f (query ) = ⎨ ⎬
⎩1 , if query word not apprear in phrase ⎭ cluster for queries. As shown in Figure 5.
Table 5: Example cluster labels for “data mining” that is query word