Você está na página 1de 6

A New Web Search Result Clustering based on True Common Phrase Label

Discovery

Jongkol Janruang * and Worapoj Kreesuradej**


Faculty of Information Technology
King Mongkut’s Institute of Technology Ladkrabang
Bankok, 15320 Thailand
Email: tawan48@gmail.com* and worapoj@it.kmitl.ac.th**

Abstract of suffix tree, particularly when the same snippets are


feed to the STC algorithm [7, 8, 9, 10, 11, 12]. Hau-
Web search results clustering are navigator Jun Zeng and etc. [12] introduced an improved suffix
for users to search results. Therefore the correct cluster tree with n-gram to deal with the problem of the
label is important which has been index the set of web original suffix tree. However, the suffix tree with n-
document. Suffix Tree Clustering (STC) is fast gram can discover only partial common phases when
automatically clustering and labeling. However, STC the length of n-gram is shorter than the length of true
is inadequate since they generate interrupted cluster common phases. As an example, Given that a true
label due to using n-gram technique. In this paper, we common phase is “President William Jefferson
propose an approach for web search results clustering Clinton”, a suffix tree with 2-gram can discover partial
and labeling based on a new suffix tree data structure, common phases: “President William”, “William
a new base cluster combining algorithm with a new Jefferson” and “Jefferson Clinton.” If this is the case,
partial phase join operation. The algorithm for STC with n-gram give too many base clusters. In
constructing the data structure is an incremental and a addition, a cluster label obtained from STC with n-
linear time algorithm. Thus, the proposed approach is gram can not be a true common phase when the length
suitable for on-the-fly the web search results clustering of n-gram is shorter than the length of true common
and labeling cluster. The proposed approach provides phases.
more readable and true common phrase of web Here, this paper proposes a new approach for
document cluster than conventional web search result web search result clustering to deal with such
clustering. Experimental results also show that the problems. The new approach still uses a suffix tree
proposed approach has better performance than that of with n-gram. However, the approach also introduces a
conventional web search result clustering. new base cluster combining technique with a new
Keyword: web search results clustering, incremental partial phase join operation to find a true common
clustering, content based combining, a new suffix tree. phase. The new approach provides more precision and
true common phrase of web document cluster than the
1. Introduction approach that is based on the previous STC algorithms.
Several approaches underneath description-
comes-first concept such as web search results 2. A New Web Search Result Clustering
clustering approaches using Lingo algorithm [1, 2, 3], based on True Common Phrase Label
SHOC [4] and FIHC [5]. That is not incremental Discovery
clustering algorithm.
Unlike the other algorithms, suffix tree Clustering web search results is to group each
clustering (STC) algorithm, which is an algorithm for snippet, returned by search engines, with others
clustering search results, is an incremental algorithm. sharing a common content and to generate new cluster
Therefore, web search result clustering based on this labels from interrupted cluster label due to using n-
algorithm is a promising approach to work on a long gram technique. The new approach composes of four
list of snippets returned by search engines. The phases which the algorithm is given in figure 1.
original STC algorithm can often construct a long path

International Conference on Computational Intelligence for Modelling


Control and Automation,and International Conference on
Intelligent Agents,Web Technologies and Internet Commerce (CIMCA-IAWTIC'06)
0-7695-2731-0/06 $20.00 © 2006
Basically, the new approach has four phases: (1) Pre-processing is selecting the most suitable
document “cleaning” using stemming algorithm and terms that describing better content. The terms are
non-word tokens are stripped, (2) identifying base transformed using stemming algorithm. Non-word
clusters using a suffix tree data structure with n-gram. tokens, such as Articles, pronouns, prepositions, and
This data structure can be constructed incrementally etc., are eliminated [13].
and in time linear with the size of the collection. Each
node of a suffix tree represents a group of snippets and 2.2. Base Cluster Identification
the label of node represents the common phrase, (3)
combining together similar base clusters using a binary Base Cluster Identification is identification base
similarity measure between base clusters, and (4) clusters using suffix tree with n-gram technique. The
ranking base clusters according to their interesting major steps are shown as bellows.
scores . The phases of the new approach are discussed • Building suffix tree with n-gram. As an
in the following section. example, a suffix tree with n<=3 gram is shown in
figure 3 given that a set of 5 snippets shown in figure 2
are used for building the suffix tree.

Figure2. As Shown a set of 5 snippets

Figuare3. A Suffix Tree with n-gram (where n<=3)

• As shown in figure 4, the suffix tree from


figure 4 is updated or compacted internal nodes that
are not contains snippets and number of link equal to
one. Then, The label of a node is defined to be the
concatenation of the edge-labels on the path from the
root to that node

Figure1. As Shown The Pseudo-code of Web Search


Results Clustering.

2.1. Pre-processing

International Conference on Computational Intelligence for Modelling


Control and Automation,and International Conference on
Intelligent Agents,Web Technologies and Internet Commerce (CIMCA-IAWTIC'06)
0-7695-2731-0/06 $20.00 © 2006
To reduce a number of base clusters and to find a
true common phase, a new base cluster combining
technique with a new partial phase join operation is
proposed. The phase join operator is shown in Eq.1.

⎧a0 ⊕ ⎫
⎪ ⎪
⎪a1 = b0 ⎪
⎪⎪a = b ⎪⎪
A⊕ B = ⎨ 2 1 ⎬if (A(d ) ⊆ B(d ) or B(d ) ⊆ A(d ) ) (1)
⎪ M ⎪
Figuare4. Seven Base Cluster of Suffix Tree ⎪an = bn−1 ⎪
⎪ ⎪
⎪⎩⊕ bn ⎪⎭
• Building a list of base clusters from internal
nodes which contain at least one snippet at leaf nodes.
Then, base clusters that their common phrases contain where A and B are base clusters, A( d ) is snippets in A
only query words are eliminated. According to figure cluster, B( d ) is snippets in B cluster,
4, a list of base clusters are shown in Table 1. From
table1, query word is “Jefferson Clinton”. Therefore, {a0 , a1 ,....., an } is a set of terms that appear in label
cluster that contains “Jefferson Clinton”, “Jefferson”
and “Clinton” are filtered out. of A cluster and {b0 , b1 ,....., bn } is set of terms that
appear in label of B cluster.
The base cluster combining technique uses the
new join operator to define a new common phase of a
new cluster when merging a pair of similar base
clusters. As an example, given that
A ={president, William, Jefferson } (0,1)(0,2) and
B ={William, Jefferson, Clinton } (1,0)(1,1)(1,2)(0,3)
a new common phase is defined as

A⊕B={president,William,Jefferson,Clinton}(1,1)(1,2)
After combining a pair of similarity base clusters, we
Table 1: A list of base clusters remove snippets that belong to the new cluster from
the old base cluster. Therefore, the result of combining
2.3. A New Base Cluster Combining a pair of A cluster and B cluster is
Algorithm with A New Partial Phase Join B={William, Jefferson, Clinton } (1,0)(0,3)
Operation A⊕ B ={president,william,jefferson,clinton}(1,1)(1,2)

Since snippets may share more than one phrase According to Table.1, the results of combining a pair
with other snippets, each snippet might appear in a of similarity base clusters are shown in Table 2.
number of base clusters. Therefore, base clusters,
obtained from a suffix tree, can have an overlap of
snippets. Overlap of snippets indicates that snippets
often have several topics [7][11]. However, high
overlap of irrelevance snippets in multiple clusters can
hurt cluster quality. In addition, STC with n-gram may
give too many base clusters and STC with n-gram can
not discover a true common phase when the length of
n-gram is shorter than the length of true common
phases. Table 2: Results of Content Based Combining Base

International Conference on Computational Intelligence for Modelling


Control and Automation,and International Conference on
Intelligent Agents,Web Technologies and Internet Commerce (CIMCA-IAWTIC'06)
0-7695-2731-0/06 $20.00 © 2006
2.4. Ranking Cluster query Num. of cluster Num. of snippet
computer 7 7619
Ranking cluster is reordering clusters according music 7 7073
to their interesting scores. In this paper, the equation search 10 7970
for computing an interesting score, shown in Eq. 2, is
“data mining” 6 320
modified from Hua-Jun Zeng and et.at [12].
apple 7 1765
S (m) = d * f m p * f (query )* ∑ tfidf ( pi , d ) (2) disney 8 1318
Thailand 6 583
jaguar 5 637
⎧0, if p = 1 ⎫
⎪ ⎪ matrix 7 1365
f m p = ⎨ p , if 2 ≤ p ≤ 8⎬ total 28,650
⎪ ⎪
⎩α , if p > 8 ⎭
Table 4: queries used to generate test collection.

⎧100 , if query word apprear in phrase ⎫ Performance of precision or relevant web documents
f (query ) = ⎨ ⎬
⎩1 , if query word not apprear in phrase ⎭ cluster for queries. As shown in Figure 5.

where |d| is the number of snippets in m cluster, |mp| is


the number of word in m phrase, f ( query ) is query
word adjustment and tfidf(pi ,d) is a score that is
calculated from the Eq.3. tfidf(pi ,d) is an inverse
phrase frequency and is defined as

tfidf ( pi ,d ) =(1+ log(tf ( pi ,d )))*log(1+ N / df ( pi )) (3)

where df(pi) is number of snippets that phrase pi appear


in, tf( pi ,d) is the number of occurrence of phrase pi
in snippet d and N is the total number of snippet in our
snippet set. According to Table2, the interesting score
of each cluster is shown in Table 3.

Figure5. The average of precision

Percentage of true common phrase equal 100. That we


Table 3: Results the Ranking evaluate by compare with cluster labels and whole
snippets. Our cluster label is true common phrase and
more readable than conventional STC with n-gram
technique (STC + n-gram). As shown in Table 4 and 5,
3. Experimental Results that is example of “Thailand” and “data mining” is
query words.
Due to lacks of standard dataset for testing
web search results clustering, we have to build a small
test dataset. For this purpose, we have defined a set of
queries for which search results were collected from
Dmoz.com [8]. The results of the proposed approach
are shown in table 4.

International Conference on Computational Intelligence for Modelling


Control and Automation,and International Conference on
Intelligent Agents,Web Technologies and Internet Commerce (CIMCA-IAWTIC'06)
0-7695-2731-0/06 $20.00 © 2006
Table 4: Example cluster labels for “Thailand” that is query word

Table 5: Example cluster labels for “data mining” that is query word

International Conference on Computational Intelligence for Modelling


Control and Automation,and International Conference on
Intelligent Agents,Web Technologies and Internet Commerce (CIMCA-IAWTIC'06)
0-7695-2731-0/06 $20.00 © 2006
4. Conclusion Technology, 2002.

[11] Oren Zamir and Oren Etzioni, “Document Clustering :


This paper proposes a new approach for web A Feasibility Demonstation”, International ACM SIGIR,
search result clustering to improve the performance of 1998, pp 46-54.
approaches that uses the previous STC algorithms.
According to the preliminary experiment results, the [12] Hua-Jun Zeng and et.at., “Learning to Cluster Web
new approach provides smaller clusters and more Search Results ”, SIGIR’04 , Peking University, 2004.
readable cluster label than that approach using the
previous STC algorithms. [13] M. F. Porter, “An algorithm for suffix stripping ” ,
Program, 14(3), pp.130-137, 1980.
5. References [14] Open Directory Project. http://dmoz.org.
[1] Jerzy Stefanowski and Stanislaw Osinski, “An
Algorithm for Clustering of Web search Results
Clustering ”, Master thesis of Science, Poznan University of
Technology, Poland, 2003.

[2] Stanishaw Osinski and Dawid Weiss, “A concept-Driven


Algorithm for Clustering Search Results ”, IEEE Intelligent
Systems, 2005.

[3] Stanishaw Osinski and Dawid Weiss, “Conceptual


Clustering Using Lingo Algorithm: Evaluation on Open
Directory Project Data ”, Institute of Computing Science,
Poznan University of Technology, 2004.

[4] Stanishaw Osinski, Jerzy Stefanowski and Dawid Weiss,


“Lingo: Search Results Clustering Algorithm Based on
Singular Value Decomposition ”, Institute of Computing
Science, Poznan University of Technology, 2003.

[5] Dell Zhang and Yisheng Dong, “ Semantic,


Hierarchical, Online Clustering of Web Search Results. ”,
Proceeding of the 6th of Asia Pacific Web Conference
(APWEB), Hangzhou, China, April 2004.

[6] B. Fung, K. Wang, and M. Ester, “Large Hierarchical


Document Clustering using Frequent Itemsets ”, Simon
Fraser University, BC, Canada, May 2003.

[7] Oren E. Zamir , “Clustering Web Document : A Phrase-


Based Method for Grouping Search Engine Results ”,
Doctoral Dissertation, University of Washington, 1999.

[8] Oren Zamir and Oren Etzioni, “ Grouper : A Dynamic


Clustering Interface to Web Search Results. ”,
WWW8/Computer Networks, Amsterdam, Netherland,
1999.

[9] Dawid Weiss and Jerzy Stefanowski, “A Clustering


Interface for Web Search Results in Polish and English ”,
Master’s thesis of Science, Poznan University of
Technology, Poland, 2001.

[10] Jerzy Stefanowski and Dawid Weiss, “Carrot2 and


Language Properties in Web Search Results Clustering.” ,
Institute of Computing Science, Poznan University of

International Conference on Computational Intelligence for Modelling


Control and Automation,and International Conference on
Intelligent Agents,Web Technologies and Internet Commerce (CIMCA-IAWTIC'06)
0-7695-2731-0/06 $20.00 © 2006

Você também pode gostar