Você está na página 1de 5

ISBN: 978-15-08772460-24

Date: 8.3.2015
PROCEEDINGS OF INTERNATIONAL CONFERENCE ON RECENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY

EXPOSURE OF DOCUMENT
SIMILARITY USING SEQUENTIAL
CLUSTERING ALGORITHM
Karthika.M*, Mrs.P.Sumathi**
*Department of Computer Science and Engineering,
K.S.R Institute for Engineering and Technology, Namakkal, Tamilnadu,
Email: karthikamarppan@gmail.com
**Department of Computer Science and Engineering,
K.S.R Institute for Engineering and Technology, Namakkal, Tamilnadu,
ABSTRACT
Document similarity measures are crucial components of many text scrutiny tasks, including information recovery,
document categorization, and document clustering. Extracting features from document is preliminary task found in
mining. On the base of extracted features likeness between two documents is calculated. A new similarity measure is
projected to calculate the similarity between two documents with respect to a feature. The proposed measure takes
the subsequent three cases into account. The three cases are, the feature appears in both documents, the feature
appears in only one document and the feature appears in none of the documents. The similarity increases as the
dissimilarity between the two values linked with a present feature decreases. Furthermore, the involvement of the
difference is usually scaled. The similarity decreases when the number of presence-absence features increases. An
absent feature has no contribution to the similarity. The expected measure is unmitigated to estimate the similarity
among two sets of documents.
Keywords: Document classification, document clustering, similarity measures, clustering algorithms
growth of accessing information from the web,

I.INTRODUCTION

efficient access and study of information are desired


Data mining is the process of extract the
implicit, up to that time unknown and potentially
useful in sequence from data. Document clustering,
subset of data clustering, organize documents into
different groups called as cluster, where the
credentials in each cluster share some common
properties according to defined similarity measure.
Document clustering algorithms play an important
role in selection users to effectively navigate,
summarize and organize the in turn. Due to unstable

IAETSD 2015: ALL RIGHTS RESERVED

gravely. The copy dealing out plays an important role


in in a row retrieval, data mining, and web search.
Text mining attempts to learn novel, previously
unknown information by apply techniques from data
mining. Clustering, one of the traditional data mining
techniques is an unsupervised learning paradigm
where clustering methods try to identify natural
groupings of the textbook papers, so that a set of
cluster is produced in which clusters exhibit high
intra-cluster

www.iaetsd.in

similarity

and

low

inter-cluster

54

ISBN: 978-15-08772460-24
Date: 8.3.2015
PROCEEDINGS OF INTERNATIONAL CONFERENCE ON RECENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY

similarity. Generally, copy essay cluster method shot

the centroids symbolize the members of their clusters

to segregate the documents into groups where each

is the Residual Sum of Squares (RSS), the squared

group represent some topic that is unlike than those

distance of each vector from its centroid summed

topics represent by the other group.

over all vectors. K-means can start with selecting as


initial clusters centers K randomly chosen objects. It
then moves the cluster centers around in space in

II.EXISTING SYSTEM

order to minimize RSS.

A bunch of measures have been proposed


for computing the likeness between two vectors.

This is done iteratively by repeating two

Similarity measures have been broadly used in text

steps until a stopping criterion is met. One is

classification and clustering algorithms.

reassigning objects to the cluster with nearby centroid


and other is recomputing each centroid based on the
existing members of its cluster.

EUCLIDEANSIMILARITY
The Euclidean distance measure is defined

PAGE COUNT

as the root of square differences between the


The hunt pattern is entered in which the first

respective coordinates of d1 and d2

word and last word are taken. In the web pages, the
idiom is checked such that the pattern is first word,
any figure of words and the last word. Throughout
where AB denotes the inner product of the two
vectors A and B. Cosine similarity measures the

the outline extraction, the skip count number of


words can be surplus in the phrase found in the web
pages. The search pattern is found out from the web

cosine of the angle between d1 and d2 as follows:

pages and the pages names are added in a list.


III.PROPOSED SIMILARITY M EASURE
Let a document d with m features w1, w2. .
In this module, cosine similarity between two
documents is carried out.

.wm be represented as an m-dimensional vector, i.e.,


d = < d1, d2, . . . , dm>. If wi, 1 i m, is absent in
the document, then di = 0. Otherwise, di >0. The

K-MEANS CLUSTERING

following propertie are preferable for a similarity


K-means is the most

significant flat

measure between two documents:

clustering algorithm. The intention function of Kmeans is to diminish the average squared distance of

1) The presence or absence of a feature is

objects from their cluster centers, where a cluster

more significant than the difference between the two

center is given by mean or centroid of the objects in a

values coupled with a present feature. Consider two

cluster. The ultimate cluster in K-means is a sphere

features wi and wj and two documents d1 and d2.

with the centroid as its center of gravity. Ideally, the

Suppose wi does not emerge in d1 but it emerge in

clusters should not overlie. A measure of how well

d2. Then wi is measured to have no relationship with

IAETSD 2015: ALL RIGHTS RESERVED

www.iaetsd.in

55

ISBN: 978-15-08772460-24
Date: 8.3.2015
PROCEEDINGS OF INTERNATIONAL CONFERENCE ON RECENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY

d1 while it has a few relationship with d2. In this

both documents. Let d1 = < d11, d12, . d1m >and d2

case, d1 and d2 are unrelated in terms of wi. If wj

=< d21, d22, . . . d2m >. If d1i d2i = 0, d1i + d2i >0

become visible in both d1 and d2. Then wj has a little

for 1 i m, then d1 and d2 are least similar to each

relationship with d1 and d2 concurrently. In this

other. As mentioned prior, d1 and d2 are dissimilar in

case, d1 and d2 are similar to some degree in terms

terms of a presence-absence feature. Since every

of wj.

single one the features are presence-absence features,


the dissimilarity reaches the farthest point in this

For the over two cases, it is realistic to say that wi


carries additional weight than wj in decisive the
similarity degree between d1 and d2. For example,

case. For example, the two documents < x, 0, y >and


<0, z, 0 >, with x, y, and z being non-zero numbers,
are least similar to each other.

assume that wi is absent in d1, but appears in d2, and


wj appears both in d1 and d2. Then wi is considered

5) The similarity measure supposed to be

to be more essential than wj in determining the

symmetric. That is, the similarity level between d1

similarity between d1 and d2, although

and d2 should be the similar as that between d2 and

the

differences of the feature values in both cases are the

d1.

same.
6) The value circulation of a feature is
2) The similarity degree should increase

considered, that is

the standard deviation of the

when the difference between two non-zero values of

feature is taken into account, for its contribution to

a specific feature decreases. For example, the

the similarity between two documents.

similarity involved with d13 = 2 and d23 = 20 should


be smaller than that involved with d13 = 2 and d23 =

Similarity between Two Documents

3.

Based
3) The similarity degree should decrease

when the number of presence-absence features


increases. For a presence-absence feature of d1 and
d2, d1 and d2 are dissimilar in terms of this feature.
Therefore, as the number of presence absence

on

the

preferable

properties

mentioned above, this propose a similarity measure,


called

SMTP

(Similarity

Measure

for

Text

Processing), for two documents d1 = < d11, d12. . .


d1m > and d2 = < d21, d22. . .d2m >. Define a
function F as follows:

features increases, the dissimilarity between d1 and


d2 increases and thus the similarity decreases. For
example, the similarity between the documents <1, 0,
1> and

<1, 1, 0 > should be smaller than that

between the documents

<1, 0, 1 >and <1, 0, 0

>.

4) Two documents are least similar to each


other if none of the features have non-zero values in

IAETSD 2015: ALL RIGHTS RESERVED

www.iaetsd.in

56

ISBN: 978-15-08772460-24
Date: 8.3.2015
PROCEEDINGS OF INTERNATIONAL CONFERENCE ON RECENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY

The proposed measure takes into account the


following three cases: a) The feature considered
appears in both documents, b) the feature measured
appears in only one document, and c) the feature

IV.EXPECTED OUTCOME
Thus, a novel similarity measure is applied

measured appears in none of the documents. For the


first case, to set a lower bound 0.5 and diminish the
similarity as the difference between the feature values
of the two documents increases, scaled by a Gaussian
function as shown in where j is the standard
deviation of all non-zero values for feature w j in the
training data set. For the second case, to set a
negative constant disregarding the enormity of the
non-zero feature value. For the last case, the feature

between two documents. Several desirable properties


are embedded in this measure. For example, the
similarity measure is symmetric. The presence or
absence of a feature is considered more essential than
the difference between the values associated with a
present feature. The similarity degree increases when
the number of presence-absence features pair
decreases. Two documents are least similar to each
other if none of the features have non-zero values in

has no contribution to the similarity.

both documents. Besides, it is desirable to consider


Similarity between Two Document Sets

the value distribution of a feature for its contribution

It extends this method to measure the

to the similarity between two documents. The

similarity between two document sets. It can be

proposed scheme has also been extended to measure

considered as an average score of the features

the similarity between two sets of documents. In

occurring

addition, a semantic similarity

in

at

least

one of the two documents. Based on this perspective,


REFERENCE

the similarity between two document sets is designed


to calculate an average score of the features occurring

1.

G.

Amati

(2002)Probabilistic

models

of

in the two sets. Let G1 and G2 be two document sets

information retrieval based on measuring the

containing q1 and q2 documents, respectively, i.e.,

divergence from randomness.

G1 = {d1, d12. . .d1q1} and G2 = {d21, d2, . . . ,


d2q2} where dsj=< dsj1, dsj2, . . . , dsjm>, s {1, 2},

2. J. A. Aslam and M. Frost (2003) An information-

and 1 j q1 or 1 j q2. The function F between

theoretic measure for document similarity, in

G1 and G2 is defined to be

Proc. 26th SIGIR, Toronto, ON, Canada, pp.


449450.
3. D. CAI, X. He, and J. Han(2005)Document
clustering using locality preserve indexing,IEEE
Trans. Knowl.., vol. 17, no. 12, pp. 16241637.
4. S. Clinchant and E. Gaussier (2010)Information-

And the similarity measure, SSMTP, for G1


and G2 is

IAETSD 2015: ALL RIGHTS RESERVED

based models for IR, in Proc. 33rd SIGIR, , ,


pp. 234241.

www.iaetsd.in

57

ISBN: 978-15-08772460-24
Date: 8.3.2015
PROCEEDINGS OF INTERNATIONAL CONFERENCE ON RECENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY

5. I. S. Dhillon, J. Hogan, and C .Nicholas(2003)

10. H. Fang, T. Tao, and C. Zhai, (2004)A formal

Feature selection and text clustering, in A

study of heuristic retrieval constraints Proc.

Comprehensive Survey of Text Mining, M. W.

27th SIGIR, Shefeld, South Yorkshire, U.K.,

Berry, Ed. Heidelberg, Germany: Springer

pp. 4956.

6. I.S. Dhillon and D.S. Modha (2001) Concept

11.G.Salton(2009)Automatic Text Processing: The

decompositions for large sparse text data using

alteration, psychoanalysis and Retrieval of

clustering, Machine Learning, 42(1):143175,

Information by Computer. Addison-Wesley.

January.Also appears as IBM Research Report


12. J. Saerty and C. Zhai(2003) Probabilistic

RJ 10147.

relevance models based on document and


7. S. Dhillon and D. S. Modha, (2001) Concept

query generation, In W. B. Croft and J.

decompositions for large sparse text data using

Laerty, editors, Language Modeling and

clustering,Mach. Learn., vol. 42, no. 1, pp. 143

Information

175.

Publishers.

8. S. C. Glassman, M. S. Manasse, and G. Zweig

Retrieval.Kluwer

Academic

13. N. Zuhr(2010) Language models and uncertain

(1997) Syntactic cluster of the mesh. Technical

inference in information retrieval ,

Report 1997-015, Digital Systems Research

Proceedings of the Language Modeling and IR

Center.

workshop.

9. Dhillon, I. S., D. S. Modha, and W. S. Spangler


(1998)

Visualizing

Class

Structure

In

14. C. Zhai and J. Laerty(2011) A study of

of

smoothing methods for language models

Multidimensional Data'. In: S.Weisberg (ed.):

applied to ad hoc information retrieval, In

Proceedings of the 30th Symposium on the

Proceedings of SIGIR01, pages 334342.

Interface: Computing Science and, Vol. 30.


Minneapolis, MN, pp. 488-493.

IAETSD 2015: ALL RIGHTS RESERVED

www.iaetsd.in

58

Você também pode gostar