Exposure of Document

ISBN: 978-15-08772460-24
Date: 8.3.2015
PROCEEDINGS OF INTERNATIONAL CONFERENCE ON RECENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY
EXPOSURE OF DOCUMENT
SIMILARITY USING SEQUENTIAL
CLUSTERING ALGORITHM
Karthika.M*, Mrs.P.Sumathi**
*Department of Computer Science and Engineering,
K.S.R Institute for Engineering and Technology, Namakkal, Tamilnadu,
Email: karthikamarppan@gmail.com
**Department of Computer Science and Engineering,
K.S.R Institute for Engineering and Technology, Namakkal, Tamilnadu,
ABSTRACT
Document similarity measures are crucial components of many text scrutiny tasks, including information recovery,
document categorization, and document clustering. Extracting features from document is preliminary task found in
mining. On the base of extracted features likeness between two documents is calculated. A new similarity measure is
projected to calculate the similarity between two documents with respect to a feature. The proposed measure takes
the subsequent three cases into account. The three cases are, the feature appears in both documents, the feature
appears in only one document and the feature appears in none of the documents. The similarity increases as the
dissimilarity between the two values linked with a present feature decreases. Furthermore, the involvement of the
difference is usually scaled. The similarity decreases when the number of presence-absence features increases. An
absent feature has no contribution to the similarity. The expected measure is unmitigated to estimate the similarity
among two sets of documents.
Keywords: Document classification, document clustering, similarity measures, clustering algorithms
growth of accessing information from the web,
I.INTRODUCTION
efficient access and study of information are desired

Data mining is the process of extract the
implicit, up to that time unknown and potentially
useful in sequence from data. Document clustering,
subset of data clustering, organize documents into
different groups called as cluster, where the
credentials in each cluster share some common
properties according to defined similarity measure.
Document clustering algorithms play an important
role in selection users to effectively navigate,
summarize and organize the in turn. Due to unstable
IAETSD 2015: ALL RIGHTS RESERVED
gravely. The copy dealing out plays an important role

in in a row retrieval, data mining, and web search.
Text mining attempts to learn novel, previously
unknown information by apply techniques from data
mining. Clustering, one of the traditional data mining
techniques is an unsupervised learning paradigm
where clustering methods try to identify natural
groupings of the textbook papers, so that a set of
cluster is produced in which clusters exhibit high
intra-cluster
www.iaetsd.in
similarity
and
low
inter-cluster
54
ISBN: 978-15-08772460-24
Date: 8.3.2015
similarity. Generally, copy essay cluster method shot
the centroids symbolize the members of their clusters
to segregate the documents into groups where each
is the Residual Sum of Squares (RSS), the squared
group represent some topic that is unlike than those
distance of each vector from its centroid summed
topics represent by the other group.
over all vectors. K-means can start with selecting as

initial clusters centers K randomly chosen objects. It
then moves the cluster centers around in space in
II.EXISTING SYSTEM
order to minimize RSS.
A bunch of measures have been proposed

for computing the likeness between two vectors.
This is done iteratively by repeating two
Similarity measures have been broadly used in text
steps until a stopping criterion is met. One is
classification and clustering algorithms.
reassigning objects to the cluster with nearby centroid

and other is recomputing each centroid based on the
existing members of its cluster.
EUCLIDEANSIMILARITY
The Euclidean distance measure is defined
PAGE COUNT
as the root of square differences between the

The hunt pattern is entered in which the first
respective coordinates of d1 and d2
word and last word are taken. In the web pages, the
idiom is checked such that the pattern is first word,
any figure of words and the last word. Throughout
where AB denotes the inner product of the two
vectors A and B. Cosine similarity measures the
the outline extraction, the skip count number of

words can be surplus in the phrase found in the web
pages. The search pattern is found out from the web
cosine of the angle between d1 and d2 as follows:
pages and the pages names are added in a list.

III.PROPOSED SIMILARITY M EASURE
Let a document d with m features w1, w2. .
In this module, cosine similarity between two
documents is carried out.
.wm be represented as an m-dimensional vector, i.e.,

d = < d1, d2, . . . , dm>. If wi, 1 i m, is absent in
the document, then di = 0. Otherwise, di >0. The
K-MEANS CLUSTERING
following propertie are preferable for a similarity

K-means is the most
significant flat
measure between two documents:
clustering algorithm. The intention function of Kmeans is to diminish the average squared distance of
1) The presence or absence of a feature is
objects from their cluster centers, where a cluster
more significant than the difference between the two
center is given by mean or centroid of the objects in a
values coupled with a present feature. Consider two
cluster. The ultimate cluster in K-means is a sphere
features wi and wj and two documents d1 and d2.
with the centroid as its center of gravity. Ideally, the
Suppose wi does not emerge in d1 but it emerge in
clusters should not overlie. A measure of how well
d2. Then wi is measured to have no relationship with
www.iaetsd.in
55
ISBN: 978-15-08772460-24
Date: 8.3.2015
d1 while it has a few relationship with d2. In this
both documents. Let d1 = < d11, d12, . d1m >and d2
case, d1 and d2 are unrelated in terms of wi. If wj
=< d21, d22, . . . d2m >. If d1i d2i = 0, d1i + d2i >0
become visible in both d1 and d2. Then wj has a little
for 1 i m, then d1 and d2 are least similar to each
relationship with d1 and d2 concurrently. In this
other. As mentioned prior, d1 and d2 are dissimilar in
case, d1 and d2 are similar to some degree in terms
terms of a presence-absence feature. Since every
of wj.
single one the features are presence-absence features,

the dissimilarity reaches the farthest point in this
For the over two cases, it is realistic to say that wi

carries additional weight than wj in decisive the
similarity degree between d1 and d2. For example,
case. For example, the two documents < x, 0, y >and

<0, z, 0 >, with x, y, and z being non-zero numbers,
are least similar to each other.
assume that wi is absent in d1, but appears in d2, and

wj appears both in d1 and d2. Then wi is considered
5) The similarity measure supposed to be
to be more essential than wj in determining the
symmetric. That is, the similarity level between d1
similarity between d1 and d2, although
and d2 should be the similar as that between d2 and
the
differences of the feature values in both cases are the
d1.
same.
6) The value circulation of a feature is
2) The similarity degree should increase
considered, that is
the standard deviation of the
when the difference between two non-zero values of
feature is taken into account, for its contribution to
a specific feature decreases. For example, the
the similarity between two documents.
similarity involved with d13 = 2 and d23 = 20 should

be smaller than that involved with d13 = 2 and d23 =
Similarity between Two Documents
3.
Based
3) The similarity degree should decrease
when the number of presence-absence features

increases. For a presence-absence feature of d1 and
d2, d1 and d2 are dissimilar in terms of this feature.
Therefore, as the number of presence absence
on
the
preferable
properties
mentioned above, this propose a similarity measure,

called
SMTP
(Similarity
Measure
for
Text
Processing), for two documents d1 = < d11, d12. . .

d1m > and d2 = < d21, d22. . .d2m >. Define a
function F as follows:
features increases, the dissimilarity between d1 and

d2 increases and thus the similarity decreases. For
example, the similarity between the documents <1, 0,
1> and
<1, 1, 0 > should be smaller than that
between the documents
<1, 0, 1 >and <1, 0, 0
>.
4) Two documents are least similar to each

other if none of the features have non-zero values in
www.iaetsd.in
56
ISBN: 978-15-08772460-24
Date: 8.3.2015
The proposed measure takes into account the

following three cases: a) The feature considered
appears in both documents, b) the feature measured
appears in only one document, and c) the feature
IV.EXPECTED OUTCOME
Thus, a novel similarity measure is applied
measured appears in none of the documents. For the

first case, to set a lower bound 0.5 and diminish the
similarity as the difference between the feature values
of the two documents increases, scaled by a Gaussian
function as shown in where j is the standard
deviation of all non-zero values for feature w j in the
training data set. For the second case, to set a
negative constant disregarding the enormity of the
non-zero feature value. For the last case, the feature
between two documents. Several desirable properties

are embedded in this measure. For example, the
similarity measure is symmetric. The presence or
absence of a feature is considered more essential than
the difference between the values associated with a
present feature. The similarity degree increases when
the number of presence-absence features pair
decreases. Two documents are least similar to each
other if none of the features have non-zero values in
has no contribution to the similarity.
both documents. Besides, it is desirable to consider

Similarity between Two Document Sets
the value distribution of a feature for its contribution
It extends this method to measure the
to the similarity between two documents. The
similarity between two document sets. It can be
proposed scheme has also been extended to measure
considered as an average score of the features
the similarity between two sets of documents. In
occurring
addition, a semantic similarity
in
at
least
one of the two documents. Based on this perspective,

REFERENCE
the similarity between two document sets is designed

to calculate an average score of the features occurring
1.
G.
Amati
(2002)Probabilistic
models
of
in the two sets. Let G1 and G2 be two document sets
information retrieval based on measuring the
containing q1 and q2 documents, respectively, i.e.,
divergence from randomness.
G1 = {d1, d12. . .d1q1} and G2 = {d21, d2, . . . ,

d2q2} where dsj=< dsj1, dsj2, . . . , dsjm>, s {1, 2},
2. J. A. Aslam and M. Frost (2003) An information-
and 1 j q1 or 1 j q2. The function F between
theoretic measure for document similarity, in
G1 and G2 is defined to be
Proc. 26th SIGIR, Toronto, ON, Canada, pp.

449450.
3. D. CAI, X. He, and J. Han(2005)Document
clustering using locality preserve indexing,IEEE
Trans. Knowl.., vol. 17, no. 12, pp. 16241637.
4. S. Clinchant and E. Gaussier (2010)Information-
And the similarity measure, SSMTP, for G1

and G2 is
based models for IR, in Proc. 33rd SIGIR, , ,

pp. 234241.
www.iaetsd.in
57
ISBN: 978-15-08772460-24
Date: 8.3.2015
5. I. S. Dhillon, J. Hogan, and C .Nicholas(2003)
10. H. Fang, T. Tao, and C. Zhai, (2004)A formal
Feature selection and text clustering, in A
study of heuristic retrieval constraints Proc.
Comprehensive Survey of Text Mining, M. W.
27th SIGIR, Shefeld, South Yorkshire, U.K.,
Berry, Ed. Heidelberg, Germany: Springer
pp. 4956.
6. I.S. Dhillon and D.S. Modha (2001) Concept
11.G.Salton(2009)Automatic Text Processing: The
decompositions for large sparse text data using
alteration, psychoanalysis and Retrieval of
clustering, Machine Learning, 42(1):143175,
Information by Computer. Addison-Wesley.
January.Also appears as IBM Research Report

12. J. Saerty and C. Zhai(2003) Probabilistic
RJ 10147.
relevance models based on document and

7. S. Dhillon and D. S. Modha, (2001) Concept
query generation, In W. B. Croft and J.
decompositions for large sparse text data using
Laerty, editors, Language Modeling and
clustering,Mach. Learn., vol. 42, no. 1, pp. 143
Information
175.
Publishers.
8. S. C. Glassman, M. S. Manasse, and G. Zweig
Retrieval.Kluwer
Academic
13. N. Zuhr(2010) Language models and uncertain
(1997) Syntactic cluster of the mesh. Technical
inference in information retrieval ,
Report 1997-015, Digital Systems Research
Proceedings of the Language Modeling and IR
Center.
workshop.
9. Dhillon, I. S., D. S. Modha, and W. S. Spangler

(1998)
Visualizing
Class
Structure
In
14. C. Zhai and J. Laerty(2011) A study of
of
smoothing methods for language models
Multidimensional Data'. In: S.Weisberg (ed.):
applied to ad hoc information retrieval, In
Proceedings of the 30th Symposium on the
Proceedings of SIGIR01, pages 334342.
Interface: Computing Science and, Vol. 30.

Minneapolis, MN, pp. 488-493.
www.iaetsd.in
58

Exposure of Document

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Exposure of Document

Enviado por

Direitos autorais:

Formatos disponíveis

ISBN: 978-15-08772460-24

efficient access and study of information are desired

IAETSD 2015: ALL RIGHTS RESERVED

gravely. The copy dealing out plays an important role

similarity. Generally, copy essay cluster method shot

the centroids symbolize the members of their clusters

to segregate the documents into groups where each

is the Residual Sum of Squares (RSS), the squared

group represent some topic that is unlike than those

distance of each vector from its centroid summed

topics represent by the other group.

over all vectors. K-means can start with selecting as

order to minimize RSS.

A bunch of measures have been proposed

This is done iteratively by repeating two

Similarity measures have been broadly used in text

steps until a stopping criterion is met. One is

classification and clustering algorithms.

reassigning objects to the cluster with nearby centroid

as the root of square differences between the

respective coordinates of d1 and d2

the outline extraction, the skip count number of

cosine of the angle between d1 and d2 as follows:

pages and the pages names are added in a list.

.wm be represented as an m-dimensional vector, i.e.,

following propertie are preferable for a similarity

measure between two documents:

1) The presence or absence of a feature is

objects from their cluster centers, where a cluster

more significant than the difference between the two

center is given by mean or centroid of the objects in a

values coupled with a present feature. Consider two

cluster. The ultimate cluster in K-means is a sphere

features wi and wj and two documents d1 and d2.

with the centroid as its center of gravity. Ideally, the

Suppose wi does not emerge in d1 but it emerge in

clusters should not overlie. A measure of how well

d2. Then wi is measured to have no relationship with

IAETSD 2015: ALL RIGHTS RESERVED

d1 while it has a few relationship with d2. In this

both documents. Let d1 = < d11, d12, . d1m >and d2

case, d1 and d2 are unrelated in terms of wi. If wj

become visible in both d1 and d2. Then wj has a little

for 1 i m, then d1 and d2 are least similar to each

relationship with d1 and d2 concurrently. In this

other. As mentioned prior, d1 and d2 are dissimilar in

case, d1 and d2 are similar to some degree in terms

terms of a presence-absence feature. Since every

single one the features are presence-absence features,

For the over two cases, it is realistic to say that wi

case. For example, the two documents < x, 0, y >and

assume that wi is absent in d1, but appears in d2, and

5) The similarity measure supposed to be

to be more essential than wj in determining the

symmetric. That is, the similarity level between d1

similarity between d1 and d2, although

and d2 should be the similar as that between d2 and

differences of the feature values in both cases are the

the standard deviation of the

when the difference between two non-zero values of

feature is taken into account, for its contribution to

a specific feature decreases. For example, the

the similarity between two documents.

similarity involved with d13 = 2 and d23 = 20 should