Escolar Documentos
Profissional Documentos
Cultura Documentos
Date: 8.3.2015
PROCEEDINGS OF INTERNATIONAL CONFERENCE ON RECENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY
EXPOSURE OF DOCUMENT
SIMILARITY USING SEQUENTIAL
CLUSTERING ALGORITHM
Karthika.M*, Mrs.P.Sumathi**
*Department of Computer Science and Engineering,
K.S.R Institute for Engineering and Technology, Namakkal, Tamilnadu,
Email: karthikamarppan@gmail.com
**Department of Computer Science and Engineering,
K.S.R Institute for Engineering and Technology, Namakkal, Tamilnadu,
ABSTRACT
Document similarity measures are crucial components of many text scrutiny tasks, including information recovery,
document categorization, and document clustering. Extracting features from document is preliminary task found in
mining. On the base of extracted features likeness between two documents is calculated. A new similarity measure is
projected to calculate the similarity between two documents with respect to a feature. The proposed measure takes
the subsequent three cases into account. The three cases are, the feature appears in both documents, the feature
appears in only one document and the feature appears in none of the documents. The similarity increases as the
dissimilarity between the two values linked with a present feature decreases. Furthermore, the involvement of the
difference is usually scaled. The similarity decreases when the number of presence-absence features increases. An
absent feature has no contribution to the similarity. The expected measure is unmitigated to estimate the similarity
among two sets of documents.
Keywords: Document classification, document clustering, similarity measures, clustering algorithms
growth of accessing information from the web,
I.INTRODUCTION
www.iaetsd.in
similarity
and
low
inter-cluster
54
ISBN: 978-15-08772460-24
Date: 8.3.2015
PROCEEDINGS OF INTERNATIONAL CONFERENCE ON RECENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY
II.EXISTING SYSTEM
EUCLIDEANSIMILARITY
The Euclidean distance measure is defined
PAGE COUNT
word and last word are taken. In the web pages, the
idiom is checked such that the pattern is first word,
any figure of words and the last word. Throughout
where AB denotes the inner product of the two
vectors A and B. Cosine similarity measures the
K-MEANS CLUSTERING
significant flat
clustering algorithm. The intention function of Kmeans is to diminish the average squared distance of
www.iaetsd.in
55
ISBN: 978-15-08772460-24
Date: 8.3.2015
PROCEEDINGS OF INTERNATIONAL CONFERENCE ON RECENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY
=< d21, d22, . . . d2m >. If d1i d2i = 0, d1i + d2i >0
of wj.
the
d1.
same.
6) The value circulation of a feature is
2) The similarity degree should increase
considered, that is
3.
Based
3) The similarity degree should decrease
on
the
preferable
properties
SMTP
(Similarity
Measure
for
Text
>.
www.iaetsd.in
56
ISBN: 978-15-08772460-24
Date: 8.3.2015
PROCEEDINGS OF INTERNATIONAL CONFERENCE ON RECENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY
IV.EXPECTED OUTCOME
Thus, a novel similarity measure is applied
occurring
in
at
least
1.
G.
Amati
(2002)Probabilistic
models
of
G1 and G2 is defined to be
www.iaetsd.in
57
ISBN: 978-15-08772460-24
Date: 8.3.2015
PROCEEDINGS OF INTERNATIONAL CONFERENCE ON RECENT INNOVATIONS IN ENGINEERING AND TECHNOLOGY
pp. 4956.
RJ 10147.
Information
175.
Publishers.
Retrieval.Kluwer
Academic
Center.
workshop.
Visualizing
Class
Structure
In
of
www.iaetsd.in
58