Você está na página 1de 49

WEB CONTENT MINING USING K-MEANS

ALGORITHM

BY

OJEBISI, OLANREWAJU MICHAEL


090805038

SUBMITTED TO DEPARTMENT OF COMPUTER SCIENCES


FACULTY OF SCIENCE UNIVERSITY OF LAGOS
IN PARTIAL FUFILLMENT OF THE REQUIREMENTS FOR
THE AWARD OF BACHELOR DEGREE OF SCIENCE IN
COMPUTER SCIENCE

NOVEMBER, 2014

CERTIFICATION
This project work titled Web Content Mining Using K-Means Algorithm is submitted to the
Department of Computer Sciences, Faculty of Science of the University of Lagos, Nigeria
and is certified to be the original work carried out by OJEBISI OLANREWAJU MICHAEL
with matric number 090805038 under my direct supervision.

_______________________

__________________________

Dr. O.B OKUNOYE

Date

Project Supervisor

ii

DEDICATION
I dedicate this project work to the Almighty God who gave me strength and good health
during the course of this programme.

iii

ACKNOWLEDGEMENT
First and foremost, I give thanks to Almighty God for granting me good health, grace and
opportunity to undergo this project. I would also like to say big thanks to my great lecturers.
To my project supervisor Dr. O.B Okunoye for the guidance, inspiration and constructive
suggestions that help me in the preparation of this project.
I also thank my mentors, friends and family at large for their moral support towards the project
to ensure successful completion of the project.

iv

TABLE OF CONTENTS

TITLE PAGE.i
CERTIFICATION .....................................................................................................................ii
DEDICATION ......................................................................................................................... iii
ACKNOWLEDGEMENT ........................................................................................................ iv
TABLE OF CONTENTS ........................................................................................................... v
ABSTACT ............................................................................................................................. viii
CHAPTER ONE ...................................................................................................................... 1
1.0INTRODUCTION ................................................................................................................ 1
1.1Background of Study ............................................................................................................ 1
1.2 Problem Definition............................................................................................................... 1
1.3 Aims and Objectives ............................................................................................................ 2
1.4 Scope of Study ..................................................................................................................... 2
CHAPTER TWO ..................................................................................................................... 3
LITERATURE REVIEW .......................................................................................................... 3
CHAPTER THREE ............................................................................................................... 11
SYSTEM ANALYSIS AND DESIGN .................................................................................... 11
3.1 Requirement Specification ................................................................................................. 11
3.1.1 Document Collection...11
3.1.2 Document Pre-Processsing......11
3.1.2.1 Tokenization.11
3.1.2.2 Stop Word Removal....11
3.1.2.3 Lemmatization......11
3.1.2.4 Predefined Number of Cluster.11

3.1.4 Randomly Generated Centroid11


3.2 System Design .................................................................................................................. 12
3.2 Uml design.12
3.2.1. Aims Of Modelling....13
3.2.2. Diagram In The Uml....14
3.2.2.1 Use Case Diagram ........................................................................................................ 14
3.2.2.2 Activity Diagram ......................................................................................................... 15
3.2.2.3 Sequence Diagram....16
3.2.2.4 Class Diagram 17
3.3 Methodolgy ..........18
CHAPTER FOUR..19
IMPLEMENTATION .............................................................................................................. 19
4.1 Introduction ........................................................................................................................ 19
4.2 programming language ...................................................................................................... 19
4.2.1 advantage of c# ............................................................................................................. 20
4.3.Requirements Analysis ...................................................................................................... 20
4.4Hardware Requirement......21
CHAPTER FIVE..22
TESTING AND ANALYSIS...22
5.0 Testing and Analysis of sample data..22
5.1 Document input..22
CHAPTER SIX ...................................................................................................................... 27
CONCLUSION AND LIMITATIONS ................................................................................... 27
6.1 Conclusion ......................................................................................................................... 27
6.2 Limitation ........................................................................................................................... 27

vi

REFERENCES ........................................................................................................................ 28
APPENDIX ............................................................................................................................. 29

vii

ABSTRACT
Web Content Mining (WCM) involves techniques for classification, summarizing and
clustering of web content. This project is implemented using k- means clustering algorithm.
The K-means algorithm is used to create the clusters of the documents based on their
distances to the centroid of the clusters. The project was built using C sharp programming
language and Microsoft Visio in drawing use case, sequence diagram and activity diagram.

viii

CHAPTER ONE
INTRODUCTION
1.1 Background of Study
Document clustering is an automatic grouping of text documents into clusters so that
documents within a cluster have high similarity in comparison to one another, but are
dissimilar to documents in other clusters. Hence clustering is also known as unsupervised
learning (G. Lakshmi, 1997) .
.This project Web Content Mining using k-means, majorly focused on text document
clustering based on k means algorithm. There are diverse ways of mining information from
web and one of them is k means technique.
Jaiswal (2007) worked on different clustering methods like K-Means, Vector Space Model
(VSM), Latent Semantic Indexing (LSI), and Fuzzy C-Means (FCM) for web clustering and
their comparison.
Bouras and Tsogkas (2010) stated an enhanced model that based on k-means algorithm using
information extracted from wordNet hypernyms in two ways
I.
II.

Enriching the bag of words used earlier in the clustering process


Assisting the label generation procedure following it

Murali Krishna and Durga Bhavani (2010) renowned Apriori algorithm, for mining the
frequent item sets and devised an efficient approach for text clustering based on the frequent
item sets.
Maheshwari and Agrawal (2010) stated the centroid-based text clustering for preprocessed
data, which is a supervised approach to classify text into a set of predefined classes. (Lama,
2013)
1.2 Problem Definition
Clustering is a method in which cluster of objects

that are some how similar in

characteristics are more. A user finding information on the web document could be
overwhelming. The way of grouping similar document into clusters will help the user find
relevant information quicker and easier. There are several methods for web document

clustering and one of them is the partitioning method which includes k-means and k mediod.
My main focus will based on k means as regards to this project.
.1.3 Aims and Objectives
The specific objectives of this study are:
1. To provides a group of similar records.
2. To help the user get the important information based on their needs.
3. To help the user reduce time instead of reading the whole document.
4. To provides quick information from the large document (Manjula.K.S, 2013).
1.4 Scope of Study
The scope of this project is to reorganize a sets of document into a smaller number of
clusters. Clustering is to provide a group of similar records.

CHAPTER TWO
LITERATURE REVIEW
2.1 Web Content Mining
Web content mining (WCM) is the mining of information from the Web page content. It
describes the discovery of useful information from the web documents. In web content
mining, contents are in different form e. g text, image, audio, video, metadata and hyperlinks
e.tc.
WCB contains some of the techniques for mining information such as clustering,
summarizing and classification of the web contents. It can also provide useful and interesting
patterns about the user needs. WCM based on information retrieval and text mining, such as
extraction of information, text classification and clustering, and visualization of information.
2.1 Techniques Of Web Content Mining
Some of the techniques of web content mining include the following:
I.
II.

Structured data mining technique


Unstructured data mining technique

III.

Semi structured data mining technique

IV.

Multimedia data mining technique

These are described below:


2.1.1 Structured Data Mining Technique
Structured data is in progress of mining information from web pages. A program for mining
such data is usually called a wrapper. Structured data are typically

data records, data

retrieved from underlying database and displayed in the web pages with templates.
Sometimes, the template can be in table or form . Extracting of data records is useful because
it enables one to obtain and integrate data from multiple sources such as Web sites and pages
to provide value-added services, e.g., customizable Web information gathering, comparative
shopping, meta-search, etc.
2.1.2 Semi Structured Data Mining Technique

Semi-structured data is a point where the Web and database come together,the web deals with
documents and database with data. The data is in form of structured relational tables with
numbers and strings to enable the natural representation of complex real-world objects like
papers, movies, pencils, etc.. The new representations for semi- structured data such as XML
are variations on the Object Exchange Model . In Object Exchange Model, data is in the
form of atomic or compound objects, atomic objects may be integers or strings and
compound objects refer to other objects through labeled edges. HTML is a special case of
such intra- document structure . Users not only request for information, but he is also keen
in having better understanding of the query. Because of this variety, semi - structured
database do not come with a conceptual schema. To make these databases more accessible to
users a rich conceptual model is needed. Traditional retrieving techniques are not directly
applied on these databases.
2.1.3 Multimedia Data Mining Technique
Multimedia data mining can be defined as the process of finding interesting patterns from
media data such as audio, video, image and text that are not accessible by basic queries. The
purpose for this Multimedia data mining is to use the discovered patterns to improve decision
making. Multimedia data mining has therefore attracted significant research efforts in
developing methods and tools to organize, manage, search and perform domain specific tasks
for data from domains such as surveillance, meetings, broadcast news, sports, archives,
movies, medical data, as well as personal and online media collections.
2.1.4 Unstructured Data Mining Technique
One of the techniques for web content mining is unstructured and number of web pages is in
form of text. According to this technique the data is searched and retrieved. It is not necessary
that the data which is retrieved is meaningful data, it may be unknown information. There are
some techniques used to get relevant information from data. Which are:
2.1.4.1 Text Mining For Web Document
Text Mining is a subset of data mining techniques. It retrieve information from HTML web
pages and it is a challenging task in itself. This is because HTML web pages have multiple
tags which are required to identify information and secondly because the web pages are
highly unstructured (Govind Murari Upadhyay, 2013).
4

2.1.4.2 Topic Tracking


This another technique in which registered user can track his/her topic of interest. The user
have to register with the topic, whenever there is update regarding the interest of the user he
will intimate by the message. Suppose that we have registered with any provider company,
relevant area is determined or informed to us. It also happen in a medical line when there any
new research come in existence the information is send to the doctors (Govind Murari
Upadhyay, 2013).
2.2 Document Pre-Processing
Document pre-processing is the process of introducing a new document to the information
retrieval system in which each document introduced is represented by a set of index terms.
The goal of document pre-processing is to represent the documents in such a way that their
storage in the system and retrieval from the system are very efficient. Document preprocessing includes the following stages.
2.2.1 Tokenization:
Tokenization is the process of cutting off a given stream of text or character sequence
into meaningful words called tokens. Tokens are grouped together as a semantic unit and
used as input for document pre-processing.
2.2.1.1 Importance Of Tokenization
It is used as a form of text segmentation in Natural Language processing and as a unique
symbol representation for the sensitive data in the data security without compromising its
security importance.
Usually, tokenization occurs in a word level but the definition of the word varies
accordingly to the context. So, the series of experimentation based on the following basic
consideration carried out for more accurate output:
All alphabetic characters in the strings in close proximity are part of one token; likewise
with numbers.

Whitespace characters like space or line break or punctuation characters separate the
tokens.
The resulting list of tokens may or may not contain punctuation and whitespace .
For example:
Input:man, women and country
Output: Tokens
Man
Woman
country
2.2.2 Stop Word Removal
Sometimes a very common word, which would appear to be of little significance in helping to
select documents matching users need, is completely excluded from the vocabulary. These
words are called stop words and the technique is called stop word handler.
The general strategy for determining a stop list is to sort the terms by collection frequency
and then to make the most frequently used terms, as a stop list, the members of which are
discarded during indexing.
Some of the examples of stop-word are: a, an, the, and, are, as, at, be, for, from, has, he, in, is,
it, its, of, on, that, the, to, was, were, will, with etc (Lama, 2013).
2.2.3 Lemmatization
This is the process of reducing the inflected forms or sometimes the derived forms of a word
to its base form so that they can be analyzed as a single term.
According to Plisson et al. (2005), lemmatization plays an important role during the
document pre-processing step in many applications of text mining. Beside its use in the field
of natural language processing and linguistics, it is also used to generate generic keywords
for search engines or labels for concept maps.
Lemmatization and stemming are closely related to each other as the goal of both processes is
to reduce the inflectional forms or derivationally related forms of a word to its base form.
However, stemming is a heuristic process in which the end of the words or the affixes of the
derivational words are chopped off to receive the base form of the word. Lemmatisation goes

through the whole morphological analysis of the word and uses the vocabulary to return the
dictionary or base form of the word which is called lemma (Lama, 2013).

2.3 What Is Cluster Analysis?


Cluster analysis is very important in human activities because one start learning from
childhood on how to distinguish between cats and dogs, or between animals and plants, and
so on. By automated clustering, one can identify dense and sparse regions in object space
and, therefore, discover overall distribution patterns and interesting correlations among data
attributes. Cluster analysis is widely used in many applications, including market research,
pattern recognition, data analysis, and image processing. In business, clustering can help
marketers discover distinct groups in their customer bases and characterize customer groups
based on
purchasing patterns. In biology, it can be used to derive plant and animal taxonomies,
categorize genes with similar functionality, and gain insight into structures inherent in
populations. In a simple words, Clustering is the process of grouping a set of physical or
abstract objects into classes of similar objects. A cluster is a collection of data objects that are
similar to one another within the same cluster and are dissimilar to the objects in other
clusters. A cluster of data objects can be treated collectively as one group and so may be
considered as a form of data compression. Although classification is an effective means for
distinguishing groups or classes of objects, it requires the often costly collection and labeling
of a large set of training tuples or patterns, which the classifier uses to model each group. It is
often more desirable to proceed in the reverse direction: First partition the set of data into
groups based on data similarity (e.g., using clustering), and then assign labels to the relatively
small number of groups (Jiawei Han).
The importance of document clustering is widely acknowledged by researchers Qiujun
(2010), Jaiswal (2007) and Shah and Elbahesh (2004) for the better management, smart
navigation, efficient filtering and concise summarization of the large collection of documents
such as World Wide Web (WWW).

figure 1(a):scattered documents

figure 1(b): clustered documents (kauwar, 2013)

In the figure above it is easy to see that the data is divided into three groups, the similarity
criterion measured is distance. Two or more objects belong to the same cluster if they are
close to one another according to a given distance (in terms of Euclidean distance). This is
called distance-based clustering, Another kind of clustering is conceptual clustering: two or
more objects belong to the same cluster if this one defines a concept common to all that
objects. In other words, objects are grouped according to their fit to descriptive concepts, not
according to simple similarity measures.

2.3.1 Classification
Clustering algorithm are classified below;
1. Flat clustering: This create a set of clusters without any explicit structure that would relate
clusters to each other; Its also called exclusive clustering
2. Hierarchical clustering : It creates a hierarchy of clusters
3. Hard clustering: It assigns each document/object as a member of exactly one cluster
4. Soft clustering: It distribute the document/object over all clusters
2.3.2 k-means clustering
There are various types of clustering,K-means clustering algorithm was developed by
MacQueen , and is one of the simplest and the best known unsupervised learning algorithms
that solve the well-known clustering problem. The K-Means algorithm aims to partition a set
of objects, based on their attributes/features, into k clusters, where k is a predefined or userdefined constant. The main idea is to define k centroids, one for each cluster. The centroid of
a cluster is formed in such a way that it is closely related (in terms of similarity function;

similarity can be measured by using different methods such as cosine similarity, Euclidean
distance, Extended Jaccard) to all objects in that cluster.
2.3.3 K means algorithm
1. Choose k number of clusters to be determined
2. Choose k objects randomly as the initial cluster center
3. Repeat
3.1. Assign each object to their closest cluster
3.2. Compute new clusters, i.e. Calculate mean points.
4. until
4.1 No changes on cluster centers (i.e. Centroids do not change location any more)
4.2 No objects change in the cluster (kauwar, 2013).

figure 2(a):selection of seed

figure 2(b):assignment of document (kauwar, 2013)

figure 2(c): movement of centroid (kauwar, 2013)

2.3.4 How Does K-Means Clustering Algorithm Work?


The letter k in the K-means algorithm refers to the number of clusters we want to assign in
the dataset. The k-means algorithm proceeds as follows.
First, it randomly selects k of the objects, each of which initially represents a cluster mean
or center. For each of the remaining objects, an object is assigned to the cluster to which
9

it is the most similar, based on the distance between the object and the cluster mean. It
then computes the new mean for each cluster. This process iterates until the criterion
function converges. Typically, the square-error criterion is used, defined as
= =1 | |2

(Jiawei Han)

10

CHAPTER THREE
SYSTEM ANALYSIS AND DESIGN

3.1 Requirement Specification


An outline for the specification of the System is as follows:
3.1.1 Document Collection
Documents are stored in text file for document preprocessing. The text file is then fed to the
system by referencing its location in the local machine.

3.1.2 Document Preprocessing


The goal of document pre-processing is to represent the documents in such a way that their
storage in the system and retrieval from the system are very efficient. Document preprocessing includes the following stages.
3.1.2.1 Tokenization
This is the process of chopping off a given stream of text or character sequence into
words, phrases, symbols, or other meaningful elements called tokens which are grouped
together as a semantic unit and used as input for further processing such as parsing or text
mining.
3.1.2.2 Stop Word Removal
This help to select documents that match users need, is completely excluded from the
vocabulary. These words are called stop words and the technique is called stop word
handler.
3.1.2.3 Lemmatization
This is the process of reducing the inflected forms or sometimes the derived forms of a word
to its base form so that they can be analyzed as a single term.
3.1.3 Predefined Number Of Clusters
The number of clusters have to be chosen before the process can take place.
11

3.1.4 Genrated Centroid


The centroid is randomly generated. It is also known as mean or center of the clusters.
3.1.5 Convergence Criterion
This determines the number of iteration that the program will loop through until the cluster
mean remain unchanged.
The overall process of the system is described diagrammatically below:

figure 3: overall process of the system (Lama, 2013)


3.2 UML Design
The unified modeling language is a standard language for specifying, Visualizing,
Constructing and documenting the software system and its components. It is a graphical
language which provides a vocabulary and set of semantics and rules. The UML focuses on
the conceptual and physical representation of the system.
It is used to understand, design, configure, maintain and control Information about the
systems.

12

3.2.1

Aims of Modeling

I.
II.

Models help us to visualize a system as it is or as we want to be


Models permit us to specify the structure of system.

III.

Model gives us template that guides us in constructing system.

IV.

Models can document the decisions we have made.

3.2.2 Diagrams In The UML


A diagram is the graphical presentation of a set of elements, most often rendered as a
connected graph of vectors and arcs.
There are 9 diagrams in the UML. These are
Class Diagram
Object Diagram
Use-Case Diagram
Sequence Diagram
Collaboration Diagram
State-Chart Diagram
Activity Diagram
Component Diagram
Deployment Diagram (G. Lakshmi, 1997)
In this project the system is designed by using the following Diagrams below;

13

3.2.2.1

Figure 4: Use Case Diagram

SYSTEM

Add Document

Document
Preprocessing

Tokenization

USER

Stopwords
Removal

Construct K-means
Representation

Assigning Document to
Cluster

Display Output

14

Stemming

TOOL

3.2.2.2

Figure 5: Activity Diagram

Add Document

Preprocessing

Clustering

Display Output
To The User

15

3.2.2.3 Figure 6: Sequence Diagram

Clustering
Add
Document

Browse

User

Preprocessing

Clustering
Assign Document
to Clusters
Display Output
to the user

16

3.2.2.4 Figure 7: Class Diagram

17

3.3

METHODOLOGY

The overall process used for this project includes step by step actions performed by clustering
web document using Rapid Application Development(RAP).
3.3.1 Document pre-processing
Document pre-processing is the process of introducing a new document to the information
retrieval system in which each document introduced is represented by a set of index terms.
The document pre-processing process includes the following steps:
Tokenization
The detail concept and explanation about Tokenization have been mentioned in Section
2.1.5.The text used in this project is English. So, instead of using the complex methods, the
tokenization process is accomplished by using space to split the sequence of characters to
token.
Stop Word Removal
Sometimes extremely common words which would appear to be of little value in helping
select documents matching a users need are excluded from the vocabulary entirely. These
words are stop words and the process is called stop word removal. For the purpose of stop
word removal, we create a list of stop words such as a, an, the, and prepositions. Hence the
tokens contained in the stop word list are discarded.

18

CHAPTER FOUR
IMPLEMENTATION
4.1 Introduction
Implementation is the process of executing a plan or design to achieve some output. In this
project, the implementation encompasses text document, fetching them into the system to go
through the document pre-processing techniques, forwarding the pre-processed documents to
the clustering system and obtaining clusters of text document as a final output.
4.2 Programming language
Visual c# was used to build this application. C# is a multi-paradigm programming language
encompassing imperative, declarative, functional ,generic, object-oriented (class-based), and
component oriented programming disciplines. It was developed by Microsoft within .NET
initiative and later approved as a standard by Ecma(ECMA-334) and ISo (iso/iec 23270).C#
is one of the programming languages designed for the common language infrastructure.
C# is intended to be a simple, modern, general purpose, object-oriented programming
language. Its development team is led by AnderHejiberg. The most recent version is C# 5.o,
which was released on August 2012 (contributors).
By design, C# is the programming language that mostly reflects the underlying common
language infrastructure (CLI). Most of its intrinsic types correspond to value types
implemented by CLI framework. However, the language specification does not state the
code generation requirements of the compiler: that is doesnot state that C# compiler must
target a common language Runtime,0r generate common intermediate language(CIL), or
generate any other specific format. Theoretically, a C# compiler could generate machine
code like traditional compilers of C++ or FORTRAN.
Some distinguishing features of C# are:
I.

There are no global variables of function. All methods and members must be declared
within classes. Static members of public lasses can substitute for global variables and
functions

II.

In addition to the trycatch construct to handle expression ,C# has a tryfinally


construct to guaranteed execution of the code in the finally block
19

III.

Checked exception are not present in C# (in contrast to java). This has been a
conscious decision based on the issues of scalability and versionability.

IV.

Multiple inheritances are not supported, although a class can implement any number
of interfaces. This was a design decision by the languages lead architect to avoid
complication and simplify architectural requirements throughout CLI

V.

C# namespace provides the same level of code isolation as a java package or a C++
namespace, with very similar rules and features to a package (contributors).

4.2.1 Advantages of C#
1. It supports object orientation of classes, interface and abstract classes
2. It supports functional programming which include delegates/method reference and
query expressions.
3. Classes can be defined within classes.
4. Formalized concepts of get-set methods, so the code becomes more legible
5. It supports data types of single-root(unified) type System, signed and unsigned
integer of 8,16,32,6 bits.
6. It supports operator overloading (contributors).
4.3

Requirement Analysis
A graphical user interface is designed so that the user can enter the parameters for the
input document.
The user can as well browse, add and perform clustering.

20

Figure 8: User Interface

4.4

Hardware Requirements

The minimum hardware requirements suggested for the software to run efficiently are given
below:
A computer system
111GB free hard disk storage space
32 operating system
1.00GB memory(RAM)

21

CHAPTER FIVE
TESTING AND ANALYSIS
5.0

Testing and Analysis of a sample data

K-means is a method for creating clusters of the documents. The K-means algorithm creates
the clusters of the documents based on their distances to the centroid of the cluster. The
output depends on optimal calculation of centroids of the clusters (Lama, 2013). The
following sample data is described below:
5.1

Document Input

The table below shows the sample data that consists of four documents (d1,d2,d3,d4) with
distinct term Term1 and Term2 taken as document input as experimental analysis.

DOCUMENTS

TERM 1

TERM 2

D1

D2

D3

D4

(Lama, 2013)

22

The matrix representation of the above data is given below;


D1 D2

D3 D4

TERM1

TERM2

The input document is now processed for clustering with the use of the K-means algorithm.
Step 1:
Say k = 2 that is number of clusters to be created.
Let us assume Documents D1 and D2 as first centroid.
And let C1 and C2 be the coordinate to represent D1 and D2 respectively.
So that C1(1,1) and C2(2,1) (Lama, 2013).
Step 2:
Distances between each document (D3 and D4) and centroids C1 and C2 are calculated using
the Euclidean distance formula.
The distance btw document D3(4,3) and first centroid C1(1,1) is calculate below
Square root of [square of (4-1) + square(3-1)=3.336
And distance btw Document D3 and second centroid = 2.83
The same method is applied to calculate distance from D1,D2 and D4 to centroid and the
final result is displayed in matix form (Lama, 2013).
D1 D2 D3

D4

3.61 5

distance from C1(1,1) Cluster1

2.83 4.24

distance from C2(2,1) Cluster2

23

The figure above shows the documents and their distances from the new centroids. Based on
the minimum distance from centroid,the documents are assigned to each clusters.D1 is
assigned to cluster 1 as its distance is minimal to C1 when compared to C2 and D2 is assigned
to cluster 2 as its distance is minimal to C2 when compared to C1 and so on for other
document (Lama, 2013).
Matrix representation of the documents in the cluster

D1

D2

D3

D4

Cluster1

Cluster2

From the above table, value 1 in the row indicates that a document belongs to a cluster and
value 0 indicate that it is not a member (Lama, 2013).
Therefore, Cluster 1 has D1 as its member and Cluster2 has D2, D3 and D4 as its members.
In this step, Iteration 1 of the algorithm starts, where the new centroid of each cluster based
on the new membership is recalculated. Cluster 1 has only one member which is D1. So, the
centroid remains the same C1. Cluster 2 has three members, the new centroid is the average
coordinates of the three members (D2, D3 and D4).
C2 = ((2+4+5)/3, (1+3+4)/3) = (11/3, 8/3)
After the new centroids are calculated, the distances of all the documents to them are
calculated as well in Step 2.
D1

D2

D3

D4

3.61

3.14 2.36 0.47 1.89

distance from C1 (1,1)

Cluster1

distance from C2 (11/3, 8/3)

Cluster2

Based on the minimal distance to each centroid, the documents are redistributed to their
nearestclusters.

24

D1

D2

D3

D4

Cluster1

Cluster2

Based on the above document matrix , cluster 1 has two members D1 and D2 and cluster 2
also has two members D3 and D4 (Lama, 2013).
Step 4
Cluster 1 has two members D1 and D2 and Cluster 2 also has two members D3 and D4. Thus,
the new centroid for each cluster is recalculated.
C1 = Average of coordinates of D1 and D2 belonging to cluster 1.
C2 = Average of coordinates of D3 and D4 belonging to cluster 2.
C1 = ((1+2)/2, (1+1)/2) = (3/2, 1)
C2 = ((4+5)/2, (3+4)/2) = (9/2, 7/2)
The process takes place in iteration2
The distance from each documents to the new centroids are calculated as explained in Step 2.
D1

D2

D3

D4

0.5

0.5

3.20

4.61 distance from C1(3/2,1)

4.30

3.54

0.71

0.71 distance from C2(9/2,7/2) Cluster2

Cluster1

Based on the distances calculated above, the documents are redistributed to their nearest
clusters (Lama, 2013).
D1

D2

D3

D4

Cluster1

Cluster2

25

Based on the above document matrix , Cluster 1 has two members D1 and D2 and Cluster 2
also has two members D3 and D4.
The result obtained from Step 3 and Step 4 in the document matix are same. The grouping
of documents from the iteration 1 and the iteration 2 shows that the documents do not move
to new cluster anymore. Thus, the computation of the K- means algorithm stops and the
following result was obtained (Lama, 2013).

DOCUMENT

TERM1

TERM2

RESULT(CLUSTER)

D1

D2

D3

D4

RESULT OF SAMPLE DATA AFTER CLUSTERING (Lama, 2013)

26

CHAPTER SIX
CONCLUSION AND LIMITATTIONS
6.0

CONCLUSION

This project study involved a great deal of work on document clustering and focused on the
various methods for document pre-processing.This project study was completely based on
these techniques. The system was created for clustering text document. Various techniques
were applied on pre-processed document. Lastly, the k-means clustering algorithm was used
for creating the clusters of text document.
6.1

LIMITATIONS

I had a lot of limitations during the implementation of this project. Some of these reasons are
Lack of programming skills to deal with the project deliverables.
Lack of capital to buy materials online.
Lack of information and material to use.

27

REFERENCES
Works Cited
contributors,

w.

(n.d.).

Sharp(programming

Language).

Retrieved

from

http://en.wkipedia.org.org/w/index.php?title=C_Sharp_(Programming
Langauage)&oldid=636995385
G. Lakshmi, B. S. (1997). Clustering Engine.
Govind Murari Upadhyay, K. D. (2013). International Journal of Advanced Research in
Computer Science and Software Engineering , Volume 3 (Issue 11).
Jiawei Han, M. K. Data Mining:Concepts and Techniques.
kauwar, s. (2013, january 26). Retrieved november 2014, from Text document clustering
using k-means: www.codeproject.com
Kunwar, S. (2013). Text Documents Clustering using K-Means Algorithm.
Lama, P. (2013). Clustering System Based On Text Mining Using The K-means Algorithm.
Manjula.K.S, S. B. (2013). Extracting Summary from Documents Using. International
Journal of Advanced Research in Computer and Communication Engineering , Vol. 2 ( Issue
8).

28

APPENDIX
PROGRAM LISTINGS
///////////////////////////////////////////MAIN CLASS///////////////////////////////////////////////////
using
using
using
using

System;
System.Collections.Generic;
System.Linq;
System.Windows.Forms;

namespace TextClustering
{
static class Program
{
/// <summary>
/// The main entry point for the application.
/// </summary>
[STAThread]
static void Main()
{
Application.EnableVisualStyles();
Application.SetCompatibleTextRenderingDefault(false);
Application.Run(new TextClusteringGUI());
}
}
}

//////////////////////////////////////DOCUMENT CLUSTERING//////////////////////////////////////////
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace TextClustering
{
/// <summary>
/// Prepares cluster center using K-Means clustering algorithm
/// Document similarity is measured using cosine similarity
/// </summary>
public static class DocumnetClustering
{

private static int globalCounter = 0;


private static int counter;

29

/// <summary>
/// Prepares the document cluster, Grouping of similar
/// type of text document is done here
/// </summary>
/// <param name="k">initial cluster center</param>
/// <param name="documentCollection">document corpus</param>
/// <returns></returns>

public

static

List<Centroid>

PrepareDocumentCluster(int

k,

List<DocumentVector> documentCollection,ref int _counter)


{
globalCounter = 0;
//prepares k initial centroid and assign one object randomly to each
centroid
List<Centroid> centroidCollection = new List<Centroid>();
Centroid c;

/*
* Avoid repeation of random number, if same no is generated more than
once same document is added to the next cluster
* so avoid it using HasSet collection
*/
HashSet<int> uniqRand = new HashSet<int>();
GenerateRandomNumber(ref uniqRand,k,documentCollection.Count);

foreach(int pos in uniqRand)


{
c = new Centroid();
c.GroupedDocument = new List<DocumentVector>();
c.GroupedDocument.Add(documentCollection[pos]);
centroidCollection.Add(c);
}

Boolean stoppingCriteria;
List<Centroid> resultSet;
List<Centroid> prevClusterCenter;

InitializeClusterCentroid(out resultSet, centroidCollection.Count);

do

30

{
prevClusterCenter = centroidCollection;

foreach (DocumentVector obj in documentCollection)


{
int index = FindClosestClusterCenter(centroidCollection, obj);
resultSet[index].GroupedDocument.Add(obj);
}
InitializeClusterCentroid(out

centroidCollection,

centroidCollection.Count());
centroidCollection = CalculateMeanPoints(resultSet);
stoppingCriteria

CheckStoppingCriteria(prevClusterCenter,

centroidCollection);
if (!stoppingCriteria)
{
//initialize the result set for next iteration
InitializeClusterCentroid(out

resultSet,

centroidCollection.Count);
}

} while (stoppingCriteria == false);

_counter = counter;
return resultSet;

/// <summary>
/// Generates unique random numbers and also ensures the generated random
number
/// lies with in a range of total no. of document
/// </summary>
/// <param name="uniqRand"></param>
/// <param name="k"></param>
/// <param name="docCount"></param>

private static void GenerateRandomNumber(ref HashSet<int> uniqRand, int k, int


docCount)
{

31

Random r = new Random();

if (k > docCount)
{
do
{
int pos = r.Next(0, docCount);
uniqRand.Add(pos);

} while (uniqRand.Count != docCount);


}
else
{
do
{
int pos = r.Next(0, docCount);
uniqRand.Add(pos);

} while (uniqRand.Count != k);


}
}

/// <summary>
/// Initialize the result cluster centroid for the next iteration, that holds
the result to be returned
/// </summary>
/// <param name="centroid"></param>
/// <param name="count"></param>
private static void InitializeClusterCentroid(out List<Centroid> centroid,int
count)
{
Centroid c;
centroid = new List<Centroid>();
for (int i = 0; i < count; i++)
{
c = new Centroid();
c.GroupedDocument = new List<DocumentVector>();
centroid.Add(c);
}

32

/// <summary>
/// Check the stopping criteria for the iteration, if centroid do not move
their position it meets the criteria
/// or if the global counter exist its predefined limit(minimum iteration
threshold) than iteration terminates
/// </summary>
/// <param name="prevClusterCenter"></param>
/// <param name="newClusterCenter"></param>
/// <returns></returns>
private static Boolean CheckStoppingCriteria(List<Centroid> prevClusterCenter,
List<Centroid> newClusterCenter)
{

globalCounter++;
counter = globalCounter;
if (globalCounter > 11000)
{
return true;
}

else
{
Boolean stoppingCriteria;
int[] changeIndex = new int[newClusterCenter.Count()]; //1 = centroid
has moved 0 == centroid do not moved its position

int index = 0;
do
{
int count = 0;
if

(newClusterCenter[index].GroupedDocument.Count

==

&&

prevClusterCenter[index].GroupedDocument.Count == 0)
{
index++;
}
else

if

(newClusterCenter[index].GroupedDocument.Count

!=

&&

prevClusterCenter[index].GroupedDocument.Count != 0)
{
for

(int

0;

newClusterCenter[index].GroupedDocument[0].VectorSpace.Count(); j++)

33

<

{
//
if
(newClusterCenter[index].GroupedDocument[0].VectorSpace[j]

==

prevClusterCenter[index].GroupedDocument[0].VectorSpace[j])
{
count++;
}

if

(count

newClusterCenter[index].GroupedDocument[0].VectorSpace.Count())
{
changeIndex[index] = 0;
}
else
{
changeIndex[index] = 1;
}
index++;
}
else
{
index++;
continue;

} while (index < newClusterCenter.Count());

// if index list contains 1 stopping criteria is set to flase


if (changeIndex.Where(s => (s != 0)).Select(r => r).Any())
{
stoppingCriteria = false;
}
else
stoppingCriteria = true;

return stoppingCriteria;

34

==

//returns index of closest cluster centroid


private

static

int

FindClosestClusterCenter(List<Centroid>

clusterCenter,DocumentVector obj)
{

float[] similarityMeasure = new float[clusterCenter.Count()];

for (int i = 0; i < clusterCenter.Count(); i++)


{

similarityMeasure[i]

SimilarityMatrics.FindCosineSimilarity(clusterCenter[i].GroupedDocument[0].VectorSpace
, obj.VectorSpace);

int index = 0;
float maxValue = similarityMeasure[0];
for (int i = 0; i < similarityMeasure.Count(); i++)
{
//if document is similar assign the document to the lowest index
cluster center to avoid the long loop
if (similarityMeasure[i] >maxValue)
{
maxValue = similarityMeasure[i];
index = i;

}
}
return index;

//Reposition the centroid


private

static

List<Centroid>

_clusterCenter)

35

CalculateMeanPoints(List<Centroid>

for (int i = 0; i < _clusterCenter.Count(); i++)


{

if (_clusterCenter[i].GroupedDocument.Count() > 0)
{

for

(int

0;

<

_clusterCenter[i].GroupedDocument[0].VectorSpace.Count(); j++)
{
float total = 0;

foreach

(DocumentVector

vSpace

in

_clusterCenter[i].GroupedDocument)
{

total += vSpace.VectorSpace[j];

//reassign new calculated mean on each cluster center, It


indicates the reposition of centroid
_clusterCenter[i].GroupedDocument[0].VectorSpace[j] = total /
_clusterCenter[i].GroupedDocument.Count();

return _clusterCenter;

}
/// <summary>
/// Find Residual sum of squares it measures how well a cluster centroid
represents the member of their cluster
/// We can use the RSS value as stopping criteria of k-means algorithm when
decreses in RSS value falls below a
/// threshold t for small t we can terminate the algorithm.

36

/// </summary>
private

static

void

FindRSS(List<Centroid>

newCentroid,

_clusterCenter)
{
//TODO:
}

/////////////////////////////////////Centroid///////////////////////////////////////////////
using System;
using System.Collections.Generic;
using System.Linq;

namespace TextClustering
{
public class Centroid
{
public List<DocumentVector> GroupedDocument { get; set; }
}
}

//////////////////////////////Document Collection//////////////////////////////////////////
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace TextClustering
{
/// <summary>
/// It represents the collection of document or corpus
/// </summary>
class DocumentsCollection
{
public

List<String> DocumentList { get; set; }

37

List<Centroid>

//////////////////////////////////////Stop Word Handler////////////////////////////////////////////////////


using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace TextClustering
{
/// <summary>
/// TODO: generally stop word are removed to make the indexing process effective,
We can consider the term as stop words which has highest
/// term frequency on document corpus.
/// </summary>
public class StopWordsHandler
{
//you can defined other stop word list here
public static string[] stopWordsList = new string[] { "The"};

public static Boolean IsStotpWord(string word)


{
if (stopWordsList.Contains(word))
return true;
else
return false;
}
}

//////////////////////////////////Similarity Matrix//////////////////////////////////////
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

38

namespace TextClustering
{
/// <summary>
/// Calculates Term Frequency of each term in its corresponding document
/// </summary>
public class SimilarityMatrics
{

#region Cosine Similarity


public static float FindCosineSimilarity(float[] vecA, float[] vecB)
{
var dotProduct = DotProduct(vecA, vecB);
var magnitudeOfA = Magnitude(vecA);
var magnitudeOfB = Magnitude(vecB);
float result = dotProduct / (magnitudeOfA * magnitudeOfB);
//when 0 is divided by 0 it shows result NaN so return 0 in such case.
if (float.IsNaN(result))
return 0;
else
return (float)result;
}

#endregion

public static float DotProduct(float[] vecA, float[] vecB)


{

float dotProduct = 0;
for (var i = 0; i < vecA.Length; i++)
{
dotProduct += (vecA[i] * vecB[i]);
}

return dotProduct;
}

// Magnitude of the vector is the square root of the dot product of the vector
with itself.
public static float Magnitude(float[] vector)
{
return (float)Math.Sqrt(DotProduct(vector, vector));

39

#region Euclidean Distance


//Computes the similarity between two documents as the distance between their
point representations. Is translation invariant.
public static float FindEuclideanDistance(int[] vecA, int[] vecB)
{
float euclideanDistance = 0;
for (var i = 0; i < vecA.Length; i++)
{
euclideanDistance += (float)Math.Pow((vecA[i] - vecB[i]), 2);
}

return (float)Math.Sqrt(euclideanDistance);

}
#endregion

#region Extended Jaccard


//Combines properties of both cosine similarity and Euclidean distance
public static float FindExtendedJaccard(float[] vecA, float[] vecB)
{
var dotProduct = DotProduct(vecA, vecB);
var magnitudeOfA = Magnitude(vecA);
var magnitudeOfB = Magnitude(vecB);

return dotProduct / (magnitudeOfA + magnitudeOfB - dotProduct);

}
#endregion

}
}

40

SAMPLE INPUT/OUTPUT OF THE PROGRAM

41

Você também pode gostar