Você está na página 1de 70

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE

WEB USING ANCHOR TEXT BASED CO-OCCURENCES

CHAPTER 1
INTRODUCTION

This paper mainly deals with information retrieval system. Information retrieval
is the area where users might search for documents, information within documents
and metadata from documents on the web. Many users query might include retrieval
of documents for personal names. Many celebrities and experts from various fields
are referred by their original names on web. Most of the queries to web search engines
include person names. For example, people might use Michel Jackson as a query on
search engine to know about him. The search engine might give the relevant
documents met the information need of the users query. Apparently celebrities and
experts might also be referred by their aliases on the web. Many web pages about
person names might also be created by aliases. For example, a newspaper article
1
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

might refer the persons using their original names, whereas a blogger might refer
them using their nick names. The user will not be able to retrieve all information
about a person if he only uses his personal name. To retrieve complete information
about a person name, one might know about his aliases on the web. Various types of
words are used as aliases on the web. Identifying aliases will be helpful in information
retrieval. The aliases are extracted using previously proposed alias extraction method.
The search engine expands the query on person names by tagging the extracted aliases
to retrieve relevant web pages those are referred by original names as well as aliases
thereby improving recall and MRR.

1.1 MOTIVATION
Searching for information about people in the Web is one of the most common
activities of Internet users. Around 30% of search engine queries include person
names .However, retrieving information about people from web search engines can
become difficult when a person has nicknames or name aliases. For example, the
famous Japanese major league base ballplayer Hideki Matsui is often called as
Godzilla on the Web. A newspaper article on the baseball player might use the real
name, Hideki Matsui, whereas a blogger would use the alias, Godzilla, in a blog entry.
We will not be able to retrieve all the information about the base ballplayer if we only
use his real name .Identifying aliases of a name is important in information retrieval.
In information retrieval, to improve recall of a web search on a person name, a search
engine can automatically expand a query using aliases of the name. In our previous
example, a user who searches for Hideki Matsui might also be interested in retrieving
documents in which Matsui is referred to as Godzilla. Consequently, we can expand a
query on Hideki Matsui using his alias name Godzilla.
The set A of its aliases to be the set of all words or multiword expressions that are
used to refer on the web. For example, Godzilla is a one-word alias for Hideki Matsui,
whereas alias the Fresh Prince contains three words and refers to Will Smith. Various
types of terms are used as aliases on the web. For instance, in the case of an actor, the
name of a role or the title of a drama (or a movie) can later become an alias for the
person (e.g., Fresh Prince, Knight Rider). Titles or professions such as president,
doctor, professor, etc., are also frequently used as aliases. Variants or abbreviations of
names such as Bill for William, and acronyms such as JFK for John Fitzgerald
Kennedy are also types of name aliases that are observed frequently on the web.
2
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

1.2 OVERVIEW
An individual is typically referred by numerous name aliases on the web.
Accurate identification of aliases of a given person name is useful in various web
related tasks such as information retrieval, personal name disambiguation, and
relation extraction. We propose a method to extract aliases of a given personal name
from the web. Given a personal name, the proposed method first extracts a set of
aliases. Second, we rank the extracted candidates according to the likelihood of a
candidate being a correct alias of the given name.
We define numerous ranking scores to evaluate candidate aliases using three
approaches: lexical pattern frequency, word co-occurrences in an anchor text graph,
and page counts on the web. To construct a robust alias detection system, we integrate
the different ranking scores into a single ranking function using ranking support
vector machines. We evaluate the proposed method on two data sets: an English
personal names data set, an English place names data set The proposed method
outperforms numerous baselines and previously proposed name alias extraction
methods, achieving a statistically significant mean reciprocal rank (MRR) of 0.67.
Experiments carried out using location names and Japanese personal names suggest
the possibility of extending the proposed method to extract aliases for different types
of named entities, and for different languages. Moreover, the aliases extracted using
the proposed method are successfully utilized in an information retrieval task and
improve recall by 20 percent in a relation detection task.
To select the best aliases among the extracted candidates, we propose numerous
ranking scores based upon three approaches: word co-occurrences in an anchor text
graph, and page counts on the web. Moreover, using real-world name alias data, we
train a ranking support vector machine to learn the optimal combination of individual
ranking scores to construct a robust alias extraction method.
Along with the recent rapid growth of social media such as blogs, extracting and
classifying sentiment on the web has received much attention However, when people
express their views about a particular entity, they do so by referring to the entity not
only using the real name but also using various aliases of the name. By aggregating
texts that use various aliases to refer to an entity, a sentiment analysis system can
produce an informed judgment related to the sentiment.

1.3 LIMITATIONS
3
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

However, an inherent limitation is that they cannot identify aliases which share
no words or letters with the real name For example; approximate string matching
methods would not identify Fresh Prince as an alias for Will Smith.
The most likely correct aliases are assigned a higher rank. Ranking a set of
candidate aliases must be in descending order only. The co-occurrences will be
considered not only anchor text but also the overall process on the web.
The different ranking scores must be integrated to evaluate ranking function
which a complex process Comparison of SVM ranks with is previously proposed rank
must be difficult.

1.4 ORGANISATION OF DOCUMENTATION


In this project documentation we have initially put the definition and objective of
the project as well as the design of the project which is followed by the
implementation and testing phases. Finally the project has been concluded
successfully and also the future enhancements of the project were given in this
documentation.

4
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

CHAPTER 2
LITERATURE SURVEY

2.1 INTRODUCTION
Literature survey is the most important step in software development process.
Before developing the tool it is necessary to determine the time factor, economy n
company strength. Once these things r satisfied, ten next steps are to determine which
operating system and language can be used for developing the tool. Once the
programmers start building the tool the programmers need lot of external support.
This support can be obtained from senior programmers, from book or from websites.

5
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

Before building the system the above consideration are taken into account for
developing the proposed system.

2.2 EXISTING SYSTEM


The existing namesake disambiguation algorithm assumes the real name of a
person to be given and does not attempt to disambiguate people who are referred only
by aliases.
2.2.1 DISADVANTAGES OF EXISTING SYSTEM
1) To low MRR and AP scores on all data sets.
2) To complex hub discounting measure.

2.3 PROPOSED SYSTEM


The proposed method will work on the aliases and get the association orders
between name and aliases to help search engine tag those aliases according to the
orders such as first order associations, second order associations etc so as to
substantially increase the recall and MRR of the search engine while searching made
on person names. The term recall is defined as the percentage of relevant documents
that were in fact retrieved for a search query on search engine. The mean reciprocal
rank of the search engine for a given sample of queries is that the average of the
reciprocal ranks for each query. The term word co-occurrence refers to the temporal
property of the two words occurring at the same web page or same document on the
web. The anchor text is the clickable text on web pages, which points to a particular
web document. Moreover the anchor texts are used by search engine algorithms to
provide relevant documents for search results because they point to the web pages that
are relevant to the user queries. So the anchor texts will be helpful to find the strength
of association between two words on the web. The anchor texts-based co-occurrence
means that the two anchor texts from the different web pages point to the same the
URL on the web. The anchor texts which point to the same URL are called as inbound
anchor texts. The proposed method will find the anchor texts-based co-occurrences
between name and aliases using co-occurrence statistics and will rank the name and
aliases by support vector machine according to the co-occurrence measures in order to
get connections among name and aliases for drawing the word co-occurrence graph.
Then a word co-occurrence graph will be created and mined by graph mining
algorithm so as to get the hop distance between name and aliases that will lead to the
association orders of aliases with the name. The search engine can now expand the
6
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

search query on a name by tagging the aliases according to their association orders to
retrieve all relevant pages which in turn will increase the recall and achieve a
substantial MRR.

2.4 CONCLUSION
This paper presents a software application which avoids manual hours and helps
the customer to track the status of applications. It is not secure to maintain important
information manually it is better to store in database which helps in avoiding
conflicts. In this software application no specific training is required for the
employees to use this application.

7
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

CHAPTER 3
ANALYSIS

3.1 INTRODUCTION
FEASIBILITY STUDY
The feasibility of the project is analyzed in this phase and business proposal is
put forth with a very general plan for the project and some cost estimates. During
system analysis the feasibility study of the proposed system is to be carried out. This
8
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

is to ensure that the proposed system is not a burden to the company. For feasibility
analysis, some understanding of the major requirements for the system is essential.
Three key considerations involved in the feasibility analysis are
ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY
ECONOMICAL FEASIBILITY
This study is carried out to check the economic impact that the system will
have on the organization. The amount of fund that the company can pour into the
research and development of the system is limited. The expenditures must be justified.
Thus the developed system as well within the budget and this was achieved because
most of the technologies used are freely available. Only the customized products had
to be purchased.
TECHNICAL FEASIBILITY
This study is carried out to check the technical feasibility, that is, the technical
requirements of the system. Any system developed must not have a high demand on
the available technical resources. This will lead to high demands on the available
technical resources. This will lead to high demands being placed on the client. The
developed system must have a modest requirement, as only minimal or null changes
are required for implementing this system.
SOCIAL FEASIBILITY
The aspect of study is to check the level of acceptance of the system by the
user. This includes the process of training the user to use the system efficiently. The
user must not feel threatened by the system, instead must accept it as a necessity. The
level of acceptance by the users solely depends on the methods that are employed to
educate the user about the system and to make him familiar with it. His level of
confidence must be raised so that he is also able to make some constructive criticism,
which is welcomed, as he is the final user of the system.

3.2 SOFTWARE SYSTEM SPECIFICATION


3.2.1 PURPOSE OF THE PROJECT
To train and evaluate the proposed method, there are two data sets: the personal
names data set and the place names data set. The personal names data set includes
9
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

people from various fields of cinema, sports, politics, science, and mass media. The
place names data set contains aliases for US states. According to the definition of cooccurrences if the two anchor texts co-occur in pointing to the same URL, then
undirected edge will be drawn between them to denote their co-occurrences. A word
co-occurrence graph like will be created for the name and aliases according to their
first order associations among them. Each name and aliases will be represented by a
node in the graph. The two nodes will be connected if they make first order
associations between them. The edge between nodes will describe that the nodes
bearing anchor texts co-occur according to the definition of anchor texts cooccurrences. Next the hop distance between nodes will be identified in order to have
first, second, and higher order associations between name and aliases by graph mining
algorithm.
3.2.2 SCOPE OF THE PROJECT
Extracting aliases of an entity is important for various tasks such as identification
of relations among entities, web search and entity disambiguation. To extract relations
among entities properly, one must first identify those entities. We propose a novel
approach to find aliases of a given name using automatically extracted lexical
patterns. We exploit a set of known names and their aliases as training data and extract
lexical patterns that convey information related to aliases of names from text snippets
returned by a web search engine. The patterns are then used to find candidate aliases
of a given name. We use anchor texts to design a word co-occurrence model and use it
to define various ranking scores to measure the association between a name and a
candidate alias. The ranking scores are integrated with page-count-based association
measures using support vector machines to leverage a robust alias detection method.
The proposed method outperforms numerous baselines and previous work on alias
extraction on a dataset of personal names, achieving a statistically significant mean
reciprocal rank of 0.6718. Experiments carried out using a dataset of location names
and Japanese personal names suggest the possibility of extending the proposed
method to extract aliases for different types of named entities and for other languages

SRS MODEL

10
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

Fig 3.1 steps to develop system analysis model


The waterfall model is a sequential design process, often used in software
development processes, in which progress is seen as flowing steadily downwards (like
a waterfall)

through

the

phases

of

Conception,

Initiation, Analysis, Design,

Construction, Testing, Production/Implementation, and Maintenance. The waterfall


development model originates in the manufacturing and construction industries highly
structured physical environments in which after-the-fact changes are prohibitively
costly, if not impossible. Since no formal software development methodologies
existed at the time, this hardware-oriented model was simply adapted for software
development.
The first known presentation describing use of similar phases in software
engineering was held by Herbert D. Benington at Symposium on advanced
programming methods for digital computers on 29 June 1956. This presentation was
about the development of software for SAGE. In 1983 the paper was republished with
a foreword by Benington pointing out that the process was not in fact performed in a
strict top-down fashion, but depended on a prototype.
The first formal description of the waterfall model is often cited as a 1970 article
by Winston W. Royce, although Royce did not use the term "waterfall" in this article.
Royce presented this model as an example of a flawed, non-working model. This, in

11
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

fact, is how the term is generally used in writing about software development to
describe a critical view of a commonly used software development practice.

3.3 SYSTEM CONFIGURATION


3.3.1 HARDWARE SYSTEM CONFIGURATION
Processor

Pentium III/IV

Speed

1.1 Ghz

RAM

256 MB(min)

Hard Disk

20 GB

Floppy Drive

1.44 MB

Key Board

Standard Windows Keyboard

Mouse

- Two or Three Button Mouse

Monitor

SVGA

3.3.2 Software System Configuration


Operating System

: Windows95/98/2000/XP

Application Server

: Tomcat5.0/6.X

Front End

: HTML, Java, Jsp

Scripts

: JavaScript.

Server side Script

: Java Server Pages.

Database

: MysQl

Database Connectivity

: JDBC.

3.4SPECIFIC PROJECT SOFTWARES


Java Technology
Initially the language was called as oak later renamed as Java in 1991. The
primary motivation of this language was the need for a platform-independent (i.e.,
architecture neutral) language that could be used to create software to be embedded in
various consumer electronic devices.

Java is a programmers language.

Java is cohesive and consistent.

Except for those constraints imposed by the Internet environment, Java


gives the programmer, full control.

Finally, Java is to Internet programming where C was to system


programming.

12
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

Importance of java to the internet


Java has had a profound effect on the Internet. This is because; Java expands
the Universe of objects that can move about freely in Cyberspace. In a network, two
categories of objects are transmitted between the Server and the Personal computer.
They are: Passive information and Dynamic active programs. The Dynamic, Selfexecuting programs cause serious problems in the areas of Security and probability.
But, Java addresses those concerns and by doing so, has opened the door to an
exciting new form of program called the Applet.
Java can be used to create two programs
Applications and Applets : An application is a program that runs on our Computer
under the operating system of that computer. It is more or less like one creating using
C or C++. Javas ability to create Applets makes it important. An Applet is an
application designed to be transmitted over the Internet and executed by a Java
compatible web browser. An applet is actually a tiny Java program, dynamically
downloaded across the network, just like an image. But the difference is, it is an
intelligent program, not just a media file. It can react to the user input and
dynamically change.
Java Virtual Machine (JVM)
Beyond the language, there is the Java virtual machine. The Java virtual
machine is an important element of the Java technology. The virtual machine can be
embedded within a web browser or an operating system. Once a piece of Java code is
loaded onto a machine, it is verified. As part of the loading process, a class loader is
invoked and does byte code verification makes sure that the code thats has been
generated by the compiler will not corrupt the machine that its loaded on. Byte code
verification takes place at the end of the compilation process to make sure that is all
accurate and correct. So byte code verification is integral to the compiling and
executing of Java code.
THE THREE-OOP PRINCIPLES
A)ENCAPSULATION
Encapsulation is the mechanism that binds together code and the data it
manipulates, and keeps both safe from outside interference and misuse. One way to
13
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

think about encapsulation is as a protective wrapper that prevents the code and data
from begin arbitrarily accessed by other code defined outside of the wrapper. Access
to he code and data inside the wrappers is tightly controlled through a well-defined
interface. To relate this to the real world, consider the automatic transmission on an
automobile. It encapsulates hundreds of bits of information about our engine, such as
how much you are accelerating, the pitch of the surface you are on, and the position of
the shift lever.
B) INHERITANCE
Inheritance is the process by which one object acquires the properties of
another object. This is important because it supports the concept of hierarchical
classification. As mentioned earlier, most knowledge is made manageable by
hierarchical classification. Inheritance interacts with encapsulation as well. If a given
class encapsulates some attributes, then subclass will have the same attributes plus
any that it adds as part of its specialization.
C)POLYMORPHISM
Polymorphism (from the Greek, meaning many forms) is a feature that
allows one interface to be used for a general class of actions. More generally, the
concept of polymorphism is often expressed by the phrase one interface, multiple
methods. This means that is possible to design a generic interface to a group or
related activities. This helps reduce complexity by allowing the same interface to be
used to specify a general class of action. It is the compilers job to select the specific
action as it applies to each situation
MY SQL
MySQL is a relational database management system (RDBMS), and ships
with no GUI tools to administer MySQL databases or manage data contained within
the databases. Users may use the included command line tools, or use MySQL "frontends", desktop software and web applications that create and manage MySQL
databases, build database structures, back up data, inspect status, and work with data
records. The official set of MySQL front-end tools,my sql work bench is actively
developed by Oracle, and is freely available for use.

Graphical Package
The official MySQL Workbench is a free integrated environment developed by
MySQL AB, that enables users to graphically administer MySQL databases and
14
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

visually design database structures. MySQL Workbench replaces the previous


package of software, My SQL GUI Tools. Similar to other third-party packages, but
still considered the authoritative My SQL front end, My SQL Workbench lets users
manage database design & modeling, SQL development (replacing My SQL Query
Browser) and Database administration (replacing MySQL Administrator).
MySQL Workbench is available in two editions, the regular free and open source
Community Edition which may be downloaded from the MySQL website, and the
proprietary Standard Edition which extends and improves the feature set of the
Community Edition.MySQL ships with many command line tools, from which the
main interface is 'mysql' client. Third-parties have also developed tools to manage,
optimize, monitor and backup a MySQL server, some listed below. All these tools
work on *NIX type operating systemsAs of April 2009, MySQL offered MySQL 5.1
in two different variants: the open source MySQL Community Server and the
commercial Enterprise Server. MySQL 5.5 is offered under the same licences. They
have a common code base and include the following features:

A broad subset of ANSI SQL 99, as well as extensions

Cross-platform support

Stored procedures

Triggers

Cursors

Updatable Views

Information schema

Strict mode

X/Open XA distributed transaction processing (DTP) support; two phase


commit as part of this, using Oracle's InnoDB engine

Independent storage engines

Transactions with the InnoDB, and Cluster storage engines; savepoints with
InnoDB

SSL support

15
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

Limitations
Like other SQL databases, MySQL does not currently comply with the full
SQL standard for some of the implemented functionality, including foreign key
references when using some storage engines other than the 'standard' InnoDB
Triggers are currently limited to one per action / timing, i.e. maximum one after insert
and one before insert on the same table. There are no triggers on views
MySQL, like most other transactional relational databases, is strongly limited by hard
disk performance. This is especially true in terms of write latency. Given the recent
appearance of very affordable consumer grade SATA interface Solid-state drives that
offer zero mechanical latency, a fivefold speedup over even an eight drive RAID
array can be had for a smaller investment
Windows Deployment
MySQL can be built and installed manually from source code, but this can be
tedious so it is more commonly installed from a binary package unless special
customizations are required. On most Linux distributions the package management
system can download and install MySQL with minimal effort, though further
configuration is often required to adjust security and optimization settings.Though
MySQL began as a low-end alternative to more powerful proprietary databases, it has
gradually evolved to support higher-scale needs as well. It is still most commonly
used in small to medium scale single-server deployments
Community
The MySQL server software itself and the client libraries use dual-licensing
distribution. They are offered under GPL version 2, beginning from 28 June 2000
(which in 2009 has been extended with a FLOSS License Exception) or to use a
proprietary license.
Support can be obtained from the official manual. Free support additionally is
available in different IRC channels and forums. Oracle offers paid support via its
MySQL Enterprise products.
Apache tomcat
16
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

Apache Tomcat (or simply Tomcat, formerly also Jakarta Tomcat) is an open
source web server and servlet container developed by the Apache Software
Foundation (ASF). Tomcat implements the Java Servlet and the JavaServer Pages
(JSP) specifications from Sun Microsystems, and provides a "pure Java" HTTP web
server environment for Java code to run.Apache Tomcat includes tools for
configuration and management
Cluster
This component has been added to manage large applications. It is used for Load
balancing that can be achieved through many techniques.Clustering support currently
requires the JDK version 1.5 or later.
High availability
A high-availability feature has been added to facilitate the scheduling of
system upgrades (e.g. new releases, change requests) without affecting the live
environment. This is done by dispatching live traffic requests to a temporary server on
a different port while the main server is upgraded on the main port. It is very useful in
handling user requests on high-traffic web applications.[2]
Web Application
It has also added user as well as system based web applications enhancement
to add support for deployment across the variety of environments. It also tries to
manage session as well as applications across the network.
Tomcat building is additional components. A number of additional
components may be used with Apache Tomcat. These components may be built by
users should they need them or they can be downloaded from one of the mirrors may
be built by users
Features
Tomcat 7.x implements the Servlet 3.0 and JSP 2.2 specifications. It requires
Java version 1.6, although previous versions have run on Java 1.1 through 1.5.
Versions 5 through 6 saw improvements in garbage collection, JSP parsing,
17
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

performance and scalability. Native wrappers, known as "Tomcat Native", are


available for Microsoft Windows and Unix for platform integration

3.5 CONTENT DIAGRAM OF THE PROJECT

F
Fig3.2: Architecture of proposed method
The architecture outlined in Fig3.2 and comprises four main components namely
computation of word co-occurrence statistics, ranking anchor texts, creation of anchor
text co-occurrence graph, and discovery of association orders. The anchor text mining
to measure the associations between anchor texts. Ranking support vector machine
(SVM) will be used to rank the anchor texts with respect to each anchor text to
identify the highest ranking anchor text for making first order associations among
anchor texts. The whole process that should done in the project is described as follows
The input is in the form of allinanchor: input is given to the Google search engine
and it will retrieve the all corresponding anchor texts and urls according to the given
input and those anchor texts and urls are kept in a table called contingency table in
contingency table the anchor texts and urls are arranged in two columns. After
creation of contingency table the word co-occurrence frequency can be computed
18
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

In the second part of the architecture the training data set is presented. The input
is given to the support vector machine. The support vector machine generates a
ranking function that function is given to the ranking algorithm. The ranking
algorithm generates a first order association.
In the third part of the architecture to get the connections among the name and
aliases the graph drawing algorithm is drawn. From that algorithm the word cooccurrence graph is created. The word co-occurrence graph should be mined by graph
mining algorithm. The hop distance is found by using graph mining algorithm. Finally
the association orders will be discovered by using the hop distances.

3.6ALGORITHM
3.6.1 Keyword Extraction Algorithm
Matsuo, Ishizuka proposed a method called keyword extraction algorithm that
applies to a single document without using a corpus. Frequent terms are extracted
first, and then a set of co-occurrences between each term and the frequent terms, i.e.,
occurrences in the same sentences, are generated. Co-occurrence distribution showed
the importance of a term in the document. However, this method only extracts a
keyword from a document but not correlate any more documents using anchor textsbased co-occurrence frequency

3.7 FLOW CHART


Start

Extracts Frequent Item


Set

Generate Occurrences in
the Same Sentence

Extracts a Keyword

Stop
Fig3.3: Flow chart of the project
19
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

3.8 CONCLUSION
In this phase, we understand the software requirements specifications for the
project. We arrange all the required components to develop the project in this phase
itself so that we will have a clear idea regarding the requirements before designing the
project. Thus we will proceed to the design phase followed by the implementation
phase of the project.

20
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

CHAPTER 4
DESIGN

21
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

4.1 INTRODUCTION
4.1.1 INPUT DESIGN
The input design is the link between the information system and the user. It
comprises the developing specification and procedures for data preparation and those
steps are necessary to put transaction data in to a usable form for processing can be
achieved by inspecting the computer to read data from a written or printed document
or it can occur by having people keying the data directly into the system. The design
of input focuses on controlling the amount of input required, controlling the errors,
avoiding delay, avoiding extra steps and keeping the process simple. The input is
designed in such a way so that it provides security and ease of use with retaining the
privacy. Input Design considered the following things:
What data should be given as input?
How the data should be arranged or coded?
The dialog to guide the operating personnel in providing input.
Methods for preparing input validations and steps to follow when error
occur.
4.1.2 OBJECTIVES
Input Design is the process of converting a user-oriented description of the input
into a computer-based system. This design is important to avoid errors in the data
input process and show the correct direction to the management for getting correct
information from the computerized system.
It is achieved by creating user-friendly screens for the data entry to handle large
volume of data. The goal of designing input is to make data entry easier and to be free
from errors. The data entry screen is designed in such a way that all the data
manipulates can be performed. It also provides record viewing facilities
When the data is entered it will check for its validity. Data can be entered with
the help of screens. Appropriate messages are provided as when needed so that the
user will not be in maize of instant. Thus the objective of input design is to create an
input layout that is easy to follow
4.1.3 OUTPUT DESIGN
22
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

A quality output is one, which meets the requirements of the end user and
presents the information clearly. In any system results of processing are
communicated to the users and to other system through outputs. In output design it is
determined how the information is to be displaced for immediate need and also the
hard copy output. It is the most important and direct source information to the user.
Efficient and intelligent output design improves the systems relationship to help user
decision-making.
1. Designing computer output should proceed in an organized, well thought out
manner; the right output must be developed while ensuring that each output element is
designed so that people will find the system can use easily and effectively. When
analysis design computer output, they should Identify the specific output that is
needed to meet the requirements.
2. Select methods for presenting information.
3. Create document, report, or other formats that contain information produced by the
system.
The output form of an information system should accomplish one or more of the
following objectives.
Convey information about past activities, current status or projections of the
Future.
Signal important events, opportunities, problems, or warnings.
Trigger an action.
Confirm an action

4.2 UML DIAGRAMS


4.2.1 INTRODUCTION
UML is a method for describing the system architecture in detail using the
blueprint.UML represents a collection of best engineering practices that have proven
successful in the modeling of large and complex systems.
UML is a very important part of developing objects oriented software and the
software development process.
UML uses mostly graphical notations to express the design of software projects.
Using the UML helps project teams communicate, explore potential designs, and
validate the architectural design of the software.
Definition
23
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

UML is a general-purpose visual modeling language that is used to specify,


visualize, construct, and document the artifacts of the software system.
UML is a language
It will provide vocabulary and rules for communications and function on
conceptual and physical representation. So it is modeling language.
UML Specifying
Specifying means building models that are precise, unambiguous and complete.
In particular, the UML address the specification of all the important analysis, design
and implementation decisions that must be made in developing and displaying a
software intensive system.
UML Visualization
The UML includes both graphical and textual representation. It makes easy to
visualize the system and for better understanding.
UML Constructing
UML models can be directly connected to a variety of programming languages
and it is sufficiently expressive and free from any ambiguity to permit the direct
execution of models.
UML Documenting
UML provides variety of documents in addition raw executable codes.

Fig 4.1: Modeling a System Architecture using views of UML


The use case view of a system encompasses the use cases that describe the
behavior of the system as seen by its end users, analysts, and testers. The design view
24
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

of a system encompasses the classes, interfaces, and collaborations that form the
vocabulary of the problem and its solution. The process view of a system
encompasses the threads and processes that form the system's concurrency and
synchronization mechanisms. The implementation view of a system encompasses the
components and files that are used to assemble and release the physical system. The
deployment view of a system encompasses the nodes that form the system's hardware
topology on which the system executes.
Uses of UML
The UML is intended primarily for software intensive systems. It has been used
effectively for such domain as
Enterprise Information System
Banking and Financial Services
Telecommunications
Transportation
Defense/Aerospace
Retails
Medical Electronics
Scientific Fields
Distributed Web
Building blocks of UML
The vocabulary of the UML encompasses 3 kinds of building blocks
Things
Relationships
Diagrams
Things
Things are the data abstractions that are first class citizens in a model. Things are of 4
types
Structural Things, Behavioral Things , Grouping Things, An notational Things
Relationships
Relationships tie the things together. Relationships in the UML are
Dependency, Association, Generalization, Specialization
UML Diagrams
A diagram is the graphical presentation of a set of elements, most often rendered
as a connected graph of vertices (things) and arcs (relationships).
25
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

There are two types of diagrams, they are


Structural and Behavioral Diagrams
Structural Diagrams
The UMLs four structural diagrams exist to visualize, specify, construct and
document the static aspects of a system. View the static parts of a system using one of
the following diagrams. Structural diagrams consist of Class Diagram, Object
Diagram, Component Diagram, and Deployment Diagram.
Behavioral Diagrams
The UMLs five behavioral diagrams are used to visualize, specify, construct, and
document the dynamic aspects of a system. The UMLs behavioral diagrams are
roughly organized around the major ways which can model the dynamics of a system.
Behavioral diagrams consists of
Use case Diagram, Sequence Diagram, Collaboration Diagram, State chart Diagram,
Activity Diagram
4.2.2 CLASS DIAGRAM
Class diagrams are widely used to describe the types of objects in a system and
their relationships. Class diagrams model class structure and contents using design
elements such as classes, packages and objects. Class diagrams describe three
different perspectives when designing a system, conceptual, specification, and
implementation. These perspectives become evident as the diagram is created and
help solidify the design. Class diagrams are arguably the most used UML diagram
type. It is the main building block of any object oriented solution. It shows the classes
in a system, attributes and operations of each class and the relationship between each
class. In most modeling tools a class has three parts, name at the top, attributes in the
middle and operations or methods at the bottom. In large systems with many classes
related classes are grouped together to create class diagrams. Different relationships
between diagrams are show by different types of Arrows.
image

user
+name
+password
+phone number
+gender

+keyword
+alias
+name
+description

+upload()
+search()

+insert()
+retrive()

26
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

Fig 4.2: class diagram


4.2.3 USE-CASE DIAGRAM
A use case is a set of scenarios that describing an interaction between a user and a
system. A use case diagram displays the relationship among actors and use cases.
The two main components of a use case diagram are use cases and actors.

Fig 4.3: Elements of use-case diagram


An actor is represents a user or another system that will interact with the system
you are modeling. A use case is an external view of the system that represents some
action the user might perform in order to complete a task.
Contents:

Use cases

Actors

Dependency, Generalization, and association relationships

System boundary

enter character

User
select character

Fig4.4: Use case diagram


4.2.4 SEQUENCE DIAGRAM
27
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

Sequence diagrams in UML shows how object interact with each other and the
order those interactions occur. Its important to note that they show the interactions for
a particular scenario. The processes are represented vertically and interactions are
show as arrows

enter character

user

1 : select character()

2 : draw character()

Fig4.5: Sequence diagram


4.2.5 COLLABORATION DIAGRAM
Communication diagram was called collaboration diagram in UML. It is similar
to sequence diagrams but the focus is on messages passed between objects. The same
information can be represented using a sequence diagram and different objects.

user

enter character
28
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

Fig4.6: Collaboration diagram


4.2.6 ACTIVITY DIAGRAM
Activity diagrams describe the workflow behavior of a system. Activity diagrams
are similar to state diagrams because activities are the state of doing something. The
diagrams describe the state of activities by showing the sequence of activities
performed. Activity diagrams can show activities that are conditional or parallel
Activity diagrams should be used in conjunction with other modeling techniques
such as interaction diagrams and state diagrams. The main reason to use activity
diagrams is to model the workflow behind the system being designed. Activity
Diagrams are also useful for: analyzing a use case by describing what actions need to
take place and when they should occur; describing a complicated sequential
algorithm; and modeling applications with parallel processes.

No
User login

yes

Authenticated

Upload image

Search image with aliases

Getting results

29
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

Fig 4.7: Activity diagram


4.2.7 STATE CHART DIAGRAMS
State chart diagrams are similar to activity diagrams although notations and
usage changes a bit. They are sometime known as state diagrams or start chart
diagrams as well.. Below State machine diagram show the basic states and actions.

user

enter character

select character

Fig4.8: State chart diagram


4.2.8 COMPONENT DIAGRAM
A component diagram displays the structural relationship of components of a
software system. These are mostly used when working with complex systems that
have many components.. Below images shows a component diagram.

Home Page

user
Login

Register

Upload Image

Search Image with


Aliases

Getting Results

30
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

Fig 4.9: Component diagram


4.2.9 DEPLOYMENT DIAGRAM
A deployment diagrams shows the hardware of your system and the software
in those hardware. Deployment diagrams are useful when your software solution is
deployed across multiple machines with each having a unique configuration

Search
Image with
Aliases

Upload
Image

Fig 4.10: Deployment diagram

4.3 LOGICAL DESIGN


Logical design is a process through which requirements are translated into a
representation of software. Initially the representation depicts a holistic view of
software. Subsequent refinement leads to a design representation that is very close to
source code.
The conceptual structure of a database is called a schema. Schema shows the
kinds of data that exists in a database and how the kinds of data are logically related
to each other. A schema can be regarded as a blueprint that portrays both kind of data
used in building a database and logical relationship that exist among various kinds of
data. At the minimum, the schema must represent all needed data items, must
correctly represent their interrelationships, and must be able to support all reports.
Schema is frequently depicted pictorially using Data Flow Diagrams (DFD).
4.3.1 DATA FLOW DIAGRAMS
Data Flow Diagrams (DFD) depicts information flow and transforms that are
applied as data move from input to output. The DFD is also known as Data Flow
Graph or Bubble Chart. It is the starting point of the design phase that functionality
decomposes the requirement specification down to the lowest level of details. Thus, a
DFD describes what data flows (logical) rather than how they are processed. Data
31
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

Flow Diagrams are made up of number of symbols which represent system


components. Data Flow modeling methods used for kinds of symbols. These symbols
are used to represent four kinds of system components. Processes, Data Stores,
External Entities, Data Flows.

Processes
Processes show what system does. Each process has one or more data inputs and
produces one or more data outputs. Processes are represented by round rectangles in
DFD.
Data Stores
A file or data stores is repository of data. Processes can enter data into a store or
retrieve data from data store. The line in the DFD and each store represents each data
store as a unique name.
External Entities
External entities are outside the system but they supply either input data into the
system or used for the system output. They are entities on which the designer has no
control. There may be an organization or other bodies with which system interacts.
Data Flows
Data Flows model the passage of data on the system and represented by the lines
joining the system components. An arrow indicates the direction of flow and line is
labeled by the name of data flow. Flow of data in the system can take place
Between two processes
From a data store to a process
From a process to a process
From source to a process
4.3.2Level 0 DFD Diagrams

User Login

Upload
Image

Search Image
with Aliases

Getting Results
32
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

Fig 4.11: level 0 data flow diagram


First the user login into the system with his user id and password and uploads the
image after successful completion of uploading the image the user has to search the
image with the original name and nick name that means aliases. The original name
and nick name of the celebrity is given at the time of image uploading. The user has to
search the image with the aliases. Then he got the results for the appropriate image.
4.3.3 Level 1 DFD diagram

Alias Names
User Login

With Alias Names


Search Image

Upload Image

Getting Results

Fig 4.12: level 1 dataflow diagram


First the user login into the system with his user id and password if he is
authenticated. If the user is not authenticated then first he register into the system then
only he login into the system. Then the image uploading is done where the original
name and aliases is given. Then the user searches the image with the alias name then
he got the appropriate results for the corresponding search. The difference between
level 0 level 1 DFD diagrams is nothing but the more brief description is given in the
level 1 DFD diagram

4.4 DATABASE DESIGN


Table name : Admin
Function

: This table contain two fields they are name and password where the
Admin login with his name and password where he can browse and
upload the photos

Field
name

Type
varchar(255)

Null
NO
33

DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

password

varchar(255)

NO

Table 4.1: Admin table

Table name: Comment


Function

: This table contains four fields they are id, name, comment, date. This

Table is used to put a comment to the user uploaded photos. The users

names also

displayed on his comments.


Field
id
name
comment
date

Type
varchar(255)
varchar(255)
varchar(255)
varchar(255)

Null
NO
NO
NO
NO

Table 4.2: Comment table


Table name: New
Function

: The user can login with his login id and upload photos with their tags
and descriptions. The likes and comments are also presented where the
another users Can put a comment on already uploaded photo.

Field
image
tag
iname
id
count
comment
description
like
date

Type
blob
varchar(255)
varchar(255)
int(255)
int(255)
int(255)
varchar(255)
int(255)
varchar(255)

Null
NO
NO
NO
NO
NO
NO
NO
NO
NO

Table 4.3: New table


Table name: Reg
Function

: The new user can register with their name and e-mail and whenever the
Registration process is successfully completed then that user login with
34

DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

Their id and browsing is done


Field
name
email
gender
phone
date

Type
varchar(255)
varchar(255)
varchar(255)
varchar(255)
varchar(255)

Null
NO
NO
NO
NO
NO

Table 4.4: Registration table


E-R DIAGRAM
An ER model is an abstract way to describe a database. Describing a database
usually starts with a relational database, which stores data in tables. Some of the data
in these tables point to data in other tables - for instance, your entry in the database
could point to several entries for each of the phone numbers that are yours. The ER
model would say that you are an entity, and each phone number is an entity, and the
relationship between you and the phone numbers is 'has a phone number'. Diagrams
created to design these entities and relationships are called entityrelationship
diagrams or ER diagrams.
An entity may be defined as a thing which is recognized as being capable of
an independent existence and which can be uniquely identifiedA relationship captures
how entities are related to one another. Relationships can be thought of as verbs,
linking two or more nouns. Entities and relationships can both have attributes
Peter Chen, the father of ER modeling said in his seminal paper
"The entity-relationship model adopts the more natural view that the real world
consists of entities and relationships. It incorporates some of the important semantic
information about the real world."
Limitations

ER models assume information content that can readily be represented in a


relational database.

They describe only a relational structure for this information.They are


inadequate for systems

35
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

Fig4.13: E-R Diagram

4.5 MODULE DESIGNING AND ORGANIZATION


Totally there are 5 modules in our project
1. Co-occurences in anchor texts
2. Role of anchor texts
3. Anchor text co-occurrence frequency
4. Ranking anchor texts
5. Discovery of association orders

36
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

1. Co-occurrences in Anchor Texts


The proposed method will first retrieve all corresponding URLs from search
engine for all anchor texts in which name and aliases appear. Most of the search
engines provide search operators to search in anchor texts on the web. For example,
Google provides in anchor or Allinanchor search operator to retrieve URLs that are
pointed by the anchor text given as a query. For example, query on
Allinanchor:Hideki Matsui to the Google will provide all URLs pointed by Hideki
Matsui anchor text on the web. For example, the picture of Arnold Schwarzenegger is
shown in Fig which is being liked by four different anchor texts. According to the
definition of co-occurrences on anchor texts, Terminator and Predator are cooccurring. As well, The Expendables and Governator are also co-occurring.

Terminator

The
expendables

Predator

Governator

Fig4.14: A picture of Arnold Schwarzenegger being linked by different anchor


texts on the web
2. Role of Anchor Texts
The main objective of search engine is to provide the most relevant documents
for a users query. Anchor texts play a vital role in search engine algorithm because it
is clickable text which points to a particular relevant page on the web. Hence search
engine considers anchor text as a main factor to retrieve relevant documents to the
users query. Anchor texts are used in synonym extraction, ranking and classification
of web pages and query translation in cross language information retrieval system.

37
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

3. Anchor Texts Co-occurrence Frequency


The two anchor texts appearing in different web pages are called as inbound
anchor texts if they point to the same URL. Anchor texts co-occurrence frequency
between anchor texts refers to the number of different URLs on which they co-occur.
For example, if p and x that are two anchor texts are co-occurring, then p and x point
to the same URL. If the co-occurrence frequency between p and x is that say an
example k, and then p and x co-occur in k number of different URLs.
Anchor texts

Table 4.5: Anchor text co-occurrence frequency table


4. Ranking Anchor Texts
Ranking SVM will be used for ranking the aliases. The ranking SVM will be
trained by training samples of name and aliases. All the co-occurrence measures for
the anchor texts of the training samples will be found and will be normalized into the
range of [0-1]. The normalized values termed as feature vectors will be used to train
the SVM to get the ranking function to test the given anchor texts of name and aliases.
Then for each anchor text, the trained SVM using the ranking function will rank the
other anchor texts with respect to their co-occurrence measures with it. The highest
ranking anchor text will be elected to make a firstorder association with its
corresponding anchor text for which ranking was performed. Next the word cooccurrence graph will be drawn for name and aliases according to the first order
associations between them. However, this method considered only the first order cooccurrences on aliases to rank them but did not focus on the second order cooccurrences to improve recall and achieve a substantial MRR for the web search
engine

38
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

Hideki Matsui

Matsui
Godzilla
Baseball

Yankees

New York

Sports

Fig 4.15: word co-occurrence graph


5. Discovery of Association Orders
Using the graph mining algorithm, the word co-occurrence graph will be mined
to find the hop distances between nodes in graph. The hop distances between two
nodes will be measured by counting the number of edges in-between the
corresponding two nodes. The number of edges will yield the association orders
between two nodes. According to the definition, a node that lies n hops away from p
has an n-order co-occurrence with p. Hence the first, second and higher order
associations between name and aliases will be identified by finding the hop distances
between them. The search engine can now expand the query on person names by
tagging the aliases according to the association orders with the name. Thereby the
recall will be substantially improved by 40% in relation detection task. Moreover the
search engine will get a substantial MRR for a sample of queries by giving relevant
search results.

4.6 CODING
4.6.1 ADMIN LOGIN PAGE

39
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

<%@page
import="com.oreilly.servlet.*,java.sql.*,databaseconnection.*,java.util.*,java.io.*,java
x.servlet.*, javax.servlet.http.*"%>
<html>
<head>
<title>Untitled Document</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
</head>
<body>
<%
Statement st = null;
ResultSet rs = null;
String name = request.getParameter("aname");
String password = request.getParameter("password");
try{
Class.forName("com.mysql.jdbc.Driver");
Connection con =
DriverManager.getConnection("jdbc:mysql://localhost:3306/alias","root","root");
st = con.createStatement();
String qry ="select * from admin where name='"+name+"' AND
password='"+password+"'";
rs = st.executeQuery(qry);
if(!rs.next()){
out.println("Enter correct username, password");
}
else{
response.sendRedirect("upload.jsp");
}
con.close();
st.close();
}
catch(Exception ex){
out.println(ex);
}
40
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

%>
</body>
</html>

4.6.2 USER LOGIN PAGE


<%@page
import="com.oreilly.servlet.*,java.sql.*,databaseconnection.*,java.util.*,java.io.*,java
x.servlet.*, javax.servlet.http.*"%>
<html>
<head>
<title>Untitled Document</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
</head>
<body>
<%
Statement st = null;
ResultSet rs = null;
String email = request.getParameter("userid");
String name = request.getParameter("username");
session.setAttribute("name",name);
session.setAttribute("email",email);
try{
Class.forName("com.mysql.jdbc.Driver");
Connection con =
DriverManager.getConnection("jdbc:mysql://localhost:3306/alias","root","root");
st = con.createStatement();
String qry ="select * from reg where name='"+name+"' AND
email='"+email+"'";
rs = st.executeQuery(qry);
if(!rs.next()){
out.println("Enter correct username, password");
}
Else
41
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

{
response.sendRedirect("search.jsp");
}
con.close();
st.close();
}
catch(Exception ex){
out.println(ex);
}
%>
</body>
</html>
4.6.3 LIKE PAGE
<%@page
import="com.oreilly.servlet.*,java.sql.*,databaseconnection.*,java.util.*,java.io.*,java
x.servlet.*, javax.servlet.http.*"%>
<%
String id=(String)session.getAttribute("id");
Statement st = null;
ResultSet rs1=null;
Try
{
Class.forName("com.mysql.jdbc.Driver");
Connection con =
DriverManager.getConnection("jdbc:mysql://localhost:3306/alias","root","root");
st=con.createStatement();
String sql1="select * from new where id='"+id+"'";
rs1=st.executeQuery(sql1);
while(rs1.next()){
int lyke=0;
lyke=rs1.getInt("lyke")+1;
try{
Class.forName("com.mysql.jdbc.Driver");
42
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

Connection con2 =
DriverManager.getConnection("jdbc:mysql://localhost:3306/alias","root","root");
PreparedStatement ps=con.prepareStatement("Update
new set lyke=? where id='"+id+"'");
ps.setInt(1,lyke);
int x=ps.executeUpdate();
//String shit=Integer.toString(hit);
//session.setAttribute("shit",shit);
response.sendRedirect("search3.jsp?message=success");
}
catch (Exception ex)
{
out.println(ex.getMessage());
}
}
}
catch (Exception e)
{
out.println(e.getMessage());
}
%>
4.6.4 COMMENT PAGE
<%@page
import="com.oreilly.servlet.*,java.sql.*,databaseconnection.*,java.util.*,java.io.*,java
x.servlet.*, javax.servlet.http.*"%>
<%
String id=(String)session.getAttribute("id");
Statement st = null;
ResultSet rs1=null;
try{
Class.forName("com.mysql.jdbc.Driver");
Connection con =
DriverManager.getConnection("jdbc:mysql://localhost:3306/alias","root","root");
st=con.createStatement();
43
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

String sql1="select * from new where id='"+id+"'";


rs1=st.executeQuery(sql1);
while(rs1.next()){
int comment=0;
comment=rs1.getInt("comment")+1;
try{
Class.forName("com.mysql.jdbc.Driver");
Connection con2 =
DriverManager.getConnection("jdbc:mysql://localhost:3306/alias","root","root");
PreparedStatement ps=con.prepareStatement("Update
new set comment=? where id='"+id+"'");
ps.setInt(1,comment);
int x=ps.executeUpdate();
//String shit=Integer.toString(hit);
//session.setAttribute("shit",shit);
response.sendRedirect("search3.jsp?message=success");
}
catch (Exception ex)
{
out.println(ex.getMessage());
}
}
}
catch (Exception e)
{
out.println(e.getMessage());
}
%>

4.7 CONCLUSION
In this way we can design the layout of the project which is to be implemented
during the construction phase. Thus we will have a clear picture of the project before

44
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

being coded. Hence any necessary enhancements can be made during this phase and
coding can be started.

CHAPTER 5
IMPLEMANTATION

45
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

5.1 INTRODUCTION
Implementation is the stage of the project when the theoretical design is turned
out into a working system. Thus it can be considered to be the most critical stage in
achieving a successful new system and in giving the user, confidence that the new
system will work and be effective.
The implementation stage involves careful planning, investigation of the existing
system and its constraints on implementation, designing of methods to achieve
changeover and evaluation of changeover methods. Implementation can be preceded
through JSP.JSP will be more suitable for dynamic page designing, data sharing and
mining concepts. For maintaining data information we go for MS-SQL as database
back end.
Implementation is the stage of the project when the theoretical design is turned
out into a working system. Thus it can be considered to be the most critical stage in
achieving a successful new system and in giving the user, confidence that the new
system will work and be effective.
The implementation stage involves careful planning, investigation of the existing
system and its constraints on implementation, designing of methods to achieve
changeover and evaluation of changeover methods.
Implementation is the process of converting a new system design into operation.
It is the phase that focuses on user training, site preparation and file conversion for
installing a candidate system. The important factor that should be considered here is
that the conversion should not disrupt the functioning of the organization.

46
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

The main purpose of Internet-based productivity applications, such as share


values, is to promote the productivity of human users as a group.

It is accepted in this context that high local responsiveness and high


concurrency are conducive to individual and group productivity.

Share amount is calculated for all value.

Consistency control in this environment must not only guarantee convergence


of replicated data, but also attempt to preserve intentions of operations.

Operational transformation (OT) is a well-established method for optimistic


consistency control.

This paper analyzes the root of correctness problems in OT and establishes a


novel operational transformation framework for developing OT algorithms
and proving their correctness.

Operational Transformation (OT) has been well accepted in group editors for
archive high local responsiveness and unconstrained collaboration.

Remote operations are transformed before they are executed such that
inconsistencies are repaired.

5.2 METHOD OF IMPLEMENTATION


When the system is ready for implementation, emphasis switches to communicate
with the finance department staff. Open discussion with the staff is important from the
beginning of the project. Staff can be expected to be concerned about the effect of the
automation on their jobs and the fear of redundancy or loss of status must be allayed
immediately. During the implementation phase it is important that all staff concerned
be apprised of the objectives of overall operation of the system. They will need
shinning on how computerization will change their duties and need to understand how
their role relates to the system as a whole. An organization-training program is
advisable; this can include demonstrations, newsletters, seminars etc.
The department should allocate a member of staff, who understands the system
and the equipment, and should be made responsible for the smooth operation of the
system. An administrator should coordinate the users to the system.
Users should be informed about new aspects of the system that will, affect them.
The features of the system explained with the adequate documentation. New services
such as security, on-line application form and back-ups must be advertised on the staff
when the time is ripe.
47
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

Existing documents such as employee loan details should be entered into the new
system. Since these files are very large, conversion of these may continue long after
the system based on current files has been implemented. Hence we need to assign
responsibility for each activity.
The system may come into full operation via number of possible routes.
Complete change over at one point time is conceptually the most tidy. But this
approach requires careful planning and coordination, particularly during the
changeover. A phased approach, possible implementing the system of the section
relating to one operation or procedure first and progressing to more novel or complex
subsystems in the fullness of time. These likely to be less traumatic. A phased
approach gives the staff time to adjust to the new system. But depends on being able
to split the system, without reliance on it. Thus approach is sensible when the
consequences of failure are disastrous, but will require extra staff time. The fourth
angle, is pilot operation permits any problems to be tackled on a smaller scale
operation. Pilot operation generally means the implementation of the complete
system, but at one location or branch only.
5.2.1 FORM DESIGN
The various forms and their processing is given below
Admin Login Screen
Purpose
The purpose of the screen is to allow the administrator to enter into the system.
Details
The Edit boxes are used to enter the username and password.
The send button is used to login into the system
Screen design

48
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

Image uploading screen


Purpose
The purpose of the screen is to upload the image and admin give both name and
aliases of the image.
Details
The edit boxes are used to enter the details
The browse button is used to upload the image
The send button indicates the process completion.
Screen design

49
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

User login screen


Purpose
The purpose of the screen is to login the user into the system
Details
The edit boxes are used to enter the text
The send button is used to login into the system
Screen design

50
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

Search screen
Purpose
The purpose of the screen is to search the images which are already uploaded
Details
The search bar is used to search the images
Screen design

Like screen
Purpose
The purpose of the screen is to like the images which are already uploaded according
to our wish
Details
The like link is used to like the images

Screen design
51
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

Comment screen
Purpose
The purpose of the screen is to comment the images which are already uploaded
according to our wish
Details
The comment link is used to comment the images the edit box will appear to comment
Screen design

52
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

5.2.2 OUTPUT SCREENS

Screen1: welcome screen

Screen 2: admin login screen


53
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

Screen 3:image uploading screen

Screen 4: user registration screen

54
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

Screen 5: user login screen

Screen6: search screen


55
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

Screen 7: image display screen

Screen 8: like screen

56
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

Screen 9: comment screen

Screen 10: comment posting screen


57
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

5.2.3 RESULT ANALYSIS


The proposed method will compute anchor texts-based co-occurrences among the
given personal name and aliases, and will create a word co-occurrence graph by
making connections between nodes representing name and aliases in the graph based
on their first order associations with each other. The graph mining algorithm to find
out the hop distances between nodes will be used to identify the association orders
between name and aliases. Ranking SVM will be used to rank the anchor texts
according to the co-occurrence statistics in order to identify the anchor texts in the
first order associations. The web search engine can expand the query on a personal
name by tagging aliases in the order of their associations with name to retrieve all
relevant results thereby improving recall and achieving a substantial MRR compared
to that of previously proposed methods.

5.3 CONCLUSION
In this way we implemented the project successfully for an easy interaction of the
user with the interfaces and enhanced security with less effort work. We proceed to
the next phase i.e., testing which is very important before delivering the project.

58
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

CHAPTER 6
TESTING AND VALIDATION

59
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

6.1 INTRODUCTION
Testing is the process of detecting errors. Testing performs a very critical role for
quality assurance and for ensuring the reliability of software. The results of testing are
used later on during maintenance also.
The aim of testing is often to demonstrate that a program works by showing that
it has no errors. The basic purpose of testing phase is to detect the errors that may be
present in the program. Hence one should not start testing with the intent of showing
that a program works, but the intent should be to show that a program doesnt work.
Testing is the process of executing a program with the intent of finding errors.
Software testing is a critical element of software quality assurance and represents
the ultimate review of software specification, design and coding. The increasing
visibility of software as a system element and the attendant costs associated with a
software failure are motivating forces for well planned, thorough testing. It is not
unusual for software. Development organization to expend 40 percent of total project
effort on testing. Hence the importance of software testing and its implications with
respect to software quality cannot be overemphasized. Different types of testing have
been carried out for this system, and they are briefly explained below.
6.1.1 TESTING OBJECTIVES
The main objective of testing is to uncover a host of errors, systematically and
with minimum effort and time. Stating formally, we can say,

Testing is a process of executing a program with the intent of finding an


error.

A successful test is one that uncovers an as yet undiscovered error.

A good test case is one that has a high probability of finding error, if it
exists.

The tests are inadequate to detect possibly present errors.

6.1.2 LEVELS OF TESTING


In order to uncover the errors present in different phases we have the concept of
levels of testing. This is a type of black box testing that is based on the specifications
of the software that is to be tested. The application is tested by providing input and
then the results are examined that need to conform to the functionality it was intended
60
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

for. The testing is conducted on a complete, integrated system to evaluate the system's
compliance with its specified requirements.
The basic levels of testing are as shown below

Acceptance
Testing

Client Needs

Requirements

System Testing

Design

Integration Testing

Code

Unit Testing

Fig 6.1: Levels of testing

6.2 TEST PLAN


A Test plan is a plan prepared by the testers to follow a systematic procedure
adapted to fulfill the process of testing. This includes the order of testing scenarios
Implemented.
As our system is concerned we follow the below procedures:
1. Point out what to be tested and where to be tested:
a) Teat the presence of database connection to obtain the data.
b) Check for the completeness of code to run the system.
c) Decide the portions of code to be tested and not to be tested.
d) Implemented the test scenarios and check the outputs with the specified ones.
2. Fail Criteria:
Here we specify the conditions where the system may fail and add them to test
conditions that are to be tested.
3. Test Cases:

61
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

The test case Specification deals with the details of the test data, which is used
to test the system.

6.3 TESTING STRATEGIES


A strategy for software testing integrates software test case design methods into a
well-planned series of steps that result in the successful construction of software.
Unit Testing
Unit testing focuses verification effort on the smallest unit of software i.e. the
module. Using the detailed design and the process specifications testing is done to
uncover errors within the boundary of the module. All modules must be successful in
the unit test before the start of the integration testing begins.
White Box Testing
White Box Testing mainly focuses on the internal performance of the product.
Here a part will be taken at a time and tested thoroughly at a statement level to find
the maximum possible errors. Also construct a loop in such a way that the part will be
tested with in a range. That means the part is executed at its boundary values and
within bounds for the purpose of testing.
Black Box Testing
This testing method considers a module as a single unit and checks the unit at
interface and communication with other modules rather getting into details at
statement level. Here the module will be treated as a block box that will take some
input and generate output. Output for a given set of input combinations are forwarded
to other modules.
Integration Testing
After the unit testing we have to perform integration testing. The goal here is to
see if modules can be integrated properly or not. This testing activity can be
considered as testing the design and hence the emphasis on testing module
interactions. It also helps to uncover a set of errors associated with interfacing. Here
the input to these modules will be the unit tested modules.
Integration testing is classifies in two types
1. Top-Down Integration Testing.
2. Bottom-Up Integration Testing.
In Top-Down Integration Testing modules are integrated by moving downward
through the control hierarchy, beginning with the main control module.
62
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

In Bottom-Up Integration Testing each sub module is tested separately and then
the full system is tested.
System Testing
Project testing is an important phase without which the system cant be
released to the end users. It is aimed at ensuring that all the processes
are according to the specification accurately.
Acceptance Testing
Acceptance Test is performed with realistic data of the client to demonstrate that
the software is working satisfactorily. Testing here is focused on external behavior of
the system; the internal logic of program is not emphasized.
6.3.1 TESTING ACTIVITIES
The activities of testing include:
Component Inspection: This finds faults in an individual component through
the manual inspection of its source code. Inspections can be conducted before
or after the unit test.
Usability Testing: This finds differences between the system and the users
expectation of what it should do?

Usability testing tests the users

understanding of the system.


Unit Testing: This finds faults by isolating an individual component using test
stubs and drivers and by exercising the component using a test case. Unit
testing focuses on the building blocks of the software system, that is, objects
and subsystems. The most important unit testing techniques are:
1. Equivalence Testing or Black box testing: This black box testing
minimizes the number of test cases. The possible inputs are partitioned
into equivalence classes, and a test case is selected for each class.
Equivalence testing involves two steps:

Identification of the equivalence classes and

Selection of the test inputs.

2. Path based testing or White box testing: This white box testing
technique identifies faults in the implementation of the component. The
assumption here is that, by exercising all possible paths through the code
at least once, most faults will trigger failures. The identification of paths
requires knowledge of the source code and data structures.
63
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

Integration Testing: This finds faults by integrating several components


together. Integration Testing detects faults that have not been detected during
unit testing by focusing on small groups of components.

Two or more

components are integrated and tested and when no new faults are revealed,
additional components are added to the group.
System Testing: This focuses on the complete system, its functional and
nonfunctional requirements, and its target environment. All the testing
activities stated are to be implemented on large scale projects to get the
consistent System to be designed. All of them are not applicable to small
projects that do not much analysis done.

6.4 VALIDATION
Name
Location

Field validation test in signup


newsignup.jsp

Input1

All the fields filled correctly

Log1

The system will shows him welcome

Input2

screen to his Inbox


Invalid pin number

Log2

A message Enter Pin Correctly

is

displayed.

Table 6.1: validations

6.5 CONCLUSION
In this way we also completed the testing phase of the project and ensured that
the system is ready to go live. Thus we developed a new technology courier system so
that people will have tracking details of their consignments.

64
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

CHAPTER 7
CONCLUSION

65
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

The proposed method compute anchor texts-based co-occurrences among the


given personal name and aliases, and create a word co-occurrence graph by making
connections between nodes representing name and aliases in the graph based on their
first order associations with each other. The graph mining algorithm to find out the
hop distances between node used to identify the association orders between name and
aliases. Ranking SVM will be used to rank the anchor texts according to the cooccurrence statistics in order to identify the anchor texts in the first order associations.
The web search engine can expand the query on a personal name by tagging aliases in
the order of their associations with name to retrieve all relevant results thereby
improving recall and achieving a substantial MRR compared to that of previously
proposed methods.

66
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

CHAPTER 8
BIBILOGRAPHY

67
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

Good Teachers are worth more than thousand books, we have them in Our
Department
References Made From
[1] J. Artiles, J. Gonzalo , and F. Verdejo, A Testbed for People
Searching Strategies in the WWW, Proc. SIGIR 05, pp. 569-570, 2005.
[2] R. Guha and A. Garg, Disambiguating People in Search, technical
report, Stanford Univ., 2004.
[3] D.Bollegala, Y. Matsuo, and M. Ishizuka , Automatic Discovery of
Personal Name Aliases from the Web, IEEE Transactions on Knowledge
and Data Engineering, vol. 23, No. 6, June 2011.
[4] Y. Matsuo, and M. Ishizuka, Keyword Extraction from a Single
Document using Word Co-occurrence Statistical Information,
International Journal on Artificial Intelligence Tools, 2004.
[5] W. Lu, L. Chien and H. Lee, Anchor Text Mining for Translation of
Web Queries: A Transitive Translation Approach, ACM Transactions on
Information Systems, Vol. 22, No. 2, Aprill 2004, Pages 242-269.
[6] Z. Liu, W. Yu, Y. Deng, Y. Wang, and Z. Bian, A Feature selection
Method for Document Clustering based on Part-of-Speech and Word Cooccurrence, Proceedings of 7th International Conference on Fuzzy
Systems and Knowledge Discovery (FSKD 10), pp. 2331-2334, Aug
2010.
68
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

[7] F. Figueiredo, L. Rocha, T. Couto, T. Salles, M.A. Gonclaves, and W.


Meira Jr, Word Co-occurrence Features for Text Classification, Vol 36,
Issues 5, Pages 843-858, July 2011.
[8] G. Salton and C. Buckley, Term-Weighting Approaches in Automatic
Text Retrieval, Information processing and Management, vol. 24, pp.
513-523, 1988.
[9] T. Dunning, Accurate Methods for the Statistics of Surprise and
Coincidence, Computational Linguistics, vol. 19, pp. 61-74, 1993.
[10] K. Church and P. Hanks, Word Association Norms, Mutual
Information and Lexicography, Computational Linguistics, Vol. 16, pp.
22-29, 1991.
[11] T. Hisamitsu and Y. Niwa, Topic-Word Selection Based on
Combinatorial Probability, Proc. Natural Language Processing PacificRim Symp. (NLPRS 01), pp.289-296, 2001.
[12] F.Smadja, Retrieveing Collocations from Text: Xtract,
Computational Liguistics, Vol. 19, no 1, pp. 143-177, 1993.
[13] T. Joachims, Optimizing Search Engines using Clickthrough Data,
proc. ACM SIGKDD 02, 2002.
[14] D. Chakrabarti and C. Faloutsos , Graph Mining: Laws, Generators,
and Algorithms, ACM Computing Surveys, Vol. 38, March 2006, Article
2.
[15] C.C. Agarwal and H. Wang, Graph Data Management and Mining :
A Survey of Algorithms and Applications, DOI 10.1007/978-1-44196045-0_2,@ Springler Science+Business Media, LLC 2010.
Sites Referred:
http://java.sun.com
69
DEPARTMENT OF CSE
GDMM,NANDIGAMA

AUTOMATIC DISCOVERY OF ASSOCIATION ORDERS BETWEEN NAME AND ALIASES FROM THE
WEB USING ANCHOR TEXT BASED CO-OCCURENCES

http://www.sourcefordgde.com
http://www.networkcomputing.com/
http://www.roseindia.com/
http://www.java2s.com/

70
DEPARTMENT OF CSE
GDMM,NANDIGAMA

Você também pode gostar