Escolar Documentos
Profissional Documentos
Cultura Documentos
Abstract
Project : SEARCH ENGINE WITH WEB CRAWLER
Front End : Core java, JSP.
Back End : File system & My sql server
Web server : Tomcat web server
.
This project is an attempt to implement a search engine with web
crawler
so as to demonstrate its contribution to the human for performing the
searching
in web in a faster way. A search engine is an information retrieval
system
designed to help find information stored on a computer system. The
most
public, visible form of a search engine is a Web search engine which
searches
for the information on theWorld WideWeb. Search engines provide an
interface
to a group of items that enables the users to specify the criteria about
an item of interest and have the engine find the matching items. A web
crawler(Web spider or Web robot)is a program or automated script
which
browses the World Wide Web in a methodical, automated manner. This
process is called Web crawling or spidering. Search engines use
spidering
as a means of providing up-to-date data. Web crawler starts with a list
of
URLs to visit, called the seeds. As the crawler visits these URLs, it
identifies
all the hyperlinks in the page and adds them to the list of URLs to visit,
called the crawl frontier. Web crawler was the Internet’s first search
engine
that performed keyword searches in both the names and texts of pages
on the
World WideWeb. Webcrawler’s search engine performed two basic
functions.
First, it compiled an ongoing index of web addresses (URLs).
Webcrawler
retrieved and marked a document, analyzed the content of both its
title and
its full text, registered the relevant links it contained, and then stored
the
information in its database. When the user submitted a query in the
form
of one or more keywords, Webcrawler compared it with the information
in
its index and reported back any matches.Webcrawler’s second function
was
searching the Internet in real time for the sites that matched a given
query.
It was carried out using exactly the same process, following links from
one
page to another.
Contents
1 Introduction
1.1 The Motivation
2 System Study
2.1 Proposed System
2.2 Technologies
2.2.1 Java
2.2.2 JDBC: (Java Database Connection)
2.2.3 Overview of the JDBC Process
2.2.4 Java Server Pages (JSP)
2.2.5 Advantages of JSP
2.2.6 JSP Architecture
3 Modules
3.1 Administrator Side
3.1.1 Page Settings
3.1.2 Log Settings .
3.2 Search
3.3 Web Service
4 Working
4.1 Steps used in the implementation of Search engine
5 System Design
5.1 Data Flow Daigram
5.2 Data Base Design
6 Conclusion
References
Chapter 1
Introduction
Most people find what they’re looking for on the World Wide Web by
using
search engines like Yahoo!, Alta Vista, or Google. It is the search
engines
that finally bring your website to the notice of the prospective
customers.
Hence it is better to know how these search engines actually work and
how
they present information to the customer initiating a search. When you
ask
a search engine to locate the information, it is actually searching
through
the index which it has created and not actually searching through the
Web.
Different search engines produce different rankings because not every
search
engine uses the same algorithm to search through the indices. Many
leading
search engines use a form of software program called the spiders or
crawlers
to find information on the Internet and store it for search results in
giant
databases or indexes. Some spiders record every word on a Web site
for their
respective indexes, while others only report certain keywords listed in
title
tags or meta tags.Search Engines use spiders to index the websites.
When
you submit your website pages to a search engine by completing their
required
submission page, the search engine spider will index your entire site.
A spider is an automated program that is run by the search engine
system.
Search engine indexing collects, parses, and stores the data to
facilitate fast
and accurate information retrieval. Spiders are unable to index pictures
or
read text that is contained within graphics, relying too heavily on such
elements
was a consideration for the online marketers. Webcrawler was the
Internet’s first search engine that performed keyword searches in both
the
names and texts of pages on the World Wide Web. It won quick
popularity
and loyalty among surfers looking for information. During the Web’s
infancy,
Webcrawler was born in January 1994. It was developed by Brian
Pinker-
ton, a computer student at the University of Washington, to cope with
the
complexity of theWeb. Pinkerton’s application, Webcrawler, could
automatically
scan the individual sites on the Web, register their content, and create
an index that surfers could query with keywords to find Web sites
relevant
to their interests.
Chapter 2
System Study
2.1 Proposed System
In our proposed system,Search engine is implemented using web
crawler. In
our search engine user can search for text queries. When a query is
submitted
this will search in the downloaded web pages and the ranked URLs are
listed
to the user. The ranking is based on the number of searched words
present
in each web page. The user can also have the option for news fetching
using
yahoo API.
2.2 Technologies
Selection of programming language depends on the system we needs.
Since
the application is web based system JAVA and its technologies are most
suitable. In the development of this application, JSP is used for the
design
of web pages for both user and administrator.
2.2.1 Java
Java was introduced by Sun Microsystems in 1995 and instantly
created
a new sense of the interactive possibilities of the web. Originally it was
called Oak. It was mainly developed for the development of software
for
consumer electronic devices. Both of the major web browsers include a
Java
Virtual Machine(JVM).Almost all major operating system developers
(IBM,
Microsoft and others) have added Java compiler as part of their product
offerings.
It is a platform independent language. It is the first programming
language that is not guide to any particular hardware or operating
systems.
Programs developed in Java can be executed anywhere on any system.
The
internet helped to propel Java into the forefront of programming, and
Java,
in turn, has had a profound effect on the internet.Java is a true Object
oriented
Language. It is a programming language expressly designed for use in
the distributed environment of the internet.
The object model in Java is simple and easy to extend. It can also be
used to build small application modules or applets for use as part of a
web
page. Applets make it possible for a web page users interact with the
page.
Java could be easily incorporated into the web system.The programs
you
create are portable in a network. The output is Byte code. Byte code is
a highly optimized set of instruction designed to be executed by the
Java
runtime system. It is a code understood by any processor. Translating a
Java program into Byte code helps to make it easier to run a program
in a
wide variety of environments.
The major features of Java are:
Mainly Java is Platform-independent and portable.java programs can be
moved from one computer system to another, anywhere and anytime.
Changes
and upgrades in operating systems, processors and system resources
will not
force any changes in Java programs.
Secondly Java is a true Object oriented Language. Almost everything in
Java
is an object. All program code and data reside within objects and
classes.
It provides many safe guards to ensure reliable code. Java makes
memory
management much easier. It has strict compile time and runtime
checking
for data types. The object model in Java and easy to extend.
Java is designed as a distributed language for creating applications on
networks
as it handles. It has the ability to share both data and programs. Java
applications can open and access remote objects on internet as easily
as they
can do in local system. It is a small and simple language. It was
designed to
be easy for the professional programmer to learn and use effectively.
Java environment includes a large number of development tools and
hundreds
of classes and methods are part of the Java Standard Library (JSL),
also known as the Application Programming Interface (API).The
development
tools, part of the Java are used as the front end for designing the GUI
for the end users. It is a general purpose programming language which
sup-
ports multi-threaded programs. This means that we need not wait for
an
application to finish one task before beginning another.
2.2.2 JDBC: (Java Database Connection)
Practically every J2EE application saves, retrieves and manipulates
information
stored in a database-using web services provided by a J2EE
component.
A J2EE component supplies database access using Java data objects
contained
in the JDBC application programming interface (API).
Sun Microsystems, inc. met the challenge in 1996 with the creation of
the JDBC driver and JDBCAPI. The JDBC driver developed by the Sun
Microsystems. Inc. wasnt a driver at all. It was a specification that
described
the detail functionality of a JDBC driver. The specification required
a JDBC driver to be a translator that converted low-level proprietary
DBMS
messages to low-level messages understood by the JDBC API and user .
This
meant java programmers could use high-level java data objects
defined in the
JDBC API to write a routine that interacted with the DBMS. JDBC driver
created by DBMS manufacturers have to
Open a connection between the DBMS and the J2EE component.
Translates low-level equivalents of SQL statements sent by the J2EE
component
into messages that can be processed by the DBMS.
Returns data that conforms the JDBC specification to the JDBC driver.
Return information such as error messages that conforms to the JDBC
specification
to the JDBC driver.
Provide Transaction Management routines that conforms to the JDBC
specification.
Close the connection between the DBMS and the J2EE component.
2.2.3 Overview of the JDBC Process
This process is divided into 5 routines. These include.
Perform connection and authentication to a database server
Manage transactions
Move SQL statement to a database engine for preprocessing and
execution
Execute stored procedures
Inspect and modify the result from SELECT statements
page is called. The page is compiled into a Java Servlet class and
remains in
server memory, so subsequent calls to the page have very fast
response times.
. JSP specification does support creation of XML documents. For simple
XML generation, the XML tags may be included as static template
portions
of the JSP page. The JSP 2.0 specification describes a mapping between
JSP
pages and XML documents.
Chapter 3
Modules
There are 3 modules for a search engine with web crawler. 1.
Administrator
Side 2. Search 3. Web service
3.2 Search
When a query is given by the user, search engine will check for the
corresponding
index file.If it is not present, make an index file with that query as
filename.
Then check all the web pages for the given query and add the address
of
that URL’s to that index file.The count of given query term present in
all
the pages are counted and it is recorded into a database.
The ranking is based on the preference of count of the query
term.Then, list
out the URLs from the database in the descending order.
Chapter 4
Working
4.1 Steps used in the implementation of Search
engine
The Steps involved in the implementation of search engine with web
crawler:
1. The necessary URLs are first downloaded in to the cache by the
administrator.
2. When the user submit the query, independent cache for individual
index
terms are created after checking whether it is present or not.
3. The web pages are searched for finding the index terms and list out
the URLs containing the corresponding index terms are recorded into
the
database.
4. The count of the given query term in each webpage is also recorded
into
the database.
5. Finally,the ranked URLs are listed out in the decreasing order from
the
database.
Chapter 5
System Design
5.1 Data Flow Daigram
A Data Flow Diagram (DFD) or a bubble chart is a graphical tool for
structured
analysis. It was De Macro in 1978 and Gane and Sarson in 1979 who
introduced DFD. DFD models a system transforms the data and
creates,
output-data-flows which go by suing external entities from which data
flows
to a process, which to other processes or external entities or files. Data
in
files many also flow to processes as inputs.
There are various symbols used in DFD. Bubbles represent the process.
Named arrows indicate the dataflow. External entities are represented
by
rectangles and are outside the system such as venders or customers
with
whom the system interacts. They either supply or consume data.
Entities
supplying data are known as sources and those that consume data are
called
sinks. Data are stored in a data store by a process in the system. Each
component in a DFD is labeled in with a descriptive name. Process
names
are further identified with a number.
DFDs can be heirarchially organized, which help in partitioning and
anslyzing large sytems. As a first step, one Data Flow Diagram can
depict
an entire system. Which gives the system overview. It is called Context
Diagram of level 0 DF. The context Diagram can be further expanded.
The
successive expansion of DFD from the context diagram tho those
giving more
details is known as leveling of DFD. Thus of top down approach is used,
starting with an overview and then working out the details.
The main merit of DFD is that it can provide an overview of what data a
system would process, what transformation of data are done, what files
are
used, and where the results flow.
The data flow diagram of Search Engine With Web Crawler has been
represented as a hierarchical DFD contest level DFD was drawn first.
Then
the processes were decomposed into several elementary levels and
were represented in the order of importance.
5.2 Data Base Design
Chapter 6
Conclusion
Nowadays, there are many search engines like Google, Yahoo, Altavista
etc.
We are trying to develop a Search engine with some of the facilities
like text
search, news search etc of the current search engines.Still there are
limitations
in our search engine.
References
[1] Baeza Yaetes: Modern Information Retrieval
[2] http://www.searchenginewatch.com/
[3] http://www.webcrawler.com/