Você está na página 1de 23

Search Engine with WebCrawler

Abstract
Project : SEARCH ENGINE WITH WEB CRAWLER
Front End : Core java, JSP.
Back End : File system & My sql server
Web server : Tomcat web server
.
This project is an attempt to implement a search engine with web
crawler
so as to demonstrate its contribution to the human for performing the
searching
in web in a faster way. A search engine is an information retrieval
system
designed to help find information stored on a computer system. The
most
public, visible form of a search engine is a Web search engine which
searches
for the information on theWorld WideWeb. Search engines provide an
interface
to a group of items that enables the users to specify the criteria about
an item of interest and have the engine find the matching items. A web
crawler(Web spider or Web robot)is a program or automated script
which
browses the World Wide Web in a methodical, automated manner. This
process is called Web crawling or spidering. Search engines use
spidering
as a means of providing up-to-date data. Web crawler starts with a list
of
URLs to visit, called the seeds. As the crawler visits these URLs, it
identifies
all the hyperlinks in the page and adds them to the list of URLs to visit,
called the crawl frontier. Web crawler was the Internet’s first search
engine
that performed keyword searches in both the names and texts of pages
on the
World WideWeb. Webcrawler’s search engine performed two basic
functions.
First, it compiled an ongoing index of web addresses (URLs).
Webcrawler
retrieved and marked a document, analyzed the content of both its
title and
its full text, registered the relevant links it contained, and then stored
the
information in its database. When the user submitted a query in the
form
of one or more keywords, Webcrawler compared it with the information
in
its index and reported back any matches.Webcrawler’s second function
was
searching the Internet in real time for the sites that matched a given
query.
It was carried out using exactly the same process, following links from
one
page to another.

Contents
1 Introduction
1.1 The Motivation
2 System Study
2.1 Proposed System
2.2 Technologies
2.2.1 Java
2.2.2 JDBC: (Java Database Connection)
2.2.3 Overview of the JDBC Process
2.2.4 Java Server Pages (JSP)
2.2.5 Advantages of JSP
2.2.6 JSP Architecture
3 Modules
3.1 Administrator Side
3.1.1 Page Settings
3.1.2 Log Settings .
3.2 Search
3.3 Web Service
4 Working
4.1 Steps used in the implementation of Search engine
5 System Design
5.1 Data Flow Daigram
5.2 Data Base Design
6 Conclusion
References

Chapter 1
Introduction
Most people find what they’re looking for on the World Wide Web by
using
search engines like Yahoo!, Alta Vista, or Google. It is the search
engines
that finally bring your website to the notice of the prospective
customers.
Hence it is better to know how these search engines actually work and
how
they present information to the customer initiating a search. When you
ask
a search engine to locate the information, it is actually searching
through
the index which it has created and not actually searching through the
Web.
Different search engines produce different rankings because not every
search
engine uses the same algorithm to search through the indices. Many
leading
search engines use a form of software program called the spiders or
crawlers
to find information on the Internet and store it for search results in
giant
databases or indexes. Some spiders record every word on a Web site
for their
respective indexes, while others only report certain keywords listed in
title
tags or meta tags.Search Engines use spiders to index the websites.
When
you submit your website pages to a search engine by completing their
required
submission page, the search engine spider will index your entire site.
A spider is an automated program that is run by the search engine
system.
Search engine indexing collects, parses, and stores the data to
facilitate fast
and accurate information retrieval. Spiders are unable to index pictures
or
read text that is contained within graphics, relying too heavily on such
elements
was a consideration for the online marketers. Webcrawler was the
Internet’s first search engine that performed keyword searches in both
the
names and texts of pages on the World Wide Web. It won quick
popularity
and loyalty among surfers looking for information. During the Web’s
infancy,
Webcrawler was born in January 1994. It was developed by Brian
Pinker-
ton, a computer student at the University of Washington, to cope with
the
complexity of theWeb. Pinkerton’s application, Webcrawler, could
automatically
scan the individual sites on the Web, register their content, and create
an index that surfers could query with keywords to find Web sites
relevant
to their interests.

1.1 The Motivation


Primarily it is due to the interest in the area of Information retrieval.
Nowadays,
there are many search engines like Google, Yahoo, Altavista etc. We
are trying to develop a Search engine with some of the facilities like
text
search, news search etc of the current search engines.

Chapter 2
System Study
2.1 Proposed System
In our proposed system,Search engine is implemented using web
crawler. In
our search engine user can search for text queries. When a query is
submitted
this will search in the downloaded web pages and the ranked URLs are
listed
to the user. The ranking is based on the number of searched words
present
in each web page. The user can also have the option for news fetching
using
yahoo API.

2.2 Technologies
Selection of programming language depends on the system we needs.
Since
the application is web based system JAVA and its technologies are most
suitable. In the development of this application, JSP is used for the
design
of web pages for both user and administrator.
2.2.1 Java
Java was introduced by Sun Microsystems in 1995 and instantly
created
a new sense of the interactive possibilities of the web. Originally it was
called Oak. It was mainly developed for the development of software
for
consumer electronic devices. Both of the major web browsers include a
Java
Virtual Machine(JVM).Almost all major operating system developers
(IBM,
Microsoft and others) have added Java compiler as part of their product
offerings.
It is a platform independent language. It is the first programming
language that is not guide to any particular hardware or operating
systems.
Programs developed in Java can be executed anywhere on any system.
The
internet helped to propel Java into the forefront of programming, and
Java,
in turn, has had a profound effect on the internet.Java is a true Object
oriented
Language. It is a programming language expressly designed for use in
the distributed environment of the internet.
The object model in Java is simple and easy to extend. It can also be
used to build small application modules or applets for use as part of a
web
page. Applets make it possible for a web page users interact with the
page.
Java could be easily incorporated into the web system.The programs
you
create are portable in a network. The output is Byte code. Byte code is
a highly optimized set of instruction designed to be executed by the
Java
runtime system. It is a code understood by any processor. Translating a
Java program into Byte code helps to make it easier to run a program
in a
wide variety of environments.
The major features of Java are:
Mainly Java is Platform-independent and portable.java programs can be
moved from one computer system to another, anywhere and anytime.
Changes
and upgrades in operating systems, processors and system resources
will not
force any changes in Java programs.
Secondly Java is a true Object oriented Language. Almost everything in
Java
is an object. All program code and data reside within objects and
classes.
It provides many safe guards to ensure reliable code. Java makes
memory
management much easier. It has strict compile time and runtime
checking
for data types. The object model in Java and easy to extend.
Java is designed as a distributed language for creating applications on
networks
as it handles. It has the ability to share both data and programs. Java
applications can open and access remote objects on internet as easily
as they
can do in local system. It is a small and simple language. It was
designed to
be easy for the professional programmer to learn and use effectively.
Java environment includes a large number of development tools and
hundreds
of classes and methods are part of the Java Standard Library (JSL),
also known as the Application Programming Interface (API).The
development
tools, part of the Java are used as the front end for designing the GUI
for the end users. It is a general purpose programming language which
sup-
ports multi-threaded programs. This means that we need not wait for
an
application to finish one task before beginning another.
2.2.2 JDBC: (Java Database Connection)
Practically every J2EE application saves, retrieves and manipulates
information
stored in a database-using web services provided by a J2EE
component.
A J2EE component supplies database access using Java data objects
contained
in the JDBC application programming interface (API).
Sun Microsystems, inc. met the challenge in 1996 with the creation of
the JDBC driver and JDBCAPI. The JDBC driver developed by the Sun
Microsystems. Inc. wasnt a driver at all. It was a specification that
described
the detail functionality of a JDBC driver. The specification required
a JDBC driver to be a translator that converted low-level proprietary
DBMS
messages to low-level messages understood by the JDBC API and user .
This
meant java programmers could use high-level java data objects
defined in the
JDBC API to write a routine that interacted with the DBMS. JDBC driver
created by DBMS manufacturers have to
Open a connection between the DBMS and the J2EE component.
Translates low-level equivalents of SQL statements sent by the J2EE
component
into messages that can be processed by the DBMS.
Returns data that conforms the JDBC specification to the JDBC driver.
Return information such as error messages that conforms to the JDBC
specification
to the JDBC driver.
Provide Transaction Management routines that conforms to the JDBC
specification.
Close the connection between the DBMS and the J2EE component.
2.2.3 Overview of the JDBC Process
This process is divided into 5 routines. These include.
Perform connection and authentication to a database server
Manage transactions
Move SQL statement to a database engine for preprocessing and
execution
Execute stored procedures
Inspect and modify the result from SELECT statements

2.2.4 Java Server Pages (JSP)


JSP is technology based on the JAVA language and enables the
development
of dynamic websites. JSP was developed by Sun Microsystems to allow
server side development. Based on the Java programming language
JSP offers
proven portability, open standards. A JSP document can share date
among users, access databases and do all the things that require
server intervention.
A JSP documents get compiled into Java byte code, a binary
format with fast and efficient run time capabilities. JSP pages separate
the
page logic from its design and display. JSP technology is part of the
Java
technology family.
JSP pages are not restricted to any specific platform or web server. The
JSP
specification represents a broad spectrum of industry input.
A servlet is a program written in the Java programming language that
runs
on the server, as opposed to the browser (applets pages are compiled
into
servlets, so theoretically you could write servlets to support your web-
based
applications. However, JSP technology was designed to simplify the
process
of creating pages by separating web presentation from web content. In
many
applications, the response sent to the client is a combination of
template data
and dynamically-generated data. In this situation, it is much easier to
work
with JSP pages than to do everything with servlets.
The JSP 2.1 specification is an important part of the Java EE 5 Platform.
There are a number of JSP technology implementations for different
web servers. JSP technology is the result of industry collaboration
and is designed to be an open, industry-standard method supporting
numerous
servers, browsers and tools. JSP technology speeds development
with reusable components and tags, instead of relying heavily on
scripting
within the page itself. All JSP implementations support a Java
programming
language-based scripting language, which provides inherent scalability
and support for complex operations.
A JSP page is a page created by the web developer that includes JSP
technology-specific and custom tags, in combination with other static
(HTML
or XML) tags. A JSP page has the extension .jsp or .jspx; this signals to
the
web server that the JSP engine will process elements on this page. Us
JSP
pages are typically compiled into Java platform servlet classes. As a
result,
JSP pages require a Java virtual machine that supports the Java
platform
servlet specification. Pages built using JSP technology are typically
implemented
using a translation phase that is performed once, the first time the

page is called. The page is compiled into a Java Servlet class and
remains in
server memory, so subsequent calls to the page have very fast
response times.
. JSP specification does support creation of XML documents. For simple
XML generation, the XML tags may be included as static template
portions
of the JSP page. The JSP 2.0 specification describes a mapping between
JSP
pages and XML documents.

2.2.5 Advantages of JSP


Scripting: The different server side languages like ASP have one
common
drawback they depend on somewhat weak programming languages for
processing.
But JSP uses the powerful and fully object oriented java language
for processing. Write once Run anywhere: JSP technology brings the
write
once run anywhere method to interactive web pages. JSP pages can be
easily
moved across platforms with out any changes.
2.2.6 JSP Architecture
The source code of a JSP page is essentially just HTML sprinkled here
and
there with either special JSP tags and or Java code enclosed in these
tags.
The files extension is .jsp rather than the usual .html or .htm, and it
tells the
server that this document requires special handling. The special
handling,
accomplished with a web server extension or plug in, involves four
steps.
1.The JSP engine parses the page and creates a Java source file. 2.It
then
compiles the file produced in step1 into java class file. The class file
created in
step2 is a servlet. 3. The servlet engine loads the servlet class for
execution.
4.The servlet executes and stream back the results back to the results
to the
requester. Step1 and step2 occur only once, when first deploy or
update the
JSP. The servlet engine performs step3 only upon the first request of
that
servlet since the last server restart. After that the class loader loads
the
class once and is available for the life of that JVM. Finally some
application
servers provide page caching, which can further improve the
performance and
reduce the cost of executing the request.

Chapter 3
Modules
There are 3 modules for a search engine with web crawler. 1.
Administrator
Side 2. Search 3. Web service

3.1 Administrator Side


In this module,administrator downloads the web pages and save them
in a
file.Administrator also have the function to keeps track of the details
about
searching and can set the page details.
This has a login session, by typing the correct username and password
on
the corresponding field we can enter into administrator side. Username
and
password are stored in the database. Only the authorized people can
log on
to the administrator side.
This has two sub modules.
3.1.1 Page Settings
In this, administrator can set the font type and color of the content,
and
background color of the selected page. By using this administrator can
set
any color and font of the content from a selected list of colors and
fonts. The
background color can also selected from a list of colors.Administrator
write
the selected colors and font to the database and change the values
according
to the data read from the database.For this administrator first write a
particular
color and font to the database, then when a change in page settings
8
is occured, it will update in the database.
3.1.2 Log Settings
Administrator can keeps the details of searching. It keep the searched
word,
time and date of searching. Details are stored in the database. When
go
to the corresponding page we can see the log table containing the
details of
searching.
There is a logout session for the administrator side.by this we can
succeessfully
logout from the administrator side.

3.2 Search
When a query is given by the user, search engine will check for the
corresponding
index file.If it is not present, make an index file with that query as
filename.
Then check all the web pages for the given query and add the address
of
that URL’s to that index file.The count of given query term present in
all
the pages are counted and it is recorded into a database.
The ranking is based on the preference of count of the query
term.Then, list
out the URLs from the database in the descending order.

3.3 Web Service


Web service includes the facilities for getting the instant news. The
news
searching is done by means of xml parsing. This is mainly fetched from
yahoo.

Chapter 4
Working
4.1 Steps used in the implementation of Search
engine
The Steps involved in the implementation of search engine with web
crawler:
1. The necessary URLs are first downloaded in to the cache by the
administrator.
2. When the user submit the query, independent cache for individual
index
terms are created after checking whether it is present or not.
3. The web pages are searched for finding the index terms and list out
the URLs containing the corresponding index terms are recorded into
the
database.
4. The count of the given query term in each webpage is also recorded
into
the database.
5. Finally,the ranked URLs are listed out in the decreasing order from
the
database.

Chapter 5
System Design
5.1 Data Flow Daigram
A Data Flow Diagram (DFD) or a bubble chart is a graphical tool for
structured
analysis. It was De Macro in 1978 and Gane and Sarson in 1979 who
introduced DFD. DFD models a system transforms the data and
creates,
output-data-flows which go by suing external entities from which data
flows
to a process, which to other processes or external entities or files. Data
in
files many also flow to processes as inputs.
There are various symbols used in DFD. Bubbles represent the process.
Named arrows indicate the dataflow. External entities are represented
by
rectangles and are outside the system such as venders or customers
with
whom the system interacts. They either supply or consume data.
Entities
supplying data are known as sources and those that consume data are
called
sinks. Data are stored in a data store by a process in the system. Each
component in a DFD is labeled in with a descriptive name. Process
names
are further identified with a number.
DFDs can be heirarchially organized, which help in partitioning and
anslyzing large sytems. As a first step, one Data Flow Diagram can
depict
an entire system. Which gives the system overview. It is called Context
Diagram of level 0 DF. The context Diagram can be further expanded.
The
successive expansion of DFD from the context diagram tho those
giving more
details is known as leveling of DFD. Thus of top down approach is used,
starting with an overview and then working out the details.
The main merit of DFD is that it can provide an overview of what data a
system would process, what transformation of data are done, what files
are
used, and where the results flow.
The data flow diagram of Search Engine With Web Crawler has been
represented as a hierarchical DFD contest level DFD was drawn first.
Then
the processes were decomposed into several elementary levels and
were represented in the order of importance.
5.2 Data Base Design
Chapter 6
Conclusion
Nowadays, there are many search engines like Google, Yahoo, Altavista
etc.
We are trying to develop a Search engine with some of the facilities
like text
search, news search etc of the current search engines.Still there are
limitations
in our search engine.

References
[1] Baeza Yaetes: Modern Information Retrieval
[2] http://www.searchenginewatch.com/
[3] http://www.webcrawler.com/

Você também pode gostar