How A Search Engine Works - Report

DEPARTMENT OF COMPUTER SCIENCE AND
ENGINEERING
SYNERGY INSTITUTE OF ENGINEERING &

TECHNOLOGY, DHENKANAL
SEMINAR ON
How a search engine

works?
Seminar Report ’10 How a search engine works
Guided by : XxX
Submitted by:
SOVAN MISRA
CS-O7-42
0701230147
Dept. of CSE 2 S.I.E.T, Dhenkanal
DEPARTMENT OF COMPUTER SCIENCE &

ENGINEERING
SYNERGY INSTITUTE OF ENGINEERING
AND TECHNOLOGY
DHENKANAL
CERTIFICATE
Certified that this is a bonafide record of the seminar entitled “HOW A

SEARCH ENGINE WORKS” done by the following student “SOVAN
MISRA” of the 7th semester, Computer Science and Engineering in the
year 2010 in partial fulfilment of the requirements of the award of
Degree of Bachelor of Technology in Computer Science and
Engineering of Synergy Institute Of Engineering And Technology,
Dhenkanal
XxX XxX
Seminar Guide Head of the Department

ACKNOWLEDGEMENT
I thank my seminar guide XxX, Lecturer, SIET, for her proper
guidance, and valuable suggestions. I am indebted to XxX, the HOD,
Computer Science department & other faculty members for giving me
an opportunity to learn and do this seminar. If not for the above
mentioned people my seminar would never have been completed
successfully. I once again extend my sincere thanks to all of them.
SOVAN MISRA

HOW SEARCH ENGINE WORKS?
INTRODUCTION
What is a search engine?
Search engine is a software program that searches a database and gathers

reports, information that contains or is related to specified terms.
Or
It is a website whose primary function is providing a search for gathering
and reporting information’s available on the Internet or a portion of
internet.
Why Search Engine?
In today’s world we have million and billions of information available in

the vast World Wide Web (WWW). If one has to search some information
it will kill lots of time of the user. For this purpose we should have certain
tools for making this searching automatic, Quick, and Effortless.
So, to reduce the problem to a , more or less manageable solution, Web
Search Engine were introduced a few years ago.
Different Search engines:

History of search engines:-
In 1990, the first search engine ARCHIE was released, at that time there is
no World Wide Web. Data resided on defence contractor, university, and
government computers and techies were the only people accessing the data.
The computers are interconnected by Telnet*.
File Transfer Protocol (FTP) used for transferring file from computer to
computer.
There is no such thing called a Browser. So, information or data are
transferred in their native format and viewed using the associated file type
software.
Archie searched FTP servers and indexed their files into a searchable
directory.
In 1991, Ghopherspace come into existence with the advantage of Gopher.
It catalogued FTP sites and the resulting catalogue become known as
Gopher space.
In 1994, WebCrawler, a new type of search engine that indexed the entire
contents pf a webpage, was introduced.
In between 1995-1998, many changes and development occurred in the
world of search engines. Meta tags* is the webpage were first utilized by
some search engines to determine relevancy.
Search engine rank-checking software was introduced. It provides an
automated tool to determine web sites position and ranking within the
major search engines.
In around 1998, search engine Algorithms was introduced to optimize the
searching.
In 2000, Marketer determines that pay-per click campaigns were an easy
yet expensive approach for gaining top search rankings. To elevate sites in
the searching engine ranking websites started adding useful and relevant
content while optimizing their
WebPages for each specific search engines. And still the search engines
optimization (SEO) is going on by improving the algorithms.
TYPE OF SEARCH ENGINES:

On the basis of working, search engine is categories in the following
groups:
* Crawler-based search engine.
* Directories.
* Hybrid search engines.
* Meta search engine.
CRAWLER BASED SEARCH ENGINE:
It uses automated software programs to survey and categorises WebPages,

which is known as “spiders” ,”crawlers” ,”robots” and ”bots”.
A spider will find a web page, download it and analyses the information
presented on the WebPages. The webpage will then be added to the search
engines database.
When a user performs a search , the search engine will check its database
of WebPages for the key word the user searched.
The results are ordered as per the bots algorithm in the search engine result
pages (SERPs).
Ex:-
www.google.com)
GOOGLE (
ASK (www.ask.com)

SPIDER’S ALGORITHMS :
All spiders use the following algorithms for retrieving documents from the
web:
The algorithm uses a list of known URLs. This lists contains at least one
URL to start with.
The document is parsed to retrieve information for the index database and
to extract the embedded link to other documents.

The URL of the links found in the document are added to the list of known
URLs.
If the list is empty or some limit exceed (number of documents retrieved,

size of the indexed database, etc) the algorithm stops, otherwise the
algorithm continues at steps 2.
Crawlers program treats World Wide Web as big graph having pages as
nodes and the hyperlinks as arcs.
Crawlers works with a simple goal, indexing all keywords in the
webpage titles.
Three Data structures is needed for crawlers or spider algorithms
A large linear array, URL_Table.
• Heap
• Hash table

• URL_table:
It is a large linear array that contains millions of entries.

Each entry contains two pointers:
• Pointer to URL
• Pointer to Title.
These are variables length strings and kept as heap.
Heap:
It is a large unstructured chunk of virtual memory to which strings can be
appended.
Hash table:
It is the third data structure of size ‘n’ entries
Any URL can be run through a hash function to produce a non-negative
integer less than ‘n’.
All URL that hash to the value ‘k’ are hooked together on a linked list
starting at the entry ‘k’ of the hash table.
Every entry in the URL_table is also entered into the hash table.
The main use of hash table is to start with a URL and be able to quickly
determine whether it is already present in URL_Table.

DATA STRUCTURE FOR CRAWLER
Building the index requires two phases:

• Searching (URL processing )
• Indexing.
The heart of the search engine is a recursive procedure procees_url, which
takes a URL string as input
Searching is done by procedure, procees_url as follows:-
It hashes the URL to see if it is already present in url_table. If so, it
is done and returns immediately.
If the URL is not already known, its page is fetched.
The URL and title are then copied to the heap and pointers to these
two strings are entered in url_table.
The URL is also entered into the hash table.
Finally, process_url extracts all the hyperlinks from the page and
calls process_url once per hyperlink, passing the hyperlink’s URL as the
input parameter
For each entry in url_table, indexing procedure will examine the title
and selects out all words not on the stop list.

Each selected word is written on to a file with a line consisting of the
word followed by the current url_table entry number.
When the whole table has been scanned, the file is shorted by word.
Formulating quires:
Keyword submission cause a request to be done in the machine

where the index is located (web server).
Then the keyword is looked up in the index database to find the set
of URL indices for each keyword.
Indexed into url_table to find all the titles and urls. Then it is stored
in the Document server.
These are then combined to form a web page and sent back to user as the
response.
Determining Relevance
Classic algorithm "TF / IDF“is used for determining relevance.

 It is a weight often used in information retrieval and text mining.
This weight is a statistical measure used to evaluate how important a
word is to a document in a collection
 A high weight in TF-IDF is reached by a high term frequency (in the
given document) and a low document frequency of the term in the
whole collection of documents
Term Frequency
 “Term Frequency” -The number of times a given term appears in

that document.
 It gives a measure of the importance of the term ti within the
particular document.
Term Frequency,
Where, ni is the number of occurrences of the considered term, and

the denominator is the number of occurrences of all terms.
E.g.
If a document contains 100 total words and the word computer
appears 3 times, then the term frequency of the word computer in the
document is 0.03 (3/100)
Inverse Document Frequency
The “inverse document frequency ”is a measure of the general importance

of the term (obtained by dividing the number of all documents by the
number of documents containing the term, and then taking the logarithm of
that quotient).
Where,
• | D | : total number of documents in the corpus

• : number of documents where the term ti

appears (that is ni!= 0)
Inverse Document Frequency
There are different ways of calculating the IDF

“Document Frequency” (DF) is to determine how many
documents contain the word and divide it by the total number of
documents in the collection.
E.g.
1) If the word computer appears in 1,000 documents out of
a total of 10,000,000 then the IDF is 0.0001
(1000/10,000,000).
2) Alternately, take the log of the document frequency.
The natural alogarithm is commonly used. In this
example we would have
IDF = ln(1,000 / 10,000,000) =1/ 9.21
The final TF-IDF score is then calculated by dividing the “Term
Frequency” by the “Document Frequency”.
E.g.
The TF-IDF score for computer in the collection would be :

1)TF-IDF = 0.03/0.0001= 300 , by using first formula of IDF.
2)If alternate formula used we would have
TF-IDF = 0.03 * 9.21 = 0.27.

OTHER TYPE OF SERCHING TECHNIQUES:
Directories
The human editors comprehensively check the website and

rank it, based on the information they find, using a pre-defined set of
rules.
There are two major directories :
Yahoo Directory (www.yahoo.com)
Open Directory (www.dmoz.org)
Hybrid Search Engines

Hybrid search engines use a combination of both crawler-based
results and directory results.

Examples of hybrid search engines are:
Yahoo (www.yahoo.com)
Google (www.google.com)
Meta Search Engines

Also known as Multiple Search Engines or Metacrawlers.
Meta search engines query several other Web search engine
databases in parallel and then combine the results in one list.
Examples of Meta search engines include:

Metacrawler (www.metacrawler.com)
Dogpile (www.dogpile.com)
References:
http://computer.howstuffworks.com/internet/basics/search-engine.htm
http://searchenginewatch.com/2168031
http://www.infotoday.com/searcher/may01/liddy.htm
http://www.slideshare.net/jsuleiman/how-search-engines-work-
presentation
Hey!
This is Sovan
Please send your feedbacks @
sovan107@gmail.com

How A Search Engine Works - Report

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

How A Search Engine Works - Report

Enviado por

Direitos autorais:

Formatos disponíveis

DEPARTMENT OF COMPUTER SCIENCE AND

SYNERGY INSTITUTE OF ENGINEERING &

How a search engine

DEPARTMENT OF COMPUTER SCIENCE &

Certified that this is a bonafide record of the seminar entitled “HOW A

Seminar Guide Head of the Department

Dept. of CSE 3 S.I.E.T, Dhenkanal

I thank my seminar guide XxX, Lecturer, SIET, for her proper

guidance, and valuable suggestions. I am indebted to XxX, the HOD,

Computer Science department & other faculty members for giving me

an opportunity to learn and do this seminar. If not for the above

mentioned people my seminar would never have been completed

successfully. I once again extend my sincere thanks to all of them.

Dept. of CSE 4 S.I.E.T, Dhenkanal

HOW SEARCH ENGINE WORKS?

What is a search engine?

Search engine is a software program that searches a database and gathers

Why Search Engine?

In today’s world we have million and billions of information available in

Different Search engines:

Dept. of CSE 5 S.I.E.T, Dhenkanal

History of search engines:-

TYPE OF SEARCH ENGINES:

Dept. of CSE 6 S.I.E.T, Dhenkanal

It uses automated software programs to survey and categorises WebPages,

Dept. of CSE 7 S.I.E.T, Dhenkanal

Dept. of CSE 8 S.I.E.T, Dhenkanal

If the list is empty or some limit exceed (number of documents retrieved,

Dept. of CSE 9 S.I.E.T, Dhenkanal

It is a large linear array that contains millions of entries.

These are variables length strings and kept as heap.

Dept. of CSE 10 S.I.E.T, Dhenkanal

DATA STRUCTURE FOR CRAWLER

Building the index requires two phases:

Dept. of CSE 11 S.I.E.T, Dhenkanal

Keyword submission cause a request to be done in the machine

Classic algorithm "TF / IDF“is used for determining relevance.

Dept. of CSE 12 S.I.E.T, Dhenkanal

 “Term Frequency” -The number of times a given term appears in

Where, ni is the number of occurrences of the considered term, and

Inverse Document Frequency

The “inverse document frequency ”is a measure of the general importance

Dept. of CSE 13 S.I.E.T, Dhenkanal

• : number of documents where the term ti

Inverse Document Frequency

There are different ways of calculating the IDF

The TF-IDF score for computer in the collection would be :

Dept. of CSE 14 S.I.E.T, Dhenkanal

OTHER TYPE OF SERCHING TECHNIQUES:

The human editors comprehensively check the website and

Hybrid Search Engines

Dept. of CSE 15 S.I.E.T, Dhenkanal

Meta Search Engines

Examples of Meta search engines include:

Dept. of CSE 16 S.I.E.T, Dhenkanal

Você também pode gostar