Escolar Documentos
Profissional Documentos
Cultura Documentos
ENGINEERING
SEMINAR ON
Guided by : XxX
Submitted by:
SOVAN MISRA
CS-O7-42
0701230147
Dept. of CSE 2 S.I.E.T, Dhenkanal
Seminar Report ’10 How a search engine works
CERTIFICATE
XxX XxX
ACKNOWLEDGEMENT
SOVAN MISRA
INTRODUCTION
In 1990, the first search engine ARCHIE was released, at that time there is
no World Wide Web. Data resided on defence contractor, university, and
government computers and techies were the only people accessing the data.
The computers are interconnected by Telnet*.
File Transfer Protocol (FTP) used for transferring file from computer to
computer.
There is no such thing called a Browser. So, information or data are
transferred in their native format and viewed using the associated file type
software.
Archie searched FTP servers and indexed their files into a searchable
directory.
In 1991, Ghopherspace come into existence with the advantage of Gopher.
It catalogued FTP sites and the resulting catalogue become known as
Gopher space.
In 1994, WebCrawler, a new type of search engine that indexed the entire
contents pf a webpage, was introduced.
In between 1995-1998, many changes and development occurred in the
world of search engines. Meta tags* is the webpage were first utilized by
some search engines to determine relevancy.
Search engine rank-checking software was introduced. It provides an
automated tool to determine web sites position and ranking within the
major search engines.
In around 1998, search engine Algorithms was introduced to optimize the
searching.
In 2000, Marketer determines that pay-per click campaigns were an easy
yet expensive approach for gaining top search rankings. To elevate sites in
the searching engine ranking websites started adding useful and relevant
content while optimizing their
WebPages for each specific search engines. And still the search engines
optimization (SEO) is going on by improving the algorithms.
Ex:-
www.google.com)
GOOGLE (
ASK (www.ask.com)
SPIDER’S ALGORITHMS :
All spiders use the following algorithms for retrieving documents from the
web:
The algorithm uses a list of known URLs. This lists contains at least one
URL to start with.
The document is parsed to retrieve information for the index database and
to extract the embedded link to other documents.
Crawlers program treats World Wide Web as big graph having pages as
nodes and the hyperlinks as arcs.
Crawlers works with a simple goal, indexing all keywords in the
webpage titles.
Three Data structures is needed for crawlers or spider algorithms
A large linear array, URL_Table.
• Heap
• Hash table
• Pointer to URL
• Pointer to Title.
Heap:
It is a large unstructured chunk of virtual memory to which strings can be
appended.
Hash table:
It is the third data structure of size ‘n’ entries
Any URL can be run through a hash function to produce a non-negative
integer less than ‘n’.
All URL that hash to the value ‘k’ are hooked together on a linked list
starting at the entry ‘k’ of the hash table.
Every entry in the URL_table is also entered into the hash table.
The main use of hash table is to start with a URL and be able to quickly
determine whether it is already present in URL_Table.
Formulating quires:
Determining Relevance
Where,
• | D | : total number of documents in the corpus
E.g.
Directories
References:
http://computer.howstuffworks.com/internet/basics/search-engine.htm
http://searchenginewatch.com/2168031
http://www.infotoday.com/searcher/may01/liddy.htm
http://www.slideshare.net/jsuleiman/how-search-engines-work-
presentation
Hey!
This is Sovan
Please send your feedbacks @
sovan107@gmail.com