Escolar Documentos
Profissional Documentos
Cultura Documentos
Babita Naagar
2k12/IT/021
Contents
1. INTRODUCTION......3
2. SEARCH ENGINE.....3
3.
WORKING
HISTORY
HOW GOOGLE MADE A BREAKTHROUGH 10
DEVELOPMENT OF GOOGLE......11
HISTORY
PAGERANK ALGORITHM
4. GOOGLE ARCHITECTURE....20
5.
OVERVIEW
DATA STRUCTRES
CRAWLING
INDEXING
SEARCHING
BIBLIOGRAPHY.29
INTRODUCTION
This project explores how search engines developed through time and the basic, in
fact, detailed working of a search engine.
It then explains how Google, the most powerful search engine of the world now,
brought a major breakthrough in world of search engine. Since then it has been the most
efficient and accurate search engine ever made.
It goes on to examine the architecture of Google search engine and the data
structure and concepts it involves in crawling, indexing and searching.
SEARCH ENGINES
A web search engine is a software system that is designed to search for information
on the World Wide Web. The search results are generally presented in a line of results
often referred to as search engine results pages (SERPs). The information may be a
specialist in web pages, images, information and other types of files. Some search engines
also mine data available in databases or open directories. Unlike web directories, which are
maintained only by human editors, search engines also maintain real-time information by
running an algorithm on a web crawler.
Web search engines work by storing information about many web pages, which they
retrieve from the HTML markup of the pages. These pages are retrieved by a Web
crawler(sometimes also known as a spider) an automated Web crawler which follows
every link on the site. The site owner can exclude specific pages by using robots.txt.
The search engine then analyzes the contents of each page to determine how it
should be indexed (for example, words can be extracted from the titles, page content,
headings, or special fields called meta tags). Data about web pages are stored in an index
database for use in later queries. A query from a user can be a single word. The index helps
find information relating to the query as quickly as possible. Some search engines, such
as Google, store all or part of the source page (referred to as a cache) as well as information
about the web pages, whereas others, such as AltaVista, store every word of every page
they find. This cached page always holds the actual search text since it is the one that was
actually indexed, so it can be very useful when the content of the current page has been
updated and the search terms are no longer in it. This problem might be considered a mild
form of linkrot, and Google's handling of it increases usability by satisfying user
expectations that the search terms will be on the returned webpage. This satisfies
the principle of least astonishment, since the user normally expects that the search terms
will be on the returned pages. Increased search relevance makes these cached pages very
useful as they may contain data that may no longer be available elsewhere.
an "inverted index" by analyzing texts it locates. This first form relies much more heavily on
the computer itself to do the bulk of the work.
Most Web search engines are commercial ventures supported
by advertising revenue and thus some of them allow advertisers to have their listings ranked
higher in search results for a fee. Search engines that do not accept money for their search
results make money by running search related ads alongside the regular search engine
results. The search engines make money every time someone clicks on one of these ads.
'Wandex'. The purpose of the Wanderer was to measure the size of the World Wide Web,
which it did until late 1995. The web's second search engine Aliweb appeared in November
1993. Aliweb did not use a web robot, but instead depended on being notified by website
administrators of the existence at each site of an index file in a particular format.
JumpStation (created in December 1993 by Jonathon Fletcher) used a web robot to
find web pages and to build its index, and used a web form as the interface to its query
program. It was thus the first WWW resource-discovery tool to combine the three essential
features of a web search engine (crawling, indexing, and searching) as described below.
Because of the limited resources available on the platform it ran on, its indexing and hence
searching were limited to the titles and headings found in the web pages the crawler
encountered.
One of the first "all text" crawler-based search engines was WebCrawler, which came
out in 1994. Unlike its predecessors, it allowed users to search for any word in any webpage,
which has become the standard for all major search engines since. It was also the first one
widely known by the public. Also in 1994, Lycos (which started at Carnegie Mellon
University) was launched and became a major commercial endeavour.
Soon after, many search engines appeared and vied for popularity. These
included Magellan, Excite, Infoseek, Inktomi,Northern Light, and AltaVista. Yahoo! was
among the most popular ways for people to find web pages of interest, but its search
function operated on its web directory, rather than its full-text copies of web pages.
Information seekers could also browse the directory instead of doing a keyword-based
search.
Google adopted the idea of selling search terms in 1998, from a small search engine
company named goto.com. This move had a significant effect on the SE business, which
went from struggling to one of the most profitable businesses in the internet.
In 1996, Netscape was looking to give a single search engine an exclusive deal as the
featured search engine on Netscape's web browser. There was so much interest that
instead Netscape struck deals with five of the major search engines: for $5 million a year,
each search engine would be in rotation on the Netscape search engine page. The five
engines were Yahoo!, Magellan, Lycos, Infoseek, and Excite.
Search engines were also known as some of the brightest stars in the Internet
investing frenzy that occurred in the late 1990s. Several companies entered the market
spectacularly, receiving record gains during their initial public offerings. Some have taken
down their public search engine, and are marketing enterprise-only editions, such as
Northern Light. Many search engine companies were caught up in the dot-com bubble, a
speculation-driven market boom that peaked in 1999 and ended in 2001.
Around 2000, Google's search engine rose to prominence. The company achieved
better results for many searches with an innovation called PageRank, as was explained
in Anatomy of a Search Engine. This iterative algorithm ranks web pages based on the
number and PageRank of other web sites and pages that link there, on the premise that
good or desirable pages are linked to more than others. Google also maintained a minimalist
interface to its search engine. In contrast, many of its competitors embedded a search
engine in a web portal. In fact, Google search engine became so popular that spoof engines
emerged such as Mystery Seeker.
By 2000, Yahoo! was providing search services based on Inktomi's search engine.
Yahoo! acquired Inktomi in 2002, andOverture (which owned AlltheWeb and AltaVista) in
2003. Yahoo! switched to Google's search engine until 2004, when it launched its own
search engine based on the combined technologies of its acquisitions.
Microsoft first launched MSN Search in the fall of 1998 using search results from
Inktomi. In early 1999 the site began to display listings from Looksmart, blended with results
from Inktomi. For a short time in 1999, MSN Search used results from AltaVista were
instead. In 2004, Microsoft began a transition to its own search technology, powered by its
own web crawler (calledmsnbot).
Microsoft's rebranded search engine, Bing, was launched on June 1, 2009. On July 29,
2009, Yahoo! and Microsoft finalized a deal in which Yahoo! Search would be powered by
Microsoft Bing technology.
In 2012, following the April 24 release of Google Drive, Google released
the Beta version of Open Drive (available as a Chrome app) to enable the search of files
in the cloud. Open Drive has now been rebranded as Cloud Kite and utilizesMax Skuse's
search form algorithm. Cloud Kite is advertised as a "collective encyclopedia project based
on Google Drive public files and on the crowd sharing, crowd sourcing and crowd-solving
principles". Cloud Kite will also return search results from other cloud storage content
services including Dropbox, SkyDrive, Evernote and Box.
marked the first time the company had reached this feat, topping their 2011 total of
US$38 billion
DEVELOPMENT OF GOOGLE
HISTORY
Google in 1998
Beginning
Google began in March 1996 as a research project by Larry Page and Sergey Brin,
Ph.D. students at Stanford University.
In search of a dissertation theme, Page had been consideringamong other things
exploring the mathematical properties of the World Wide Web, understanding its link
structure as a huge graph. His supervisor, Terry Winograd, encouraged him to pick this idea
(which Page later recalled as "the best advice I ever got") and Page focused on the problem
of finding out which web pages link to a given page, based on the consideration that the
number and nature of such backlinks was valuable information for an analysis of that page
(with the role of citations in academic publishing in mind).
In his research project, nicknamed "BackRub", Page was soon joined by Brin, who
was supported by a National Science Foundation Graduate Fellowship. Brin was already a
11
close friend, whom Page had first met in the summer of 1995Page was part of a group of
potential new students that Brin had volunteered to show around the campus. Both Brin
and Page were working on the Stanford Digital Library Project (SDLP). The SDLP's goal was
to develop the enabling technologies for a single, integrated and universal digital library"
and it was funded through the National Science Foundation, among other federal agencies.
Page's web crawler began exploring the web in March 1996, with Page's own
Stanford home page serving as the only starting point. To convert the backlink data that it
gathered for a given web page into a measure of importance, Brin and Page developed
the PageRank algorithm. While analyzing BackRub's outputwhich, for a given URL,
consisted of a list of backlinks ranked by importancethe pair realized that a search engine
based on PageRank would produce better results than existing techniques (existing search
engines at the time essentially ranked results according to how many times the search term
appeared on a page).
A small search engine called "RankDex" from IDD Information Services (a subsidiary
of Dow Jones) designed by Robin Li was, since 1996, already exploring a similar strategy for
site-scoring and page ranking. The technology in RankDex would be patented and used later
when Li founded Baidu in China.
Convinced that the pages with the most links to them from other highly relevant
Web pages must be the most relevant pages associated with the search, Page and Brin
tested their thesis as part of their studies, and laid the foundation for their search engine.
By early 1997, the BackRub page described the state as follows:
Some Rough Statistics (from August 29th, 1996)
Total indexable HTML urls: 75.2306 Million
Total content downloaded: 207.022 gigabytes
...
BackRub is written in Java and Python and runs on several Sun Ultras and Intel
Pentiums running Linux. The primary database is kept on a Sun Ultra II with 28GB of disk.
Scott Hassan and Alan Steremberg have provided a great deal of very talented
implementation help. Sergey Brin has also been very involved and deserves many thanks.
-Larry Page page cs.stanford.edu
Late 1990s
Originally the search engine used the Stanford website with the
domain google.stanford.edu. The domain google.com is registered on September 15, 1997.
12
waysby a company that does good things for the world even if we forgo some short term
gains.
14
means there is a 50% chance that a person clicking on a random link will be directed to the
document with the 0.5 PageRank.
Simplified algorithm
Assume a small universe of four web pages: A, B, C and D. Links from a page to itself,
or multiple outbound links from one single page to another single page, are ignored.
PageRank is initialized to the same value for all pages. In the original form of PageRank, the
sum of PageRank over all pages was the total number of pages on the web at that time, so
each page in this example would have an initial PageRank of 1. However, later versions of
PageRank, and the remainder of this section, assume a probability distribution between 0
and 1. Hence the initial value for each page is 0.25.
The PageRank transferred from a given page to the targets of its outbound links upon
the next iteration is divided equally among all outbound links.
If the only links in the system were from pages B, C, and D to A, each link would
transfer 0.25 PageRank to A upon the next iteration, for a total of 0.75.
Suppose instead that page B had a link to pages C and A, page C had a link to page A,
and page D had links to all three pages. Thus, upon the next iteration, page B would transfer
half of its existing value, or 0.125, to page A and the other half, or 0.125, to page C.
Page C would transfer all of its existing value, 0.25, to the only page it links to, A.
Since D had three outbound links, it would transfer one third of its existing value, or
approximately 0.083, to A. At the completion of this iteration, page A will have a PageRank
of 0.458.
In the general case, the PageRank value for any page u can be expressed as:
16
i.e. the PageRank value for a page u is dependent on the PageRank values for each
page v contained in the set Bu (the set containing all pages linking to page u), divided by the
number L(v) of links from page v.
Damping factor
The PageRank theory holds that an imaginary surfer who is randomly clicking on links
will eventually stop clicking. The probability, at any step, that the person will continue is a
damping factor d. Various studies have tested different damping factors, but it is generally
assumed that the damping factor will be set around 0.85.
The damping factor is subtracted from 1 (and in some variations of the algorithm, the
result is divided by the number of documents (N) in the collection) and this term is then
added to the product of the damping factor and the sum of the incoming PageRank scores.
That is,
So any page's PageRank is derived in large part from the PageRanks of other pages.
The damping factor adjusts the derived value downward. The original paper, however, gave
the following formula, which has led to some confusion:
The difference between them is that the PageRank values in the first formula sum to
one, while in the second formula each PageRank is multiplied by N and the sum becomes N.
A statement in Page and Brin's paper that "the sum of all PageRanks is one"and claims by
other Google employees support the first variant of the formula above.
Page and Brin confused the two formulas in their most popular paper "The Anatomy
of a Large-Scale Hypertextual Web Search Engine", where they mistakenly claimed that the
latter formula formed a probability distribution over web pages.
Google recalculates PageRank scores each time it crawls the Web and rebuilds its
index. As Google increases the number of documents in its collection, the initial
approximation of PageRank decreases for all documents.
17
The formula uses a model of a random surfer who gets bored after several clicks and
switches to a random page. The PageRank value of a page reflects the chance that the
random surfer will land on that page by clicking on a link. It can be understood as a Markov
chain in which the states are pages, and the transitions, which are all equally probable, are
the links between pages.
If a page has no links to other pages, it becomes a sink and therefore terminates the
random surfing process. If the random surfer arrives at a sink page, it picks another URL at
random and continues surfing again.
When calculating PageRank, pages with no outbound links are assumed to link out to
all other pages in the collection. Their PageRank scores are therefore divided evenly among
all other pages. In other words, to be fair with pages that are not sinks, these random
transitions are added to all nodes in the Web, with a residual probability usually set to d =
0.85, estimated from the frequency that an average surfer uses his or her browser's
bookmark feature.
Other uses
Personalized PageRank is used by Twitter to present users with other accounts they
may wish to follow.
A version of PageRank has recently been proposed as a replacement for the
traditional Institute for Scientific Information (ISI) impact factor, and implemented
at Eigenfactor as well as at SCImago. Instead of merely counting total citation to a journal,
the "importance" of each citation is determined in a PageRank fashion.
A similar new use of PageRank is to rank academic doctoral programs based on their
records of placing their graduates in faculty positions. In PageRank terms, academic
departments link to each other by hiring their faculty from each other (and from
themselves).
PageRank has been used to rank spaces or streets to predict how many people
(pedestrians or vehicles) come to the individual spaces or streets. In lexical semantics it has
been used to perform Word Sense Disambiguation and to automatically
rank WordNet synsets according to how strongly they possess a given semantic property,
such as positivity or negativity.
A Web crawler may use PageRank as one of a number of importance metrics it uses
to determine which URL to visit during a crawl of the web. One of the early working
papers that were used in the creation of Google is Efficient crawling through URL
ordering, which discusses the use of a number of different importance metrics to determine
how deeply, and how much of a site Google will crawl. PageRank is presented as one of a
number of these importance metrics, though there are others listed such as the number of
18
inbound and outbound links for a URL, and the distance from the root directory on a site to
the URL.
The PageRank may also be used as a methodology to measure the apparent impact
of a community like the Blogosphere on the overall Web itself. This approach uses therefore
the PageRank to measure the distribution of attention in reflection of the Scale-free
network paradigm.
In any ecosystem, a modified version of PageRank may be used to determine species
that are essential to the continuing health of the environment.
For the analysis of protein networks in biology PageRank is also a useful tool.
nofollow
In early 2005, Google implemented a new value, "nofollow",for the rel attribute of
HTML link and anchor elements, so that website developers and bloggers can make links
that Google will not consider for the purposes of PageRankthey are links that no longer
constitute a "vote" in the PageRank system. The nofollow relationship was added in an
attempt to help combat spamdexing.
As an example, people could previously create many message-board posts with links
to their website to artificially inflate their PageRank. With the nofollow value, messageboard administrators can modify their code to automatically insert "rel='nofollow'" to all
hyperlinks in posts, thus preventing PageRank from being affected by those particular posts.
This method of avoidance, however, also has various drawbacks, such as reducing the link
value of legitimate comments. (See: Spam in blogs#nofollow)
In an effort to manually control the flow of PageRank among pages within a website,
many webmasters practice what is known as PageRank Sculptingwhich is the act of
strategically placing the nofollow attribute on certain internal links of a website in order to
funnel PageRank towards those pages the webmaster deemed most important. This tactic
has been used since the inception of the nofollow attribute, but may no longer be effective
since Google announced that blocking PageRank transfer with nofollow does not redirect
that PageRank to other links.
Deprecation
PageRank was once available for the verified site maintainers through the Google
Webmaster Tools interface. However on October 15, 2009, a Google employee confirmed
that the company had removed PageRank from its Webmaster Tools section, explaining that
"We've been telling people for a long time that they shouldn't focus on PageRank so much.
Many site owners seem to think it's the most important metric for them to track, which is
simply not true." In addition, The PageRank indicator is not available in Google's
own Chrome browser.
19
GOOGLE ARCHITECTURE
Overview
Most of Google is implemented in C or C++ for efficiency and can run in either Solaris
or Linux.
20
BigFiles are virtual files spanning multiple file systems and are addressable by 64 bit
integers. The allocation among multiple file systems is handled automatically. The BigFiles
package also handles allocation and deallocation of file descriptors, since the operating
21
systems do not provide enough for our needs. BigFiles also support rudimentary
compression options.
Repository
Document Index
The document index keeps information about each document. It is a fixed width
ISAM (Index sequential access mode) index, ordered by docID. The information stored in
each entry includes the current document status, a pointer into the repository, a document
checksum, and various statistics. If the document has been crawled, it also contains a
pointer into a variable width file called docinfo which contains its URL and title. Otherwise
the pointer points into the URLlist which contains just the URL. This design decision was
driven by the desire to have a reasonably compact data structure, and the ability to fetch a
record in one disk seek during a search
Additionally, there is a file which is used to convert URLs into docIDs. It is a list of URL
checksums with their corresponding docIDs and is sorted by checksum. In order to find the
docID of a particular URL, the URL's checksum is computed and a binary search is performed
on the checksums file to find its docID. URLs may be converted into docIDs in batch by doing
a merge with this file. This is the technique the URLresolver uses to turn URLs into docIDs.
22
This batch mode of update is crucial because otherwise we must perform one seek for every
link which assuming one disk would take more than a month for our 322 million link dataset.
Lexicon
The lexicon has several different forms. One important change from earlier systems
is that the lexicon can fit in memory for a reasonable price. In the current implementation
we can keep the lexicon in memory on a machine with 256 MB of main memory. The
current lexicon contains 14 million words (though some rare words were not added to the
lexicon). It is implemented in two parts -- a list of the words (concatenated together but
separated by nulls) and a hash table of pointers. For various functions, the list of words has
some auxiliary information which is beyond the scope of this paper to explain fully.
Hit Lists
23
The inverted index consists of the same barrels as the forward index, except that
they have been processed by the sorter. For every valid wordID, the lexicon contains a
pointer into the barrel that wordID falls into. It points to a doclist of docID's together with
their corresponding hit lists. This doclist represents all the occurrences of that word in all
documents.
An important issue is in what order the docID's should appear in the doclist. One
simple solution is to store them sorted by docID. This allows for quick merging of different
doclists for multiple word queries. Another option is to store them sorted by a ranking of
the occurrence of the word in each document. This makes answering one word queries
trivial and makes it likely that the answers to multiple word queries are near the start.
However, merging is much more difficult. Also, this makes development much more difficult
in that a change to the ranking function requires a rebuild of the index. We chose a
24
compromise between these options, keeping two sets of inverted barrels -- one set for hit
lists which include title or anchor hits and another set for all hit lists. This way, we check the
first set of barrels first and if there are not enough matches within those barrels we check
the larger ones.
and cause the crawler to crash, or worse, cause unpredictable or incorrect behavior.
Systems which access large parts of the Internet need to be designed to be very robust and
carefully tested. Since large complex systems such as crawlers will invariably cause
problems, there needs to be significant resources devoted to reading the email and solving
these problems as they come up.
Parsing -- Any parser which is designed to run on the entire Web must handle a
huge array of possible errors. These range from typos in HTML tags to kilobytes of zeros in
the middle of a tag, non-ASCII characters, HTML tags nested hundreds deep, and a great
variety of other errors that challenge anyone's imagination to come up with equally creative
ones. For maximum speed, instead of using YACC to generate a CFG parser, we use flex to
generate a lexical analyzer which we outfit with its own stack. Developing this parser which
runs at a reasonable speed and is very robust involved a fair amount of work.
Sorting -- In order to generate the inverted index, the sorter takes each of the
forward barrels and sorts it by wordID to produce an inverted barrel for title and anchor hits
and a full text inverted barrel. This process happens one barrel at a time, thus requiring little
temporary storage. Also, we parallelize the sorting phase to use as many machines as we
have simply by running multiple sorters, which can process different buckets at the same
time. Since the barrels don't fit into main memory, the sorter further subdivides them into
baskets which do fit into memory based on wordID and docID. Then the sorter, loads each
basket into memory, sorts it and writes its contents into the short inverted barrel and the
full inverted barrel.
Searching
The goal of searching is to provide quality search results efficiently. Many of the large
commercial search engines seemed to have made great progress in terms of efficiency.
Therefore, we have focused more on quality of search in our research, although we believe
26
our solutions are scalable to commercial volumes with a bit more effort. The google query
evaluation process is show in Figure 4.
1. Parse the query.
2. Convert words into wordIDs.
3. Seek to the start of the doclist in the short
To put a limit on response time, once a
barrel for every word.
certain number (currently 40,000) of matching
4. Scan through the doclists until there is a
documents are found, the searcher automatically
document that matches all the search terms.
5. Compute the rank of that document for
goes to step 8 in Figure 4. This means that it is
the query.
possible that sub-optimal results would be
6. If we are in the short barrels and at the
returned. We are currently investigating other end of any doclist, seek to the start of the doclist in the
ways to solve this problem. In the past, we sorted
full barrel for every word and go to step 4.
7. If we are not at the end of any doclist go to
the hits according to PageRank, which seemed to
step 4.
proximity. Every type and proximity pair has a type-prox-weight. The counts are converted
into count-weights and we take the dot product of the count-weights and the type-proxweights to compute an IR score. All of these numbers and matrices can all be displayed with
the search results using a special debug mode. These displays have been very helpful in
developing the ranking system.
4.5.2 Feedback
The ranking function has many parameters like the type-weights and the type-proxweights. Figuring out the right values for these parameters is something of a black art. In
order to do this, we have a user feedback mechanism in the search engine. A trusted user
may optionally evaluate all of the results that are returned. This feedback is saved. Then
when we modify the ranking function, we can see the impact of this change on all previous
searches which were ranked. Although far from perfect, this gives us some idea of how a
change in the ranking function affects the search results.
28
Bibliography
The following sources have been referred to in making of this project:
1. The Anatomy of a Large-Scale Hypertextual Web Search Engine By Sergey Brin and
Lawrence Page http://infolab.stanford.edu/~backrub/google.html
2. PageRank - http://en.wikipedia.org/wiki/PageRank
3. The History of Web Search Engines (INFOGRAPHIC)
- http://www.wordstream.com/blog/ws/2010/08/03/web-search-engines-history
4. Web Search Engine - http://en.wikipedia.org/wiki/Web_search_engine#History
5.W3servers-http://www.w3.org/History/19921103hypertext/hypertext/DataSources/WWW/Servers.html
6. Google Architecture - http://highscalability.com/google-architecture
And several other related sites and blogs.
29