Você está na página 1de 29

Self study Project

How Google Works


Technology Behind Worlds Most Powerful Search Engine

Babita Naagar
2k12/IT/021

Contents
1. INTRODUCTION......3
2. SEARCH ENGINE.....3

3.

WORKING
HISTORY
HOW GOOGLE MADE A BREAKTHROUGH 10
DEVELOPMENT OF GOOGLE......11

HISTORY
PAGERANK ALGORITHM
4. GOOGLE ARCHITECTURE....20

5.

OVERVIEW
DATA STRUCTRES
CRAWLING
INDEXING
SEARCHING
BIBLIOGRAPHY.29

INTRODUCTION
This project explores how search engines developed through time and the basic, in
fact, detailed working of a search engine.
It then explains how Google, the most powerful search engine of the world now,
brought a major breakthrough in world of search engine. Since then it has been the most
efficient and accurate search engine ever made.
It goes on to examine the architecture of Google search engine and the data
structure and concepts it involves in crawling, indexing and searching.

SEARCH ENGINES
A web search engine is a software system that is designed to search for information
on the World Wide Web. The search results are generally presented in a line of results
often referred to as search engine results pages (SERPs). The information may be a
specialist in web pages, images, information and other types of files. Some search engines
also mine data available in databases or open directories. Unlike web directories, which are
maintained only by human editors, search engines also maintain real-time information by
running an algorithm on a web crawler.

WORKING OF SEARCH ENGINES

A search engine operates in the following order:


1. Web crawling
2. Indexing
3. Searching

Web search engines work by storing information about many web pages, which they
retrieve from the HTML markup of the pages. These pages are retrieved by a Web
crawler(sometimes also known as a spider) an automated Web crawler which follows
every link on the site. The site owner can exclude specific pages by using robots.txt.
The search engine then analyzes the contents of each page to determine how it
should be indexed (for example, words can be extracted from the titles, page content,
headings, or special fields called meta tags). Data about web pages are stored in an index
database for use in later queries. A query from a user can be a single word. The index helps
find information relating to the query as quickly as possible. Some search engines, such
as Google, store all or part of the source page (referred to as a cache) as well as information
about the web pages, whereas others, such as AltaVista, store every word of every page
they find. This cached page always holds the actual search text since it is the one that was
actually indexed, so it can be very useful when the content of the current page has been
updated and the search terms are no longer in it. This problem might be considered a mild
form of linkrot, and Google's handling of it increases usability by satisfying user
expectations that the search terms will be on the returned webpage. This satisfies
the principle of least astonishment, since the user normally expects that the search terms
will be on the returned pages. Increased search relevance makes these cached pages very
useful as they may contain data that may no longer be available elsewhere.

High-level architecture of a standard Web crawler


When a user enters a query into a search engine (typically by using keywords), the
engine examines its index and provides a listing of best-matching web pages according to its
criteria, usually with a short summary containing the document's title and sometimes parts
of the text. The index is built from the information stored with the data and the method by
which the information is indexed. From 2007 the Google.com search engine has allowed
one to search by date by clicking "Show search tools" in the leftmost column of the initial
search results page, and then selecting the desired date range. Most search engines support
the use of the boolean operatorsAND, OR and NOT to further specify the search query.
Boolean operators are for literal searches that allow the user to refine and extend the terms
of the search. The engine looks for the words or phrases exactly as entered. Some search
engines provide an advanced feature called proximity search, which allows users to define
the distance between keywords. There is also concept-based searching where the research
involves using statistical analysis on pages containing the words or phrases you search for.
As well, natural language queries allow the user to type a question in the same form one
would ask it to a human. A site like this would be ask.com.
The usefulness of a search engine depends on the relevance of the result set it gives
back. While there may be millions of web pages that include a particular word or phrase,
some pages may be more relevant, popular, or authoritative than others. Most search
engines employ methods to rank the results to provide the "best" results first. How a search
engine decides which pages are the best matches, and what order the results should be
shown in, varies widely from one engine to another. The methods also change over time as
Internet usage changes and new techniques evolve. There are two main types of search
engine that have evolved: one is a system of predefined and hierarchically ordered
keywords that humans have programmed extensively. The other is a system that generates
5

an "inverted index" by analyzing texts it locates. This first form relies much more heavily on
the computer itself to do the bulk of the work.
Most Web search engines are commercial ventures supported
by advertising revenue and thus some of them allow advertisers to have their listings ranked
higher in search results for a fee. Search engines that do not accept money for their search
results make money by running search related ads alongside the regular search engine
results. The search engines make money every time someone clicks on one of these ads.

HISTORY AND DEVELOPMENT


During early development of the web, there was a list of webservers edited by Tim
Berners-Lee and hosted on the CERNwebserver. One historical snapshot of the list in 1992
remains,but as more and more webservers went online the central list could no longer keep
up. On the NCSA site, new servers were announced under the title "What's New!"
The very first tool used for searching on the Internet was Archie. The name stands for
"archive" without the "v". It was created in 1990 by Alan Emtage, Bill Heelan and J. Peter
Deutsch, computer science students at McGill University in Montreal. The program
downloaded the directory listings of all the files located on public anonymous FTP (File
Transfer Protocol) sites, creating a searchable database of file names; however, Archie did
not index the contents of these sites since the amount of data was so limited it could be
readily searched manually.
The rise of Gopher (created in 1991 by Mark McCahill at the University of Minnesota)
led to two new search programs, Veronicaand Jughead. Like Archie, they searched the file
names and titles stored in Gopher index systems. Veronica (Very EasyRodent-Oriented Netwide Index to Computerized Archives) provided a keyword search of most Gopher menu
titles in the entire Gopher listings.
Jughead (Jonzy's Universal Gopher Hierarchy Excavation And Display) was a tool for
obtaining menu information from specific Gopher servers. While the name of the search
engine "Archie" was not a reference to the Archie comic book series, "Veronica" and
"Jughead" are characters in the series, thus referencing their predecessor.
In the summer of 1993, no search engine existed for the web, though numerous
specialized catalogues were maintained by hand. Oscar Nierstrasz at the University of
Geneva wrote a series of Perl scripts that periodically mirrored these pages and rewrote them
into a standard format. This formed the basis for W3Catalog, the web's first primitive search
engine, released on September 2, 1993.
In June 1993, Matthew Gray, then at MIT, produced what was probably the first web
robot, the Perl-based World Wide Web Wanderer, and used it to generate an index called
6

'Wandex'. The purpose of the Wanderer was to measure the size of the World Wide Web,
which it did until late 1995. The web's second search engine Aliweb appeared in November
1993. Aliweb did not use a web robot, but instead depended on being notified by website
administrators of the existence at each site of an index file in a particular format.
JumpStation (created in December 1993 by Jonathon Fletcher) used a web robot to
find web pages and to build its index, and used a web form as the interface to its query
program. It was thus the first WWW resource-discovery tool to combine the three essential
features of a web search engine (crawling, indexing, and searching) as described below.
Because of the limited resources available on the platform it ran on, its indexing and hence
searching were limited to the titles and headings found in the web pages the crawler
encountered.
One of the first "all text" crawler-based search engines was WebCrawler, which came
out in 1994. Unlike its predecessors, it allowed users to search for any word in any webpage,
which has become the standard for all major search engines since. It was also the first one
widely known by the public. Also in 1994, Lycos (which started at Carnegie Mellon
University) was launched and became a major commercial endeavour.

Soon after, many search engines appeared and vied for popularity. These
included Magellan, Excite, Infoseek, Inktomi,Northern Light, and AltaVista. Yahoo! was
among the most popular ways for people to find web pages of interest, but its search
function operated on its web directory, rather than its full-text copies of web pages.
Information seekers could also browse the directory instead of doing a keyword-based
search.
Google adopted the idea of selling search terms in 1998, from a small search engine
company named goto.com. This move had a significant effect on the SE business, which
went from struggling to one of the most profitable businesses in the internet.
In 1996, Netscape was looking to give a single search engine an exclusive deal as the
featured search engine on Netscape's web browser. There was so much interest that
instead Netscape struck deals with five of the major search engines: for $5 million a year,
each search engine would be in rotation on the Netscape search engine page. The five
engines were Yahoo!, Magellan, Lycos, Infoseek, and Excite.
Search engines were also known as some of the brightest stars in the Internet
investing frenzy that occurred in the late 1990s. Several companies entered the market
spectacularly, receiving record gains during their initial public offerings. Some have taken
down their public search engine, and are marketing enterprise-only editions, such as
Northern Light. Many search engine companies were caught up in the dot-com bubble, a
speculation-driven market boom that peaked in 1999 and ended in 2001.
Around 2000, Google's search engine rose to prominence. The company achieved
better results for many searches with an innovation called PageRank, as was explained
in Anatomy of a Search Engine. This iterative algorithm ranks web pages based on the
number and PageRank of other web sites and pages that link there, on the premise that
good or desirable pages are linked to more than others. Google also maintained a minimalist
interface to its search engine. In contrast, many of its competitors embedded a search
engine in a web portal. In fact, Google search engine became so popular that spoof engines
emerged such as Mystery Seeker.
By 2000, Yahoo! was providing search services based on Inktomi's search engine.
Yahoo! acquired Inktomi in 2002, andOverture (which owned AlltheWeb and AltaVista) in
2003. Yahoo! switched to Google's search engine until 2004, when it launched its own
search engine based on the combined technologies of its acquisitions.
Microsoft first launched MSN Search in the fall of 1998 using search results from
Inktomi. In early 1999 the site began to display listings from Looksmart, blended with results
from Inktomi. For a short time in 1999, MSN Search used results from AltaVista were
instead. In 2004, Microsoft began a transition to its own search technology, powered by its
own web crawler (calledmsnbot).

Microsoft's rebranded search engine, Bing, was launched on June 1, 2009. On July 29,
2009, Yahoo! and Microsoft finalized a deal in which Yahoo! Search would be powered by
Microsoft Bing technology.
In 2012, following the April 24 release of Google Drive, Google released
the Beta version of Open Drive (available as a Chrome app) to enable the search of files
in the cloud. Open Drive has now been rebranded as Cloud Kite and utilizesMax Skuse's
search form algorithm. Cloud Kite is advertised as a "collective encyclopedia project based
on Google Drive public files and on the crowd sharing, crowd sourcing and crowd-solving
principles". Cloud Kite will also return search results from other cloud storage content
services including Dropbox, SkyDrive, Evernote and Box.

HOW GOOGLE MADE A BREAKTHROUGH


Google began in January 1996 as a research project by Larry Page and Sergey
Brin when they were both PhD students at Stanford University in Stanford, California.
While conventional search engines ranked results by counting how many times the
search terms appeared on the page, the two theorized about a better system that analyzed
the relationships between websites. They called this new technology PageRank; it
determined a website's relevance by the number of pages, and the importance of those
pages, that linked back to the original site.
A small search engine called "RankDex" from IDD Information Services designed
by Robin Li was, since 1996, already exploring a similar strategy for site-scoring and page
ranking. The technology in RankDex would be patented and used later when Li
founded Baidu in China.
Page and Brin originally nicknamed their new search engine "BackRub", because the
system checked backlinks to estimate the importance of a site. Eventually, they changed the
name to Google, originating from a misspelling of the word "googol", the number one
followed by one hundred zeros, which was picked to signify that the search engine was
intended to provide large quantities of information.Originally, Google ran under Stanford
University's website, with the domains google.stanford.edu and z.stanford.edu.
The domain name for Google was registered on September 15, 1997 and the
company was incorporated on September 4, 1998. It was based in a friend's (Susan
Wojcicki) garage in Menlo Park, California. Craig Silverstein, a fellow PhD student at
Stanford, was hired as the first employee.
In May 2011, the number of monthly unique visitors to Google surpassed one billion
for the first time, an 8.4 percent increase from May 2010 (931 million). In January 2013,
Google announced it had earned US$50 billion in annual revenue for the year of 2012. This
10

marked the first time the company had reached this feat, topping their 2011 total of
US$38 billion

DEVELOPMENT OF GOOGLE
HISTORY
Google in 1998

Beginning
Google began in March 1996 as a research project by Larry Page and Sergey Brin,
Ph.D. students at Stanford University.
In search of a dissertation theme, Page had been consideringamong other things
exploring the mathematical properties of the World Wide Web, understanding its link
structure as a huge graph. His supervisor, Terry Winograd, encouraged him to pick this idea
(which Page later recalled as "the best advice I ever got") and Page focused on the problem
of finding out which web pages link to a given page, based on the consideration that the
number and nature of such backlinks was valuable information for an analysis of that page
(with the role of citations in academic publishing in mind).
In his research project, nicknamed "BackRub", Page was soon joined by Brin, who
was supported by a National Science Foundation Graduate Fellowship. Brin was already a
11

close friend, whom Page had first met in the summer of 1995Page was part of a group of
potential new students that Brin had volunteered to show around the campus. Both Brin
and Page were working on the Stanford Digital Library Project (SDLP). The SDLP's goal was
to develop the enabling technologies for a single, integrated and universal digital library"
and it was funded through the National Science Foundation, among other federal agencies.
Page's web crawler began exploring the web in March 1996, with Page's own
Stanford home page serving as the only starting point. To convert the backlink data that it
gathered for a given web page into a measure of importance, Brin and Page developed
the PageRank algorithm. While analyzing BackRub's outputwhich, for a given URL,
consisted of a list of backlinks ranked by importancethe pair realized that a search engine
based on PageRank would produce better results than existing techniques (existing search
engines at the time essentially ranked results according to how many times the search term
appeared on a page).
A small search engine called "RankDex" from IDD Information Services (a subsidiary
of Dow Jones) designed by Robin Li was, since 1996, already exploring a similar strategy for
site-scoring and page ranking. The technology in RankDex would be patented and used later
when Li founded Baidu in China.
Convinced that the pages with the most links to them from other highly relevant
Web pages must be the most relevant pages associated with the search, Page and Brin
tested their thesis as part of their studies, and laid the foundation for their search engine.
By early 1997, the BackRub page described the state as follows:
Some Rough Statistics (from August 29th, 1996)
Total indexable HTML urls: 75.2306 Million
Total content downloaded: 207.022 gigabytes
...
BackRub is written in Java and Python and runs on several Sun Ultras and Intel
Pentiums running Linux. The primary database is kept on a Sun Ultra II with 28GB of disk.
Scott Hassan and Alan Steremberg have provided a great deal of very talented
implementation help. Sergey Brin has also been very involved and deserves many thanks.
-Larry Page page cs.stanford.edu
Late 1990s
Originally the search engine used the Stanford website with the
domain google.stanford.edu. The domain google.com is registered on September 15, 1997.

12

They formally incorporated their company, Google, on September 4, 1998 at a friend's


garage in Menlo Park, California.
The first patent filed under the name "Google Inc." was filed on August 31, 1999. This
patent, filed by Siu-Leong Iu, Malcom Davis, Hui Luo, Yun-Ting Lin, Guillaume Mercier, and
Kobad Bugwadia, is titled "Watermarking system and methodology for digital multimedia
content" and is the earliest patent filing under the assignee name "Google Inc."
Both Brin and Page had been against using advertising pop-ups in a search engine, or
an "advertising funded search engines" model, and they wrote a research paper in 1998 on
the topic while still students. They changed their minds early on and allowed simple text
ads.
By the end of 1998, Google had an index of about 60 million pages. The home page
was still marked "BETA", but an article in Salon.com already argued that Google's search
results were better than those of competitors like Hotbotor Excite.com, and praised it for
being more technologically innovative than the overloaded portal sites (like Yahoo!,
Excite.com, Lycos, Netscape's Netcenter, AOL.com, Go.com and MSN.com) which at that
time, during the growing dot-com bubble, were seen as "the future of the Web", especially
by stock market investors.
In March 1999, the company moved into offices at 165 University Avenue in Palo
Alto, home to several other noted Silicon Valley technology startups. After quickly
outgrowing two other sites, the company leased a complex of buildings in Mountain View at
1600 Amphitheatre Parkway from Silicon Graphics (SGI) in 2003. The company has remained
at this location ever since, and the complex has since become known as the Googleplex (a
play on the word googolplex, a number that is equal to 1 followed by a googol of zeros). In
2006, Google bought the property from SGI for US$319 million.
2000s
The Google search engine attracted a loyal following among the growing number of
Internet users, who liked its simple designIn 2000, Google began
selling advertisements associated with search keywords. The ads were text-based to
maintain an uncluttered page design and to maximize page loading speed. Keywords were
sold based on a combination of price bid and click-throughs, with bidding starting at $.05
per click. This model of selling keyword advertising was pioneered by Goto.com (later
renamed Overture Services, before being acquired by Yahoo! and rebranded as Yahoo!
Search Marketing). While many of its dot-com rivals failed in the new Internet marketplace,
Google quietly rose in stature while generating revenue.
Google's declared code of conduct is "Don't be evil", a phrase which they went so far
as to include in their prospectus (aka "S-1") for their 2004 IPO, noting that "We believe
strongly that in the long term, we will be better servedas shareholders and in all other
13

waysby a company that does good things for the world even if we forgo some short term
gains.

THE PAGERANK ALGORITHM


PageRank works by counting the number and quality of links to a page to determine
a rough estimate of how important the website is. The underlying assumption is that more
important websites are likely to receive more links from other websites.
It is not the only algorithm used by Google to order search engine results, but it is the
first algorithm that was used by the company, and it is the best-known. Google uses an
automated web spider called Google botto actually count links and gather other information
on web pages.
PageRank is a link analysis algorithm and it assigns a numerical weighting to each
element of a hyperlinked set of documents, such as theWorld Wide Web, with the purpose
of "measuring" its relative importance within the set. The algorithm may be applied to any
collection of entities with reciprocal quotations and references. The numerical weight that it
assigns to any given element E is referred to as thePageRank of E and denoted by
Other factors like Author Rank can contribute to the importance of an entity.
A PageRank results from a mathematical algorithm based on the webgraph, created
by all World Wide Web pages as nodes andhyperlinks as edges, taking into consideration
authority hubs such as cnn.com or usa.gov. The rank value indicates an importance of a
particular page. A hyperlink to a page counts as a vote of support. The PageRank of a page is
defined recursively and depends on the number and PageRank metric of all pages that link
to it ("incoming links"). A page that is linked to by many pages with high PageRank receives a
high rank itself.
Numerous academic papers concerning PageRank have been published since Page
and Brin's original paper. In practice, the PageRank concept may be vulnerable to
manipulation. Research has been conducted into identifying falsely influenced PageRank
rankings. The goal is to find an effective means of ignoring links from documents with falsely
influenced PageRank.
Other link-based ranking algorithms for Web pages include the HITS
algorithm invented by Jon Kleinberg (used by Teoma and now Ask.com), the IBM CLEVER
project, the TrustRank algorithm and the hummingbird algorithm.

14

The idea of formulating a link analysis problem as an eigenvalue problem was


probably first suggested in 1976 by Gabriel Pinski and Francis Narin, who worked
on scientometrics ranking scientific journals. PageRank was developed at Stanford
University by Larry Page and Sergey Brin in 1996 as part of a research project about a new
kind of search engine. Sergey Brin had the idea that information on the web could be
ordered in a hierarchy by "link popularity": a page is ranked higher as there are more links to
it. It was co-authored by Rajeev Motwani and Terry Winograd. The first paper about the
project, describing PageRank and the initial prototype of the Google search engine, was
published in 1998 shortly after, Page and Brin founded Google Inc., the company behind the
Google search engine. While just one of many factors that determine the ranking of Google
search results, PageRank continues to provide the basis for all of Google's web search tools.
The name "PageRank" plays off of the name of developer Larry Page, as well as the
concept of a web page. The word is a trademark of Google, and the PageRank process has
been patented (U.S. Patent 6,285,999). However, the patent is assigned to Stanford
University and not to Google. Google has exclusive license rights on the patent from
Stanford University. The university received 1.8 million shares of Google in exchange for use
of the patent; the shares were sold in 2005 for $336 million.
PageRank was influenced by citation analysis, early developed by Eugene Garfield in
the 1950s at the University of Pennsylvania, and by Hyper Search, developed by Massimo
Marchiori at the University of Padua. In the same year PageRank was introduced (1998), Jon
Kleinberg published his important work on HITS. Google's founders cite Garfield, Marchiori,
and Kleinberg in their original papers.
A small search engine called "RankDex" from IDD Information Services designed
by Robin Li was, since 1996, already exploring a similar strategy for site-scoring and page
ranking.[15] The technology in RankDex would be patented by 1999 and used later when Li
founded Baidu in China Li's work would be referenced by some of Larry Page's U.S. patents
for his Google search methods.
Algorithm

PageRank is a probability distribution used to represent the likelihood that a person


randomly clicking on links will arrive at any particular page. PageRank can be calculated for
collections of documents of any size. It is assumed in several research papers that the
distribution is evenly divided among all documents in the collection at the beginning of the
computational process. The PageRank computations require several passes, called
"iterations", through the collection to adjust approximate PageRank values to more closely
reflect the theoretical true value.
A probability is expressed as a numeric value between 0 and 1. A 0.5 probability is
commonly expressed as a "50% chance" of something happening. Hence, a PageRank of 0.5
15

means there is a 50% chance that a person clicking on a random link will be directed to the
document with the 0.5 PageRank.
Simplified algorithm

Assume a small universe of four web pages: A, B, C and D. Links from a page to itself,
or multiple outbound links from one single page to another single page, are ignored.
PageRank is initialized to the same value for all pages. In the original form of PageRank, the
sum of PageRank over all pages was the total number of pages on the web at that time, so
each page in this example would have an initial PageRank of 1. However, later versions of
PageRank, and the remainder of this section, assume a probability distribution between 0
and 1. Hence the initial value for each page is 0.25.
The PageRank transferred from a given page to the targets of its outbound links upon
the next iteration is divided equally among all outbound links.
If the only links in the system were from pages B, C, and D to A, each link would
transfer 0.25 PageRank to A upon the next iteration, for a total of 0.75.
Suppose instead that page B had a link to pages C and A, page C had a link to page A,
and page D had links to all three pages. Thus, upon the next iteration, page B would transfer
half of its existing value, or 0.125, to page A and the other half, or 0.125, to page C.
Page C would transfer all of its existing value, 0.25, to the only page it links to, A.
Since D had three outbound links, it would transfer one third of its existing value, or
approximately 0.083, to A. At the completion of this iteration, page A will have a PageRank
of 0.458.

In other words, the PageRank conferred by an outbound link is equal to the


document's own PageRank score divided by the number of outbound links L( ).

In the general case, the PageRank value for any page u can be expressed as:

16

i.e. the PageRank value for a page u is dependent on the PageRank values for each
page v contained in the set Bu (the set containing all pages linking to page u), divided by the
number L(v) of links from page v.
Damping factor

The PageRank theory holds that an imaginary surfer who is randomly clicking on links
will eventually stop clicking. The probability, at any step, that the person will continue is a
damping factor d. Various studies have tested different damping factors, but it is generally
assumed that the damping factor will be set around 0.85.
The damping factor is subtracted from 1 (and in some variations of the algorithm, the
result is divided by the number of documents (N) in the collection) and this term is then
added to the product of the damping factor and the sum of the incoming PageRank scores.
That is,

So any page's PageRank is derived in large part from the PageRanks of other pages.
The damping factor adjusts the derived value downward. The original paper, however, gave
the following formula, which has led to some confusion:

The difference between them is that the PageRank values in the first formula sum to
one, while in the second formula each PageRank is multiplied by N and the sum becomes N.
A statement in Page and Brin's paper that "the sum of all PageRanks is one"and claims by
other Google employees support the first variant of the formula above.
Page and Brin confused the two formulas in their most popular paper "The Anatomy
of a Large-Scale Hypertextual Web Search Engine", where they mistakenly claimed that the
latter formula formed a probability distribution over web pages.
Google recalculates PageRank scores each time it crawls the Web and rebuilds its
index. As Google increases the number of documents in its collection, the initial
approximation of PageRank decreases for all documents.

17

The formula uses a model of a random surfer who gets bored after several clicks and
switches to a random page. The PageRank value of a page reflects the chance that the
random surfer will land on that page by clicking on a link. It can be understood as a Markov
chain in which the states are pages, and the transitions, which are all equally probable, are
the links between pages.
If a page has no links to other pages, it becomes a sink and therefore terminates the
random surfing process. If the random surfer arrives at a sink page, it picks another URL at
random and continues surfing again.
When calculating PageRank, pages with no outbound links are assumed to link out to
all other pages in the collection. Their PageRank scores are therefore divided evenly among
all other pages. In other words, to be fair with pages that are not sinks, these random
transitions are added to all nodes in the Web, with a residual probability usually set to d =
0.85, estimated from the frequency that an average surfer uses his or her browser's
bookmark feature.
Other uses

Personalized PageRank is used by Twitter to present users with other accounts they
may wish to follow.
A version of PageRank has recently been proposed as a replacement for the
traditional Institute for Scientific Information (ISI) impact factor, and implemented
at Eigenfactor as well as at SCImago. Instead of merely counting total citation to a journal,
the "importance" of each citation is determined in a PageRank fashion.
A similar new use of PageRank is to rank academic doctoral programs based on their
records of placing their graduates in faculty positions. In PageRank terms, academic
departments link to each other by hiring their faculty from each other (and from
themselves).
PageRank has been used to rank spaces or streets to predict how many people
(pedestrians or vehicles) come to the individual spaces or streets. In lexical semantics it has
been used to perform Word Sense Disambiguation and to automatically
rank WordNet synsets according to how strongly they possess a given semantic property,
such as positivity or negativity.
A Web crawler may use PageRank as one of a number of importance metrics it uses
to determine which URL to visit during a crawl of the web. One of the early working
papers that were used in the creation of Google is Efficient crawling through URL
ordering, which discusses the use of a number of different importance metrics to determine
how deeply, and how much of a site Google will crawl. PageRank is presented as one of a
number of these importance metrics, though there are others listed such as the number of
18

inbound and outbound links for a URL, and the distance from the root directory on a site to
the URL.
The PageRank may also be used as a methodology to measure the apparent impact
of a community like the Blogosphere on the overall Web itself. This approach uses therefore
the PageRank to measure the distribution of attention in reflection of the Scale-free
network paradigm.
In any ecosystem, a modified version of PageRank may be used to determine species
that are essential to the continuing health of the environment.
For the analysis of protein networks in biology PageRank is also a useful tool.
nofollow
In early 2005, Google implemented a new value, "nofollow",for the rel attribute of
HTML link and anchor elements, so that website developers and bloggers can make links
that Google will not consider for the purposes of PageRankthey are links that no longer
constitute a "vote" in the PageRank system. The nofollow relationship was added in an
attempt to help combat spamdexing.
As an example, people could previously create many message-board posts with links
to their website to artificially inflate their PageRank. With the nofollow value, messageboard administrators can modify their code to automatically insert "rel='nofollow'" to all
hyperlinks in posts, thus preventing PageRank from being affected by those particular posts.
This method of avoidance, however, also has various drawbacks, such as reducing the link
value of legitimate comments. (See: Spam in blogs#nofollow)
In an effort to manually control the flow of PageRank among pages within a website,
many webmasters practice what is known as PageRank Sculptingwhich is the act of
strategically placing the nofollow attribute on certain internal links of a website in order to
funnel PageRank towards those pages the webmaster deemed most important. This tactic
has been used since the inception of the nofollow attribute, but may no longer be effective
since Google announced that blocking PageRank transfer with nofollow does not redirect
that PageRank to other links.
Deprecation

PageRank was once available for the verified site maintainers through the Google
Webmaster Tools interface. However on October 15, 2009, a Google employee confirmed
that the company had removed PageRank from its Webmaster Tools section, explaining that
"We've been telling people for a long time that they shouldn't focus on PageRank so much.
Many site owners seem to think it's the most important metric for them to track, which is
simply not true." In addition, The PageRank indicator is not available in Google's
own Chrome browser.
19

The visible page rank is updated very infrequently.


On 6 October 2011, many users mistakenly thought Google PageRank was gone. As it
turns out, it was simply an update to the URL used to query the PageRank from Google.
PageRank is now one of 200 ranking factors that Google uses to determine a pages
popularity. Google Panda is one of the other strategies Google now relies on to rank
popularity of pages Even though PageRank is no longer directly important for SEO purposes,
the existence of back-links from more popular websites continues to push a webpage higher
up in search rankings.

GOOGLE ARCHITECTURE
Overview

Most of Google is implemented in C or C++ for efficiency and can run in either Solaris
or Linux.

20

In Google, the web crawling (downloading of web pages) is done by several


distributed crawlers. There is a URLserver that sends lists of URLs to be fetched to the
crawlers. The web pages that are fetched are then sent to the storeserver. The storeserver
then compresses and stores the web pages into a repository. Every web page has an
associated ID number called a docID which is assigned whenever a new URL is parsed out of
a web page. The indexing function is performed by the indexer and the sorter. The indexer
performs a number of functions. It reads the repository, uncompresses the documents, and
parses them. Each document is converted into a set of word occurrences called hits. The hits
record the word, position in document, an approximation of font size, and capitalization.
The indexer distributes these hits into a set of "barrels", creating a partially sorted forward
index. The indexer performs another important function. It parses out all the links in every
web page and stores important information about them in an anchors file. This file contains
enough information to determine where each link points from and to, and the text of the
link.
The URLresolver reads the anchors file and converts relative URLs into absolute URLs
and in turn into docIDs. It puts the anchor text into the forward index, associated with the
docID that the anchor points to. It also generates a database of links which are pairs of
docIDs. The links database is used to compute PageRanks for all the documents.
The sorter takes the barrels, which are sorted by docID , and resorts them by wordID
to generate the inverted index. This is done in place so that little temporary space is needed
for this operation. The sorter also produces a list of wordIDs and offsets into the inverted
index. A program called DumpLexicon takes this list together with the lexicon produced by
the indexer and generates a new lexicon to be used by the searcher. The searcher is run by a
web server and uses the lexicon built by DumpLexicon together with the inverted index and
the PageRanks to answer queries.

Major Data Structures


Google's data structures are optimized so that a large document collection can be
crawled, indexed, and searched with little cost. Although, CPUs and bulk input output rates
have improved dramatically over the years, a disk seek still requires about 10 ms to
complete. Google is designed to avoid disk seeks whenever possible, and this has had a
considerable influence on the design of the data structures.
BigFiles

BigFiles are virtual files spanning multiple file systems and are addressable by 64 bit
integers. The allocation among multiple file systems is handled automatically. The BigFiles
package also handles allocation and deallocation of file descriptors, since the operating
21

systems do not provide enough for our needs. BigFiles also support rudimentary
compression options.
Repository

The repository contains the full HTML of every web


page. Each page is compressed using zlib (see RFC1950).
The choice of compression technique is a tradeoff
between speed and compression ratio. We chose zlib's
speed over a significant improvement in compression
Figure 2. Repository Data Structure
offered by bzip. The compression rate of bzip was
approximately 4 to 1 on the repository as compared to
zlib's 3 to 1 compression. In the repository, the documents are stored one after the other
and are prefixed by docID, length, and URL as can be seen in Figure 2. The repository
requires no other data structures to be used in order to access it. This helps with data
consistency and makes development much easier; we can rebuild all the other data
structures from only the repository and a file which lists crawler errors.

Document Index

The document index keeps information about each document. It is a fixed width
ISAM (Index sequential access mode) index, ordered by docID. The information stored in
each entry includes the current document status, a pointer into the repository, a document
checksum, and various statistics. If the document has been crawled, it also contains a
pointer into a variable width file called docinfo which contains its URL and title. Otherwise
the pointer points into the URLlist which contains just the URL. This design decision was
driven by the desire to have a reasonably compact data structure, and the ability to fetch a
record in one disk seek during a search
Additionally, there is a file which is used to convert URLs into docIDs. It is a list of URL
checksums with their corresponding docIDs and is sorted by checksum. In order to find the
docID of a particular URL, the URL's checksum is computed and a binary search is performed
on the checksums file to find its docID. URLs may be converted into docIDs in batch by doing
a merge with this file. This is the technique the URLresolver uses to turn URLs into docIDs.
22

This batch mode of update is crucial because otherwise we must perform one seek for every
link which assuming one disk would take more than a month for our 322 million link dataset.
Lexicon

The lexicon has several different forms. One important change from earlier systems
is that the lexicon can fit in memory for a reasonable price. In the current implementation
we can keep the lexicon in memory on a machine with 256 MB of main memory. The
current lexicon contains 14 million words (though some rare words were not added to the
lexicon). It is implemented in two parts -- a list of the words (concatenated together but
separated by nulls) and a hash table of pointers. For various functions, the list of words has
some auxiliary information which is beyond the scope of this paper to explain fully.
Hit Lists

A hit list corresponds to a list of occurrences of a particular word in a particular


document including position, font, and capitalization information. Hit lists account for most
of the space used in both the forward and the inverted indices. Because of this, it is
important to represent them as efficiently as possible. We considered several alternatives
for encoding position, font, and capitalization -- simple encoding (a triple of integers), a
compact encoding (a hand optimized allocation of bits), and Huffman coding. In the end we
chose a hand optimized compact encoding since it required far less space than the simple
encoding and far less bit manipulation than Huffman coding. The details of the hits are
shown in Figure 3.
Our compact encoding uses two bytes for every hit. There are two types of hits:
fancy hits and plain hits. Fancy hits include hits occurring in a URL, title, anchor text, or meta
tag. Plain hits include everything else. A plain hit consists of a capitalization bit, font size,
and 12 bits of word position in a document (all positions higher than 4095 are labeled 4096).
Font size is represented relative to the rest of the document using three bits (only 7 values
are actually used because 111 is the flag that signals a fancy hit). A fancy hit consists of a
capitalization bit, the font size set to 7 to indicate it is a fancy hit, 4 bits to encode the type
of fancy hit, and 8 bits of position. For anchor hits, the 8 bits of position are split into 4 bits
for position in anchor and 4 bits for a hash of the docID the anchor occurs in. This gives us
some limited phrase searching as long as there are not that many anchors for a particular
word. We expect to update the way that anchor hits are stored to allow for greater
resolution in the position and docIDhash fields. We use font size relative to the rest of the
document because when searching, you do not want to rank otherwise identical documents
differently just because one of the documents is in a larger font.

23

The length of a hit list is stored before the hits


themselves. To save space, the length of the hit list is
combined with the wordID in the forward index and the
docID in the inverted index. This limits it to 8 and 5 bits
respectively (there are some tricks which allow 8 bits to
be borrowed from the wordID). If the length is longer
than would fit in that many bits, an escape code is used
in those bits, and the next two bytes contain the actual
length.
4.2.6 Forward Index

The forward index is actually already partially


Figure 3. Forward and Reverse Indexes
sorted. It is stored in a number of barrels (we used 64).
and the Lexicon
Each barrel holds a range of wordID's. If a document
contains words that fall into a particular barrel, the docID is recorded into the barrel,
followed by a list of wordID's with hitlists which correspond to those words. This scheme
requires slightly more storage because of duplicated docIDs but the difference is very small
for a reasonable number of buckets and saves considerable time and coding complexity in
the final indexing phase done by the sorter. Furthermore, instead of storing actual wordID's,
we store each wordID as a relative difference from the minimum wordID that falls into the
barrel the wordID is in. This way, we can use just 24 bits for the wordID's in the unsorted
barrels, leaving 8 bits for the hit list length.
4.2.7 Inverted Index

The inverted index consists of the same barrels as the forward index, except that
they have been processed by the sorter. For every valid wordID, the lexicon contains a
pointer into the barrel that wordID falls into. It points to a doclist of docID's together with
their corresponding hit lists. This doclist represents all the occurrences of that word in all
documents.
An important issue is in what order the docID's should appear in the doclist. One
simple solution is to store them sorted by docID. This allows for quick merging of different
doclists for multiple word queries. Another option is to store them sorted by a ranking of
the occurrence of the word in each document. This makes answering one word queries
trivial and makes it likely that the answers to multiple word queries are near the start.
However, merging is much more difficult. Also, this makes development much more difficult
in that a change to the ranking function requires a rebuild of the index. We chose a
24

compromise between these options, keeping two sets of inverted barrels -- one set for hit
lists which include title or anchor hits and another set for all hit lists. This way, we check the
first set of barrels first and if there are not enough matches within those barrels we check
the larger ones.

Crawling the Web


Running a web crawler is a challenging task. There are tricky performance and
reliability issues and even more importantly, there are social issues. Crawling is the most
fragile application since it involves interacting with hundreds of thousands of web servers
and various name servers which are all beyond the control of the system.
In order to scale to hundreds of millions of web pages, Google has a fast distributed
crawling system. A single URLserver serves lists of URLs to a number of crawlers (we
typically ran about 3). Both the URLserver and the crawlers are implemented in Python.
Each crawler keeps roughly 300 connections open at once. This is necessary to retrieve web
pages at a fast enough pace. At peak speeds, the system can crawl over 100 web pages per
second using four crawlers. This amounts to roughly 600K per second of data. A major
performance stress is DNS lookup. Each crawler maintains a its own DNS cache so it does
not need to do a DNS lookup before crawling each document. Each of the hundreds of
connections can be in a number of different states: looking up DNS, connecting to host,
sending request, and receiving response. These factors make the crawler a complex
component of the system. It uses asynchronous IO to manage events, and a number of
queues to move page fetches from state to state.
It turns out that running a crawler which connects to more than half a million
servers, and generates tens of millions of log entries generates a fair amount of email and
phone calls. Because of the vast number of people coming on line, there are always those
who do not know what a crawler is, because this is the first one they have seen. Almost
daily, we receive an email something like, "Wow, you looked at a lot of pages from my web
site. How did you like it?" There are also some people who do not know about the robots
exclusion protocol, and think their page should be protected from indexing by a statement
like, "This page is copyrighted and should not be indexed", which needless to say is difficult
for web crawlers to understand. Also, because of the huge amount of data involved,
unexpected things will happen. For example, our system tried to crawl an online game. This
resulted in lots of garbage messages in the middle of their game! It turns out this was an
easy problem to fix. But this problem had not come up until we had downloaded tens of
millions of pages. Because of the immense variation in web pages and servers, it is virtually
impossible to test a crawler without running it on large part of the Internet. Invariably, there
are hundreds of obscure problems which may only occur on one page out of the whole web
25

and cause the crawler to crash, or worse, cause unpredictable or incorrect behavior.
Systems which access large parts of the Internet need to be designed to be very robust and
carefully tested. Since large complex systems such as crawlers will invariably cause
problems, there needs to be significant resources devoted to reading the email and solving
these problems as they come up.

Indexing the Web

Parsing -- Any parser which is designed to run on the entire Web must handle a

huge array of possible errors. These range from typos in HTML tags to kilobytes of zeros in
the middle of a tag, non-ASCII characters, HTML tags nested hundreds deep, and a great
variety of other errors that challenge anyone's imagination to come up with equally creative
ones. For maximum speed, instead of using YACC to generate a CFG parser, we use flex to
generate a lexical analyzer which we outfit with its own stack. Developing this parser which
runs at a reasonable speed and is very robust involved a fair amount of work.

Indexing Documents into Barrels -- After each document is parsed, it is encoded


into a number of barrels. Every word is converted into a wordID by using an in-memory hash
table -- the lexicon. New additions to the lexicon hash table are logged to a file. Once the
words are converted into wordID's, their occurrences in the current document are
translated into hit lists and are written into the forward barrels. The main difficulty with
parallelization of the indexing phase is that the lexicon needs to be shared. Instead of
sharing the lexicon, we took the approach of writing a log of all the extra words that were
not in a base lexicon, which we fixed at 14 million words. That way multiple indexers can run
in parallel and then the small log file of extra words can be processed by one final indexer.

Sorting -- In order to generate the inverted index, the sorter takes each of the
forward barrels and sorts it by wordID to produce an inverted barrel for title and anchor hits
and a full text inverted barrel. This process happens one barrel at a time, thus requiring little
temporary storage. Also, we parallelize the sorting phase to use as many machines as we
have simply by running multiple sorters, which can process different buckets at the same
time. Since the barrels don't fit into main memory, the sorter further subdivides them into
baskets which do fit into memory based on wordID and docID. Then the sorter, loads each
basket into memory, sorts it and writes its contents into the short inverted barrel and the
full inverted barrel.

Searching

The goal of searching is to provide quality search results efficiently. Many of the large
commercial search engines seemed to have made great progress in terms of efficiency.
Therefore, we have focused more on quality of search in our research, although we believe
26

our solutions are scalable to commercial volumes with a bit more effort. The google query
evaluation process is show in Figure 4.
1. Parse the query.
2. Convert words into wordIDs.
3. Seek to the start of the doclist in the short
To put a limit on response time, once a
barrel for every word.
certain number (currently 40,000) of matching
4. Scan through the doclists until there is a
documents are found, the searcher automatically
document that matches all the search terms.
5. Compute the rank of that document for
goes to step 8 in Figure 4. This means that it is
the query.
possible that sub-optimal results would be
6. If we are in the short barrels and at the
returned. We are currently investigating other end of any doclist, seek to the start of the doclist in the
ways to solve this problem. In the past, we sorted
full barrel for every word and go to step 4.
7. If we are not at the end of any doclist go to
the hits according to PageRank, which seemed to
step 4.

improve the situation.

4.5.1 The Ranking System

Sort the documents that have matched by rank and


Google maintains much more information
return the top k.
Figure 4. Google Query Evaluation
about web documents than typical search engines.

Every hitlist includes position, font, and


capitalization information. Additionally, we factor in hits from anchor text and the PageRank
of the document. Combining all of this information into a rank is difficult. We designed our
ranking function so that no particular factor can have too much influence. First, consider the
simplest case -- a single word query. In order to rank a document with a single word query,
Google looks at that document's hit list for that word. Google considers each hit to be one
of several different types (title, anchor, URL, plain text large font, plain text small font, ...),
each of which has its own type-weight. The type-weights make up a vector indexed by type.
Google counts the number of hits of each type in the hit list. Then every count is converted
into a count-weight. Count-weights increase linearly with counts at first but quickly taper off
so that more than a certain count will not help. We take the dot product of the vector of
count-weights with the vector of type-weights to compute an IR score for the document.
Finally, the IR score is combined with PageRank to give a final rank to the document.
For a multi-word search, the situation is more complicated. Now multiple hit lists
must be scanned through at once so that hits occurring close together in a document are
weighted higher than hits occurring far apart. The hits from the multiple hit lists are
matched up so that nearby hits are matched together. For every matched set of hits, a
proximity is computed. The proximity is based on how far apart the hits are in the document
(or anchor) but is classified into 10 different value "bins" ranging from a phrase match to
"not even close". Counts are computed not only for every type of hit but for every type and
27

proximity. Every type and proximity pair has a type-prox-weight. The counts are converted
into count-weights and we take the dot product of the count-weights and the type-proxweights to compute an IR score. All of these numbers and matrices can all be displayed with
the search results using a special debug mode. These displays have been very helpful in
developing the ranking system.
4.5.2 Feedback

The ranking function has many parameters like the type-weights and the type-proxweights. Figuring out the right values for these parameters is something of a black art. In
order to do this, we have a user feedback mechanism in the search engine. A trusted user
may optionally evaluate all of the results that are returned. This feedback is saved. Then
when we modify the ranking function, we can see the impact of this change on all previous
searches which were ranked. Although far from perfect, this gives us some idea of how a
change in the ranking function affects the search results.

28

Bibliography
The following sources have been referred to in making of this project:
1. The Anatomy of a Large-Scale Hypertextual Web Search Engine By Sergey Brin and
Lawrence Page http://infolab.stanford.edu/~backrub/google.html
2. PageRank - http://en.wikipedia.org/wiki/PageRank
3. The History of Web Search Engines (INFOGRAPHIC)
- http://www.wordstream.com/blog/ws/2010/08/03/web-search-engines-history
4. Web Search Engine - http://en.wikipedia.org/wiki/Web_search_engine#History
5.W3servers-http://www.w3.org/History/19921103hypertext/hypertext/DataSources/WWW/Servers.html
6. Google Architecture - http://highscalability.com/google-architecture
And several other related sites and blogs.

29

Você também pode gostar