Você está na página 1de 15

University Of RUGKT IIIT Nuzvid

Faculty of Graph Theory and Applications

Google PageRank Algorithm


(Project Report)

Prepared By
Kalishavali Shaik
N091390-CSE-I

Supervised By
Kalpana Gangwar Madam
Graph theory and Applications

Dated:

December 10, 2014

Table of Contents

1 Cover
Page..1
2
Abstract
.3
3
Abbreviations
4
4
Chapters
..4
4.1
Introduction
4
4.2 Review of Literature
.............................................5
4.3 Summary and Conclusions
15

5
References
15

Abstract
This paper presents different parallel implementations of Google's PageRank algorithm.
The purpose is to compare different methods for computing PageRank on large domains
Of the Web. The iterative algorithms used are the Power method and the Arnoldi method.

We have implemented these algorithms in a parallel environment and created a basic


WebCrawler to gather test data. Tests have then been carried out with the different algorithms
Using various test data.

The explicitly restarted Arnoldi method was shown to be superior to the normal Arnoldi
Method as well as the Power method for high values of the dampening factor . Results
Also show that load balancing our parallel implementation was usually quite ineffective.

For smaller values of , including 0.85 as Google uses, the Power method is preferable. It
Is usually somewhat slower, but the memory used is significantly less. For higher values
Of , if very accurate results are needed, the restarted Arnoldi method is preferable.

The importance of a Webpage is an inherently subjective matter, which depends on the


readers interests, knowledge and attitudes. But there is still much that can be said objectively
about the relative importance of Webpages. This paper describes PageRank, a method for
rating Webpages objectively and mechanically, effectively measuring the human interest and
attention devoted to them.

Abbreviations
HITS: Hypertext Induced Topic Search
DWPR: Distance Weighted Page Ranking
WWW: World Wide Web

Contents
1. Introduction
The World Wide Web creates many new challenges for information retrieval.
It is very large and heterogeneous. Current estimates are that there areover150million
webpages with a doubling life of less than one year. More importantly, the web pages are
extremely diverse, ranging from "What is Joe having for lunch today?" to journals about
information retrieval. In addition to these major challenges, search engines on the Web
must also contend with inexperienced users and pages engineered to manipulate search
engine ranking functions.
However, unlike "at" document collections, the World Wide Web is
Hypertext and provides considerable auxiliary information on top of the text of the
web pages, such as link structure and link text. In this paper, we take advantage of the
link structure of the Web to produce a global importance" ranking of every web page.
This ranking, called PageRank, helps search engines and users quickly make sense of the
vast heterogeneity of the World Wide Web
Today search engines are becoming best friends of most of the people for
navigation on Internet. User needs to enter keyword or combination of keywords to
trigger the search. Search engine searches the web pages available on Internet and returns
the result as number of ordered pages. This order should depend on relevancy of pages
and importance of pages.

Different search engines use different techniques to decide importance of page on


the web. Alta Vista uses HITS (Hyperlink Induced Topic Search) algorithm to rank pages

while Google uses PageRank algorithm to rank the pages. Yahoo also uses some
variation of PageRank algorithm that Google uses. There are many variations of Page
Ranking algorithm. Like confidence based page ranking, weighted page rank algorithm,
query sensitive self-adaptable web page ranking algorithm etc. But, this paper covers
only one algorithm that is based on formula given by Bin and Page, the founders of
Google.

2. What is PageRank?
Most of the people start their web navigation by search engine. Google is the most
famous search engine used now days. While presenting search results, they should be
ordered by their relevancy and importance on the web. User cannot go through all the
pages presented as output of search. Thus all the pages in the collection should be
weighted and represented in the order of their weights.
One of the most important factors that Google uses is PageRank. PageRank is a
numeric value that represents how important a page is on the web. Off course PageRank
is not the only factor, which decides importance of page, but still it is one of them.
PageRank is described by one mathematical formula that seems very difficult at first, but
actually it is not.

2.1 Formula:
The citation graph of the web is main resource for calculation of PageRank. In the
paper The Anatomy of a Large-Scale Hyper textual Web Search Engine founders of
Google, Sergey Brin and Lawrence Page defined PageRank as:
We assume page A has pages T1...Tn which point to it (i.e., are citations). The

parameter d is a damping factor, which can be set between 0 and 1. We usually set d to
0.85.. C (A) is defined as the number of links going out of page A. The PageRank
of a page A is given as follows:
PR (A) = (1-d) + d (PR (T1)/C (T1) + ... + PR (Tn)/C (Tn))

It's the original formula that was published when PageRank was being developed,
and it is probable that Google uses a variation of it but they aren't telling us what it is.

Thus, when one page links to other page, first page votes some PageRank to the second
page. PageRank of a page is the addition of one constant (1-d=0.15) and the damped
value of addition of votes by all pages pointing to it. Value of vote by a particular page
depends on PageRank and total number of out links of that page. Thus, higher the
PageRank, higher is the value of vote and higher the number of out links, lower is the
value of vote.
So every page distributes 85% of its original PageRank evenly among all pages to
which it points. E.g. if page A is pointing to four other pages then factor d * PR (A)/4
will come in PageRank equation of all those four pages. That four factors will add up to d
* PR (A), i.e. 85% of PageRank of A. (From here onwards Ill refer PageRank as PR)
Note that when page votes for another page, it doesnt give anything from its own PR.
Its just voting, only the difference is that weight of vote of a page depends on its own
PR. It is same as shareholders meeting where weight of vote of shareholder depends on
the shares held.

2.2 How to use formula?


If we look at the equation, its obvious that PR of a page depends on PR of other
pages, which are pointing to it, which may depend on PR of original page if it has back
link to that page. e.g. If there are only two pages A and B which points to each other, then
PR of each page will depend on PR of other, which is yet uncalculated. So where to start
from? Surprisingly, we can start with any assumed values of PR. And repeat the
calculation of PR values of all pages iteratively till they become stable.
Lets see an example: consider there are only two pages A and B, which points to
each other. We dont know what their PR should be to begin with, so lets take a guess at
1 and do some calculations.

PR (A) = (1-d) + d * PR (B)


= (1-0.85) + 0.85 * 1
= 1
PR (B) = (1-d) + d * PR (A)
= (1-0.85) + 0.85 * 1
= 1

Since they are not changing we can stop here. Now lets do the calculations with initial
guess at 0:

PR (A) = 0.15 + 0.85 * 0


= 0.15
PR (B) = 0.15 + 0.85 * 0.15
= 0.2775
After second iteration:
PR (A) = 0.15 + 0.85 * 0.2775
= 0.385875
PR (B) = 0.15 + 0.85 + 0.385875
= 0.47799375

After third iteration:

PR (A) = 0.15 + 0.85 * 0.47799375


= 0.5562946875
PR (B) = 0.15 + 0.85 * 0.5562946875
= 0.622850484375

and so on. They will keep changing till they reach value of 1. If we start with values
greater than 1, then also values will degrade iteration by iteration and will
settle down to 1. If we start calculation with different values of PR (A) and PR(B), then
also we will end up with PR(1) = PR(B) = 1. Here we considered symmetrical link
structure between A and B, thus with any value of PR (A) and PR (B) to start, we ends up
with PR(A) = PR(B) = 1.
If we take asymmetrical structure e.g. only A is pointing to B and B is not pointing to A,
then we will end up with consistent values of PR (A) and PR (B), whatever the values we
start with.

For more practice with different numbers of pages with different configurations of
citation graphs, visit:

www.webworkshop.net/pagerank_calculator.php?lnks=2,10,15&iblprs=0.15,0.15,0.15,0.
15&pgnms=&pgs=2&initpr=1&its=100&type=simple

3. How is PageRank used?


Google uses PR as one of the important factor in search process to fine the
relevancy of page. So while searching, web pages are accessed in the decreasing order of
PR. One can find PR of its own webpage by using Google toolbar
(http://toolbar.google.com/). But output of this toolbar is somewhat unexpected. As our
normal PR calculation can yield PR ranging from 0.15 to unlimited. But toolbar gives PR
of any page in the range 1 to 10. So actual PR values are divided into intervals and are
represented as one of the value ranging from 1 to 10. But the question is whether these
intervals are equidistant? Means is the scale linear? No one outside Google knows it. For
this question, there are different answers from different researchers. According to Ian
Rogers, intervals cannot be equidistant. Means the scale used by Google must be
logarithmic. This yields a result that efforts needed to increase PR from 2 to 3 are very
less as compared to increase PR from 8 to 9. What these efforts are? At the end of this
paper anyone can answer this question.

4. Examples and Observations:


Google does not provide any information about methods to improve PR values of
page or site. But we can solve different examples and come up with different
observations. Before starting examples I would like to explain difference between PR of
page and PR of site. Google assigns PR for each page on the web. But from the site point
of view, total PR of all pages of site is important. Thus PR of site is nothing but sum of
PRs of all pages of site.

Example 1.

PR (A) = (1-d) + d (0)

= 0.15
PR (B) = (1-d) + d (0)
= 0.15

In this example A and B dont have any inbound and outbound link. They come up
with PR values equal to 0.15. Thus we can conclude that: Minimum value of PR of any
page is 0.15.

Example 2.

PR (A) = (1-d) + d (PR (B) / C (B))

= 0.15 + 0.85 (1/1)


=1
PR (B) = (1-d) + d (0)
= 0.15

Actually this is not the correct result. There are two concepts: orphan pages and
dangling links. Orphan pages are those, which dont have any inbound link. For a page to
be indexed by Google it must have at least one page linking to it. If a page is not in the
Google index, it and its links don't exist as far as the calculations are concerned.
Therefore, it can't have a PR value and it can't share PR with other pages. Page B is
orphan page and thus will be eliminated during PR calculation. And thus page A will not
receive any vote from page B.
Dangling links are links that go to pages that don't have any outbound links. These
links are dropped for the duration of the calculations. Link from B to A is a dangling link
and will be eliminated during PR calculation. But here onwards we will consider only
small part of whole web. Thus even there are no inbound links shown for a particular

page, they are assumed to be present there. Thus we will not consider these two concepts
anymore in our examples.
From here onwards Ill not show all calculations. Instead Ill show only final
settled values of PR after large number of iterations.

Example 3

B
B

1.0

1.0

A
1.0

1.0

1.0

C
1.0

Example 4.

B
0.575

10

A
1.85

C
0.575

Fig.1.Analysis of Page Rank [1]

This is hierarchical structure. In general homepage of website should have maximum PR.
We need our PR to be concentrated at homepage. Thus we can use hierarchical structure
and can channel large proportion of PR of site to where we want.
Example 5:

B
0.575
External Site1
1.0

A
1.0

C
0.575

Example 6:

External Site 2
0.638

B
1.255

External Site 1
1.0

A
External Site 2

2.6

1.215

1.255

11

Fig 2. Analysis of Page Rank [1]


Compare example 5 with example 6. Page A, B and C are pages of same site. External
site 1 is pointing to page A and page C is pointing to page C. In both of these examples,
PR of external site 1 is assumed to be equal to 1. And then PR values of remaining pages
are calculated. Can we infer something from these examples? Definitely. There is PR
leak in example 5, which is avoided in example 6. In example 5 page is C gives its entire
vote to external site. While in example 6 page C gives part of it to external site and part
of it to page A. This part of vote from C increases PR value of A and thus increases
weight of vote of A. In turn PR of B and C increase which, in turn again increases weight
of vote of C. This is iterative process, which finally results into increased value of PR of
all A, B and C. This is definitely in the favor of PR of whole site.
From these examples we can conclude that if any page of site is pointing to
external page, then we can reduce the PR leak by increasing citation network.
Can we decrease PR leak by introducing reciprocating links between B and C? Lets try
this.

B
1.549

External Site 1
1.0

A
2.146

C
1.720

External Site 2
1.215

Fig.3.Analysis of Page Rank [1]


Thats great!!! Average PR of site increased as we expected.

5. How to increase PageRank?

12

There are different ways to increase PR of your site. From examples illustrated
above we can come up with some ideas to increase PR of web site. Some standard ideas
are given below:
5.1 Add spam pages:
As shown in figure, we can increase total PR of website by adding more pages into
site. No matter, what are the contents of these pages? Total PR of site increases, but we
cant increase average PR of site (i.e. PR per page). Upper limit on average PR of site is
1.

B
281.6

Spam 1

0.39

331.0
Spam 2
0.39

Spam
1000
0.39

Fig.4.Analysis of Page Rank [1]

5.2 Join forums:

13

Forums are a great way to achieve links to your website. In most forums you are
allowed to have a signature and in your signature you can put a link to your website. But
another important note to look on is making sure the forum is somewhat related to your
website. You will still get credit if it's not, but if it's related to your website then you will
be accomplishing two tasks at once. From this people will come to know about your site
and this will help to increase popularity of your site.

5.3 Submit to search engine directories.


Search engine directories are a good way to get a free link to your website. They
also increase your chances at being listed higher on popular search engines like Google,
and others. Most search engine directories allow you to submit to their website for free.
This will allow you to increase your web presence by being listed on another search
engine, and it will also be a free link. Remember the more links you have the higher your
PR will be.
5.4 Reciprocating Links:
You can search for sites related to same topic as that of your website. If PR of that
site is higher than your site, then you can pay that site and can create reciprocating links
among two. There are link-farming sites, which exchanges links with other sites. Actually
this is illegal. Google is banning the sites participating in link farming. Thus be aware of
link farming!!

5.5 Contents:

Last and the most important way to increase PR of your site is to keep solid
contents on your sites. Other sites will automatically link your site if you are having good
contents. Really there is no substitute for good contents!!

Conclusion

14

Even though formula for calculating PageRank seems to be difficult, it is easy to


understand. But when a simple calculation is applied hundreds of times over the results
can seem complicated. And we cannot predict the result of these iterations. Surely, more
practice can yield more observations.
PageRank is important factor considered in Google ranking, but it is only one of
the important factors considered. E.g. now a days Google is paying a lot of attention to
the links anchor text while deciding relevancy of target page.
But as PageRank is also one of the important factor, one should be well aware of
PageRank while designing the website.

References
Fig. Analysis of Page Rank [1]--- Wikipedia, http://pr.efactory.de/
Bing Liu Web Data Mining Springer International Edition
IEEE Conference Paper Research on PageRank and Hyperlink Induced Topic
Search in Web Structure Mining
Website : Google, Wikipedia, http://pr.efactory.de/
www.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture4/lecture4.html

15

Você também pode gostar