Escolar Documentos
Profissional Documentos
Cultura Documentos
Prepared By
Kalishavali Shaik
N091390-CSE-I
Supervised By
Kalpana Gangwar Madam
Graph theory and Applications
Dated:
Table of Contents
1 Cover
Page..1
2
Abstract
.3
3
Abbreviations
4
4
Chapters
..4
4.1
Introduction
4
4.2 Review of Literature
.............................................5
4.3 Summary and Conclusions
15
5
References
15
Abstract
This paper presents different parallel implementations of Google's PageRank algorithm.
The purpose is to compare different methods for computing PageRank on large domains
Of the Web. The iterative algorithms used are the Power method and the Arnoldi method.
The explicitly restarted Arnoldi method was shown to be superior to the normal Arnoldi
Method as well as the Power method for high values of the dampening factor . Results
Also show that load balancing our parallel implementation was usually quite ineffective.
For smaller values of , including 0.85 as Google uses, the Power method is preferable. It
Is usually somewhat slower, but the memory used is significantly less. For higher values
Of , if very accurate results are needed, the restarted Arnoldi method is preferable.
Abbreviations
HITS: Hypertext Induced Topic Search
DWPR: Distance Weighted Page Ranking
WWW: World Wide Web
Contents
1. Introduction
The World Wide Web creates many new challenges for information retrieval.
It is very large and heterogeneous. Current estimates are that there areover150million
webpages with a doubling life of less than one year. More importantly, the web pages are
extremely diverse, ranging from "What is Joe having for lunch today?" to journals about
information retrieval. In addition to these major challenges, search engines on the Web
must also contend with inexperienced users and pages engineered to manipulate search
engine ranking functions.
However, unlike "at" document collections, the World Wide Web is
Hypertext and provides considerable auxiliary information on top of the text of the
web pages, such as link structure and link text. In this paper, we take advantage of the
link structure of the Web to produce a global importance" ranking of every web page.
This ranking, called PageRank, helps search engines and users quickly make sense of the
vast heterogeneity of the World Wide Web
Today search engines are becoming best friends of most of the people for
navigation on Internet. User needs to enter keyword or combination of keywords to
trigger the search. Search engine searches the web pages available on Internet and returns
the result as number of ordered pages. This order should depend on relevancy of pages
and importance of pages.
while Google uses PageRank algorithm to rank the pages. Yahoo also uses some
variation of PageRank algorithm that Google uses. There are many variations of Page
Ranking algorithm. Like confidence based page ranking, weighted page rank algorithm,
query sensitive self-adaptable web page ranking algorithm etc. But, this paper covers
only one algorithm that is based on formula given by Bin and Page, the founders of
Google.
2. What is PageRank?
Most of the people start their web navigation by search engine. Google is the most
famous search engine used now days. While presenting search results, they should be
ordered by their relevancy and importance on the web. User cannot go through all the
pages presented as output of search. Thus all the pages in the collection should be
weighted and represented in the order of their weights.
One of the most important factors that Google uses is PageRank. PageRank is a
numeric value that represents how important a page is on the web. Off course PageRank
is not the only factor, which decides importance of page, but still it is one of them.
PageRank is described by one mathematical formula that seems very difficult at first, but
actually it is not.
2.1 Formula:
The citation graph of the web is main resource for calculation of PageRank. In the
paper The Anatomy of a Large-Scale Hyper textual Web Search Engine founders of
Google, Sergey Brin and Lawrence Page defined PageRank as:
We assume page A has pages T1...Tn which point to it (i.e., are citations). The
parameter d is a damping factor, which can be set between 0 and 1. We usually set d to
0.85.. C (A) is defined as the number of links going out of page A. The PageRank
of a page A is given as follows:
PR (A) = (1-d) + d (PR (T1)/C (T1) + ... + PR (Tn)/C (Tn))
It's the original formula that was published when PageRank was being developed,
and it is probable that Google uses a variation of it but they aren't telling us what it is.
Thus, when one page links to other page, first page votes some PageRank to the second
page. PageRank of a page is the addition of one constant (1-d=0.15) and the damped
value of addition of votes by all pages pointing to it. Value of vote by a particular page
depends on PageRank and total number of out links of that page. Thus, higher the
PageRank, higher is the value of vote and higher the number of out links, lower is the
value of vote.
So every page distributes 85% of its original PageRank evenly among all pages to
which it points. E.g. if page A is pointing to four other pages then factor d * PR (A)/4
will come in PageRank equation of all those four pages. That four factors will add up to d
* PR (A), i.e. 85% of PageRank of A. (From here onwards Ill refer PageRank as PR)
Note that when page votes for another page, it doesnt give anything from its own PR.
Its just voting, only the difference is that weight of vote of a page depends on its own
PR. It is same as shareholders meeting where weight of vote of shareholder depends on
the shares held.
Since they are not changing we can stop here. Now lets do the calculations with initial
guess at 0:
and so on. They will keep changing till they reach value of 1. If we start with values
greater than 1, then also values will degrade iteration by iteration and will
settle down to 1. If we start calculation with different values of PR (A) and PR(B), then
also we will end up with PR(1) = PR(B) = 1. Here we considered symmetrical link
structure between A and B, thus with any value of PR (A) and PR (B) to start, we ends up
with PR(A) = PR(B) = 1.
If we take asymmetrical structure e.g. only A is pointing to B and B is not pointing to A,
then we will end up with consistent values of PR (A) and PR (B), whatever the values we
start with.
For more practice with different numbers of pages with different configurations of
citation graphs, visit:
www.webworkshop.net/pagerank_calculator.php?lnks=2,10,15&iblprs=0.15,0.15,0.15,0.
15&pgnms=&pgs=2&initpr=1&its=100&type=simple
Example 1.
= 0.15
PR (B) = (1-d) + d (0)
= 0.15
In this example A and B dont have any inbound and outbound link. They come up
with PR values equal to 0.15. Thus we can conclude that: Minimum value of PR of any
page is 0.15.
Example 2.
Actually this is not the correct result. There are two concepts: orphan pages and
dangling links. Orphan pages are those, which dont have any inbound link. For a page to
be indexed by Google it must have at least one page linking to it. If a page is not in the
Google index, it and its links don't exist as far as the calculations are concerned.
Therefore, it can't have a PR value and it can't share PR with other pages. Page B is
orphan page and thus will be eliminated during PR calculation. And thus page A will not
receive any vote from page B.
Dangling links are links that go to pages that don't have any outbound links. These
links are dropped for the duration of the calculations. Link from B to A is a dangling link
and will be eliminated during PR calculation. But here onwards we will consider only
small part of whole web. Thus even there are no inbound links shown for a particular
page, they are assumed to be present there. Thus we will not consider these two concepts
anymore in our examples.
From here onwards Ill not show all calculations. Instead Ill show only final
settled values of PR after large number of iterations.
Example 3
B
B
1.0
1.0
A
1.0
1.0
1.0
C
1.0
Example 4.
B
0.575
10
A
1.85
C
0.575
This is hierarchical structure. In general homepage of website should have maximum PR.
We need our PR to be concentrated at homepage. Thus we can use hierarchical structure
and can channel large proportion of PR of site to where we want.
Example 5:
B
0.575
External Site1
1.0
A
1.0
C
0.575
Example 6:
External Site 2
0.638
B
1.255
External Site 1
1.0
A
External Site 2
2.6
1.215
1.255
11
B
1.549
External Site 1
1.0
A
2.146
C
1.720
External Site 2
1.215
12
There are different ways to increase PR of your site. From examples illustrated
above we can come up with some ideas to increase PR of web site. Some standard ideas
are given below:
5.1 Add spam pages:
As shown in figure, we can increase total PR of website by adding more pages into
site. No matter, what are the contents of these pages? Total PR of site increases, but we
cant increase average PR of site (i.e. PR per page). Upper limit on average PR of site is
1.
B
281.6
Spam 1
0.39
331.0
Spam 2
0.39
Spam
1000
0.39
13
Forums are a great way to achieve links to your website. In most forums you are
allowed to have a signature and in your signature you can put a link to your website. But
another important note to look on is making sure the forum is somewhat related to your
website. You will still get credit if it's not, but if it's related to your website then you will
be accomplishing two tasks at once. From this people will come to know about your site
and this will help to increase popularity of your site.
5.5 Contents:
Last and the most important way to increase PR of your site is to keep solid
contents on your sites. Other sites will automatically link your site if you are having good
contents. Really there is no substitute for good contents!!
Conclusion
14
References
Fig. Analysis of Page Rank [1]--- Wikipedia, http://pr.efactory.de/
Bing Liu Web Data Mining Springer International Edition
IEEE Conference Paper Research on PageRank and Hyperlink Induced Topic
Search in Web Structure Mining
Website : Google, Wikipedia, http://pr.efactory.de/
www.math.cornell.edu/~mec/Winter2009/RalucaRemus/Lecture4/lecture4.html
15