Você está na página 1de 23

SEMINAR REPORT

Submitted By: AANCHAL GARG CSE


` ` ` ` `
Web Crawlers Why do we need web crawlers Prerequisites of a Crawling System Gene
ral Crawling Strategies Crawling Policies :
Selection policy Re-visit policy Politeness policy Parallelization policy
` ` ` `
Web crawler architectures Distributed web crawling Focused Crawlers Examples of
Web crawlers
` ` ` ` ` `
Spiders, robots, bots, aggregators, agents and intelligent agents. An Internet-a
ware program that can retrieve information from specific location on the Interne
t. A program that collects documents by recursively fetching links from a set of
starting pages. Mainly depends on crawling policies used. It starts with a list
of URLs to visit, called the seeds. As the crawler visits these URLs, it identi
fies all the hyperlinks in the page and adds them to the list of URLs to visit,
called the crawl frontier.
WHY DO WE NEED WEB CRAWLERS?
has a wide expanse of Information. Â Finding relevant information requires an effic
ient mechanism. Â Web Crawlers provide that scope to the search engine.
 Internet
`
Flexibility: System should be suitable for various scenarios. High Performance:
System needs to be scalable with a minimum of one thousand pages/ second and ext
ending up to millions of pages. Fault Tolerance: System process invalid HTML cod
e, deal with unexpected Web server behavior, can handle stopped processes or int
erruptions in network services.
`
`
Maintainability and Configurability: Appropriate interface is necessary for moni
toring the crawling process including: 1. Download speed 2. Statistics on the pa
ges 3. Amounts of data stored.
`
`
Breadth-First Crawling: launched by following hypertext links leading to those p
ages directly connected with this initial set. Repetitive Crawling: once pages h
ave been crawled, some systems require the process to be repeated periodically s
o that indexes are kept updated. Targeted Crawling: specialized search engines u
se crawling process heuristics in order to target a certain type of page.
`
`
`
Random Walks and Sampling: random walks on Web graphs via sampling is done to es
timate the size of documents on line. Deep Web crawling: a lot of data accessibl
e via the Web are currently contained in databases and may only be downloaded th
rough the medium of appropriate requests or forms. The Deep Web is the name give
n to the Web containing this category of data.
`
` `
`
`
Selection Policy that states which pages to download. Re-visit Policy that state
s when to check for changes to the pages. Politeness Policy that states how to a
void overloading Web sites. Parallelization Policy that states how to coordinate
distributed Web crawlers.
` `
`
Search engines covers only a fraction of Internet. This requires download of rel
evant pages, hence a good selection policy is very important. Common Selection p
olicies: Restricting followed links Path-ascending crawling Focused crawling Cra
wling the Deep Web
` ` ` `
`
Web is dynamic; crawling takes a long time. Cost factors play important role in
crawling. Freshness and Age- commonly used cost functions. Objective of crawler-
high average freshness; low average age of web pages. Two re-visit policies: Un
iform policy Proportional policy
`
`
`
Crawlers can have a crippling impact on the overall performance of a site. The c
osts of using Web crawlers include: Network resources Server overload Server/ ro
uter crashes Network and server disruption A partial solution to these problems
is the robots exclusion protocol.
` `
`
The crawler runs multiple processes in parallel. The goal is: To maximize the do
wnload rate. To minimize the overhead from parallelization. To avoid repeated do
wnloads of the same page. The crawling system requires a policy for assigning th
e new URLs discovered during the crawling process.
`
A distributed computing technique whereby search engines employ many computers t
o index the Internet via web crawling. The idea is to spread out the required re
sources of computation and bandwidth to many computers and networks. Types of di
stributed web crawling: 1. Dynamic Assignment 2. Static Assignment
`
`
`
`


With this, a central server assigns new URLs to different crawlers dynamically.
This allows the central server dynamically balance the load of each crawler. Con
figurations of crawling architectures with dynamic assignments: A small crawler
configuration, in which there is a central DNS resolver and central queues per W
eb site, and distributed down loaders. A large crawler configuration, in which t
he DNS resolver and the queues are also distributed.
`
`
`
Here a fixed rule is stated from the beginning of the crawl that defines how to
assign new URLs to the crawlers. A hashing function can be used to transform URL
s into a number that corresponds to the index of the corresponding crawling proc
ess. To reduce the overhead due to the exchange of URLs between crawling process
es, when links switch from one website to another, the exchange should be done i
n batch.
` `
`
Focused crawling was first introduced by Chakrabarti. A focused crawler ideally
would like to download only web pages that are relevant to a particular topic an
d avoid downloading all others. It assumes that some labeled examples of relevan
t and not relevant pages are available.
`
A focused crawler predict the probability that a link to a particular page is re
levant before actually downloading the page. A possible predictor is the anchor
text of links. In another approach, the relevance of a page is determined after
downloading its content. Relevant pages are sent to content indexing and their c
ontained URLs are added to the crawl frontier; pages that fall below a relevance
threshold are discarded.
`
` ` `
` `
`
`
Yahoo! Slurp: Yahoo Search crawler. Msnbot: Microsoft's Bing web crawler. FAST C
rawler : Distributed crawler, used by Fast Search & Transfer. Googlebot : Google¶s
web crawler. WebCrawler : Used to build the first publicly-available full-text i
ndex of a subset of the Web. World Wide Web Worm : Used to build a simple index
of document titles and URLs. Web Fountain: Distributed, modular crawler written
in C++.
`
`
`
Web crawlers are an important aspect of the search engines. Web crawling process
es deemed high performance are the basic components of various Web services. It
is not a trivial matter to set up such systems: 1. Data manipulated by these cra
wlers cover a wide area. 2. It is crucial to preserve a good balance between ran
dom access memory and disk accesses.

Você também pode gostar