Você está na página 1de 5

WEB CRAWLER

A SEMINAR REPORT

Submitted by

AANCHAL GARG

In partial fulfillment for the award of the degree


Of

BACHELOR OF ENGINEERING

IN

COMPUTER SCIENCE & ENGINEERING

At

Apeejay College of Engineering


SOHNA
ON

17TH FEBRUARY 2010


ABSTRACT

Today's search engines are equipped with specialized agents known as Web

crawlers (download robots) dedicated to crawling large Web contents on line.

These contents are then analyzed, indexed and made available to users. Crawlers

interact with thousands of Web servers over periods extending from a few weeks to

several years. This type of crawling process therefore means that certain judicious

criteria need to be taken into account, such as the robustness, extendibility and

maintainability of these crawlers. The crawlers visits several thousands of pages

every second, includes a high-performance fault manager, are platform

independent or dependent and are able to adapt transparently to a wide range of

configurations without incurring additional hardware expenditure. In this report I

will provide details of the various crawling strategies, crawling policies and web

crawling process containing details about crawling architecture. Finally, we will

discuss some more about crawling procedure.


TABLE OF CONTENTS

CHAPTER NO. TITLE PAGE NO.

ABSTRACT……………………………………………….. i
ACKNOWLEDGMENT………………………………….. ii

1. WEB CRAWLER…………………………………………..1
1.1 Introduction……………………………………......1
1.2 Prerequisites of a Crawling System……………….1
1.3 General Crawling Strategies………………………2
1.4 Crawling Policies………………………………….3
1.4.1 Selection Policy……………………………..4
1.4.2 Re-Visit Policy……………………………...7
1.4.3 Politeness Policy……………………………9
1.4.4 Parallelization Policy……………………….10

2. WEB CRAWLING PROCESS………………………….11


2.1 Web Crawler Architecture………………………..11
2.2 URL Normalization……………………………....12
2.3 Crawler Identification…………….……………....12
2.4 Examples of Web Crawlers………………………12

3. SOME MORE ABOUT CRAWLING…………………14


CHAPTER NO. TITLE PAGE NO.

3.1 Distributed Web Crawling..……………..14


3.1.1 Types………………………………..14
3.2 Focused Crawlers………………………...15
3.2.1 Strategies…………………………….16

4. REFERENCES………………………………17
ACKNOWLEDGMENT

With great pleasure and pride, I take an opportunity to pay my gratitude and thanks
to my respected guide and teacher Mr. Lalit Goel who has been a continuous
source of inspiration, and without whose help, the completion of this report would
have been impossible.

AANCHAL GARG
061002

Você também pode gostar