Você está na página 1de 1

INT. J. NEW. INN.

, 2012, 1(1), 105-110


ISSN:2277-4459

A REVIEW OF WEB-CRAWLER AND P2P OVERLAY NETWORKS


Praveen Kumar1, Arpit Kumar2
1

Department of Computer Science & Engineering, Govt. Polytechnic, Lisana (Rewari), Distt. Rewari - 123401, Haryana, India E-mail: praveensatija@gmail.com 2 Department of Computer Science & Engineering, D.A. V. College of Engineering & Technology, Kanina, Mohindergarh - 123027, Haryana, India E-mail: arpitrewari@gmail.com

ABSTRACT
Web crawlers are the heart of search engines. Web crawlers continuously keep on crawling the web and find any new web pages that have been added to the web as well as the pages that have been removed from the web. Due to growing and dynamic nature of the web; it has become a challenge to traverse all URLs in the web documents and to handle these URLs. In this paper, we review the challenges and issues faced in using a single type of Crawler. We also explore the concept of exploiting network proximity in peer-to-peer overlay networks [11]. Peer to peer overlay networks offer a novel platform for a variety of scalable and decentralized distributed applications. These systems provide efficient and fault- tolerant routing, object location and load balancing within a self-organizing overlay network.

KEYWORDS: Crawler;uri;p2p;html;http

1. INTRODUCTION
The World Wide Web [1] is a global, read-write information space. Text documents, images, multimedia and many other items of information, referred to as resources, are identified by short, unique, global identifiers called Uniform Resource Identifiers (URI) so that each can be found, accessed and cross-referenced in the simplest possible way. It is a client-server based architecture that allows a user to initiate search by providing a keyword and some optional additional information to a search engine, which in turn collects and returns the required web pages from the Internet. Web Crawlers are Software programs that traverse the World Wide Web information space by following the hypertext links extracted from hypertext documents. Since, a crawler identifies a document from its URL, it picks up a seed URL and downloads corresponding Robot.txt file, which contains downloading permissions and the information about the files that should be excluded by the crawler. On the basis of the host protocol, it downloads the document from the same machine or to one on the other side of the world. When a user accesses a web page using its URL, the documents are transferred to the client machine using Hyper Text Transfer Protocol (HTTP). The browser interprets the 105

document and makes it available to the user. He/she follows the links in the presented page to access other pages. A crawler often has to download hundreds of millions of pages in a short period of time and has to constantly monitor and refresh the downloaded pages. In addition, the crawler should avoid putting too much pressure on the visited Web sites and the crawler's local network, because they are intrinsically shared resources. Web crawlers utilize the graph structure of the program to move from page to other page. A crawler stores a web page and then extracts any URLs appearing in that web page. The same process is repeated for all web pages who are URLs have been extracted from the earlier page. For this, queue data structure is used. All encountered URLs are put in the queue. The process is repeated until queue is empty or crawler decides to stop. Key purpose of designing web crawlers is to retrieve web pages and to add them to local repository.

International Journal of New Innovations

Você também pode gostar