Você está na página 1de 8

A Scalable, Distributed Web-Crawler*

Ankit Jain, Abhishek Singh, Ling Liu


Technical Report GIT-CC-03-08
College of Computing
Atlanta,Georgia
{ankit,abhi,lingliu}@cc.gatech.edu
In this paper we present a design and implementation of a scalable, distributed web-crawler. The
motivation for design of such a system to effectively distribute crawling tasks to different machined in a
peer-peer distributed network. Such architecture will lead to scalability and help tame the exponential
growth or crawl space in the World Wide Web. With experiments on the implementation of the proto-type
of the system we derive show the scalability and efficiency of such a system.

1. Introduction

Web crawler forms an integral part of any search engine. The basic task of a crawler is to
fetch pages, parse them to get more URLs, and then fetch these URLs to get even more
URLs. In this process crawler can also log these pages or perform several other operations on
pages fetched according to the requirements of the search engine. Most of these auxiliary
tasks are orthogonal to the design of the crawler itself. The explosive growth of the web has
rendered the simple task of crawling the web non-trivial. With this rapid increase in the
search space, crawling the web is becoming more difficult day by day. But all is not lost,
newer computational models are being introduced to make resource intensive tasks more
manageable. The price of computing is decreasing monotonically. It has now become very
economical to use several cheap computation units in distributed fashion to achieve high
throughputs. The challenge while using a distributed model such as one described above, is
to efficiently distribute the computation tasks avoiding overheads for synchronization and
maintenance of consistency. Scalability is also an important issue for such a model to be
usable. In this project, design architecture of a scalable, distributed web crawler has been
proposed and implemented . It has been designed to make use of cheap resources and tries to
remove some of the bottleneck of the present crawlers in novel way. For sake of simplicity
and focus, we only worked on the crawling part of the crawler, logging only the URL’s. Other
functions can be easily integrated to the design.

Section 2 talks about the salient features of our design. In section 3 gives an overview of the
proposed architecture. Section 4 goes into the details of a crawler entity in our architecture.
In section 5 we explain the probabilistic hybrid search model. In section 6 we talk in brief
about the implementation of our system. Section 7 discusses the experimental results and
there interpretations. In later sections we discuss out conclusions and describe the learning
experience during this project.

2. Salient features of the design

Our major objectives while designing the crawler were


• Increased resource utilization (by multithreaded programming to increase concurrency)
• Effective distribution of crawling tasks with no central bottleneck
• Easy portability
• Limiting the request load for all the web servers
• Configurability of the crawling tasks

* The manuscript is still under progress 1


Besides catering to these capabilities our design also includes probabilistic hybrid search
model. This is done using a probabilistic hybrid of stack and queue ADTs (Abstract Data
Type) for maintaining the pending URL lists. Details of the probabilistic hybrid model are
presented in section 5. This distributed crawler is a peer-to-peer distributed crawler, with no
central entity. By using a distributed crawling model we have overcome the bottlenecks like:
• Network throughput
• Processing capabilities
• Database capabilities
• Storage capabilities.
A database capability bottleneck is avoided by dividing the URL space into disjoint sets, each
of which is handled by a separate crawler. Each crawler parses and logs only the URLs that
lie in its URL space subset, and forwards rest of the URL to corresponding crawler entity.
Each crawler will have a prior knowledge of the look up table relating each URL subset to
[IP:PORT] combination identifying all the crawler threads

3. Distributed crawler

The crawler system consists of a number of crawler entities, which run on distributed sites
and interact in peer-to-peer fashion. Each crawler entity has the knowledge to its URL
subset, as well as mapping from URL subset to network address of corresponding peer
crawler entity. Whenever the crawler entity encounters a URL from a different URL subset,
it is forwarded to the appropriate peer crawler entity based on URL subset to crawler entity
lookup. Each crawler entity maintains its own database, which only stores the URL’s from
the URL subset assigned to the particular entity. The database’s are disjoint and can be
combined offline when the crawling task is complete.

* The manuscript is still under progress 2


4. Crawler Entity

Each crawler entity consists of several of crawler threads, a URL handling thread, a URL
packet dispatcher thread and URL packet receiver thread. The URL set assigned to each
crawler entity will be further divided into subsets for each crawler thread. Each crawler
thread has its own pending URL list. Each thread picks up an element from URL pending
list, generates an HTTP fetch requests, gets the page, parses through this page to extracts
any URL’s in it and finally puts them in the job pending queue of the URL handling thread.
During initialization URL handling thread reads the hash to [IP:PORT] mapping. It also has
a job queue. This thread gets a URL from the job queue, checks to see if the URL belongs to
the URL set corresponding to the crawler entity. It does so based on the last few bits of the
hash of the domain name in the URL with conjunction of hash to [IP:PORT] mapping. If the
URL belongs to another entity it will put the URL on the dispatcher queue and get a new
URL from its job queue. If the URL belongs to its set, it firsts checks the URL-seen cache, if
the test fails it queries the URL database to check if the URL has been seen, and puts the
URL in the URL database. It then puts the URL into URL pending list of one of the crawler
threads. URLs are assigned to a crawler thread based on domain names. Each domain name
will only be serviced by one thread; hence only one connection will be maintained with any
given server. This will make sure that the crawler doesn’t overload a slow server. A different
hash is used while distributing jobs in between the crawler thread and while determining the
URL subset. The objective behind this to isolate the two operations such that there is no
correlation between a crawler entity and the thread that is assigned to it; thus balancing the
load evenly within the threads. The decision to divide URL space on the bases to domain
names was based on the observation that a lot of pages on the web tend to have links to
pages in the same domain name. Hence if all URL’s with a particular domain name will lie in
the same URL space, these URL’s will not be needed to be forwarded to other crawler
entities. Thus this scheme provides and effective strategy to divide the crawl task between
different peer-to-peer nodes of this distributed system. We validate this argument in our
experiments described in Section 7. URL dispatcher thread communicates the URL’s
corresponding crawler entity. A URL receiver thread collects the URL’s received from other
crawler entities i.e. communicated via dispatcher threads of those crawler entities and puts
them on the job queue of the URL handling thread.

5. Probabilistic Search Model

We use a search model that can be configured to behave as DFS, BFS or a hybrid of both. It
can be configured to behave as DFS a given percentage of the times, and behave as BFS the
rest of the times. We use a probabilistic hybrid of stack and queue abstract data types to
store the list of pending URL’s. DFS can be modeled by using a Stack to store the URL
pending list. In a stack last in first out order is maintained for element lists. In a stack we
push elements and pop elements from the same end of a list. Similarly BFS can be modeled
by using a queue to store the URL pending list, which maintains the First In First Out order
for list elements This is achieved by pushing the elements into one end of a list and popping
elements from the other. In short if we pop and push elements from the same end we get
DFS and if we pop and push from different ends we get BFS. We use above fact to obtain a
hybrid of DFS and BFS. We push elements from one end of the list, and pop elements from
the same end of list with probability ‘p’ and from the other end of this list with probability 1-
p. Now if p =1, then the system will behave as DFS, if p = 0, the systems will behave as BFS.
For p anywhere between 0 and 1, the system will behave as DFS p*100% times and BFS rest
of the times. Each time we need to pop an element we decide with probability p, whether to
get the element from the top of the list or the bottom.

* The manuscript is still under progress 3


By varying value of p the search characteristic will change, this will effect the cache-hit ratio
and the coverage of the search. We intended to find the optimum value of ‘p’, which yields
highest cache-hit rate for both the DNS cache as well as URL-seen cache. This study could
lead to a significant improvement in the crawler performance. We have implemented this
hybrid structure but due to time constraints we could not perform this study in this project.

6. The Implementation

The system was implemented in Java platform for portability reasons. MySQL was used for
the URL database. Even though Java is less efficient than other languages that can be
compiled to the native machine code and none of the team members were proficient with it,
we selected Java for this prototype. The reasons behind this decision were to keep the
software architecture modular, make the system portable, and to deal with complexity of
such a system. In retrospect this turned to be a good decision as we might not have been able
to complete this project in time if we would have implemented it in other languages
such as C. The comprehensive libraries provided with Java us to concentrate our efforts on
design of the system and software architecture. A java class was written for each of the
various components of the system ( i.e. different kind of threads, database, synchronized job
queues, caches etc.). First we wrote generic classes for various infrastructure components of
the system like synchronized job queues and caches. The LRUCache class implements an
approximate LRU cache based of hash table with overlapping buckets. The JobQueue class
implements a generic synchronized job queue with option for probabilistic hybrid of stack
and queue ADT. The main Crawler class performs the initialization, by reading the
configuration files, spawning various threads accordingly and initializing various job queues.
It then behaves as the Handler Thread. A class named CrawlerThread performs the operation
of the Crawler Thread. This thread simply gets a URL from its job queue, messages the
URLlist class with this URL. The URLlist class then spawns a new thread that fetches the
page, parses it for URL links and returns the list of these URL’s back to the CrawlerThread.
In java the URL fetch operation is not guaranteed to return and in case of a malicious web
server the whole thread can possibly hang, waiting for the operation to complete. This is the
reason why the URLlist class spawns a new thread every time to fetch the URL. The thread is
completed with a certain time-out, hence if the URL fetch operation isn’t completed in time
the thread stops after time-out and normal operation is resumed. Spawning a new thread to
fetch each page does put an extra overhead on the operation but is essential for the
robustness of the system. The Sender and Receiver classes implement the Sender and
Receiver threads respectively. The Receiver class starts a UDP socket at pre-determine port
and waits for any packet. The Sender class transmits the URL’s via UDP packet to
appropriate remote node. Besides the classes that form the system architecture described
before, we added a Probe Thread to the system and a Measurement class. The relevant
classes report the appropriate measurements to the Measurements class and the Probe
Threads messages the Measurement class to output the measurements at configurable
periodic time intervals.

7. Evaluation and Results

We performed experiments to evaluate the performance and scalability of the system. Our
experimental setup consisted of four Sun Ultra-30 machines. One crawler entity ran in each
of these machines and each of the entity was configured to have 12 Crawler Threads. During
the design we decided to store all the queues in memory as the cost of memory is really low
and several cheap computer system come equipped with 2 GB RAM. Our program would
never require more memory. As 2GB RAM could accommodate about 20 million URLs in
various queues for each entity. We do not expect that queue size of any particular node will
become more than 20 million when URL space is divided into several nodes. Unfortunately

* The manuscript is still under progress 4


we could not arrange such machines for our experiments. Instead we ran our experiments on
machines only with 128 Mb RAM with even less memory available for our process.

In our experiment [Figure 2] we faced problems due to unavailability of required memory


space. The nodes failed after memory overflow. Arrows in the above graph depict node
failure. First of the systems went down in about 12 minutes due to memory overflow. By this
time the system had crawled about 54465 pages giving a throughput of 75.65 documents per
second. The second node went down after about twenty minutes; the throughput at this time
was 62.37 pages per second. The third and fourth node do down at about 57 minutes with
throughput of about 31.4 pages per second. This result although not straightforward to
interpret due failure of the nodes, is still very promising. At about 74 documents per second,
one billion pages can be crawled in less than six months. Surely tests with machines with
required amount of memory need to be performed to corroborate this throughput.

* The manuscript is still under progress 5


In figure 3, we show the queue sizes and number of pages crawled for one of the 4 systems in
the above experiments. As seen in the graph the number of pages crawled is fairly linear
indicating almost constant throughput though out the run. This graph also justifies our
decision of keeping one handler thread per crawler entity. As seen from the graph, except for
few temporary bursts the handler thread queue length is fairly low. Thus it can be inferred
that one handler thread is enough to quickly execute its functions even for multiple crawler
threads. Worker queue length identifies the culprit for the memory requirement. It increases
at a rate much higher than rate at which pages are crawled. To study the scalability of the
system we find scalability factor of the system with 4 nodes.

Scalability Factor = Throughput with 4 nodes working together/ Throughput with 4


nodes working independently.

We calculated the scalability factor after first ten minutes of the execution of crawler. We
found this value to be 97.9%. This figure shows extremely good capability for scalability as
the system show only about 2% overhead for distribution task i.e. the distribution of task was
fairly effective. We also measured the number of URL’s that needed to be forwarded to other
peer nodes. For this reason we introduce the distribution factor

Distribution Factor = Number of Local pages found/ Number of pages found.

Here local pages are to pages that belong to the same URL subset as there parents. Hence
these local pages are not forwarded to other peer nodes and do not lead to network traffic.
Needles to say the higher the distribution factor the better as it leads to effective distribution
of the crawl space. If web were to be a random set of hyper link structure. The expected value
of the Distribution factor would be 25% for our case of 4 nodes. In our experimentation we
found the distribution factor to be 65% (averaged over more than 100,000 pages the system
crawled). This again validates our claim that dividing the URL’s into subsets based on the
domain names and then assigning a URL subset to each node is an effective distribution of
crawl task.
Our next experiment aims to explore the level of concurrency attainable in one of the crawler
entity of the system. In this experiment we use only one node and then measure the
performance of the system at the end of 5 minutes on varying the number of crawler threads.

* The manuscript is still under progress 6


This graph [figure 4] although not very smooth, provides clear indication of increased
throughput on increasing the number of crawler thread, validating our claim of increased
resource utilization. Beyond about 48 threads the throughput start to decrease because of the
synchronization overheads of the system. The graph suggests that around 32 to 48 crawler
threads per crawler entity may provide optimum performance to the system. In these
experiments a single node achieved the throughput of 32 documents per second, again a very
promising fact in terms of system performance.

8. Contributions of this project

The biggest contribution of this project is the concept of distributing crawl tasks based on
disjoint subsets of the URL crawl space. We also presented a scalable, multi-threaded, peer-
to-peer distributed architecture for a WebCrawler based on the above concept. Another
interesting contribution of the project is the proposed probabilistic hybrid of Depth-First
Traversal and Breath-First Traversal, although we were unable to study its advantages or
disadvantages during this project. This traversal strategy can be used to achieve the hybrid
of the two traditional strategies without any extra book-keeping and is very easy to
implement. We also implement the complete WebCrawler that demonstrates all of the above
concepts.

9. The Learning experience

Foundations of this project were laid from the discussions on Web-crawlers and challenges
that lie in there design. Since web space is growing exponentially so our proposed solution
should be scalable. Proposed solution should be capable of making good use of cluster of
computers rather than being dependent on a large capacity machine. Discussions about the
DFS and BFS navigation strategies for the efficient crawl of the web prompted us to
experiment with a probabilistic navigation strategy. A lot of papers referred in the class,
especially [2]; also gave us insights into the design and implementation of such a system.
Design and implementation of this project was a very fruitful hands-on experience. It turned
out to be a very good design exercise. We had to deal with real world system issues. The
project was initiated by a task distribution idea, but to demonstrate the usefulness of such a
concept we had to design a whole system that used exploited this idea in its architecture.
During designing the architecture we were faced with challenges associated to Internet
Applications, related decisions and trade-offs. Thus this project covered designing internet
application from the scratch; from design principles to designing a system architecture, and
then implementation as well as evaluation of the system. We implemented the whole project
on our own, using the Java libraries. This is itself was a very useful. Due to the nature of
internet applications such as this one, it is always important to emphasize on efficient
implementation as well as portability of the system. Also this system included components
from various domains. We implemented multi-threaded architecture, synchronized job
queues, LRU caches, crawlers, other networking component, database query components etc.
Even though we had studied these components before, in this project we implemented all
these components, which gave us insights into the implementation issues of these useful
components. Besides implementing these components during the implementation we also got
an experience with integrating these components to make the whole system work. During
the evaluation we dealt with designing and executing experiments to validate our claims,
this also provided us insights into proper interpretation of the experimental results and
logical derivation of conclusions. Through out this project we experienced the fact mention in
the class that “ developing a WebCrawler is easy, but developing an efficient WebCrawler is
very difficult”

* The manuscript is still under progress 7


10. Future extensions
Future extension of the project includes implementing the DNS cache in the Crawler Thread
and studying the performance of the hybrid traversal strategy on the various cache-hit rates.
A lot of issues need to be dealt with to make this system usable in the real world. The
Crawler needs to conform to robot exclusion protocol. We need to handle partial failure.
Although at present failure of one node will not stop other components, it would be desirable
for other system to take over the task of the node that failed. Also dynamic reconfiguration
and dynamic load-balancing would be desirable.

11. Related work

Google [1] web crawler is written in python and is a single threaded and uses asynchronous
I/O to fetch data from several concurrent connections. The crawler transmits downloaded
pages to a single Store Server process. The store server compresses the data and stores in
repository. Another famous web crawler is Mercator[2], which is a multithreaded web
crawler in JAVA. Although Mercator is not distributed, it does divide the URL space like our
design to guarantee that only one thread will contact a given server. We do not deal with
storing webpages or process of indexing in this project. Our architecture is distributed as
well as multithreaded. This way we increase concurrency in a single machine as well as the
entire system of several computers. We also have a distributed database with no central
bottleneck. We also make use of probabilistic search model for crawling web pages. The
whole combination of features, improved resource utilization and scalability distinguishes us
from related previous work.

12. Conclusion

In all the performance results of the crawler are very promising. We achieved a throughput
rate of 75 documents per second. This is an encouraging result as at 31.7 pages per second
one billion documents can be crawled in one year. We have also validated our claims of
scalability and improved resource utilization with the experimental result. Although the
results are encouraging, more tests needed to be conducted to find out if such system can be
really useful in the real world situation.

References

[1] Google: The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and
Lawrence Page, Proceedings of the 7th International World Wide Web Conference, pages
107-117, April 1998
[2] Mercator: A Scalable, Extensible Web Crawler , Allan Heydon and Marc Najork, Compaq
Systems Research Center.
[3] Class notes

* The manuscript is still under progress 8

Você também pode gostar