Web Crawler Cse

A Web Crawler for Automated
Location of Genomic Sequences
By
Steven Mayocchi
Department of Computer Systems and Electrical Engineering University of Queensland
Submitted for the Degree of
Bachelor of Engineering (Honours)

in the division of
Computer Systems
on the
19th October 2001.
6 Carolyn Place FERNY GROVE QLD 4055 Tel. (07) 33513670 19 October 2001
Professor Simon Kaplan Head of School of Information Technology and Electrical Engineering University of Queensland ST LUCIA QLD 4072
Dear Professor Kaplan,
In accordance with the requirements of the degree of Bachelor of Engineering (Honours) in the division of Computer Systems Engineering, I present the following Thesis entitled A Web Crawler for Automated Location of Genomic Resources. This work was completed under the supervision of Dr Mark Ragan.
I declare that the work submitted is my own, except as acknowledged in the text, and has not been previously submitted for a degree at the University of Queensland or any other institution.
Yours Sincerely,
Steven Mayocchi.
ii
Acknowledgments
I would like to acknowledge my supervisor, Dr Mark Ragan, for providing me with this guidance throughout this project.
I would also like to thank Ms Thoa Nguyen, Ms Deanne Schloss, Mrs Sally Mayocchi and Mr Gordon Mayocchi for proofreading this document.
Finally I would like to thank my family and friends for all the support throughout the last year.
iii
Abstract
This purpose of this project is to develop a software package that will locate, select and download files that contain Genomic Resources from the Internet. The files thus obtained will be stored locally. The Genomic resources to be obtained are Genetic Sequence files. These files are available for download on the Internet. The software is also intended to ensure that the copy of the files that is held locally is kept up to date. The final software package should also be fully automated.
A software package has been written in Perl to allow users to perform all the operations discussed in the above paragraph. The software package developed has been named the Sequence Locator or SeqLoc for short. The software implements a Web Crawler to perform the search operation on the Internet and uses various other modules from the Perl libraries in order to download and maintain the local file listing of sequence files. The software developed is not fully automated due to limitations about the amount of information that can be collected for a particular file when examining it at a remote web site.
It has been concluded that the SeqLoc software package successfully performs the majority of operations it is required to perform. Problems with the software are that it is not fully automated and that it has not been exhaustively tested as yet.
iv
Table of Contents
Acknowledgements Abstract List of Figures iii iv
1. Introduction
1.1 1.2 1.3
Purpose of Project Sequence Locator Remainder of Thesis
1 2 2
2. Background
2.1 2.2 2.3 2.4 2.5
Genomic Resources Web Crawlers Robots Exclusion Protocol Guidelines for Creating Web Robots A Brief Introduction to the Internet 2.5.1 Hyper Text Markup Language (HTML)
4 4 5 5 6 6 6
2.5.2 Hypertext Transfer Protocol (HTTP)
3. Specifications of the Project
3.1 3.2 3.3
Objectives Programming Language Requirements Modules of the Project 3.3.1 File Search 3.3.2 File Selection 3.3.3 File Updating 3.3.4 File Download 3.3.5 Graphical User Interface
8 8 9 9 9 9 10 10
4. Software Implementation
4.1 4.2
Language Selection File Search Module 4.2.1 The Basic Algorithm 4.2.2 Understanding the Basic Algorithm 4.2.3 Problems with the Basic Version 4.2.4 The Improved Algorithm 4.2.5 Testing the Improved Algorithm
11 12 13 14 15 16 18 18 19 19 19 20 22 23 24 25
4.3 4.4
File Selection Module File Updating 4.4.1 Initial Design 4.4.2 Problems with the Initial Design 4.4.3 Improved Design
4.5 4.6
File Download Graphical User Interface 4.6.1 Functions 4.6.2 Options
5. Software Evaluation
5.1
Software Evaluation Techniques 5.1.1 Local Interfacing Test 1 5.1.2 Local Interfacing Test 2 5.1.3 Local Interfacing Test 3 5.1.4 Local Interfacing Test 4 5.1.5 Local Interfacing Test 5 5.1.6 Remote HTTP Test 5.1.7 Remote FTP Test
26 26 26 27 27 27 27 27 28 28 28 28 28
5.2
Results 5.2.1 Local Interfacing Test 1 5.2.2 Local Interfacing Test 2 5.2.3 Local Interfacing Test 3 5.2.4 Local Interfacing Test 4
vi
5.2.5 Local Interfacing Test 5 5.2.6 Remote HTTP Test 5.2.7 Remote FTP Test 5.3 Discussion of Software 5.3.1 Discussion of Test Results 5.3.2 File Location Module 5.3.3 File Selection Module 5.3.4 File Updating Module 5.3.5 File Download Module 5.3.6 Graphical User Interface 5.4 Summary
29 29 29 29 30 30 31 31 32 32 33
6. Review and Conclusion
6.1 6.2
Review of Project Future Directions 6.2.1 File Search Module 6.2.2 File Selection Module 6.2.3 File Updating Module 6.2.4 File Download Module 6.2.5 Graphical User Interface 6.2.6 Other Applications
34 35 35 35 36 36 36 36 37
6.3
Conclusion
Appendices
A. References
38
B. Code Listing for SeqLoc.pl
39
C. Code Listing for Crawl.pm
58
vii
List of Figures
Figure 2.1: Figure 4.1: Figure 4.2: Figure 4.3: Figure 4.4: Figure 4.5: Figure 4.6: Figure 4.7 Figure 4.8 Figure 4.9 HTTP Request Types Properties of Perl and Java Pseudo-code for Basic Web Crawler Pseudo-code for Improved Web Crawler Basic Table of Fields for Records Improved Table of Fields for Records Main Screen of GUI File Selection Screen of GUI Directory Selection Screen of GUI Options Menu Commands 7 12 13 17 19 21 23 24 25 25
viii
Chapter 1: Introduction
1.1 Purpose of Project
The purpose of this project is to develop a software package that will locate and then download Genomic Resources to a local server from the Internet. These Genomic Resources will then be available for further research and processing.
The Genomic Resources that are to be obtained correspond to the genetic sequence information of various species. An example of a very well known sequencing project is the Human Genome Project where the majority of the genetic information for our species has now been determined. As time goes by more and more species are being sequenced and these sequences are then becoming available for download on the Internet.
This software is required as there are various research projects occurring at the University of Queensland which require as much sequence information as is possible in order to get the best results. Obtaining the available sequences locally would be the logical first step for this research to occur. Getting a human to do the essentially repetitive task of downloading all the available sequence files and then ensuring that the local version of these files is up to date would be difficult now and almost impossible in the future when the number of files available increases greatly. For this reason a software solution is considered the best option as it allows a significant amount of automation of the process of locating and downloading the sequence files.
As was already stated the purpose of this project is to develop a software package that will locate and then download the sequence files available on the Internet. The sequences that will be obtained are to be used in research and should therefore be as accurate and up to date as possible. The sequences available on the Internet would be expected to change in small ways over time as more information was gathered. Therefore the software will also be designed to keep a record of all the files that have
been downloaded and it will ensure that the current local version is as up to date as the version that is available for download.
1.2
Sequence Locator
The software package that has been developed has been named the Sequence Locator or SeqLoc for short. This package makes use of a graphical user interface to allow the user to first select a set of URLs where the search will begin. The search will then occur and the software locates all the files and selects files that are of interest based on the file extensions of the files. Once the files have been located the user is required to select where they want the files to be saved and the software will then save the files to this location. Once a file has been selected to be stored somewhere in the local file system, that file is downloaded. On any subsequent run of the software package the software checks to see if any of the local copies of files are out of date. If this is the case then it will download them again. In this way the software is able to first locate and then download any sequence files and also ensure that the current sequence files are up to date.
1.3
Remainder of Thesis
The next chapter in this document is entitled Background and discusses the background theory required for an understanding of the working of this project.
Chapter 3 is entitled Specification of the Project and it discusses the specifications that were determined at the beginning of this project. These specifications are for the ideal product that would be a result of this project.
The following chapter is entitled Software Implementation and it discusses the procedure of implementing the Sequence Locator software package.
Next there is the chapter Software Evaluation where the software is evaluated as to how well it fulfils the specifications of this project.
The final chapter is entitled Review and Conclusion. In this chapter the entire project is reviewed, future direction for this and other projects are discussed and finally conclusions about the project are reached.
Chapter 2: Background
2.1 Genomic Resources
The Genomic Resources that are of interest in this project are files that are available for download on the Internet and contain the genetic information for species that have had their genome sequenced. The genetic information that is contained in the cells of every living thing is encoded in the form of Deoxyribonucleic Acid or, the more common version, DNA. DNA consists of four different base pairs A, C, T and G. These base pairs combine together in long strands of DNA and encode the genes that are expressed in every species. Current technology makes it possible to extract this DNA from cells and then to determine the sequence of the base pairs that make up the long strands of DNA for a particular species. This sequence can then be recorded. They are generally recorded as large text files. The characters in the files correspond to the base pairs of the DNA that have been determined through experiment.
Although the method of determining the DNA sequences is not important for this project it is important that the sequence data is available as large text files. These large text files are what the software that is being developed is intended to locate and download.
2.2
Web Crawlers
A Web Crawler is a program that traverses the Internet automatically. It does this by retrieving a web page and then recursively searching all the pages that are linked. This process is repeated and if uncontrolled would result in the Web Crawler being capable of travelling across the entire Internet. Other names that Web Crawlers are referred to by are Web Robots, Web Spiders and Web Wanderers.
The most common usage for Web Crawlers is in collection of information from a large number of sites around the World Wide Web. The information that is collected can then be collated in a database. A particular common example of the use of a Web
Crawler is a search engine where the database generated contains information about the content of the web pages across the entire Internet.5
2.3
Robots Exclusion Protocol
In general the use of Web Crawlers is benign and has no significant effect on a server. However if a Web Crawler was to make a large number of requests to a server over a short period of time (known as rapid fire) it could overload that servers capability. It is for this reason that a robot exclusion protocol has been designed.5 What this protocol denotes is a method whereby a server can prevent a Web Crawler or Robot from accessing that site. What the server needs to do is add the name of the Crawler to be excluded to a file robots.txt available on the server. A robot should check this text file whenever it accesses a particular server to ensure that it is allowed to work with the data available on that server. In development of a Web Crawler, it is important to design it to follow this protocol and also to design so that it does not make rapid-fire requests.
2.4
Guidelines for Creating Web Robots
In addition to following the robot exclusion protocol the robot should fulfil the following requirements.2
Separate requests to a particular server by at least 1 minute. Ensure that whenever the Crawler visits a site the site can determine the Crawlers name and the contact details of the Crawlers controller. The Crawler should not be run unsupervised in case it starts to cause problems at any of the sites it visits.
2.5
A Brief Introduction into the Internet
In this section it is not the intention to explain the entire Internet as that is well and truly beyond the scope of this project. However some knowledge of the way the Internet works is required to understand the basics of the project.
2.5.1 Hyper Text Markup Language (HTML)
The pages available on the World Wide Web are written in Hypertext Markup Language (HTML). This language provides a simple method of adding fonts, graphics and pointers to other WWW sites. HTML works by simply enclosing a region that has certain formatting with commands that say what format that text needs. As an example, to encode the title of a particular web page would just require the following line:
<TITLE> Title of the web page is here </TITLE>
So to encode the title just involves surrounding the title with the correct commands. Similarly for the remainder of the web page, each of the relevant sections is enclosed with particular formatting commands. Of particular interest in this project is how a link to another page is encoded. This is done using the following command:
<A_HREF=http://www.example.com/index.htm> Actual Text to display </A>
Here the address of the page to go to is directly after A_HREF. This line also includes the text that would be displayed by the browser when it looks at this page. The address of the page here is generally referred to as a link or an anchor. Its more technical name is a Uniform Resource Locator (URL)
2.5.2 Hypertext Transfer Protocol (HTTP)
The Hypertext Transfer Protocol is the standard protocol for transfers of data around the World Wide Web. It essentially involves a request to a particular site that is then
6
followed by a response. The formatting of the requests is specified in this protocol but is not important for this project. However it is important to note that HTTP is able to support all of the various other types of requests that would be expected to occur when browsing the Internet such as FTP or telnet. Some of the requests that a user is able to make using HTTP are listed in figure 2.1.
Method GET HEAD PUT POST
Description Get the entire content of a page Get simple information about a page. Request to add a file to a particular site. Add data to a particular web page.
Figure 2.1: Some simple HTTP Request types.
Although the majority of the requests here are quite simple the HEAD request deserves more discussion. This request involves obtaining some simple information about a page such as document size and the most recent update of that page. This information is also obtained when using the GET request but the GET request sends the entire page and not just the Header information. As such, the HEAD request provides a useful method for obtaining information about the current status of a page.
Chapter 3: Specifications of the Project

In this chapter the required specifications of the software to be developed are derived from the objectives for the project. The process used in development of these specifications is to first state the objectives and then separate the objectives into each of the modules that make up the project.
3.1
Objectives
The objective of this project is to develop a Web Crawler that locates sequence files from a given set of web sites. The Crawler must follow the Robot Exclusion protocol and must also follow the guidelines for development of robots.2 Once the Crawler has located the files that are of interest they need to be downloaded to the local file system. The download should occur so that the sequence files for a particular species can be found in that species directory. Therefore it is necessary to determine what species a particular sequence file corresponds to when saving it. The entire process needs to be automated. A Graphical User Interface that allows simple control of the parameters for the software also needs to be developed.
3.2
Programming Language Requirements
In programming the project it is necessary that the programming language to be used is capable of interfacing with the World Wide Web. It must also be capable of implementing a Web Crawler and doing the complex search operations on web pages that the Web Crawler requires. The next criterion is how system dependent the language is. The program to be developed should be capable of being run on any operating system and therefore the programming language should be system independent. Finally the level of support that is available for the language is a consideration whenever developing a new project as having little help available makes development difficult.
3.3
Modules of the Project
The project has been separated into the following modules based on the objectives denoted earlier: 3.3.1 File Search
The file search module implements the actual search for files and is the part of the project that does the majority of Crawling. Its job is to locate all files that are linked on the set of sites to be searched. The set of sites to be searched will be user defined and the crawler should not search outside these web sites. The Web Crawler must follow the Robots Exclusion protocol and also follows the suggested guidelines for development of Web Robots discussed in the introduction to this report.2
3.3.2 File Selection
Once the files in the set of web sites have been located, files that correspond to sequence data should be selected. This selection should be optimised so that the number of irrelevant files selected is minimised. The selection procedure should be as automated as possible.
3.3.3 File Updating
In addition to locating and downloading new sequence files the software is also intended to ensure that the files that it downloads are kept up to date. To do this it is necessary to check that each of the files that have previously been downloaded is the most up to date version. This can be done by going through each of the files that have previously been checked and then checking the version that is available for download. If that version on the web is more up to date than the version that has been saved locally then the file will be selected to be downloaded again.
3.3.4 File Download
The selected files should then be downloaded to the local file system. The local file system will need to be organised in such a way that there is a directory available for each species. The files that correspond to a particular species can then be added to that directory. Any files that have been downloaded in a compressed format should be processed into their uncompressed form. As was the case for the file selection the file download should also be automated as much as possible.
3.3.5 Graphical User Interface
The system to be developed will require a graphical user interface that enables control of the various modules. This interface should allow control of each of the modules described previously and also set up any parameters that are required for the search operation.
10
Chapter 4: Software Implementation

In this chapter the process of development of the SeqLoc software package is explained. This process involved a variety of stages in development and these stages are followed as we go through the chapter until the final product that has been developed for this thesis project is reached.
4.1
Language Selection
The first decision in the development of this project was which programming language to develop the software with. Searches of the Internet revealed that the Web Crawlers that have previously been written have been written in a variety of languages. However Java and Perl were the most prevalent.1 For this reason further examination of Java and Perl seemed the logical next step.
Searching the Internet found a variety of information about programming in Java. The first search performed located an article that had source code for a Web Crawler written in Java. This article explained the workings of the web crawler quite well.6 Another attractive feature of Java is that it has a module that allows interface with the Internet. Java is also extremely system independent; it has been specifically designed for system independence.
A large amount of information was available on the Internet about Perl. Searching for a Web Crawler written in Perl located a good article with sample code for a Crawler written in Perl.3 This code also had a decent explanation of what the crawler was doing. In addition to this Perl is a language designed around the processing of text. This means that Perl if a very good language for processing web pages, as these pages are just text files. Perl is also designed to be system independent although it is not as independent as Java as it does have some modules that are only useable on certain operating systems. The downside for Perl is that it is not strict on the type of instructions it will process. This means that logically incorrect instructions can sneak into code and not be located until significantly later. Therefore debugging of Perl software is more difficult than for a stricter language such as Java. Perl also has a
11
variety of modules that allow interface with the Internet including a module that allows use of the Robots exclusion protocol.
Property Sample Crawler Code System Independent Text processing Debugging Support Web Interface
Java Good
Perl Good
Extremely Good Good Okay Good Extremely Good Bad
Extremely Good Extremely Good Good Extremely Good None
Programmers Knowledge Some
Figure 4.1: Summary of the Properties of Perl and Java
Selection of the language to be used for this project was somewhat difficult as both languages available have extremely good characteristics. However the fact that Perl has significantly good text processing facilities weighed highly in its favour as this project is essentially based around text processing. Also Perl has good support for interfacing with the Internet. The modules that allow interface to the Internet already follow the Robots Exclusion protocol. This provides a significant advantage to working with Perl as it allows more focus on implementing the actual product and not in developing the interface to the Internet. In addition to this it seemed a valuable opportunity to gain some experience in a new programming language. For all of these reasons it was decided to use Perl as the basis for this project.
4.2
File Search Module
In this section the design and implementation of the module that locates files in the WebPages we are interested in is discussed. The purpose of this module is to locate all of the links to files that are contained within the set of pages supplied to the module. The procedure for development of this module involved first obtaining a sample piece of code that does a simple Crawler based search and then modifying that
12
search in order to perform the operations needed in the search module for the SeqLoc software package.
4.2.1 The Basic Algorithm
In development of a new piece of software it is always useful to make sure that someone else hasnt already solved the problem for you. In this case a search for a web crawler that is written in Perl located some very basic source code that implemented a web crawler.3 This source code provided a valuable start point for development of the web crawler used here. The algorithm that this crawler was based on provided a start point for this project. That algorithm can be seen in the pseudocode in Figure 4.2.
Get a source page and add it to a queue; Set the current page as the first new page available in the queue; While (current page is not an empty string) { Obtain the contents of the current page; If (contents of current page contain search string) { Print current page URL; } Extract the links from the current page; For (each of the links) { Convert the current link to a fully specified URL; If (the current link is not in the queue) { Add link to queue; } Else { Increment number of times current link has been referenced; } } Increment the count of the times the current page has been visited; If (any new pages in queue?) { Set the current page as the first new page found; } Else { Set the current page as an empty string; } }
Figure 4.2: Pseudo-code for Web Crawler source.
13
4.2.2 Understanding the Basic Algorithm
As the pseudo-code from the previous section gives the basis for the search module it is worth examining each of the stages in more detail, this is done by going through what would occur in the first iteration of this software.
The first thing that occurs is that a queue is set up. The queue that is built by the crawler is used as a large listing of the web pages that the crawler has visited and has yet to visit. This is done by building the queue as a table that contains each of the URLs of the pages and also an integer that corresponds to the number of times that page has been visited. Therefore when the queue is first built it will only contain the URL of the source page and also the value 0 indicating that this page has not yet been visited. Once the queue has been built the current page to be operated on is set to the source page that is set by the user. The software will then enter the main loop.
Because the current page has been set to the source page entered by the user the first iteration of the main loop will always occur. However on later iterations of the main loop it will be possible that the software will have finished searching. If it has finished this is indicated by the current page being set to an empty string.
Once in the main loop it is necessary to obtain the contents of the current page. To do this the crawler must be able to communicate with the Internet. In the source code this is done using a direct connection that does not use any of Perls modules for Internet interface. Once the page has been obtained the content of the page is checked for the search string of interest. If that search string is found then the address of the current page is printed to the screen.
The next stage in the procedure is to find all the links that are on the page. The background section explained that when examining an HTML page it is possible to obtain certain parts of the page by examining the HTML anchors that surround a particular region of the page (refer to section 2.5.1). In the case of searching for the links contained in the page, these links are always surrounded by certain HTML tags. Therefore these tags can be used as the basis of a search for the links in the page.
14
This search is simplified by Perls extensive text search capabilities. These links are then processed to ensure that they are fully specified with a protocol type and hostname. This is done using a subroutine that adds these to the page address if they are not already present. Once the entire address of the current link is known it is checked if that address is already in the queue. If that is the case then the number of times the particular link of interest has been found is incremented. Otherwise the link is added to the queue to be searched in a later iteration of the main loop.
The important operations have now been completed for the current page and now it is necessary to perform some housekeeping so that the next iteration of the loop runs smoothly. It is necessary to ensure that the current page is not searched through again. To do this the number of times the current page has been visited is incremented. Next the queue is examined and if any pages that have not yet been searched are found then the current page is set to the first page that has not been searched. If no new pages are found then the search is complete and the current page is set to be an empty string so that the main loop will not occur again. This completes the current iteration of the main loop.
As can be seen from this explanation the search algorithm provides a relatively simple method of locating a text string within a variety of web pages. In the next section problems with this algorithm are discussed.
4.2.3 Problems with the Basic Version
The source code that was discussed in the previous section provides a good basis for development of a web crawler however there are a number of problems with it in relation to this project. Firstly it is not actually locating files; instead it is looking for a particular text string in pages. As we wish to search for files we need to replace the current search operation with a new search that locates links to files. Secondly it does not make use of the robot exclusion protocol or use Perls Internet interface modules. Also the crawler to be designed is only intended to search within a limited subset of servers. Therefore it is necessary to remove the capability to swap over to different servers that are not a part of the search.
15
4.2.4 The Improved Algorithm
The search algorithm currently present does not locate files that we are interested in. Files can be located by just examining the links that we find in the pages and determining if there is a link to a file. Links in a web page can be to another page or to a file that can be downloaded. Therefore a method of determining if a particular link is to a file instead of to another web page is required. The means that has been selected is by checking the extension of the file.
Usually a file name will consist of some sort of name, then a period character (ie. .) and then some sort of extension. The extension will generally be three characters long and in most cases tells the user something about what type of file it is. As an example, the file thesis.zip is a file that has been compressed with some sort of zip compression software.
It is likely that the user will have a set of extensions that they are interested in searching for. It is also quite possible that there would be a set of extensions that are definitely not of interest and that the crawler should not try and search through. Therefore in searching for extensions there should be two sets. The first being those to include and the second being those to exclude. Any file found with an extension of interest can be stored as an array for later processing instead of printing it to the screen.
The final check for adding a link to the queue should be to check if that link is to another HTML based page. If not then it should not be added to the search. This can be determined using a Perl module that determines the content type of a particular file based on its filename.
The algorithm needs to be modified so it only searches within the limited set of servers that are defined by the user before the search begins. To implement this, those servers are added to the queue when the software begins running. Later when searching through the links that have been found within a page, the only links that should be added to the queue are those that are within the domain of the current page. To add a specific set of pages to the queue before running the main loop, the
16
addresses of those pages can be stored in a text file on the local file system and obtained when the software starts running.
Another addition that has been made to the search system is that the interface that is used makes use of Perls Web interface module. This means that an interface must be set up when the software begins running. An additional feature that using this module allows is that it supports any protocol. This means that there is now support for accessing sites that make use of the FTP protocol. However the search software relies on the data it receives being in HTML format. A problem with FTP is that it is not in that format unless the user specifically requests it to be. Therefore an additional line must be added to the request so that the format it returns the data is HTML.
Adding all the changes that have been discussed results in the Pseudo code that is shown in Figure 4.3.
Set up interface to the Web; Get the source pages and add them to the queue; Get extensions of interest; Get extensions to be excluded; Set the current page as the first new page available in the queue; While (current page is not an empty string) { Obtain the contents of the current page; Extract the links from the current page; For (each of the links) { Convert the current link to a fully specified URL; If (current link has one of the extensions of interest) { Add link to results array; } Else if (The current link does not have an excluded extension) { If (the current link is to a HTML page) { If (current link is not in queue) { Add link to queue; } Else { Increment number of references to link; } } } } Increment the count of the times the current page has been visited; If (new pages in queue?) { Set the current page as the first new page found; } Else { Set the current page as an empty string; } } Print the results obtained;
Figure 4.3: This figure shows the changes to the algorithm used for the web crawler.
17
4.2.5 Testing the improved algorithm
The improved search algorithm that has been developed was implemented in Perl. Testing has shown that this algorithm will successfully locate files that are contained within the pages to be searched. It successfully locates and records all the links to files that have the extensions that are defined as being of interest. Due to this success this algorithm is used as the search module in the SeqLoc software package.
4.3
File Selection Module
The purpose of the file selection module is quite obviously to select files. The selection is required, as only files of interest should be downloaded. In this case that corresponds to files that contain sequence data. There is already some simple file selection occurring during the running of the Search module. The search results in only files that have certain extensions being found. A further selection procedure would allow the software to determine the files that should be downloaded out of the set of files that it has obtained.
The problem with file selection is that the decision must be made before the file has been downloaded. Therefore the decision about whether to download a particular file must be made based on the name of the file alone. There are two parts to the file name the extension and the name of the file itself. The extension is already being used to select the files of interest so further selection need not be added about extensions.
Selection based on the file name would be possible if there was a standardised method of naming sequence files. However the data that is being obtained is from a variety of sources and there is no standardised method. For this reason the best solution seems to be to allow the user to examine the results of the search and to select the files that they choose to download. This selection is a part of the GUI that is discussed in a later section.
18
4.4
File Updating
It is necessary to determine whether any of the files that have previously been downloaded by this software have been changed to a more up to date version on the Internet. If the files that are available for download are more up to date, then the software should download them again. In order to do this a system for keeping records of the files that have previously been downloaded is necessary. In this section the development of that system is discussed.
4.4.1 Initial Design
The initial design was to keep a large table of each of the files that the Sequence Locator found that it wished to download. This table could be stored on the local file system and obtained by the software each time it started running. The table would store all the relevant values for a particular file. The headings for the initial table are shown in figure 4.4.
URL for File File Size Date file was last changed
Figure 4.4: Initial design of the table of downloaded files.
Use of this initial design of the table provides all the information required in checking if a particular file had been updated. The location that the file is stored has been recorded and also the file size and the date of the last update to the file. These two fields can be obtained by making a HEAD request about the page (as discussed in section 2.5.2). This information will indicate the file size in bytes and the time of the last update to the file.
4.4.2 Problems with the Initial Design
The major problem with the initial design is that it does not record enough information about the files that were found in previous searches. Ideally what should occur is that every file that has been located should be recorded. This is necessary because the same file can be located at various different locations. These would
19
appear different to the software as it differentiates between files based on their entire address. These mirrors of the original version of the file should not be downloaded. In addition to this, files can be found that are not of interest. The system should keep a record of all of the different files and not need to process them each time that it runs, as would be the case with the initial system.
A minor problem with the initial design was the method that was being used to check if a file had been changed. Using the date field from the header was considered to pose a risk in that the server where the file was hosted be updating the pages daily. This would mean that only the date field in the header would have changed.
4.4.3 Improved Design
The improved design of the table implements two additional fields to allow recording information about the level of interest of a particular file. The two fields are named type and download. The type field is used to indicate whether a particular link to a file is a primary source for download of that file. If it is a primary source then the type field is set to primary otherwise it is set to mirror. If the type is set to mirror it indicates that a particular record is a copy of another source of a file. The download field is used to indicate whether a particular file should be downloaded. Only files that are primary sources for a file should ever be downloaded. The download field can be set to never or when changed. This allows control of whether a primary source of a file is downloaded or not. If a file is to be downloaded when changed, then that file will be downloaded if the software determines that it has been changed.
The decision about when a file has been changed has been updated. It is now made based upon whether a files size has changed and does not take into consideration the date field available in the file header. Therefore the date field that was being used in the table has been removed.
In addition to all the other fields a number of informational fields were added to allow the software to record relevant information about the files that are downloaded. Information that is needed is the directory that a particular file should be saved to.
20
Setting this directory is discussed in the next section on file download but for now it should be noted that this field is included in the table. Another informational field is a field that keeps a record of the specific name of the file that the current record refers to. In general most files are referred to in this software by their complete web address. The file name field keeps a record of the file name, split off from the entire address. This is kept for ease of comparison when determining whether two files are mirrors of one another.
URL http://sample.source.page/ http://sample.source.page/good.zip http://sample.source.page/directory/good.zip
File name good.zip good.zip
Size 0 87 87
Directory C:\good\
Type Source primary Mirror
Download never when changed never
Figure 4.5: Improved table of fields. This figure also shows a number of example records.
The improved design results in a table that records all the desired values successfully. This table was implemented as an additional class in Perl. The records were saved to a text file that was opened each time the software ran. This table was updated when the software closed. This method allows continuity between runs of the software.
Use of the improved design to select files that have changed firstly involves finding the entries in the set of records that are primary sources of a file. Also those files must have the setting 'when changed' in their download field. The header for these files is obtained and the file size is checked. If it has changed then the file is selected for download.
The improved design for selection of files that need to be updated has been implemented in the final SeqLoc software package and is working successfully.
21
4.5
File Download
The next stage is to download the files that have been selected for download. As these files should previously have been selected this procedure should be relatively simple. The only difficulty in this section is how to choose what directory to save a new file to. Older files that are being updated should already have their directory set. Therefore it is only necessary to determine the path for newer files. The intention is that the directory structure to be developed should have one directory for each species. Then files that correspond to that species can be downloaded to that directory. Therefore the software to be developed needs to be able to create directories and also to save a particular file to a particular directory.
As was discussed in the file selection system (refer to section 4.3) it is practically impossible for the software to determine much information about a particular file due to their being no standardised format for the naming of sequence files. This means that the user is required to determine what species a particular file corresponds to. What the user must do is select the file that has been chosen for download and then associate it with a particular directory. This will be done in the GUI that is discussed in the next section. Another part of the interface will be the requirement for the software to be able to add directories to the current file system for the case that data has been located for a species that does not currently have a directory.
Once the directories for download have been determined the software is required to download the various selected files to those directories. The download procedure makes use of Perls web interface modules. Those modules provide a simple method for the download of files.
The last function that this module should be capable of is converting compressed files to their uncompressed format. External decompression software will be required to implement this. Due to the fact that the software being developed is intended to be system independent the compression software to be used cannot be hardcoded into the software. It must be an option for the user to determine what is used for decompression, as different decompression software will be required for different
22
operating systems. The three forms of decompression that the software will support are zip, tar and gz. The GUI will be required to allow the user to specify where software to uncompress each of these formats is located.
4.6
Graphical User Interface
The Graphical User Interface (GUI) is required so that the user may control the software as it progresses through the processing of the various modules. It should allow the user to perform search and download operations and also to control the variables that are needed when implementing a search. Coding of the GUI was performed in the language Tk. This language is provided as an additional module to Perl and was the obvious choice for this reason. A screenshot of the GUI is shown in figure 4.6.
Figure 4.6: This picture is a screenshot of the basic GUI. This is the appearance of the interface when it is first opened.
The operations of the GUI can be split into the two sections, functions and options. The next part of this report will examine these sections.
23
4.6.1
Functions
When the function option is selected the user is given the option of starting a file search, selecting files for download, and associating files for download with directories and finally downloading the files. As doing a search and downloading files could take a long time these operations do not have to occur at the same time. In fact the way the software could run is that a file search could be done five times without ever downloading a file. However attempting to do more than one download without doing a search in between would be useless, as no files would be selected for download. It is expected that generally when running this software each of the functions will be run in the order that they are displayed. The separation of control is implemented to allow the user to control the operation of the software as closely as possible. Figures 4.7 and 4.8 show the interface for selecting files for download and for associating files with particular directories. The file search and file download options do not have an interface, as the actions that they perform do not require user interaction.
Figure 4.7: This screenshot shows the file selection screen. This screen illustrates files that have been located as being new in the file search. These files can then be selected for download if the user wishes.
24
Figure 4.8: This screenshot shows the directory selection screen. In this screen the user is able to set up which directory a particular file should be saved to. If the directory is not selected then the file will just be saved to a temporary directory.
4.6.2 Options
The purpose of the options section is to allow as much control over the variables and functionality of the software as possible. The options section allows the user to set such things as the base links for the search and the extensions that are of interest in the search. The options section is split into each of the modules for the software so that the user can look at each section individually. A listing of the options that are available for change is shown in figure 4.9.
Option Search Delay Edit Links Edit Extensions Maximum Size Decompression
Base Path
Function Performed Set the delay that will occur between requests to a server (minimum 1 minute) Edits the set of links that are used as a base for the search operation Edit the extensions that are used to select files in the search. Also allows extensions to be ignored. This sets the maximum size for web pages to be for the software to search through them. Allows selection of what types of decompression are allowed and also setting the path to the software that will perform that decompression. Sets the base path for the software to run from.
Figure 4.9: This figure displays the various settings controllable under the options menu.
25
Chapter 5: Software Evaluation

In this chapter the process that has been used to evaluate the software is explained and then the results of this evaluation are discussed. The software is examined as to how well it meets the specifications that were discussed in Chapter 3 of this document.
5.1 Software Evaluation Techniques

The methods that have been used to evaluate SeqLoc are discussed in this section. This is done by setting up a sequence of progressively more complicated tests. The tests begin on a local server and then move to a more remote location. The site used on the local server is a web page that is intended specifically for testing of this project. The remote testing will test both an HTTP based page and an FTP based page. In all of these tests the GUI is used to control the software and is being tested as well.
5.1.1 Local Interfacing Test 1
The first test makes use of a basic web page that has two links to two simple files. The software is required to find both of these files and to record and then download them. On a second run of the software it should determine that it already has the most current version of both of the files.
In this test the same web page is used as for Test 1, however one of the files is now changed slightly. The software is tested to see if it can determine that one of the files has been changed and if it will then download the new version.
26
In this test the structure of the page is changed so that it also has a local page linked and also an external page linked. The software is tested to ensure that it only follows the local link. In the local link the software is then able to download another file.
In this test the software is given a large variety of file extensions to select from. Only limited subsets of these file extensions are defined to be of interest. The software is tested to ensure that it only selects files with the extensions of interest.
In this test the software is required to process a variety of different compressed files into their decompressed format. In this test the basic web page is made use of, it contains three files compressed with the three different formats.
5.1.6 Remote HTTP Test
Having tested all the basic functions the software is now tested on an external web site that makes use of the HTTP protocol. The purpose of this test is to ensure that all files are located and also that unexpected situation does not occur to stall the software.
5.1.7 Remote FTP Test
The final test is to run the software on a remote page that is based on the FTP protocol. To pass this test the software must locate all files and must be able to handle all the situations that occur.
27
5.2
Results
In this section the results of the tests that were discussed in the previous section are displayed. The tests that were discussed were used as a diagnostic tool during the development and in the majority of cases the software required some debugging before it managed to pass them. However the results discussed here are for the finished version.
The software successfully locates both of the files that are listed on the page. Examination of the records in text files also indicates that it has made records of these files successfully and that the data that is stored is accurate.
The software successfully determines that the version of the file that it currently has is out of date and downloads the new version. It also updates the records to indicate the update status of that file.
The software successfully manages to differentiate between the two links that are available to it. It follows only the local link and successfully manages to obtain the file that is in that link.
The software manages to differentiate the files and only select the files that have the extensions that are of interest.
28
The software manages to decompress the various file formats successfully. However making use of external decompression software does lead to some messages being printed to screen depending on the software. For example unzip will print the name and version of unzip being used and will also display a success message if it successfully decompresses the file it is working on.
5.2.6 Remote HTTP Test
In this test the software begins to have some slight problems in time management. Interfacing to an external web site takes a significant amount of time and there is a significant amount of data to be searched through. To make this test reasonable on a dial-up connection the scope of this test was not as great as it could be. However the software passes the test and manages to locate and download the files that are of interest.
5.2.7 Remote FTP Test
Again in this test the software has time problems and again the scope of the test is limited due to this. However the software does successfully locate and download files that are selected to be of interest.
5.3 Discussion of Software
In this section the performance of the software is first judged based on its success with the various tests that were devised. Following this each of the modules of the final software are examined as to how well they meet the criteria that were defined in Chapter 3.
29
5.3.1 Discussion of Test Results
The software that has been developed has successfully managed to pass the tests that were designed for it. However in passing these tests a number of limitations have been determined. The major limitation being that the software takes a significant amount of time to run when accessing an external site. This limitation is in part due to the fact that there is a delay of at least 1-minute between accesses to any web site. This delay is due to the recommended procedure for developing Web Crawlers that was discussed in Chapter 2. The time limitation is also most likely due to the fact that all testing occurred on a dial-up connection. The final implementation of this software is intended for use on a server with much faster access to the Internet. In this situation the only real delays that would be expected would be those that have been coded into the system to ensure it does not make rapid fire requests to a particular site. Further testing of the software at such a location would be the logical next step in the testing procedure.
A further limitation of the testing procedure is that it has only occurred over a limited subset of sites and would not cover the wide variety of possibilities that could occur across the Internet. This could mean that when the software is tested on another site that further problems could arise. In fact it is quite likely that the software would find some sort of problem given the wide variety of possible sites on the Internet.
In the context of this project it is considered that testing of the software has indicated that it will successfully perform the wide variety of tasks that it has been assigned.
5.3.2 File Location Module
In Chapter 3 it was stated that the purpose of this module was to locate all DNA sequence files in a limited subset of web sites. It was intended that a Web Crawler should perform this search and that the Crawler was required to follow the robots exclusion protocol and also the guidelines for the development of web robots. It was also necessary for the search to be automated.
30
The file location module successfully manages to fulfil all of these objectives. It makes use of the robots exclusion protocol through its use of a Perl module that implements that protocol. It follows the guidelines for web robots in that it limits the rate at which it accesses web sites with a delay of at least 1-minute between each access. As has been displayed in its successful passing of the various tests devised, it also manages to find all the files that are available on a particular web site. As the files it locates would include all the sequence files the objective to locate the sequence files is also fulfilled.
5.3.3 File Selection Module
The purpose of the file selection module is to take all the files that the web crawler has located and then to select the files that should be downloaded. This selection should occur in such a way that only sequence files are downloaded. The selection procedure needs to be automated.
The selection module specification was later changed because it was impossible to decide whether a particular file was of interest based on file names alone. For this reason the user was required to control the selection stage of the software. This stage was implemented as a part of the graphical interface where the user could select files that should be downloaded. This change to the specification meant that the selection module does not fulfil its original requirements very well. It is not automated and the only selection that really occurs is based on the extensions of the file names of the files that are located.
5.3.4 File Updating Module
According to the specifications the purpose of this module is to check all the files that have previously been located and to ensure that the local version that has been downloaded is the most up to date version. As with the majority of the other modules this procedure must be automated.
The way the file update module works fulfils the specifications. The software keeps a record of all the files that it has located and will check through its list of files to see if
31
any of them need to be updated. In the case that an update is required the software will add the file to the list of files for download.
5.3.5 File Download Module
The purpose of the file download module is to first determine what species a particular file corresponds to and then to download that file into that species directory. This procedure should be automated.
The way the file download module works is that the user is required to select which directory to save a particular file to. If a file is not set to a particular directory then it is just saved to the base directory for the software. This does not fulfil the automation specification but does perform the required function.
Due to limitations in the method that files are named at the external sites the software is accessing, it does not quite reach the level of automation that was described in the specifications. In the optimal situation the software would be capable of running through the entire procedure without user input in any form. The way the software currently runs removes a significant amount of repetitive work in the form of locating and downloading files. However the user is still required to control the selection of directories for a particular file to be saved to.
The original intention for the Interface was that it would allow the user to control the variety of options for the software such as the links that the search is based on and the extensions that it looks for. However a number of additional tasks were added to the interface, as it became apparent they would be required for this project. In addition to the base specification the interface also allows the user to control the file selection and file download procedures. This interface has successfully fulfilled all of the required specifications.
32
5.4
Summary
In this chapter the variety of tests that were used to assess the final software were listed and their results were displayed. These results were discussed and it was determined that although a number of the modules fulfilled their specifications the file selection and file download modules did not reach the automation specification. However they do fulfil the other functionality requirements.
33
Chapter 6: Review and Conclusion

In this chapter the entire project is reviewed. In addition to this, suggestions are made as to future directions for this project and also other projects that would be a logical spin-off from this project. Finally conclusions are made about the success of the project as a whole.
6.1
Review of Project
The major purpose of this project has been to develop a software package that is capable of locating, downloading and maintaining a set of files that contain sequence information. The sequence information would then be available for further work to be performed at the University of Queensland.
The procedure for development of the software involved first examining the background of the project and then making use of that background to develop specifications that this particular project should be able to fulfil. Once the specifications were written it was possible to begin work implementing the actual software.
Implementing the software led to a variety of problems. Some of these problems required the specifications to be altered to some degree. The major alterations to the software meant that the final product would not be completely automated but would instead require the user to have control over a number of the stages. This loss of automation meant that one of the design objectives was not reached as it was intended that the software should free the user from the need of controlling the software at all. However the control that the user is required to perform would not take much time and is considered to be extremely small as compared to the time it would take to individually locate and download each of the files.
Once the software had been implemented it was tested. This testing involved a variety of tests that were intended to check that the software fulfilled all of the
34
functionality that was required of it. The software passed all of the tests. However the testing procedure did not have as wide a scope as would be required to say that the software is entirely capable of processing any site of the Internet.
6.2
Future Directions
There are a variety of areas in which this project could be improved. As it has been split into a variety of modules, areas of improvement for each of the modules will be discussed in this section.
6.2.1 File Search Module
Future directions for this module would involve firstly improving the search algorithm in order to minimise the number of requests it needs to make on the Internet. At present the software has a minimum delay of 1 minute between requests and therefore minimising the number of requests will minimise the number of delays.
Other improvements would involve optimising the system so that it makes full use of the set of previous records that the software keeps. At present the only information the search makes use of is the set of source pages for the search to begin from. Adding a set of records of pages that have previously been searched and checking to see if those pages have changed would decrease the length of time searches would take.
6.2.2 File Selection Module
Improvements in this module would involve developing an improved technique for the selection of files. As the present technique makes selections based upon the extensions of files, an improvement would be to add selection based on the actual file name. As the naming of files is quite complex and is not standardised the best solution would be a neural network. The results of the users selections of files could be used as the basis for training a neural network that is intended to select files. The
35
success of this type of project would depend on the amount of data available for training the neural network.
6.2.3 File Updating Module
This module works quite successfully in its current form so there are no major improvements to be made. The only issues would be to ensure that it is making the smallest number of requests to the Internet as possible. Apart from this few improvements are necessary for this section.
6.2.4 File Download module
This module has problems in that the user is required to select the directory a particular file should be saved to. In the section on improvements to the file selection system it was discussed that a neural network would likely provide an improved system. Similarly here a neural network would be a good addition as it could be trained to determine the species a particular file corresponds to based on the name of the file.
The graphical user interface is sufficient in its current form. The only improvements or changes it would require is additions as various modules are added to the base software.
6.2.6 Other applications
The software that has been developed is not particularly limited to downloading sequence files. In fact none of the selection of processing parts of the software rely on the fact that the files it obtains are sequence files. For this reason this software could well have applications in any other area where download of a large number of files from a small number of web sites is required. An example of this would be downloading music files.
36
6.3
Conclusion
The thesis has met the majority of the objectives set for it. The SeqLoc software has been developed and it will locate and download sequence files from the Internet. This software is not fully automated but is automated to the maximal level possible. Despite its lack of automation the software will still remove the majority of the work involved in locating and downloading sequence files.
The software that has been developed will provide a good system for the development of a local copy of all the sequence data available on the World Wide Web at the University of Queensland.
37
Appendix A
References:
1. Google Search: web crawler open source; http://www.google.com/search?q=web+crawler+open+source; 16 October 2001
2.
Koster,Martijn; Guidelines for Robot Writers; http://www.robotstxt.org/wc/guidelines.html; 16 October 2001.
3.
Thomas, Mike; LJ40: A Web Crawler in Perl; http://www2.linuxjournal.com/lj-issues/issue40/2200.html; 16 October 2001.
4. 5.
Tanenbaum, Andrew; Computer Networks; Third Edition; Prentice Hall; 1996. The Web Robots Page; http://www.robotstxt.org/wc/robots.html; 16 October 2001.
6.
Vanhelsuwe, Laurence; Automating Web Exploration; http://www.javaworld.com/jw-11-1996/jw-11-webcrawler.html; 16 October 2001.
38
Appendix B
Code Listing for SeqLoc.pl

The code listed in this appendix is also available for download on the Internet at
http://student.uq.edu.au/~s336454/
#!/usr/bin/perl # # # # # # # # # # # # # # #
File:
Seqloc.pl
Topic: Software written for the thesis project A WebCrawler for automated location of Genomic Sequences Method: Makes use of variables loaded from file to locate and maintain a set of files in the local file system. Also provides a graphical interface for the user to control the software. Written by: Student No.: Last modified: Steven Mayocchi, Jan 2001 Oct 2001. 33364540 16/10/01
require LWP::UserAgent; require LWP::RobotUA; use Tk; use Crawl; use LWP::Simple; use LWP::MediaTypes qw(guess_media_type); use Mail::Sendmail; #set up global variables $optionname = 'var.txt'; #options store $linkname = 'link.txt'; #records store $abouttext = "Sequence Locator\n\n Version 1.0.0 \n\n written by Steven Mayocchi"; $curr_url=""; @records = (); #all previous records @include_ext = (); #extensions to include @exclude_ext = (); #extensions to exclude $newfiles = 0; $oldfiles = 0;
39
#Start off by getting the options from disk #open the options file &open_file($optionname,'<',OPTION); #grab the variables from the file line by line # (ignore the first line, it is a comment) chomp($emailaddress = <OPTION>); chomp($emailaddress = <OPTION>); chomp($notification = <OPTION>); chomp($timebetween = <OPTION>); chomp($starttime = <OPTION>); chomp($finishtime = <OPTION>); chomp($zip = <OPTION>); chomp($zippath = <OPTION>); chomp($tar = <OPTION>); chomp($tarpath = <OPTION>); chomp($gz = <OPTION>); chomp($gzpath = <OPTION>); chomp($debug = <OPTION>); chomp($maxsize = <OPTION>); chomp($basepath = <OPTION>); #finally grab all the extensions of interest from the file chomp($temp = <OPTION>); while ($temp ne "") { push (@include_ext, $temp); chomp ($temp = <OPTION>); } chomp ($temp = <OPTION>); while ($temp ne "") { push (@exclude_ext, $temp); chomp ($temp = <OPTION>); } #Next get all the records from file #first open the file &open_file($linkname,'<',LINKS); #use the method implemented in crawl.pm to obtain the #records $finished=0; while ($finished == 0) { $temp = Crawl::getfromfile(*LINKS{IO}); if ($temp->url() eq "") { #last record has been obtained $finished = 1; } else { push (@records,$temp); } } #close the files close OPTION; close LINKS; #set up the user agent that connects to the internet if ($debug == 0) { #not debugging, use robot UA #set up robot user agent 40
$ua = new LWP::RobotUA 'seqloc/0.1', $emailaddress; #setup delay between requests to a particular server ua->delay($timebetween); # be very nice, go slowly } else { # debugging in progress, set up normal user agent $ua = new LWP::UserAgent; } #set up a separate user agent for header requests, as these requests #are small there will not be a delay between them. $headua = new LWP::UserAgent; #set the base path chdir $basepath; #now start setting up the GUI #set up the main window my $mw = MainWindow->new; $mw->title("Sequence Locator"); #create a frame in the upper part of the window my $f = $mw->Frame(-relief => 'ridge', -borderwidth => 2); $f->pack(-side => 'top', -anchor => 'n', -expand => 1, -fill => 'x'); #create the menus $functionmenu = $f->Menubutton(-text => 'Function', -tearoff => 0); $optionmenu = $f->Menubutton(-text => 'Options', -tearoff => 0); $helpmenu = $f->Menubutton(-text => 'Help', -tearoff => 0, -menuitems => [[ 'command' => "About", -command => \&do_about]]); #pack the menus into the upper frame $functionmenu->pack(-side => 'left'); $optionmenu->pack(-side =>'left'); $helpmenu->pack(-side => 'right'); #create the submenu's for use under the main menu options $searchmenu = $functionmenu->menu->Menu(-tearoff => 0); $searchmenu->AddItems(["command" => "File Search", -command => \&do_file_search); $downloadmenu = $functionmenu->menu->Menu(-tearoff => 0); $downloadmenu->AddItems( ["command" => "Files to Download", -command => \&do_show_download], ["command" => "Path for Download", -command => \&do_download_path], ["command" => "Download now", -command => \&do_download]); $functionmenu->cascade(-label => "Search"); $functionmenu->cascade(-label => "Download"); $functionmenu->entryconfigure("Search", -menu => $searchmenu); $functionmenu->entryconfigure("Download", -menu => $downloadmenu); $functionmenu->AddItems( ["command" => "Exit", -command => \&do_exit]);
41
$searchoptmenu = $optionmenu->menu->Menu(-tearoff => 0); $searchoptmenu->AddItems( ["command" => "Search Delay", -command => \&do_time_between], ["command" => "Edit Links", -command => \&do_add_links], ["command" => "Edit Extensions", -command => \&do_edit_extensions], ["command" => "Maximum Size", -command => \&do_max_size]); $downloadoptmenu = $optionmenu->menu->Menu(-tearoff => 0); $downloadoptmenu->AddItems( ["command" => "Postprocessing", -command => \&do_postprocessing_select], ["command" => "Base Path", -command => \&do_path]);
$optionmenu->cascade(-label => "Search"); $optionmenu->cascade(-label => "Download"); $optionmenu->entryconfigure("Search", -menu => $searchoptmenu); $optionmenu->entryconfigure("Download", -menu => $downloadoptmenu); $optionmenu->AddItems( ["command" => "Autonotification", -command => \&do_emailaddress]);
#add a picture to the bottom half of the main window $seqloc = $mw->Photo(-file => "dna4.gif"); $mw->Label(-image => $seqloc)->pack; #begin the mainloop, this displays the main window MainLoop; #the code here is executed after the main window is closed. #it will save relevant files and update data where necessary &open_file($optionname,'>',OPTION); #print the options to the file line by line print OPTION "\# do not modify the variables in this file directly!!!\n"; print OPTION "$emailaddress\n"; print OPTION "$notification\n"; print OPTION "$timebetween\n"; print OPTION "$starttime\n"; print OPTION "$finishtime\n"; print OPTION "$zip\n"; print OPTION "$zippath\n"; print OPTION "$tar\n"; print OPTION "$tarpath\n"; print OPTION "$gz\n"; print OPTION "$gzpath\n"; print OPTION "$debug\n"; print OPTION "$maxsize\n"; print OPTION "$basepath\n"; foreach $ext (@include_ext) { print OPTION "$ext\n"; } #add a space of one line print OPTION "\n";
42
foreach $ext (@exclude_ext) { print OPTION "$ext\n"; } #finished rewriting the options files close OPTION;
if ($notification == 1) { # problems with email working here #&send_email(); } #save all the records to file &open_file($linkname,'>',LINK); foreach $rec (@records) { $rec->print(*LINK); }
#the remainder of the software is the various sub routines #called by the graphical interface routines and also #other support routines sub do_path { if (! Exists ($pathwindow)) { $pathwindow = $mw->Toplevel(); $pathwindow->title("Set Path"); $pathwindow->Label(-text => "Complete path to the base directory for save of data")->pack( -side => 'top', -pady => 5); $pathentry = $pathwindow->Entry(-text => \$basepath)->pack(-side => 'top'); $pathentry->bind("<Return>", sub {$basepath = $pathentry->get(); $pathwindow->withdraw();}); $pathwindow->Button(-text => "OK", -command => sub { $basepath = $pathentry->get(); $pathwindow->withdraw(); })->pack( -side => 'bottom', -pady => 5, -ipadx => 5); } else { $pathwindow->deiconify(); $pathwindow->raise(); } } sub do_exit { #destroys the main window but will still do any code after the main loop #this is the best way of exiting from the software $mw->destroy(); }
sub do_about { #displays the about window under the help menu if (! Exists ($aboutwindow)) { $aboutwindow = $mw->Toplevel(); $aboutwindow->title("About"); $aboutwindow->Label(-textvariable => \$abouttext)->pack; } else { 43
$aboutwindow->deiconify(); $aboutwindow->raise(); } } sub do_emailaddress { #obtains the users emailaddress if (! Exists ($emailwindow)) { $emailwindow = $mw->Toplevel(); $emailwindow->title("Autonotification Emails"); my $f1 = $emailwindow->Frame(-relief => 'ridge', -borderwidth => 2); $f1->pack(-side => 'bottom', -anchor => 's', -expand => 1, -fill => 'x'); my $f2 = $emailwindow->Frame(-relief => 'ridge', -borderwidth =>2); $f2->pack(-side => 'top',-anchor => 'n', -expand => 1, -fill => 'x'); $f1->Label(-text => "Email address to send autonotification emails to:" )->pack( -side => 'top', -pady => 5); $emailentry = $f1->Entry(-text => \$emailaddress)->pack( -side => 'top', -fill => 'x'); $emailentry->bind("<Return>", sub { $emailaddress = $emailentry->get(); $emailwindow->withdraw(); }); $f1->Button(-text => "OK", -command => sub { $emailaddress = $emailentry->get(); $emailwindow->withdraw(); })->pack( -side => 'bottom', -pady => 5, -padx => 5); $f2->Checkbutton(-text => "Send Autonotification Emails", -variable => \$notification)->pack; } else { $emailwindow->deiconify(); $emailwindow->raise(); } } sub do_max_size { #graphical interface to determine maximum allowable # size for web pages to be searched if (! Exists ($sizewindow)) { $sizewindow = $mw->Toplevel(); $sizewindow->title("Maximum Size"); $sizewindow->Label(-text => "Maximum size (bytes) for web pages to search through:")->pack( -side => 'top', -pady => 5); $sizeentry = $sizewindow->Entry(-text => \$maxsize)->pack( -side => 'top', -fill => 'x'); $sizeentry->bind("<Return>", sub { $maxsize = $sizeentry->get(); $sizewindow->withdraw(); }); $sizewindow->Button( -text => "OK", -command => sub { $maxsize = $sizeentry->get(); $sizewindow->withdraw(); })->pack( -side => 'bottom', -pady => 5, -ipadx => 5); } else { 44
$sizewindow->deiconify(); $sizewindow->raise(); } }
sub do_time_between { #graphical interface to obtain the time in minutes between #requests to a server if (! Exists ($timebwindow)) { $timebwindow = $mw->Toplevel(); $timebwindow->title("Time Between"); $timebwindow->Label(-text => "Time between requests to a server (in minutes):" )->pack( -side => 'top', -pady => 5); $timebentry = $timebwindow->Entry(-text => \$timebetween)->pack(-side => 'top'); $timebentry->bind("<Return>", sub { $timebetween = $timebentry->get(); $timebwindow->withdraw(); }); $timebwindow->Button( -text => "OK", -command => sub { $timebetween = $timebentry->get(); $timebwindow->withdraw(); })->pack( -side => 'bottom', -pady => 5, -ipadx => 5); } else { $timebwindow->deiconify(); $timebwindow->raise(); } } sub do_download_times { #this subroutine is not used in the present version #it implements a graphical interface to allow the user #to set times between which the software is able to run if (! Exists ($downtwindow)) { $downtwindow = $mw->Toplevel(); $downtwindow->title("Times for download (0-24 hrs)"); $downtwindow->Label(-text => "Times between which downloads should occur:" )->pack(-side => 'top', -pady => 5); $downtentrys = $downtwindow->Entry(-text => \$starttime )->pack(-side => 'left', -pady => 5, -padx => 5); $downtentryf = $downtwindow->Entry(-text => \$finishtime )->pack(-side => 'right', -pady => 5, -padx => 5); $downtwindow->Button(-text => "OK", -command => sub {$starttime = $downtentrys->get(); $finishtime = $downtentryf->get(); $downtwindow->withdraw();} )->pack(-side => 'bottom', -pady => 5, -ipadx => 5); 45
} else { $downtwindow->deiconify(); $downtwindow->raise(); } } sub do_postprocessing_select { #set up decompression of files if (! Exists ($postwindow)) { $postwindow = $mw->Toplevel(); $postwindow->title("Postprocessing Selection"); $postwindow->Label(-text => "Path to unzip program" )->pack(-side => 'top', -pady => 5); $postwindow->Entry(-textvariable => \$zippath)->pack(
-side => 'top', -pady => 5, -padx => 5);
-side => 'top', -pady => 5); $postwindow->Label(-text => "Path to untar program" )->pack(-side => 'top', -pady => 5); $postwindow->Entry(-textvariable => \$tarpath)->pack(-side => 'top', -pady => 5, -padx => 5); $postwindow->Checkbutton(-text => "tar", -variable => \$tar)->pack( -side => 'top', -pady => 5); $postwindow->Label(-text => "Path to ungz" )->pack(-side => 'top', -pady => 5); $postwindow->Entry(-textvariable => \$gzpath)->pack(-side => 'top', -pady => 5, -padx => 5); $postwindow->Checkbutton(-text => "gz", -variable => \$gz)->pack( -side => 'top', -pady => 5); } else { $postwindow->deiconify(); $postwindow->raise(); } } sub do_edit_extensions { #modify extensions to include and exclude if (! Exists ($extwindow)) { my $extwindow = $mw->Toplevel(); $extwindow->title("Edit Extensions"); my $f1 = $extwindow->Frame(-relief => 'ridge', -borderwidth => 2); $f1->pack(-side => 'top', -anchor => 'n', -expand => 1, -fill => 'x'); my $f2 = $extwindow->Frame(-relief => 'ridge', -borderwidth =>2); $f2->pack(-side => 'bottom',-anchor => 's', -expand => 1, -fill => 'x'); $f1->Label(-text => 'Extensions to search for')->pack(-side => 'top'); $f2->Label(-text => 'Extensions to exclude from search')->pack(-side => 'top'); my $subf1 = $f1->Frame(-relief => 'ridge', -borderwidth => 2); 46
$postwindow->Checkbutton(-text => "zip", -variable => \$zip)->pack(
$subf1->pack(-side => 'left', -anchor => 'w', -expand => 1, -fill => 'y'); my $subf2 = $f1->Frame(-relief => 'ridge', -borderwidth =>2); $subf2->pack(-side => 'right',-anchor => 'e', -expand => 1, -fill => 'y'); my $subf3 = $f2->Frame(-relief => 'ridge', -borderwidth => 2); $subf3->pack(-side => 'left', -anchor => 'w', -expand => 1, -fill => 'y'); my $subf4 = $f2->Frame(-relief => 'ridge', -borderwidth =>2); $subf4->pack(-side => 'right',-anchor => 'e', -expand => 1, -fill => 'y'); my $addextlb = $subf1->Scrolled("Listbox",-scrollbars => 'se', -selectmode => 'multiple' )->pack( -padx => 5, -pady => 5); $addextlb->insert('end',@include_ext); my $addextentry = $subf2->Entry()->pack(-side => 'top', -padx => 5, -pady => 5); $addextentry->bind("<Return>", sub { push (@include_ext, $addextentry->get()); $addextlb->insert('end', $addextentry->get()); }); $subf2->Button(-text => "Add", -command => sub { my $found = 0; foreach (@include_ext) { if ($_ eq $addextentry->get()) { $found = 1; } } if ($found == 0) { push (@include_ext, $addextentry->get()); $addextlb->insert('end', $addextentry->get()); } })->pack(-side => 'top', -padx => 5, -pady => 5);
$subf2->Button(-text => "Remove", -command => sub { my @selected = $addextlb->curselection(); while ($selected[0] ne "") { my $sel = pop(@selected); splice (@include_ext,$sel,1); $addextlb->delete($sel); } })->pack(-side=> 'bottom', -padx => 5, -pady => 5); my $remextlb = $subf3->Scrolled("Listbox",-scrollbars => 'se', -selectmode => 'multiple' )->pack( -padx => 5, -pady => 5); $remextlb->insert('end',@exclude_ext); my $remextentry = $subf4->Entry()->pack(-side => 'top', -padx => 5, -pady => 5); $remextentry->bind("<Return>", sub {
push (@exclude_ext, $remextentry->get()); $remextlb->insert('end', $remextentry->get());
}); $subf4->Button( -text => "Add", -command => sub {my $found = 0; 47
foreach (@exclude_ext) { if ($_ eq $remextentry->get()) { $found = 1; } } if ($found == 0) { push (@exclude_ext, $remextentry->get()); $remextlb->insert('end', $remextentry->get()); } })->pack(-side => 'top', -padx => 5, -pady => 5); $subf4->Button(-text => "Remove", -command => sub { my @selected = $remextlb->curselection(); while ($selected[0] ne "") { my $sel = pop(@selected); splice (@exclude_ext,$sel,1); $remextlb->delete($sel); } })->pack(-side=> 'bottom', -padx => 5, -pady => 5); } else { $extwindow->deiconify(); $extwindow->raise(); } } sub do_download { #this routine occurs when the download operation starts #it goes through all of the records and downloads all the records #with the download field set to now #make the main window disappear from the screen $mw->withdraw(); #progressively download all the records my $rec; foreach $rec (@records) { if ($rec->download() eq 'now') { #build a http request $req = new HTTP::Request 'GET' => $rec->url(); #send request print $rec->directory()."\\".$rec->filename(); print "attempting to download \n"; $res = $ua->request($req,$rec->directory()."\\".$rec->filename()); $rec->download('when changed'); my ($ext) = $rec->filename() =~ /.*\.([^.]*)$/; #perform decompression if it is allowed if (($zip)&($ext ='zip')) { system("$zippath "."$rec->directory"."\\"."$rec->filename()"); } if (($tar)&($ext ='tar')) { 48
system("$tarpath "."$rec->directory"."\\"."$rec->filename()"); } if (($gz)&($ext ='gz')) { system("$gzpath "."$rec->directory"."\\"."$rec->filename()"); } } } $mw->deiconify(); $mw->raise(); }
sub do_download_path { #set up each of the files for download with a directory to be saved too if (! Exists ($downwindow)) { my $downwindow = $mw->Toplevel(); $downwindow->title("Download Locations"); my $f1 = $downwindow->Frame(-relief => 'ridge', -borderwidth => 2); $f1->pack(-side => 'left', -anchor => 'w', -expand => 1, -fill => 'y'); my $f2 = $downwindow->Frame(-relief => 'ridge', -borderwidth =>2); $f2->pack(-side => 'right',-anchor => 'e', -expand => 1, -fill => 'y'); my $f4 = $downwindow->Frame(-relief => 'ridge', -borderwidth => 2); $f4->pack(-side => 'bottom', -anchor => 's', -expand => 1, -fill => 'x'); my $f3 = $downwindow->Frame(-relief => 'ridge', -borderwidth => 2); $f3->pack(-side => 'bottom', -anchor => 's', -expand => 1, -fill => 'x'); $f1->Label(-text => 'Files to be downloaded')->pack(-side => 'top'); $f2->Label(-text => 'Directories to save files to')->pack(-side => 'top'); $f3->Label(-text => 'View which files')->pack(-side => 'top'); -scrollbars => 'se', -selectmode => 'single', -exportselection => 0 )->pack( -padx => 5, -pady => 5); my $dirlb = $f2->Scrolled("Listbox", -scrollbars => 'se', -selectmode => 'single', -exportselection => 0 )->pack(-padx => 5, -pady => 5); opendir DIR, $basepath; my @directories = grep -d "$basepath/$_",readdir DIR; closedir DIR; #next line possibly system dependant?(removes '.' and '..') splice(@directories,0,2); $dirlb->insert('end',@directories); my $rec; foreach $rec (@records) { if (($rec->type() eq 'new')&($rec->download() eq 'now')) { $filelb->insert('end',$rec->url()); } } my $view = 'new'; $f3->Radiobutton(-text => 'New', -value => 'new', -variable => \$view, -command => sub { $filelb->delete(0,'end'); 49 my $filelb = $f1->Scrolled("Listbox",
foreach $rec (@records) { if (($rec->type() eq 'new')& ($rec->download() eq 'now')) {$filelb->insert('end',$rec->url());} } })->pack(-side => 'top'); $f3->Radiobutton(-text => 'Old', -value => 'old', -variable => \$view, -command => sub { $filelb->delete(0,'end'); foreach $rec (@records) { if (($rec->type() eq 'primary')& ($rec->download() eq 'now')) {$filelb->insert('end',$rec->url());} } })->pack(-side => 'top'); $f3->Radiobutton(-text => 'All', -value => 'all', -variable => \$view, -command => sub { $filelb->delete(0,'end'); foreach $rec (@records) { if ($rec->download() eq 'now') { $filelb->insert('end', $rec->url()); } } })->pack(-side => 'top'); $f4->Button(-text => "Link", -command => sub { my $sel1 = $filelb->curselection(); my $sel2 = $dirlb->curselection(); my $rec; foreach $rec (@records) { if (($sel1 ne "")&($sel2 ne "")) { if ($rec->url() eq $filelb->get($sel1)) {$rec>directory($basepath. $dirlb->get($sel2)); $rec->type('primary'); $filelb->delete($sel1); last; } } } })->pack(-side => 'top', -padx => 5, -pady => 5); } else { $downwindow->deiconify(); $downwindow->raise(); } } sub do_add_links { #modify the base links used in the search if (! Exists ($linkwindow)) { my $linkwindow = $mw->Toplevel(); 50
$linkwindow->title("Edit Links"); my $f1 = $linkwindow->Frame(-relief => 'ridge', -borderwidth => 2); $f1->pack(-side => 'left', -anchor => 'w', -expand => 1, -fill => 'y'); my $f2 = $linkwindow->Frame(-relief => 'ridge', -borderwidth =>2); $f2->pack(-side => 'right',-anchor => 'e', -expand => 1, -fill => 'y'); my $linklb = $f1->Scrolled("Listbox", -scrollbars => 'e', -selectmode => 'multiple' )->pack(-padx => 5, -pady => 5); my $record; foreach $record (@records) { if ($record->type() eq 'source') { $linklb->insert('end',$record->url()); } } my $linkentry = $f2->Entry()->pack(-side => 'top', -padx => 5, -pady => 5); my $found = 0; $linkentry->bind("<Return>", sub { foreach $record (@records) { if ($record->url() eq $linkentry->get()) {$found = 1;} } if ($found == 0) { $record = Crawl::new(); $record->url(linkentry->get()); $record->type('source'); $record->download('never'); push (@records, $record); $linklb->insert('end', $linkentry->get()); } }); $f2->Button(-text => "Add", -command => sub { foreach $record (@records) { if ($record->url() eq $linkentry->get()) {$found = 1;} } if ($found == 0) { $record = Crawl::new(); $record->url($linkentry->get()); $record->type('source'); $record->download('never'); push (@records, $record); $linklb->insert('end', $linkentry->get()); }
} )->pack(-side => 'top', -padx => 5, -pady => 5); $f2->Button(-text => "Remove", -command => sub { my @selected = $linklb->curselection(); while ($selected[0] ne "") { my $sel = pop(@selected); splice (@records,$sel,1); $linklb->delete($sel); } })->pack(-side=> 'bottom', -padx => 5, 51
-pady => 5); } else { $linkwindow->deiconify(); $linkwindow->raise(); } } sub do_file_search { #this function implements the initial crawl for files $mw->withdraw(); #first develop a queue of base urls for the search my @searchqueue = (); my $rec=Crawl::new(); foreach $rec (@records) { if ($rec->type() eq 'source') { push (@searchqueue,$rec->url()); } } my @files = &locate_from_list(\@searchqueue,\@include_ext); #have generated list of files, now process them into the records my $file; foreach $file (@files) { #check if the file that has been found is already in the records $found = 0; foreach $rec (@records) { if ($file eq $rec->url()) { $found = 1; last; } } if ($found == 0) { #a file has been found that is not in the records #add a new entry to the records $newfiles = $newfiles + 1; my $new = Crawl::new(); $new->url($file); #work out the file name ($temp) = $file =~ m|.*/([^/]*)$|; $new->filename($temp); $new->type('new'); $new->size(check_size($new->url())); $new->download('never'); #check through records and see if the new record is a mirror of another my $rec2 = Crawl::new(); foreach $rec2 (@records) { if ($rec2->type() eq 'primary') { #only compare to primary sources of files if ($rec2->filename() eq $temp) { #found the same filename, possible mirror? if ($rec2->size() == $new->size) { #files have same size, record as a mirror $new->type('mirror'); } else { #just a likely mirror $new->type('mirror?'); } } } 52
} push (@records, $new); } }
$mw->deiconify(); $mw->raise(); } sub do_show_download { #this is a graphical interface that displays the files that have #been found in the search if (! Exists ($window)) { my $window = $mw->Toplevel(); $window->title("Select URL\'s for download"); my $f1 = $window->Frame(-relief => 'ridge', -borderwidth => 2); $f1->pack(-side => 'left', -anchor => 'w', -expand => 1, -fill => 'y'); my $f2 = $window->Frame(-relief => 'ridge', -borderwidth =>2); $f2->pack(-side => 'right',-anchor => 'e', -expand => 1, -fill => 'y'); my $listbox1 = $f1->Scrolled("Listbox", -scrollbars => 'se', -selectmode => 'multiple' )->pack( -side => 'bottom', -padx => 5, -pady => 5); $f1->Label(-text => "URL\'s Found in Search")->pack(-side => 'top'); my $listbox2 = $f2->Scrolled("Listbox", -scrollbars => 'se', -selectmode => 'multiple' )->pack( -side => 'bottom', -padx => 5, -pady => 5); $f2->Label(-text => "URL\'s To be downloaded")->pack(-side => 'top'); #find any records that are new or that have been changed and add them to the display my $record; my $temp; foreach $record (@records) { if ($record->type() eq 'new') { $listbox1->insert('end',$record->url()); } elsif ($record->download() eq 'when changed') { #check if the size has changed $temp = check_size($rec->url()); if (( $temp != $record->size())&($record->url() ne "")) { $oldfiles = $oldfiles +1; #size has changed, add to download list $record->size($temp); $record->download('now'); } } }
#find all the records that have been changed and that need to be downloaded and #display them. my $subaddref = sub { my @selected = $listbox1->curselection(); my $rec; while ($selected[0] ne "") { #get the entry for manipulation 53
my $sel = pop(@selected); #add it to the other listbox $listbox2->insert('end', $listbox1->get($sel)); foreach $rec (@records) { if ($rec->url() eq $listbox1->get($sel)) { $rec->download('now'); #this is the last iteration of the loop last; } } #remove it from the listbox $listbox1->delete($sel); } }; my $subremref = sub { my @selected = $listbox2->curselection(); my $rec; while ($selected[0] ne "") { #get the entry for manipulation my $sel = pop(@selected); #add it to the other listbox $listbox1->insert('end', $listbox2->get($sel)); foreach $rec (@records) { if ($rec->url() eq $listbox2->get($sel)) { $rec->download('never'); #this is the last iteration of the loop last; } } #remove it from the listbox $listbox2->delete($sel); } }; my $button1 = $window->Button(-text => "Add", -command => $subaddref)->pack(-side => 'top', -padx => 5, -pady => 5); my $button2 = $window->Button(-text => "Remove", -command => $subremref)->pack(-side => 'bottom', -padx => 5, -pady => 5); } else { $window->deiconify(); $window->raise(); } } sub open_file { #this subroutine is just a simple way of opening a file local ($filename,$parameters,$handle) = @_; open $handle, "$parameters$filename" or die "Cannot open $filename : $!" } sub locate_from_list { #this subroutine implements the actual web crawler local ($firstref,$secondref,$thirdref) = @_; #local (@urls,@include_ext) = @_; 54
local (%queue,@result,$count,$newURL,$protocol,$rest); local ($req,$res,$pagetext,$server_host,$port,$document); local (@anchors,$anchor,$new_host,$new_port,$new_document); local ($success,$new_protocol,$new_rest); %queue = (); #print "URL's\n"; foreach $url (@$firstref) { $queue{$url}=0; } #obtain a new url from the queue $thisURL = &find_new(%queue); $count = 0; @result = (); #loop until there are no more urls to check while ($thisURL ne "") { #get the protocol from the URL ($protocol,$rest) = $thisURL =~ /^([^:\/]*):(.*)$/; # check that the protocol is http or ftp, only continue #if that is the case if (($protocol eq "http")|($protocol eq "ftp")) { #split out the hostname port and document ($server_host,$port,$document) = $rest =~ m|^//([^:/]*):*([0-9]*)/*([^:]*)$|; #get the contents of the page and #build a http request $req = new HTTP::Request 'GET' => $thisURL; #make sure that the data we receive is in a format #we can process (ie. text/html) #this is important with ftp $req->header(Accept=> "text/html"); #send request for the webpage $res = $ua->request($req); #check the outcome if ($res->is_success) { $pagetext = $res->content; #remove CR/LF characters $pagetext =~tr/\r\n//d; #remove html comments $pagetext =~ s|<~--[^>]*-->||g; #locate all the links on the page (@anchors) = $pagetext =~ m|<A[^>]*HREF\s*=\s*"([^">]*)"|gi; foreach $anchor (@anchors) { #make sure the anchor we have found is fully specified $newURL = &fqURL($thisURL,$anchor); if ($queue{$newURL} > 0) { #increment count of url we have already looked #at, count shows how many references have been #made to that variable $queue{$newURL}++; }else { #look at the extensions on the file and see if they #are of interest or they should be ignored. ($new_protocol,$new_rest) = $newURL =~ /^([^:\/]*):(.*)$/; ($new_host,$new_port,$new_document) = $new_rest =~ m|^//([^:/]*):*([0-9]*)/*([^:]*)$|; if ($new_host eq $server_host) { #check if our document has one of the 55
#invalid extensions $success = 0; foreach $extension (@$thirdref) { $success = 1 if $new_document =~ /.*\.$extension$/i; } if ($success == 0) { #current document does not #have one of the excluded #extensions, check if it is a file #of interest $success = 0; foreach $extension (@$secondref) { $success = 1 if $new_document =~ /.*\.$extension$/i; } #now if success = 1 then we have a file #of interest if ($success == 1) { push(@result,$newURL); } else { #possibly another url to be checked #if it is not a link to a file of interest #check it's type firstly check based on #the url name if (guess_media_type($newURL) == "text/html") { #check that it is small if (check_size($newURL) < $maxsize ) {$queue{$newURL} =0;} } } } } } } } else { print "Error: " . $res->status_line . "\n"; } } #record that this url has been visited and then set #the url for the next iteration of the main loop $queue{$thisURL}++; $thisURL = &find_new(%queue); } return @result; }
sub check_size { #obtains the header for the file and returns the size local($url) = @_; local ($req,$res); $req = HTTP::Request->new(HEAD => $url); $res = $headua->request($req); return $res->content_length; } 56
sub fqURL { #this subroutine gets a partial url and turns it into a fully specified #URL local ($thisURL,$anchor) = @_; local ($has_proto, $has_lead_slash, $currprot, $currhost, $newURL); # strip anything following a # as it is just reference to a position # in a page $anchor =~ s|^.*#[^#]*$|$1|; # examine the anchor to see what parts of the url have been specified $has_proto = 0; $has_lead_slash = 0; $has_proto = 1 if ($anchor =~ m|^[^/:]+:|); $has_lead_slash = 1 if ($anchor =~ m|^/|); if ($has_proto == 1) { #protocol is specified therefore assume the address is fully #specified $newURL = $anchor; } elsif ($has_lead_slash == 1) { #if document has leading slash, it needs protocol and host ($currprot,$currhost) = $thisURL =~ m|^([^:/]*):/+([^:/]*)|; $newURL = $currprot . "://" . $currhost . $anchor; } else { #anchor must be a relative pathname, append to current url #need to modify this line to get correct url generation ($newURL) = $thisURL =~ m|^(.*)/[^/]*$|; $newURL .= "/" if (! ($newURL =~ m|/$|)); $newURL .= $anchor; } return $newURL; }
#find an unchecked url in the url queue sub find_new { #this subroutine locates any URL's left in the queue #that have not been searched #declare some variables local only to this subroutine local(%queue) = @_; local($key, $value); while (($key, $value) = each (%queue)) { return $key if ($value == 0); } return ""; } sub send_email { #This subroutine should send an email when the software completes #however it is presently bugged %mail = ( To => 's336454@student.uq.edu.au', From => 'steve_m25@hotmail.com', Message => "Test" ); sendmail(%mail) or die $Mail::Sendmail::error; print "OK. Log says:\n", $Mail::Sendmail::log;
57
Appendix C
Code Listing for Crawl.pm

The code listed in this appendix is also available for download on the Internet at
http://student.uq.edu.au/~s336454/
package Crawl; #implements a new record sub new { my $self = {}; $self->{URL} = undef; $self->{SIZE} = 0; $self->{FILENAME} = undef; $self->{DIRECTORY} = undef; $self->{TYPE} = undef; $self->{DOWNLOAD} = 'never'; bless($self); return $self; } #method to set/get the url sub url { my $self = shift; if (@_) { $self->{URL} = shift } return $self->{URL}; } #method to set/get the size sub size { my $self = shift; if (@_) { $self->{SIZE} = shift } return $self->{SIZE}; } #method to set/get the filename sub filename { my $self = shift; if (@_) { $self->{FILENAME} = shift } return $self->{FILENAME}; } #method to set/get the directory sub directory { my $self = shift; if (@_) { $self->{DIRECTORY} = shift } return $self->{DIRECTORY};
58
} #method to set/get the type sub type { my $self = shift; if (@_) { $self->{TYPE} = shift } return $self->{TYPE}; } #method to set/get the download type sub download { my $self = shift; if (@_) { $self->{DOWNLOAD} = shift } return $self->{DOWNLOAD}; } #method to print a record sub print { my $self = shift; if (@_) { my $file = shift; print $file $self->{URL},"\n"; print $file $self->{SIZE},"\n"; print $file $self->{FILENAME},"\n"; print $file $self->{DIRECTORY},"\n"; print $file $self->{TYPE},"\n"; print $file $self->{DOWNLOAD},"\n"; } else { print $self->{URL},"\n"; print $self->{SIZE},"\n"; print $self->{FILENAME},"\n"; print $self->{DIRECTORY},"\n"; print $self->{TYPE},"\n"; print $self->{DOWNLOAD},"\n"; } return 1; } #method to obtain a record from file sub getfromfile { my $self = Crawl->new(); local *FH = shift; chomp(my $temp =<FH>); ($self->{URL}) =$temp; chomp($temp = <FH>); ($self->{SIZE}) = $temp; chomp($temp = <FH>); ($self->{FILENAME}) = $temp; chomp($temp = <FH>); ($self->{DIRECTORY}) = $temp; chomp($temp = <FH>); ($self->{TYPE}) = $temp; chomp($temp = <FH>); ($self->{DOWNLOAD}) = $temp; return $self; } 1; # so the require or use succeeds
59

Web Crawler Cse

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Web Crawler Cse

Enviado por

Direitos autorais:

Formatos disponíveis

A Web Crawler for Automated

Location of Genomic Sequences

Department of Computer Systems and Electrical Engineering University of Queensland

Submitted for the Degree of

Bachelor of Engineering (Honours)

19th October 2001.

Dear Professor Kaplan,

1.1 1.2 1.3

Purpose of Project Sequence Locator Remainder of Thesis

2.1 2.2 2.3 2.4 2.5

2.5.2 Hypertext Transfer Protocol (HTTP)

3. Specifications of the Project

3.1 3.2 3.3

File Download Graphical User Interface 4.6.1 Functions 4.6.2 Options

6. Review and Conclusion

B. Code Listing for SeqLoc.pl

C. Code Listing for Crawl.pm

Robots Exclusion Protocol

Guidelines for Creating Web Robots

A Brief Introduction into the Internet

2.5.1 Hyper Text Markup Language (HTML)

<TITLE> Title of the web page is here </TITLE>

<A_HREF=http://www.example.com/index.htm> Actual Text to display </A>

2.5.2 Hypertext Transfer Protocol (HTTP)

Method GET HEAD PUT POST

Figure 2.1: Some simple HTTP Request types.

Chapter 3: Specifications of the Project

Programming Language Requirements

Modules of the Project

3.3.2 File Selection

3.3.3 File Updating

3.3.4 File Download

3.3.5 Graphical User Interface

Chapter 4: Software Implementation

Extremely Good Good Okay Good Extremely Good Bad

Extremely Good Extremely Good Good Extremely Good None

Programmers Knowledge Some

Figure 4.1: Summary of the Properties of Perl and Java

File Search Module

4.2.1 The Basic Algorithm

Figure 4.2: Pseudo-code for Web Crawler source.

4.2.2 Understanding the Basic Algorithm

4.2.3 Problems with the Basic Version

4.2.4 The Improved Algorithm

4.2.5 Testing the improved algorithm

File Selection Module

4.4.1 Initial Design

Figure 4.4: Initial design of the table of downloaded files.

4.4.2 Problems with the Initial Design

4.4.3 Improved Design

URL http://sample.source.page/ http://sample.source.page/good.zip http://sample.source.page/directory/good.zip

File name good.zip good.zip

Type Source primary Mirror

Download never when changed never

Graphical User Interface

Chapter 5: Software Evaluation

5.1 Software Evaluation Techniques

5.1.1 Local Interfacing Test 1

5.1.2 Local Interfacing Test 2

5.1.3 Local Interfacing Test 3

5.1.4 Local Interfacing Test 4

5.1.5 Local Interfacing Test 5

5.1.6 Remote HTTP Test

5.1.7 Remote FTP Test