Você está na página 1de 5

The 9th International Conference on Computer Supported Cooperative Work in Design Proceedings

Web Robot Detection Techniques Based on Statistics of Their Requested URL


Resources

Weigang Guo’,~,Shiguang Ju2 ,Yi Gu2


1
Information Center, Foshan Universi@, Foshan, Guungdong, 528000, P.R. China
wgguo@fosu. edu.cn
SchooE of Computer Science, Jiangsu Universiv, Jiangsu, 212013, P.R. China
jushig@ujs.edu. cn

Abstract proper measures to redirect the Web robots or stop


responding to HTTP requests coming from the
Following the widely use of search engines, the unauthorized robots.
impact Web robots have on the Web sites should not be The commonly used Web detection method is to set
ignored. Afer analyzing the navigational patterns of up a database of known robots[Z], and compare the IP
Web robotsfrom Web logs, two new algorithms are address and User Agent fields of the HTTP request
proposed One is based on classijkation and statistics message against the known robots. But this method can
of requested URL resources, which classifies the URL detect the well-known robots only. There are three
resources into eight types and counts the number of simple techniques to detect the unknown robots from
session of the clients and number of visiting records Web logs:
with same type. And another is based on Web page (1) According to the SRE (Standard for Robot
member list, which constructs one member listf o r every Exclusion) [3], whenever a robot visits a Web site, it
Web page and one show table for every visitor. The should first request a file called robotstxt. So, by
experiment shows that the two new algorithms can examining the user sessions generated from Web logs,
detect the unknown robots and unfriendly robots who new robots that follow the S E can be found. However,
do not obey the Stundardfor Robot Exclusion. the standard is voluntary, and many robots do not obey
it.
Keywords: Search Engine, Web Robot Detection, (2) Most robots do not assign any value to the
Content Classification, Webpage Member List, Web “referrer” field in their HTTP request messages. And, in
the Web log, the “referrer” field is empty (=”-“). So, if
Log
the user sessions have large number of requests with
empty referrer fields, the visitor is a “suspicious” robot.
1. Introduction But, as Web browsers can sometimes generate HTTP
messages with empty referrer values, this method is
also not reliable.
A Web robot is ;t program that automatically
traverses the Web’s hypertext structure by retrieving a
(3) When checking the validity of the hyperlink
document, and recursively retrieving a11 documents that structure, most robot use HEAD request method to
reduce the burden on Web servers. Therefore, one can
are referenced. Web robots are often used as resource
examine user sessions with lots of HEAD requests to
discovery and retrieval tools for Web search engines
such as Googie, Lycos, etc. discover potential robots. Similarly, as Web browsers
can sometimes generate HEAD request type, this
But the robot’s automatic visits toward the Web sites
also cause many problems. First, considering the method is also not reliable.
business secret, many E-commerce Web sites do not To solve the problem, Pang-Ning Tan and Vipin
hope the unauthorized robots retrieve inforination fiom
Kumar [11 adopted C4.5 decision tree algorithm to
their Web site, Second, many E-commerce Web sites classify the robots visit and human visit .based on the
need to analyze the visitors’ browsing behavior, but characteristics of the robots’ access pattern. Their
such analysis can be severely distorted by the presence method can effectively detect the unknown robots. But
of Web robots [ 11. Third, many government Web sites it is a bit complicated. In this paper, after analyzing the
also do not hope their information collected and access patterns of Web robots, we propose two new and
indexed by the robots for some reason. Fourth, poorly- simple algorithms to detect Web robots based on the
designed robots often consume lots of network and statistics of their requested URL resources.
server resources, affecting the visit of normal
customers. So, it is necessary for the Web site managers
to detect Web robots from all the visitors, and take

302

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on March 09,2010 at 00:48:51 EST from IEEE Xplore. Restrictions apply.
The 9th International Conference on Computer Supported Cooperative Work in Design Proceedings

2. The differences between robots visit and file types of requested URLs of the current session are
human visit all jpg, gif, png (the image search engine robot) or
mp3,rm(the music search engine robot).
There are great differences between robots visit and
human visit. 3. The algorithm based on classification and
(1) When a person inputs a URL address in the statistics of requested URL resources
browser (eg. Microsoft Internet Explorer), the browser
will send out the HTTP request to the target server. The algorithm classifies the URL resources into
According to the HTTP protocol [4], the server will eight types and counts the number of session of the
check whether it has the document specified by the clients and number of visiting records with same type.
URL. after it has received the request. If it does have, it
sends out that document. Otherwise, it gives out an 3.1 The classification of URI, resources
error message. And, the browser will parse the
document after receiving the document sent by the The contents of Web sites can be classified to eight
server. If it is a single document, such as a picture etc., major types (See Table 1). According to the
the browser shows it directly. If it is an HTML classification, each record of the Web logs can be
document, then the browser will analyze the embedded assigned a value of certain type. It is valuable to point
and linked objects in this document (such as image out that as many robots retrieve non-text documents
files, animation files, scripts files, cascading style sheet (e.g. PDF file) we can regard “document” and
files and frames, etc.), and then continuously and ‘kebpage” as the same type.
automatically send out the HTTP requests to the server
until all the embedded objects have been requested. On Table I. The Classification of the content orweb site
the server side, the sewer sends out all the requested Type Name DescriDtion { file tvDes )
documents in order after receiving the client’s requests.
When the browser has received a11 the embedded webpage Web pages( htm, htm4 asp, PI, php, 4
objects, it “assembles” them and then generates a document Non-text documents(doc, ppt, pdf,
complete Web page from the human point of view. So, PS,w
one request of a person may generate several records in script Scripts files(js, css, vbs)
Web server logs, and all the embedded objects how'^ image rmagesCjpg,gif,png,bmp)
in the log. And there is no obvious characteristic of the
music Music(mid, mp3, wma, rm)
fiIe types of URL fields of these records because the
embedded objects of the web page are random. video Video and animation files(swf,
( 2 ) The robot is different. Usually, after getting a avi,mpeg)
URL (assuming it is an html document) from the URL download Compressed files(zip, rar, tgz, exe)
list which is waiting for being visited, the robot sends others anv other file tvues
HTTP request to the target server. The robot also
analyzes the embedded objects and hyperlinks within
the received HTML document after receiving the 3.2 The detection algorithm
servefs response, and then adds the embedded
hyperlinks to the waiting list according to its visiting Stepl: data preprocessing. Each record of the log
rules. Here, the method of treating embedded objects files can be processed as:
(such as image files, animation files, scripts files, R=QP, agent, time url, P,
cascading style sheet files and frames, etc.) may be Where IP is the address of the client, agent is the
different respectively. Some search engine robots add user agent field of HTTP message, url is the requested
the URLs of these objects to the waiting list as well, URL resources, time is the request time, and t is the
while some give them up directly or modify the links of type of the requested content, which is computed from
the objects in the HTML document rather than its url field according to the tablel. All the records form
requested them straightly. But one thing is same that the the user visiting record set.
robots don’t send the requests for the embedded objects StepZ: generating user session set S. First, sort the
to the server at once. Therefore one request of robot Web logs by IP field, agent field and time field as the
visit only leaves one piece of record in the server logs, first, secund and third keywords, and then treat the
which exactly represents the request of the robot. And, records with same IP and agent field as one visitor’s
the file type of the requested document is usually the visiting records. If the time interval between two visits
Web page type (.htm, .asp, .php, etc). So, in a session, is more than a fixed time length T, the two visits c m be
all the file types of requested URLs of robot visit are regarded as two different sessions. Usually, T is
the webpage type. I f the purpose of one robot is to between 15-30 minutes.
coIlect images files or music files, as it has requested The user session set S is represented as:
and analyzed the webpage in the previous sessions, the S=<IP, agent, m, It,, t2, ...&t

303

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on March 09,2010 at 00:48:51 EST from IEEE Xplore. Restrictions apply.
The !A% Intemational Conference on Computer Supported Cooperative Work in Design Proceedings

Where m is the total amount of records, t,, tZ, ... & Definition 2: The member list of a Web page is
are the types of requested contents of each record in the defined by a three-tuple:
session. t=<webpage, memeberset, n>
Step3: Computing the number of each type of the where, webpage is the URL of the Web page,
requested contents in every session, and finding out the memberset is the aggregate of all the URLs of the
content type which has the largest number. Now the embedded objects and is represented as
user session set is represented as: {ml,mz,. - * , m n ) , mi ( i=l;.*,n is the URL of the
S=<IP, agent, m, t, n> embedded objects, n is the number of the members,
Where t i s one of the eight types defined in Table 1, The embedded objects mi include: 1) Multimedia
n is the number of the type t. files (images, sounds, animations, etc) defined by the
Step4: Sifting all the sessions with m= n, and SRC attribute of IMG. BGSOUND. EMBED.
forming the robot candidate set C ,which is represented OBJECT tags of HTML. 2) Frames defined by the SRC
as: attribute of FRAME $D IFRAME tags of HTML. 3)
C=<IP, agent, m, D Cascading style sheet files linked by the HREF attribute
Step5: Merging the sessions and forming the of LINK tag of HTML. 4) Scripts files linked by the
merging-session set M. SRC attribute of SCRIPT tag of HTML. 5 ) Java applet
In set C, the total amount of sessions and the visiting class files linked by the CODE attribute of APPLET tag
records of every visitor with same type are calculated. of HTML.
Because search engine robots usually use several hosts For example, if one Web page index.htm is consisted
with one same Type-C IP address to retrieve of three frames: a.htm, b.htm and c.htm, and every
information, the visitors corning from same Type-C IP frame has one picture named tl.jpg, t2.jpg, W.jpg
. address and having same user agent fields can be respectively, and there is a cascading style sheet file
regarded as one visitor. Set M is represented as: style.css in a.htm, then, supposing all the files are in the
. M=<IP, agent, Snumber, Rnumber, t> ,
root directory of the Web site, the member list of
Where Snumber represents the total amount of index.htm is:
sessions which is of type t and generated by the visitors <index.htm, {a.htm, b.htm, c.htm, tl.jpg, t2.jpg,
corning from same Type-C IP address and having same t3.jpg, style-css}, 7>
user agent fields, and Rnumber represents the total The number of multimedia files’ member is 0.
amount of visiting records correspondingly. We developed an HTML analyzer to analysis the
Thinking that there are lots of occasional visitors, HTML tags and its attributes and generate the Web
the items in set M which Snumber and Rnumber all page member list for every file in the Web site. All the
equal 1 will be deleted. member lists consists the member list set of the Web
StepC: Checking Snumber and humbers, if they site.
exceed a value of a certain threshold, it can be regarded
as a robot. The threshold vafue can be different from 4.2 The detection algorithm
Web site to Web site. For example, the threshold value
of Snumber can be set to 2 while Rnumber can be set to
Stepl: data preprocessing. Sort the Web logs by IP
5.
field, agent field and time field as the first, second and
third keywords, and then treat the records with same IP
4. The algorithm based on Web page and agent field as one visitor’s visiting records and
member list assign a label uid. Each record of the log files can be
processed as:
The aIgorithm first constructs a member list for R=<uid, url, t i m e
every Web page, and then generates a ShowTable for Where uid i s the label of every different visitor, url
every visitor’s requested URLs. Robot can be checked is the requested URL resources, and time is the request
from the ShowTable. time. So, the visiting records set of the user is
represented as:
4.1. The construction of Web page member Iist S = -=adi, ~(urll,time,),i..,(ul~ti~e~)}>
Where, k is the total number of visiting records.
Definition 1: A Web page is a collection of Step2: Constructing a ShowTable(see Table 2)
information, consisting of one or more Web resources, which records actual attendance of the members o f the
intended to be rendered simultaneously, and identified visited Web pages for every visitor. In the ShowTable,
by a single URL. More specifically, a Web page url is URL of the Web page, NumberOfMember is the
consists of a Web resource with zero, one, or more number of member of the Web page and is obtained
embedded Web resources intended to be rendered as a from the Web site’s Web page member Iist,
single unit, and referred to by the URL of the one Web ShowNumber is the total number of members who
resource which is not embedded.[5] appear in his visiting records set.

304

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on March 09,2010 at 00:48:51 EST from IEEE Xplore. Restrictions apply.
The 9th International Conference on Computer Supported Cooperative Work in Design Proceedings

First, we adopt the algorithm described in Section 3


Table 2. ShowTnble: records actual attendance of the to detect the robot, and then adopt the algorithm
members of the visited Web pages for every visitor, described in section 4 to verify.
url NumberOfMember S howNumbe The Web log of January 21st, 2004 contains a total
r of 50740 records, of which there are 24 records coming
1ndex.htm 7 0 from the different IP addresses and agents that requests
robots.txt. Using the algorithm proposed in section 3,
Computer.htm 0 0
7432 sessions are created while the time interval T=20
Camera.htm 3 0 minutes. And there are 6742 session in the robot
Apple.jpg 0 0 candidate set C, of which there are 424 “webpage” type
... ... ... sessions, 128 “image” type sessions, 6165 “music” type
sessions and 7 “animation” type sessions. We did not
‘process the “music” sessions because the requests of
The algorithm of computing ShowNumber is as
different clients generate complete different server log
follows:
records. For example, when Microsoft Windows Media
Player requests a MP3 file, lots o f records (sometimes
for each r E S do
exceeding 30) may be produced in the server log and
if the URL type of r is multimedia files(images, the agent fields may be different. So, only a total of 559
sounds) “webpage”, ”image”, ”animation” type sessions in the
then ShowNumber := 0 ; robot candidate set C are used for further processing.
else The final result is that there are 253 cIients (their IP
for each member of r do address and agent are different) in the merging-session
( judge whether this member appear in its set M. The average session number of 253 clients is 2.2,
close and the maximal session number is 96 while the
succeeding sequence; minimal number is 1. The average visiting records
if it does appear (requested resources) number is 5.1, and the maximal is
then { ShowNumber := ShowNumber + 1; 249,the minimal is 1.
delete this member’s record)
next member;
end for
5.1 The wholeness of detection
next r;
end for In original logs, there are 24 different clients (their
IP address and agent are different) who have requested
Where, the close succeeding sequence of r means all robots.txt, among which there are 20 appear in the
the visiting records behind r in a certain time interval in merging-session set M. By checking the original log
the visiting records set S. The interval can be from 0 to files, it i s discovered that other 4 who don’t appear in
30 seconds. The reasons is that, if the visitor is a person, the merging-session set M only request nothing but
the browser usually requests the embedded objects in robot.txt file. As they are filtered in Step 5, it is
0-5 seconds, and the request intervals will not be over reasonable that they wouldn’t appear in the merging-
30 seconds, otherwise the visitor will be impatient and session set M. So the wholeness of detection of good-
give up or exit. For robot visits, according to its behavior robot (who requests robots.txt) is 100%.
retrieving strategy, if the robot wants to request the
embedded objects, the time interval is usually large than 5.2 The accurateness of detection
30 seconds.
Step 3: To judge whether one uid is a robot, the Which are the real Robots of the 253 clients in the
simple method is to check the ShowNumber field of its merging-session set M? The criteria are to check their
ShowTable. If all the corresponding ShowNumber Snumber and Rnumber. We can choose appropriate
fields of its visited UlUs equal 0, we could think the Snumber and Rnumber to determine whether the client
uid is a robot, is a robot based on the following assumption: 1) When
a robot visit a Web site, it usually divides its retrieving
5. Experiments task into several sub-tasks and may produces a bit more
sessions than a human user; 2) Usually it requests a bit
Our Experiments were performed on Foshan more contents t h a n a human user. We got 28 clients
University server logs (http:/1202.192.168.245) filtered from the merging-session set M by setting
collected from January 21st to February 7th, 2004 Snumber>=2 or Rnumber>=5 (the average number of
(these days are Winter holidays and Chinese Spring set M). Of the 28 clients, there are 20 clients requested
Festival, there are only a little people visit the robots.txt, while other 8 didn’t. The eight clients are
university Web site during this period). shown in Table 3.

305

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on March 09,2010 at 00:48:51 EST from IEEE Xplore. Restrictions apply.
The 9th International Conference on Computer Supported Cooperative Work in Design Proceedings

Table 3. The result robots of the experiment who don't request robots.txt
IF address Agent Tvoe Snumber Rnumber
210.72.21.199 HTML-GET-APP webpage 20 20
21 6.88.158.142 Mozilla/4.O+compatible+ZyBorg/l.O+(wn. webpage 11 26
zyborg@looksmart.net;+http:/lwww. WISE
nutbot.com)
66.196.72.103 Mozilld5.O+(Slurp/cat;+slurp@inktomi.co webpage 46 50
m;+http://~.inktomi.com/slurp.html)
66.196.90.125 MoziIldS.O+(Slurp/cat;+slurp@inktm" .CO webpage 3 6
m;+http://www.inktomi.com/shrp.html)
202.96.63.3 User-Agent:+Mozilla/4.O+(compatible; webpage 18 707
+MSIE+5 .S;+Windows+NT+5,0)
219.133.39.15 - image 15 249
205.188.209.37 Mozilla/4.O+(compatible;+MSIE+6.O;+AO image 3 10
L+9.0;+Windows+NT+S. 1)
66.237.60.91 Openfind+data+gatherer,+Openbot/3 .O+(ro image 96 126
bot response openfind.com.tw; +
http:I/www.openfind.com.tw/
robat.html)

Whether the remaining clients of the 253 in the his browser not displaying images and not playing
merging-session set M can be regarded as robots will be sounds, the algorithms may mistake a human visit for a
detected accord to their later visits. If the number of robot visit. For future work, we would like to take into
their visit sessions and visit records exceed one certain account the hyperlinks of the Web pages to detect the
value, then they can be detected. This is proved by our search engine robots more effectively.
experiments using the following days' server logs.
When adopting the algorithm described in section 4, References
we found that all the Shownumber in the ShowTable of
these robots' requested U l U s are 0. [I] Pang-Ning Tan, Vipin K m a r . "Discovery of Web Robot
Sessions based on their Navigational Patterns". Datu Mining
undKnowZedge Discovev, 2002,6(1): 9-35.
6. Conclusions [2] The Web Robots Database. http://www.robolstrf.orgfwd
active.html.
Our detection algorithms are simple, but they are [3] Robots Exclusion. htfp://www.roborcatorg/wc/
effective and have a high accurateness, and need only a exclusion.hfml.
few records to detect whether the visitor is a robot. The [4] Hypertext Transfer Protocol- IITpIl.1.
weakness of the two algorithms is that if the Web pages http://www.w3.org.
of the Web site only contain plain text and no images, [5] Web Characterization Termhalogy & Definitions Sheet.
sounds, the algorithms maybe regard a human visitor as hilp://www.wlo r ~ l 9 9 9 / 0 5 ~ C ~ - t e r m s / O l /
a robot. And if a person uses very simple browser or set

306

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR. Downloaded on March 09,2010 at 00:48:51 EST from IEEE Xplore. Restrictions apply.

Você também pode gostar