Escolar Documentos
Profissional Documentos
Cultura Documentos
T
Coimbatore, India Coimbatore, India
sujatha.padmakumar@rediffmail.com. mpunitha_srcw@yahoo.co.in
Abstract: Web Usage Mining (WUM) is the automatic mining techniques to automatically discover web
discovery of user access pattern from web servers. documents and services, uncover general pattern on
ES
Organizations collect large volumes of data in their
daily operations, generated automatically by web
servers and collected in server access logs. It can also
provide information on how to restructure a website to
service effectively. This paper presents how to mines the
the web and to observe user behavior (viewing, book
marking and browsing history).Web mining is the
process of finding out what users are looking for on
the internet .Some users might be looking at only
textual data, whereas some others might be interested
secondary data (web logs) derived from the users'
interaction with the web pages during certain period of in multimedia data. Web usage mining is classified
Web sessions. At first Ant-based clustering algorithm is into three and are web content mining, web structure
applied to pre-processed log files to extract frequent mining, web usage mining.
A
patterns, then it is displayed in an interpretable format Web usage mining focuses on
and secondly decision tree method is used to find and
techniques that could predict user behavior while the
predict user’s navigation behavior. Two type of
user interacts with the web. As mentioned before the
approaches are used were the offline phase is based on
mined data in this category are the secondary data on
Ant based clustering and the online phase is based on
IJ
decision trees. The experimental results represent that the web as the result of interaction. These data could
the approach can improve the quality of clustering for range very widely but generally it is classified into
user navigation pattern in web usage mining systems. usage data that resides in the web client, proxy server
These results can be use for predicting user’s next and servers. The aim of understanding the navigation
request in the huge web sites. preferences of the visitors is to enhance the quality of
Keywords -Web usage mining, web mining, web electronic commerce services ecommerce, to
log files, classification and navigation pattern
personalize the Web portals or to improve the Web
structure and Web server performance. The first
I. INTRODUCTION
stage is preprocessing, next stage is pattern discovery
Web mining The term web mining is and the last stage is pattern analysis.
coined by Etzioni in 1996, to signify the use of data
T
Pattern Analysis is the final stage of
WUM (Web Usage Mining), which involves the
Fig 1: General Architecture for Web Usage Mining
validation and interpretation of the mined pattern.
II. WEB USAGE MINING ARCHITECTURE
Validation: to eliminate the irrelevant rules or
A.Preprocessing
ES patterns and to extract the interesting rules or patterns
Pre-processing "consists of converting the
from the output of the pattern discovery process.
usage, content, and structure information contained in
Interpretation: the output of mining algorithms
the various available data sources into the data
is mainly in mathematic form and not suitable for
abstractions necessary for pattern discovery". This
direct human interpretations.
step can break into at least four sub steps: Data
III. RELATED WORK
Cleaning, User Identification, Session Identification
Identifying Web browsing strategies is a
and Formatting. Unneeded data will be deleted from
crucial step in Website design and evaluation, and
A
raw data in web log files in the data cleaning step.
requires approaches that provide information on both
At least two log file formats exists: Common Log
the extent of any particular type of user behavior and
File format (CLF) and Extended Log File format
the motivations for such behavior [9].Pattern
([16] for more details). Our university log file
discovery from web data is the key component of
consists of these fields: Date, Time, client IP address,
web mining and it converge algorithms and
IJ
T
Figure 2 Offline & Online phase
undirected graph based on connectivity between
Referrer and URI pages was presented along with a A. Offline phase of the architecture
preprocessing method to process unprocessed web This phase consists of two major
log file and a formula for assigning weights to edges
ES modules Data pretreatment and Navigation Patterns
of the undirected graph. Ant-based clustering due to Mining. In this phase starting with the primary Web-
its flexibility and self-organization has been applied Log Preprocessing (Data pretreatment) to extract user
in a variety of areas from problems arising in e- navigation session from dataset and Clustering
commerce to circuit design, and text-mining to web- algorithm to mining navigational patterns in offline
mining, etc (Jianbin et al., 2000. The various works phase .
proposed in this area with particular emphasize on B. Online phase of the architecture
web usage mining, clustering and classification was During the online phase, when a new
A
provided in this section. In this present work, request arrives at the server, the URL requested and
research work is one another attempt made to the session to which the user belongs are identified,
propose a hybrid system that uses clustering and the underlying knowledge base is updated, and a list
classification methods to discover the user’s of suggestion is appended to the requested page[6].
navigation pattern and analyze them from the server’s C. Prediction Engine.
IJ
proposed ant-based data clustering algorithm (shown Input: training samples, represented by discrete
attributes; the set of candidate Attributes, attribute-list.
in Figure 3), which resembles the ant behavior
Output: set of classes
described in [4]. Method:
1. Create a node N;
2. If samples are all of the same class C, then Return
N as a leaf node labeled with the class C;
3. If attribute list is empty then Return N as a leaf
node labeled with the most common
class in samples (majority voting)
4. Select test attribute, the attribute among
attribute-list with the highest information gain ratio;
5. Label node N with test-attribute;
6. For each known value ai of test-attribute
7. Grow a branch from node N for the condition test-
T
attribute= ai;
8. Let si be the set of samples in samples for
which test-attribute = ai;
9. If si is empty then
10. Attach a leaf labeled with the most common class
in samples;
ES 11. Else attach the node returned by generate
decision- tree
V. EXPERIMENTAL EVALUATION
In order to test the effectiveness of
the proposed system, server web log data file was
Figure 3: Ant based algorithm obtained. The system was tested with several data
A
E. Decision Trees collected from 90 days for easy discussion,
Decision trees are used in experiments projected here are from one day, that is,
powerful way of knowledge representation. The section 3, the preprocessing is conducted in four
IJ
models produced by decision trees are represented in steps, namely (i) Cleaning (ii) User Identification (iii)
the form of tree structure. A leaf node indicates the Session Identification and (iv) formatting
T
122.178.146.123 1 4 11 {1, 4, 11,
6 15 4 15} In this paper, a new method to extract
Figure 6 Extracted navigation patterns navigational patterns from web logs. The work
focused on group of the frequently accessed patterns
NP Navigational Pattern of interested users. It assists the web site designers to
number ES improve the performance of the web by giving
1 (P1, P15 ,P3 ,P8 ,P17 ) preference to the patterns navigated by the regular
2 (P1, P6 ,P3 ,P11 ,P15 ,P17 ,P23 ) interested users. After the clustering is completed,
3 (P1,,P2 P8, P6 ,P17 ) alignment processing has been applied to the
4 (P1, P4 ,P9 ,P11 ,P23 ) extracted sequences in each cluster and extract the
5 ( P1, P8 ,P13 ,P17 ) representative for each cluster. A Classification
6 ( P1, P4 ,P11 ,P15 ) algorithm is used for online phase to predict the user
Figure 7: Navigation pattern Generated by future request.
clustering algorithm
[4] Deneubourg, J.L., Goss, S., Franks, N., Data Warehousing and Knowledge Discovery, LNCS
Sendova–Franks, A., Detrain, C. and Chretien, L. 2454, Y. Kambayashi, W. Winiwarter, M. Arikawa
(1990) The Dynamics of Collective Sorting Robot– (Eds.), Pp. 73-82.
Like Ants and Ant – Like Robots. From Animals to
Animals, Proc. Of the 1st Int. Conf. on simulation of
Adaptive Behaviour, Pp. 356–363.
[5] Dixit, D. and Gadge, J. (2010) A New
Approach for Clustering of Navigation Patterns of
Online Users, International Journal of Engineering
Science and Technology, Vol. 2, No.6, Pp. 1670-
1676.
T
[6] Handl, J. and Meyer, B. (2002) Improved
ant-based clustering and sorting in a document
retrieval interface, Proceedings of the Seventh
International Conference on Parallel Problem Solving
ES
from Nature, Vol. 2439 of LNCS, Springer-Verlag,
Berlin, Germany, and Pp. 913–923.
[7] Jalali, M., Mustapha, M., Mamat, A. and
Sulaiman, M.N.B. (2008a) A new clustering
approach based on graph partitioning for navigation
patterns mining, 9th International Conference on
Pattern Recognition, Pp. 1- 4.
A
[8] Jalali, M., Mustapha, N., Mamat, A.,
Sulaiman, N.B. (2008b) Web user navigation pattern
mining approach based on graph partitioning
algorithm, Journal of Theoretical and Applied
Information Technology, Pp.
IJ
1125-1131
[9] Jalali, M., Mustapha, N., Sulaiman, N.B. and
Mamat, A. (2008c) A web usage mining approach
based on LCS algorithm in online predicting
recommendation systems, 12th International
Conference Information
Visualization, IEEE Computer Society, Pp. 302-
307.
[10] Jespersen S.E., Thorhauge J., and Bach T.
(2002), A Hybrid Approach to Web Usage Mining,