A Survey of Preprocessing Method For Web Usage Mining Process

International Journal of Computer Trends and Technology (IJCTT) volume 9 number 2 Mar 2014
ISSN: 2231-2803 http://www.ijcttjournal.org Page62

A Survey of Preprocessing Method for Web Usage Mining Process

Harmit kaur
1
Hardeep singh
2
1
(Department of CSE/Lovely Professional University,INDIA)
2
(Department of ECE/Lovely Professional University,INDIA)

Abstract
The amount of web applications are increasing in
large amount and users of web applications are also
increasing rapidly with high speed. By increasing
number of users the size of log file also increases
.The information which stores in log files cannot be
directly used for analysis. Therefore preprocessing of
log files is necessary to improve the quality of web
usage mining process. Preprocessing of log data
improves performance of other two steps pattern
discovery and pattern analysis. Preprocessing
involves data cleaning, user identification, session
identification, path completion. In this paper the
survey of different preprocessing techniques are done
and identify better techniques to improve the
performance.

Keywords web usage mining, web server log,
Preprocessing of log file,

1. Introduction

Data mining is a process of finding useful
information from large database. Data mining is a
process of knowledge discovery which uses different
techniques for extracting the knowledge from
database.web data mining is an application of data
mining It is process of extracting the information
from web.web data mining is categorized into three
types that are web content mining, web structure
mining and web usage mining[1].Web content
mining is a type of web data mining which is
extraction of contents from web sources there are
different web sources from where user can get the
informaion.web content mining is divided into two
types text mining and multimedia mining.web
structure mining is a process of discovering link
structure of web. There are many tools available to
retrieving information from web page but tools
ignore valuable information containing in web
links.web usage mining is a process to extract the
behavior of user by applying data mining
techniques.web usage mining deals with the
information which is used to understanding the
behavior of user who is interacting with a web site.
The information is used to improve the structure
website, improve performance, and provide fast and
reliable access to users. Web usage mining is divided
into three phases preprocessing of log data, pattern
discovery. Pattern analysis. Preprocessing of log file
is complex task but it improves the quality of other
two steps pattern discovery and pattern analysis.
This paper is organized as follows in section II
literature review of web usage mining is explained. In
section III and IV sources of log file and attributes of
log file are described. In section V formats of log
files are explained. In section VI web usage mining
process is described. In section VII application of
web usage mining are described. Conclusion is given
in section VIII.

2.Literature review

A paper [1] which described techniques of
preprocessing that are used in data cleaning data
filtering, path completion, user identification, session
identification and web session clustering.
They described the different sources of log files, log
file formats, preprocessing techniques, algorithms
applied and data support to data preprocessing phase.
A survey is done by authors on preprocessing
techniques used in preprocessing phase.
A paper [2] in this paper web log data preprocessing
is divided into steps that are log consolidation, data
cleaning, user identification and transaction
identification .log consolidation is the first step in
preprocessing in which the logs from different
servers are combined into one place for data cleaning.
Next step is data cleaning which is divide into two
parts first is page element cleaning in which files
with extension.gif, jpeg, .jpg are removed and second

type is cleaning other information such as files with
extension .css, xsl, .xsd, .dll.
Determination and identification of user involves the
ip address, agent file and referrer page. Transaction
identification is to break large transaction into several
smaller one or combine the small transaction into
large one. Transaction identification is done using
reference length and maximal forward length.
A paper by [3] theint aye explain the web log
cleaning process for mining the log data .He give
overview of web usage mining process and explain
two algorithms one for data cleaning and other for
field extraction in web usage mining process he give
overview of preprocessing phase and after
preprocessing the data is converted into structured
form and then apply algo for mining the information
from it.
In data cleaning algorithm image files, multimedia
files with extensions .jpg,.css,.gif are removed from
url link but there can be some irrelevant data cannot
be removed using this algorithm. Second algorithm is
field extraction in this algorithm a table is created and
then data is converted into structured format and then
stored into table.
A paper by [4] who explain data preprocessing in log
files he start the preprocessing from data fusion and
cleaning in this he described data is combined from
different sources and then irrelevant entries are
removed. User and session identification is
performed in session identification he gives two
methods first is time oriented second is structure
oriented in last path completion is performed using
graphs.
A paper [6] in this two algorithms for preprocessing
are introduced the first algorithm for data cleaning
and second for data reduction are proposed in data
cleaning algorithm the records with extension .jpg,
gif, .css are removed but records with irrelevant
status code are not removed in this algorithm so the
status code can be remover in improved algorithm in
data reduction algorithm identified the sessions and
removed the incomplete session entries are removed.

3.Sources of log files
There are many sources of log files for preprocessing
phase . Some sources are
3.1 Client log file
Client log files are most authentic and accurate to
depict user behavior. It store the information about
the client the clicks by the client. The information is
in the form of one to many relationships. It store the
information websites visited by particular users.
3.2 Proxy log file
These files are more complex because these files are
store the information is in form many to many
relationship it means one user can visit many sites
and many users can visit one site. Proxy server is a
mechanism which exists between client browser and
web server. These servers take requests from multiple
clients to multiple web servers.
3.3 Server log file
These server files store information in form many to
one relationship. Many users can visit one website.
The behavior of user can be captured using web
server log files these files are accurate and reliable
for web usage mining process. These files store
information into different logs. Server log file has
some common log files are access log, error log,
referrer log, agent log.
Access log file records all the clicks, hits access
made by user. The behavior and interest of user are
mined from access log file. The information about
user behavior and user interests are captured in
access log file and mined from this file.
Error log file records all errors of website when user
open a particular page by clicking a link and page
does not found then error display a message error
404 file not found. These log files improve the
quality of website by optimize the web site links.
Referrer log file records the information about the
referrer. The referrer is the page from user jumps to
the new page using any hyperlink between the pages.
Agent log file contained the information about the
browser and operating system. The information
stored in agent files are helps in user identification
and web site designer can also make website is more
compatible with the most used browser and operating
system.

4.Attributes of log file
The log file store the different information using
different attributes of log file which includes some
important information [1]
Client IP
Client IP is the IP address of client machine from
where users browse the website.
Date and time

Date is recorded when user made access the date is
stored in the format YYYY-MM-DD.Time
information is stored in the format as HH-MM-SS.
Sever client status
Server code return by server like 200,404.
User agent
User agent records the information about the browser
type, version and operating system that is used by the
client at the time of accessing the website.
Referrer
Referrer is the previous page from where client
jumps to the new web page or website. It is the link
of the current page or website to the previous page or
website.
Server client bytes
Number of bytes sent by the server to client.
Client server bytes
Number of bytes received by client from server.

5. Formats of log files
There are many log formats are available of log files
to capture the behavior of user and activities of user
on website. Log files text files which are stored
using.txt extension. These files are stored in ASCII
format.log files are used to monitor the behavior of
user and takes feedback from client side. There are
some common formats are available for storing the
log records that are common log format, Microsoft
IIS log format, extended log format[2].
5.1 Common log format
Common log format store the users activities in
some fixed attributes. It store the information in
attributes like IP address, time and date,
duration,referenc log. HTTP status. Common log
formats are standardized format [5].

5,2Microsoft IIS log format
Microsoft IIS log format is customizable format. It
has some additional attributes than common log
format. It stores some extra information of user
behavior. It records more data of users access [2].

5.3 Extended log file format.
Extended log format is customizable. It is more
flexible and can be customized according to user
requirements. Some other attributes like HTTP
cookie, version and HTTP user agent are used to
capture the user behavior.
6. Web usage mining process

Web usage mining process involves three steps
preprocessing of log file,pattern discovery and
pattern analysis [7].The outcome of web usage
mining process is used for improving the structure of
web site and personalization. Preprocessing is
important phase of web usage mining process
because log files cannot directly used for analysis.
Preprocessing phase takes 80% time of whole process.
6.1.Preprocessing of web log data
Preprocessing is essential step of web usage mining
process. A web log file is an input to the
preprocessing phase. The quality of web usage
mining process is not only depends on sources of
log data. The quality of other two steps of web usage
mining process pattern is recovery and pattern
anaysis.Preprocessing phase involves data cleaning,
user identification, session identification, path
completion.
6.1.1 Data cleaning
Data cleaning is first step in preprocessing phase. It is
a process to remove the irrelevant records from log
files. The records in log files with extension gif, jpeg,
jpg are removed from files [6] . The information
which is not require for the analysis all that entries
are removed in data cleaning phase .The records
having files like sound ,graphic information are
removed from log files in this process.some othe files
like css,xsl are also removed in data cleaning phase

Fig 1 web log preprocessing


6.1.2 User identification
User identification is important phase in
preprocessing of log files .It is a process to identify
the users who access the website. there are n number
records in server log file user identification is a
process to identify a user corresponding each record.
In User identification users are identify based on ip
address of client,registred users and other
methods[3].
The common method to identify users is ip address if
there is different ip address and then there is different
user and if there is same ip address and different user
agent means different operating system and browser
type then there will be different user and if the
operating system and browser type is same and
referrer page is null the there will be new user if
referrer page is not null there can be same user. This
method is only applicable in static ip addresses. User
identification is very difficult and complex task. Due
to proxy server user identification cannot give
accurate results. The result of web usage mining
process is depends on the user identification.
6.1.3 Session identification
A Session is a sequence of page view by a single user
during visit. Session identification is a process of
identification of user activity of each user in log file
during period of time. Session identification is
divided into two ways that are time oriented and
structure oriented [5] .Time oriented is depends on
time of request in the server log file. Time oriented
session identification is divided into two ways first is
the time gap of first request and last request is <=t
and second is the time gap between the first request
and next request is <=t.
Structure oriented session identification is based on
the referrer page. The session is identified is using
the page accessed by user using links.
6.1.4 Path completion
Path completion is the process to complete the access
path of user using URL and referrer page access path
complete structure of pages and links of pages that
are accessed by a user. Graphs are used to represent
the paths of user access. Graphs are used to represent
the path completion process .Each node is used to
represent a web page or a website and edges
represent the links between the pages or websites.
6.2 Pattern discovery
In pattern discovery the knowledge is discovered
using classifying and clustering the user activities.
Classification is a technique to classify the data items
into predefined classes. The data items in a particular
class having same properties [1].
6.3 Pattern analysis
Pattern analysis is final step in web usage mining
process after pattern discovery the information is
obtained .The analysis of that information is pattern
analysis. The analysis of information can be done
using the OLAP tools.

7. Applications of web usage mining
Web application has many applications some
important applications are
Personalization
Web site evaluation
System improvement

Personalization
Personalization is an important application of web
usage mining when user interacts with the website
and website presents the information according to
users requirements. Personalization is most widely
used in research areas in web usage mining. Adaptive
web sites change their organization and presentation
according to the preferences of user accessing them.
Web agent based systems are used for web
personalization .Amazon.com uses similar technique
for web personalization.

Web site evaluation
Web site evaluation determines needed modification
in the contents of web site and link structure of
website. The technique for web site evaluation is to
model user navigation pattern and compare them to
site designers expected patterns.

8. Conclusion
In web usage mining process preprocessing of web
log data is necessary step. Preprocessing improves
the performance of other two steps pattern discovery
and pattern analysis.the log files are store the records
with .txt extension.server log file is source of log data
for web usage mining process.there are many
preprocessing techniques we can combine some old
preprocessing techniques with new preprocessing

techniques to improve the performance of web usage
mining process.

9. References

[1] Dafa-Alla, Mirghani. A. Eltahir and Anour
F.A(2013), Extracting Knowledge from Web Server
Logs Using Web Usage Mining, 2013 international
conference on computing, electrical and electronic
engineering (ICCEEE)
[2] Tasawar hussain, dr.asghar and dr.
masood(2010) preprocessing techniques in web log
mining
[3] ma shu yue,liu wen cai,wang shuo(2010) the
study on the preprocessing in web log mining
[4] Theint Theint Aye University of Computer
Studies, Mandalay(2011), Web Log Cleaning for
Mining of Web Usage Patterns
[5] Marathe Dagadu Mitharam(2012), Preprocessing
in Web Usage mining.
[6] Navin Kumar Tyagi1, A.K. Solanki2& Sanjay
Tya(2010), an algorithmic approach for
preprocessing in web usage mining International
Journal of Information Technology and Knowledge
Management July-December 2010, Volume 2, No. 2,
pp. 279-283.
[7] Jia Li(2013), Research of Analysis of User
Behavior Based on Web Log, 2013 International
Conference on Computational and Information
Sciences
[8] Sitaramulu, K. Sudheer Reddy M. Kantha Reddy
V.,(2013) An effective Data Preprocessing method
for Web

A Survey of Preprocessing Method For Web Usage Mining Process

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

A Survey of Preprocessing Method For Web Usage Mining Process

Enviado por

Direitos autorais:

Formatos disponíveis

International Journal of Computer Trends and Technology (IJCTT) volume 9 number 2 Mar 2014

ISSN: 2231-2803 http://www.ijcttjournal.org Page62

Você também pode gostar