Você está na página 1de 27

Web Usage Mining and Pattern Discovery: A Survey Paper

By Naresh Barsagade CSE 8331


December 8, 2003

1. Introduction Web technology is not evolving in comfortable and incremental steps, but it is turbulent, erratic, and often rather uncomfortable. It is estimated that the Internet, arguably the most important part of the new technological environment, has expanded by about 2000 % and that is doubling in size every six to ten months. In recent years, the advance in computer and web technologies and the decrease in their cost have expanded the means available to collect and store data. As an intermediate consequence, the amount of information (Meaningful data) stored has been increasing at a very fast pace. Traditional information analysis techniques are useful to create informative reports from data and to confirm predefined hypothesis about the data. However, huge volumes of data being collected create new challenges for such techniques as organizations look for ways to make use of the stored information to gain an edge over competitors. It is reasonable to believe that data collected over an extended period contains hidden knowledge about the business or patterns characterizing customer profile and behavior. With the rapid growth of the World Wide Web, the study of knowledge discovery in web, modeling and predicting the users access on a web site has become very important [GO2003]. From the administration, business and application point of view, knowledge obtained from the Web usage patterns could be directly applied to efficiently manage activities related to e-Business, e-CRM, e-Services, e-Education, e-Newspapers, e-Government, Digital Libraries, and so on [AR2003]. Web is becoming the necessity of the businesses and organizations because of its demand from the clients. Since the web technology largely feeds on ideas and knowledge rather than being dependent on fixed assets, it gave birth to new companies such as Yahoo, Google, Netscape, e-Bay, e-Trade,

Survey Paper: Barsagade

Page 2 of 27

4/23/2012

Expedia, Amazon and so on. With the large number of companies using the Internet to distribute and collect information, knowledge discovery on the web has become an important research area [JTP2002]. With the explosive growth of information sources available on the World Wide Web, it has become necessary for organizations to discover the usage patterns and analyze the discovered patterns to gain an edge over competitors. Jespersen et al [JTB2002] proposed a hybrid approach for analyzing the visitor click stream sequences. A combination of hypertext probabilistic grammar and click fact table approach is used to mine Web logs, which could be also used for general sequence mining tasks. Mobasher et al [MCS1999] proposed the web personalization system, which consists of offline tasks related to the mining if usage data and online process of automatic Web page customization based on the knowledge discovered. LOGSOM (LOGSOM, a system that utilizes Kohonen's self-organizing map (SOM) to organize web pages into a two-dimensional map) proposed by Smith et al [SN2003], utilizes a selforganizing map based solely on the users navigation behavior, rather than the content of the web pages. LumberJack proposed by Chi et al [CRHL2002] builds up user profiles by combining both clustering of user sessions and traditional statistical traffic analysis using kmeans algorithm. Joshi et al [JJYK1999] used relational online analytical processing approach for creating a Web log warehouse using access logs and mined logs. A comprehensive overview of web usage mining research is found in [SCDT2000, CMS97, CMS1999, RWC2000]. Web mining can be divided into three areas, namely web content mining, web structure mining and web usage mining [SCDT2000]. Web Content mining focuses on discovery of information stored on the Internet. Web Structure mining focuses on improvement in

Survey Paper: Barsagade

Page 3 of 27

4/23/2012

structural design of a website. Web Usage mining, the main topic of this paper, focuses on knowledge discovery from the usage of individuals web sites. Global Internet Usage Average Usage [NN2003] shows the current usage around the globe and in United States. Month of September 2003, Panel Type: Home % Change 1.65 0.89 0.3 0 1.24 -0.4 0.94 -0.15 0.65

Number of Sessions per Month Number of Unique Domains Visited Page Views per Month Page Views per Surfing Session Time Spent per Month Time Spent During Surfing Session Duration of a Page Viewed Active Internet Universe Current Internet Universe Estimate United States: Average Web Usage

September 22 55 901 41 11:59:20 0:32:29 0:00:48 252,672,070 419,054,724

August 22 54 899 41 11:50:30 0:32:37 0:00:47 253,054,814 416,339,888

Month of October 2003, Panel Type: Home Sessions/Visits Per Person Domains Visited Per Person PC Time Per Person Duration of a Web Page Viewed Active Digital Media Universe Current Digital Media Universe Estimate 71 103 80:46:37 0:01:00 47,003,165 51,012,930

The remainder of the paper is organized as follows: Section 2 contains applications of web usage mining, section 3 contains basic components of web mining terminologies, taxonomy of web mining, architecture of web usage mining, explanation of individual components in web usage mining architecture, section 4 summarizes the paper, identifies several future research directions and section 5 contains the bibliography. 2. Applications of Web Usage Mining

Survey Paper: Barsagade

Page 4 of 27

4/23/2012

Each of the applications can benefit from patterns that are ranked by subjective interesting. Web usage mining is used in the following areas: Web usage mining offers users the ability to analyze massive volumes of clickstream or click flow data, integrate the data seamlessly with transaction and demographic data from offline sources and apply sophisticated analytics for web personalization, e-CRM and other interactive marketing programs. Personalization for a user can be achieved by keeping track of previously accessed pages. These pages can be used to identify the typical browsing behavior of a user and subsequently to predict desired pages. By determining frequent access behavior for users, needed links can be identified to improve the overall performance of future accesses. Information concerning frequently accessed pages can be used for caching. In addition to modifications to the linkage structure, identifying common access behaviors can be used to improve the actual design of Web pages and to make other modifications to the site.

Web usage patterns can be used to gather business intelligence to improve Customer attraction, Customer retention, sales, marketing and advertisement, cross sales.

Mining of web usage patterns can help in the study of how browsers are used and the users interaction with a browser interface.

Usage characterization can also look into navigational strategy when browsing a particular site.

Survey Paper: Barsagade

Page 5 of 27

4/23/2012

Web usage mining focuses on techniques that could predict user behavior while the user interacts with the Web.

Web usage mining helps in improving the attractiveness of a Web site, in terms of content and structure.

Performance and other service quality attributes are crucial to user satisfaction and high quality performance of a web application is expected.

Web usage mining of patterns provides a key to understanding Web traffic behavior, which can be used to deal with policies on web caching, network transmission, load balancing, or data distribution.

Web usage and data mining is also useful for detecting intrusion, fraud, and attempted break-ins to the system.

Web usage mining can be used in e-Learning, e-Business, e-Commerce, e-CRM, e-Services, e-Education, eNewspapers, e-Government, and Digital Libraries.

Web usage mining can be used in Customer Relationship Management, Manufacturing and Planning,

Telecommunications and Financial Planning. Web usage mining can be used in Physical Sciences, Social Sciences, Engineering, Medicine, and Biotechnology. Web usage mining can be used in Counter Terrorism and Fraud Detection, and detection of unusual accesses to secure data. Web usage mining can be used in determination of common behaviors or traits of users who perform certain actions, such as purchasing merchandise.

Survey Paper: Barsagade

Page 6 of 27

4/23/2012

Web usage mining can be used in usability studies to determine the interface quality.

Web usage mining can be used in network traffic Analysis for determining equipment requirements and data distribution in order to efficiently handle site traffic.

3. Web Usage Mining and Pattern Discovery Web usage mining is the application of data mining techniques to discover usage pattern from Web data, in order to understand and better serve the needs of Webbased applications [CMS1997]. Web usage mining consists of three phases, namely preprocessing, pattern discovery, and pattern analysis. A high level Web usage mining Process is presented in Figure 1 [SCDT2000]. Mobasher et al. [CMS1997] proposes that the web mining process can be divided into two main parts. The first part includes the domain dependent processes of transforming the Web data into suitable transaction form. This includes preprocessing, transaction identification, and data integration components. The second part includes some data mining and pattern matching techniques such as association rule and sequential patterns. In the absence of cookies or dynamically embedded session Ids in the URIs, the combination of IP address can be used as a first pass estimate of unique users. This estimate can be refined using the referrer field as described in [CMS1999]. Some authors have proposed global architectures to handle the web usage mining process. Cooley et al [CTS1999] proposed a site information filter, named WebSIFT that establishes a framework for web usage mining as shown in Figure 2. The WebSIFT performs the mining in distinct tasks.

Survey Paper: Barsagade

Page 7 of 27

4/23/2012

WeSift system divides the Web Usage Mining Process into three main parts, as show in Fig 1. For a particular Web site, the three server logs access, referrer, and agent (often combined into a single log), the HTML files, template files, script files or databases that make up the site content, and any optional data such as registration data or remote agent logs provide the information to construct the different information abstractions.

The preprocessing phase uses the input data to construct a server session file based on the method and heuristics discussed in [[CMS, 1999]. In order to preprocess a server log, the log must first be cleaned, which consists of removing unsuccessful requests, parsing relevant CGI name/value pairs and rolling up file accesses into page views. Once the log is converted into a list of page views, users must be identified. In the absence of cookies or dynamically embedded session Ids in the URIs, the combination of IP address The first is preprocessing state in which user sessions are inferred from log data. The second searches for patterns in the data by making use of standard data mining techniques, such as association rules or mining for sequential patterns. In the third stage an information filter bases on domain knowledge and the web site structures is applied to the mining patterns in search for the interesting patterns. Links between pages and the similarity between contents of pages provide evidence that pages are

Survey Paper: Barsagade

Page 8 of 27

4/23/2012

related. The preprocessing phase allows the option of converting the server sessions into episodes prior to performing knowledge discovery.

Figure 2: A General Architecture for Web Usage Mining

In this case, episodes are either all of the page views in a server sessions that the user spent a significant amount of time viewing, or all of the navigation page views leading up to each content page view. The details of how a cutoff time is determined for classifying a page view as content or navigation are also contained in [CMS1999]. The click-stream or click-flow for each user is divided into sessions based on a simple thirtyminute timeout. The notion of what makes discovered knowledge interesting has been addressed in [PT1998]. A survey of methods that have been used to characterize the interestingness of discovered patterns is given in [HH1999]. Four dimensions used by [HH1999] to classify interestingness measures are pattern-form, representation, scope, and class. Pattern-form defines what type of patterns a measure is applicable to, such as association rules or classification rules. The representation dimension defines the nature of the framework, such as probabilistic or logical. Scope is a binary dimension that indicates whether the measure applies to single pattern, or to the entire discovered

Survey Paper: Barsagade

Page 9 of 27

4/23/2012

set. The final dimension, class is also a binary dimension that can be labeled as subjective or objective. Preprocessing for the content and structure of a site involves assembling each page view for parsing and /or analysis. Page views are accessed through HTTP requests by a site crawler to assemble the components of the page view. This handles both static and dynamic content. In addition to being used to derive a site topology, the site files are used to classify the pages of a site. Both the site topology and page classification an then be fed into the information filter. The knowledge discovery phase uses existing data mining techniques to generate rules and patterns. Included in this phase is the generation of general usage statistics, such as number of hits per page, page most frequently accessed, most common starting page, and average time spent on each page. The WebSIFT performs the mining in distinct tasks. The first state is preprocessing in which user sessions are inferred from log data. The second searches for patterns in the data by making use of standard data mining techniques, such as association rules or mining for sequential patterns. In the third stage an information filter bases on domain knowledge and the web site structures is applied to the mining patterns in search for the interesting patterns. Links between pages and the similarity between contents of pages provide evidence that the pages are related. This information is used to identify interesting patterns, for example, itemsets that contain pages not directly connected are declared interesting. In Mobasher et al [MCS1999] the authors propose to group the itemsets obtained by the mining stage in cluster of URL references. These clusters are aimed at real time web page personalization. A hypergraph is inferred from the mined itemsets where the nodes correspond to pages and the hyperedges connect pages in a

Survey Paper: Barsagade

Page 10 of 27

4/23/2012

itemset. The weight of a hyperedge is given by the confidence of the rules involved. The graph is subsequently partitioned into clusters and an occurring user session is matched against such clusters. For each URL in the matching clusters a recommendation score is computed and the recommendation set is composed by all the URL whose recommendation score is above a specified threshold. In Buchner et al. [BBAMH1999] a new approach, in the form of process, is proposed to find marketing intelligence from Internet data. An n-dimensional web log data cube is created to store the collected data. Domain knowledge is incorporated into the data cube in order to reduce the pattern search space. They proposed an algorithm to extract navigation patterns from the data cube. The patterns conform to pre-specified navigation templates whose use enables the analyst to express his knowledge about the field and to guide the mining process. This model does not store the log data in compact form, and that can be major drawback when handling very large daily log files. Information on how customers are using a Web site is critical for marketers of electronic commerce businesses. Buchner et al [BM1998] have presented a knowledge discovery process in order to discover marketing intelligence from Web data. They define a Web log data hypercube that consolidates Web usage data along with marketing data for electronic commerce applications. Four distinct steps are identified in customer relationship life cycle that can be supported by their knowledge discovery techniques: customer attractions, customer retention, cross sales and customer departure. In Masseglia et al [MPC1999] proposed an integrated tool for mining access patterns and association rules from log file. The techniques implemented pay particular attention to the handling of time constraints, such as the minimum and maximum time gap between adjacent requests in a pattern. The system provides a real time generator of

Survey Paper: Barsagade

Page 11 of 27

4/23/2012

dynamic links, which aimed at automatically modifying the hypertext organization when user navigation matches a previously mined rule. Fundamental methods of data cleaning and preparation have been well studied by Srinivasa et al [SCDT2000]. The main techniques traditionally used for modeling usage patterns in a Web site are collaborative filtering (CF), clustering pages or user sessions, association rule generation, sequential pattern generation and Markov Models. The prediction step is the real-time processing of the model, which considers the active user session and makes recommendations based on the discovered patterns. The time spent on a page is a good measure of the users interest in that page, providing an implicit rating for it [GO2003]. If a user is interested in the content of a page, she will likely spend more time there compared to the other pages in her session. They presented a new model that uses both the sequences of visiting pages and the time spent on that pages which reflects the structural information of user session and handles twodimensional information. Data preprocessing consists of data filtering, user identification, session/transaction identification, and topology extraction. Data filtering filters out some noise, i.e., unsuccessful requests, automatically downloaded graphics, or requests from robots, to get more compact training data. Now people use some heuristic rules to identify user, such as IP address, cookies, etc. Preprocessing consists of converting the usage, content, and structure information contained in the various available data sources into the data abstractions necessary for pattern discovery.

Usage preprocessing : Usage preprocessing consists of Web pages, such as IP


addresses, page references, and the date and time of accesses [SCDT2000]. Typically,

Survey Paper: Barsagade

Page 12 of 27

4/23/2012

the usage data comes from an Extended Common Log Format (ECLF) Server log [RWC2000].

Content Preprocessing : Content preprocessing consists of converting the text,


images, scripts, and multimedia data into forms that are useful for the web usage mining process. Often this consists of performing content mining such as classification or clustering. In the context of web usage mining, the content of Web sites can be used to filter the input to the pattern discovery algorithms [SCDT2000]. Structure Preprocessing : Web structure mining analyses the link structure of the web in order to identify relevant documents [SCDT2000]. The structure of a site is created by the hypertext links between page views. Intra-page structure information includes the arrangement of various HTML or XML tags within a given page. The principal kind of inter-page structure information is hyper-links connecting one page to another. The Google Search engine [GOO] makes use of the web link structure in the process of determining the relevance of a page. The Google search engine achieves good results because while the keyword similarity analysis ensures high precision the use of a probability measure ensures high quality of the pages returned. The information provided by the data sources listed above can be used to construct a data model consisting of several data abstractions, notably users, page views, clickstreams, server sessions, and episodes [RWC2000]. A page view is defined as all of the files that contribute to the client-side presentation seen as the result of a single mouse click of a user. A click-stream is then the sequence of page views that are accessed by a user. A server session is the click-stream is then sequence of page views that are accessed by a user. A server session is the click-stream for a single visit of a user to a Web site. Finally, an episode is a subset of page views from a server session. Data can

Survey Paper: Barsagade

Page 13 of 27

4/23/2012

be collected at the server-level, client-level, proxy-level, or obtained from an organizations database. Each type of data collection differs not only in terms of the location of the data source, but also in the kinds of data available, the segment of population from which the data was collected, and its method of implementation. The usage data collected at the different sources such as Server level, Client Level and Proxy Level represent the navigation patterns of different segments of the overall Web traffic [SCDT2000].

Server-level Collection : A Web server log records the browsing behavior of site
visitors [SCDT2000]. The data recorded in server logs reflect the concurrent and interleaved access of a Web site by multiple users. These log files can be stored in various formats such as Common Log Format (CLF) or Extended Common Log Format (ECLF). ECLF contains client IP address, User ID, time/date, request, status, bytes, referrer, and agent. Tracking of individual users is not an easy task due to the stateless connection model of the HTTP protocol. In order to handle this problem, Web servers can also store other kind of usage information such as cookies in separate logs, or appended to the CLF or ECLF logs. Cookies are tokens generated by the Web server for individual client browsers in order to automatically track the site visitors. Packet sniffing technology (also referred to as network monitors) is an alternative method for collecting usage data through server logs. Packet sniffers monitor network traffic coming to a Web server and extract usage data directly from TCP/IP packets. Besides usage data, the server side log also provides access to the site files, e.g. content data, structure information, local databases, and Web page meta-information such as the size of a file and its last modified time.

Survey Paper: Barsagade

Page 14 of 27

4/23/2012

Client level collection : Client-side collection can be implemented by using a remote


agent (such as Java scripts or Java applets) or by modifying the source code of an existing browser (such as Mosaic or Mozilla) to enhance its data collection capabilities [SCDT2000]. Proxy Level Collection: The Internet Service Provider (ISP) machine that users connect to through a model is a common form of proxy server. A web proxy acts as an intermediary between client browsers and Web servers. Proxy-level caching can be used to reduce the loading of time of a Web page experienced by users as well as the network traffic load at the server and client sides.

Pattern Discovery : Pattern discovery uses methods and algorithms developed from
several fields such as statistics, data mining, machine learning and pattern recognition [SCDT2000]. Zaiane et al. [ZXH1998] proposed the use of On-Line Analytical Processing (OLAP) technology in web usage mining. OLAP and the data cube structure offer a highly interactive and powerful data retrieval and analysis environment. The knowledge that can be discovered is represented in the form of rules, tables, charts, graphs, and other visual presentation forms for characterizing, comparing, predicting, or classifying data from the web access log. Visualization can also be used in web usage mining, and it presents the data in the way that can be understood by users more easily.

Statistical Analysis : Statistical techniques are the most common method to extract
knowledge about visitors to a web site. By analyzing the session file, one can perform different kinds of descriptive statistical analyses (frequency, mean, median, etc) on variables such as page views, viewing time and length of a navigational path. For example e-Trade developed a website in German language for Germany and scrapped it because German people were visiting the English site rather than the German site. Many web traffic analysis tools produce a periodic report containing statistical information

Survey Paper: Barsagade

Page 15 of 27

4/23/2012

such as the most frequently accessed pages, average view time of a page or average length of a path through a site. This type of knowledge can be potentially useful for improving the system performance, enhancing the security of the system, facilitating the site modification task, and providing support for marketing decisions. There are lots of commercial tools available for statistical analysis.

Association Rules: Association rule generation can be used to relate pages that are
most often referenced together in a single server sessions [SCDT2000]. In the context of web usage mining, association rules refer to sets of pages that are accessed together with a support value exceeding some specified threshold. Association rule mining has been well studied in Data Mining, especially for basket transaction data analysis. Many association rule algorithms have been used, such as Apriori, Partition [MHD2003]. Aside from being applicable for e-Commerce, business intelligence and marketing applications, it can help web designers to restructure their web site. The results about the usefulness of such rules in supermarket transaction or in web application have not been reported. People also put some constraints over the mining process, and prune the extracted rules. The association rules may also serve as heuristic for pre fetching documents in order to reduce user-perceived latency when loading a page from a remote site. In electronic CRM, an existing customer can be retained by dynamically creating web offers based on associations with threshold support and/or confidence value [BM98].

Clustering: Clustering is a technique to group together a set of items having similar


characteristics [SCDT2000]. Clustering can be performed on either the users or the page views. Clustering analysis in web usage mining intends to find the cluster of user, page, or sessions from web log file, where each cluster represents a group of objects with common interesting or characteristic. User clustering is designed to find user groups

Survey Paper: Barsagade

Page 16 of 27

4/23/2012

that have common interests based on their behaviors, and it is critical for user community construction. Page clustering is the process of clustering pages according to the users access over them. Such knowledge is especially useful for inferring user demographics in order to perform market segmentation in e-Commerce applications or provide personalized web content to the users. On the other hand, clustering of pages will discover groups of pages having related content. This information is useful for the Internet search engines and Web assistance providers. In both applications, permanent or dynamic HTML pages can be created that suggest related hyperlinks to the user according to the users query or past history of information needs. The intuition is that if the probability of visiting page, given page has also been visited, is high, then maybe they can be grouped into one cluster. For session clustering, all the sessions are processed to find some interesting session clusters. Each session cluster may be one interesting topic within the web site. Mobasher et al [MCS1999] generated recommendations from URL clusters to build an adaptive web site by using ARHP (Association Rule Hypergraph Partitioning). Abhrahum et al [AR2003] proposed an ant-clustering algorithm to discover web usage patterns and a linear genetic programming approach to analyze the visitor trends. They proposed hybrid framework, which uses an ant colony optimization algorithm to cluster Web usage patterns. The raw data from the log files are cleaned and preprocessed and the ACLUSTER algorithm is used to identify the usage patterns. The developed clusters of data are fed to a linear genetic programming model to analyze the usage trends. The WebCANVAS (Web Clustering Analysis and VisuAlization of Sequences) [CHMSW, 2003] presented a new methodology for exploring and analyzing navigation pattern on a web site. The patterns that can be analyzed consist of sequences of URL categories

Survey Paper: Barsagade

Page 17 of 27

4/23/2012

traversed by users. In their approach, they first partitioned site users into clusters such that users with similar navigation paths through site are places into the same cluster. The clustering approach they employed was model-based (as opposed to distance based) and partitioned users according to the order in which they request web pages. Another feature of their use of model-based clustering is that learning time scales linearly with sample size. In contrast, agglomerative distance-based methods scale quadratically with sample size. The purpose of knowledge discovery from users profile is to find clusters of similar interests among the users [SZAS1997]. If the site is well designed, there will be strong correlation among the similarity of the navigation paths and similarity among the users interest. Therefore, clustering of the former could be used to cluster the latter. The definition of the similarity is application dependent. They provide an overview on a powerful path clustering method called path mining. This approach is suitable for knowledge discovery in databases with partial ordering in their data. In this method, first a general path feature space is characterized. Then a similarity, measure among the paths over the feature space is introduced. Finally this similarity measure is used in the clustering purpose. They implemented the path-mining algorithm to cluster the navigation paths detected by the profiler. This algorithm finds a scalar number as the similarity among the paths. These similarity numbers could be fed to standard datamining algorithms to cluster the user interests.

Classification : Classification is the task of mapping a data item into one of several
predefined classes [SCDT2000]. In the internet marketing, a customer can be classified as no customer, visitor once and visitor regular based on their browsing patters and discovered rules for attracting the customers by displaying special offers [BM98].

Survey Paper: Barsagade

Page 18 of 27

4/23/2012

In the web domain, one is interested in developing a profile of users belonging to a particular class or category. This requires extraction and selection of features that best describe the properties of a given class or category. Classification can be done by using supervised inductive learning algorithms such as decision tree classifiers, nave Baysian classifiers, k-nearest neighbor classifiers, Support Vector Machines etc. For example, classification on server logs may lead to the discovery of interesting rules such as: 30% of users who placed an online order in /Product/Music are in the 18-25 age group and live on the west coast. The Classification algorithms such as C4.5, CART, BAYES, and RIPPER can be used to predict if page is of interest to the user.

Sequential Patterns : The technique of sequential pattern discovery attempts to find


inter-session patterns such that the presence of a set of items is followed by another item in a time-ordered set of sessions or episodes [SCDT2000]. A new algorithm MiDAS (Mining Internet data for Associative Sequences) for discovering sequential patterns from web log files has been proposed that provides behavioral marketing intelligence for e-commerce scenarios [BBAMH1999]. MiDAS contains three phases: 1. A priori phase is the input data preparation, which consists of data reduction and data type substitution. 2. Discovery Phase discovers the sequences of hits and generates the pattern tree. 3. A posteriori Phase filters out all sequences that do not fulfill the criteria laid in the specified navigation templates and topology network and also pruning is done in this phase. By using this approach, Web marketers can predict future visit patterns, which will be helpful in placing advertisements aimed at certain user groups. Other types of temporal analysis that can be performed on sequential patterns include trend analysis, change point detection, or similarity analysis.

Survey Paper: Barsagade

Page 19 of 27

4/23/2012

Oyanagi et al [OKN2002] explore the issues in sequence mining for methods for mining WWW access log. The Apriori algorithm is well known as a typical algorithm for sequence pattern mining. However, it suffers from inherent difficulties in finding long sequential patterns and in finding interesting patterns among a huge amount of results. This paper proposes a new method for finding sequence patterns by matrix clustering. This method decomposes a sequence into a set of sequence elements, each of which corresponds to an ordered pair of items. Then matrix clustering is applied to extract a cluster of similar sequences. The resulting sequence elements are composed into a graph. A Web Utilization Miner, WUM [SS1998] uses an efficient data structure called Aggregated Tree to store the user sessions, and it also provides query language to extract interesting patterns from the aggregated session data. WUM employs an innovative technique for the discovery of navigation patterns over an aggregated materialized view of the web log. After performing the classical preparation steps (i.e., user and session identification) the user sessions are merged into Aggregated Tree. An Aggregated Tree is a tree constructed by merging trails with the same prefix. WUM provides a query language called MINT to let the users specify their query, concerning the content, structure and statistics of navigation patterns. MINT supports the specification of criteria of statistical, structural, and textual nature. The WEBMIER tool [CMS1997] provides a query language on top of external mining software for association rules and for sequential patterns.

Dependency modeling : Dependency modeling is another useful pattern discovery


task in web mining [SCDT2000]. The goal here is to develop a model capable of representing significant dependencies among the various variables in the web domain. As an example, one may be interested to build a model representing the different stages

Survey Paper: Barsagade

Page 20 of 27

4/23/2012

a visitor undergoes while shopping in an online store based on the actions chosen (ie, from a casual visitor to a serious potential buyer. There are several probabilistic learning techniques that can be employed to model the browsing behavior of users. Such techniques include Hidden Markov Models and Bayesian Belief Networks. Modeling of Web usage patterns will not only provide a theoretical framework for analyzing the behavior of users but is potentially useful for predicting future Web resource consumption. Such information may help develop strategies to increase the sales of products offered by the Web site or improve the navigational convenience of users. Borgees et al [BL1999] proposed formal data mining model, Hypertext probabilistic Grammars (HPG) to capture user web navigation patterns. User sessions are presented as HPG whose higher probability strings correspond to the navigation trails preferred by the user. Hypertext Probabilistic Grammar (HPG) is a Markov model, which assumes that the probability of a link being chosen depends more on the contents of the page being viewed than on all the previous history of sessions [LL1999]. Note that this assumption can be weighted by making use of the Ngram concept, or dynamic Markov chain techniques There are situations in which a Markov assumption is realistic, such as, for example, an online newspaper where a user chooses which article to read in the sports section independently of the contents of the front page. However, there are also cases in which such assumption is not very realistic, such as, for example, an online tutorial providing a sequence of pages explaining how to perform a given task.

Deviation/Outlier Detection : It contains techniques aimed at detecting unusual


changes in the data relatively to the expected values. Such techniques are useful, for example, in fraud detection, where the inconsistent use of credit cards can identify

Survey Paper: Barsagade

Page 21 of 27

4/23/2012

situations where a card is stolen. The inconsistent use of credit card could be noted if there were transactions performed in different geographic locations within a given time window.

Pattern analysis: Pattern analysis is the last step in the overall Web Usage mining
process as described in Figure 2. The motivation behind pattern analysis is to filter out uninteresting rules or patterns from the set found in the pattern discovery phase. The exact analysis methodology is usually governed by the application for which Web mining is done. The most common form of pattern analysis consists of a knowledge query mechanism such as SQL. Another method is to load usage data into a data cube in order to perform OLAP operations. Visualization techniques, such as graphing patterns or assigning colors to different values, can often highlight overall patterns or trends in the data. Content and structure information can be used to filter out patterns containing pages of a certain usage type, content type, or pages that match a certain hyperlink structure.

4. Summary and Future Research Directions This paper has attempted to provide an up-to-date survey of the rapidly growing area of Web usage mining, which is the demand of current technology. In this paper a general overview of Web usage mining is presented in introduction section. Web usage mining is used in many areas such as e-Business, e-CRM, e-Services, e-Education, e-Newspapers, e-Government, Digital Libraries, advertising, marketing, bioinformatics and so on. The major classes of recommendation services are based on the discovery of navigational patterns of users. The main techniques for pattern discovery are sequential patterns, association rules, Classification, Clustering, and path analysis. Web usage minings basic

Survey Paper: Barsagade

Page 22 of 27

4/23/2012

components, taxonomy of web mining, architecture of web usage mining, individual components in web usage mining and detailed research in this area by researchers like Jaideep Srivastava, Bamshad Mobasher, Robert Cooley, Cyrus Shahabi, Ming-Syan Chen, and A.G. Bchner in web mining is described in detail section. With the growth of Web-based applications, specifically e-commerce, there is significant interest in analyzing Web usage data. As the web mining area is growing fast, there is a lot of demand for web usage mining and there is a need to develop a common framework like J2EE and .NET. Cross Industry Standard Process for Data Mining, the CRISP-DM project has developed an industry and tool-neutral Data Mining process model [CRISP-DM] for data mining. Similar Process model or framework needs to be developed for creating an interest among the new researchers or business strategists and developers. We need a systematic web-site design methodology to create new web pages, or modify existing web pages, such that different users navigation patterns could be better mapped to answers to a set of specific questions. There is a need to develop tools, which incorporate statistical methods, visualization, and human factors to help better understand the mined knowledge. Since the output of knowledge mining algorithms is often not in a form suitable for direct human consumption, there is a need to develop techniques and tools for helping an analyst better assimilate it. One of the open issues in data mining, in general and Web Mining, in particular, is the creation of intelligent tools that can assist in the interpretation of mined knowledge. Clearly, these tools need to have specific knowledge about the particular problem domain to do any more than filtering based on statistical attributes of the discovered rules or patterns.

Survey Paper: Barsagade

Page 23 of 27

4/23/2012

More research needs to be done in e-Commerce, bioinformatics, computer security, Web intelligence, intelligent learning, Database systems, Finance, Marketing, Healthcare, and Telecommunications by using Web usage mining.

5. Bibliography [AR2003]. Ajit Abhraham, Vitorino Ramos, Web Usage Mining Using Artificial Ant Colony Clustering and Linear Genetic Programming, to appear in CEC03 - Congress on Evolutionary Computation, IEEE Press, Canberra, Australia, 8-12 Dec. 2003. [BBAMH1999]. A.G. Bchner, M. Baumgarten, S.S. Anand, M.D. Mulvenna, J. G. Hughes, Navigation Pattern Discovery from Internet Data, in WEBKDD, San Diego, CA 1999. [BL1999]. Jos'e Borges, Mark Levene, Data Mining of User Navigation Patterns, WEBKDD,1999. [BM1998]. A.G. Bchner, M.D. Mulvenna, Discovering Internet Marketing Intelligence through Online Analytical Web Usage Mining, ACM SIGMOD, Vol. 27, No. 4, pp. 54-61, 1998. [CHMSW2003]. I. V. Cadez, D. Heckerman, C. Meek, P. Smyth, S. White, Model-based clustering and visualization of navigation patterns on a Web Site, Journal of Data Mining and Knowledge Discovery, 7(4), 2003. (extended version of ACM SIGKDD 2000 conference paper). [CMS1997]. Robert Cooley, Bamshad Mobasher, and Jaideep Srivastava, Web Mining:

Survey Paper: Barsagade

Page 24 of 27

4/23/2012

Information and Pattern Discovery on the World Wide Web (A Survey Paper) (1997), in Proceedings of the 9th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'97), November 1997 [CMS1999]. Robert Cooley, Bamshad Mobasher, and Jaideep Srivastava, Data Preparation for mining world wide web browsing patterns, Knowledge and information Systems 1(1),1999. [CRHL2002]. Chi E.H., Rosien A. and Heer J., LumberJack:Intelligent Discovey and Analysis of Web User Traffic Composition. In Proceedings of ACM-SIGKDD Workshop on Web Mining for Usage Patterns and User Profiles, Canada, ACM press, 2002. [CRISP-DM]. http://www.crisp-dm.org . [CTS1999]. Robert Cooley, Pang-Ning Tan, Jaideep Srivastava, WebSIFT: The Web Site Information Filter System (1999). Proceedings of the Web Usage

Analysis and User Profiling Workshop, August 1999.


[GO2003]. Sule Gunduz, M. Tamer Ozsu, A Web Page Prediction Model Based on Click-Stream Tree Representation of User Behavior, The Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Washington, DC, USA, August 24 - 27, 2003. [GOO]. [HH1999]. Google Search Engine http://www.google.com Robert J. Hilderman and Howard J. Hamilton, Knowledge discovery and interestingness measures, A survey, Technical Report, University of Regina, 1999.

Survey Paper: Barsagade

Page 25 of 27

4/23/2012

[JJYK1999].

Joshi K. P., Joshi A., Yesha Y., Krishnapuram, R., Warehousing and Mining We Logs, Proceedings of the 2nd ACM CIKM Workshop on Web Information and Data Management, pp. 63-68, 1999.

[JTB2002].

Jespersean S.E., Throhauge J., and Bach T., A hybrid approach to Web Usage Mining, Data Warehousing and Knowledge Discovery, (DaWaK02), LNCS 2454, Springer Verlag Germany, pp73-82, 2002.

[JTP2002].

Soren E. Jespersen, Jesper Thorhauge, Torben Bach Pederson, A Hybrid Approach to Web Usage Mining, Technical Report 02-5002, Department of Computer Science Aalborg University, July 2002.

[MHD2003].

Margaret H. Dunham, Data Mining Introductory and Advanced Topics, Prentice Hall, 2003.

[LL1999].

Levene, M. and Loizou, G. Computing the entropy of user navigation in the web, Department of Computer Science, University College London, 1999.

[NN2003]. [MCS1999].

http://www.nielsen-netratings.com Bamshad Mobasher, Robert Cooley, Jaideep Srivastava, Creating

Adaptive Web Sites Through Usage-Based Clustering of URLs, in

Proceedings of the 1999 IEEE Knowledge and Data Engineering Exchange Workshop (KDEX'99), November 1999
[MPC1999]. Masseglia, F., Poncelet, P., and Cicchetti, R., 1999a, Webtool: An Integrated framework for data mining, In proceedings of the Ninth International Conference on Database and Expert System Application (DEXA99), Florence, Italy, August, 1999.

Survey Paper: Barsagade

Page 26 of 27

4/23/2012

[OKN2002]. Shigeru Oyanagi, Kazuto Kubota, Akihiko Nakase, Mining WWW Access Sequence by Matrix Clustering,SIGKDD Explorations. Volume 4, Issue 2, page 125. [RWC2000]. Robert W. Cooley, Web Usage Mining: Discovery and Application of Interesting Patterns from Web Data., A Ph. D. Thesis, May 2000. [SCDT2000]. Jaideep Srivastava,Robert Cooley, Mukund Deshpande,Pang-Ning Tan, Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data(2000). SIGKDD Explorations, Vol. 1, Issue 2, 2000. [SN2003]. Smith K.A. and Ng A., Web page clustering using a self-organizing map of user navigation patterns, Decision Support Systems, Volume 35 , Issue 2 (May 2003) Special issue: Web data mining, Pages: 245 256. [SZAS1997]. Cyrus Shahabi, Amir M. Zarkesh, Jafar Adibi, and Vishal Shah, Knowledge Discovery from Users Web-page Navigation, IEEE RIDE 1997. [SS1998]. Myra Spiliopoulou and Lukas C. Faulstich, WUM: A Web Utilization Miner, in International Workshop on the Web and Databases (WebDB98), Valencia, Spain, March 1998. [ZA1997]. A. Zarkesh and J. Adibi, Pathmining: Knowledge discovery in partially ordered databases. Submitted to KDD-1997. [ZXH1998]. O. R. Zaiane, M. Xin, and J. Han,Discovering Web access patterns and trends by applying OLAP and data mining technology on Web logs, in Proc. Advances in Digital Libraries Conference (ADL'98), Santa Babara, CA, April, 1998.

Survey Paper: Barsagade

Page 27 of 27

4/23/2012

Você também pode gostar