Escolar Documentos
Profissional Documentos
Cultura Documentos
L T P J C
3 0 0 4 4
Dr. S M SATAPATHY
Associate Professor,
School of Computer Science and Engineering,
VIT University, Vellore, TN, India – 632 014.
Module – 6
3. Data Modelling
VIT University 2
Web Usage Mining
Extraction of information from data generated
through Web page visits and transactions…
data stored in server access logs, referrer logs,
agent logs, and client-side cookies
user characteristics and usage profiles
metadata, such as page attributes, content attributes,
and usage data
Clickstream data
Clickstream analysis
Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 3
Web Usage Mining
Web usage mining applications
Determine the lifetime value of clients
Design cross-marketing strategies across products.
Evaluate promotional campaigns
Target electronic ads and coupons at user groups
based on user access patterns
Predict user behavior based on previously learned
rules and users' profiles
Present dynamic information to users based on their
interests and profiles…
Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 4
Web Usage Mining
(clickstream analysis)
Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 5
Web Mining Success Stories
Web
Analytics
Voice of
Customer
Customer Experience
Management
Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 6
Introduction
Web usage mining: automatic discovery of
patterns in clickstreams and associated data
collected or generated as a result of user
interactions with one or more Web sites.
Goal: analyze the behavioral patterns and
profiles of users interacting with a Web site.
The discovered patterns are usually
represented as collections of pages, objects,
or resources that are frequently accessed by
groups of users with common interests.
Introduction
Data in Web Usage Mining:
Web server logs
Site contents
Data about the visitors, gathered from external channels
Further application data
Not all these data are always available.
When they are, they must be integrated.
A large part of Web usage mining is about
processing usage/ clickstream data.
After that various data mining algorithm can be applied.
Bing Liu 8
Introduction
Each log entry (depending on the log format) may
contain fields identifying
the time and date of the request,
The IP address of the client,
the resource requested,
possible parameters used in invoking a Web application,
status of the request,
HTTP method used,
the user agent (browser and operating system type and version),
the referring Web resource, and,
if available, client-side cookies which uniquely identify a repeat
visitor.
Bing Liu 9
Web server logs
1 2006-02-01 00:08:43 1.2.3.4 - GET /classes/cs589/papers.html - 200 9221
HTTP/1.1 maya.cs.depaul.edu
Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+2.0.50727)
http://dataminingresources.blogspot.com/
2 2006-02-01 00:08:46 1.2.3.4 - GET /classes/cs589/papers/cms-tai.pdf - 200 4096
HTTP/1.1 maya.cs.depaul.edu
Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+2.0.50727)
http://maya.cs.depaul.edu/~classes/cs589/papers.html
3 2006-02-01 08:01:28 2.3.4.5 - GET /classes/ds575/papers/hyperlink.pdf - 200
318814 HTTP/1.1 maya.cs.depaul.edu
Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1)
http://www.google.com/search?hl=en&lr=&q=hyperlink+analysis+for+the+web+survey
4 2006-02-02 19:34:45 3.4.5.6 - GET /classes/cs480/announce.html - 200 3794
HTTP/1.1 maya.cs.depaul.edu
Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1)
http://maya.cs.depaul.edu/~classes/cs480/
5 2006-02-02 19:34:45 3.4.5.6 - GET /classes/cs480/styles2.css - 200 1636
HTTP/1.1 maya.cs.depaul.edu
Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1)
http://maya.cs.depaul.edu/~classes/cs480/announce.html
6 2006-02-02 19:34:45 3.4.5.6 - GET /classes/cs480/header.gif - 200 6027
HTTP/1.1 maya.cs.depaul.edu
Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1)
http://maya.cs.depaul.edu/~classes/cs480/announce.html
Bing Liu 10
Web usage mining process
Bing Liu 11
Web usage mining process
The overall Web usage mining process can be divided into three
inter-dependent stages:
data collection and pre-processing,
pattern discovery, and
pattern analysis.
Bing Liu 12
Web usage mining process
In the pattern discovery stage, statistical, database, and machine
learning operations are performed to obtain hidden patterns
reflecting the typical behaviour of users, as well as summary
statistics on Web resources, sessions, and users.
Bing Liu 13
Data preparation
Bing Liu 14
Sources and Types of Data
The primary data sources used in Web usage mining are the server
log files, which include Web server access logs and application
server logs.
Additional data sources that are also essential for both data
preparation and pattern discovery include the site files and meta-
data, operational databases, application templates, and domain
knowledge.
The data obtained through various sources can be categorized into
four primary groups:
Usage Data
Content Data
Structure Data
User Data
Bing Liu 15
Pre-processing of web usage data
Bing Liu 16
Data Fusion
In large-scale Web sites, it is typical that the content served to users
comes from multiple Web or application servers.
In some cases, multiple servers with redundant content are used to
reduce the load on any particular server.
Data fusion refers to the merging of log files from several Web and
application servers.
This may require global synchronization across these servers.
In the absence of shared embedded session ids, heuristic methods
based on the “referrer” field in server logs along with various
sessionization and user identification methods can be used to
perform the merging.
This step is essential in “inter-site” Web usage mining where the
analysis of user behaviour is performed over the log files of multiple
related Web sites
Bing Liu 17
Data cleaning
Data cleaning
remove irrelevant references and fields in server
logs
remove references due to spider navigation
remove erroneous references
add missing references due to caching (done after
sessionization)
Bing Liu 18
Identify sessions (sessionization)
Bing Liu 19
Sessionization strategies
Bing Liu 20
Sessionization heuristics
Bing Liu 21
Sessionization example
Bing Liu 22
Sessionization example
Bing Liu 23
User identification
Bing Liu 24
User identification: an example
Bing Liu 25
Pageview
Bing Liu 26
Path completion
Client- or proxy-side caching can often result
in missing access references to those pages
or objects that have been cached.
For instance,
if a user returns to a page A during the same
session, the second access to A will likely result in
viewing the previously downloaded version of A
that was cached on the client-side, and therefore,
no request is made to the server.
This results in the second reference to A not being
recorded on the server logs.
Bing Liu 27
Missing references due to caching
Bing Liu 28
Path completion
The problem of inferring missing user
references due to caching.
Effective path completion requires extensive
knowledge of the link structure within the site
Referrer information in server logs can also
be used in disambiguating the inferred paths.
Problem gets much more complicated in
frame-based sites.
Bing Liu 29
Integrating with e-commerce events
Either product oriented or visit oriented
Used to track and analyze conversion of
browsers to buyers.
Major difficulty for E-commerce events is defining
and implementing the events for a site, however,
in contrast to clickstream data, getting reliable
preprocessed data is not a problem.
Another major challenge is the successful
integration with clickstream data
Bing Liu 30
Product-Oriented Events
Product View
Occurs every time a product is displayed on a
page view
Typical Types: Image, Link, Text
Product Click-through
Occurs every time a user “clicks” on a product to
get more information
Bing Liu 31
Product-Oriented Events
Bing Liu 32
Web usage mining process
Bing Liu 33
Integration with page content
Bing Liu 34
Integration with link structure
Bing Liu 35
E-commerce data analysis
Bing Liu 36
Session analysis
Bing Liu 37
Session analysis: aggregate reports
Bing Liu 38
OLAP
Bing Liu 39
Data mining
Bing Liu 40
Data mining (cont.)
Bing Liu 41
Some usage mining applications
Bing Liu 42
Data Mining Purpose: System Improvement
Bing Liu 43
Personalization application
Bing Liu 44
Standard approaches
Bing Liu 45
Suggest: Online Recommender System
Bing Liu 46
Data Modeling for Web Usage Mining
Bing Liu 47
Data Modeling for Web Usage Mining
Bing Liu 48
Data Modeling for Web Usage Mining
Bing Liu 49
Data Modeling for Web Usage Mining
Bing Liu 50
Data Modeling for Web Usage Mining
Bing Liu 51
Data Modeling for Web Usage Mining
Bing Liu 52
Agenda
Summary
The Apriori Algorithm: Key Concepts
K-itemsets: An itemset having k items in it.
Compare candidate
Scan D for support count with
count of each Itemset Sup.Count Itemset Sup.Count
minimum support
candidate count
{I1} 6 {I1} 6
{I2} 7 {I2} 7
{I3} 6 {I3} 6
{I4} 2 {I4} 2
{I5} 2 {I5} 2
C1 L1
In the first iteration of the algorithm, each item is a member of the set
of candidates Ck along with its support count.
The set of frequent 1-itemsets L1, consists of the candidate 1-
itemsets satisfying minimum support.
Step 2: Generating candidate and frequent 2-itemsets
with min. support = 2
This itemset is pruned since its subset {{I2, I3, I5}} is not
frequent.
Back To Example:
Lets take l = {I1,I2,I5}
R4: I1 I2 ^ I5
Confidence = sc{I1,I2,I5} / sc{I1} = 2/6 = 33%
R4 is Rejected.
R5: I2 I1 ^ I5
Confidence = sc{I1,I2,I5} / {I2} = 2/7 = 29%
R5 is Rejected.
R6: I5 I1 ^ I2
Confidence = sc{I1,I2,I5} / {I5} = 2/2 = 100%
R6 is Selected.
Summary
Mining Frequent Patterns Without Candidate
Generation
Compress a large database into a compact, Frequent-
Pattern tree (FP-tree) structure
Highly condensed, but complete for frequent pattern mining
Avoid costly database scans
Develop an efficient, FP-tree-based frequent pattern mining method
A divide-and-conquer methodology:
Compress DB into FP-tree, retain itemset associations
Divide the new DB into a set of conditional DBs – each
associated with one frequent item
Mine each such database seperately
Avoid candidate generation
FP-Growth Method : An Example
null
Item Sup Node-
Id Count link I2 {} I1
I2 7 :7 :2
I1 6 I1
I3 I4
I3 6 :4
I4 2 :2 :1 I3
I5 2
I3 I4 :2
I5 :2 I5:1
:1
An FP-Tree that registers compressed,
:1 frequent pattern
information
Mining the FP-Tree by Creating Conditional (sub)
pattern bases
1. Start from each frequent length-1 pattern (as an initial
suffix pattern).
2. Construct its conditional pattern base which consists of
the set of prefix paths in the FP-Tree co-occurring with
suffix pattern.
3. Then, construct its conditional FP-Tree & perform
mining on this tree.
4. The pattern growth is achieved by concatenation of the
suffix pattern with the frequent patterns generated from
a conditional FP-Tree.
5. The union of all frequent patterns (generated by step
4) gives the required frequent itemset.
FP-Tree Example Continued
I3 {(I2 I1: 2),(I2: 2), (I1: 2)} <I2: 4, I1: 2>,<I1:2> I2 I3:4, I1 I3: 2 , I2 I1 I3: 2
• Balanced
• Iterative
• Reducing and
• Clustering using
• Hierarchies
1. BIRCH – the definition
Hierarchical clustering
4. BIRCH concepts and terminology
Clustering Feature
Clustering Feature
CF Tree
• a height balanced tree with two parameters:
- branching factor B
- threshold T
CF Tree
• A leaf node contains at most L entries,
each of them of the form [CFi], where i = 1, 2, …, L .
• It also has two pointers, prev and next,
which are used to chain all leaf nodes together
for efficient scans.
• A leaf node also represents a cluster
made up of all the subclusters represented by its entries.
• But all entries in a leaf node must satisfy
a threshold requirement, with respect to a threshold value T:
the diameter (or radius) has to be less than T.
4. BIRCH concepts and terminology
CF Tree
4. BIRCH concepts and terminology
CF Tree
CF Tree
5.1. Phase 1
• Starts with initial threshold, scans the data and inserts points
into the tree.
• If it runs out of memory before it finishes scanning the data,
it increases the threshold value and
rebuilds a new, smaller CF tree,
by re-inserting the leaf entries from the older tree and
then resuming the scanning of the data from the point at which
it was interrupted.
5.1. Phase 3
• Phase 3:
– It uses a global or semi-global algorithm to cluster all leaf
entries.
– Adapted agglomerative hierarchical clustering algorithm is
applied directly to the subclusters
represented by their CF vectors.
5. BIRCH algorithm
Pros
Cons
1
0
7 VIT University