Você está na página 1de 107

CSE3024 WEB MINING

L T P J C
3 0 0 4 4

Dr. S M SATAPATHY
Associate Professor,
School of Computer Science and Engineering,
VIT University, Vellore, TN, India – 632 014.
Module – 6

WEB USAGE MINING

1. Click stream analysis

2. Log File, Data Collection and Pre-processing

3. Data Modelling

4. Modelling Web user Interest

5. Finding User Access Pattern

VIT University 2
Web Usage Mining
 Extraction of information from data generated
through Web page visits and transactions…
 data stored in server access logs, referrer logs,
agent logs, and client-side cookies
 user characteristics and usage profiles
 metadata, such as page attributes, content attributes,
and usage data
 Clickstream data
 Clickstream analysis

Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 3
Web Usage Mining
 Web usage mining applications
 Determine the lifetime value of clients
 Design cross-marketing strategies across products.
 Evaluate promotional campaigns
 Target electronic ads and coupons at user groups
based on user access patterns
 Predict user behavior based on previously learned
rules and users' profiles
 Present dynamic information to users based on their
interests and profiles…

Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 4
Web Usage Mining
(clickstream analysis)

Pre-Process Data Extract Knowledge


Website
User / Collecting Usage patterns
Customer Merging User profiles
Cleaning Page profiles
Structuring Visit profiles
- Identify users Customer value
- Identify sessions
- Identify page views
- Identify visits
Weblogs

How to better the data


How to improve the Web site
How to increase the customer value

Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 5
Web Mining Success Stories

 Amazon.com, Ask.com, Scholastic.com, …


 Website Optimization Ecosystem
Customer Interaction Analysis of Interactions Knowledge about the Holistic
on the Web View of the Customer

Web
Analytics

Voice of
Customer

Customer Experience
Management

Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 6
Introduction
 Web usage mining: automatic discovery of
patterns in clickstreams and associated data
collected or generated as a result of user
interactions with one or more Web sites.
 Goal: analyze the behavioral patterns and
profiles of users interacting with a Web site.
 The discovered patterns are usually
represented as collections of pages, objects,
or resources that are frequently accessed by
groups of users with common interests.
Introduction
 Data in Web Usage Mining:
 Web server logs
 Site contents
 Data about the visitors, gathered from external channels
 Further application data
 Not all these data are always available.
 When they are, they must be integrated.
 A large part of Web usage mining is about
processing usage/ clickstream data.
 After that various data mining algorithm can be applied.

Bing Liu 8
Introduction
 Each log entry (depending on the log format) may
contain fields identifying
 the time and date of the request,
 The IP address of the client,
 the resource requested,
 possible parameters used in invoking a Web application,
 status of the request,
 HTTP method used,
 the user agent (browser and operating system type and version),
 the referring Web resource, and,
 if available, client-side cookies which uniquely identify a repeat
visitor.

Bing Liu 9
Web server logs
1 2006-02-01 00:08:43 1.2.3.4 - GET /classes/cs589/papers.html - 200 9221
HTTP/1.1 maya.cs.depaul.edu
Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+2.0.50727)
http://dataminingresources.blogspot.com/
2 2006-02-01 00:08:46 1.2.3.4 - GET /classes/cs589/papers/cms-tai.pdf - 200 4096
HTTP/1.1 maya.cs.depaul.edu
Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1;+.NET+CLR+2.0.50727)
http://maya.cs.depaul.edu/~classes/cs589/papers.html
3 2006-02-01 08:01:28 2.3.4.5 - GET /classes/ds575/papers/hyperlink.pdf - 200
318814 HTTP/1.1 maya.cs.depaul.edu
Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1)
http://www.google.com/search?hl=en&lr=&q=hyperlink+analysis+for+the+web+survey
4 2006-02-02 19:34:45 3.4.5.6 - GET /classes/cs480/announce.html - 200 3794
HTTP/1.1 maya.cs.depaul.edu
Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1)
http://maya.cs.depaul.edu/~classes/cs480/
5 2006-02-02 19:34:45 3.4.5.6 - GET /classes/cs480/styles2.css - 200 1636
HTTP/1.1 maya.cs.depaul.edu
Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1)
http://maya.cs.depaul.edu/~classes/cs480/announce.html
6 2006-02-02 19:34:45 3.4.5.6 - GET /classes/cs480/header.gif - 200 6027
HTTP/1.1 maya.cs.depaul.edu
Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+SV1)
http://maya.cs.depaul.edu/~classes/cs480/announce.html

Bing Liu 10
Web usage mining process

Bing Liu 11
Web usage mining process
 The overall Web usage mining process can be divided into three
inter-dependent stages:
 data collection and pre-processing,
 pattern discovery, and
 pattern analysis.

 In the pre-processing stage, the clickstream data is cleaned and


partitioned into a set of user transactions representing the activities
of each user during different visits to the site.

 Other sources of knowledge such as the site content or structure, as


well as semantic domain knowledge from site ontologies (such as
product catalogs or concept hierarchies), may also be used in pre-
processing or to enhance user transaction data.

Bing Liu 12
Web usage mining process
 In the pattern discovery stage, statistical, database, and machine
learning operations are performed to obtain hidden patterns
reflecting the typical behaviour of users, as well as summary
statistics on Web resources, sessions, and users.

 In the final stage of the process, the discovered patterns and


statistics are further processed, filtered, possibly resulting in
aggregate user models that can be used as input to applications
such as recommendation engines, visualization tools, and Web
analytics and report generation tools.

Bing Liu 13
Data preparation

Bing Liu 14
Sources and Types of Data
 The primary data sources used in Web usage mining are the server
log files, which include Web server access logs and application
server logs.
 Additional data sources that are also essential for both data
preparation and pattern discovery include the site files and meta-
data, operational databases, application templates, and domain
knowledge.
 The data obtained through various sources can be categorized into
four primary groups:
 Usage Data
 Content Data
 Structure Data
 User Data

Bing Liu 15
Pre-processing of web usage data

Bing Liu 16
Data Fusion
 In large-scale Web sites, it is typical that the content served to users
comes from multiple Web or application servers.
 In some cases, multiple servers with redundant content are used to
reduce the load on any particular server.
 Data fusion refers to the merging of log files from several Web and
application servers.
 This may require global synchronization across these servers.
 In the absence of shared embedded session ids, heuristic methods
 based on the “referrer” field in server logs along with various
sessionization and user identification methods can be used to
perform the merging.
 This step is essential in “inter-site” Web usage mining where the
analysis of user behaviour is performed over the log files of multiple
related Web sites

Bing Liu 17
Data cleaning

 Data cleaning
 remove irrelevant references and fields in server
logs
 remove references due to spider navigation
 remove erroneous references
 add missing references due to caching (done after
sessionization)

Bing Liu 18
Identify sessions (sessionization)

 In Web usage analysis, these data are the


sessions of the site visitors: the activities
performed by a user from the moment she
enters the site until the moment she leaves it.
 Difficult to obtain reliable usage data due to
proxy servers and anonymizers, dynamic IP
addresses, missing references due to
caching, and the inability of servers to
distinguish among different visits.

Bing Liu 19
Sessionization strategies

Bing Liu 20
Sessionization heuristics

Bing Liu 21
Sessionization example

Bing Liu 22
Sessionization example

Bing Liu 23
User identification

Bing Liu 24
User identification: an example

Bing Liu 25
Pageview

 A pageview is an aggregate representation of


a collection of Web objects contributing to the
display on a user’s browser resulting from a
single user action (such as a click-through).
 Conceptually, each pageview can be viewed
as a collection of Web objects or resources
representing a specific “user event,” e.g.,
reading an article, viewing a product page, or
adding a product to the shopping cart.

Bing Liu 26
Path completion
 Client- or proxy-side caching can often result
in missing access references to those pages
or objects that have been cached.
 For instance,
 if a user returns to a page A during the same
session, the second access to A will likely result in
viewing the previously downloaded version of A
that was cached on the client-side, and therefore,
no request is made to the server.
 This results in the second reference to A not being
recorded on the server logs.
Bing Liu 27
Missing references due to caching

Bing Liu 28
Path completion
 The problem of inferring missing user
references due to caching.
 Effective path completion requires extensive
knowledge of the link structure within the site
 Referrer information in server logs can also
be used in disambiguating the inferred paths.
 Problem gets much more complicated in
frame-based sites.

Bing Liu 29
Integrating with e-commerce events
 Either product oriented or visit oriented
 Used to track and analyze conversion of
browsers to buyers.
 Major difficulty for E-commerce events is defining
and implementing the events for a site, however,
in contrast to clickstream data, getting reliable
preprocessed data is not a problem.
 Another major challenge is the successful
integration with clickstream data

Bing Liu 30
Product-Oriented Events

 Product View
 Occurs every time a product is displayed on a
page view
 Typical Types: Image, Link, Text
 Product Click-through
 Occurs every time a user “clicks” on a product to
get more information

Bing Liu 31
Product-Oriented Events

 Shopping Cart Changes


 Shopping Cart Add or Remove
 Shopping Cart Change - quantity or other feature
(e.g. size) is changed
 Product Buy or Bid
 Separate buy event occurs for each product in the
shopping cart
 Auction sites can track bid events in addition to
the product purchases

Bing Liu 32
Web usage mining process

Bing Liu 33
Integration with page content

Bing Liu 34
Integration with link structure

Bing Liu 35
E-commerce data analysis

Bing Liu 36
Session analysis

 Simplest form of analysis: examine individual


or groups of server sessions and e-
commerce data.
 Advantages:
 Gain insight into typical customer behaviors.
 Trace specific problems with the site.
 Drawbacks:
 LOTS of data.
 Difficult to generalize.

Bing Liu 37
Session analysis: aggregate reports

Bing Liu 38
OLAP

Bing Liu 39
Data mining

Bing Liu 40
Data mining (cont.)

Bing Liu 41
Some usage mining applications

Bing Liu 42
Data Mining Purpose: System Improvement

Bing Liu 43
Personalization application

Bing Liu 44
Standard approaches

Bing Liu 45
Suggest: Online Recommender System

Bing Liu 46
Data Modeling for Web Usage Mining

Bing Liu 47
Data Modeling for Web Usage Mining

Bing Liu 48
Data Modeling for Web Usage Mining

Bing Liu 49
Data Modeling for Web Usage Mining

Bing Liu 50
Data Modeling for Web Usage Mining

Bing Liu 51
Data Modeling for Web Usage Mining

Bing Liu 52
Agenda

 The Apriori Algorithm (Mining single-dimensional


boolean association rules)

 Frequent-Pattern Growth (FP-Growth) Method

 Summary
The Apriori Algorithm: Key Concepts
 K-itemsets: An itemset having k items in it.

 Support or Frequency: Number of transactions that contain a


particular itemset.

 Frequent Itemsets: An itemset that satisfies minimum support.


(denoted by Lk for frequent k-itemset).

 Apriori Property: All non-empty subsets of a frequent itemset must


be frequent.

 Join Operation: Ck, the set of candidate k-itemsets is generated by


joining Lk-1 with itself. (L1: frequent 1-itemset, Lk: frequent k-itemset)

 Prune Operation: Lk, the set of frequent k-itemsets is extracted from


Ck by pruning it – getting rid of all the non-frequent k-itemsets in Ck

Iterative level-wise approach: k-itemsets used to explore (k+1)-


itemsets.
The Apriori Algorithm finds frequent k-itemsets.
How is the Apriori Property used in the Algorithm?

 Mining single-dimensional Boolean association


rules is a 2 step process:

 Using the Apriori Property find the frequent itemsets:


 Each iteration will generate Ck (candidate k-itemsets from
Ck-1) and Lk (frequent k-itemsets)
 Use the frequent k-itemsets to generate association
rules.
Finding frequent itemsets using the Apriori
Algorithm: Example

TID List of Items  Consider a database D, consisting


T100 I1, I2, I5
of 9 transactions.
 Each transaction is represented
T100 I2, I4 by an itemset.
T100 I2, I3  Suppose min. support required is
2 (2 out of 9 = 2/9 =22 % )
T100 I1, I2, I4  Say min. confidence required is
T100 I1, I3
70%.
 We have to first find out the
T100 I2, I3 frequent itemset using Apriori
T100 I1, I3
Algorithm.
 Then, Association rules will be
T100 I1, I2 ,I3, I5 generated using min. support &
T100 I1, I2, I3
min. confidence.
Step 1: Generating candidate and frequent 1-itemsets
with min. support = 2

Compare candidate
Scan D for support count with
count of each Itemset Sup.Count Itemset Sup.Count
minimum support
candidate count
{I1} 6 {I1} 6
{I2} 7 {I2} 7
{I3} 6 {I3} 6
{I4} 2 {I4} 2
{I5} 2 {I5} 2
C1 L1

 In the first iteration of the algorithm, each item is a member of the set
of candidates Ck along with its support count.
 The set of frequent 1-itemsets L1, consists of the candidate 1-
itemsets satisfying minimum support.
Step 2: Generating candidate and frequent 2-itemsets
with min. support = 2

Generate C2 Scan D for Compare


Itemset Itemset Sup. Itemset Sup
candidates count of candidate
from L1 x L1 {I1, I2} Count support Count
each
candidate {I1, I2} 4 count with {I1, I2} 4
{I1, I3} minimum
{I1, I4} {I1, I3} 4 support {I1, I3} 4
count
{I1, I5} {I1, I4} 1 {I1, I5} 2

{I2, I3} {I1, I5} 2 {I2, I3} 4

{I2, I4} {I2, I4} 2


{I2, I3} 4
{I2, I5} {I2, I5} 2
{I2, I4} 2
{I3, I4} {I2, I5} 2 L2
{I3, I5}
{I3, I4} 0
{I4, I5}
{I3, I5} 1 Note: We
C2 {I4, I5} 0 haven’t used
C2 Apriori
Step 3: Generating candidate and frequent 3-itemsets
with min. support = 2
Compare
Generate Scan D for candidate
C3 count of support
candidates Itemset each Itemset Sup. Itemset Sup
count with
from L2 {I1, I2, I3} candidate Count min support Count
count {I1, I2, I3} 2
{I1, I2, I5} {I1, I2, I3} 2
{I1, I3, I5} {I1, I2, I5} 2 {I1, I2, I5} 2
{I2, I3, I4}
C3 L3
{I2, I3, I5}
{I2, I4, I5}
Contains non-frequent
C3 (2-itemset) subsets

 The generation of the set of candidate 3-itemsets C3, involves use of


the Apriori Property.
 When Join step is complete, the Prune step will be used to reduce the
size of C3. Prune step helps to avoid heavy computation due to large Ck.
Step 4: Generating frequent 4-itemset

 L3 Join L3 C4 = {{I1, I2, I3, I5}}

 This itemset is pruned since its subset {{I2, I3, I5}} is not
frequent.

 Thus, C4 = φ, and the algorithm terminates, having found


all of the frequent items.

 This completes our Apriori Algorithm. What’s Next ?

 These frequent itemsets will be used to generate strong


association rules (where strong association rules satisfy
both minimum support & minimum confidence).
Step 5: Generating Association Rules from frequent
k-itemsets
 Procedure:
 For each frequent itemset l, generate all nonempty subsets of l

 For every nonempty subset s of l, output the rule “s  (l - s)” if


support_count(l) / support_count(s) ≥ min_conf where min_conf is
minimum confidence threshold. 70% in our case.

 Back To Example:
 Lets take l = {I1,I2,I5}

 The nonempty subsets of Lets take l are {I1,I2}, {I1,I5}, {I2,I5},


{I1}, {I2}, {I5}
Step 5: Generating Association Rules from frequent
k-itemsets [Cont.]

 The resulting association rules are:


 R1: I1 ^ I2  I5
 Confidence = sc{I1,I2,I5} / sc{I1,I2} = 2/4 = 50%
 R1 is Rejected.
 R2: I1 ^ I5  I2
 Confidence = sc{I1,I2,I5} / sc{I1,I5} = 2/2 = 100%
 R2 is Selected.
 R3: I2 ^ I5  I1
 Confidence = sc{I1,I2,I5} / sc{I2,I5} = 2/2 = 100%
 R3 is Selected.
Step 5: Generating Association Rules from Frequent
Itemsets [Cont.]

 R4: I1  I2 ^ I5
 Confidence = sc{I1,I2,I5} / sc{I1} = 2/6 = 33%
 R4 is Rejected.
 R5: I2  I1 ^ I5
 Confidence = sc{I1,I2,I5} / {I2} = 2/7 = 29%
 R5 is Rejected.
 R6: I5  I1 ^ I2
 Confidence = sc{I1,I2,I5} / {I5} = 2/2 = 100%
 R6 is Selected.

We have found three strong association rules.


Apriori Algorithm
Agenda

 The Apriori Algorithm (Mining single dimensional


boolean association rules)

 Frequent-Pattern Growth (FP-Growth) Method

 Summary
Mining Frequent Patterns Without Candidate
Generation
 Compress a large database into a compact, Frequent-
Pattern tree (FP-tree) structure
 Highly condensed, but complete for frequent pattern mining
 Avoid costly database scans
 Develop an efficient, FP-tree-based frequent pattern mining method
 A divide-and-conquer methodology:
 Compress DB into FP-tree, retain itemset associations
 Divide the new DB into a set of conditional DBs – each
associated with one frequent item
 Mine each such database seperately
 Avoid candidate generation
FP-Growth Method : An Example

TID List of Items  Consider the previous example


T100 I1, I2, I5 of a database D, consisting of
9 transactions.
T100 I2, I4  Suppose min. support count
T100 I2, I3
required is 2 (i.e. min_sup =
2/9 = 22 % )
T100 I1, I2, I4  The first scan of the database
is same as Apriori, which
T100 I1, I3
derives the set of 1-itemsets &
T100 I2, I3 their support counts.
 The set of frequent items is
T100 I1, I3 sorted in the order of
T100 I1, I2 ,I3, I5
descending support count.
 The resulting set is denoted as
T100 I1, I2, I3 L = {I2:7, I1:6, I3:6, I4:2, I5:2}
FP-Growth Method: Construction of FP-Tree

 First, create the root of the tree, labeled with “null”.


 Scan the database D a second time (First time we scanned it to
create 1-itemset and then L), this will generate the complete tree.
 The items in each transaction are processed in L order (i.e. sorted
order).
 A branch is created for each transaction with items having their
support count separated by colon.
 Whenever the same node is encountered in another transaction, we
just increment the support count of the common node or Prefix.
 To facilitate tree traversal, an item header table is built so that each
item points to its occurrences in the tree via a chain of node-links.
 Now, The problem of mining frequent patterns in database is
transformed to that of mining the FP-Tree.
FP-Growth Method: Construction of FP-Tree

null
Item Sup Node-
Id Count link I2 {} I1
I2 7 :7 :2
I1 6 I1
I3 I4
I3 6 :4
I4 2 :2 :1 I3
I5 2
I3 I4 :2
I5 :2 I5:1
:1
An FP-Tree that registers compressed,
:1 frequent pattern
information
Mining the FP-Tree by Creating Conditional (sub)
pattern bases
1. Start from each frequent length-1 pattern (as an initial
suffix pattern).
2. Construct its conditional pattern base which consists of
the set of prefix paths in the FP-Tree co-occurring with
suffix pattern.
3. Then, construct its conditional FP-Tree & perform
mining on this tree.
4. The pattern growth is achieved by concatenation of the
suffix pattern with the frequent patterns generated from
a conditional FP-Tree.
5. The union of all frequent patterns (generated by step
4) gives the required frequent itemset.
FP-Tree Example Continued

Item Conditional pattern base Conditional Frequent pattern


FP-Tree generated
I5 {(I2 I1: 1),(I2 I1 I3: 1)} <I2:2 , I1:2> I2 I5:2, I1 I5:2, I2 I1 I5: 2

I4 {(I2 I1: 1),(I2: 1)} <I2: 2> I2 I4: 2

I3 {(I2 I1: 2),(I2: 2), (I1: 2)} <I2: 4, I1: 2>,<I1:2> I2 I3:4, I1 I3: 2 , I2 I1 I3: 2

I1 {(I2: 4)} <I2: 4> I2 I1: 4

Mining the FP-Tree by creating


conditional
Now, (sub)
following the above pattern
mentioned steps: bases
 Lets start from I5. I5 is involved in 2 branches namely {I2 I1 I5: 1} and {I2 I1 I3
I5: 1}.
 Therefore considering I5 as suffix, its 2 corresponding prefix paths would be
{I2 I1: 1} and {I2 I1 I3: 1}, which forms its conditional pattern base.
FP-Tree Example Continued

 Out of these, only I1 & I2 is selected in the conditional FP-Tree


because I3 does not satisfy the minimum support count.
For I1, support count in conditional pattern base = 1 + 1 = 2
For I2, support count in conditional pattern base = 1 + 1 = 2
For I3, support count in conditional pattern base = 1
Thus support count for I3 is less than required min_sup which is 2
here.
 Now, we have a conditional FP-Tree with us.
 All frequent pattern corresponding to suffix I5 are generated by
considering all possible combinations of I5 and conditional FP-Tree.
 The same procedure is applied to suffixes I4, I3 and I1.
 Note: I2 is not taken into consideration for suffix because it doesn’t
have any prefix at all.
Why Frequent Pattern Growth Fast ?

 Performance study shows


 FP-growth is an order of magnitude faster than Apriori
 Reasoning
 No candidate generation, no candidate test
 Use compact data structure
 Eliminate repeated database scans
 Basic operation is counting and FP-tree building
1. BIRCH – the definition

• Balanced
• Iterative
• Reducing and
• Clustering using
• Hierarchies
1. BIRCH – the definition

 An unsupervised data mining algorithm used to


perform hierarchical clustering over particularly large
data-sets.
2. Data Clustering – problems

• Data-set too large to fit in main memory.

• I/O operations cost the most (seek times on disk are


orders of a magnitude higher than RAM access times).

• BIRCH offers I/O cost linear in the size of the dataset.


2. Data Clustering – other solutions

• Probability-based clustering algorithms


(COBWEB and CLASSIT)

• Distance-based clustering algorithms


(KMEANS, KMEDOIDS and CLARANS)
3. BIRCH advantages

• It is local in that each clustering decision is made without


scanning all data points and currently existing clusters.

• It exploits the observation that data space is not usually


uniformly occupied and not every data point is equally important.

• It makes full use of available memory to derive the finest


possible sub-clusters while minimizing I/O costs.

• It is also an incremental method that does not require the


whole dataset in advance.
4. BIRCH concepts and terminology

Hierarchical clustering
4. BIRCH concepts and terminology

Clustering Feature

• The BIRCH algorithm builds a clustering feature tree (CF


tree) while scanning the data set.

• Each entry in the CF tree represents a cluster of objects and is


characterized by a triple (N, LS, SS).
4. BIRCH concepts and terminology

Clustering Feature

• Given N d-dimensional data points in a cluster,


Xi (i = 1, 2, 3, … , N)
CF vector of the cluster is defined as a triple CF =
(N,LS,SS):
- N - number of data points in the cluster
- LS - linear sum of the N data points
- SS - square sum of the N data points
4. BIRCH concepts and terminology

CF Tree
• a height balanced tree with two parameters:
- branching factor B
- threshold T

• Each non-leaf node contains at most B entries of the


form [CFi, childi], where childi is a pointer to its i-th child
node and CFi is the CF of the subcluster represented by this child.

• So, a non-leaf node represents a cluster made up of all the


subclusters represented by its entries.
4. BIRCH concepts and terminology

CF Tree
• A leaf node contains at most L entries,
each of them of the form [CFi], where i = 1, 2, …, L .
• It also has two pointers, prev and next,
which are used to chain all leaf nodes together
for efficient scans.
• A leaf node also represents a cluster
made up of all the subclusters represented by its entries.
• But all entries in a leaf node must satisfy
a threshold requirement, with respect to a threshold value T:
the diameter (or radius) has to be less than T.
4. BIRCH concepts and terminology

CF Tree
4. BIRCH concepts and terminology

CF Tree

• The tree size is a function of T (the larger the T is, the


smaller the tree is).
• We require a node to fit in a page of size of P .
• B and L are determined by P (P can be varied
for performance tuning ).
• Very compact representation of the dataset because each
entry in a leaf node is not a single data point but a
subcluster.
4. BIRCH concepts and terminology

CF Tree

• The leave contains actual clusters.

• The size of any cluster in a leaf is not larger than T.


4. BIRCH concepts and terminology
4. BIRCH concepts and terminology
4. BIRCH concepts and terminology
4. BIRCH concepts and terminology
4. BIRCH concepts and terminology
4. BIRCH concepts and terminology
4. BIRCH concepts and terminology
4. BIRCH concepts and terminology
4. BIRCH concepts and terminology
4. BIRCH concepts and terminology
4. BIRCH concepts and terminology
4. BIRCH concepts and terminology
4. BIRCH concepts and terminology
5. BIRCH algorithm

• Phase 1: Scan all data and build an initial in-memory CF


tree, using the given amount of memory and recycling
space on disk.

• Phase 2: Condense into desirable length by building a


smaller CF tree.

• Phase 3: Global clustering.

• Phase 4: Cluster refining – this is optional, and requires


more passes over the data to refine the results.
5. BIRCH algorithm

5.1. Phase 1

• Starts with initial threshold, scans the data and inserts points
into the tree.
• If it runs out of memory before it finishes scanning the data,
it increases the threshold value and
rebuilds a new, smaller CF tree,
by re-inserting the leaf entries from the older tree and
then resuming the scanning of the data from the point at which
it was interrupted.

• Good initial threshold is important but hard to figure out.


• Outlier removal (when rebuilding tree).
5. BIRCH algorithm

5.1. Phase 2 (optional)

• Preparation for Phase 3.

• Potentially, there is a gap between the size of Phase 1


results and the input range of Phase 3.

• It scans the leaf entries in the initial CF tree to rebuild a


smaller CF tree, while removing more outliners and
grouping crowded subclusters into larger ones.
5. BIRCH algorithm

5.1. Phase 3

• Problems after Phase 1:


– Input order affects results.
– Splitting triggered by node size.

• Phase 3:
– It uses a global or semi-global algorithm to cluster all leaf
entries.
– Adapted agglomerative hierarchical clustering algorithm is
applied directly to the subclusters
represented by their CF vectors.
5. BIRCH algorithm

5.1. Phase 4 (optional)

• Additional passes over the data to correct inaccuracies and


refine the clusters further.
• It uses the centroids of the clusters produced by Phase 3 as
seeds, and redistributes the data points to its closest seed
to obtain a set of new clusters.
• Converges to a minimum (no matter how many time is
repeated).
• Option of discarding outliners.
5. Conclusion

Pros

 Birch performs faster than existing algorithms


(CLARANS and KMEANS) on large datasets.

 Scans whole data only once.

 Handles outliers better.

 Superior to other algorithms in stability and scalability.


5. Conclusion

Cons

• Since each node in a CF tree can hold only a limited


number of entries due to the size,
a CF tree node doesn’t always correspond to what a user
may consider a nature cluster.

• Moreover, if the clusters are not spherical in shape,


it doesn’t perform well because it uses the notion of radius
or diameter to control the boundary of a cluster.
Thank You for Your Attention !

1
0
7 VIT University

Você também pode gostar