Você está na página 1de 7

International Journal of Engineering INTERNATIONALComputer Volume OF COMPUTER ENGINEERING JOURNAL 3, Issueand Technology (IJCET), ISSN 0976 6367(Print), ISSN

N 0976 6375(Online) 3, October-December (2012), IAEME & TECHNOLOGY (IJCET)

ISSN 0976 6367(Print) ISSN 0976 6375(Online) Volume 3, Issue 3, October - December (2012), pp. 377-383 IAEME: www.iaeme.com/ijcet.asp Journal Impact Factor (2012): 3.9580 (Calculated by GISI) www.jifactor.com

IJCET
IAEME

A FRAME WORK FOR CLUSTERING TIME EVOLVING DATA USING SLIDING WINDOW TECHNIQUE
1

Y. Swapna1, S. Ravi Sankar2 (Faculty, CSE Department, National Institute of Technology, Goa, India, spr@nitgoa.ac.in) 2 (Faculty, CSE Department, National Institute of Technology, Goa, India, srs@nitgoa.ac.in)

ABSTRACT Clustering is defined as the process of dividing a dataset into mutually exclusive groups such that the members of each group are as "close" as possible to one another and different groups are as "far" as possible from one another. Sampling is defined as representing large data sets into smaller random samples of data. It is used to improve the efficiency of clustering. Though sampling is applied, the points that are not sampled will not have their labels after the normal process. The problem has been solved for numerical domain, where as clustering of timeevolving data in the categorical domain still remains a challenging issue. In this paper, Sliding Window is used to form subset of data from dataset using specified size (i.e.) collection of data from the database and transfer to the module. The drifting concept detection has been proposed which introduces new algorithm that finds the number of outliers that cannot be assigned to any of the cluster. The objective of this algorithm is to compare the distribution of clusters and outliers between the last clustering result and the current temporal clustering result. The experimental evaluation shows that performing DCD is faster than doing clustering once on the entire data set and DCD can provide high-quality clustering results with correctly detected drifting concepts. Keywords: clustering, sampling, categorical domain, labels, sliding window, drifting concept detection. I. INTRODUCTION Our present information age society thrives and evolves on knowledge. Knowledge is derived from information gleaned from a wide variety of reservoirs of data (databases). Clustering is an important technique for exploratory data analysis and has been the focus of substantial research in several domains for decades. Clusters are connected regions of a multidimensional space containing of a relatively high density of points, separated from other such
377

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 6367(Print), ISSN 0976 6375(Online) Volume 3, Issue 3, October-December (2012), IAEME

regions by a region containing a low density of points. It is useful for classification and can reveal the structure in high-dimensional data spaces, outliers may be interesting, statistical pattern recognition, machine learning, and information retrieval because of its use in a wide range of applications. Cluster analysis is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense. It helps us to gain insight into the data distribution. In real world domain, the concept of interest may depend on some hidden context, not given plainly in the form of predictive features, which has become a problem as these concepts drift with time. A suitable example would be buying preferences of customers which may change with time, depending on their needs, climatic conditions, discounts etc. Since the concepts behind the data evolve with time, the underlying clusters may also change significantly with time. The concept not only decreases the quality of clusters but also disregards the expectations of users, which usually require recent clustering results. Many works have been explored based on the problem of clustering time-evolving data in the numerical domain. Categorical attributes also prevalently exist in real data with drifting concepts, for example Web logs that record the browsing history of users, stock market details, buying records of customers often evolve with time. Previous works on clustering categorical data focus on doing clustering on the entire data set and drifting concepts were not taken consideration. Consequently, the problem of clustering time evolving data in the categorical domain remains a challenging issue. The objective is to propose a framework for performing clustering on the categorical time-evolving data. The goal is to use a generalized clustering framework that utilizes existing clustering algorithms that detects if there is a drifting concept or not in the incoming data, instead of designing a specific clustering algorithm. Sliding window technique is adopted to detect the drifting concepts. II. RELATED WORK
Many different numerical clustering algorithms have been proposed that consider the timeevolving data and traditional categorical clustering algorithms [1]. An effective and efficient method, called, clustream for clustering large evolving data streams was proposed by [5]. This method tries to cluster the whole stream at one time rather than viewing the stream as a changing process over time. A density-based method called DenStream was proposed in [2] for discovering clusters in an evolving data stream. Evolutionary clustering algorithms were proposed in [5] and [3]. They adopted the same method that performs data clustering over time and tries to optimize two potentially conflicting criteria: first, the previous and the present cluster must be similar without drifting concept, and second, clustering should reflect the data arrived at that time step with the drifting concept. In [6], a generic frame work for this problem used k-means and agglomerative hierarchical clustering algorithms that were extended according to the problem domain. In [5], a measure of temporal smoothness is integrated in the overall measure of clustering quality. Due to this, the proposed method uses stable and consistent clustering results that are less sensitive to short-term noises while at the same time are adaptive to long-term cluster drifts. The previously proposed methods have concentrated on the problem of clustering time evolving data in the numerical domain. In [4], problem of clustering categorical data is discussed, which performs clustering on customer transaction data in a market database.

378

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 6367(Print), ISSN 0976 6375(Online) Volume 3, Issue 3, October-December (2012), IAEME In [6], [4], a framework to perform clustering on the categorical time-evolving data has been proposed. Especially the rough membership function in rough set theory represents a concept that induces a fuzzy set. Several extension works based on k-modes are presented for different objectives, fuzzy kmodes [6], initial points refinement [2], etc. These categorical algorithms focus on performing clustering on the entire data set and do not consider time-evolving trends.

III. THE PROPOSED APPROACH We propose a generalized clustering framework that utilizes existing clustering algorithms and detects if there is a drifting concept or not in the incoming data. In order to detect the drifting concepts at different sliding windows, we propose the algorithm DCD to compare the cluster distributions between the last clustering result and the temporal current clustering result. It is a collection of data which is extracted from the database that we are going to cluster and the data from the database which is time evolving categorical data (It is not sequential basis manner). We used a synthetic data generator [5] to generate data sets with different number of data points and attributes. The number of data points varies from 10,000 to 100,000, and the dimensionality is in the range of 10-50. In all synthetic data sets, each dimension possesses 20 attribute values. Sliding Window is used to form subset of data from dataset using specified size (i.e.) collection of data from the database and transfer to the module. In this paper, a practical categorical clustering representative, named Node Importance Representative (abbreviated as NIR), is utilized. It represents clusters by measuring the importance of each attribute value in the clusters. Drifting Concept Detection (DCD) algorithm (fig.2) is used to detect the difference of cluster distribution between the current data subset and the last clustering result. In order to perform proper evaluation, we label the clusters and those that do not belong to any cluster are called an outlier. The result is set to perform reclustering if the difference between the clusters is large enough. Two clusters are said to be similar (resemblance), if they satisfy the condition between point pj and cluster ck i.e. 1< k<l obtains maximum of the cluster point. The resemblance for a given data point p j and an NIR table of clusters ck, is defined by the following equation: R( )= (1)

Where is one entry in the NIR table of clusters . As shown in the equation (1), resemblance can be directly obtained by summing up the nodes importance in the NIR table of clusters . Resemblance will be larger if data point contains nodes that are more important in one cluster than in another cluster and is considered to obtain maximal resemblance. If resemblance values between each cluster are small, then it will be treated as an outlier. Therefore, a threshold in each cluster is set to identify outliers. The decision function is defined as follows: Label = { outliers; , if max R ( otherwise. where 1 i l;

As shown in fig.1, the data points in the second sliding window are going to perform data labeling and thresholds are 1 = 2 = 0.5. The first data point p6 = (B, E, F) in S2 is decomposed
379

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 6367(Print), ISSN 0976 6375(Online) Volume 3, Issue 3, October-December (2012), IAEME

into 3 nodes, i.e., {[A1=B]}, {[A2=E]}, {[A3=F]}. The resemblance of in is zero, and in it is also zero, since the maximal resemblance is not larger than the threshold, hence the data point is considered as an outlier. The resemblance of in is 0.037 and in it is 1.537(0.5 +0.037 +1). Then the maximal resemblance value is R ( , ) and the resemblance value is larger than the threshold 2 = 0.5, therefore is labeled cluster . p1 C W D p2 I W M p3 C W N S1 p11 S W P p12 I W P p13 Z P T S3 C C C W W W D N D

A1 A2 A3

p4 S W M

p5 C W D

p6 B E F

p7 I T H

p8 B E G S2

p9 S I H

p10 B O G

p14 I W P

p15 S W P

I W M S T H

I T H

S W M outliers B B B E E O F G G

Fig. 1: The temporal clustering result Algorithm Used: Let temp= DriftingConceptDetecting (temp, ) outliers out = 0 while there is next tuple in do read in data point from St divide into nodes to for all clusters tempi in do calculate Resemblance R(pj, tempi) end for find Maximal Resemblance tempm if R( , tempm ) then is assign to else out = out + 1
380

that is obtained by data labeling.

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 6367(Print), ISSN 0976 6375(Online) Volume 3, Issue 3, October-December (2012), IAEME

end if end while Outlier = out; {Do data labeling on current sliding window } Numdiffclusters = 0 For all clusters tempi in temp do If Numdiffclusters = numdiffclusters + 1 end if end for > or if {Concept Drifts} dump out temp call initial clustering on St else {Concept not drifts} add into temp update NIR as end if then

then

Since we measure the similarity between the data point and the cluster as R ( , the cluster with the maximal resemblance is the most appropriate cluster for that data point. If the in that maximal resemblance (the most appropriate cluster) is smaller than the threshold cluster, the data point is seen as an outlier. In order to observe the relationship between different clustering results, cluster relationship analysis is used to analyze and show the changes between different clustering results. It measures the similarity of clusters between the clustering results at different time stamps and links the similar clusters. Cluster Cluster 0.012 0.182 Cluster 0.567 0 Cluster Cluster Cluster 1 0 Cluster 0 0 Cluster Fig. 2: The similarity table between clustering results The cosine measure CM ( ( ). Therefore cluster ). = (1.537/1.225)* 1.578 = 0.567, which is larger than CM is said to be more similar to than to cluster .

381

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 6367(Print), ISSN 0976 6375(Online) Volume 3, Issue 3, October-December (2012), IAEME

Table 1: Symbols used in Algorithm Aa C[t1 , t2] Ct C1t Cj The a-th attribute in the data set. The clustering result from t1 to t2. The clustering result on sliding window t. The temporal clustering result on sliding window t. The j-th cluster in C. The node importance vector of . The r-th node in . The number of occurrence of . The number of clusters in C. The number of data points in . The size of sliding window. The sliding window t. The timestamp index of sliding window. The importance of in . The outlier threshold. The cluster variation threshold. The cluster difference threshold. The cosine measure between cluster vectors and .

K N T CM( ,

IV RESULTS: The following table shows the results in terms of precision and recall of DCD are efficient on detecting drifting concepts. N=1000 precision Recall 0.557 0.873 0.825 0.992 0.816 0.98 0.443 0.97

Settings D1 D2 D3 D4

drifting 35.6 39.2 46 44.5

Fig. 3: The precision and recall of the DCD We change clustering pairs to obtain the data sets with drifting concepts and then test the detecting accuracy of algorithm DCD by those data sets. The outlier threshold is set to 0.1, and the cluster variation threshold is set to 0.1, and also, the cluster difference threshold is set to 0.5. The number of clusters k, which is the required parameter on the initial clustering step and reclustering step, is set to the maximum number of clusters in each setting, e.g., k = 10 in D1 and k = 20 in D3. In addition, each synthetic data set is generated by randomly combining 50 clustering results on that data set setting, and the precision and recall shown in fig.3 are the averages of 20 experiments. The precision and recall are more than 80 percent when the size of the sliding window is larger than 2,000. It is a little low when the size of the sliding window is set to 1,000 because the drifting concepts often cross two windows, we only count one as a
382

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976 6367(Print), ISSN 0976 6375(Online) Volume 3, Issue 3, October-December (2012), IAEME

correct hit, and the other window is considered as a miss. However, the detecting recall is the highest one when the size of sliding window is set to 1,000. The drifting concepts will probably not be omitted in the sliding window when the data set is separated in detail. If we choose two examples of bank datasets that are synthesized by settings D1 and D2 and evaluate clustering results on each sliding window, it generates a new clustering results when the drifting concept is detected, it also response quickly to the trend of evolving dataset. IV. CONCLUSION In this paper we have proposed a framework to perform clustering on categorical timeevolving data. In order to detect the drifting concepts at different sliding windows, we proposed the algorithm DCD to compare the cluster distributions between the last clustering result and the temporal current clustering result. If the results are quite different, the last clustering result will be dumped out, and the current data in this sliding window will perform reclustering. In order to observe the relationship between different clustering results, cluster relationship analysis is used to analyze and show the changes between different clustering results. The experimental evaluation shows that performing DCD is faster than doing clustering once on the entire data set and DCD can provide high-quality clustering results with correctly detected drifting concepts. Therefore, the result demonstrates that our framework is practical for detecting drifting concepts in time-evolving categorical data. V.REFERENCES [1] D. Barbara, Y. Li, and J. Couto, Coolcat: An Entropy-Based Algorithm for Categorical Clustering, Proc. ACM Intl Conf. Information and Knowledge Management (CIKM), 2002. [2] F. Cao, M. Ester, W. Qian, and A. Zhou, Density-Based Clustering over an Evolving Data Stream with Noise, Proc. Sixth SIAM Intl Conf. Data Mining (SDM), 2006. [3] H.-L. Chen, K.-T. Chuang, and M.-S. Chen, Labeling Unclustered Categorical Data into Clusters Based on the Important Attribute Values, Proc. Fifth IEEE Intl Conf. Data Mining (ICDM), 2005. [4] O. Nasraoui, M. Soliman, E. Saka, A. Badia, and R. Germain, A Web Usage Mining Framework for Mining Evolving User Profiles in Dynamic Web Sites, IEEE Trans. Knowledge and Data Eng., vol. 20, no. 2, pp. 202-215, Feb. 2008. [5] Hung-Leng Chen, Ming-Syan Chen, and Su-Chen Lin, Catching the Trend: A Framework for Clustering Concept-Drifting Categorical Data, IEEE Trans. Knowledge and Data Eng., vol. 21, no. 5, May 2009. [6] Z. Huang and M.K. Ng, A Fuzzy k-Modes Algorithm for Clustering Categorical Data, IEEE Trans. Fuzzy Systems, 1999.

383

Você também pode gostar