Você está na página 1de 5

International Journal of Computer Information Systems, Vol. 3, No.

5, 2011

Cluster-Based and Distance-Based Approach for Outlier Detection in Data Set


Ms. S. D. Pachgade
Student Computer Science & Engineering Sipnas College of Engg& Tech. Amravati, Maharashtra. India. sdpachgade@gmail.com
AbstractOutlier detection is currently very active area of research in data set mining community. In this paper we propose clustering based approach and distance based approach to capture outliers. We apply k-mean or k-median algorithm which Partition the data set into number of chunks or clusters and each chunk contains set of data. Once cluster are formed, centriod of each cluster are calculated. The points which are lying near the centroid of the cluster are not probable candidate for outlier and we can prune out such points from each cluster. Next distance based technique is used to find the distance from centroid to candidate outlier. If this distance is greater than some threshold then it will declare as outlier otherwise as a real object. In proposed approach, two techniques are combining to efficiently find the outlier from the data set. This hybrid approach takes less computational cost. Proposed algorithm efficiently prune of the safe cells and save huge number of extra calculations. In this paper we are giving brief introduction about our project. KeywordsOutlier, Cluster-based, Distance-based

Ms. S. S. Dhande Asst. Prof. Computer Science & Engineering Sipnas College of Engg& Tech. Amravati, Maharashtra. India dhande_123@rediffmail.com exhibiting entirely different characteristics, deliberate actions etc. Detecting outliers may lead to the discovery of truly unexpected behavior and help avoid wrong conclusions etc. Outlier is the data point that does not conform to the normal points characterizing the data set. Detecting outliers has important applications in data cleaning as well as in the mining of abnormal points for fraud detection, stock market analysis, intrusion detection, marketing, network sensors. Finding anomalous points among the data points is the basic idea to find out an outlier. Distance based techniques use the distance function for relating each pair of objects of the data set. Distance based definition (these definitions are computationally efficient) [7, 11] represent a useful tool for data analysis [8]. In this work, we are introducing clustering method that will identify candidate outliers, the points which may contain outliers (temporary outliers). Next we apply distance based for all candidate outliers, which is used to identify a point to be an outlier or not. II. OBJECTIVES OF STUDY

I. INTRODUCTION Anomaly detection is recently a vital and active research problem in many fields and involved in numerous applications. Most of the existing methods are based on distance measure. Because of dynamic nature of the incoming data; declare an outlier often can lead us to a wrong decision. However, earlier research for the problem of outlier detection is suitable for disk resident datasets where the entire dataset is available in advance and algorithms can operate in more than single passes. But, outlier detection over data set is a challenging task because data is continuously updated and flowing. Finding outliers in a collection of patterns is a very well-known problem in the data mining field. An outlier is a pattern which is dissimilar with respect to the rest of the patterns in the dataset. Depending upon the application domain, outliers are of particular interest. In some cases presence of outliers are adversely affect the conclusions drawn out of the analysis and hence need to be eliminated beforehand. There are varied reasons for outlier generation in the first place. For example outliers may be generated due to measurement impairments, rare normal events The main objective is to find the outlier from the data set using cluster based method and distance based method. The problem of finding outliers in data has broad applications in areas as diverse as data cleaning, fraud detection, network monitoring, invasive species monitoring, etc. While there are dozens of techniques that have been proposed to solve this problem for static data collections, very simple distance-based outlier detection methods are known to be competitive or superior to more complex methods [9]. So, first apply cluster based method to reduce the size of data and then apply the distance based method to find the outlier detection. First we need a clustering algorithm. We will try different logical approaches to detect outlier data values, such as clustering algorithms like, K-mean or K-median, which can produce good clustering results and at the same time deserves good scalability. Finally, distance based technique is used to find the distance from centroid to candidate outlier. So that, both approaches take less computational cost. In brief, to avoid pair wise distance calculations, to

November Issue

Page 48 of 59

ISSN 2229 5208

International Journal of Computer Information Systems, Vol. 3, No. 5, 2011 detect better outlier even if the evolution of data set change, to let user free to provide sensitive parameters, and to mine Data set even in limited memory resources here we propose a clustering based method because; the clustering methods have good space and time complexity [5,6]. 1) III.RELATED WORK Outlier detection (deviation detection, exception mining, novelty detection, etc.) is an important problem that has attracted wide interest and numerous solutions. These solutions can be broadly classified into several major ideas: Model-Based [2]: An explicit model of the domain is built (i.e., a model of the heart, or of an oil refinery), and objects that do not fit the model are flagged. Disadvantage: Model-based methods require the building of a model, which is often an expensive And difficult enterprise requiring the input of a domain expert Connectedness [12]: In domains where objects are linked (social networks, biological networks), objects with few links are considered potential anomalies. Disadvantage: Connectedness approaches are only defined for datasets with linkage information Density-Based [3]: Objects in low-density regions of space are flagged. Disadvantage: Density based models require the careful settings of several parameters. It requires quadratic time complexity. It may rule out outliers close to some non-outliers patterns that has low density. Distance-Based [1]: Given any distance measure, objects that have distances to their nearest neighbors that exceed a specific threshold are considered potential anomalies. In contrast to the above, distance-based methods are much more flexible and robust. They are defined for any data type for which we have a distance measure and do not require a detailed understanding of the application domain. Cluster based approach [4]: The clustering based techniques involve a clustering step which partitions the data into groups which contain similar objects. The assumed behavior of outliers is that they either do not belong to any cluster, or belong to very small clusters, or are forced to belong to a cluster where they are very different from other members. Clustering based outlier detection techniques have been enveloped which make use of the fact that outliers do not belong to any cluster since they are very few and different from the normal instances. K-Nearest Neighbor Based Approach [13]: Knearest neighbor based schemes analyses each object with respect to its local neighborhood. The basic idea behind such schemes is that an outlier will have a neighborhood where it will stand out, while a normal object will have a neighborhood where all its neighbors will be exactly like it. The obvious strength of these techniques is that they can work in an unsupervised mode, i.e. they do not assume availability of class labels. IV.PROPOSED WORK 1. System architecture

Figure 1: System Architecture Input Data Set: A data set is an ordered sequence of objects X1, ..,Xn. Applications, such as fraud detection, network flow monitoring, telecommunications, data management, etc., where the data arrival is continuous. Cluster Based Approach: Clustering is a popular technique used to group similar data points or objects in groups or clusters. Clustering is an important tool for outlier analysis. This technique relies on the key assumption that normal objects belong to large and dense clusters, while outliers form very small clusters. Distance Based Approach: Distance based approach first proposed by Knorr and Ng. These techniques are highly dependent on the parameters provided by the users. Given any distance measure, objects that have distances to their nearest neighbour that exceed a specific threshold are considered potential anomalies. Outlier Detection: Outlier detection is an extremely important task in a wide variety of application domains. Outlier detection is a task that finds objects that are dissimilar or inconsistent with respect to the remaining data. Outliers are often considered as error or noise and are removed once detected. Examples include skewed data values resulting from measurement error, or erroneous values resulting from data entry mistakes. 2. Architecture of proposed work

It can be divided in 4 steps: Partition the data set into number of chunks and each chunk contain set of data. Over each chunk, apply clustering method to figure out candidate outliers and safe region. Apply distance based outlier detection algorithm over clusters with respect to centroid of cluster. Give a chance to the candidate outlier to survive in next set, and allow it for appropriate number of set chunks, and then declare candidate outliers as real outliers or inliers.

November Issue

Page 49 of 59

ISSN 2229 5208

International Journal of Computer Information Systems, Vol. 3, No. 5, 2011 4. Techniques Used

4.1) Cluster-based approach: Cluster based approach is here used to reduce the size of dataset i.e., act as data reduction. First, cluster based technique is used to form cluster of dataset. Once cluster are formed, centriod of each cluster are calculated. Remove the data up to certain radius as a real data. After removing the real data, remaining data are the candidate outlier. Candidate outliers are the temporary outlier [10]. Figure 3 shows Cluster based Approach.

Figure 2: Architecture of Proposed work It is assumed that, the number of outliers in any dataset is expected to be extremely small as compared to the normal data. So, it is highly inefficient to apply the traditional outlier detection algorithms over the entire data set, especially in case of data set this method can become highly expensive as well as can often led us to wrong decision in finding most outstanding outliers. Proposed method only declares these points as candidate outliers and compares them with next incoming data set chunk to make sure that these are real outliers. In the first phase, the cluster based method is applied; safe data will be pruned out while the candidate cells will still be kept there for further processing. Safe region are pruned out from the current chunks so that next chunks can allow for storing. Then in second phase, distance based strategy is applied over the candidate to efficiently figure out the outliers while discard rest of data. Proposed algorithm efficiently prune of the safe cells and save huge number of extra calculations. 3. Research Design

Figure 3: Cluster-based Approach Clustering algorithm (K-mean) steps: K-number of cluster, we assume the centroid or centre of these clusters. We can take any random objects as the initial centroids or the first K objects in sequence can also serve as the initial centroids. Clustering is nothing but the grouping the data. The K means algorithm will do the three steps: Generating clusters Iterate until stable (= no object move group) 1. Determine the centroid coordinate 2. Determine the distance of each object to the centroid 3. Group the object based on minimum distance By this way we can cluster the entire dataset in to number of clusters and calculate centroid of each cluster. Find Candidate cells Remove the data up to certain radius as a real data. After removing real data rest of the data will be candidate outlier. 4.2) Distance-based approach:

We will use MATLAB tools for implementing our algorithms. Data File Format: Data is available in .data and .xls excel file or txt or csv file format. This data file will be taken to find the outlier. Source of Data collection: Data is collected from UCI machine learning repository that provided various types of datasets. This dataset can be used for clustering, classification and regression. A repository of databases, domain theories and data generators are used by the machine learning community for the empirical analysis of machine. 1. Forest fire: The aim is to predict the burned area of forest fires. 2. Wine quality model: The aim is to model wine quality based on physicochemical tests. 3. Car evaluation model.

Next, distance based technique is used to find the distance from centroid to candidate outlier. If this distance is greater than some threshold then it will declare as outlier otherwise as a real object. In this work, candidate outliers are taken from the cluster based approach and then the distance from the centroid of each cluster to the candidate outlier is calculated. If this distance is greater than some threshold then it will declare finally as outlier otherwise as a real data.

November Issue

Page 50 of 59

ISSN 2229 5208

International Journal of Computer Information Systems, Vol. 3, No. 5, 2011 In this approach, distance of each point from its neighbours is calculated because there are n numbers of points in the original dataset. If the neighbouring points are relatively close, then the point is considered normal. If the neighbouring points are far away, then the point is considered an outlier (Euclidean distance is used as the distance function).Figure 4 shows Distance-based Approach. VI. CONCLUSION This papers aims to detect outliers is the task that finds objects that are dissimilar or inconsistent with respect to remaining data. The Clustering algorithm is first performed, and then small clusters are determined and considered as candidate outlier. Other outliers are then determined based on distance measures. In our view, using clustering method here act as a data reduction and distance based method that calculate distance from centroid. REFERENCES
[1] F. Angiulli and F. Fassetti, "Detecting Distance-based Outliers in Streams of Data," In Proceedings of CIKM'07, Pages 811-820, November 6-10 2007. F. J. Anscombe and I. Guttman, "Rejection of Outliers," Techno metrics, vol. 2, Pages 123-147, May 1960. M. Breunig, H.-P. Kriegel, R. Ng, and J. Sander, "OPTICSOF: Identifying Local Outliers," In Proceedings of PKDD'99, Pages 262- 270, September 15-18 1999. Parneeta Dhaliwal, MPS Bhatia and Priti Bansal, A Cluster-based Approach for Outlier Detection in Dynamic Data Streams (KORM: k-median OutlieR Miner) JOURNAL OF COMPUTING, VOLUME 2, ISSUE 2, FEBRUARY 2010, ISSN: 2151-9617.PAGES 74-80. Manzoor Elahi, KunLi, Wasif Nisar, Xinjie Lv, Hongan Wang, Efficient Clustering-Based Outlier Detection Algorithm for Dynamic Data Stream In Proc .of Fifth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD.2008),ISBN: 978-0-7695-3305-6/08, pages 298-304. Manzoor Elahi, Xinjie Lv, Wasif Nisar, Imran Ali Khan, Ying Qiao Hongan Wang, DB-Outlier Detection Algorithm using Divide and Conquer approach over Dynamic DataStream In International Conference on Computer Science and Software Engineering, ISBN: 978-0-76953336-0/08, pages 438-443. E. M. Knorr and R. T. Ng. Algorithms for mining distance based outliers in large datasets In Proc. 24th Int. Conf. Very Large Data Bases, VLDB, pages 392403, 1998. M. Knorr and R. T. Ng. Finding intentional knowledge of distance-based outliers In VLDB 99: Proceedings of the 25thInternational Conference on Very Large Data Bases, pages 211222, 1999. Vit Niennattrakul, Eamonn Keogh, Chotirat Ann Ratanamahatana, Data Editing Techniques to Allow the Application of Distance-Based Outlier Detection to Streams, IEEE International Conference on Data Mining (ICDM) 2010. ISBN: 978-1-4244-9131-5, Pages 947-952. Rajendra Pamula,Jatindra kumar Deka,Sukumar Nandi.An Outlier Detection Method based on Clustering, Second International Conference on Emerging Applications of information Technology, ISBN: 978-0-7695-4329-1/11, Pages 253-256. Ramaswamy, R. Rastogi, and K. Shim. Efficient algorithms for mining outliers from large data sets pages 427438, 2000. J. Tang, Z. Chen, A. W.-C. Fu and D. W.-L. Cheung, "Enhancing Effectiveness of Outlier Detections for Low Density Patterns," In Proceedings of PAKDD'02, Pages 535548, May 6-8 2002. Peng Yang; Biao Huang; KNN Based Outlier Detection Algorithm in Large Dataset International Workshop on Education Technology and Training, ISBN: 978-0-76953563-0, Pages 611 613, 2008.

[2] [3]

[4]

Figure 4: Distance-based Approach


[5]

Distance-based Algorithm steps: 1. Centroid of each cluster is calculated 2. Calculate distance of each point (candidate outlier) from centroid of the cluster. 3. If Distance>Threshold then it will declare finally as outlier otherwise as a real data. V. IMPLICATIONS OF WORK An efficient method for outlier detection is proposed in this paper. First we identify which are not probable candidates for outliers by using radius of each cluster and we remove those points from dataset. Due to reduction in the size of dataset, the computation time reduced considerably. Proposed algorithm efficiently prune of the safe cells and save huge number of extra calculations. Next we measure the distance of candidate outlier from centroid of the clusters. In proposed approach, two techniques are combining to efficiently find the outlier from the data set. This hybrid approach takes less computational cost .i.e. time and space complexity than the distanced based approach. We pay more attention to the points detected as outliers and gives them a chance of survival in the next incoming chunk of data, rather declare them outlier by observing the current chunk. We only keep the most suitable candidate outliers. Our approach tries to reduce time and space complexity by using cluster based method as k-mean or k-median and then applies distance based method to find out outliers. And will try for methods should applicable for varying datasets.

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

November Issue

Page 51 of 59

ISSN 2229 5208

International Journal of Computer Information Systems, Vol. 3, No. 5, 2011


AUTHORS PROFILE Shraddha D. Pachgade is currently pursuing M.E. (computer sciene & Engg) from S.G.B. Amravati university. She did her B.E. (computer sciene & Engg) from S.G.B. Amravati university. She published one paper at national and international level.

Shital S. Dhande is currently working as assistant professor at sipna college of engineering, Amravati. She did her B.E. (computer sciene & Engg) and M.E. (computer sciene & Engg) from S.G.B. Amravati university. She is currently pursuing Ph. D. in faculty of engineering & technology in the area of OO Database management system from S.G.B. Amravati university. She has published many papers at national as well as international level. She is a member of ISTE,IE,IETE, India.

November Issue

Page 52 of 59

ISSN 2229 5208

Você também pode gostar