Most of the clustering algorithms extract patterns which are of least
interest. Such pattern consists of data items which usually belong to
widely different support levels. Such data items belonging to
different support level have weak association between them, thus
producing least interested patterns which are of least interest. The
reason behind this problem is that such existing algorithms do not
have the basic knowledge regarding the co-occurrence relationship
between data items. Such algorithm cannot even consider the
knowledge regarding the co-occurrence relationship among the data
items in them as if it consider such knowledge, the goal of the
algorithm will conflict with this knowledge. I am going to propose a
solution to this problem by extracting highly correlated and interested
patterns known as maximized patterns. Confidence measure will be
used to extract maximized patterns. In this framework, the data
mining operation is performed not directly on the data set but the
data mining is performed on the highly correlated intensive patterns.
Using this strategy the effect of cross support pattern is also
minimized. A minimum threshold value is also being used to regulate
the intensive patterns.
Título original
Minimizing Spurious Patterns Using
Association Rule Mining
Most of the clustering algorithms extract patterns which are of least
interest. Such pattern consists of data items which usually belong to
widely different support levels. Such data items belonging to
different support level have weak association between them, thus
producing least interested patterns which are of least interest. The
reason behind this problem is that such existing algorithms do not
have the basic knowledge regarding the co-occurrence relationship
between data items. Such algorithm cannot even consider the
knowledge regarding the co-occurrence relationship among the data
items in them as if it consider such knowledge, the goal of the
algorithm will conflict with this knowledge. I am going to propose a
solution to this problem by extracting highly correlated and interested
patterns known as maximized patterns. Confidence measure will be
used to extract maximized patterns. In this framework, the data
mining operation is performed not directly on the data set but the
data mining is performed on the highly correlated intensive patterns.
Using this strategy the effect of cross support pattern is also
minimized. A minimum threshold value is also being used to regulate
the intensive patterns.
Most of the clustering algorithms extract patterns which are of least
interest. Such pattern consists of data items which usually belong to
widely different support levels. Such data items belonging to
different support level have weak association between them, thus
producing least interested patterns which are of least interest. The
reason behind this problem is that such existing algorithms do not
have the basic knowledge regarding the co-occurrence relationship
between data items. Such algorithm cannot even consider the
knowledge regarding the co-occurrence relationship among the data
items in them as if it consider such knowledge, the goal of the
algorithm will conflict with this knowledge. I am going to propose a
solution to this problem by extracting highly correlated and interested
patterns known as maximized patterns. Confidence measure will be
used to extract maximized patterns. In this framework, the data
mining operation is performed not directly on the data set but the
data mining is performed on the highly correlated intensive patterns.
Using this strategy the effect of cross support pattern is also
minimized. A minimum threshold value is also being used to regulate
the intensive patterns.
Minimizing Spurious Patterns Using Association Rule Mining Ruchi Goel Dr. Parul Agarwal M.Tech(CSE) Assistant Professor Jamia Hamdard University, New Delhi,India (Department of Computer Science) Jamia Hamdard University, New Delhi,India ABSTRACT Most of the clustering algorithms extract patterns which are of least interest. Such pattern consists of data items which usually belong to widely different support levels. Such data items belonging to different support level have weak association between them, thus producing least interested patterns which are of least interest. The reason behind this problemis that such existing algorithms do not have the basic knowledge regarding the co-occurrence relationship between data items. Such algorithm cannot even consider the knowledge regarding the co-occurrence relationship among the data items in themas if it consider such knowledge, the goal of the algorithmwill conflict with this knowledge. I amgoing to propose a solution to this problemby extracting highly correlated and interested patterns known as maximized patterns. Confidence measure will be used to extract maximized patterns. In this framework, the data mining operation is performed not directly on the data set but the data mining is performed on the highly correlated intensive patterns. Using this strategy the effect of cross support pattern is also minimized. A minimumthreshold value is also being used to regulate the intensive patterns. Keywords: Asymmetric data set, Cooccurrence relation, Intensive patterns, Minimumthreshold, Spurious patterns. I. INTRODUCTION Normally data sets consists of asymmetric data items. For example, any departmental store having a wide range of commodities of same price but their significance vary from one commodity to another. Some belong to same support level while some belongs to different support level. Normally clustering algorithms are ineffective on such a asymmetric datasets to perform effective clustering. The conventional algorithms on the lower threshold value give large spurious patterns that are weakly correlated data items. Such problem require to design a measure that can perform even on low support value and remove spurious patterns to get intensive patterns. For example, in any shopping mall there could be a large range of commodities having some price which may vary significantly from one commodity to the other while there could be some commodities belonging to the same price level. Thus, it could be said that in a shopping mall there is a wide range of commodities belonging to different support levels but few of them may belong to the same support level. In such data sets if we use conventional clustering algorithms for mining associated patterns then they will not be effective. Most of the clustering algorithms defined so far rely purely on support-based pruning strategy and this strategy when used on highly asymmetric data sets proves to be ineffective because of the following two reasons: 1. If the value of minimum threshold is very low, then the number of spurious patterns in the overall extracted patterns may also increase. Such spurious patterns contain data items belonging to different support level. These spurious patterns are called cross-support patterns and the data items which they contain are weakly correlated with most of the data items belonging to the pattern. For example, {chips, shampoo} is a cross-support pattern as chips is a data item having high support while shampoo on the other hand has quite low support as compared to chips. Such data items are weakly correlated and thus the patterns which contain them are considered to be spurious. Besides this, using a lower value for minimum threshold also increases the computational and memory requirements substantially. 2. On the other hand if the value of minimum threshold is very high, then there are chances that many interested patterns having support less than the threshold value may be missed. For example, {chips, cold drinks}.
II. OBJECTIVE I am going to reduce spurious patterns from the asymmetric dataset using association rules. Such spurious pattern should be minimized and we get the optimized pattern , on which we can do decision-making. So far, the clustering algorithms cant discover co- occurrence relationship among activities performed by specific group or object or between the data items. They simply use their notions and reduce the size of data items by removing items that provides less power to classify instances. International Journal of Computer Trends and Technology (IJCTT) volume 10 number 4 Apr 2014 ISSN: 2231-2803 http://www.ijcttjournal.org Page193
So it becomes tedious to do decision making on such clusters containing asymmetric data items. My solution will be to extract useful patterns at low support value and in turn remove spurious patterns during mining. This will generate the patterns having string co-occurrence relationship between data items. III. APPROACH FOR MINIMIZING SPURIOUS PATTERNS As we know that, clustering is a process of assigning various objects to various clusters, keeping in mind that objects or data items belonging to a certain cluster has maximum similarity with each other. Thus, provide patterns for the purpose of discovering knowledge that can use for further decision making. If some of the data items belonging to a particular cluster moves from that particular cluster to some other cluster because of the clustering algorithm being used then it would become very tedious to gather knowledge from such clusters. Even more it is much easy to gather knowledge from the well understood patterns rather than interpreting the data items directly. Thus to solve this problem intensive-clustering is an approach. In this approach, patterns are preserved such that the data items belonging to a particular pattern always belongs to a particular cluster. Intensive-patterns[3] are patterns which contains data items which have high similarity or co- occurrence relation with each other. By high co-occurrence relation in intensive patterns means that the presence of each and every data item in that intensive-pattern highly implies the presence of each and every other data item belonging to that same intensive-pattern. The intensive-confidence or (i-confidence) of an itemset D={d1,d2,d3,.dm}, is denoted as I-conf(D), is a measure that reflects the overall co-occurrence relation among items within the itemset. This measure is defined as min{conf{d1->d2,d3,,dm}, conf{d2->d1,d3,.d m }, ., conf{d m d 1 ,d 2 ,d m -1}}, where conf is the conventional definition of association rule confidence. The scope of intensive-confidence could be understood properly with the help of following example. Consider an itemset D= {desktop, printer, antivirus}. Assume that supp({desktop})=0.1, supp({printer})=0.1, supp({antivirus})=0.06, and supp({desktop, printer ,antivirus })=0.06, where supp is the support of an itemset. Then conf{desktopprinter, antivirus}=supp({desktop,printer,antivirus})/supp({desktop}) =0.6 conf{printerdesktop, antivirus}=supp({desktop,printer,antivirus})/supp({printer})= 0.6 conf{antivirusdesktop, printer}=supp({desktop,printer,antivirus})/supp({antivirus})= 1 Hence-conf(D)=min{conf{desktopprinter, antivirus}, conf{printerdesktop ,antivirus}, conf{antivirusdesktop, printer}}=0.6. The collection of candidate patterns from the itemset(D) is an intensive-pattern if and only if, the value of i-conf(D)>=Tc, where Tc is the minimum threshold confidence which is provided by the user. Further if for any intensive-pattern there exist some subset of this intensive-pattern, then these subset patterns should be removed from the set of all intensive- pattern. The reason for this is due to the property of all- confidence[2]. Properties of I-confidence Measure The I-confidence measure has four important properties, namely the anti-monotone property, the cross-support property, the strong co-occurrence relation property and the all-confidence property. 1. Anti-Monotone The I-confidence measure posses anti-monotone property. This property states that if for all the data items belonging to P, the value of I-confidence is greater than the threshold value Tc, then for all the subsets of P, the value of I-confidence will remain greater than the threshold value Tc. How I-confidence measure uses this property of anti- monotone? This could be easily explained with the help of the following example: Suppose the supp({desktop}) = 0.2, supp({antivirus}) =0.6 and the supp({desktop, printer})=0.3 and the value of minimum i-confidence threshold is 0.6, then the i-confidence of the candidate pattern {desktop, printer} is given by supp({desktop, printer})/ max{supp({desktop}),supp({printer})}=0.3/.6 =0.5 which is less than the minimum i-confidence of 0.6. Thus the candidate pattern {desktop, printer} is not a intensive pattern. Moreover, all the candidate patterns having {desktop, printer} as their subset are pruned, like {desktop, printer, TV} is not a intensive pattern. One thing should be noted down here, the pruning here is done on the basis of I-confidence threshold. If the value of I-confidence threshold is reduced to .45, then {desktop, printer} will be an intensive pattern.
International Journal of Computer Trends and Technology (IJCTT) volume 10 number 4 Apr 2014 ISSN: 2231-2803 http://www.ijcttjournal.org Page194
2. Strong Co-occurrence Relation The I-confidence measure also posses the property of Strong Co-occurrence relation among datasets. The I-confidence measure take cares that all the data items contained in a data set have strong co-occurrence relation among each other which means to say strong association between each other. This could be easily understood with the help of following consideration. Suppose the value of I-confidence is 90% for any itemset (D). Then if any of the data item belonging to the itemset (D) occurs in any transaction, then there are 90% chances that the remaining data items belonging to the same itemset (D) will also occur in the same transaction. 3. Cross-support patterns The I-confidence helps in minimizing the cross-support patterns which are actually the spurious patterns. It is always very difficult to choose the right threshold value for the purpose of mining the large collection of data. If we set a very high value of threshold then there are chances that we may miss many interesting patterns. Conversely, if we set a very low value for the threshold then also it may not be easy to find the interested associated patterns because of the following two reasons. The first reason is that the computational and memory requirements of existing analysis algorithm increases considerably and secondly, the number of extracted patterns also increases substantially. The I-confidence helps us in eliminating patterns which consists of data items which are not of interest. Also, I-confidence does not involve extra computational cost as it simply depends on the support values of the individual data items or their various combinations. This could be easily understood with the help of following consideration. Suppose Tc is the given value of threshold and P is a pattern such that P={p1, p2,.,,pn}. We could say P as a cross-support pattern with respect to Tc,if for any two data items suppose p1 and p2 belonging to P, the value of supp({p1})/supp({p2}) <Tc, where 0<Tc<1. 4. All Confidence Omiecinski proposed the concept of all confidence [1] as an alternative to the support. All confidence represents the minimum confidence of all the association rules extracted from the itemset. Omiecinskis all-confidence posses the desirable property of anti-monotone. The all-confidence [2] measure for an itemset P = {p1, p2,.., pm} is given by min({conf(AB | for all A, B is subset of P, AUB =P, AB = }) and is equal to :- Supp({p1,p2,,pm})/max 1<=k<=m{supp(pk})} IV. ALGORITHM FOR MINIMIZING SPURIOUS PATTERNS Input I: Item Set stored in database containing list of transactions with their items and corresponding support Min_threshold: Minimum Threshold value of i-confidence * the value of Min_threshold will be provided by the user. Variable Intensive : Intensive Pattern Set Max_Intensive: Maximal Intensive Pattern Set Intensive_Pattern_Evaluation() : Function for evaluating Intensive Pattern Set Max_Intensive_Pattern_Evaluation() : Function for evaluating Max_Intensive Pattern Set Method I: Extracting Maximal Intensified Pattern Intensive=Intensive_Pattern_Evaluation(I, Min_threshold) { 1. Access the support value for each element in (I) . 2. Create candidate patterns with items belonging to different level of support 3. Prune candidate patterns on basis of Anti- monotone property 4. Prune candidate patterns on basis of cross- support property 5. Intensive patterns (i.e. Intensive ) with I- confidence >Min_threshold } Max_Intensive= Max_Intensive_Pattern_Evaluation(Intensive) { 1. Find an Intensive patterns (X) such that X is a subset of Y and both X,Y Intensive. International Journal of Computer Trends and Technology (IJCTT) volume 10 number 4 Apr 2014 ISSN: 2231-2803 http://www.ijcttjournal.org Page195
2. Intensive =Intensive X 3. Reapeat until steps 1 -2 until there exist no X, X subset of Y and both belonging to the Intensive Patterns. 4. Set Max_Intensive =Intensive } V. RESULT AND DISCUSSION Performance comparison of maximal intensive -pattern and hyperclique-pattern While comparing the performance of maximal intensive patterns and hyperclique patterns, we take a dataset of transactions. Both the algorithms are implemented on the same set of data of transactions. Minimum threshold value taken in both the cases is same. On execution of algorithm the result that comes out is that the number of interested patterns generated by maximal intensive patterns are less than that are generated by hyperclique pattern algorithm.[4].Reason for this is maximal intensive pattern used anti- monotone and all confidence property with strong co-occurrence and cross support property. Let us consider the Table 1 , having a list of various transactions with the items involved in each transaction. Transaction_Id Items 1 Bread, Butter 2 Butter 3 Coffee, Butter, Bread 4 Coffee, Milk 5 Bread, Butter, Milk 6 Coffee 7 Bread, Cookie 8 Coffee, Pickle 9 Bread, Sugar 10 Ketchup, Juice, Coffee, Egg 11 Bread, Juice, Pickle 12 Milk 13 Milk, Coffee, Sugar 14 Cookie, Chocolate 15 Chocolate, Milk 16 Biscuit, Milk 17 Bread, Biscuit, Milk 18 Milk, Coffee, Sugar 19 Cookie, Chocolate 20 Milk
Table 1 list of various transactions with the items involved in each transaction. The Table 2 shows a list of interested patterns generated as Hyperclique-patterns and Maximal Intensive-patterns. The value of minimum threshold confidence taken for each of these measures is 0.02. The number of interested patterns generated as Hyperclique patterns are 16 and the number of interested patterns generated as Maximal Intensive Patterns are 11. Look at the interested patterns number 6 & 13 generated as Hyperclique patterns. It is found that both the values are same. The reason behind this is that while generating hyperclique patterns, it is only that the value of h- conf for that pattern should be greater than the minimum International Journal of Computer Trends and Technology (IJCTT) volume 10 number 4 Apr 2014 ISSN: 2231-2803 http://www.ijcttjournal.org Page196
threshold confidence. Whatsoever, if there are same multiple candidate patterns. On the other hand, look at the interested patterns generated as Maximal Intensive Pattern, none of the interested patterns has any subset present in the interested pattern list. This is due to the all-confidence property. Interested Patterns Hyperclique Patterns Maximal Optimized Patterns 1 {ketchup,juice,coffee, egg} {ketchup,juice,coffee,egg } 2 {bread,juice,pickle} {bread,juice,pickle} 3 {chocolate, milk} {chocolate, milk} 4 {bread,biscuit,milk} {bread,biscuit,milk} 5 {milk,coffee,sugar} {milk,coffee,sugar} 6 {cookie, chocolate} {cookie, chocolate} 7 {coffee,butter,bread} {coffee,butter,bread} 8 {bread,butter,milk} {bread,butter,milk} 9 {bread, cookie} {bread, cookie} 10 {coffee, pickle} {coffee, pickle} 11 {milk,bread,sugar} {milk,bread,sugar} 12 {bread, butter} 13 {cookie, chocolate} 14 {biscuit, milk} 15 {coffee, milk} 16 {milk,coffee,sugar} Table 2 List of interested patterns generated as Hyperclique- patterns and Maximal Intensive patterns.
VI. CONCLUSION I conclude that this algorithm is able to reduce the spurious patterns and generate the maximal intensive patterns having high co-occurrence relation among patterns with the threshold value given by user on the basis of various properties that are prior mentioned. On this intensive patterns, clustering process become highly efficient than the already existing mining algorithm. REFERENCES. [1] Fisher, R.A.: The Use of Multiple Measurements in Taxonomic Problems, Annals of Eugenics, vol. 7, 1936, pp. 179188. [2] Omiecinski, Edward R.; Alternative interest measures for mining associations in databases, IEEE Transactions on Knowledge and Data Engineering, 15(1):57-69, Jan/Feb 2003. [3] Syed Zubair Ahmad Shah, Preceding Clustering by Pattern Preservation. In VSRD-IJCSIT,Vol. 2 (8), 2012. [4] H. Xiong, P. Tan, and V. Kumar. Mining hyperclique patterns with confidence pruning. In Technical Report 03-006 Computer Science, Univ. of Minnesota., Jan 2003.