Você está na página 1de 4

Association Rule Mining for Rules from the given Database Transactions

1. Introduction
This is a project that has not been implemented in any Gov campus. We propose this project for finding the Association Rules from the given Database. E.g. In this we can find relation between two items in database. E.g. from Agricultural Database we have attributes like: a) soil type, b) crop type, c) fertilizer used, d) location, e) crop production (quantity) We can find the relation between the soil type and crop type, soil type and fertilizer used, soil type and crop production, crop type and fertilizer used, crop type and location, fertilizer used and location, or fertilizer used and crop production etc. We can relate the attributes values and can find how much strong relation is between them. This project will come under the category of Data Mining in which we are doing Association Rule mining. So to understand this clearly, please have a look at these subparts: 1.1. Data Mining Originally, DATA MINING" is a statistician's term for overusing data to draw invalid inferences. So, its Discovery of useful summaries of data. Data Mining is a process that discovers the knowledge or hidden pattern from large databases. Data Mining is known as one of the core processes of Knowledge Discovery in Database (KDD). It is the process that results in the discovery of new patterns in large data sets. It is a useful method at the intersection of artificial intelligence, machine learning, statistics, and database systems. It is the principle of picking out relevant information from data. It is usually used by business intelligence organizations, and financial analysts, to extract useful information from large data sets or databases Data Mining is use to derive patterns and trends that exist in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The goal of this technique is to find accurate patterns that were previously not known by us. So, the overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. Organizations like retail stores, hospitals, banks, and insurance companies currently using mining techniques. Data Mining is primarily used today by companies with a strong consumer focus - retail, financial, communication, and marketing organizations. It enables these companies to determine relationships among "internal" factors such as price, product positioning, or staff skills, and "external" factors such as economic indicators, competition, and customer demographics. The actual data mining task is the automatic or semi-automatic analysis of large quantities of data to extract previously unknown interesting patterns such as groups of data records (cluster analysis), unusual records (anomaly detection) and dependencies (association rule mining). Data mining involves six common classes of tasks:
1|Page

o Anomaly detection (Outlier/change/deviation detection) The identification of unusual data records, that might be interesting or data errors and require further investigation. o Association rule learning (Dependency modeling) Searches for relationships between variables. For example a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis. o Clustering is the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data. o Classification is the task of generalizing known structure to apply to new data. For example, an e-mail program might attempt to classify an e-mail as "legitimate" or as "spam". o Regression Attempts to find a function which models the data with the least error. o Summarization providing a more compact representation of the data set, including visualization and report generation. Data mining brings a lot of benefits to businesses, society, governments as well as individual. However privacy, security and misuse of information are the big problems if they are not addressed and resolved properly. 1.2 Association Rule Mining Association Rule Mining is a data mining function that discovers probability of co-occurance of items in a collection. Association rule mining, one of the most important and well researched techniques of data mining, was first introduced by Rakesh Agrawal. It aims to extract interesting correlations, frequent patterns, associations or casual structures among sets of items in the transaction databases or other data repositories. The formal statement of association rule mining problem was firstly stated by Agrawal. Let I is item-set of m distinct attributes, I={I1, I2, ., Im} and D is database (transaction set), D={T1,T2,.TN}, where T I and there are two item-sets X and Y, such that X T and Y T, then association rule, X=>Y holds where X I and Y I and X Y = . X is called antecedent while Y is called consequent; the rule means X implies Y. There are two important basic measures for association rules, support(s) and confidence(c). Since the database is large and users concern about only those frequently purchased items, usually thresholds of support and confidence are predefined by users to drop those rules that are not so interesting or useful. The two thresholds are called minimal support and minimal confidence respectively. The two basic parameters of Association Rule Mining (ARM) are: support and confidence. Support(s) of an association rule is defined as the percentage/fraction of transactions in D that contain X Y. Support(s) is calculated by the following formula: ( )
( ) ( )

Eq no.(1)

Confidence of an association rule is defined as the percentage/fraction of the number of transactions in D that contain X also contains Y. IF component is the Antecedent and THEN component is called consequent. It is calculated by dividing probability of items occurring together by probability of occurrence of antecedent. Confidence is a measure of strength of the association rules. Confidence(c) is calculated by the following formula:
2|Page

( ) ( )

Eq no.(2)

Association rule mining is to find out association rules that satisfy the predefined supmin and confmin from a given database. Objective of ARM is to find universal set S of all valid association rule. The problem is usually decomposed into two sub-problems. One is to find those item-sets whose occurrences or support exceed a predefined threshold (supmin) in the database, those item-sets are called frequent or large item-sets. The second problem is to generate association rules from those large item-sets with the constraints of confmin.

2. Algorithm Used: Apriori Algorithm


The Apriori algorithm generate the candidate item-sets to be counted in a pass by using only the itemsets found large in the previous pass- without considering the transactions in the database. The basic intuition is that any subset of a large item-set must be large. Therefore, the candidate item-sets having k items can be generated by joining large item-sets having k-1 items, and deleting those that contain any subset that is not large. This procedure results in generation of a much smaller number of candidate item-sets. Algorithm 3 is the Apriori algorithm. The first pass of the algorithm simply counts item occurrences to determine the large 1-item-sets. A subsequent pass, say pass k, consists of two phases. First, the large item-sets Lk-1 found in the (k-1)th pass are used to generate the candidate item-sets Ck, using the apriori-gen function described in Section 1.3.1. Next, the database is scanned and the support of candidates in Ck is counted. For fast counting, we need to efficiently determine the candidates in C k that are contained in a given transaction t. Section 1.3.2 describes the subset function used for this purpose.
1) Algorithm Apriori(large 1 item-sets) 2) L1={large 1 item-sets}; 3) for(k=2;Lk-1 ;k++) do begin 4) Ck=apriori-gen(Lk-1); //New candidates 5) forall transactions t D do begin 6) Ct=subset(Ck,t); //Candidates contained in t 7) forall candidates c Ct do 8) c.count++; 9) end 10) Lk={c Ck | c.countminsup}
11) end 12) Answer=

UL

k k

Algorithm 1: Apriori Algorithm 2.1 Apriori Candidate Generation The apriori-gen function takes as argument Lk-1, the set of all large (k-1)-item-sets. It returns a superset of the set of all large k-item-sets. The function works as follows. First, in the join step, we join Lk-1 with Lk-1. Next, in the prune step, we delete all item-sets c Ck such that some (k-1)-subset of c is not in Lk-1.
1) Algorithm Apriori-gen(Lk) 3|Page

2) 3) 4) 5) 6) 7) 8) 9)

insert into Ck select p.item1.p.item2,.....p.itemk-1,q.itemk-1 from Lk-1 p,Lk-1 q where p.item1=q.item1, .....p.itemk-2=q.itemk-2, p.itemk-1<q.itemk-1; forall item-sets c Ck do forall (k-1)-subsets s of c do if(s Lk-1) then delete c from Ck;

Algorithm 2: Apriori-Gen function 2.2 Subset Function Candidate item-sets Ck are stored in a hash-tree. A node of the hash-tree either contains a list of itemsets (a leaf node) or a hash table (an interior node). In an interior node, each bucket of the hash table points to another node. The root of the hash-tree is defined to be at depth 1. An interior node at depth d points to nodes at depth d+1. Item-sets are stored in the leaves. When we add an item-set c, we start from the root and go down the tree until we reach a leaf. At an interior node at depth d, we decide which branch to follow by applying a hash function to the dth item of the item-set. All nodes are initially created as leaf nodes. When the number of item-sets in a leaf node exceeds a specified threshold, the leaf node is converted to an interior node. Starting from the root node, the subset function finds all the candidates contained in a transaction t as follows. If we are at a leaf, we find which of the item-sets in the leaf are contained in t and add references to them to the answer set. If we are at an interior node and we have reached it by hashing the item i, we hash on each item that comes after i in t and recursively apply this procedure to the node in the corresponding bucket. For the root node, we hash on every item in it.

3. Tools required for Proposed Work


Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data preprocessing, classification, regression, clustering, association rules, and visualization. It is also wellsuited for developing new machine learning schemes. Weka is named after a flightless bird found in Newzealand. Weka is open source software issued under the GNU General Public License. It has 3 modes of operation 1. GUI 2. Command Line 3. Java API The Main Features of Weka involves: o 49 data preprocessing tools o 76 classification/regression algorithms o 8 clustering algorithms o 3 algorithms for finding association rules

4|Page

Você também pode gostar