Você está na página 1de 5

2014 International Conference on Intelligent Computing Applications

Diabetic prognosis through Data Mining Methods


and Techniques
Sankaranarayanan.S
Dr Pramananda Perumal.T
Associate Professor of Computer Science
Principal
Government Arts College (Autonomous)
Government Arts College
Kumbakonam, Tamilnadu, India
Uthiramerur, Tamilnadu, India
profsankaranarayanan@yahoo.in
pramanandaperumal@yahoo.com

AbstractData mining now-a-days plays an important role in


II. DATAMINING PREDICTION MODELS

prediction of diseases in health care industry. Data mining is the An invariable characteristic of any prediction model is that it
process of selecting, exploring, and modeling large amounts of data uses some data mining classification algorithm with two folds as
to discover unknown patterns or relationships useful to the data Prediction model and Evaluation method as shown in Fig.2.1. In
analyst. Medical data mining has emerged impeccable with the first fold, it uses the training dataset for screening the
potential for exploring hidden patterns from the data sets of attributes and build classification predictive model. In the
medical domain. These patterns can be utilized for fast and better second fold, it uses testing dataset for finding classification
clinical decision making for preventive and suggestive medicine.
efficiency. The classification algorithm indicates whether the
patient disease can be predicted with high or low level and finds
However raw medical data are available widely distributed,
out whether patient is suffering from diseases or not.
heterogeneous in nature and voluminous for ordinary processing.
Data mining and Statistics can collectively work better towards
Decision tree looks like a tree structure. It is very simple,
discovering hidden patterns and structures in data. In this paper,
efficient and easy to implement. It is presented in [2]. A nave
two major Data Mining techniques v.i.z., FP-Growth and Apriori Bayesian classifier using Bayes theorem works with a
have been used for application to diabetes dataset and association probabilistic statistical classifier. The major advantage of using
rules are being generated by both of these algorithms. this nave Bayesian classifier lies in its simplicity and is efficient
in handling the dataset containing many attributes.
Keywords- Association rules (AR), FPtree, Ttree, Classification, and
Pima Indian Diabetes Data (PIDD)

I. INTRODUCTION
Data mining process can be extremely useful for Medical
practitioners for extracting hidden medical knowledge. It would
otherwise be impossible for traditional pattern matching and
mapping strategies to be so effective and precise in prognosis or
diagnosis without application of data mining techniques. This
work aims at correlating various diabetes input parameters for
efficient classification of Diabetes dataset and onward to mining
useful patterns. Knowledge discovery and data mining have
found numerous applications in business and scientific domain.
Valuable knowledge can be discovered from application of data
mining techniques in healthcare systems too [1]. Data
preprocessing and transformation is required before one can
apply data mining to clinical data. Knowledge discovery and
data mining is the core step, which results in discovery of hidden
but useful knowledge from massive databases [1].

Figure 2.1 Data Mining Classification Techniques

978-1-4799-3966-4/14 $31.00 2014 IEEE 162


DOI 10.1109/ICICA.2014.43
Clustering refers to grouping of similar records [3]. This is A. Dataset Description
used as a preprocessing stage before the data is added into The goals of data mining can be classified into two tasks:
classifying model. The values should be normalized before description and prediction, while the purpose of description is to
clustering by avoiding the domination of high value attribute to extract understandable patterns and associations from data, the
low value attributes. Neural Network (NN) is a collection of goal of prediction is to forecast one or more variables of interest.
neurons interconnected between two or more network layers The main difference between the two tasks is therefore the
implemented in various diseases prediction used in paper [2, 4]. presence of a response variable in the case of prediction
It is made up of three layers input layer, hidden layer and output problems, if the response variable is continuous, such as blood
layer. It uses linear transfer function as input layer and non glucose (BG), the prediction task is said to be a regression
linear transfer function as output layer. In the first stage it sets problem, whereas if the response variable is categorical, such as
transfer function and network parameters and calculates output hypertension (taking the values yes or no), the task is said to be
of every neuron in hidden layer and estimates output in hidden a classification problem.
layer.
Adaptive-neuronal-fuzzy-interference system combines both B. The PIMA-Indian Diabetic Database
neural networks and fuzzy systems. This is done by converting
The Pima Indians are genetically predisposed to diabetes,
the inputs from numerical domain to fuzzy domain. Fuzzy and
and it was noted that their diabetic rate was 19 times that of a
genetic algorithm use fuzzy logic framework and significantly
typical town in Minnesota. The National Institute of Diabetes
improves the performance to diagnose diseases of patients and
and Digestive and Kidney Diseases of the NIH originally owned
minimizes cost and maximize accuracy.
the Pima Indian Diabetes Database (PIDD) [5]. The database has
Prediction models are the core data mining methods mainly n=768 patients each with 9 attribute variables. Out of the nine
used in healthcare and engineering field and the techniques used conditional attributes, six are due to physical examination, rest
are shown in figure 2.2. Performance evaluation is done by of the attributes are of chemical examination. Table1 shows the
comparing with various models used and accuracy is measured. properties of these data. Of these 9 attributes, there are eight
Then it is compared with existing model and validated how the inputs and last one being the output [5]. The goal is to use the
proposed model is better than existing models. first 8 variables to predict attribute values of the 9th variable.

Table 1 Pima-Indian Diabetes data set attributes

Figure 2.2 Steps in predicting models

III. METHODLOGY

The Methodology imbibed in this paper is survey of select set


of Data mining methods for predicting immediate or later
incidence of the life threatening disease, the Diabetes Mellitus.
Over the last years several papers have been published on the
problem of analyzing DM data. In 1994, for example, the AAAI
spring symposium on artificial intelligence in medicine
published a BG home monitoring data set as a challenge for
applying machine learning and artificial intelligence methods to
describe and predict BG values. Another well known data set
available is at the University of California. Irvine data
repository, the so called Pima Indians Diabetes database,
which is a collection of medical diagnostic reports of 768 C. Data discretization
samples. There have been many studies applying data mining Many real-world data mining tasks involve continuous
techniques to the PIDD. attributes. Data discretization is defined as a process of
converting continuous data attribute values into a finite set of
intervals and associating with each interval some specific data

163
value. There are no restrictions on discrete values associated B. Apriori
with a given data interval except that these values must induce
some ordering on the discretized attribute domain. Data This algorithm consists of two parts [11, 12]. The first part
discretization significantly improves the quality of discovered finds frequent itemsets, second part identifies the rules. For
knowledge and also reduces the running time of various data finding frequent itemsets following steps are followed:
mining tasks such as association rule discovery, classification,
and prediction [6]. Good discretization can lead to new and more Step 1: Scan all transactions and find all frequent items that have
accurate knowledge. On the other hand, bad discretization leads support above s %.Let these frequent items be L.
to unnecessary loss of information or in some cases to false
information with disastrous consequences. There are a wide Step 2: Build potential sets of k items from Lk-1 by using pairs
variety of discretization methods starting with the naive methods of itemsets in Lk-1 such that each pair has the first k-2 items in
often referred to as unsupervised methods such as equal-width, common. Now the k-2 common items and the one remaining
equal-frequency and supervised methods such as Minimum item from each of the two itemsets are combined to form a k-
Description Length(MDL) and Pearsons X2 or Wilks G2 itemset. The set of such potentially frequent k itemsets is the
statistics based discretization algorithms[6, 7]. candidate set Ck. (For k=2, we build the potential frequent pairs
by using the frequent itemset L1 appears with every other item
in L1. The set so generated is the candidate set C2)
IV. ALGORITHMS
Step 3: Scan all transactions and find all k-itemsets in Ck that
A. Association Rule Mining
are frequent. The frequent set so obtained is L2. The first pass of
Association rule mining techniques are used to identify the Apriori algorithm simply counts item occurrences to
relationships among a set of items in database [8]. These determine the large 1-itemsets. A subsequent pass, say pass k,
relationships are not based on inherent properties of the data consists of two phases. First, the large itemsets Lk-1 found in the
themselves as with functional dependencies, but rather based on (k-1)th pass are used to generate the candidate itemsets Ck,
co-occurrence of the data items. Association rules are more using the apriori-gen function. Next, the database is scanned and
appropriate when we search for completely new rules [8]. In this the support of candidates in Ck is counted. For fast counting, we
context, the associations rule mining technique may generate the need to efficiently determine the candidates in Ck that are
probable causes of the particular disease such as Diabetes in the contained in a given transaction t [11, 12].
form of association rules which can be used for fast and better
clinical decision-making. For finding rules, the following straightforward algorithm is
Let D be a set of transactions, where each transaction T is a used. Take a large frequent itemset, say l, and find each non-
set of items such that T I, I= {i1, i2,..im} be a set of literals, empty subset a. For every such subset a, output a rule of the
called items. Given the set of transactions D the problem is to form a  (l-a) if support (l) / support (a) satisfies minimum
find association rules that have support and confidence greater confidence.
than the user specified minimum support and minimum
confidence [8]. C. Frequent Pattern growth
An association rule is an implication of the form XY, where FP-Growth is a two step approach which allows frequent itemset
X I, YI, XY=. The rule XY holds in the transaction set discovered without candidate itemset generation.
D with confidence c, if c %of transactions in D that contain X
Step 1: Build a compact data structure called the FP-tree. Build
also contain Y. The rule XY has support s in the transaction using 2 passes over the data-set.
set D, if s % of transactions in D contains XY [9, 10].
Step 2: Extracts frequent itemsets directly from the FP-tree
Given the set of transactions T, one may be interested in
generating all rules that satisfy certain fixed constraint for FP-Tree is constructed using 2 passes over the dataset
support and confidence. Support and confidence are measures of
the interestingness of the rule. A high level of support indicates Pass-1: compresses a large database into a compact, Frequent
that the rule is frequent enough for the organization to be Pattern tree (FP-tree) structure.
interested in it. A high level of confidence shows that the rule is
true often enough to justify a decision based on it [11]. Pass-2: develops an efficient, FP-tree based frequent pattern
mining.
Thus for a rule XY,
Support(XY)=(Number of times X and Y
appear together)/D
Confidence(XY)=Support(XY)/Support(X)

164
The major difference between FP-growth and the Apriori following table 2, frequent itemsets and rules were produced
algorithm discussed above is that FP-growth does not generate using different approaches and different parameter settings.
the candidate itemsets and then tests [13, 14].

Table2 : Number of rules per approach


V. DATA STRUCTURE FOR ASSSOCIATION MINING
Parameter Value FP growth Apriori
Basically association rule works in two steps: Number Number of Number Number
of Rules of of Rules
Frequent generated Frequent generated
(1) Generating item sets that pass a minimum support Itemsets Itemsets
threshold. Support (4%) 121 3 121 3
(2) Generating rules that pass a minimum confidence Confidence(80%)
threshold. Support (4%) 121 14 121 14
Confidence(70%)
Association Rule Mining (ARM) obtains a set of rules which Support (4%) 121 2 121 23
indicate that the consequent of a rule is likely to apply if the Confidence(60%)
antecedent applies [10]. To generate such rules, the first step is to Support (4%) 121 37 121 37
determine the support for sets of items (I) that may be present in Confidence(50%)
the data set, i.e., the frequency with which each combination of Support (5%) 68 No Rules 68 No Rules
Confidence(80%)
items occurs. After eliminating those I for which the support fails
to meet a given minimum support threshold, the remaining large
VI. DISCUSSION AND CONCLUSION
I can be used to produce ARs of the form A  B, where A and
B are disjoint subsets of a large I. The ARs generated are usually In healthcare, data mining is becoming increasingly more
pruned according to some notion of confidence in each AR. essential. Data mining has come into existence in the mid of
However to achieve this pruning, it is always necessary to first 1990 and widely used in the fields of biomedical, healthcare and
identify the large I contained in the input data. This in turn engineering. Using data mining technologies, we can predict the
requires an effective storage structure. One of the efficient data diseases earlier. This paper provides an idea about diabetes
storage mechanisms for itemset storage is T-tree [15]. mellitus, the life-threatening disease and about its diagnosis
using data mining with minimum number of attributes applied to
A. Total Support Tree (T-tree) classification algorithms. With the help of data mining
A T-tree is a set enumeration tree structure which is used to algorithms, the computation cost decreases and also the
store frequent itemset information. The difference between the classification performance increases. This study has included
T-tree and other set enumeration tree structure is two algorithms of association rule mining technique, v.i.z
Apriori and FP-Growth techniques. In FP-Growth a novel data
1. Array is used to define the levels in each sub-branch of the structure, frequent pattern tree (FP-tree), is being implemented
tree which permits indexing in at all levels which in turn for storing compressed, crucial information about frequent
offers computational advantages. pattern. There are several advantages of FP-growth over other
Apriori approach:
2. To make the indexing at all levels the tree is built in 1) It constructs a highly compact FP-tree, which is usually
reverse. Here, each branch is founded on the last element of substantially smaller than the original database and thus saves
the frequent sets to be stored. the costly database scans in the subsequent mining processes
2) It applies a pattern growth method which avoids costly
The most significant overhead when considering ARM data candidate generation and test by successively concatenating
structures is that the number of possible combinations frequent 1-itemset found in conditional FP-trees
represented by the item-columns in the input data scales 3) It applies a partitioning-based divide-and-conquer method
exponentially with the size of the record. A partial solution is to which dramatically reduces the size of the subsequent
store only those combinations that actually appear in the data conditional pattern bases and conditional FP-trees.
set. The implementation of this structure can be optimized by
storing levels in the tree in the form of arrays, thus reducing the It is observed that both the techniques generate the same number
number of links needed and providing direct indexing. For the of frequent sets as a consequence same number of rules for the
latter purpose, it is more convenient to build a reverse version same known dataset under the same constraints. These rules
of the tree, referred to as a T-tree, the Total support tree. In the provide valuable knowledge, in the form of Induction rules:

165
1. IF ( OGTT=127) THEN (number of times pregnant=3) REFERENCES
2. IF (Diastolic Blood Pressure =75) AND (BMI=30) THEN [1] Harleen Kaur and Siri Krishan Wasan, Empirical Study on
(Diabetis pedigree function=0.5) Unit of BMIweight in aplications of Data Mining Techniques in Healthcare,
Kg/(height in m)^2 Journal of Compuuter Science, 2006.
3. IF (number of times pregnant=3) AND (Age=35) THEN
(Diabetis pedigree function=0.33) [2] Duen-Yian Yeh a, Ching-Hsue Cheng b, Yen-Wen Chen
4. IF (Diastolic Blood Pressure =50) THEN Not Diabetic 2011 A predictive model for cerebrovascular disease using
5. IF (number of times pregnant=6) AND (BMI=34) THEN Not
data mining Vol. 8970-8977
Diabetic
6. IF(Triceps Skinfold Thickness=22) AND (Diabetis pedigree
function=0.5) THEN Not Diabetic [3] ShantakumarB.Patil,Y.S.Kumaraswamy , Predictive data
mining for medical diagnosis of heart disease, 2011,
7. IF (number of times pregnant=5) AND (Diabetis pedigree
function=0.66) THEN Not Diabetic
8. IF (number of times pregnant=7) AND (Diabetis pedigree [4] D.Shanthi,,Dr.G.Sahoo,,Dr.N.Saravanan,2008, Designing an
Artificial Neural Network Model for the Prediction of Thrombo-
function=0.66) THEN Not Diabetic
embolic Stroke (IJBB), Volume 3, pp.10-18.
9. IF (BMI=35) AND (Diabetis pedigree function=0.66) THEN
Not Diabetic [5] UCI Machine Learning, Pima Indians Diabetes DataSet,
10. IF (OGTT=103) THEN Diabetic http://archive.ics.uci.edu/ml/machine-learning-
11. IF (OGTT=105) THEN Diabetic databases/pima-indians-diabetes
12. IF (OGTT=119) THEN Diabetic
13. IF (OGTT=120) THEN Diabetic [6] Khiops, A Statistical Discretization Method of Continuous
14. IF (Diastolic Blood Pressure =63) THEN Diabetic Attributes. Marc Boull. Journal Title: Machine Learning,, 2004.
15. IF (number of times pregnant=2) AND (Diastolic Blood .
[7] Ruoming Jin, Yuri Breitbart, Chibuike Muoh, "Data Discretization
Pressure =75) THEN Diabetic
Unification," icdm, pp. 183-192, Seventh IEEE International
16. IF (number of times pregnant=3) AND (Triceps Skinfold Conference on Data Mining, 2007.
Thickness=17) THEN Diabetic
17. IF (number of times pregnant=2) AND (BMI=30) THEN [8] M. H. Margahny and A. A. Mitwaly. Fast Algorithms for mining
Diabetic association rules. AIML 05 Conf.,19-21 December 2005.
18. IF (Diastolic Blood Pressure =75) AND (BMI=30) THEN
Diabetic [9] Milan Zorman e.t al. Mining Diabetes Database with Decision
19. IF (number of times pregnant=3) AND (BMI=33) THEN Trees and Association Rules, 2002
Diabetic
[10] Carlos Ordonnez, Comparing Association Rules and Decision
20. IF (OGTT=113) AND (Diabetis pedigree function=0.33)
Trees for Disease Prediction, HIKM06, November 11, 2006,
THEN Diabetic
21. IF (OGTT=120) AND (Diabetis pedigree function=0.33) [11] Agrawal and R. Srikant, Fast Algorithms for mining association
THEN Diabetic rules, Sept. 1994.
22. IF (BMI=32) AND (Diabetis pedigree function=0.33) THEN
Diabetic [12] R. Agrawal e.t al Mining association rules between sets of items
23. IF (OGTT=119) AND (Diabetis pedigree function=0.5) in large databases. In proc. of the ACM SIGMOD Conference on
THEN Diabetic Management of Data, 1993.
24. IF (Diastolic Blood Pressure =75) AND (Diabetis pedigree
[13] Christian Borge, An Implementation of The FP-growth
function=0.5) THEN Diabetic
Algorithm, Conference on Knowledge Discovery in Data
25. IF (Diastolic Blood Pressure =75) AND (BMI=30) THEN mining: frequent pattern mining implementations, 2005
(Diabetes pedigree function=0.5) AND Diabetic
[14] Jiawei Han, Jian Pei and Yiwen Yin, Mining Frequent Patterns
These rules have the potential to improve the expert system and without Candidate Generation, International Conference on
to make better clinical decision making. In a thickly populated Management of Data, 2000 ACM SIGMOD international
country with scarce resources such as India, public awareness conference on Management of data
can also be achieved through the dissemination of the above
knowledge [15] Frans Coenen, Paul Leng, and Shakil Ahmed, Data Structure for
Association Rule Mining, IEEE Transactions on Knowledge
and Data Engineering , VOL. 16, June. 2004.

166

Você também pode gostar