Escolar Documentos
Profissional Documentos
Cultura Documentos
www.se ipub.org/ie
1
2
jlliu@isu.edu.tw; 2arthur@fotech.edu.tw
*1
Abstract
In a case of misdiagnosing of disease , false negativity can
cast a much larger impact than false positivity for both the
doctor and the patie nt. Rece ntly, many researche rs have
de vote d to the study of using data mining on disease
diagnosis. Current research shows by emphasizing the
adoption of various clustering or classification techniques in
cardiac disease diagnosis, one fails to lowe r the risk of
misjudging a case as false negative . There fore , this research
focuses on the combination of multi-objective ge netic
algorithm and k-means clustering technique in data mining
to analyze the data of patie nts suffe ring from cardiac disease,
hoping to ge t higher pre diction accuracy in the diagnosis
and lowe r the risk of misjudging a case as false negative , the
re sult of which nee ds to pay a high cost. These two
algorithms are combine d and name d Multi-obje ctive Ge netic
k-means Clustering Algorithm, terme d MOGKCA in this
work. The obje ctive s of multi-objective optimization on
cardiac disease analysis are : (1) to minimize the inaccuracy
of classification (or maximize the pre diction accuracy of
classification), and (2) to minimize the number of false
negativity (or minimize the cost of classification). The data
use d in this research are from UCI cardiac disease data set,
He art-staglog. From the single obje ctive optimization case,
the propose d MOGKCA can e ffective ly e le vate the
classification rate . In the analysis of the two cases of data sets
in the ir multi-obje ctive optimization, we can ge t the best
Pare to-Front through MOKGCA, which can be provide d and
he lp set the best point of analysis.
Introduction
Keywords
K-means Clustering; Cardiac Disease Diagnosis;Multi-objective
Genetic k-means Clustering Algorithm; False Negativity
77
www.se ipub.org/ie
Prediction
Actuality
Negative
Positive
False
True
tn
fn
fp
tp
Prediction
Actuality
Negative
Positive
False
True
0
5
1
0
78
www.se ipub.org/ie
k
xi xc
i =1
(1)
k
wi xi xc
i =1
(2 )
Choose k.
(b).
(c).
Make use of similarity computing, and
grouping data points to their nearest data points.
(d).
Use these data points in the grouping centers,
and recalculate the new data points.
(e).
Check step (d) to see if the center points alter.
If so,steps (c) and (d) will be repeated until the center
points remain the same again. In this state, we reach
the convergence of k-means algorithm.
Multi-objective Genetic Algorithm
The MOGA used in this study(Liu et al., 2010)
includes: (a) put the individual solution in the group
solutions according to the definition of dominate, and
perform a non-dominated sorting to each individual
(Deb et al., 2002), (b) calculate the crowded distance
for each solution and sort them, (c) use the method of
binary tournament selection operator, (d) apply a
extend intermediate crossover, and (e) use a nonuniform mutation operator (Deb et al.; Liu et al.). By
doing so, the effectiveness of the MOGA can be
heightened. The optimization of multiple objectives
can be formulated as a minimization of function f (x )
with M objectives subject to p constraints denoted by
function g (x ) as follows:
79
www.se ipub.org/ie
Minimize f ( x ) = { f1 , f 2, ..., f M } T
Subject to g j ( x ) 0, j =
1, ..., p
(3)
i {1, 2,..., M } , fi ( xa ) fi ( xb )
(4)
=
f1 ( w) Min C pred ( Cactual )i
i
i =1
f 2 ( w) = Min fn
i =1
(5)
Our search put together k-means algorithm and multiobjective genetic algorithm, renaming it Multiobjective Genetic k-means Clustering Algorithm
(MOGKCA).
Discussion and Resu lt
The cardiac disease data set Heart-staglog comes up
which is from UCI (Blake and Merz, 1998). First,
Heart-staglog has 13 attributes and one class attribute.
The total number of the data set is 270. Two
classifications were made from the class attribute (120
cases with cardiac disease, 150 cases without).Our
results go as following: (1) the best optimum
prediction in single objective and (2) the best optimum
result in multiple objectives.In single objective
80
tn + tp
tn + fn + fp + tp
(i)
Accuracy: accuracy =
(ii)
Sensitivity: sensitivity ( P ) =
tp
fn + tp
(iii)
Specificity: specificity ( R ) =
tn
tn + fp
(iv)
(v)
2 P R
F-measure: F Measure =
P+R
Cost: cos t = fp + fn 5
No
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Attributes
Age (age)
Sex (sex)
Chest (c p)
Resting blood pressure (trestbps)
Serum c holesterol (c hol)
Fasting blood sugar (fbs)
Resting elec troc ardiographic results
(restecg)
Maximum heart rate ac hieved (thalac h)
Exerc ise induc ed angina (exang)
Old peak
S lope (slope)
Number of major vessels (c a)
Thal (thal)
Class (c lass)
Domain
continuous
male/female
1, 2, 3, 4
continuous
continuous
0, 1
0, 1, 2
continuous
0, 1
continuous
1, 2, 3
0,1,2,3
3, 6, 7
healthy or sic k
www.se ipub.org/ie
See 5.
TABLE 4 COMPARISON OF CONFUSION MATRIX FOR SINGLE OBJECTIVE
OPTIMIZATION
predic tion
reality
healthy
sic k
healthy
sic k
k-means
137
13
MOGKCA
140
10
k-means
MOGKCA
31
17
89
103
algorithm
parameter
Nave
Bayes
See5
k-means
MOGKCA
acc uracy
sensitivity
86.29%
83.33%
90.00%
87.50%
83.70%
74.17%
90.00%
85.83%
88.67%
85.92%
92.00%
89.69%
91.33%
81.86%
93.33%
89.43%
cost
117
87
168
95
predic tion
healthy
sic k
solution at p
112
38
solution at q
141
solution at p
111
solution at q
17
103
reality
healthy
sic k
solution point
item
average
83.70%
89.46%
standard deviation
1.45818%
0.54275%
solution point
parameter
acc uracy
sensitivity
spec ific ity
F-measure
cost
No. of false negativity
82.59%
92.50%
74.67%
82.63%
83
9
90.37%
85.83%
94.00%
89.73%
94
17
81
www.se ipub.org/ie
attributes
w1 (age)
w2 (sex)
w3 (c p)
w4 (trestbps)
w5 (c hol)
w6 (fbs)
w7 (restecg)
w8 (thalac h)
w9 (exang)
w10 (oldpeak)
w11 (slope)
w12 (c a)
w13 (thal)
weights atp
0.10590
0.39751
0.83539*
0.15497
0.10438
0.11371
0.41222
0.21988
0.00696
0.61670*
0.42300
0.79464*
0.23767
weights at q
0.67609*
0.49020
0.80064*
0.32295
0.12337
0.05776
0.29321
0.06001
0.00087
0.46671
0.46982
0.53327*
0.19470
82
predic tion
reality
healthy
sic k
95
117
8
14
30
8
83
77
solution at p
solution at q
solution at p
solution at q
healthy
sic k
item
solution point
average
standard deviation
84.90%
3.27495%
87.59%
1.76655%
solution point
parameter
acc uracy
sensitivity
spec ific ity
F-measure
cost
No. of false negativity
82.41%
91.21%
76.00%
82.91%
70
8
89.81%
84.62%
93.60%
88.88%
78
14
Conclusion
The study proposed Multi-objective Genetic k-means
Clustering Algorithm to analyze the patient cases in
cardiac disease data base and heightens the diagnostic
accuracy, while lowering the cost due to false
negativity. We adopted materials from UCI cardiac
disease data set: Heart-staglog. From the single
objective optimization case, we know the proposed
MOGKCA can effectively elevate the classification rate.
In the analysis of the two cases of data sets in their
multi-objective optimization, we can get the best
Pareto-Front through MOKGCA, which can be
www.se ipub.org/ie
Base Station Placeme nt Using an Enhance d Multiobjective Ge netic Algorithm. International Journal of
ACKNOWLEDGMENT
Anbarasi,
Anupriya,
E.,
and
Iye ngar,
Se lection
Using
Ge netic
L.,
2010, 19-42.
Palaniappan, S., and Awang, R. Inte llige nt Heart Disease
Pre diction System Using Data Mining Te chniques.
International Journal of Compute r Scie nce and Ne twork
M.,
Feature
Frie dman,
J.,
Olshen,
R.,
and
Stone,
Trees, Wadsworth
Machine
2011
Inte rnational
Journal
of
artificial
&
83