Você está na página 1de 6

Letter Paper Int. J. on Recent Trends in Engineering and Technology, Vol. 7, No.

1, July 2012

An Assessment on Neuro-Stastistical Classification Algorithms for the Diagnosis of Multivariate Heart Disease Dataset
G.NaliniPriya1, A.Kannan2 and P.Ananahakumar3
Research Scholar, Department of IST, Anna University, Chennai E-mail: nalini.anbu@gmail.com 2 Professor, Department of IST, Anna University, Chennai 3 AssociateProfessor, Department of IT, Anna University, Chennai
Abstract- A major challenge faced by healthcare organizations is the provision of quality services at affordable costs. Quality service implies diagnosing patients correctly and administering treatments that are effective. Poor clinical decisions can lead to disastrous consequences which are therefore unacceptable. Hospitals must also minimize the cost of clinical tests. They can achieve these results by employing appropriate computer-based information and/or decision support systems. The scope of this paper is to study the performance of some of the widely used classification algorithms with the most complex multivariate data sets. The performances of the algorithms were evaluated with the coronary artery disease (CAD) data sets taken from University California Irvine (UCI). The algorithms going to be evaluated are : k-means Fuzzy c-mean, Self Organizing Maps(SOM), Adaptive resonance theory (ART) The performance will be measured in terms of metrics such as Specificity, Sensitivity and Accuracy. Keywords: CAD, Heart Disease, Clustering, Classification, Multivariate Data Clustering, k-means, FCM, Fuzzy c-mean, ART, , SOM, ANN
1

or applying neural networks[5][6] to a set of features .The area of study that considers the traffic as a whole is traffic behavior characterization or profiling, which can include scan detection. II. MOTIVATION OF THIS RESEARCH Clinical decisions are often made based on doctors intuition and experience rather than on the knowledge rich data hidden in the database. This practice leads to unwanted biases, errors and an excessive medical cost which affects the quality of service provided to patients. There are many ways that a medical misdiagnosis can present itself. Whether a doctor is at fault, or hospital staff, a misdiagnosis of a serious illness can have very extreme and harmful effects. The National Patient Safety Foundation cites that 42% of medical patients feel they have had experienced a medical error or missed diagnosis. Patient safety is sometimes negligently given the back seat for other concerns, such as the cost of medical tests, drugs, and operations. Medical Misdiagnoses are a serious risk to our healthcare profession. If they continue, then people will fear going to the hospital for treatment. We can put an end to medical misdiagnosis by informing the public and filing claims and suits against the medical practitioners at fault. Motivated by the need of such an expert system, in this paper, we are analyzing some of the existing classification techniques for their suitability to efficiently diagnose the heart disease. III. CORONARY ARTERY DISEASE (CAD) Heart disease, which is usually called coronary artery disease [2][3][4] (CAD), is a broad term that can refer to any condition that affects the heart [8]. CAD is a chronic disease in which the coronary arteries gradually hardens and narrow. It is the most common form of cardiovascular disease [8][9] and the major cause of heart attacks in all countries. Moreover, cardiovascular disease is the leading killer of while considering other diseases. While many people with heart disease [10][15] have symptoms such as chest pain and fatigue, as many as 50% have no symptoms until a heart attack occurs [9]. The data generally used for diagnosing the CAD will be multivalent in nature. Having so many factors

I. INTRODUCTION The rapidly growing aging population, the increased burden of chronic diseases, and the increasing healthcare costs, there is an urgent need for the development, implementation, and deployment, in everyday medical practice, of new models of healthcare services. In this scenario, home monitoring (HM) and data mining (DM) play an important role. DM is the computer-assisted process of digging through and analyzing a large quantity of data in order to extract meaningful knowledge and to identify phenomena faster and better than human experts. As regards HM, although a wide literature describes technical solutions, the evidence of cost-effectiveness is limited and only a few studies compare HM with other models of disease management programs (DMPs) has a more complex thresholding scheme, offering low, medium, and high sensitivity settings. However, increasing the sensitivity also increases false positives making the alerts unmanageable. More complex methods published for detecting slow scans identify anomalous packets or sessions in some manner, then identify scans through analysis of statistical measures [1]data mining, 32 2012 ACEEE DOI: 01.IJRTET.7.1.77

Letter Paper Int. J. on Recent Trends in Engineering and Technology, Vol. 7, No. 1, July 2012 to analyze to diagnose the heart diseases, physicians generally make decisions by evaluating the current test results of the patients. The previous decisions made on other patients with the same condition are also examined by the physicians. These complex procedures are not easy when considering the number of factors that the physician has to evaluate. So, diagnosing the heart disease[12][13] of a patient involves experience and highly skilled physicians. Recent advances in the field of artificial intelligence and data mining have led to the emergence of expert systems for medical applications. Moreover, in the last few decades computational tools have been designed to improve the experiences and abilities of physicians for making decisions about their patients. IV. CLUSTERING AND CLASSIFICATION METHODS Clustering is one of the most useful tasks in the data mining process for discovering groups and identifying interesting distributions and patterns in the underlying data. Clustering is the process of grouping a set of physical or abstract objects into classes of similar objects. A cluster is a collection of data objects that are similar to one another with the same cluster and are dissimilar to the objects in other clusters. A cluster of data [1][2]objects can be treated collectively as one group in many applications. Clustering is a form of learning by observation rather than learning by examples. Cluster analysis is an important human activity in which we indulge since childhood when we learn to distinguish between animals and plants, etc., by continuously improving subconscious clustering schemes. This has been widely used in numerous applications, including pattern recognition, data analysis, image processing, and market research, etc., A. Classical Methods I) K-Means Clustering The k-means algorithm (MacQueen 1967, Anderberg 1973) is built upon four basic operations: (1) selection of the initial k means for k clusters, (2) Calculation of the dissimilarity between an object and the mean of a cluster, (3) Allocation of an object to the cluster whose mean is nearest to the object, (4) Re-calculation of the mean of a cluster from the objects allocated to it in such a way that the intra cluster dissimilarity is minimized. Except for the first operation, the other three operations are repeatedly performed in the algorithm until the algorithm converges. II) Fuzzy C-Means Clustering Fuzzy c-means (FCM) is a data clustering technique wherein each data point belongs to a cluster to some degree that is specified by a membership grade. It provides a method that shows how to group data points that populate some multidimensional space into a specific number of different clusters. The Fuzzy c-means algorithm starts with an initial guess for the cluster centers, which are intended to mark the mean location of each cluster. The initial guess for these 33 2012 ACEEE DOI: 01.IJRTET.7.1.77 cluster centers is most likely incorrect. Additionally, Fuzzy cmeans algorithm assigns every data point a membership grade for each cluster. By iteratively updating the cluster centers and the membership grades for each data point, Fuzzy cmeans algorithm iteratively moves the cluster centers to the right location within a data set. This iteration is based on minimizing an objective function that represents the distance from any given data point to a cluster center weighted by that data points membership grade. The fuzzy c-means (FCM) algorithm was introduced by J. C. Bezdek [14]. The idea of FCM is using the weights that minimize the total weighted mean-square error: J(wqk, z(k)) =
(k=1,K)

(k=1,K)

(wqk)|| x(q)- z(k)||2

(1)

(k=1,K) (wqk) = 1 for each q wqk = (1/(Dqk)2)1/(p-1) / (k=1,K) (1/(Dqk)2)1/(p-1) , p > 1(2) The FCM allows each feature vector to belong to every cluster with a fuzzy truth value (between 0 and 1), which is computed using Equation (2). The algorithm assigns a feature vector to a cluster according to the maximum weight of the feature vector over all clusters. B. Unsupervised Learning Methods In unsupervised learning we are given some data x, and the cost function to be minimized can be any function of the data x and the networks output, f. The cost function is dependent on the task (what we are trying to model) and our a priori assumptions (the implicit properties of our model, its parameters and the observed variables). As a trivial example, consider the model f(x) = a, where a is a constant and the cost C = (E[x] - f(x))2. Minimizing this cost will give us a value of a that is equal to the mean of the data. The cost function can be much more complicated. Its form depends on the application: For example in compression it could be related to the mutual information between x and y. In statistical modeling, it could be related to the posterior probability of the model given the data. (Note that in both of those examples those quantities would be maximized rather than minimized) Tasks that fall within the paradigm of unsupervised learning are in general estimation problems; the applications include clustering, the estimation of statistical distributions, compression and filtering. I) Self Organizing Neural Networks The self-organizing map (SOM) is a subtype of artificial [7] neural networks. It is trained using unsupervised learning to produce low dimensional representation of the training samples while preserving the topological properties of the input space. This makes SOM reasonable for visualizing lowdimensional views of high-dimensional data. The selforganizing map is a single layer feed forward network where the output syntaxes are arranged in low dimensional (usually 2D or 3D) grid. Each input is connected to all output neurons. Attached to every neuron there is a weight vector with the same dimensionality as the input vectors. The weights of the neurons are initialized either to small random values or

Letter Paper Int. J. on Recent Trends in Engineering and Technology, Vol. 7, No. 1, July 2012 sampled evenly from the subspace spanned by the two largest principal component eigenvectors. The latter alternative will speed up the training significantly because the initial weights already give good approximation of SOM weights. The training utilizes competitive learning. When a training sample is given to the network, its Euclidean distance to all weight vectors is computed. The neuron with weight vector most similar to the input is called the Best Matching Unit (BMU). The weights of the BMU and neurons close to it in the SOM lattice are adjusted towards the input vector. The magnitude of the change decreases with time and is smaller for neurons physically far away from the BMU. The update formula for a neuron with weight vector Wv(t) is Wv(t + 1) = Wv(t) + (v, t) (t)(D(t) - Wv(t)), where (t) is a monotonically decreasing learning coefficient and D(t) is the input vector. The neighborhood function (v, t) depends on the lattice distance between the BMU and neuron v. In the simplest form it is one for all neurons close enough to BMU and zero for others, but a gaussian function is a common choice, too. Regardless of the functional form, the neighborhood function shrinks with time.[1] At the beginning when the neighborhood is broad, the self-organizing takes place on the global scale. When the neighborhood has shrunk to just a couple of neurons the weights are converging to local estimates. This process is repeated for each input vector, over and over, for a (usually large) number of cycles. The network winds up associating output nodes with groups or patterns in the input data set. If these patterns can be named, the names can be attached to the associated nodes in the trained net. Like most [5][6][7]artificial neural networks, the SOM has two modes of operation: During the training process a map is built, the neural network organizes itself, using a competitive process. The network must be given a large number of input vectors, as much as possible representing the kind of vectors that are expected during the second phase (if any). Otherwise, all input vectors must be administered several times. During the mapping process a new input vector may quickly be given a location on the map, it is automatically classified or categorized. There will be one single winning neuron: the neuron whose weight vector lies closest to the input vector. (This can be simply determined by calculating the Euclidean distance between input vector and weight vector.) Stepping through the SOM algorithm Randomize the maps nodes weight vectors Grab an input vector Traverse each node in the map Use Euclidean distance formula to find similarity between the input vector and the maps nodes weight vector Track the node that produces the smallest distance (this node will be called the Best Matching Unit or BMU) Update the nodes in the neighborhood of BMU 2012 ACEEE DOI: 01.IJRTET.7.1.77 34 by pulling them closer to the input vector Wv(t + 1) = Wv(t) + (t)(t)(D(t) - Wv(t)) There are two ways to interpret a SOM. Because in the training phase weights of the whole neighborhood are moved in the same direction, similar items tend to excite adjacent neurons. Therefore, SOM forms a semantic map where similar samples are mapped close together and dissimilar apart. The other way to perceive the neuronal weights is to think them as pointers to the input space. They form a discrete approximation of the distribution of training samples. More neurons point to regions with high training sample concentration and fewer where the samples are scarce. With Matlabs neural network toolbox we can create and use a SOM (neural network) in simple steps. The architecture for this SOM is shown below.

Figure 1. The SOM Network

II) The ART The basic ART system is an unsupervised learning model. It typically consists of a comparison field and a recognition field composed of neurons, a vigilance parameter, and a reset module. The vigilance parameter has considerable influence on the system: higher vigilance produces highly detailed memories (many, fine-grained categories), while lower vigilance results in more general memories (fewer, moregeneral categories). The comparison field takes an input vector (a one-dimensional array of values) and transfers it to its best match in the recognition field. Its best match is the single neuron whose set of weights (weight vector) most closely matches the input vector.

Figure 2. The ART System

Each recognition field neuron outputs a negative signal (proportional to that neurons quality of match to the input vector) to each of the other recognition field neurons and inhibits their output accordingly. In this way the recognition

Letter Paper Int. J. on Recent Trends in Engineering and Technology, Vol. 7, No. 1, July 2012 field exhibits lateral inhibition, allowing each neuron in it to represent a category to which input vectors are classified. After the input vector is classified, the reset module compares the strength of the recognition match to the vigilance parameter. If the vigilance threshold is met, training commences. Otherwise, if the match level does not meet the vigilance parameter, the firing recognition neuron is inhibited until a new input vector is applied; training commences only upon completion of a search procedure. In the search procedure, recognition neurons are disabled one by one by the reset function until the vigilance parameter is satisfied by a recognition match. If no committed recognition neurons match meets the vigilance threshold, then an uncommitted neuron is committed and adjusted towards matching the input vector.In the figure 1, F1 and F2 represent the two layers of nodes in the subsystem. Nodes on each layer are fully interconnected to the nodes on the other layer. A plus sign indicates an excitatory connection. A minus sign indicates inhibitory connections. V. IMPLEMENTATION AND EVALUATION To evaluate the algorithms under consideration, a suitable and standard multivariate data set is needed. A suitable UCI data set called cleveland.data[11[12][15], concerning heart disease diagnosis[13] is used for the evaluation of the algorithms under consideration. This data was originally provided by Cleveland Clinic Foundation. A data set (or dataset) is a collection of data, usually presented in tabular form. Each column represents a particular variable. Each row corresponds to a given member of the data set in question. It lists values for each of the variables, such as height and weight of an object or values of random numbers. Each value is known as a datum. The data set may comprise data for one or more members, corresponding to the number of rows. The values may be numbers, such as real numbers or integers, for example representing a persons height in centimeters, but may also be nominal data (i.e., not consisting of numerical values), for example representing a persons ethnicity. More generally, values may be of any of the kinds described as a level of measurement. For each variable, the values will normally all be of the same kind. However, there may also be missing values, which need to be indicated in some way. A total of 500 records with 15 medical attributes (factors) were obtained from the Heart Disease [9][10]database lists the attributes. The records were split equally into two datasets: training dataset (455 records) and testing dataset (454 records). To avoid bias, the records for each set were selected randomly. The attribute Diagnosis was identified as the predictable attribute with value 1 for patients with heart disease and value 0 for patients with no heart disease. The attribute PatientID was used as the key; the rest are input attributes. It is assumed that problems such as missing data, inconsistent data, and duplicate data have all been resolved. 2012 ACEEE DOI: 01.IJRTET.7.1.77 35 A. Metrics considered for Evaluation I. Sensitivity Sensitivity measures the proportion of actual positives which are correctly identified as such(e.g. the percentage of sick people who are correctly identified as having the condition). Number of True Positives Sensitivity = Number of True Positives + Number of False Negatives II. Specificity Specificity measures the proportion of negatives, which are correctly identified (e.g. the percentage of healthy people who are correctly identified as not having the condition). Number of True Negatives Specificity = Number of True Negatives + Number of False positives III. Accuracy Accuracy of a measurement system is the degree of closeness of measurements of a quantity to its actual (true) value. No of (True Positive + True Negatives) Accuracy = Number of (True positives+ false Positives +False Negative+ True Negative). VI. THE RESULTS OF THE EVALUATION We have successfully implemented all the four algorithms under Matlab2009 and repeated the experiments with different set of parameters to obtain optimum performance in all the six methods. The following tables are showing the performance of the algorithms. The clustering and classification was done using a particular algorithm for several times and the three most occurring best results were tabulated and the average value is used to evaluate the performance. I) Performance in Terms of Sensitivity
TABLE I. SENSITIVITY

The ART network was not at all capable of classifying the data. We tried for different training parameters. In all the case, we arrived almost same poor result. So, the sensitivity in the case of ART is very poor. This means that almost all the records were classified as negative. So, in terms of sensitivity, FCM was performed well.

Letter Paper Int. J. on Recent Trends in Engineering and Technology, Vol. 7, No. 1, July 2012 the algorithms. In terms of accuracy, the k-means, FCM, SOM are the only classification algorithms, which produced acceptably good results. The performance of ART not comparable with the other algorithms. VII. CONCLUSION AND FURTHER ENHANCEMENTS In this paper, we evaluated four widely used clustering and classification algorithms for the diagnosis of coronary artery disease. The results obviously show the complex nature of data set restricts these algorithms from achieving better accuracy. While testing the same algorithms with a normal synthetic data set with normal distribution, they provided ideal performance and better accuracy. As far as we evaluated, the only realized problem in getting more improved accuracy is, the selection of training samples because, the performance of these learning methods were very much depend on the training and testing samples and the achieved accuracy also very random with respect to the training and testing samples. Our future works will address supervised learning methods will hope to solve these problems by improving the training process and make the algorithm to provide a constant performance in terms of accuracy .This evaluations can incorporate other medical attributes besides the 15 listed . It can also incorporate other data mining techniques, e.g., Time Series, Clustering and Association Rules. ACKNOWLEDGEMENT The authors are thankful to Prof.R.Alagesan MD, DM(Card), Professor and Head, Department of Cardiology (R), Institute of cardiology, MMC, Chennai-600 003,Tamilnadu for his valuable suggestions. REFERENCES
Figure 4: The performance in terms of Specificity

Figure 3 The performance in terms of Sensitivity

II) Performance in Terms of Specificity


TABLE II. SPECIFICITY

The specificity in the case of ART seems to be good. But is not likely to be good. Because, if we consider the poor sensitivity, then this means that almost all the records were classified as negative. So, in terms of specificity, k-means was performed well. SOM was very low compared to all other methods. III) Performance in Terms of Accuracy SOM is the poor performing algorithm among the all. After several, repeated analysis we made on the clustering and classification algorithms for classifying [14] [17] the CAD data, we came to the following conclusion. In the case of unsupervised machine learning algorithms SOM and ART as well as the classical k-means and FCM, the performance of classification was very much depend up on the initial guess which is generally made during configuring these methods. For example, in the case of k-means and FCM, the result depends on the initial guessed centroid and hence leading to some what random results. Even though the results were random, the accuracy was not considerably good in almost all the cases. The Accuracy is the important collective measure which is directly showing the overall classification performance of 36 2012 ACEEE DOI: 01.IJRTET.7.1.77

[1] P. A. Bath, Data mining in health and medical information, Annu. Rev. Inform. Sci. Technol., vol. 38, pp. 331369, 2004. [2] R. Gaikwad and J.Warren, The role of home-based information and communications interventions in chronic disease management: A systematic literature review, Health Inform. J., vol. 15, no. 2, pp. 122146, 2009. [3] J. Gonseth, P. Guallar-Castillon, J. R. Banegas, and F. RodriguezArtalejo, The effectiveness of disease management programmes in reducing hospital re-admission in older patients with heart failure: A systematic review and meta-analysis of published reports, Eur. Heart J., vol. 25, no. 18, pp. 1570 1595, 2004. [4] S. Koch, Home telehealthCurrent state and future trends, Int. J. Med. Inform., vol.75, no. 8, pp. 565576, 2006. [5] G.NaliniPriya, P.AnandhaKumarNeural Network Based Efficient Knowledge Discovery in Hospital Databases Using RFID Technology, Proceeding of International IEEE Conference Tencon, November 2008. [6] G.NaliniPriya, P.AnandhaKumar Efficient Knowledge Discovery in smart environments- A Trend analysis, Proceeding of National Conference Tech 2011 February 2011. [7] G.NaliniPriya, A.Kannan, P.AnandhaKumar Dynamic Context Adaptation for Diagnosing the Heart Disease in Healthcare Environment Using Optimized Rough Set Approach ,

Letter Paper Int. J. on Recent Trends in Engineering and Technology, Vol. 7, No. 1, July 2012
International Journal on Soft Computing (IJSC) Vol.3, No.2, pp 23-33,May 2012 . [8] L. Pecchia, U. Bracale, and M. Bracale, Health technology assessment of home monitoring for the continuity of care of patient suffering from heart prediction ,Inform. Assoc., vol. 14, no. 3, pp. 269277, 2007. [9] C. S. Pattichis, C. N. Schizas, el al, Introduction to the special section on computational intelligence in medical systems, IEEE Trans. Inform. Technol. Biomed., vol. 13, no. 5, pp. 667672, Sep. 2009. [10] S. G. Mougiakakou, I. K. Valavanis, et al , DIAGNOSIS: A telematics-enabled system for medical image archiving, management, and diagnosis assistance, IEEE Trans. Instrum. Meas., vol. 58, no. 7, pp. 21132120, Jul.2009. [11] A. Martinez, E. Everss, et al A systematic review of the literature on home monitoring for patients with heart failure, J. Telemed. Telecare, vol. 12, no. 5, pp. 234241, 2006. [12] Turker Ince, Serkan Kiranyaz, Jenni Pulkkinen, Moncef Gabbouj, Evaluation of global and local training techniques over feed-forward neural network architecture spaces for computer-aided medical diagnosis, Elsevier - Expert Systems with Applications, Volume 37, issue 12, December 2010. [13] Resul Das, Ibrahim Turkoglu, Abdulkadir Sengur, Effective diagnosis of heart disease through neural networks ensembles, Elsevier - Expert Systems with Applications, Volume 36, Number 4, May 2009. [14] Setiawan, N.A., P.A. Venkatachalam, and Ahmad Fadzil M.H. Rule Selection for Coronary Artery Disease Diagnosis Based on Rough Set. International Journal of Recent Trends in Engineering, No. 5, Vol. 2, November 2009, Academy Publisher. ISSN: 1797-9617. [15] Sumit Bhatia, Praveen Prakash, and G.N. Pillai , SVM Based Decision Support System for Heart Disease Classification with Integer-Coded Genetic Algorithm to Select Critical Features, Proceedings of the World Congress on Engineering and Computer Science, WCECS 2008, ISBN: 978-98898671-0-2. [16] Robert Detrano, M.D., Ph.D., The Cleveland Data , V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: [17] Noor Akhmad Setiawan, P.A. Venkatachalam and Ahmad Fadzil M.Hani Diagnosis of Coronary Artery Disease Using Artificial Intelligence Based Decision Support System Proceedings of the International Conference on Man-Machine Systems (ICoMMS), MALAYSIA, October 2009.

2012 ACEEE DOI: 01.IJRTET.7.1.77

37

Você também pode gostar