Escolar Documentos
Profissional Documentos
Cultura Documentos
Abstract—Data mining has recently emerged as an important such as; treatment effectiveness, customer relationship
field that helps in extracting useful knowledge from the huge management, healthcare information management fraud and
amount of unstructured and apparently un-useful data. Data abuse. However, the significant applications in data processing
mining in health organization has highest potential in this area perhaps implicate predictive modeling [1].
for mining the unknown patterns in the datasets and disease
prediction. The amount of work done for cardiovascular patients Data mining enacts a substantial part in various
in Pakistan is scarcely very less. In this research study, using applications of different domains such as business
classification approach of machine learning, we have proposed a corporation‟s, e-commerce, healthcare industry, science and
framework to classify unstructured data of cardiac patients of engineering. In the health care industry, primarily it is used to
the Armed Forces Institute of Cardiology (AFIC), Pakistan to predict disease. Various diseases like Diabetes, Hepatitis,
four important classes. The focus of this study is to structure the Cancer and are diagnosed using data mining or machine
unstructured medical data/reports manually, as there was no learning techniques [2]. According to Benko and Wilson in
structured database available for the specific data under study. healthcare organizations where data mining is being practiced
Multi-nominal Logistic Regression (LR) is used to perform multi- are performing better in meeting their long term needs. Benko
class classification and 10-fold cross validation is used to validate and Wilson argue [3] “In healthcare, data mining is becoming
the classification models, in order to analyze the results and the increasingly popular, if not increasingly essential.”
performance of Logistic Regression models. The performance-
measuring criterion that is used includes precision, f-measure, To assist the medical practitioners, intelligent information
sensitivity, specificity, classification error, area under the curve system, knowledge based system and prediction systems are
and accuracy. This study will provide a road map for future being developed. Healthcare organizations have kept
research in the field of Bioinformatics in Pakistan. voluminous data of patients in the form of medical reports,
patients‟ history, electronic test results etc. [4]. This data in its
Keywords—bioinformatics; classification techniques; heart present unstructured form is complex, noisy, high dimensional
disease in Pakistan; heart disease prediction; multinomial and discrete [5]. Considerably there is a lot of useful
classification; logistic regression
knowledge buried in those records. However, the question
I. INTRODUCTION arises that how can we mine and transform those unstructured
and complex reports into practically useful information, that
Cardiac disease is one of the most serious and death could assist the doctors to draw knowledgeable medical
causing health problem. This disease has its bad effects not conclusions.
only on older people but affected severely younger generation
also. There are some significant risk elements of cardiac A. Research Motivation
disorder like excessive amount of cholesterol, high blood The main motivation of this research is to propose a
pressure, hypertension, smoking and sometimes family history. framework to extract the hidden valuable information from the
Various precautionary mediations for individuals are available, unstructured records of patients in the form of medical reports
involving either prescription or change in daily life routine. and to classify the data into important classes or patients‟
There exists raw data in the form of patients‟ history and impressions that could assist the healthcare experts to make
complex reports. All these resources are key factors to extract intelligent decisions and for predictive analysis.
meaningful results, for better medical diagnosis. This data can
be processed and analyzed to extract valuable in-formation that B. Objectives and Contribution
helps practitioner in decision-making and cost saving. This study is designed at predictive analysis/classification
Medicine or Bioinformatics technology has flourished a lot in of cardiac patients by proposing a classification framework for
the past few years that we have applied in healthcare industry un-structured data. We propose to use a classification
133 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 7, No. 2, 2016
framework that emphasize on pre-processing of unstructured 81% accuracy, whereas Decision tree gives the least accuracy
data of healthcare organization in the form of patients' reports. and sensitivity ratio among all three-classification models.
Hence, contributions of this research study are as follows:
Literature reports various comparative studies on data
1) We present a manual approach that extracts attributes mining, classification algorithms for predicting heart disease.
(like age, sex, blood pressure, LVEF value, BMI, defected One such comparative analysis was carried out using the
areas, etc.) from patients' reports in order to classify the Cleveland cardiovascular disease dataset from UCI repository
patient’s condition. This study aims to provide a framework with 13 attributes and 303 instances [8]. On this dataset three
classification models were developed, namely, Sequential
that uses supervised learning techniques of data mining.
Minimal Optimization, Logistic Function and Multilayer
2) We have used the multinomial class label (namely, fair, Perceptron Function. The Accuracy of these classification
moderate, risk and critical). techniques was determined through kappa statistics, ROC, True
3) A classification framework is proposed that uses positive rate and F measures. All these accuracy measures
supervised learning techniques to classify unstructured of data show that logistic regression gives better results than other
patients into four classes mentioned above. techniques. The different error rates calculated shows that the
4) We use best-known machine learning technique, Logistic Function algorithm performs much better than other
Logistic Regression, which is most widely being used in two classification algorithms. The rate of true positives and
prediction of heart disease. ROC Area of the point reaches the maximum accurateness in
5) Comparison performance evaluation is presented the logistic function algorithm. Even Kappa statistics and F
measures give better results in the logistic function than SMO
based on some performance measures are explained later
and Multilayer perceptron.
section 3.
Using physiological measuring devices like Point-of-Care
II. LITERATURE REVIEW devices (PoC), mobile gateways and monitoring server, a
Cardiac syndrome is a serious and a death-causing remote cardiac monitoring system was designed for preventive
syndrome [6]. However, the science and Bioinformatics has care [9]. The system was developed to provide preventive care
developed a lot and treatment for this disease is possible and services to cardiac patients. By calculating the information gain
available to almost every person. The increase in no of deaths of features, highly related feature subsets were selected and
occurring because of cardiac disease all around the world has SVM classifier was applied to them. The proposed prediction
focused the attention of medical practitioners and researcher on algorithm gives 87.5 % accuracy. F. Imran Kurt et al uses a
this serious issue. There is quite a comprehensive literature real data set from VA Medical Center from Long Beach,
available on this topic. Various useful medical applications and California and compare performances of logistic regression,
decision support system have been advanced to aid the medical decision tree, and neural networks for predicting coronary
practitioner in better medical treatment of their patients. . artery disease [15]. Lift charts and error rates were used to
These systems predict the likelihood of patients getting heart compare the performances of these classification models.
disease or heart attack, etc. data mining has played an Prediction of coronary artery diseases by Neural Networks
important role in this field. Here, patients historical data is used yields excellent results as it gives higher accuracy while
to make and develop such system where artificial intelligence classifying the data. Logistic Regression was found to be
and machine learning techniques are used widely. The ongoing second most accurate classed whereas, decision trees show
research in this field has provided much success and opened up least accuracy and highest error rate.
the doors for further improvements. A prototype of intelligent heart disease prediction system
It has been noticed from the detailed survey of the literature (IHDPS) [10] was developed by using Naïve Bayes, Decision
that SVM, Logistic regression, Neural Nets and Naïve Bayes, Trees and Neural Network. this system is capable of mining
are most widely use algorithms for heart disease prediction. To unknown patterns and associations correlated to cardiac
predict the survival of cardiac patients, three prediction models disease. IHDPS assists medical practitioners to make
were built on 1000 cases of cardiac heart disease patients. intelligent decisions as it can give response to simple as well as
Using a binary categorical variable (1 for survival and 0 for complex‟ what if‟ queries. Further, it delivers operative and
non-survival), 10-fold cross validation procedure was inexpensive treatment and enhances visualization and develop
performed on SVM, ANN and Decision trees. This gives less better understanding. IHDPS has used the CRISP-DM
biased prediction and highest classification accuracy in SVM methodology to make three models (Naïve Bayes, Decision
[6]. Trees and Neural Network) and Data Mining Extension
language is used to create, train, predict and access model
To identify and prevent the cardiovascular disease, content. Classification Matrix and Lift Chart methods are used
classification techniques are used. Authors of [7] present to check which model gave a maximum percentage of right
comparison of Artificial Neural Network, Decision Tress and predictions. In this research study, five mining goals were set
RIPPER and SVM techniques. Based on accuracy measures and assessed with respect to three trained models. These are:
these techniques were compared. The results of this study show
that the Support Vector Machine model is the best giving 84% Predict those patients who have chances to get heart
accuracy. Ripper, a classification algorithm based on disease based on their medical profile.
association rules with reduced error pruning algorithm gives
134 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 7, No. 2, 2016
Find out the important influences and relationships predictors; Decision Trees answered three and its results are
between medical inputs and medical attributes related easier to interpret; Neural Network two and in this the
to the predictable state. correlation between attributes is hard to interpret. Intelligent
heart disease prediction system is constructed using 15
Find out heart patient characteristics. features, in categorical sample data of 909 patients. The
Define attribute values that discriminate nodes favoring authors of the paper suggested that more features and
and disfavoring the predictable conditions. techniques like association rules, clustering and time series can
be used by prolonging it.
Naïve Bayes appears most efficient by answering four out
of the five goals and by identifying all important medical
TABLE I. COMPARATIVE ANALYSIS OF CLASSIFICATION TECHNIQUES USED IN LITERATURE
Tools Techniques
Data Set
ANN
LR
DT
NB
SVM
Other
Classification Type
Ref # (Real/ Results
(Categorical/ multinomial)
Artificial)
7 Artificial Binomial categorical WEKA SVM 84% accuracy
8 Artificial Binomial categorical WEKA LR is considered to be best
The literature discussed above reveals that artificial neural However, there is little research seen in Pakistan in building
networks (ANN) and Decision trees most widely used a framework specifically for cardiac data mining based on real
classification algorithms for categorical data and Logistic data obtained from some renowned cardiac hospital. In
regression is also being used widely for prediction. There are addition, there is a need of framework that unifies the data
several intelligent heart disease prediction systems, which uses mining tasks from data preparation to data visualization and the
different approaches and propose various models implementing discovery of knowledge. In this work, analysis is based on the
Naïve Bayes and ANN. results of machine learning techniques like clustering,
correlation and logistic regression to better and complete
In countries like China, India and Malaysia and in some visualization of results.
European countries much work has been done in medical data
135 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 7, No. 2, 2016
136 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 7, No. 2, 2016
137 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 7, No. 2, 2016
Fig. 3. Useful Attributes Selected through Attribute Selection Technique
138 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 7, No. 2, 2016
139 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 7, No. 2, 2016
140 | P a g e
www.ijacsa.thesai.org
(IJACSA) International Journal of Advanced Computer Science and Applications,
Vol. 7, No. 2, 2016
The study on the whole gives new directions in the outcome of coronary heart disease. In Convergence Information
Technology, 2007. International Conference on (pp. 868-872). IEEE.
field of biomedical research in Pakistan.
[7] Milan Kumari, Sunila Godara “Comparative Study of Data Mining
The classification and predictive analysis of such data is of Classification Methods in Cardiovascular Disease Prediction”, IJCST
utmost importance nowadays. The future work of the research Vol. 2, Issue 2, June
can be: [8] Vijayarani, S., and S. Sudha. "Comparative Analysis of Classification
Function Techniques for Heart Disease Prediction”, International
These results can always be improved by improving Journal of Innovative Research in Computer and Communication
the classification/prediction model. In the future, to Engineering, Vol. 1, Issue 3, May 2013
improve the result by applying „bootstrapping‟ [9] Kwon, K., Hwang, H., Kang, H., Woo, K. G., & Shim, K. (2013,
January). A remote cardiac monitoring system for preventive care. In
technique that would balance the data and thus will Consumer Electronics (ICCE), 2013 IEEE International Conference on
give better results. (pp. 197-200). IEEE.
Secondly, other important and better classification [10] SH, M. I., & Sanap, S. A. (2013). Intelligent Heart Disease Prediction
System Using Data Mining Techniques. International J. of Healthcare &
models like SVM and Artificial Neural Networks could Biomedical Research, 1(3), 94-101.
be used to achieve high accuracy. [11] Avci, E., & Turkoglu, I. (2009). An intelligent diagnosis system based
on principle component analysis and ANFIS for the heart valve diseases.
The main future concern could be to design an Expert Systems with Applications, 36(2), 2873-2878.
inference engine for cardiac data that would assist the [12] Rajeswari, K., Vaithiyanathan, V., & Amirtharaj, P. (2011). Prediction
practitioner to make better decisions. of Risk Score for Heart Disease in India Using Machine Intelligence. In
2011 International Conference on Information and Network Technology,
The results of the study could be further improved by IACSIT Press, Singapore IPCSIT (Vol. 4).
investigating other algorithms and by improving the [13] Parthiban, L., & Subramanian, R. (2008). Intelligent heart disease
data pre-processing techniques as well. prediction system using CANFIS and genetic algorithm. International
Journal of Biological, Biomedical and Medical Sciences, 3(3).
The Text mining technique can be applied to mine the [14] Dangare, C. S., & Apte, D. S. S. (2012). A data mining approach for
huge unstructured data in hospitals. prediction of heart disease using neural networks. International journal
of Computer Engineering & Technology (IJCET), 3(3), 30-40.
Using the same data set to explore the reason, solution
[15] Kurt, I., Ture, M., & Kurum, A. T. (2008). Comparing performances of
and precautionary measures of specific types of logistic regression, classification and regression tree, and neural
complaining diseases and problem occurring in specific networks for predicting coronary artery disease. Expert Systems with
type of patients. Applications, 34(1), 366-374
[16] Dodani, S., Mistry, R., Khwaja, A., Farooqi, M., Qureshi, R., &
ACKNOWLEDGMENT Kazmi, K. (2004). Prevalence and awareness of risk factors and
behaviours of coronary heart disease in an urban population of Karachi,
We would like to thank and acknowledge the support and the largest city of Pakistan: a community survey. Journal of public
guidance of our supervisors. We are also thankful to Armed health, 26(3), 245-24
Forces Institute of Cardiology (AFIC), Pakistan for providing [17] Han, J., Kamber, M., & Pei, J., (2006), Data mining: concepts and
us the data set of heart patients and helping us to understand techniques, Morgan kaufmann.
the research problem. [18] Deekshatulu, B. L., & Chandra, P. (2013). Classification of Heart
Disease using Artificial Neural Network and Feature Subset Selection.
REFERENCES Global Journal of Computer Science and Technology, 13(3).
[1] Koh, H. C., & Tan, G. (2011). Data mining applications in [19] Ramaswami, M., & Bhaskaran, R. (2009). A study on feature selection
healthcare, Journal of Healthcare Information Management—Vol, 19(2), techniques in educational data mining. arXiv preprint arXiv:0912.3924.
65
[20] http://publib.boulder.ibm.com/infocenter/db2luw/v8/index.jsp?topic=/co
[2] Parpinelli, R. S., Lopes, H. S., & Freitas, A. A. (2001, July). An ant m.ibm.db2.udb.doc/admin/c0006909.htm.
colony based system for data mining: applications to medical data. In
Proceedings of the genetic and evolutionary computation conference [21] Fatima, M.; Basharat, I; Khan, S.A; Anjum, AR., "Biomedical (cardiac)
(GECCO-2001) (pp. 791-797). data mining: Extraction of significant patterns for predicting heart
condition," Computational Intelligence in Bioinformatics and
[3] Benko, A. & Wilson, B. (2003). Online decision support gives plans an Computational Biology, 2014 IEEE Conference on , vol., no., pp.1,7,
edge. Managed Healthcare Executive, 13(5), 20. 21-24 May 2014.
[4] AbuKhousa, E., & Campbell, P. (2012, March). Predictive data mining [22] http://www.upa.pdx.edu/IOA/newsom/pa551/lectur21.htm] (Retrieved
to support clinical decisions: An overview of heart disease prediction on 23 March 2014).
systems. In Innovations in Information Technology (IIT), 2012
International Conference on (pp. 267-272). IEEE. [23] Fürnkranz, J. (2002). Round robin classification. The Journal of
Machine Learning Research, 2, 721-747.
[5] Hartigan, J. A., & Wong, M. A. (1979). Algorithm AS 136: A k-means
clustering algorithm. Applied statistics, 100-108.. [24] http://www.ncsu.edu/labwrite/Experimental%20Design/accuracyprecisio
n.htm (Retrieved on 2nd April, 2014).
[6] Xing, Y., Wang, J., Zhao, Z., & Gao, Y. (2007, November).
Combination data mining methods with new medical data to predicting
141 | P a g e
www.ijacsa.thesai.org