Effect of Data Size On Feature Set Using Classification in Health Domain

IDL - International Digital Library Of
Technology & Research

Volume 1, Issue 5, May 2017 Available at: www.dbpublications.org
International e-Journal For Technology And Research-2017
Effect of Data Size on Feature Set Using

Classification in Health Domain
Uttham H1*, Gowramma2
1
PG-Student, 2Associate Professor,
Dept. Computer Science & Engineering, D.B.I.T, Banglore, Karnataka, India.
1*
utthamhmanju@gmail.com,2gowramma@gmail.com.
ABSTRACT:
In health domain, the major critical issue is prediction of disease in early stage. Prediction
of disease is mainly based on the experience of physician so many machine learning
approach contribute their work in the prediction of disease. In existing approaches, either
prediction or feature selection has been concentrated. The aim of this paper is to present
the effect of data size and set of features in the prediction of disease in health domain using
Nave Bayes. This shows how each attribute or combination of attribute behaves on
different size of dataset.
Keywords: Machine Learning, Classification, Nave Bayes, feature selection.
1. INTRODUCTION
In health, domain diagnosis of disease is the experience. If the physician has more
very challenging task. Earlier prediction can experience, then he may predict well. if the
made based on some lab test. Using this lab physician has less experience then he may
test report the physician will decide whether predict wrongly.to overcome from this
the patient has disease or not but prediction problem machine learning has many
of disease by physician mainly depend on approaches like KNN, SVM, ANN to
IDL - International Digital Library 1|P a g e Copyright@IDL-2017


predict correctly. Machine learning is a includes phrases such as to gain knowledge,
branch of science that allows machine to understanding of, or skill by studying the
make decision.to make decision machine has instruction or experience and modification
to learn on itself or by experience. There are of a behavioural tendency by experienced
three types of learning supervised learning, zoologists and psychologists study learning
unsupervised learning, reinforcement in animals and humans [1]. The extraction of
learning. important information from a large pile of
data and its correlations is often the
The aim of this study is to find the effect on
advantage of using machine learning.
performance of different feature set using
Humans are constantly discovering new
WEKA on different size of Pima Indian
knowledge about tasks. There is a constant
Diabetes Dataset. A critical challenge in
stream of new events in the world and
medical science is to attain the diagnosis
continuing redesign of Artificial Intelligent
correctly. For correct diagnosis, generally
systems to conform to new knowledge is
many tests done to predict correctly. All of
impractical but machine-learning methods
these test procedures said to be necessary in
might be able to track much of it [1]. There
order to reach the ultimate diagnosis.
is a substantial amount of research has been
However, on the other hand, too many tests
done with machine learning algorithms such
could complicate the main diagnosis process
as Bayes network, Multilayer Perceptron,
and lead to the difficulty in obtaining the
Decision tree and pruning like J48graft,
results, particularly in the case where many
C4.5, Single Conjunctive Rule Learner like
tests performed. This kind of difficulty
FLR, JRip and Fuzzy Inference System and
could be resolved with the aid of machine
Adaptive Neuro-Fuzzy Inference System.
learning which used directly to obtain the
result with the aid of several classification 2. RELATED WORK
techniques. Machine learning covers such a
A good number of researches have been
broad range of processes that it is difficult to
reported in literature on diagnosis of
define it precisely. A dictionary definition
different deceases. Sapna and Tamilarasi [2]


proposed a technique based on neuropathy adequate index to provide percentage
diabetics. Nerve disorder is caused by measure in the detection of eye suspect
diabetic mellitus. Long term diabetic regions based on neuro-fuzzy subsystem.
patients could have diabetic neuropathies
Radha and Rajagopalan [4] introduced an
very easily. There is fifty (50%) percent
application of fuzzy logic to diagnosis of
probability to have such diseases which
diabetes. It describes the fuzzy sets and
affect many nerves system of the body. For
linguistic variables that contribute to the
example, body wall, limbs (which called as
diagnosis of disease particularly diabetes. As
somatic nerves) could be affected. On the
we all know fuzzy logic is a computational
other hand, internal organ like heart,
paradigm, that provides a tool based on
stomach, etc., are known as automatic
mathematics which deals with uncertainty.
nerves. In this paper, the risk factors and
At the same time this paper also presents a
symptoms of diabetic neuropathy are used to
computer-based Fuzzy Logic with maximum
make the fuzzy relation equation. Fuzzy
and mini- mum relationship, membership
relation equation is linked with the
values consisting of the components,
perception of composition of binary
specifying fuzzy set frame work. Forty
relations that means they used Multilayer
patients data have been collected to make
Perceptron NN using Fuzzy Inference
this relationship more strong.
System.
Faezeh,Hossien, Ebrahim [7] proposed a
Leonarda and Antonio [6] proposed
fuzzy clustering technique (FACT) which
automatic detection of diabetic symptoms in
determined the number of appropriate
retinal images by using a multilevel
clusters based on the pattern essence. Dif-
perceptron neural network. The network
ferent experiments for algorithm evaluation
trained using algorithms for evaluating the
were performed which showed a better
optimal global threshold, which can
performance compared to the typical widely
minimize pixel classification errors. System
used K-means clustering algorithm. Data
performances evaluated by means of an


was taken from the UCI Machine Learning 1 Plasma glucose
Repository [3]. concentration a 2
hours in an oral
3. DATA SET DESCRIPTION
glucose tolerance
The characteristics of the data set used in test
this research are summarized in following 2 Diastolic blood
Table 1. The detailed descriptions of the pressure (mm Hg)
data set are available at UCI repository
3 Triceps skin fold
which contains 768 instances [3]
thickness (mm)
Dataset->Pima Indian diabetes 4 2-hour serum insulin
(mu U/ml)
No of example->768
5 Body mass index
Input attribute->8 (weight in kg/(height
Output classes->two in m)^2)

6 Diabetes pedigree
Total number of attribute->nine
function
Missing attributes status->No 7 Age (years)
Noisy attribute status->No

8 Class variable (0 or
1)
Table 1. Characteristics of data sets
4. METHODOLOGY
Sl number Attributes
0 Number of times In this paper, we will use machine learning
pregnant techniques like the Nave Bayes
classification techniques for classification of
diabetes data


4.1. Nave BayesThe Nave Bayes [5] We measure the performance of the
classifier provides a simple approach, with classifiers with respect to different
clear semantics, representing and learning performance metrics like precision value,
probabilistic knowledge. It is termed nave recall value, F-measure value.
because is relies on two important
Precision value (p): provides correctness
simplifying assumes that the predictive
attributes are conditionally independent Calculate the precision with respect to a
given the class, and it assumes that no particular class. This is defined as
hidden or latent attributes influence the Correctly classified positives

p= ------------------------------
prediction process. Total predicted as positive
Naive Bayes: The Naive Bayes classier is

a simple supervised learning probabilistic Recall value(r): provides completeness
classier based on Bayes theorem. Calculate the recall with respect to a
P(c|x) =P(x|c)P(c)/ P(x)--------->(1) particular class. This is defined as

Correctly classified positives
P(c|x) = P(x1|c)P(x2|c)...P(x6|c)P(c)--------- r= ----------------------------------------------
> (2) Total positives
Where
F-Measure (f): it is the harmonic mean of
P(c|x) is the posterior probability of the precision value and recall value
class (high-risk or low-risk) given the
Calculate the F-Measure with respect to a
predictors, calculated as (2), P(c) is the prior
particular class. This is defined as
probability of the class, P(x|c) is the
2*r*p
likelihood which is the probability of the F=-- ----------------------
predictor given the class, and P(x) is the r+p
prior probability of predictor.

6 EXPERIMENTAL WORK
5. PERFORMANCE METRICS


This experiment have done with the help of possible subset)Apply10-fold cross
open source tools in window environment validation for building the model then note
using eclipse software. In this experiment, down the precision value, recall value, f
we used the java code and libraries, which score value. Repeat the experiment for 90%
are available in WEKA. To conduct the of data,80% of data,70% of data,60% of
experiment following procedure has to data,50% of data.by conducting this
follow. experiment we know how each feature or
We divide our data set into training sets and combination of feature act on different size
testing sets to apply supervised learning. We of data.
use Naive Bayes classifier to explore our
data set, primarily because previous work
has shown that these algorithms present a 7. RESULT ANALYSIS AND
good trade-off between simplicity and DISCUSSION
accuracy. Patients are classified into one of In this paper, we examine the effect of data
two classes: (i) diabetic i or (ii) non - size on feature set using nave Bayes
diabetic. We use 10-fold cross validation in classifier.
training and then we apply the model onto
For each attribute set for example if there
our testing set.
are 8 attribute then 2^8-1=256-1=255 subset
Consider 100% of data means full instances
possible. For each subset graph generated.
then
Which shows performance of each
For each possible subset for features (for
example if there are 8 attribute then 2^8 The following figure shows the effect of
features(0,1,2,4) on different data size

The below graph shows the effect of

attribute subset(2,4,5,6) on different data
size

8. CONCLUSION AND FUTURE 9. REFERENCES

WORK
[1] N.J.Nilsson, Introduction to
The objective of this study is to evaluate Machine Learning, 2010
effect of data size on feature set and http://ai.stanford.edu/~nilsson/mlboo
investigate the performance using Nave k.html
Bayes algorithm based on WEKA. The
[2] M. S. Sapna and D. A. Tamilarasi,
experiment shows the effect of each attribute
Fuzzy Relational Equation in
or combination of attribute affecting the
Preventing Neuropathy Diabetic,
performance on different data size i.e. for
Internati- onal Journal of Recent
each possible subset of attribute affecting
Trends in Engineering, Vol. 2, No. 4,
the performance for prediction on different
2009, p. 126.
size of data.
[3] UCI Machine Learning Repository.
As a future work we can conduct same
http://www.ics.uci.edu/mlearn/MLR
experiment on different data set for example
epository.html
:heart attack dataset and diabetes dataset
from the experiment we can combine [4] R. Radha and S. P. Rajagopalan,
common attribute affect for prediction also Fuzzy Logic Approach for
we can work using different classification Diagnosis of Diabetes, Information
algorithm. Technology Journal, Vol. 6, No. 1,


pp. 96-102.
doi:10.3923/itj.2007.96.102
[5] G. H. John and P. Langley,

Estimating Continuous
Distributions in Bayesian
Classifiers, Proceedings of the 11th
Conference on Uncertainty in
Artificial Intelligence, San
Francisco, 1995, pp. 338-345.
[6] L. Carnimeo and A. Giaquinto, An

Intelligent System for Improving
Detection of Diabetic Symptoms in
Retinal Images, IEEE International
Conference on Information
Technology in Biomedicine,
Ioannina, 26-28 October 2006.
[7] F. Ensan, M. H. Yaghmaee and E.

Bagheri, Fact: A New Fuzzy
Adaptive Clustering Technique, The
11th IEEE Symposium on Computers
and Communications, Sardinia, 26-
29 June 2006, pp. 442-447.
doi:10.1109/ISCC.2006.73

IDL - International Digital Library 10 | P a g e Copyright@IDL-

2017

Effect of Data Size On Feature Set Using Classification in Health Domain

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Effect of Data Size On Feature Set Using Classification in Health Domain

Enviado por

Direitos autorais:

Formatos disponíveis

IDL - International Digital Library Of

Technology & Research

International e-Journal For Technology And Research-2017