Escolar Documentos
Profissional Documentos
Cultura Documentos
Techniques
Shubham Pachori
12BCE055
Seminar
Submitted in partial fulfillment of the requirements
For the degree of
Bachelor of Technology in
Computer Science and Engineering
Shubham Pachori
12BCE055
CERTIFICATE
This is to certify that the seminar entitled Data Mining Models and Evaluation Techniques submitted
by Shubham Pachori (12BCE055), towards the partial fulfillment of the requirements for the degree
of Bachelor of Technology in B.Tech of Nirma University, Ahmed- abad is the record of work carried
out by him under my supervision and guidance. In my opinion, the submitted work has reached a
level required for being accepted for examination. The results embodied in this Seminar, to the best
of my knowledge, havent been submitted to any other university or institution for award of any degree
or diploma.
Prof. K.P.Agarwal
Associate Professor,
CSE Department,
Institute Of Technology,
Nirma University, Ahmedabad.
Acknowledgements
I am profoundly grateful to Prof. K P AGARWAL for his expert guidance
throughout the project.His continuous encouragement have fetched us the golden
results. His elixir of knowledge in the field has made this project achieve its zenith
and credibility.
I would like to express deepest appreciation towards , Prof. SANJAY GARG,
Head of Department of Computer Engineering and Prof. ANUJA NAIR, whose
invaluable guidance supported us in completing this project.
At last I must express my sincere heartfelt gratitude to all the staff members of
Computer Engineering Department who helped me directly or indirectly during this
course of work.
SHUBHAM PACHORI
12BCE055
ii
Abstract
Databases are rich with hidden information that can be used for intelligent decision making. Classification and prediction are two forms of data analysis that
can be used to extract models describing important data classes or to predict future
data trends. Such analysis can help provide us with a better understanding of the
data at large. Classification models predicts categorical (discrete, unordered) label
functions. For example, we can build a classification model to categorize bank loan
applications as either safe or risky.
As predictions always have an implicit cost involved it is important to evaluate
classifiers generalization performance in order to determine whether to employ the
classifier (For example: when learning the effectiveness of medical treatments from
a limited-size data, it is important to estimate the accuracy of the classifiers.) and
to Optimize the classifier. (For example: when post-pruning decision trees we must
evaluate the accuracy of the decision trees on each pruning step.)
This seminar report gives an in depth explanation of classifier models (viz.
Naive Bayesian and Decision Trees) and how these classifier models are evaluated
for their accuracy on predictions. The later part of the report also deals with how to
improve the accuracy of these classifier models and it includes an exploratory study
comparing the various model evaluation techniques , carried out in Weka(A GUI
Based Data Mining Tool) on representative data sets.
iii
Contents
Certificate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
i
ii
iii
Introduction
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
4
5
8
9
10
.
.
.
.
12
12
13
14
15
.
.
.
.
.
.
.
.
.
.
.
17
17
18
23
23
24
25
26
27
29
31
31
iv
4.5.2
4.5.3
4.5.4
4.5.5
4.5.6
5
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
32
33
33
33
34
37
37
40
40
References
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
40
Chapter 1
Introduction
The term Knowledge Discovery in Databases, or KDD for short, refers to the broad
process of finding knowledge in data, and emphasizes the high-level application of
particular data mining methods. It is of interest to researchers in machine learning,
pattern recognition, databases, statistics, artificial intelligence, knowledge acquisition for expert systems, and data visualization.The unifying goal of the KDD process
is to extract knowledge from data in the context of large databases.It does this by
using data mining methods (algorithms) to extract (identify) what is deemed knowledge, according to the specifications of measures and thresholds, using a database
along with any required preprocessing, subsampling, and transformations of that
database.
The overall process of finding and interpreting patterns from data involves the
repeated application of the following steps:
1. Developing an understanding of
1. the application domain
CSE Department,Institute of Technology, Nirma University
2. Creating a target data set: selecting a data set, or focusing on a subset of variables, or data samples, on which discovery is to be performed.
3. Data cleaning and preprocessing.
1. Removal of noise or outliers.
2. Removal of noise or outliers.
3. Strategies for handling missing data fields.
4.Accounting for time sequence information and known changes.
Chapter 2
Classification Using Decision Tree
This chapter introduces the concept of the most widely used learning method that
apply a similar strategy of dividing data into smaller and smaller portions to identify
patterns that can be used for prediction. The knowledge is then presented in the
form of logical structures that can be understood without any statistical knowledge.
This aspect makes these models particularly useful for business strategy and process
improvement.
1. Understading Decision Tress
2. Divide and Conquer
3. Unique identifiers
4. C 5.0 Decision Tree Algorithm
5. Choosing The Best Split
6. Pruning The Decision Tress
2.1
As we might intuit from the name itself, decision tree learners build a model in
the form of a tree structure. The model itself comprises a series of logical decisions,
similar to a flowchart, with decision nodes that indicate a decision to be made on
an attribute. These split into branches that indicate the decisions choices. The tree
is terminated by leaf nodes (also known as terminal nodes) that denote the result of
following a combination of decisions.
Data that is to be classified begin at the root node where it is passed through
the various decisions in the tree according to the values of its features. The path
that the data takes funnels each record into a leaf node, which assigns it a predicted
CSE Department,Institute of Technology, Nirma University
2.2
Decision trees are built using a heuristic called recursive partitioning. This approach is generally known as divide and conquer because it uses the feature values
to split the data into smaller and smaller subsets of similar classes. Beginning at the
root node, which represents the entire dataset, the algorithm chooses a feature that is
the most predictive of the target class. The examples are then partitioned into groups
of distinct values of this feature; this decision forms the first set of tree branches.
The algorithm continues to divide-and-conquer the nodes, choosing the best candidate feature each time until a stopping criterion is reached. This might occur at a
node if:
1. All (or nearly all) of the examples at the node have the same class
2. There are no remaining features to distinguish among examples
3. The tree has grown to a predefined size limit
To illustrate the tree building process, lets consider a simple example. Imagine
that we are working for a Hollywood film studio, and our desk is piled high with
CSE Department, Institute Of Technology,Nirma University, Ahmedabad
screenplays. Rather than read each one cover-to-cover, you decide to develop a
decision tree algorithm to predict whether a potential movie would fall into one of
three categories: mainstream hit, critics choice, or box office bust. To gather data for
your model, we turn to the studio archives to examine the previous ten years of movie
releases. After reviewing the data for 30 different movie scripts, a pattern emerges.
There seems to be a relationship between the films proposed shooting budget, the
number of A-list celebrities lined up for starring roles, and the categories of success.
A scatter plot of this data might look something like the figure 2.1(Reference [2]):
To build a simple decision tree using this data, we can apply a divide-and-conquer
strategy. Lets first split the feature indicating the number of celebrities, partitioning the movies into groups with and without a low number of A-list stars (fig 2.2
Reference [2])
Next, among the group of movies with a larger number of celebrities, we can
make another split between movies with and without a high budget(fig2.3) At this
point we have partitioned the data into three groups. The group at the top-left corner
of the diagram is composed entirely of critically-acclaimed films. This group is
distinguished by a high number of celebrities and a relatively low budget. At the
top-right corner, the majority of movies are box office hits, with high budgets and a
large number of celebrities. The final group, which has little star power but budgets
ranging from small to large, contains the flops.
If we wanted, we could continue to divide the data by splitting it based on increasingly specific ranges of budget and celebrity counts until each of the incorrectly classified values resides in its own, perhaps tiny partition. Since the data can continue
to be split until there are no distinguishing features within a partition, a decision tree
can be prone to be overfitting for the training data with overly-specific decisions.
Well avoid this by stopping the algorithm here since more than 80 percent of the
examples in each group are from a single class.
Our model for predicting the future success of movies can be represented in a simple tree as shown fig 2.4(Ref[2]). To evaluate a script, follow the branches through
each decision until its success or failure has been predicted. In no time, you will
be able to classify the backlog of scripts and get back to more important work such
as writing an awards acceptance speech. Since real-world data contains more than
two features, decision trees quickly become far more complex than this, with many
more nodes, branches, and leaves. In the next section we will throw some light on a
popular algorithm for building decision tree models automatically.
2.3
There are numerous implementations of decision trees, but one of the most well
known is the C5.0 algorithm. This algorithm was developed by computer scientist J.
Ross Quinlan as an improved version of his prior algorithm, C4.5, which itself is an
improvement over his ID3 (Iterative Dichotomiser 3) algorithm.
Strengths of C5.0 Algorithm
1. An all-purpose classifier that does well on most problems
2. Highly-automatic learningprocess can handle numeric or nominal features,
missing data
3. Uses only the most important features
4. Can be used on data with relatively few training examples or a very large
number
5. Results in a model that can be interpreted without a mathematical background
(for relatively small trees)
6. More efficient than other complex models
Weaknesses of C5.0 Algorithm
1. Decision tree models are often biased toward splits on features having a large
number of levels
2. It is easy to overfit or underfit the model
3. Can have trouble modeling some relationships due to reliance on axisparallel
splits
4. Small changes in training data can result in large changes to decision logic
5. Large trees can be difficult to interpret and the decisions they make may seem
counterintuitive
2.4
The first challenge that a decision tree will face is to identify which feature to
split upon. In the previous example, we looked for feature values that split the data
in such a way that partitions contained examples primarily of a single class. If the
segments of data contain only a single class, they are considered pure. There are
many different measurements of purity for identifying splitting criteria C5.0 uses
Entropy for measuring purity. The entropy of a sample of data indicates how mixed
the class values are; the minimum value of 0 indicates that the sample is completely
homogenous, while 1 indicates the maximum amount of disorder. The definition of
entropy is specified by:
c
Entropy(S) =
pi log2(pi)
(2.1)
n=1
In the entropy formula, for a given segment of data (S), the term c refers to the
number of different class levels, and pi refers to the proportion of values falling into
class level i. For example, suppose we have a partition of data with two classes: red
(60 percent), and white (40 percent). We can calculate the entropy as:
Entropy(S) = 0.60 log2 (0.60) 0.40 log2 (0.40) = 0.9709506
(2.2)
Given this measure of purity, the algorithm must still decide which feature to split
upon. For this, the algorithm uses entropy to calculate the change in homogeneity
resulting from a split on each possible feature. The calculation is known as information gain. The information gain for a feature F is calculated as the difference
between the entropy in the segment before the split (S1 ), and the partitions resulting
CSE Department, Institute Of Technology,Nirma University, Ahmedabad
(2.3)
The one complication is that after a split, the data is divided into more than one
partition. Therefore, the function to calculate Entropy(S2 ) needs to consider the
total entropy across all of the partitions. It does this by weighing each partitions
entropy by the proportion of records falling into that partition. This can be stated in
a formula as:
n
Entropy(S) =
wi log2(Pi)
(2.4)
n=1
In simple terms, the total entropy resulting from a split is the sum of entropy
of each of the n partitions weighted by the proportion of examples falling in that
partition wi . The higher the information gain, the better a feature is at creating
homogeneous groups after a split on that feature. If the information gain is zero,
there is no reduction in entropy for splitting on this feature. On the other hand, the
maximum information gain is equal to the entropy prior to the split. This would
imply the entropy after the split is zero, which means that the decision results in
completely homogeneous groups.
The previous formulae assume nominal features, but decision trees use information gain for splitting on numeric features as well. A common practice is testing
various splits that divide the values into groups greater than or less than a threshold.
This reduces the numeric feature into a two-level categorical feature and information
gain can be calculated easily. The numeric threshold yielding the largest information
gain is chosen for the split
One solution to this problem is to stop the tree from growing once it reaches a
certain number of decisions or if the decision nodes contain only a small number of
examples. This is called early stopping or pre-pruning the decision tree. As the tree
avoids doing needless work, this is an appealing strategy. However, one downside is
that there is no way to know whether the tree will miss subtle, but important patterns
that it would have learned had it grown to a larger size.
An alternative, called post-pruning involves growing a tree that is too large, then
using pruning criteria based on the error rates at the nodes to reduce the size of the
tree to a more appropriate level. This is often a more effective approach than prepruning because it is quite difficult to determine the optimal depth of a decision tree
without growing it first. Pruning the tree later on allows the algorithm to be certain
that all important data structures were discovered.
One of the benefits of the C5.0 algorithm is that it is opinionated about pruningit
takes care of many of the decisions, automatically using fairly reasonable defaults.
Its overall strategy is to postprune the tree. It first grows a large tree that overfits the
training data. Later, nodes and branches that have little effect on the classification
errors are removed. In some cases, entire branches are moved further up the tree
or replaced by simpler decisions. These processes of grafting branches are known
as subtree raising and subtree replacement, respectively. Balancing overfitting and
underfitting a decision tree is a bit of an art, but if model accuracy is vital it may
be worth investing some time with various pruning options to see if it improves
performance on the test data. As you will soon see, one of the strengths of the C5.0
algorithm is that it is very easy to adjust the training options.
10
Chapter 3
Probabilistic Learning - Naive Bayesian
Classification
When a meteorologist provides a weather forecast, precipitation is typically predicted using terms such as 70 percent chance of rain. These forecasts are known
as probability of precipitation reports. Have you ever considered how they are calculated? It is a puzzling question, because in reality, it will either rain or it will
not. This chapter covers a machine learning algorithm called naive Bayes, which
also uses principles of probability for classification. Just as meteorologists forecast
weather, naive Bayes uses data about prior events to estimate the probability of future events. For instance, a common application of naive Bayes uses the frequency
of words in past junk email messages to identify new junk mail.
3.1
The basic statistical ideas necessary to understand the naive Bayes algorithm have
been around for centuries. The technique descended from the work of the 18th century mathematician Thomas Bayes, who developed foundational mathematical principles (now known as Bayesian methods) for describing the probability of events,
and how probabilities should be revised in light of additional information. Classifiers based on Bayesian methods utilize training data to calculate an observed probability of each class based on feature values. When the classifier is used later on
unlabeled data, it uses the observed probabilities to predict the most likely class for
the new features. Its a simple idea, but it results in a method that often has results
on par with more sophisticated algorithms. In fact, Bayesian classifiers have been
used for:
1. Text classification, such as junk email (spam) filtering, author identification,
or topic categorizatio
CSE Department,Institute of Technology, Nirma University
11
3.2
Bayes Theorem
12
posterior probability, P(H|X), from P(H), P(X|H), and P(X). Bayes theorem is
P(H|X) =
3.3
P(X|H)P(H)
P(X)
(3.1)
The naive Bayes (NB) algorithm describes a simple application using Bayes theorem for classification. Although it is not the only machine learning method utilizing
Bayesian methods, it is the most common, particularly for text classification where
it has become the de facto standard. Strengths and weaknesses of this algorithm are
as follows
Strength
13
body are not independent from one another, since the appearance of some words is a
very good indication that other words are also likely to appear. A message with the
word Viagra is probably likely to also contain the words prescription or drugs. Probabilistic Learning Classification Using Naive Bayes [ 96 ] However, in most cases
when these assumptions are violated, naive Bayes still performs fairly well. This
is true even in extreme circumstances where strong dependencies are found among
the features. Due to the algorithms versatility and accuracy across many types of
conditions, naive Bayes is often a strong first candidate for classification learning
tasks.
3.4
(3.2)
Thus we maximize P(Ci |X). The class for which P(Ci |X) is maximized is
called the maximum posterior hypothesis. By Bayes theorem
P(Ci |X) =
P(X|Ci ) P(Ci )
P(X)
(3.3)
3. As P(X) is constant for all classes, only P(X|Ci )P(Ci ) need be maximized. If
the class prior probabilities are not known, then it is commonly assumed that
the classes are equally likely, that is, P(C1 ) = P(C2 ) == P(Cm ), and we would
therefore maximize P(X|Ci ). Otherwise, we maximize P(X|Ci )P(Ci ). Note
that the class prior probabilities may be estimated by P(Ci ) = |Ci,D |/|D|,where
|Ci,D | is the number of training tuples of class Ci in D.
4. Given data sets with many attributes, it would be extremely computationally
CSE Department, Institute Of Technology,Nirma University, Ahmedabad
14
P(X|Ci ) =
P(xk |Ci)
(3.4)
n=1
5. In order to predict the class label of X, P(X|Ci )P(Ci ) is evaluated for each class
Ci . The classifier predicts that the class label of tuple X is the class Ci if and
only if
P(X|Ci )P(Ci ) > P(X|C j )P(C j ) f or1 < j < m, j! = i
(3.5)
In other words, the predicted class label is the class Ci for which P(X|Ci )P(Ci )
is the maximum.
15
Chapter 4
Model Evaluation Techniques
As We now have in depth explored the two most widely used classifier models the
question which we now face is how accurately these classifiers can predict the future
trends based on the data used for building these classifier viz. How accurately a customer recomender system of a company can predict the future purchasing behavior
of the customer based on the previously recorded sales data of the customers.
Given the signifanct role these classifiers play their accuracy becomes of prime
importance to the companies speically to those in e-commerce system. Thus the
model evaluation techniques are employed to evaluate the accuracy of the predictions
made by a classifier model.As different classifier models have varying strengths and
weaknesses, it is necessary to use test that reveal distinctions among the learners
when measuring how a model will perform on future data.The Succeeding sections
in this chapters will primarily focus on the following points.
1. The reason why predictive accuracy is not sufficient to measure performance
and what are the other alternatives to measure the accuracy
2. Methods to ensure that the performance measures reasonably reflect a models
ability to predict or forecast unseen data
4.1
Prediction Accuracy
The prediction accuracy of a classifier model is defined proportion of correct predictions by the total number of predictions. This number indicates the percentage of
cases in which the learner is right or wrong. For instance, suppose a classifier correctly identified whether or not 99,990 out of 100,000 newborn babies are carriers
of a treatable but potentially-fatal genetic defect. This would imply an accuracy of
99.99 percent and an error rate of only 0.01 percent.
CSE Department,Institute of Technology, Nirma University
16
4.2
17
The confusion matrix is useful tool for analysing how well our classifier can recognize tuples of different classes. TP and TN tell us when the classifier is getting things
right, while FP and FN tell us when the classifier is getting things wrong.Given m
classes, a confusion matrix is a matrix of atleast m by m size An entry, CMi j in
the first m rows and m columns indicates the number of tuples of class i that were
labeled by the classifier as class j. For a classifier to have good accuracy, ideally
most of the tuples would be represented along the diagonal of the confusion matrix
from the entry CM1,1 to entry CMm,m , with the rest of the entries being zero or close
to zero. That is ideally , FP and FN are around zero.
Accuracy: The accuracy of a classifier on a given test set is the percentage of test
tuples that are correctly classified by the classifier.
accuracy =
TP+TN
P+N
(4.1)
Error Rate: Error Rate or miss classification rate of classifier, M, which is simply
1 accuracy(M), where accuracy (M) is the accuracy of M.
errorrate =
FP + FN
P+N
(4.2)
If we use the training set instead of test set to estimate the error rate of a model, this
quantity is known as the re-substituion error.This error estimate is optimistic of the
true error rate because the model is not tested on any samples that it has not already
seen.
The Class Imbalance Problem: the datasets where the main class of interest is
rare. That is the data set distribution reflects a significant majority of the negative
class and minority positive class.For example in fraud detection applications, the
class of interest fraudulent class is rare or less frequently occurring in comparison
to the negative non-fraudulentclass.In medical data there may be a rare class, such
as cancer. Suppose that we have trained a classifier to classify medical data tuples,where the class label attribute is cancer and the possible class values are yes
and no. An accuracy rate of say 97% may make the classifier seem quite accurate,
but what if only, say 3% of the training tuples are actually cancer? Clearly an accuracy rate of 97% may not be acceptable- the classifier could be correctly labeling
only the non-cancer tuples, for instance,and miss classifying all the cancer tuples.
Instead we need other measures which access how well the classifier can recognize
CSE Department, Institute Of Technology,Nirma University, Ahmedabad
18
the positive tuples and how well it can recognize the negative tuples
Sensitivity and Specificity:Classification often involves a balance between being overly conservative and overly aggressive in decision making. For example, an
e-mail filter could guarantee to eliminate every spam message by aggressively eliminating nearly every ham message at the same time. On the other hand, a guarantee
that no ham messages will be inadvertently filtered might allow an unacceptable
amount of spam to pass through the filter. This tradeoff is captured by a pair of
measures: sensitivity and specificity.
The sensitivity of a model (also called the true positive rate), measures the proportion of positive examples that were correctly classified. Therefore, as shown in
the following formula, it is calculated as the number of true positives divided by the
total number of positives in the datathose correctly classified (the true positives), as
well as those incorrectly classified (the false negatives).
sensitivity =
TP
T P + FN
(4.3)
The specificity of a model (also called the true negative rate), measures the proportion of negative examples that were correctly classified. As with sensitivity, this
is computed as the number of true negatives divided by the total number of negativesthe true negatives plus the false positives.
speci f icity =
TN
T N + FP
(4.4)
Precision and recall: Closely related to sensitivity and specificity are two other
performance measures, related to compromises made in classification: precision and
recall. Used primarily in the context of information retrieval, these statistics are
intended to provide an indication of how interesting and relevant a models results
are, or whether the predictions are diluted by meaningless noise.
The precision (also known as the positive predictive value) is defined as the proportion of positive examples that are truly positive; in other words, when a model
predicts the positive class, how often is it correct? A precise model will only predict
the positive class in cases very likely to be positive. It will be very trustworthy.
19
Consider what would happen if the model was very imprecise. Over time, the
results would be less likely to be trusted. In the context of information retrieval,
this would be similar to a search engine such as Google returning unrelated results.
Eventually users would switch to a competitor such as Bing. In the case of the SMS
spam filter, high precision means that the model is able to carefully target only the
spam while ignoring the ham.
precision =
TP
T P + FP
(4.5)
On the other hand, recall is a measure of how complete the results are. As shown
in the following formula, this is defined as the number of true positives over the
total number of positives. We may recognize that this is the same as sensitivity, only
the interpretation differs. A model with high recall captures a large portion of the
positive examples, meaning that it has wide breadth. For example, a search engine
with high recall returns a large number of documents pertinent to the search query.
Similarly, the SMS spam filter has high recall if the majority of spam messages are
correctly identified.
TP
recall =
(4.6)
T P + FN
The F-Measure: A measure of model performance that combines precision and
recall into a single number is known as the F-measure (also sometimes called the
F1 score or the F-score). The F-measure combines precision and recall using the
harmonic mean. The harmonic mean is used rather than the more common arithmetic
mean since both precision and recall are expressed as proportions between zero and
one. The following is the formula for F-measure:
F Measure =
2 precision recall
recall + precision
(4.7)
(1 + 2 ) precision recall
(4.8)
2 precision + recall
In addition to accuracy-based measures, classifiers can also be compared with respect to the following additional aspects:
F =
1. Speed: This refers to the computational costs involved in generating and using
the given classifier
2. Robustness: This is the ability of the classifier to make correct prediction
given noisy data or data with missing values.Robustness is typically assessed
CSE Department, Institute Of Technology,Nirma University, Ahmedabad
20
with a series of synthetic data sets represeting increasing degress of noise and
missing values.
3. Scalability: This refers to the ability to construct the classifier efficiently given
large amounts of data. Scalability is typically assessed with a series of data sets
of increasing size.
4. Interpretability : This refers to the level of understanding and insight that is
provided by the classifier or predictor. Interpretability is subjective and therefore more difficult to asses
21
4.3
We can use following methods to estimate the evaluation metrics explained indepth in the preceding sections:
a. Training data
b. Independent test data
c. Hold-out method
d. k-fold cross-validation method
e. Leave-one-out method
f. Bootstrap method
g. Comparing Two Models
4.3.1
The accuracy/error estimates on the training data are not good indicators of
performance on future data. Because new data will probably not be exactly the same
as the training data.The accuracy/error estimates on the training data measure the
degree of classifiers over-fitting.Fig 4.2 depicts use of training set Estimation with
independent test data (figure 4.3)is used when we have plenty of data and there
is a natural way to forming training and test data. For example: Quinlan in 1987
reported experiments in a medical domain for which the classifiers were trained on
data from 1985 and tested on data from 1986.
22
4.3.2
Holdout Method
23
K-Cross-validation
In k-fold cross-validation, (fig 4.6) the initial data are randomly partitioned into k
mutually exclusive subsets or folds, D1 , D2 , ..., Dk each of approximately equal size.
Training and testing is performed k times. In iteration i, partition Di is reserved as
the test set, and the remaining partitions are collectively used to train the model.
That is, in the first iteration, subsets D2 , .., Dk collectively serve as the training set in
order to obtain a first model, which is tested on D1 ; the second iteration is trained on
subsets D1 , D3 , ...., Dk and tested on D2 ; and so on. Unlike the holdout and random
subsampling methods above, here, each sample is used the same number of times for
training and once for testing. For classification, the accuracy estimate is the overall
number of correct classifications from the k iterations, divided by the total number
of tuples in the initial data. For prediction, the error estimate can be computed as the
total loss from the k iterations, divided by the total number of initial tuples.
Leave one out CV
Leave-one-out is a special case of k-fold cross-validation where k is set to the number of initial tuples. That is, only one sample is left out at a time for the test set.
Some features of Leave one out CV are
1. Makes best use of the data.
2. Involves no random sub-sampling.
CSE Department, Institute Of Technology,Nirma University, Ahmedabad
24
Bootstrap
Cross validation uses sampling without replacement. The same instance, once selected, can not be selected again for a particular training/test set.The The bootstrap
uses sampling with replacement to form the training set:
1. Sample a dataset of n instances n times with replacement to form a new dataset
of n instances.
2. Use this data as the training set.
3. Use the instances from the original dataset that dont occur in the new training
set for testing.
4. A particular instance has a probability of 1 n1 of not being picked Thus its probability of ending up in the test data is (where n tends to infinity):
1
(1 )n = e1 = 0.368
n
(4.9)
5. This means the training data will contain approximately 63.2% of the instances
and the test data will contain approximately 36.8% of the instances.
6. The error estimate on the test data will be very pessimistic because the classifier
is trained on just 63% of the instances.
CSE Department, Institute Of Technology,Nirma University, Ahmedabad
25
(4.10)
8. The training error gets less weight than the error on the test data.
4.3.5
Suppose that we have generated two models, M1 and M2 (for either classification or
prediction), from our data. We have performed 10-fold cross-validation to obtain a
mean error rate for each. How can we determine which model is best? It may seem
intuitive to select the model with the lowest error rate, however, the mean error rates
are just estimates of error on the true population of future data cases. There can be
considerable variance between error rates within any given 10-fold cross-validation
experiment. Although the mean error rates obtained for M1 and M2 may appear
different, that difference may not be statistically significant. What if any difference
between the two may just be attributed to chance? following points explain in detail
how statistically significant is theirr difference
1. Assume that we have two classifiers, M1 and M2 , and we would like to know
which one is better for a classification problem.
2. We test the classifiers on n test data sets D1 , D2 , , Dn and we receive error rate
estimates e11 , e12 , , e1n for classifier M1 and error rate estimates e21, e22 , , e2n
for classifier M2 .
3. Using rate estimates we can compute the mean error rate e1 for classifier M1
and the mean error rate e2 for classifier M2 .
4. These mean error rates are just estimates of error on the true population of
future data cases.
5. We note that error rate estimates e11 , e12 , , e1n for classifier M1 and error rate
estimates e21 , e22 , , e2n for classifier M2 are paired. Thus, we consider the differences d1 , d2 , , dn where d j = |e1 j e2 j |.
6. The differences d1 , d2 , , dn are instantiations of n random variables D1 , D2 , , Dn
with mean D and standard deviation D.
7. We need to establish confidence intervals for D in order to decide whether the
difference in the generalization performance of the classifiers M1 and M2 is
statistically significant or not.
CSE Department, Institute Of Technology,Nirma University, Ahmedabad
26
n i=1
9. T-statistics
T=
D D
sd
(4.12)
11. If d and sd are the mean and standard deviation of the normally distributed
differences of n random pairs of errors, a (1 )100% confidence interval for D =
1 - 2 is :
sd
sd
dm t1 / 2 < d < dm + t1 / 2
(4.13)
n
n
where t / 2 is the t-value with v = n 1 degrees of freedom, leaving an area of
/2 to the right.
27
12. If t > z or t < z t doesnt lie in the rejection region, within the tails of the distribution.
This means that we can reject the null hypothesis that the means of M1 and M2 are
the same and conclude that there is a statistically significant difference between the
two models. Otherwise, if we cannot reject the null hypothesis, we then conclude
that any difference between M1 and M2 can be attributed to chance.
4.4
ROC Curves
The ROC curve (Receiver Operating Characteristic) is commonly used to examine the tradeoff between the detection of true positives, while avoiding the false
positives. As you might suspect from the name, ROC curves were developed by
engineers in the field of communications around the time of World War II; receivers
of radar and radio signals needed a method to discriminate between true signals
and false alarms. The same technique is useful today for visualizing the efficacy of
machine learning models.
The characteristics of a typical ROC diagram are depicted in the following plot(figure
4.8 Reference[2]). Curves are defined on a plot with the proportion of true positives
on the vertical axis, and the proportion of false positives on the horizontal axis. Because these values are equivalent to sensitivity and (1 specificity), respectively, the
diagram is also known as a sensitivity/specificity plot:
28
The points comprising ROC curves indicate the true positive rate at varying false
positive thresholds. To create the curves, a classifiers predictions are sorted by
the models estimated probability of the positive class, with the largest values first.
Beginning at the origin, each predictions impact on the true positive rate and false
positive rate will result in a curve tracing vertically (for a correct prediction), or
horizontally (for an incorrect prediction).
To illustrate this concept, three hypothetical classifiers are contrasted in the previous plot. First, the diagonal line from the bottom-left to the top-right corner of
the diagram represents a classifier with no predictive value. This type of classifier
detects true positives and false positives at exactly the same rate, implying that the
classifier cannot discriminate between the two. This is the baseline by which other
classifiers may be judged; ROC curves falling close to this line indicate models that
are not very useful. Similarly, the perfect classifier has a curve that passes through
the point at 100 percent true positive rate and 0 percent false positive rate. It is
able to correctly identify all of the true positives before it incorrectly classifies any
negative result. Most real-world classifiers are similar to the test classifier; they fall
somewhere in the zone between perfect and useless.
The closer the curve is to the perfect classifier, the better it is at identifying positive values. This can be measured using a statistic known as the area under the ROC
curve (abbreviated AUC). The AUC, as you might expect, treats the ROC diagram
as a two-dimensional square and measures the total area under the ROC curve. AUC
ranges from 0.5 (for a classifier with no predictive value), to 1.0 (for a perfect classifier). A convention for interpreting AUC scores uses a system similar to academic
letter grades:
1. 0.9 1.0 = A (outstanding)
2. 0.8 0.9 = B (excellent/good)
3. 0.7 0.8 = C (acceptable/fair)
4. 0.6 0.7 = D (poor)
5. 0.5 0.6 = F (no discrimination)
As with most scales similar to this, the levels may work better for some tasks
than others; the categorization is somewhat subjective.
29
4.5
Ensemble Methods
Motivation
1. Ensemble model improves accuracy and robustness over single model methods
2. Applications:
(a) distributed computing
(b) privacy-preserving applications
(c) large-scale data with reusable models
(d) multiple sources of data
3. Efficiency: a complex problem can be decomposed into multiple sub-problems
that are easier to understand and solve (divide-and-conquer approach)
4.5.1
30
4.5.2
1. Learn to Combine
2. Learn By Consensus
31
4.5.3
Learn To Combine
Pros
1. Get useful feed backs from labeled data.
2. Can potentially improve accuracy.
Cons
1. Need to keep the labeled data to train the ensemble
2. May overfit the labeled data.
3. Cannot work when no labels are available
4.5.4
Learn By Consensus
Pros
1. Do not need labeled data.
2. Can improve the generalization performance.
Cons
1. No feedbacks from the labeled data.
2. Require the assumption that consensus is better.
4.5.5
Bagging
Given a set, D, of d tuples, bagging works as follows. For iteration i(i = 1, 2, ...., k),
a training set,Di, of d tuples is sampled with replacement fromthe original set of
tuples,D. The term bagging stands for bootstrap aggregation.Each training set is a
bootstrap sample . Because sampling with replacement is used, some of the original
tuples of D may not be included in Di ,whereas other smay occur more than once.
A classifier model, Mi , is learned for each training set, Di . To classify an unknown
tuple, X, each classifier, Mi , returns its class prediction, which counts as one vote.
The bagged classifier, M, counts the votes and assigns the class with the most votes
to X. Bagging can be applied to the prediction of continuous values by taking the
average value of each prediction for a given test tuple. . The bagged classifier often
has significantly greater accuracy than a single classifier derived from D, the original
training data. It will not be considerably worse and is more robust to the effects of
noisy data. The increased accuracy occurs because the composite model reduces
CSE Department, Institute Of Technology,Nirma University, Ahmedabad
32
the variance of the individual classifiers. For prediction, it was theoretically proven
that a bagged predictor will always have improved accuracy over a single predictor
derived from D.
Alogrithm The bagging algorithmcreate an ensemble of models (classifiers or
predictors) for a learning scheme where each model gives an equally-weighted prediction.
Input: D, a set of d training tuples; k, the number of models in the ensemble; a
learning scheme (e.g., decision tree algorithm, back-propagation, etc.)
Output: A composite model, M.
Method:
(1) for i = 1tokdo // create k models:
(2) create bootstrap sample, Di , by sampling D with replacement;
(3) use Di to derive a model, Mi ;
(4) end for
To use the composite model on a tuple, X
(1) if classification then
(2) let each of the k models classify X and return the majority vote
(3) if prediction then
(4) let each of the k models predict a value for X and return the average predicted
value;
4.5.6
Boosting
Principles
1. Boost a set of weak learners to a strong learner
2. An iterative procedure to adaptively change distribution of training data by focusing more on previously misclassified records
3. Initially, all N records are assigned equal weights Unlike bagging, weights may
change at the end of a boosting round
4. Records that are wrongly classified will have their weights increased
5. Records that are classified correctly will have their weights decreased
6. Equal weights are assigned to each training tuple (1/d for round 1)
7. After a classifier Mi is learned, the weights are adjusted to allow the subsequent
classifier Mi+1 to pay more attention to tuples that were misclassified by Mi .
CSE Department, Institute Of Technology,Nirma University, Ahmedabad
33
error(Mi ) = w j err(X j )
(4.14)
34
(4.15)
sumo f oldweights
sumo f newweights
(4.16)
error(M )
35
Chapter 5
Conclusion and Future Scope
5.1
Comprative Study
To pratically explore the theoretical aspects of the data mining models and the
techniques to evaluate them, we conducted a small scale exploratory study in data
mining tool Weka - developed by University of Waikado , Newzeland.The following
tables summarize the result of our exploratory study
36
Type of Data
No of Attributes
No of Instances
Type of Classifier
Evaluation Technique
217
69
No recurrence event
0.965
0.729
0.758
0.965
0.849
0.639
Recurrence event
0.271
0.035
0.767
0.271
0.4
0.639
No recurrence event
0.95
0.729
0.755
0.95
0.841
0.582
Recurrence event
0.271
0.05
0.697
0.271
0.39
0.582
Numeric
10
286
J48
10
286
Nave Bayesian
208
72
78
Recall
No recurrence event
0.93
0.753
0.745
0.93
0.827
0.547
Recurrence event
0.247
0.07
0.6
0.247
0.35
0.547
Class
Hold out
Mehthod(Percent Split
66%)
214
73
59
215
208
24
38
71
78
No recurrence event
0.972
0.88
0.761
0.972
0.854
0.697
Recurrence event
0.12
0.028
0.6
0.12
0.2
0.697
No recurrence event
0.75
0.667
0.686
0.75
0.716
0.543
Recurrence event
0.33
0.25
0.407
0.33
0.367
0.543
No recurrence event
0.866
0.518
0.798
0.866
0.831
0.76
Recurrence event
0.482
0.134
0.603
0.482
0.536
0.76
No recurrence event
0.851
0.565
0.781
0.851
0.814
0.7
Recurrence event
0.435
0.149
0.552
0.435
0.487
0.7
210
76
No recurrence event
0.861
0.565
0.783
0.861
0.82
0.702
Recurrence event
0.435
0.139
0.569
0.435
0.493
0.702
72
65
25
32
No recurrence event
0.778
0.36
0.862
0.778
0.818
0.752
Recurrence event
0.64
0.222
0.5
0.64
0.561
0.752
No recurrence event
0.781
0.545
0.735
0.781
0.758
0.674
194
62
7
23
a = norecurren
b = recurrence
a
b
a 191
10
b
62
23
a = norecurren
b = recurrence
a
b
187
64
14
21
a = norecurren
646
122
0.455
0.219
0.517
0.455
0.484
0.674
ce
ce
ce
b = recurrence
a
b
a
70
22
b
2
3
a = norecurren
b = recurrence
a
b
a
48
16
b
22
11
a = norecurren
b = recurrence
a
174
b
27
44
41
a = norecurren
ce
ce
ce
b = recurrence
a
b
171
48
30
37
a = norecurren
ce
b = recurrence
a 173
28
b
48
37
a = norecurren
b = recurrence
a
56
ce
b
16
b
9
16
a = norecurren
ce
b = recurrence
a
a
50
b
14
b
18
15
a = norecurren
Recurrence event
a
b
Class
Hold out
Mehthod(Percent Split
66%)
Confusion Matrix
ce
b = recurrence
a
468
tested_negative
0.936
0.336
0.839
0.936
0.885
0.888
tested_positive
0.664
0.064
0.848
0.664
0.745
0.888
b = tested
tested_negative
0.814
0.403
0.79
0.814
0.802
0.751
b
32
b
90
178
a = tested _ neg
571
197
a
tested_positive
0.597
0.186
0.632
0.597
0.614
0.751
tested_negative
0.834
0.418
0.788
0.834
0.81
0.74
a
404
Diabetes.arff
Numeric
768
J48
573
195
Class
tested_positive
0.582
0.166
0.653
0.582
0.615
0.74
tested_negative
0.849
0.463
0.762
0.849
0.803
0.722
tested_positive
0.537
0.151
0.671
0.537
0.596
0.722
a
417
Numeric
768
Nave Bayesian
192
194
586
583
578
69
67
182
185
190
_ pos
b
83
b
112
156
a = tested _ neg
b = tested
Hold out
Mehthod(Percent Split
66%)
b
93
b
108
160
a = tested _ neg
b = tested
_ pos
a
b
b
25
51
a = tested
b = tested
a
b
_ pos
a
141
44
_ neg
_ pos
128
30
28
66
tested_negative
0.776
0.313
0.81
0.776
0.793
0.747
a = tested
_ neg
tested_positive
0.688
0.224
0.641
0.688
0.663
0.747
b = tested
_ pos
tested_negative
0.842
0.384
0.803
0.842
0.822
0.825
a
b
tested_positive
0.616
0.158
0.676
0.616
0.645
0.825
tested_negative
0.844
0.399
0.798
0.844
0.82
0.814
tested_positive
0.601
0.156
0.674
0.601
0.635
0.814
a
421
103
a = tested
b = tested
a
b
_ neg
_ pos
422
107
78
161
a = tested
_ neg
b = tested
_ pos
417
107
83
161
tested_negative
0.834
0.399
0.796
0.834
0.814
0.811
a
b
tested_positive
0.601
0.166
0.66
0.601
0.629
0.811
b = tested
Class
b
79
165
a = tested
Hold out
Mehthod(Percent Split
66%)
187
74
tested_negative
0.831
0.484
0.75
0.831
0.789
0.811
tested_positive
0.516
0.169
0.636
0.516
0.57
0.811
a
b
138
46
28
149
a = tested
b = tested
205
56
tested_negative
tested_positive
0.842
0.688
0.313
0.158
0.822
0.717
0.842
0.688
0.832
0.702
0.838
0.838
a
b
_ neg
_ pos
_ neg
_ pos
139
30
26
66
a = tested
_ neg
b = tested
_ pos
No of Attributes
No of Instances
Type of Classifier
Evaluation
Technique
TP Rate
FP Rate
Precision
Recall
F-Measure
ROC Area
144
Iris-setosa
Iris-versicolor
0.96
0.04
0.923
0.96
0.941
0.993
Iris-virginica
0.92
0.02
0.958
0.92
0.939
0.993
CV
CV(Seed value
20)
143
143
J48
Hold Out
Method
Iris.arff
49
Iris-setosa
Iris-versicolor
0.94
0.04
0.922
0.94
0.931
0.993
Iris-virginica
0.92
0.03
0.939
0.92
0.929
0.993
Iris-setosa
Iris-versicolor
0.94
0.04
0.922
0.94
0.931
0.931
Iris-virginica
0.92
0.03
0.939
0.92
0.929
0.929
Iris-setosa
Iris-versicolor
0.063
0.905
0.95
0.969
Iris-virginica
0.882
0.882
0.938
0.967
Iris-setosa
Nom(Class)
Hold Out
Method(seed 20)
150
CV
CV (seed =20)
51
144
143
143
Nave Bayesian
49
Confusion Matrix
a
a 50
b
0
b
c
48 2
4 46
0
0
a
a 50
c
0
b
0
c
0
0 47
0 4
a b
a 50 0
3
46
c
0
b
c
3
46
b
c
0
0
47
4
a b c
a 15 0 0
b 0 19 0
c 0 2 15
a
c
0
0
15
Iris-versicolor
a 15 0
b 0 13
Iris-virginica
Iris-setosa
a 50
48
46
a b
a 50 0
b 0 47
c 0 4
c
0
3
46
Iris-versicolor
0.96
0.04
0.923
0.96
0.941
0.993
Iris-virginica
0.92
0.02
0.958
0.92
0.939
0.993
Iris-setosa
Iris-versicolor
0.94
0.04
0.922
0.94
0.931
0.993
Iris-virginica
0.92
0.03
0.939
0.92
0.929
0.993
Iris-setosa
Iris-versicolor
0.94
0.04
0.922
0.94
0.931
0.993
a 50 0
b 0 47
0
3
Iris-virginica
0.92
0.03
0.939
0.92
0.929
0.993
46
Iris-setosa
Iris-versicolor
0.947
0.031
0.947
0.947
0.947
0.993
Iris-virginica
0.938
0.029
0.938
0.938
0.938
0.993
Iris-setosa
Iris-versicolor
Iris-virginica
a b c
a 16 0 0
b 0 18 1
c 0 1 15
a b c
a 15 0 0
b 0 13 0
c 0 0 15
5.2
Conclusion
From the exploratory tests carried out on the datasets in Weka we can conclude
some of the following theoretical aspect which we explored in depth in the various
sections
1. Evaluating the classifier only on the training set fetches highly optimistic result
and thus they are biased.
2. Increasing the k value incereases the credibility of the result and the best results
are obtained when k = 10
3. Repeating the k-cross validation in iteration fetches more credible results and
the best results are obtained when it is repeated for 10 times
4. Hold out Method when repeated iteratively fetches accurate results and best
results are obtained when it is repeated for 10 times
5. Naive Bayesian and Decision Tree Induction(J48) works excellently well with
datasets which have more nominal data in comparision to numeric data
5.3
Future Scope
39
References
[1] Data Mining Concepts and Techniques: Jiawei Han, Micheline Kamber,Jian
Pei
[2] Machine Learning With R: Brett Lanz
[3] Statistical Learning: Gareth James, Daniela Witten ,Trevor Hastie ,Robert Tibshirani
[4] Statistics; David Freedman, Robert Pisani
[5] Inferential Statistics:Course Track Udacity
[6] Descriptive Statistics:Course Track Udacity
[7] Data Mining With Weka: Course Track University Of Waikato, Newzeland
40