Você está na página 1de 12

Q.No 1.

a> Ans:
Decision Tree Induction:-
Advantages:
It is simple and hence easy to understand and interpret .
It doesnt make any prior assumptions about the data.
It is immune to noisy data.
Can be applied in Categorical and Nominal data.
Disadvantages:
Limited to one output attribute.
The desired output must be categorical.
We can create numerical datas but can be complex.
With our proper pruning or limiting tree growth, they tend to over fit the training data.
Biased to shortest trees: it assumes that simplest decision tree that correctly classify the
training examples.

Q.No 1.b> Ans:
Artificial Neural Networks:-
Advantages:
A neural network is a parallel system that can perform faster.
When an element of neural networks fails, it can continue without any problem by their
parallel nature.
It can be implemented in any application without any problem.
It can learn and doesnt need to be reprogrammed.
Flexibility, ability to generalize and learn the training data and modifying parameters.
Disadvantages:
Require large diversity in the training data for effective implementation.
Required high processing time for larger neural network.
The VC dimension of neural network is unclear, this is very important when we want to
evaluate how good the solution might be.
Neural network cannot be retrained, if we want to add data later, it is impossible to add
to an existing network. We shall retrain the whole data if we want.

Q.No 1.c> Ans:
Support Vector Machines:-
Advantages:
A principled approach to classification, regression or novelty detection tasks.
Versatile: Because we can use provided the kernel function which is our decision
function as well as the custom kernels.
Effective in high dimensional spaces.
Hypothesis has an explicit dependence on the data (via the support vectors). Hence can
readily interpret the model.
Learning involves optimization of a convex function (no false minima, unlike a neural
network).

Disadvantages:
The selection of kernel function.
Computational and storage complexity of training: quadratic in the size of the training
set.
It is directly applicable to two binary classifications only, so algorithms to reduce the
multi-class task to binary have to be applied for multiple classifications.

Q.No 1.d> Ans:
K- Nearest Neighbor:-
Advantages:
Simple to use. It can estimate complex target concepts locally and differently for each
new instance to be classified.
Robust to noisy training data, especially if the inverse square of weighted distance is
used as the distance measure.
Effective if the training data is large.
KNN is particularly well suited for multi-modal classes as well as applications in which an
object can have many class labels.
The NN algorithm learns very quickly.

Disadvantages:
The NN algorithm has large storage requirements because it has to store all the data.
It is slow during instance classification because all the training instances have to be
visited.
The accuracy degrades with increase of noise in the training data and irrelevant
attributes.
Need to determine the value of parameter K, the number of nearest neighbors.
Attributes are better to use producing the best results. Shall we use all attributes or
certain attributes only?
Distance based learning is not clear which type of distance to use.
Biased by value of K.

Q.No 1.e> Ans:
Nave Bayes Classifier:
Advantages:
Fast to train (single scan). Fast to classify, easy to implement.
Requires a small amount of training data to estimate the parameters.
Not sensitive to irrelevant features.
Handles real and discrete data.
Handles streaming data well.
Disadvantages:
Assumption: class conditional independence, therefore loss of accuracy.
Practically, dependencies exist among variables.
Dependencies among these cannot be modeled by Nave Bayesian Classifier.



Q.No .2> Ans:
The Perceptron is a simple form of Neural Network, which consists of a single neuron with
adjustable synaptic weights and bias. It is sufficient to study single layer perceptrons with just
one neuron. For a perceptron with n input variables, it draws a hyperplane as the decision boundary
over the (n-dimensional) input space. It classifies input patterns into two classes.

Fig1: Single layer perceptron(Ref: Internet)

We will try to implement the XOR gate with single layer perceptron. The truth table of XOR
gate is as below:
Inputs to Perceptron Output of Perceptron
Out X1 X2
0 0 0
0 1 1
1 0 1
1 1 0

We weight w1 and w2 and threshold and for each training pattern consider we need to
satisfy,
Out= sgn(w1x1+w2x2- )
So the training data lead to four inequalities:
w1*(0)+w2*(0)- <0 >0
w1*(0)+w2*(1)- >=0 w2>=
w1*(1)+w2*(0)- >=0 w1>=
w1*(1)+w2*(1)- <0 w1+w2< since w1,w2 >
Clearly the second and third inequalities are incompatible with the fourth, fourth equation
cannot satisfy hence no fact solution. Let us plot the truth table of XOR gate

The XOR function is not linearly seperable, it is impossible to separate the classes (Outputs)
with only one line. Hence Perceptron cannot learn XOR function. However to learn we need
more complex networks, e.g. that combine together many simple networks, or use different
activation/thresholding/transfer functions.

Q.No .3> Ans:
We know in Support Vector Machine we want Maximum Margin Hyperplane and margin from
an SVM trained samples from the two classes.

Fig2: Support Vector machine (Ref: Internet)
To get maximum margin we minimize the factor
|| W||
2
2

for yi = +1, wxi +b >= 1
yi = 1, wxi +b <= 1
Fig.1.1 Plot of 2-input XOR logic gate truth table.
The above optimization problem can be solved by using Lagrange multiplier,
Minimize Lp( w,b,i) = ||w||
2
+_ (i(yi[wxi + b] 1))
n
=1

Where, and

Thus I() = (i)
n
=1

1
2

n
=1
(i j yi yj < xi, x] >)
n
]=1

We need quadratic programming; if N is large then calculating the calculation for n-
dimension quadratic solution will be complex. From we can find the support vectors, which
will be the points that define the hyperplane. If the data are not linear form then we transform
it in to the linear form considering simple Hypothesis H, instead of complex hypothesis h.
Kernel trick handles transformation from linear to non-linear analysis.
For the Non- linear transformation, the idea is to go to high dimensional Z-space hypothesis set.
I() = (i)
n
=1

1
2

n
=1
(i j yi yj < zi. z] >)
n
]=1

Constraints: i >=0 for i= 0,1,2,3,n and (i yi) = 0
n
=1

Consider a Non-linear mapping function : I= R
2
Z =R
3
(consider Z=3)from the 2-dimensional
input space I into the 3- dimensional feature space Z, which is dened in the following way:
(x) = (x1
2
, 2x1x2, x2
2
) ---------------------------------------(a)
Taking the equation for a separating hyperplane ,
w (x)= w1x1
2
+ w2 2x1x2+w3 x2
2
=0------------------------(b)
Consider the Z- Space hypothesis function as H(z)= sign (w z +b)
We have the value of w as and for b = yi(wxi +b) = 1
The explicit coordinates in Z and even the mapping function become unnecessary when we
dene a function K (xi, x) = (xi)(x), the so called kernel function, which directly calculates the
value of the dot product of the mapped data points in some feature space. The following
example of a kernel function K demonstrates the calculation of the dot product in the feature
space using K(x, z) = (x z)
2
and inducing the mapping function (x) = (x1
2
, 2x1x2, x2
2
) from (a).
Let x-vector =(x1,x2) and z-vector be (z1,z2) then, K(x,z)= (x z)
2
0
p
L

w 1
n
i i i
i
y

w x
1
0
n
i i
i
y


0
p
L
b

1
n
i i i
i
y

w x
=(x1z1 + x2z2)
2

=(x1
2
z1
2
+2x1x2z1z2+x2
2
z2
2
)
=(x1
2
, 2x1x2, x2
2
) (z1
2
, 2z1z2, z2
2
)
= (x) (z)
The advantage of such a kernel function is that the complexity of the optimization problem
remains only dependent on the dimensionality of the input space and not of the feature space.
Therefore, it is possible to operate in a theoretical feature space of innite height.
We can solve our dual Lagrangian of our optimization problem using the kernel function K :
I() = (i)
n
=1

1
2

n
=1
(i j yi yj K < xi. x] >)
n
]=1

Q.No .4> Ans:
In K- Nearest Neighbor Classifier, the new instance query is classified based on the majority of
the k-nearest neighbor category. For each instance it calculates the distance between that
instance and all the training instances and chooses K-examples which has got the nearest
distance. Hence classification is done with majority of the K-training sets. So, even when few
training examples have been misclassified due to its dependency upon majority class it will
classify it correctly. The more we choose the value of the K, the more robustness nature
increases however it needs more training data as well. The robustness nature of K-Nearest
Neighbor doesnt only depends upon the value of K, but also the type of models we are using
and the distance computation method we used.

Fix3: Xq is the new instance used for classification in K-Nearest Neighbor classifier(Ref: Internet).
Q.No .5> Ans
Let us consider a probability model for a classifier is a conditional model. P(Vj/a1,a2..aN) Over a
dependent variable Vj which is the binary (+1,-1) outcome which is conditional on several
feature vector a1,a2aN.
Using Bayes Theorem,
. P(Vj/a1,a2..aN) == P(a1,a2..aN/Vj) P(Vj)/P(a1,a2..aN)
We can further simplify as:
. P(Vj/a1,a2..aN)= Argmax P(a1,a2..aN/Vj)
Using the chain rule of probability
Argmax [P(a1/Vj) * P(a2/Vj,a1)*P(a3/Vj,a1,a2)P(aN/Vj,a1,a2aN-1)
Since the possibility of having the occurrence of one feature conditionally dependent to
another feature may be zero so resulting the total value as 0. And the other reason is to model
the system there will be many parameters hence result in complex computation. Thus we
assume that each feature is conditional independent to each other. This assumption leads to
Argmax [P(a1/Vj)* P(a2/Vj)* P(a3/Vj)P(an/Vj)

Q.No .6> Ans

Fig4: Neural Network with back propagation (Ref: Internet)
Artificial Neural Network is the algorithm which can run on the computer and approximate the
desired output as the human brain does depending upon the various inputs. The network
consists of number of neurons which behaves as the brain neuron, consists of the activation
function and gives the output to out edges of the neuron by using the input of in edges and
activation function. Each input to the neural network is fed as input to the all input. The input
and output are weighed by weight Wij and shifted by bias factor specific to each neuron. To
compute the output of a single neuron we will first find the weighted sum of inputs to that
neuron add the bias to the sum and feed it to the activation function of that neuron. Output of
activation function(A) is defined as the output of that particular neuron.
Output (Zk)= A( Wij_k * I_k +bias_k) , where I_k is the input carried across the input of kth
neuron and Wij_k is the weight of the kth in edge. However we do not know the exact weight
which is associated with each neuron to give the desired output. We need to find the weights of
edges and biases of neurons in the network which minimizes the squares error of the training
data. The sum of square error is computed as:
E= Sum ((Zk 0k)
2
) where 0k is the set of outputs of the neural
network for the set of inputs Xk .
The weights and biases are initially assigned to a random continuous real value. The algorithm
trains the neural network incrementally,I,e each instance is considered independently and after
considering a training instance, each weights are updated first before considering the another
one. Considering a back propagation learning instance (Xk , Zk ). We will calculate the error for
the output Zk in comparison to desired output, called blame. We compute the blame value for
each neuron. The prime function of calculating the blame is that it would be useful for
evaluating the proper weight and bias of each neuron. Theblame for each input and hidden
layer is computed by using the blame of its output neuron. We can calculate the blame for
each input and hidden layer by using the formula:
Wk * Ek , where ek is the blame of an output neuron and Wk is the
weight of the neuron that connects to output neuron. This process of sending back the error
calculated from the output to inputs is called back propagation. Hence, the blame from the
output move backwards to input to recalculate the weight and bias of each neuron. The weight
is modified as
Wij= Wij+r *Ej*Aj Ij *Zi for the jth neuron.
Where, r is the learning rate, Ej is the blame of the jth neuron and Aj the activation function. Ij
is the input fed to the jth neuron and Zi the output calculcated in first step for the jth neuron.
Similarly bias_i= bias_i+r* Ei .
Q.No .7> Ans
Over-fitting means that an hypothesis h has a larger true error over unseen instances than
another hypothesis h, even though its training error is lower than that of h. Over fitting may
be caused by noisy data, large number of training data and large number of hidden neurons in
case of neural network. Over fitting can be avoided by means of cross-validation, namely to
partition the data into training and validation set and to pick the hypothesis that has the lowest
error on the validation data.
Let us consider, Eout be the error due to overfitting and Ein without overfitting.
Then, Eout(h)= Ein(h) + Overfit penalty
Consider a sample point (x,y), the error is e[h(x),y] squared error is [h(x),y]
2
.
Expectation of error=E(e[h(x),y]) =Eout(h)
Variance [ (e[h(x),y])]=
2
.
In Validation set let us consider a set of points instead of using only a single point. Let
(x1,y1),(x2,y2)(xk,yk) be the validation set.
Now compute the error Eval(h) = 1/k * _ [ (e[h(xk), yk])
n
k=1

Expectation of error = E{Eval(h)}= 1/k *_ [ (E[h(xk), yk])
n
k=1
= Eout(h)
Now, Variance Var.(Eval(h)) = 1/k2 * _ [ Var (e[h(xk), yk])
n
k=1
=
o2
k

Eval(h)= Eout(h) +- (1/ k)
Thus the error is decreased due to overfitting and the variance is also decreased by a factor of
k. If k is small than there will be still overfitting, hence we choose larger value of k. larger value
of k doesnot always guarantee to combat the overfitting. Hence we should select proper value
of k, rule of Thumb choose k=N/5.

Q.No .8> Ans
In 10-fold Cross-validation is a technique for estimating the performance of a predictive model.
The original sample is randomly partitioned in to 10 equal subsamples. Out of these subsamples
one sample is used for the validation test of the model and remaining 9 sets are used as the
training data. 10- Cross validation process is repeated ten times, with each subsamples used at
once for the validation test. Ten results can be averaged for the prediction. The advantage of
10-fold cross validation is that all the datas are at least once used for the training as well as for
the test of the algorithm.
Hold out method is also called as 2-fold cross validation method, where data points are divided in

Q.No 9> Ans:
Bagging (Bootstrap aggregation) is a technique that can be used with many classification and
regression methods to reduce the variance associated with the prediction and thereby improve
the prediction process. Bagging utilizes several subsets of training samples mainly by re-
sampling in order to stabilize the classifier constructed by those subsets. Bagging changes the
distribution of training sets stochastically. It employs the simplest way of combining the
prediction belong to same type which can be realized either by voting or averaging. Bagging
typically improves when applied with an over fitted base model, however it has high
dependency on the actual training date. If we implement bagging to the robust algorithm such
as K-Nearest Neighbor then it doesnt have so much of improvement in the prediction as
compared to the Noisy data. Bagging works efficiently when the learning algorithm is unstable:
if small changes to the training set cause large changes in the learned classifier.
Boosting utilizes randomly sampled subsets of training samples, but in re-sampling it
weighs heavily to the samples that were not correctly classified as to the current stage.
Instances that are wrongly classified will have their weights increased, while the instances with
correctly classified have their weights decreased. Boosting algorithms are considered stronger
than bagging on noise free data. We can improve both the accuracy and confidence of the
prediction. It works well if base classifiers are not too complex and their error doesnt become
too large too quickly; I,e high robustness of the classifier is demanded for its desired effect.
Boosting seems to be susceptible to noise.
We might not want to use an ensemble classifier at all because of the corresponding increase
in computational demands and implementation complexity. The other reason is that the
to two sub-samples, so that both sets are equal. When we use 10-fold Cross-validation we can have
10 sets of data to train and to test hence this estimator has lower variance in comparison to hold out.
10 fold cross validation method is also effective with the model which has lesser data sets like in
medical field.
implementation of the ensemble classifier doesnt guarantee that it will perform better than
the single classifier.
Q.No 10> Ans:
Random Forrest is a generic principle of classifier combination that uses L tree-structured base
classifier {h(X,n), N=1,2,3,L}, where X denotes the input data and {n} is a family of identical
and dependent distributed random vectors. Every decision tree is made by randomly selecting
the data from the available data. For example a example a Random Forest for each Decision
Tree (as in Random Subspaces) can be built by randomly sampling a feature subset, and/or by
the random sampling of a training data subset for each Decision Tree (the concept of Bagging).
It is unexcelled in accuracy among current algorithms. It runs efficiently on large data bases. It
has an effective method for estimating missing data and maintains accuracy when a large
proportion of the data are missing.
Random Forrest Learning algorithm is resistant to Over fitting because:
The fact that the samples used to train the individual trees are "bootstrapped".
The fact that you have a multitude of random trees using random features and thus the
individual trees are strong but not so correlated to each other.

However there may be over fitting with fewer numbers of trees used in Random forests.

Você também pode gostar