DWM Experiment6 E059

LAB Manual
PART A
(PART A : TO BE REFFERED BY STUDENTS)
Experiment No.06
Aim: Implementation of Nave Bayes Classifier
Prerequisites:
C/C++/Java Programming
Learning Outcomes:
Concepts of Bayesian theorem and Classification
Theory:
Bayesian Theorem
Given training data X, posteriori probability of a hypothesis H, P(H|X), follows the Bayes
theorem
P(H |X)P(X|H)P(H)
P(X)
Informally, this can be written as

posteriori = likelihood x prior/evidence
Predicts X belongs to C2 iff the probability P(Ci|X) is the highest among all the P(Ck|X) for all
the k classes
Practical difficulty: require initial knowledge of many probabilities, significant computational
cost
Bayesian Classifier
Let D be a training set of tuples and their associated class labels, and each tuple is represented
by an n-D attribute vector X = (x1, x2, , xn)
Suppose there are m classes C1, C2, , Cm.
Classification is to derive the maximum posteriori, i.e., the maximal P(Ci|X)
This can be derived from Bayes theorem
P(C |X)
P(X)
P(X|Ci)P(Ci)
Since P(X) is constant for all classes, only
P(Ci|X)P(X|Ci)P(Ci)
needs to be maximized
Example
Class:
C1:buys_computer = yes
C2:buys_computer = no
Data sample
X = (age <=30, Income = medium, Student = yes, Credit_rating = Fair)
Training data:
age
income
student
credit_rating
buys_computer
<=30
high
no
Fair
no
<=30
high
no
Excellent
no
3140
high
no
Fair
yes
>40
medium
no
Fair
yes
>40
low
yes
Fair
yes
>40
low
yes
Excellent
no
3140
low
yes
Excellent
yes
<=30
medium
no
Fair
no
<=30
low
yes
Fair
yes
>40
medium
yes
Fair
yes
<=30
medium
yes
Excellent
yes
3140
medium
no
Excellent
yes
3140
high
yes
Fair
yes
>40
medium
no
Excellent
no
Sol
utio
n:
P(Ci):
P(buys_computer = yes) = 9/14 = 0.643

P(buys_computer = no) = 5/14= 0.357
Compute P(X|Ci) for each class

P(age = <=30 | buys_computer = yes) = 2/9 = 0.222
P(age = <= 30 | buys_computer = no) = 3/5 = 0.6
P(income = medium | buys_computer = yes) = 4/9 = 0.444 P(income = medium |
buys_computer = no) = 2/5 = 0.4
P(student = yes | buys_computer = yes) = 6/9 = 0.667
P(student = yes | buys_computer = no) = 1/5 = 0.2
P(credit_rating = fair | buys_computer = yes) = 6/9 = 0.667
P(credit_rating = fair | buys_computer = no) = 2/5 = 0.4
X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = yes) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = no) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = yes) * P(buys_computer = yes) = 0.028
P(X|buys_computer = no) * P(buys_computer = no) = 0.007
Therefore, X belongs to class (buys_computer = yes)
PART B
(PART B : TO BE COMPLETED BY STUDENTS)
(Students must submit the soft copy as per following segments within two hours of the practical
slot. The soft copy must be uploaded on the Blackboard or emailed to the concerned lab in charge
faculties at the end of the practical in case the there is no Black board access available)
Roll No. E059

Class : B.TECH CS
Date of Experiment:
Grade :
Date of Grading:
Name: Shubham Gupta

Batch : E3
Date of Submission
Time of Submission:
B.1 Software Code written by student:

(Paste your c/c++/java code completed during the 2 hours of practical in the lab
here)
packagebayesian
; importjava.sql.*;
importjava.util.Sc
anner; /**
* @author
mpstme.student */
public class Bayesian {
/**
* @paramargs the command line
arguments */
public static void main(String[] args) {
Connection con=null;double male[]=new double[3];double
female[]=new double[3];double height_probab[][]=new double[6][3];
doubletotal_range[]=new double[3];double
probab_range[]=new double[3]; String
range[]={"Short","Medium","Tall"};
double p_t_range[]=new double[3];double
likelihood[]=new double[3];double
tot_likelihood_range=0;double p_range_t[]=new double[3];
try
{ Class.forName("sun.jdbc.odbc.JdbcOdb
cDriver") ;
con =
DriverManager.getConnection("jdbc:odbc:Classifica
tion"); Statement st = con.createStatement();
ResultSetrs;
for(inti=0;i<range.length;i++){
rs=st.executeQuery("Select count(*) from Bayesian_table
where Gender='M' and Output='"+range[i]+"'");
while(rs.next()){
male[i]=rs.getInt(1);
}
rs = st.executeQuery("Select count(*) from Bayesian_table
where Gender='F' and Output='"+range[i]+"'");
while(rs.next()){
female[i]=rs.getInt(1);
}
}
for(inti=0;i<3;i++){
double total=male[i]
+female[i];
male[i]/=total;
female[i]/=total;
}
for(inti=0;i<range.length;i++){
rs = st.executeQuery("Select count(*) from Bayesian_table where
Height>0 and Height<=1.6 and Output='"+range[i]+"'");
while(rs.next())
{ height_probab[0]
[i]=rs.getInt(1);
}
Height>1.6 and Height<=1.7 and Output='"+range[i]+"'");
while(rs.next())
{ height_probab[1]
[i]=rs.getInt(1);
}
while(rs.next())
{ height_probab[2]
[i]=rs.getInt(1);
}
while(rs.next())
{ height_probab[3]
[i]=rs.getInt(1);
}
Height>1.9 and Height<=2 and Output='"+range[i]+"'");
while(rs.next())
{ height_probab[4]
[i]=rs.getInt(1);
}
where Height>2 and Height<="+Integer.MAX_VALUE+"
and Output='"+range[i]+"'"); while(rs.next()){
height_probab[5][i]=rs.getInt(1);
}
}
for(inti=0;i<
3;i++){ int
total=0;
for(int j=0;j<6;j++)
{ total+=height_pr
obab[j][i];
}
total_range[i]=total
; for(int j=0;j<6;j+
+)
{ height_probab[j]
[i]/=total;
}
}
rs=st.executeQuery("select count(*) from

bayesian_table"); int count=0;double
prob_height[]=new double[3];
while(rs.next()){
count=rs.getInt(1);
}
for(inti=0;i<3;i++)
{ probab_range[i]=total_range[i
]/count;
System.out.println("Enter Details of the

person"); Scanner sc=new
Scanner(System.in);
System.out.println("Enter
Name"); String
name=sc.nextLine();
System.out.println("Enter Gender
M/F"); String
gender=sc.nextLine();
System.out.println("Enter
Height"); double
height=sc.nextDouble();
for(inti=0;i<3;i++)
{ if((0<height)&&(height
<=1.6)){
Height>0 and Height<=1.6 and Output='"+range[i]+"'");
while(rs.next())
{ prob_height[i]=rs.
getInt(1);
}
}
else if((1.6<height)&&(height<=1.7)){
while(rs.next())
getInt(1);
}
}
else if((1.7<height)&&(height<=1.8)){
while(rs.next())
getInt(1);
}
}
else if((1.9<height)&&(height<=2)){

Height>1.9 and Height<=2 and Output='"+range[i]+"'");
while(rs.next())
getInt(1);
}
}
else if((2<height)&&(height<Integer.MAX_VALUE)){
where Height>2 and Height<"+Integer.MAX_VALUE+" and
Output='"+range[i]+"'");
while(rs.next()){ prob_height[i]=rs.getInt(1);
}
}
}
for(inti=0;i<3;i++){ prob_height[i]/=total_range[i];
}
for(inti=0;i<3;i++){
if(gender.equals("M")){ p_t_range[i]=male[i]*prob_height[i];
likelihood[i]=p_t_range[i]*probab_range[i];
}
if(gender.equals("F")){ p_t_range[i]=female[i]*prob_height[i];
likelihood[i]=p_t_range[i]*probab_range[i];
}
}
for(inti=0;i<3;i++){ tot_likelihood_range+=likelihood[i];
for(inti=0;i<3;i++){ p_range_t[i]=likelihood[i]/tot_likelihood_range;
}
double max=p_range_t[0];int index=0; for(inti=1;i<3;i++)
{
if(max<p_range_t[i]){ max=p_range_t[i]; index=i;
}
}
switch(index){
case 0:System.out.println(name+" is categorised as Short"); break;
case 1:System.out.println(name+" is categorised as Medium"); break;
case 2:System.out.println(name+" is categorised as Tall"); break;
}
con.close();
}catch (Exception e) { e.printStackTrace();
System.err.println("Exception: "+e.getMessage());
}
}
}
B.2 Input and Output:
(Paste your program input and output in following format, If there is error then paste the specific error in the output
part. In case of error with due permission of the faculty extension can be given to submit the error free code with output
in due course of time. Students will be graded accordingly.)
run:
Enter Details of the person Enter Name
Adam
Enter Gender M/F M
Enter Height 1.95
Adam is categorised as Tall
BUILD SUCCESSFUL (total time: 13 seconds)
B.3 Observations and learning:

(Students are expected to comment on the output obtained with clear observations and learning for each task/ sub
part assigned)
In this experiment we have studied Bayesian algorithm which is used to decide

in which class a particular record may fall. To implement this, we created tables
in SQL Server. Then, we have connected Java to SQL server using JDBC. We
calculate probability of each class based on separate criteria. Then, we calculate
likelihood of each class. Then, we calculate probabilities of the entry being in a
particular class. The entry will fall in the class which has the maximum
probability.
B.4 Conclusion:
(Students must write the conclusions based on their learning)
Hence, we have studied and implemented Bayesian algorithm using SQL Server
and JDBC to connect to Java.
B.5 Questions of Curiosity

Q1.What are the issues in classification? Explain each with the help of an example.
Data Cleaning - Data cleaning involves removing the noise and treatment of
missing values. The noise is removed by applying smoothing techniques and
the problem of missing values is solved by replacing a missing value with
most commonly occurring value for that attribute.
Relevance Analysis - Database may also have the irrelevant attributes.
Correlation analysis is used to know whether any two given attributes are
related.
Normalization - The data is transformed using normalization. Normalization
involves scaling all values for given attribute in order to make them fall within
a small specified range.
Data transformation Generalize the data to higher level concepts using
concept hierarchies and normalize data which involves scaling the values.
Q2. Compare Bayesian Classification with ID3 classification technique.

Bayesian Classification:
Incrementality: The prior and likelihood can be updated dynamically. It is flexible and robust to
errors.
Combines prior knowledge and observed data
Probabilistic hypotheses: Outputs probability distribution over all classes.

Meta-classification: Outputs of several classifiers can be combined.
ID3 classification technique:

Able to generate understandable rules.
Perform classification without requiring much computation.

Able to handle both continuous and categorical values.
Provide a clear indication of which fields are most important for prediction or classification.
Q3.Summarize all approaches used for classification with their advantages and limitations.
1) Nave Bayes Classifier
Naive Bayes classifiers are highly scalable, requiring a number of
parameters linear in the number of variables (features/predictors) in a
learning problem. Maximum-likelihood training can be done by evaluating a
closed-form expression, which takes linear time, rather than by expensive
iterative approximation as used for many other types of classifiers.
Advantages:
- Combines prior knowledge and observed data
- Probabilistic hypotheses: Outputs probability distribution over all classes.
Limitations:
- Independence assumption may not hold for some attributes.
- If you have no occurrences of a class label and a certain attribute value
together (e.g. class="nice", shape="sphere") then the frequency-based
probability estimate will be zero.
- Nave Bayes classification algorithm can be used only with a small data set
but precision and recall will keep very low.
2) K-nearest neighbor
k-NN is a type of instance-based learning, or lazy learning, where the function
is only approximated locally and all computation is deferred until classification.
The k-NN algorithm is among the simplest of all machine learning algorithms.
Advantages:
Simplicity, effectiveness, intuitiveness and competitive classification
performance in many domains are the advantages. It is Robust to noisy
training data and is effective if the training data is large.
Limitations:
-
Distance based learning is not clear which typeof distance to use and which
attribute to use to produce the best results.
Computation cost is quite high because we needto compute distance of

each query instance to alltraining samples.
KNN can have poor run-time performance when the training set is large.
3) Decision Tree Learning

Decision tree learning uses a decision tree as a predictive model which
maps observations about an item to conclusions about the item's target value.
It is one of the predictive modelling approaches used in statistics, data mining
and m4achine learning. More descriptive names for

such tree models are classification trees or regression trees. In these tree
structures, leaves represent class labels and branches represent conjunctions
of features that lead to those class labels.
Advantages:
Able to generate understandable rules.
Perform classification without requiring much computation.
Able to handle both continuous and categorical values.
- Provide a clear indication of which fields are most important for prediction
or classification. Limitations:
The problem of learning an optimal decision tree is known to be NPcomplete under several aspects of optimality and even for simple concepts.
Consequently, practical decision-tree learning algorithms are based on heuristics
such as the greedy algorithm where locally-optimal decisions are made at each
node. Such algorithms cannot guarantee to return the globally-optimal decision
tree.
Decision-tree learners can create over-complex trees that do not generalise
well from the training data.
- There are concepts that are hard to learn because decision trees do not
express them easily, such as XOR, parity or multiplexer problems. In such
cases, the decision tree becomes prohibitively large.

DWM Experiment6 E059

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

DWM Experiment6 E059

Enviado por

Direitos autorais:

Formatos disponíveis

LAB Manual

Informally, this can be written as

Since P(X) is constant for all classes, only

P(buys_computer = yes) = 9/14 = 0.643

Compute P(X|Ci) for each class

Roll No. E059

Name: Shubham Gupta

B.1 Software Code written by student:

rs=st.executeQuery("select count(*) from

System.out.println("Enter Details of the

rs = st.executeQuery("Select count(*) from Bayesian_table where

B.2 Input and Output:

B.3 Observations and learning:

In this experiment we have studied Bayesian algorithm which is used to decide

B.5 Questions of Curiosity

Q2. Compare Bayesian Classification with ID3 classification technique.

Probabilistic hypotheses: Outputs probability distribution over all classes.

ID3 classification technique:

Perform classification without requiring much computation.

Computation cost is quite high because we needto compute distance of

3) Decision Tree Learning

and m4achine learning. More descriptive names for

Você também pode gostar