Escolar Documentos
Profissional Documentos
Cultura Documentos
PART A
(PART A : TO BE REFFERED BY STUDENTS)
Experiment No.06
Aim: Implementation of Nave Bayes Classifier
Prerequisites:
C/C++/Java Programming
Learning Outcomes:
Concepts of Bayesian theorem and Classification
Theory:
Bayesian Theorem
Given training data X, posteriori probability of a hypothesis H, P(H|X), follows the Bayes
theorem
P(H |X)P(X|H)P(H)
P(X)
Let D be a training set of tuples and their associated class labels, and each tuple is represented
by an n-D attribute vector X = (x1, x2, , xn)
Suppose there are m classes C1, C2, , Cm.
Classification is to derive the maximum posteriori, i.e., the maximal P(Ci|X)
This can be derived from Bayes theorem
P(C |X)
P(X)
P(X|Ci)P(Ci)
P(Ci|X)P(X|Ci)P(Ci)
needs to be maximized
Example
Class:
C1:buys_computer = yes
C2:buys_computer = no
Data sample
X = (age <=30, Income = medium, Student = yes, Credit_rating = Fair)
Training data:
age
income
student
credit_rating
buys_computer
<=30
high
no
Fair
no
<=30
high
no
Excellent
no
3140
high
no
Fair
yes
>40
medium
no
Fair
yes
>40
low
yes
Fair
yes
>40
low
yes
Excellent
no
3140
low
yes
Excellent
yes
<=30
medium
no
Fair
no
<=30
low
yes
Fair
yes
>40
medium
yes
Fair
yes
<=30
medium
yes
Excellent
yes
3140
medium
no
Excellent
yes
3140
high
yes
Fair
yes
>40
medium
no
Excellent
no
Sol
utio
n:
P(Ci):
PART B
(PART B : TO BE COMPLETED BY STUDENTS)
(Students must submit the soft copy as per following segments within two hours of the practical
slot. The soft copy must be uploaded on the Blackboard or emailed to the concerned lab in charge
faculties at the end of the practical in case the there is no Black board access available)
packagebayesian
; importjava.sql.*;
importjava.util.Sc
anner; /**
* @author
mpstme.student */
public class Bayesian {
/**
* @paramargs the command line
arguments */
public static void main(String[] args) {
Connection con=null;double male[]=new double[3];double
female[]=new double[3];double height_probab[][]=new double[6][3];
doubletotal_range[]=new double[3];double
probab_range[]=new double[3]; String
range[]={"Short","Medium","Tall"};
double p_t_range[]=new double[3];double
likelihood[]=new double[3];double
tot_likelihood_range=0;double p_range_t[]=new double[3];
try
{ Class.forName("sun.jdbc.odbc.JdbcOdb
cDriver") ;
con =
DriverManager.getConnection("jdbc:odbc:Classifica
tion"); Statement st = con.createStatement();
ResultSetrs;
for(inti=0;i<range.length;i++){
rs=st.executeQuery("Select count(*) from Bayesian_table
where Gender='M' and Output='"+range[i]+"'");
while(rs.next()){
male[i]=rs.getInt(1);
}
rs = st.executeQuery("Select count(*) from Bayesian_table
where Gender='F' and Output='"+range[i]+"'");
while(rs.next()){
female[i]=rs.getInt(1);
}
}
for(inti=0;i<3;i++){
double total=male[i]
+female[i];
male[i]/=total;
female[i]/=total;
}
for(inti=0;i<range.length;i++){
rs = st.executeQuery("Select count(*) from Bayesian_table where
Height>0 and Height<=1.6 and Output='"+range[i]+"'");
while(rs.next())
{ height_probab[0]
[i]=rs.getInt(1);
}
rs = st.executeQuery("Select count(*) from Bayesian_table where
Height>1.6 and Height<=1.7 and Output='"+range[i]+"'");
while(rs.next())
{ height_probab[1]
[i]=rs.getInt(1);
}
rs = st.executeQuery("Select count(*) from Bayesian_table where
Height>1.7 and Height<=1.8 and Output='"+range[i]+"'");
while(rs.next())
{ height_probab[2]
[i]=rs.getInt(1);
}
rs = st.executeQuery("Select count(*) from Bayesian_table where
Height>1.8 and Height<=1.9 and Output='"+range[i]+"'");
while(rs.next())
{ height_probab[3]
[i]=rs.getInt(1);
}
rs = st.executeQuery("Select count(*) from Bayesian_table where
Height>1.9 and Height<=2 and Output='"+range[i]+"'");
while(rs.next())
{ height_probab[4]
[i]=rs.getInt(1);
}
rs = st.executeQuery("Select count(*) from Bayesian_table
where Height>2 and Height<="+Integer.MAX_VALUE+"
and Output='"+range[i]+"'"); while(rs.next()){
height_probab[5][i]=rs.getInt(1);
}
}
for(inti=0;i<
3;i++){ int
total=0;
for(int j=0;j<6;j++)
{ total+=height_pr
obab[j][i];
}
total_range[i]=total
; for(int j=0;j<6;j+
+)
{ height_probab[j]
[i]/=total;
}
}
System.out.println("Enter
Height"); double
height=sc.nextDouble();
for(inti=0;i<3;i++)
{ if((0<height)&&(height
<=1.6)){
rs = st.executeQuery("Select count(*) from Bayesian_table where
Height>0 and Height<=1.6 and Output='"+range[i]+"'");
while(rs.next())
{ prob_height[i]=rs.
getInt(1);
}
}
else if((1.6<height)&&(height<=1.7)){
rs = st.executeQuery("Select count(*) from Bayesian_table where
Height>1.6 and Height<=1.7 and Output='"+range[i]+"'");
while(rs.next())
{ prob_height[i]=rs.
getInt(1);
}
}
else if((1.7<height)&&(height<=1.8)){
rs = st.executeQuery("Select count(*) from Bayesian_table where
Height>1.7 and Height<=1.8 and Output='"+range[i]+"'");
while(rs.next())
{ prob_height[i]=rs.
getInt(1);
}
}
else if((1.9<height)&&(height<=2)){
while(rs.next()){ prob_height[i]=rs.getInt(1);
}
}
}
for(inti=0;i<3;i++){ prob_height[i]/=total_range[i];
}
for(inti=0;i<3;i++){
if(gender.equals("M")){ p_t_range[i]=male[i]*prob_height[i];
likelihood[i]=p_t_range[i]*probab_range[i];
}
if(gender.equals("F")){ p_t_range[i]=female[i]*prob_height[i];
likelihood[i]=p_t_range[i]*probab_range[i];
}
}
for(inti=0;i<3;i++){ tot_likelihood_range+=likelihood[i];
for(inti=0;i<3;i++){ p_range_t[i]=likelihood[i]/tot_likelihood_range;
}
double max=p_range_t[0];int index=0; for(inti=1;i<3;i++)
{
if(max<p_range_t[i]){ max=p_range_t[i]; index=i;
}
}
switch(index){
case 0:System.out.println(name+" is categorised as Short"); break;
case 1:System.out.println(name+" is categorised as Medium"); break;
case 2:System.out.println(name+" is categorised as Tall"); break;
}
con.close();
}catch (Exception e) { e.printStackTrace();
System.err.println("Exception: "+e.getMessage());
}
}
}
(Paste your program input and output in following format, If there is error then paste the specific error in the output
part. In case of error with due permission of the faculty extension can be given to submit the error free code with output
in due course of time. Students will be graded accordingly.)
run:
Enter Details of the person Enter Name
Adam
Enter Gender M/F M
Enter Height 1.95
Adam is categorised as Tall
BUILD SUCCESSFUL (total time: 13 seconds)
B.4 Conclusion:
(Students must write the conclusions based on their learning)
Hence, we have studied and implemented Bayesian algorithm using SQL Server
and JDBC to connect to Java.
Data Cleaning - Data cleaning involves removing the noise and treatment of
missing values. The noise is removed by applying smoothing techniques and
the problem of missing values is solved by replacing a missing value with
most commonly occurring value for that attribute.
Relevance Analysis - Database may also have the irrelevant attributes.
Correlation analysis is used to know whether any two given attributes are
related.
Normalization - The data is transformed using normalization. Normalization
involves scaling all values for given attribute in order to make them fall within
a small specified range.
Data transformation Generalize the data to higher level concepts using
concept hierarchies and normalize data which involves scaling the values.
Provide a clear indication of which fields are most important for prediction or classification.
Q3.Summarize all approaches used for classification with their advantages and limitations.
1) Nave Bayes Classifier
Naive Bayes classifiers are highly scalable, requiring a number of
parameters linear in the number of variables (features/predictors) in a
learning problem. Maximum-likelihood training can be done by evaluating a
closed-form expression, which takes linear time, rather than by expensive
iterative approximation as used for many other types of classifiers.
Advantages:
- Combines prior knowledge and observed data
- Probabilistic hypotheses: Outputs probability distribution over all classes.
Limitations:
- Independence assumption may not hold for some attributes.
- If you have no occurrences of a class label and a certain attribute value
together (e.g. class="nice", shape="sphere") then the frequency-based
probability estimate will be zero.
- Nave Bayes classification algorithm can be used only with a small data set
but precision and recall will keep very low.
2) K-nearest neighbor
k-NN is a type of instance-based learning, or lazy learning, where the function
is only approximated locally and all computation is deferred until classification.
The k-NN algorithm is among the simplest of all machine learning algorithms.
Advantages:
Simplicity, effectiveness, intuitiveness and competitive classification
performance in many domains are the advantages. It is Robust to noisy
training data and is effective if the training data is large.
Limitations:
-
Distance based learning is not clear which typeof distance to use and which
attribute to use to produce the best results.
Advantages:
Able to generate understandable rules.
Perform classification without requiring much computation.
Able to handle both continuous and categorical values.
- Provide a clear indication of which fields are most important for prediction
or classification. Limitations:
The problem of learning an optimal decision tree is known to be NPcomplete under several aspects of optimality and even for simple concepts.
Consequently, practical decision-tree learning algorithms are based on heuristics
such as the greedy algorithm where locally-optimal decisions are made at each
node. Such algorithms cannot guarantee to return the globally-optimal decision
tree.
Decision-tree learners can create over-complex trees that do not generalise
well from the training data.
- There are concepts that are hard to learn because decision trees do not
express them easily, such as XOR, parity or multiplexer problems. In such
cases, the decision tree becomes prohibitively large.