Strategy For Complete Discriminant Analysis

SW388R7
Strategy for Complete discriminant Analysis

Data Analysis &
Computers II
Slide 1
Assumption of normality, linearity, and homogeneity
Outliers
Multicollinearity
Validation
Sample problem
Steps in solving problems

SW388R7
Data Analysis & Assumptions of normality, linearity, and
homogeneity of variance
Computers II
Slide 2
 The ability of discriminant analysis to extract discriminant functions

that are capable of producing accurate classifications is enhanced
when the assumptions of normality, linearity, and homogeneity of
variance are satisfied.
 We will use the script for testing for normality and test substituting
the log, square root, or inverse transformation when they induce
normality in a variable that fails to satisfy the criteria for normality.
 We can compare the accuracy rates in a model using transformed

variables to one that does not to evaluate whether or not the
improvement gained by transformed variables is sufficient to justify
the interpretational burden of explaining transformations.
SW388R7
Assumption of linearity in discriminant analysis

Data Analysis &
Computers II
Slide 3
 Since the dependent variable is non-metric in discriminant analysis,

there is not a linear relationship between the dependent variable and
an independent variable.
 In discriminant analysis, the assumption of linearity applies to the

relationships between pairs of independent variable. To identify
violations of linearity, each metric independent variable would have to
be tested against all others.
 Since non-linearity only reduces the power to detect relationships, the

general advice is to attend to it only when we know that a variable in
our analysis consistently demonstrated non-linear relationships with
other independent variables.
 We will not test for linearity in our problems.

SW388R7
Assumption of homogeneity of variance

Data Analysis &
Computers II
Slide 4
 The assumption of homogeneity of variance is particular important in

the classification stage of discriminant analysis.
 If one the groups defined by the dependent variable has greater

dispersion than others, cases will tend to be overclassified in it.
 Homogeneity of variance is tested with Box's M test, which tests the

null hypotheses that the group variance-covariance matrices are equal.
If we fail to reject this null hypothesis and conclude that the
variances are equal, we use the SPSS default of using a pooled
covariance matrix in classification.
 If we reject the null hypothesis and conclude that the variances are
heterogeneous, we substitute separate covariance matrices in the
classification, and evaluate whether or not our classificaiton accuracy
is improved.
SW388R7
Detecting outliers in discriminant analysis - 1

Data Analysis &
Computers II
Slide 5
 In the classification phase of discriminant analysis, each case will be

predicted to be a member of one of the groups defined by the
dependent variable.
 The assignment is based on proximity, i.e. the case will be assigned to

the group it is closest to in multidimensional space.
 Just as we use z-scores to measure the location of a case in a

distribution with a given mean and standard deviation, we can use
Mahalanobis distance as a measure of the location of a case relative to
the centroid and covariance matrix for the cases in the distribution for
a group of cases. The centroid and covariance matrix are the
multivariate equivalents of a mean and standard deviation.
SW388R7

Data Analysis &
Computers II
Slide 6
 According to the SPSS Base 10.0 Applications Guide, page 259, "cases
with large values of Mahalanobis distance from their group mean can
be identified as outliers."
 In the Casewise Statistics output, SPSS provides us with the Squared

Mahalanobis Distance to the Centroid for each of the groups defined
by the dependent variable.
 If a case has a large Squared Mahalanobis Distance to the Centroid is

most likely to belong to, it is an outlier.
SW388R7

Data Analysis &
Computers II
Slide 7
 If we calculate the critical value that identifies a "large" value for

Mahalanobis D² distance, we can scan the Casewise Statistics table to
identify outliers.
 When we identified multivariate outliers, we used the SPSS function

CDF.CHISQ to calculate the probability of obtaining a D² of a certain
size, given the number of independent variables in the analysis.
 SPSS has a parallel function, IDF.CHISQ, that computes the size of D²

needed to reach a specific probability, given the number of
independent variables in the analysis.
SW388R7

Data Analysis &
Computers II
Slide 8
 Since we are dealing with the classification phase of discriminant

analysis, we use the number of independent variables included in
computing the discriminant scores for cases.
 For simultaneous discriminant analysis in which all independent

variables are entered at the same time, we use the total number of
independent variables in the calculations for the critical value for D².
 For stepwise discriminant analysis, in which variables are entered by

statistical criteria, we use the number of variables satisfying the
statistical criteria in the calculations for the critical value for D².
SW388R7

Data Analysis &
Computers II
Slide 9
 We will identify outliers as cases whose probability of being in

the group that they are most likely to belong it is 0.01 or less.
Since the IDF.CHISQ function is based on cumulative
probabilities from the left tail of the distribution through the
critical value, we will use 1.00 – 0.01 = 0.99 as the probability
in the IDF.CHIDQ function.
 For simultaneous discriminant analysis with 4 independent

variables, the compute command for the critical value of D² is:
COMPUTE critval = IDF.CHISQ(0.99, 4).
 For stepwise discriminant analysis, in which 2 of for

independent variables, the compute command for the critical
value of D² is: COMPUTE critval = IDF.CHISQ(0.99, 2).
SW388R7
Multicollinearity
Data Analysis &
Computers II
Slide 10
 Multicollinearity has the same effect in discriminant analysis

that it does in multiple regression, i.e. the importance of an
independent variable will be undervalued because it has a very
strong relationship to another independent variable or
combination of independent variables.
 Like multiple regression, multicollinearity in discriminant
analysis is identified by examining tolerance values.
 While tolerance is routinely included in the output for the
stepwise method for including variables, it is not included for
simultaneous entry of variables. If a tolerance problem occurs
in a simultaneous entry problem, SPSS will include a table titled
"Variables Failing Tolerance Test."
 We should not attempt to interpret an analysis with a
multicollinearity problem until we have resolved the problem
by removing or combining the problemmatic variable.
SW388R7
Validation
Data Analysis &
Computers II
Slide 11
 The primary criteria for a successful discriminant analysis are:

 the existence of sufficient statistically significant
discriminant functions to distinguish among the groups

defined by the dependent variable, and
 an accuracy rate that substantially improves the accuracy
rate obtainable by chance alone.

 We are also concerned with the role of the individual
independent variables, but we can expect to see greater
variation in this aspect of validation. However, if we can verify
the number of discriminant functions and the accuracy rate, we
will require a caution if the role of the variables is not the same
among split-sample models.
SW388R7
Overall strategy for solving problems

Data Analysis &
Computers II
Slide 12
1. Run a baseline discriminant analysis using the method for including

variables implied by the problem statement to find the baseline cross-
validated accuracy rate for the model.
2. Test for useful transformations to improve normality.
3. Substitute transformed variables and check for outliers.
4. If cross-validated accuracy rate from discriminant analysis using
transformed variables and omitting outliers is at least 2% better than
baseline cross-validated accuracy rate, select it for interpretation;
otherwise select baseline model.
5. If the Box’s M statistic is statistically significant, we violate the
assumption of homogeneity of variance and re-run the analysis using
separate covariance matrices for classification. If the accuracy rate
increases by more than 2%, we interpret this model, otherwise return
to model using pooled covariance.
6. If the cross-validated accuracy rate is 25% or more higher than
proportional by chance accuracy rate, interpret the selected
discriminant model:
 Number of functions and importance of predictors
 Role of individual variables on functions distinguishing among groups
SW388R7
Problem 1
Data Analysis &
Computers II
Slide 13
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of
a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.01
for evaluating assumptions. Use a level of significance of 0.05 for evaluating the statistical
relationship.
From the list of variables "number of hours worked in the past week" [hrs1], "self-employment"
[wrkslf], "highest year of school completed" [educ], and "income" [rincom98], the most useful
predictors for distinguishing among groups based on responses to "opinion about spending on
welfare" [natfare] are "number of hours worked in the past week" [hrs1], "self-employment"
[wrkslf], and "highest year of school completed" [educ]. These predictors differentiate survey
respondents who thought we spend too much money on welfare from survey respondents who
thought we spend about the right amount of money on welfare who, in turn, are differentiated
from survey respondents who thought we spend too little money on welfare.
The most important predictor of groups based on responses to opinion about spending on welfare
was number of hours worked in the past week. The second most important predictor of groups
based on responses to opinion about spending on welfare was self-employment. The third most
important predictor of groups based on responses to opinion about spending on welfare was
highest year of school completed.
Survey respondents who thought we spend about the right amount of money on welfare worked
fewer hours in the past week than survey respondents who thought we spend too much or little
money on welfare. Survey respondents who thought we spend about the right amount of money
on welfare had completed more years of school than survey respondents who thought we spend
too much or little money on welfare. Survey respondents who thought we spend too much money
on welfare were more likely to be self-employed than survey respondents who thought we spend
too little money on welfare.
1. True
2. True with caution
3. False
4. Inappropriate application of a statistic
SW388R7
Dissecting problem 1 - 1
Data Analysis &
Computers II
Slide 14
The problem may give us different levels

of significance for the analysis.
In this problem, we are told to use 0.05

as alpha for the discriminant analysis,
but 0.01 for testing assumptions.
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application
of a statistic? Assume that there is no problem with missing data. Use a level of significance of
0.01 for evaluating assumptions. Use a level of significance of 0.05 for evaluating the statistical
relationship.
SW388R7
Data Analysis &
Computers II
Slide 15
The variables listed first in the problem

statement are the independent variables
(IVs): "number of hours worked in the past
In the
week"dataset GSS2000.sav,
[hrs1], is the following
"self-employment" [wrkslf], statement true, false, or an incorrect application of
a statistic? Assume
"highest year that there
of school is no problem
completed" [educ], with missing data. Use a level of significance of
0.01 for evaluating assumptions.
and "income" [rincom98]. Use a level of significance of 0.05 for evaluating the statistical
relationship.
From the list of variables "number of hours worked in the past week" [hrs1], "self-
employment" [wrkslf], "highest year of school completed" [educ], and "income" [rincom98],
the most useful predictors for distinguishing among groups based on responses to "opinion
about spending on welfare" [natfare] are "number of hours worked in the past week" [hrs1],
"self-employment" [wrkslf], and "highest year of school completed" [educ]. These predictors
differentiate survey respondents who thought we spend too much money on welfare from survey
respondents who thought we spend about the right amount of money on welfare who, in turn,
areThe
differentiated from
variable used survey respondents who thought we spend too little money on welfare.
to define
groups is the dependent When a problem asks us
variable (DV): "opinion about to identify the bestonorwelfare worked
Survey respondents
spending who thought we spend about the right amount
on welfare" of money
fewer most useful predictors
hours in the past week than survey respondents who thought we spend too much or little
[natfare]. from a list of
on welfare had completed more years of school than surveyindependent
respondentsvariables,
who thought we spend
we do stepwise
too much or little money on welfare. Survey respondents who thought we spend too much
discriminant analysis.
money on welfare were more likely to be self-employed than survey respondents who thought
we spend too little money on welfare.
SW388R7
Data Analysis &
Computers II
Slide 16
The problem identifies three groups for the dependent variable:

survey respondents who thought we spend too much money on welfare
survey respondents who thought we spend about the right amount of
money on welfare
survey respondents who thought we spend too little money on welfare.
To distinguish among three groups, the analysis will be required to find

two statistically significant discriminant functions.
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application
of a statistic? Assume that there is no problem with missing data. Use a level of significance of
0.01 for evaluating assumptions. Use a level of significance of 0.05 for evaluating the
statistical relationship.
[wrkslf], and "highest year of school completed" [educ]. These predictors differentiate
survey respondents who thought we spend too much money on welfare from survey
respondents who thought we spend about the right amount of money on welfare who, in
turn, are differentiated from survey respondents who thought we spend too little money
on welfare.
SW388R7
Data Analysis &
Computers II
Slide 17
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is
no problem with missing data. Use a level of significance of 0.01 for evaluating assumptions. Use a level of significance of 0.05
for evaluating the statistical relationship.
In a stepwise analysis, we only
From the list of variables "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf], "highest year of school
interpret[educ],
completed" the independent
and "income" [rincom98], the most useful predictors for distinguishing among groups based on responses to
variables
"opinion aboutthat are on
spending entered
welfare" in
[natfare] are "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf],
and "highest year of school completed" [educ]. These predictors differentiate survey respondents who thought we spend too much
the stepwise
money analysis.
on welfare from survey respondents who thought we spend about the right amount of money on welfare who, in turn, are
differentiated from survey respondents who thought we spend too little money on welfare.
The importance of individual
The most important predictor of groups based on responses to opinion about spending on welfare was number of hours
predictors
worked in the past week. The second most important predictor of groups based onisresponses
based on order about spending on
to opinion
welfare was self-employment. The third most important predictor of of entry
groups in the
based analysis.
on responses to opinion about spending
on welfare was highest year of school completed.
Survey respondents who thought we spend about the right amount of money on welfare worked fewer hours in the past week than
survey respondents who thought we spend too much or little money on welfare. Survey respondents who thought we spend about
the right amount of money on welfare had completed more years of school than survey respondents who thought we spend too
much or little money on welfare. Survey respondents who thought we spend too much money on welfare were more likely to be
self-employed than survey respondents who thought we spend too little money on welfare.
1. True
3. False
SW388R7
Data Analysis &
Computers II
Slide 18
From the list of variables "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf], "highest year of
school completed" [educ], and "income" [rincom98], the most useful predictors for distinguishing among groups based on
responses to "opinion about spending on welfare" [natfare] are "number of hours worked in the past week" [hrs1], "self-
employment" [wrkslf], and "highest year of school completed" [educ]. These predictors differentiate survey respondents
Thethought
who specific
werelationships
spend too muchlisted
moneyinonthe problem
welfare from survey respondents who thought we spend about the right
amount
indicate how the independent variable differentiated
of money on welfare who, in turn, are relates from survey respondents who thought we spend too little
money on welfare.
to groups of the dependent variable, e.g., the
The mostfor
mean important predictor in
hours worked of the
groups based
past weekon responses
will to opinion about spending on welfare was number of hours
worked in the past week. The second most important predictor of groups based on responses to opinion about spending on
be lower
welfare wasfor respondentsThe
self-employment. who think
third mostwe spend predictor of groups based on responses to opinion about spending
important
the
on right was
welfare amount
highestofyear
money versus
of school respondents
completed.
who think
Survey we spend
respondents who too much
thought weor too about
spend little.the right amount of money on welfare worked fewer hours in the
past week than survey respondents who thought we spend too much or little money on welfare. Survey respondents
who thought we spend about the right amount of money on welfare had completed more years of school than
survey respondents who thought we spend too much or little money on welfare. Survey respondents who thought
we spend too much money on welfare were more likely to be self-employed than survey respondents who thought
1. True
3. False
In order for a stepwise analysis to be

true, we must have enough statistically
significant functions to distinguish among
the groups, the order of entry must be
correct, and each significant relationship
must be interpreted correctly.
SW388R7
LEVEL OF MEASUREMENT - 1
Data Analysis &
Computers II
Slide 19
In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a
statistic? Assume that there is no problem with missing data. Use a level of significance of 0.01 for
evaluating assumptions. Use a level of significance of 0.05 for evaluating the statistical relationship.
Survey respondents who thought we spend about the right amount of money on welfare worked fewer
hours in the past week than survey respondents who thought we spend too much or little money on
Discriminant
welfare. Survey respondents analysis
who thought requires
we spend thatthe
about theright
dependent
amount of money on welfare had
completed more years of school than survey respondents who thoughtvariables
variable be non-metric and the independent we spend too much or little
be metric or dichotomous. "Opinion about spending
money on welfare. Surveyon welfare" [natfare] is an ordinal level variable,money on welfare were more
respondents who thought we spend too much
likely to be self-employed thansatisfies
which survey respondents who thought we spend too little money on welfare.
the level of measurement
requirement.
It contains three categories: survey respondents who

thought we spend too much money on welfare,
survey respondents who thought we spend about the
right amount of money on welfare, and survey
respondents who thought we spend too little money
on welfare.
SW388R7
LEVEL OF MEASUREMENT - 2
Data Analysis &
Computers II
Slide 20
about spending on welfare" [natfare] are "number of hours worked in the past week"
[hrs1], "self-employment" [wrkslf], and "highest year of school completed" [educ]. These
predictors differentiate survey respondents who thought we spend too much money on
welfare from survey respondents who thought we spend about the right amount of money
on welfare who, in turn, are differentiated from survey respondents who thought we spend
too little money on welfare.
fewer hours
"Number in the
of hours past week
worked in thethan survey respondents who thought we spend too much or little
money
past week"on [hrs1]
welfare.
andSurvey respondents who thought we spend about the right amount of money
"highest
year of schoolhad
on welfare completed"
completed[educ]
more years of school than survey respondents who thought we
are interval level variables, which
spend too much or little money on welfare. Survey respondents who thought we spend too
satisfies the level of measurement
much moneyfor
requirements ondiscriminant
welfare were more likely to be self-employed than survey
"Income" [rincom98] is anrespondents
ordinal level who
thought we spend too little money on welfare. variable. If we follow the convention of
analysis.
treating ordinal level variables as metric
variables, the level of measurement
requirement for discriminant analysis is
satisfied. Since some data analysts do
not agree with this convention, a note
"Self-employment" [wrkslf] is a of caution should be included in our
dichotomous or dummy-coded interpretation.
nominal variable which may be
included in discriminant analysis.
SW388R7
The baseline discriminant analysis

Data Analysis &
Computers II
Slide 21
We begin our analysis by

running a stepwise
discriminant analysis with
natfare as the dependent
variable and hrs1, wrkslf,
educ, and rincom98 as the
independent variables.
Select the Classify |

Discriminant… command
from the Analyze menu.
SW388R7
Selecting the dependent variable

Data Analysis &
Computers II
Slide 22
First, highlight the

dependent variable
natfare in the list
of variables.
Second, click on the right

arrow button to move the
dependent variable to the
Grouping Variable text box.
SW388R7
Defining the group values

Data Analysis &
Computers II
Slide 23
When SPSS moves the dependent variable to the

Grouping Variable textbox, it puts two question marks in
parentheses after the variable name. This is a reminder
that we have to enter the number that represent the
groups we want to include in the analysis.
First, to specify the

group numbers, click
on the Define Range…
button.
SW388R7
Completing the range of group values

Data Analysis &
Computers II
Slide 24
The value labels for natfare show

three categories:
1 = TOO LITTLE
2 = ABOUT RIGHT
3 = TOO MUCH
First, type in 1 in
The range of values that we need the Minimum text
to enter goes from 1 as the box.
minimum and 3 as the maximum.
Second, type in
3 in the Third, click on the
Maximum text Continue button to
box. close the dialog box.
Note: if we enter the wrong range of group

numbers, e.g., 1 to 2 instead of 1 to 3, SPSS
will only include groups 1 and 2 in the analysis.
SW388R7
Specifying the method for including variables

Data Analysis &
Computers II
Slide 25
SPSS provides us with two methods for including

variables: to enter all of the independent variables
at one time, and a stepwise method for selecting
variables using a statistical test to determine the
order in which variables are included.
Since the problem states the

importance of the best subset of
predictors, we mark the option
button to Use stepwise method.
SW388R7
Requesting statistics for the output

Data Analysis &
Computers II
Slide 26
Click on the Statistics…

button to select statistics
we will need for the
analysis.
SW388R7
Specifying statistical output

Data Analysis &
Computers II
Slide 27
First, mark the Means

checkbox on the Descriptives
panel. We will use the group
means in our interpretation.
Second, mark the Univariate

ANOVAs checkbox on the
Descriptives panel. Perusing
these tests suggests which
variables might be useful
descriminators.
Third, mark the Box’s M

checkbox. Box’s M statistic Fourth, click on the
evaluates conformity to the Continue button to
assumption of homogeneity of close the dialog box.
group variances.
SW388R7
Specifying details for the stepwise method

Data Analysis &
Computers II
Slide 28
Click on the Method…

button to specify the
specific statistical criteria to
use for including variables.
SW388R7
Details for the stepwise method

Data Analysis &
Computers II
Slide 29
First, mark the

Mahalanobis
distance option
button on the
Method panel.
Second, mark the

Third, click on
Summary of steps
the Continue
checkbox to
button to close
produce a summary
the dialog box.
table when a new
variable is added.
Fourth, type the level

Third, click on the
of significance in the
option button Use
probability of F so that Entry text box. The
we can incorporate the Removal value is twice
level of significance as large as the entry
specified in the problem. value.
SW388R7
Specifying details for classification

Data Analysis &
Computers II
Slide 30
Click on the Classify…

button to specify details for
the classification phase of
the analysis.
SW388R7
Details for classification - 1

Data Analysis &
Computers II
Slide 31
First, mark the option button to Compute from

group sizes on the Prior Probabilities panel.
This incorporates the size of the groups defined
by the dependent variable into the classification
of cases using the discriminant functions.
Second, mark the

Casewise results
checkbox on the
Display panel to
include
classification details
for each case in the
output.
Third, mark the Summary

table checkbox to include
summary tables
comparing actual and
predicted classification.
SW388R7

Data Analysis &
Computers II
Slide 32
Fourth, mark the Leave-one-out

classification checkbox to request SPSS to
include a cross-validated classification in
the output. This option produces a less
biased estimate of classification accuracy
by sequentially holding each case out of
the calculations for the discriminant
functions, and using the derived functions
to classify the case held out.
SW388R7

Data Analysis &
Computers II
Slide 33
Fifth, accept the default of Within-groups Seventh, click

option button on the Use Covariance Matrix on the Continue
panel. The Covariance matrices are the button to close
measure of the dispersion in the groups the dialog box.
defined by the dependent variable. If we
fail the homogeneity of group variances
test (Box’s M), our option is use Separate
groups covariance in classification.
Sixth, mark the Combined-
groups checkbox on the Plots
panel to obtain a visual plot of
the relationship between
functions and groups defined
by the dependent variable.
SW388R7
Completing the discriminant analysis request

Data Analysis &
Computers II
Slide 34
Click on the OK
button to request the
output for the
disciminant analysis.
SW388R7
Initial sample size - 1

Data Analysis &
Computers II
Slide 35
Analysis Case Processing Summary
Unweighted Cases N Percent

Valid 138 51.1
Excluded Missing or out-of-range
7 2.6
group codes
At least one missing
115 42.6
discriminating variable
Both missing or
out-of-range group codes The initial sample size
10 3.7 before excluding outliers
and at least one missing
discriminating variable and influential cases is 138
Total 132 48.9 cases. With 4 independent
Total variables, the ratio of cases
270 100.0
to variables is 34.5 to 1,
satisfying both the
minimum ratio of 5 cases
for each independent
variable and the preferred
ratio of 20 to 1.
SW388R7
Initial sample size - 2

Data Analysis &
Computers II
Slide 36
Prior Probabilities for Groups
Cases Used in Analysis In addition to the requirement for the

WELFARE Prior Unweighted Weighted ratio of cases to independent
1 .406 56 56.000 variables, discriminant analysis
2 .362 50 50.000 requires that there be a minimum
3 .232 32 32.000 number of cases in the smallest group
Total 1.000 138 138.000 defined by the dependent variable.
The number of cases in the smallest
group must be larger than the number
of independent variables, and
preferrably contain 20 or more cases.

group in this problem is 32, which is
larger than the number of
independent variables (4), satisfying
the minimum requirement. In
addition, the number of cases in the
smallest group satisfies the preferred
minimum of 20 cases.
If the sample size did not
initially satisfy the minimum
requirements, discriminant
analysis is not appropriate.
SW388R7
Data Analysis & Classification accuracy before
transformations or removing outliers
Computers II
Slide 37
Classification Resultsb,c
Predicted Group Membership

WELFARE 1 2 3 Total
Original Count 1 43 15 Prior to any
6 64
transformations
2 26 30 of variables
6 to satisfy
62 the
3 17 10 assumptions
9 of discriminant
36
Ungrouped cases 3 3 analysis 2or removal8 of
% 1 67.2 23.4 outliers,
9.4the cross-validated
100.0
2 41.9 48.4 accuracy9.7 rate was
100.050.0%.
3 47.2 27.8 25.0 100.0
This accuracy rate is the
Ungrouped cases 37.5 37.5 25.0 100.0
benchmark that we will use
Cross-validated a Count 1 43 15 6 64
to evaluate the utility of
2 26 30 transformations
6 and62 the
3 17 11 elimination
8 of outliers.
36
% 1 67.2 23.4 9.4 100.0
2 41.9 48.4 9.7 100.0
3 47.2 30.6 22.2 100.0
a. Cross validation is done only for those cases in the analysis. In cross validation, each case
is classified by the functions derived from all cases other than that case.
b. 50.6% of original grouped cases correctly classified.
c. 50.0% of cross-validated grouped cases correctly classified.
SW388R7
ASSUMPTION OF NORMALITY
Data Analysis &
Computers II
Slide 38
There are three metric independent

variables to test for normality:
•"number of hours worked in
the past week" [hrs1], First, move the three
•"highest year of school independent variables
completed" [educ], and to thelist box of
•"income" [rincom98], variables to test.
To test for normality of, run the

script: NormalityAssumptionAnd
Transformations.SBS
Second, click on
the OK button to
produce the output.
SW388R7
Data Analysis & Normality of independent variable:
highest year of school completed
Computers II
Slide 39
Descriptives
Statistic Std. Error

HIGHEST YEAR OF Mean 13.12 .179
SCHOOL COMPLETED 95% Confidence Lower Bound 12.77
Interval for Mean Upper Bound
13.47
5% Trimmed Mean 13.14

Median 13.00
Variance 8.583
Std. Deviation 2.930
Minimum 2
Maximum 20
Range 18
Interquartile Range 3.00
Skewness -.137 .149
Kurtosis 1.246 .296
The independent variable "highest year of

school completed" [educ] does not satisfy the
criteria for a normal distribution.
The skewness (-0.137) fell between -1.0 and

+1.0, but the kurtosis (1.246) fell outside the
range from -1.0 to +1.0.
SW388R7
highest year of school completed
Computers II
Slide 40
Neither the logarithmic, the square root, nor

the inverse transformation normalizes the
variable.
A caution should be added to the findings.

SW388R7
number of hours worked in the past week
Computers II
Slide 41
Descriptives

NUMBER OF HOURS Mean 40.99 .958
WORKED LAST WEEK 95% Confidence Lower Bound 39.10
42.88
5% Trimmed Mean 41.21

Median 40.00
Variance 161.491
Std. Deviation 12.708
Minimum 4
Maximum 80
Range 76
Skewness -.324 .183
Kurtosis .935 .364
The variable "number of hours worked in the

past week" [hrs1] satisfies the criteria for a
normal distribution. The skewness (-0.324)
and kurtosis (0.935) were both between -1.0
and +1.0.
SW388R7
income
Computers II
Slide 42
Descriptives

RESPONDENTS INCOME Mean 13.35 .419
95% Confidence Lower Bound 12.52
14.18 The variable "income"
[rincom98] satisfies
5% Trimmed Mean 13.54 the criteria for a
Median 15.00 normal distribution.
Variance 29.535 The skewness (-
Std. Deviation 5.435 0.686) and kurtosis (-
Minimum 1 0.253) were both
Maximum 23 between
Range 22 -1.0 and +1.0.
Skewness -.686 .187
Kurtosis -.253 .373
Since we do not have any transformations to test,

we can use the baseline model to examine outliers.
If we had transformation to test, we would run the

discriminant analysis again, substituting the
transformed variables before examining outliers.
SW388R7
OUTLIERS
Data Analysis &
Computers II
Slide 43
The classification output for

individual cases can be used to
detect outliers. In this context,
an outlier is a case that is distant
from the centroid of the group to
which it has the highest
probability of belonging.
Distance from the centroid of a

group is measured by
Mahalanobis Distance.
To identify outliers, we scan

the column looking for cases
with Mahalanobis D² distance
greater than a critical value.
SW388R7
Data Analysis & Using SPSS to calculate the critical value
for Mahalanobis D²
Computers II
Slide 44
The critical value for Mahalanobis D² is that

value that would achieve a specified level of
statistical significance given the number of
variables that were included in its calculation.
Specifically, we will use an SPSS function to

give us the critical value for a probability of
0.01 with the degrees of freedom equal to the
number of variables used to compute D².
SW388R7
Data Analysis & The number of variables used to compute
Mahalanobis D²
Computers II
Slide 45
V ariable s Ente r e d/Re m ovea,b,c,d

d
Min. D Squared
Betw een Exac t F

Step Entered Statis tic Groups Statis tic df1 df 2 Sig.
1 NUMBER
OF
HOURS
.023 1 and 3 .475 1 135.000 .492
WORKED
LA ST
WEEK
2 R In a direct entry discriminant analysis that
SELF-EM includes all variables simultaneously, the
P OR
WORKS .251 1 and 2 number of 2variables
3.289 134.000 used to
.040compute the
FOR values of D² is equal to the number of
SOMEBO independent variables included in the analysis.
DY
3 HIGHEST In stepwise discriminant analysis, the number
YEAR OF of variables used to compute the values of D²
SCHOOL .364 1 and 3 2.433 3 133.000 .068
COMPLE
is equal to the number of independent variables
TED selected for inclusion by the statistical
procedure.
At eac h step, the v ariable that maximizes the Mahalanobis distanc e betw een the tw o c los es t
groups is enter ed.
a. Maximum number of s teps is 8. In this problem, 3 out of the 4 independent
b. Maximum s ignif ic anc e of F to enter is .05. variables were used in the discriminant
c. Minimum s ignif ic anc e of F to remove is .10.
functions.
d. F level, tolerance, or V IN ins uf ficient f or f ur ther computation.
SW388R7
Data Analysis & Computing the critical value for
Mahalanobis D²
Computers II
Slide 46
First, we open the window to

compute a new variable by
selecting the Compute…
command from the
Transform menu.
SW388R7
Selecting the SPSS function

Data Analysis &
Computers II
Slide 47
First, we enter the acronym

for the variable we want to
create in the Target Variable
textbox: critval, for critical
value.
Third, we click
on the up
arrow button to
move the
function to the
Numeric
Second, we scroll down the
Expression
list of SPSS function to
textbox.
highlight the one we need:
IDF.CHISQ(p, df)
SW388R7
Completing the function arguments

Data Analysis &
Computers II
Slide 48
First, the first argument to the

IDF.CDF function, p, is replaced by
the cumulative probability associated
with the critical value, 0.99.
Second, the number of independent

variables in the discriminant
functions, 3, is used as the df, or
degrees of freedom.
Third, click on the

OK… button to
compute the variable.
SW388R7
The critical value for Mahalanobis D²

Data Analysis &
Computers II
Slide 49
The critical value is

calculated as a new variable
in the SPSS data editor.
Even though we only need it
calculated a single time, the
compute crease a value for
every case.
Now that we have the critical

value, we can compare it to
the values in the table of
Casewise Statistics.
SW388R7
Skipping ungrouped cases

Data Analysis &
Computers II
Slide 50
Case 50 has a D² 0f 16.603 which is its distance from the

centroid of its predicted group 3. However, the actual
group for the case was "ungrouped" meaning it was
missing data for the dependent variable. This case is not
counted as an outlier because it is already omitted from
the calculations for the discriminant functions.
SW388R7
Identifying outliers
Data Analysis &
Computers II
Slide 51
Case Number 176 has a D² 0f 11.553 which is its distance from

the centroid of its predicted group 2, and which is larger than the
critical value for D² of 11.345. This case is an outlier and should
be omitted in our test for the impact of outliers on the analysis.
To omit it from the analysis, we will have to find its case id

number and eliminate that. We cannot use case numbers to
eliminate outliers, because omitting one case changes the case
number for all of the other cases after it, and we are likely to
exclude the wrong case.
SW388R7
The caseid of the outlier

Data Analysis &
Computers II
Slide 52
To omit the outlier, we scroll

down the data editor to case
176 and note its caseid value,
"20001785."
In this data set, caseids are

string or text data, and we
represent their values in
quotation marks.
SW388R7
Omitting the outliers

Data Analysis &
Computers II
Slide 53
To omit outliers, we select

into the analysis, the cases
that are not outliers.
First, select the

Select Cases…
command from the
Transform menu.
SW388R7
Specifying the condition to omit outliers

Data Analysis &
Computers II
Slide 54
First, mark the If

condition is satisfied
option button to
indicate that we will Second, click on the
enter a specific If… button to specify
condition for the criteria for inclusion
including cases. in the analysis.
SW388R7
The formula for omitting outliers

Data Analysis &
Computers II
Slide 55
To eliminate the outliers, we request

the cases that are not outliers be
included in the analysis. Using this
formula, we are selecting cases that
do not have a caseid of "20001785".
In the formula, the symbols ~=

stands for "not equal to".
If we had more than one outlier, the

formula would be expanded to:
caseid~="20001785" and
After typing in the formula, caseid~="20005967" and
click on the Continue button caseid~="20006102" …
to close the dialog box,
SW388R7
Completing the request for the selection

Data Analysis &
Computers II
Slide 56
To complete the
request, we click on
the OK button.
SW388R7
The omitted outlier

Data Analysis &
Computers II
Slide 57
SPSS identifies the excluded

cases by drawing a slash mark
through the case number.
SW388R7
Classification accuracy without the outlier

Data Analysis &
Computers II
Slide 58

WELFARE 1 2 3 Total
Original Count 1 43 15 6 64
2 26 29 6 61
3 17 10 9 36
Ungrouped cases 3 3 2 8
% 1 67.2 23.4 9.4 100.0
2 42.6 47.5 9.8 100.0
Prior to any transformations of
3 47.2 27.8 25.0 100.0
variables to satisfy the assumptions of
Ungrouped cases 37.5
normality and 37.5
the removal 25.0of outliers,
100.0
Cross-validated a Count 1 43
the cross-validated15 classification
6 64
2 accuracy
26 rate was 29 50.0%. 6 61
3 17 11 8 36
% 1 After substituting
67.2 23.4 transformed
9.4 100.0
2 variables
42.6 and removing
47.5 outliers,
9.8 the
100.0
3 cross-validated
47.2 classification
30.6 22.2 accuracy
100.0
rate was 49.7%.
SW388R7
SELECTION OF MODEL FOR INTERPRETATION

Data Analysis &
Computers II
Slide 59

WELFARE 1 2 3 Total
Original Count 1 43 15 6 64
2 26 29 6 61
3 17 10 9 36
% 1 67.2 23.4 9.4 100.0
2 42.6 47.5 9.8 100.0
3 Since
47.2 the discriminant
27.8 25.0 analysis
100.0
Ungrouped cases using transformations and
37.5 37.5 25.0 100.0
omitting outliers was less
Cross-validated a Count 1 43 15 6 64
accurate in classifying cases than
2 26 29 6 61
the discriminant analysis with all
3 17
cases and 11 8
no transformations, 36
% 1 67.2 23.4
the discriminant 9.4
analysis 100.0
with all
2 cases and
42.6 no transformations
47.5 9.8 100.0
3 was interpreted.
47.2 30.6 22.2 100.0
SW388R7
SAMPLE SIZE - 1
Data Analysis &
Computers II
Slide 60
Analysis Case Processing Summary
Unweighted Cases N Percent

Valid 138 51.1
Excluded Missing or out-of-range
7 2.6
group codes
At least one missing
115 42.6
discriminating variable The minimum ratio of valid
Both missing or cases to independent
out-of-range group codes variables for discriminant
10 3.7
and at least one missing analysis is 5 to 1, with a
discriminating variable preferred ratio of 20 to 1.
Total 132 48.9 In this analysis, there are
Total 270 100.0 138 valid cases and 4
independent variables.
The ratio of cases to

independent variables is
34.5 to 1, which satisfies
the minimum requirement.
In addition, the ratio of
34.5 to 1 satisfies the
preferred ratio of 20 to 1.
SW388R7
SAMPLE SIZE - 2
Data Analysis &
Computers II
Slide 61
Cases Used in Analysis

In addition to the requirement for the
WELFARE Prior Unweighted Weighted
ratio of cases to independent
1 TOO LITTLE .409 56 56.000
variables, discriminant analysis
2 ABOUT RIGHT .358 49 49.000 requires that there be a minimum
3 TOO MUCH .234 32 32.000 number of cases in the smallest group
Total 1.000 137 137.000 defined by the dependent variable.
group must be larger than the number
of independent variables, and
preferably contain 20 or more cases.

group in this problem is 32, which is
larger than the number of
independent variables (4), satisfying
the minimum requirement. In
addition, the number of cases in the
smallest group satisfies the preferred
minimum of 20 cases.
SW388R7
NUMBER OF DISCRIMINANT FUNCTIONS - 1

Data Analysis &
Computers II
Slide 62
The maximum possible number of discriminant

functions is the smaller of one less than the
number of groups defined by the dependent
variable and the number of independent
variables.
In this analysis there were 3 groups defined by

opinion about spending on welfare and 4
independent variables, so the maximum
possible number of discriminant functions was
2.
SW388R7
NUMBER OF DISCRIMINANT FUNCTIONS - 2

Data Analysis &
Computers II
Slide 63
In the table of Wilks' Lambda which tested functions for

statistical significance, the stepwise analysis identified 2
discriminant functions that were statistically significant. The
Wilks' lambda statistic for the test of function 1 through 2
functions (chi-square=21.853) had a probability of 0.001 which
was less than or equal to the level of significance of 0.05.
After removing function 1, the Wilks' lambda statistic for the

test of function 2 (chi-square=7.074) had a probability of
0.029 which was less than or equal to the level of
significance of 0.05. The significance of the maximum
possible number of discriminant functions supports the
interpretation of a solution using 2 discriminant functions.
SW388R7
MULTICOLLINEARITY
Data Analysis &
Computers II
Slide 64
Multicollinearity occurs when one

independent variable is so
strongly correlated with one or
more other variables that its
relationship to the dependent
variable is likely to be
misinterpreted. Its potential
unique contribution to explaining
the dependent variable is
minimized by its strong
relationship to other independent
variables. Multicollinearity is
indicated when the tolerance
value for an independent variable
is less than 0.10.
The tolerance values for all of the

independent variables are larger
than 0.10. Multicollinearity is not
a problem in this discriminant
analysis.
SW388R7
Data Analysis & Independent variables and group membership:
relationship of functions to groups
Computers II
Slide 65
In order to specify the role that each independent

variable plays in predicting group membership on the
dependent variable, we must link together the
relationship between the discriminant functions and the
groups defined by the dependent variable, the role of
the significant independent variables in the
discriminant functions, and the differences in group
means for each of the variables.
Function 2 separates
Functions at Group Centroids survey respondents
who thought we spend
Function too little money on
WELFARE 1 2 welfare (positive value
of 0.235) from survey
1 -.220 .235 respondents who
2 .446 -.031 thought we spend too
3 -.311 -.362 much money (negative
value of -0.362) on
Unstandardized canonical discriminant welfare. We ignore the
functions evaluated at group means second group (-0.031)
Function 1 separates survey respondents in this comparison
who thought we spend about the right because it was
amount of money on welfare (the positive distinguished from the
value of 0.446) from survey respondents other two groups by
who thought we spend too much (negative function 1.
value of -0.311) or little money (negative
value of -0.220) on welfare.
SW388R7
which predictors to interpret
Computers II
Slide 66
Variables Entered/Removeda,b,c,d
Min. D Squared
Between Exact F
Step Entered Statistic Groups Statistic df1 df2 Sig.
1 NUMBER When we use the stepwise method of
OF variable inclusion, we limit our interpretation
HOURS of independent variable predictors to those
.023 1 and 3listed as statistically
.475 1 135.000
significant .492
in the table
WORKED
LAST of Variables Entered/Removed.
WEEK
We will interpret the impact on membership
2 R in groups defined by the dependent variable
SELF-EM by the independent variables:
P OR •number of hours worked in the past week
WORKS .251 1 and 2 •self-employment.
3.289 2 134.000 .040
FOR •highest year of school completed
SOMEBO
DY
3 HIGHEST
YEAR OF
SCHOOL .364 1 and 3 2.433 3 133.000 .068
COMPLE
Had we use simultaneous
TED entry of all variables, we
wouldbetween
At each step, the variable that maximizes the Mahalanobis distance not have imposed
the two closest this
groups is entered. limitation.
a. Maximum number of steps is 8.
b. Maximum significance of F to enter is .05.
c.
SW388R7
predictor loadings on functions
Computers II
Slide 67
We do not Structure Matrix

interpret loadings
in the structure Function
matrix unless they 1 2
are 0.30 or HIGHEST YEAR OF
higher. .687* .136
SCHOOL COMPLETED
NUMBER OF HOURS
-.582* .345
WORKED LAST WEEK
R SELF-EMP OR WORKS
.223 .889*
FOR SOMEBODY
RESPONDENTS INCOMEa .101 .292*
Pooled within-groups correlations between discriminating
variables and standardized canonical discriminant functions
Variables ordered by absolute size of correlation within function.
Based on the structure
*. Largest absolute correlation between each variable and
matrix, the predictor
Based on the structure matrix,any thediscriminant function variable strongly
predictor variables strongly associated with
a. This variable not used in the analysis. associated with
discriminant function 1 which distinguished discriminant function 2
between survey respondents who thought which distinguished
we spend about the right amount of money between survey
on welfare and survey respondents who respondents who thought
thought we spend too much or little money we spend too little money
on welfare were number of hours worked in on welfare and survey
the past week (r=-0.582) and highest year respondents who thought
of school completed (r=0.687). we spend too much money
on welfare was self-
employment (r=0.889).
SW388R7
predictors associated with first function - 1
Computers II
Slide 68
Group Statistics
Valid N (listwise)
WELFARE Mean Std. Deviation Unweighted Weighted
1 TOO LITTLE NUMBER OF HOURS The average number of hours worked
43.96 13.240in the past56week 56.000
for survey
WORKED LAST WEEK
HIGHEST YEAR OF respondents who thought we spend
13.73 2.401about the 56
right amount
56.000 of money on
SCHOOL COMPLETED
welfare (mean=37.90) was lower
R SELF-EMP OR WORKS
1.93 .260than the average
56 number of hours
56.000
FOR SOMEBODY worked in the past weeks for survey
RESPONDENTS INCOME 13.70 5.034respondents
56 who56.000
thought we spend
2 ABOUT RIGHT NUMBER OF HOURS too little money on welfare
37.90 13.235(mean=43.96)
50 50.000
and survey
WORKED LAST WEEK
14.78 2.558too much money
50 on welfare
50.000
SCHOOL COMPLETED
(mean=42.03).
R SELF-EMP OR WORKS
1.90 .303 50 50.000
FOR SOMEBODY This supports the relationship that
RESPONDENTS INCOME 14.00 5.503"survey respondents
50 50.000who thought we
3 TOO MUCH NUMBER OF HOURS spend about the right amount of
42.03 10.456money on 32 32.000
welfare worked fewer
WORKED LAST WEEK
HIGHEST YEAR OF hours in the past week than survey
13.38 2.524respondents 32 who32.000
thought we spend
SCHOOL COMPLETED
too little or much money on welfare."
R SELF-EMP OR WORKS
1.75 .440 32 32.000
FOR SOMEBODY
RESPONDENTS INCOME 14.75 5.304 32 32.000
Total NUMBER OF HOURS
41.32 12.846 138 138.000
WORKED LAST WEEK
HIGHEST YEAR OF
14.03 2.537 138 138.000
SCHOOL COMPLETED
R SELF-EMP OR WORKS
SW388R7
predictors associated with first function - 2
Computers II
Slide 69
Group Statistics
Valid N (listwise)
1 TOO LITTLE NUMBER OF HOURS
43.96 13.240The average
56 highest
56.000year of school
WORKED LAST WEEK
completed for survey respondents
2.401who thought
56 we 56.000
spend about the
HIGHEST YEAR OF
13.73
SCHOOL COMPLETED right amount of money on welfare
R SELF-EMP OR WORKS (mean=14.78) was higher than the
1.93 .260average highest
56 56.000
year of school
FOR SOMEBODY
RESPONDENTS INCOME 13.70 5.034completeds56 for survey
56.000 respondents
who thought we spend too little
2 ABOUT RIGHT NUMBER OF HOURS
37.90 13.235money on 50welfare (mean=13.73) and
50.000
WORKED LAST WEEK survey respondents who thought we
HIGHEST YEAR OF
14.78 2.558
spend too 50
much 50.000
money on welfare
SCHOOL COMPLETED (mean=13.38).
R SELF-EMP OR WORKS
1.90 .303This supports
50 the50.000
relationship that
FOR SOMEBODY
RESPONDENTS INCOME 14.00 5.503
"survey respondents
50
who thought we
50.000
spend about the right amount of
3 TOO MUCH NUMBER OF HOURS
42.03 10.456money on 32 welfare had completed
32.000
WORKED LAST WEEK more years of school than survey
13.38 2.524 32 32.000
SCHOOL COMPLETED too little or much money on welfare."
R SELF-EMP OR WORKS
1.75 .440 32 32.000
FOR SOMEBODY
41.32 12.846 138 138.000
WORKED LAST WEEK
HIGHEST YEAR OF
14.03 2.537 138 138.000
SCHOOL COMPLETED
R SELF-EMP OR WORKS
SW388R7
predictors associated with second function
Computers II
Slide 70
Group Statistics
Valid N (listwise)
1 TOO LITTLE NUMBER OF HOURS Since self-employment is a dichotomous
43.96 13.240 variable, the
56 mean
56.000
is not directly
WORKED LAST WEEK
HIGHEST YEAR OF interpretable. Its interpretation must
13.73 2.401 take into 56
account the coding by which 1
56.000
SCHOOL COMPLETED
corresponds to self-employed and 2
R SELF-EMP OR WORKS
1.93 .260 corresponds
56 to someone
56.000 else. The lower
FOR SOMEBODY mean for survey respondents who
RESPONDENTS INCOME 13.70 5.034 thought we
56 spend too much money on
56.000
2 ABOUT RIGHT NUMBER OF HOURS welfare (mean=1.75), when compared
37.90 13.235 to the mean
50 for 50.000
survey respondents who
WORKED LAST WEEK
HIGHEST YEAR OF
thought we spend too little money on
14.78 2.558 welfare (mean=1.93),
50 50.000 implies that the
SCHOOL COMPLETED
group contained more survey
R SELF-EMP OR WORKS
1.90 .303 respondents
50 who were self-employed
50.000
FOR SOMEBODY and fewer survey respondents who were
RESPONDENTS INCOME 14.00 5.503 working for
50 someone
50.000 else.
3 TOO MUCH NUMBER OF HOURS
42.03 10.456 This supports
32 the relationship that
32.000
WORKED LAST WEEK
"survey respondents who thought we
HIGHEST YEAR OF
13.38 2.524 spend too32much32.000
money on welfare were
SCHOOL COMPLETED more likely to be self-employed than
.440 survey respondents who thought we
R SELF-EMP OR WORKS
1.75 32 32.000
FOR SOMEBODY spend too little money on welfare."
41.32 12.846 138 138.000
WORKED LAST WEEK
HIGHEST YEAR OF
14.03 2.537 138 138.000
SCHOOL COMPLETED
R SELF-EMP OR WORKS
SW388R7
Data Analysis & ASSUMPTION OF EQUAL DISPERSION FOR
DEPENDENT VARIABLE GROUPS
Computers II
Slide 71
The assumption of equal

dispersion for groups defined by
the dependent variable only
affects the classification phase of
discriminant analysis, and so is
not evaluated until we are
determining the final accuracy
rate of the model.
Box's M test evaluated the

homogeneity of dispersion
matrices across the subgroups of
the dependent variable. The null
hypothesis is that the dispersion
matrices are homogenous. If
the analysis fails this test, we
request the use of separate
group dispersion matrices in the
classification phase of the
discriminant analysis to see it
this improves our accuracy rate.
SW388R7
Computers II
Slide 72
In this analysis, Box's M statistic

had a value of 19.386 with a
probability of 0.096. Since the
probability for Box's M is greater
than the level of significance for
testing assumptions (0.01), the
null hypothesis is not rejected
and the assumption of equal
dispersion is satisfied.
We use the pooled or within-

groups covariance matrix for
classification.
SW388R7
Computers II
Slide 73
Had we rejected the null hypothesis and concluded that

dispersion was not equal across groups, we would have run
the analysis again, specifying separate-groups covariance
matrices for classification.
If classification using separate covariance matrices were

more accurate by 2% or more, we would report classification
accuracy based on this model rather than the one that use
within-groups covariance.
SW388R7
CLASSIFICATION USING THE DISCRIMINANT MODEL:
Data Analysis &
by chance accuracy rate

Computers II
Slide 74
The independent variables could be characterized as useful

predictors of membership in the groups defined by the dependent
variable if the cross-validated classification accuracy rate was
significantly higher than the accuracy attainable by chance alone.
Operationally, the cross-validated classfication accuracy rate should
be 25% or more higher than the proportional by chance accuracy
rate.
The proportional by chance accuracy rate of was computed by

squaring and summing the proportion of cases in each group from
the table of prior probabilities for groups (0.406² + 0.362² + 0.232²
= 0.350).
Cases Used in Analysis

WELFARE Prior Unweighted Weighted
1 TOO LITTLE .406 56 56.000
2 ABOUT RIGHT .362 50 50.000
3 TOO MUCH .232 32 32.000
Total 1.000 138 138.000
SW388R7
CLASSIFICATION USING THE DISCRIMINANT MODEL:
Data Analysis &
criteria for classification accuracy

Computers II
Slide 75

1 TOO 2 ABOUT
WELFARE LITTLE RIGHT 3 TOO MUCH Total
Original Count 1 TOO LITTLE 43 15 6 64
2 ABOUT RIGHT 26 30 6 62
3 TOO MUCH 17 10 9 36
% 1 TOO LITTLE 67.2 23.4 9.4 100.0
2 ABOUT RIGHT 41.9 48.4 9.7 100.0
3 TOO MUCH 47.2 27.8 25.0 100.0
Ungrouped cases 37.5 37.5 25.0 100.0
Cross-validated a Count 1 TOO LITTLE 43 15 6 64
2 The cross-validated
ABOUT RIGHT accuracy
26 rate30 6 62
computed
3 TOO MUCH by SPSS was
17 50.0% 11 8 36
% which was
1 TOO LITTLE greater than
67.2
or equal to
23.4 9.4 100.0
the proportional by chance accuracy
2 criteria
ABOUT RIGHT
of 43.7% (1.2541.9 x 35.0%48.4= 9.7 100.0
43.7%).
3 TOO MUCH The criteria for
47.2 30.6 22.2 100.0
classification accuracy is satisfied.
a. Cross validation is done only for those cases in the analysis. In cross validation, each case is
classified by the functions derived from all cases other than that case.
SW388R7
Answering the problem question - 1

Data Analysis &
Computers II
Slide 76
about spending on welfare" [natfare] are "number of hours worked in the past week" [hrs1],
"self-employment" [wrkslf], and "highest year of school completed" [educ]. These predictors
differentiate survey respondents who thought we spend too much money on welfare from survey
respondents who thought we spend about the right amount of money on welfare who, in turn,
are differentiated fromThe stepwise
survey discriminant
respondents who analysis
thought we spend too little money on welfare.
included the three variables identified
as the most useful predictors.
The most important predictor of groups based on responses to opinion about spending on
welfare was number of hours worked in the past week. The second most important predictor of
groups based on responses to opinion about spending on welfare was self-employment. The third
most important predictor of groups based on responses to opinion about spending on welfare
was highest year of school completed.
on welfare had completed more years of school than survey respondents who thought we spend
too much or little money on welfare. Survey respondents who thought we spend too much
money on welfare were more likely to be self-employed than survey respondents who thought
SW388R7

Data Analysis &
Computers II
Slide 77
respondents who thought we spend too much money on welfare from survey respondents
who thought we spend about the right amount of money on welfare who, in turn, are
differentiated from survey respondents who thought we spend too little money on welfare.
welfare was number We
of hours
found worked in the past
two statistically week. The second most important predictor of
significant
groups based on responses to opinion
discriminant aboutmaking
functions, spending on welfare
it possible to was self-employment. The
third most importantdistinguish
predictor among
of groupsthebased
three on responses
groups defined to opinion about spending on
welfare was highest by theofdependent
year variable.
school completed.
Moreover, the cross-validated classification
Survey respondents who thought
accuracy we spend
surpassed the about the right
by chance amount of money on welfare worked
accuracy
criteria, supporting the utility of the model.
on welfare had completed more years of school than survey respondents who thought we
much money on welfare were more likely to be self-employed than survey respondents who
SW388R7

Data Analysis &
Computers II
Slide 78
The order of importance matched
thought we spend about the right theamount
order ofofentry
money on table
in the welfare
of who, in turn, are differentiated
from survey respondents who thought we Entered/Removed."
"Variables spend too little money on welfare.
welfare was number of hours worked in the past week. The second most important
predictor of groups based on responses to opinion about spending on welfare was self-
employment. The third most important predictor of groups based on responses to opinion
about spending on welfare was highest year of school completed.
on welfare had completed more years of school than survey respondents who thought we
much money on welfare were more likely to be self-employed than survey respondents who
SW388R7

Data Analysis &
Computers II
Slide 79
welfare was number of hours worked in the past week. The second most important predictor of
groups based on responses to opinion about
We spending on welfare
verified that was self-employment. The
each statement
third most important predictor of groups about
basedthe
on relationship
responses tobetween
opinion about spending on
welfare was highest year of school completed.
predictors and groups was correct.
Survey respondents who thought we spend about the right amount of money on welfare
worked fewer hours in the past week than survey respondents who thought we spend too
much or little money on welfare. Survey respondents who thought we spend about the right
amount of money on welfare had completed more years of school than survey respondents
who thought we spend too much or little money on welfare. Survey respondents who
thought we spend too much money on welfare were more likely to be self-employed than
survey respondents who thought we spend too little money on welfare.
1. True
The answer to the question is true with
2. True with caution caution. A caution is added because of
3. False the inclusion of ordinal level variables. A
caution is added because of a violation
4. Inappropriate application of a statistic of discriminant analysis assumptions.
SW388R7
Data Analysis & Steps in discriminant analysis:
level of measurement and initial sample size
Computers II
Slide 80
The following is a guide to the decision process for answering

problems about the complete discriminant analysis:
Dependent non-metric? No Inappropriate

Independent variables application of
metric or dichotomous? a statistic
Yes
Ratio of cases to No Inappropriate

independent variables at application of
least 5 to 1?
a statistic
Yes
Number of cases in
smallest group greater No Inappropriate
than number of application of
independent variables? a statistic
Yes
SW388R7
running the baseline model
Computers II
Slide 81
Run baseline discriminant analysis, using method for

including variables identified in the research question.
Record cross-validated classification accuracy for
evaluation of transformations and removal of outliers.
Try:
1. Logarithmic transformation
Metric IV’s normally No 2. Square root transformation
distributed? 3. Inverse transformation
If unsuccessful, add caution

for violation of discriminant
analysis assumptions
Yes
Mahalanobis D² to most Yes

likely group greater than Remove outliers from
critical value? data set
No
SW388R7
picking discriminant model for interpretation
Computers II
Slide 82
Were transformed variables No

substituted, or outliers and
influential cases omitted?
Yes
Evaluate impact of transformations and

removal of outliers by running discriminant
again, using method for including variables
identified in the research question.
Cross-validated accuracy
for second discriminant
analysis greater than
accuracy of baseline by 2%
Yes or more?
No
Pick discriminant analysis with Pick baseline discriminant

transformations and omitting analysis for interpretation
outliers for interpretation
SW388R7
usable discriminant model
Computers II
Slide 83
Sufficient statistically No
significant functions to False
distinguish DV groups?
Yes
Tolerance for all IV’s

greater than 0.10, No
indicating no False
multicollinearity?
Yes
SW388R7
assumption of equal dispersion
Computers II
Slide 84
Probability of Box's M test Yes Re-run discriminant analysis, using

less than or equal to level of separate-groups covariance matrices
significance for assumptions? for classification
No
No Accuracy rate at least 2%
higher using separate-
groups covariance
matrices?
Yes
Pick discriminant analysis using Pick discriminant analysis using

within-groups covariance for separate-groups covariance for
interpretation interpretation
SW388R7
relationships between IV's and DV
Computers II
Slide 85
Stepwise method of entry

used to include
independent variables?
Yes
No
Entry order of variables
interpreted correctly?
No
False
Yes
Relationships between No
individual IVs and DV groups False
interpreted correctly?
Yes
SW388R7
classification accuracy
Computers II
Slide 86
Cross-validated accuracy is No
25% higher than proportional False
by chance accuracy rate?
Yes
SW388R7
adding cautions to solution
Computers II
Slide 87
Satisfies preferred ratio of No

cases to IV's of 20 to 1 True with caution
Yes
Satisfies preferred DV group No

minimum size of 20 cases? True with caution
Yes
DV is non-metric level and IVs No

are interval level or True with caution
dichotomous (not ordinal)?
Yes
True

Strategy For Complete Discriminant Analysis

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Strategy For Complete Discriminant Analysis

Enviado por

Direitos autorais:

Formatos disponíveis

SW388R7

Strategy for Complete discriminant Analysis

Assumption of normality, linearity, and homogeneity

Steps in solving problems

 The ability of discriminant analysis to extract discriminant functions

 We can compare the accuracy rates in a model using transformed

Assumption of linearity in discriminant analysis

 Since the dependent variable is non-metric in discriminant analysis,

 In discriminant analysis, the assumption of linearity applies to the

 Since non-linearity only reduces the power to detect relationships, the

 We will not test for linearity in our problems.

Assumption of homogeneity of variance

 The assumption of homogeneity of variance is particular important in

 If one the groups defined by the dependent variable has greater

 Homogeneity of variance is tested with Box's M test, which tests the

Detecting outliers in discriminant analysis - 1

 In the classification phase of discriminant analysis, each case will be

 The assignment is based on proximity, i.e. the case will be assigned to

 Just as we use z-scores to measure the location of a case in a

Detecting outliers in discriminant analysis - 2

 In the Casewise Statistics output, SPSS provides us with the Squared

 If a case has a large Squared Mahalanobis Distance to the Centroid is

Detecting outliers in discriminant analysis - 3

 If we calculate the critical value that identifies a "large" value for

 When we identified multivariate outliers, we used the SPSS function

 SPSS has a parallel function, IDF.CHISQ, that computes the size of D²

Detecting outliers in discriminant analysis - 4

 Since we are dealing with the classification phase of discriminant

 For simultaneous discriminant analysis in which all independent

 For stepwise discriminant analysis, in which variables are entered by

Detecting outliers in discriminant analysis - 5

 We will identify outliers as cases whose probability of being in

 For simultaneous discriminant analysis with 4 independent

 For stepwise discriminant analysis, in which 2 of for

 Multicollinearity has the same effect in discriminant analysis

 The primary criteria for a successful discriminant analysis are:

discriminant functions to distinguish among the groups

rate obtainable by chance alone.

Overall strategy for solving problems

1. Run a baseline discriminant analysis using the method for including

The problem may give us different levels

In this problem, we are told to use 0.05

The variables listed first in the problem

The problem identifies three groups for the dependent variable:

To distinguish among three groups, the analysis will be required to find

In order for a stepwise analysis to be

It contains three categories: survey respondents who

The baseline discriminant analysis

We begin our analysis by

Select the Classify |

Selecting the dependent variable

First, highlight the

Second, click on the right

Defining the group values

When SPSS moves the dependent variable to the

First, to specify the

Completing the range of group values

The value labels for natfare show

Note: if we enter the wrong range of group

Specifying the method for including variables

SPSS provides us with two methods for including

Since the problem states the