Você está na página 1de 87

SW388R7

Strategy for Complete discriminant Analysis


Data Analysis &
Computers II

Slide 1

Assumption of normality, linearity, and homogeneity

Outliers

Multicollinearity

Validation

Sample problem

Steps in solving problems


SW388R7
Data Analysis & Assumptions of normality, linearity, and
homogeneity of variance
Computers II

Slide 2

 The ability of discriminant analysis to extract discriminant functions


that are capable of producing accurate classifications is enhanced
when the assumptions of normality, linearity, and homogeneity of
variance are satisfied.

 We will use the script for testing for normality and test substituting
the log, square root, or inverse transformation when they induce
normality in a variable that fails to satisfy the criteria for normality.

 We can compare the accuracy rates in a model using transformed


variables to one that does not to evaluate whether or not the
improvement gained by transformed variables is sufficient to justify
the interpretational burden of explaining transformations.
SW388R7

Assumption of linearity in discriminant analysis


Data Analysis &
Computers II

Slide 3

 Since the dependent variable is non-metric in discriminant analysis,


there is not a linear relationship between the dependent variable and
an independent variable.

 In discriminant analysis, the assumption of linearity applies to the


relationships between pairs of independent variable. To identify
violations of linearity, each metric independent variable would have to
be tested against all others.

 Since non-linearity only reduces the power to detect relationships, the


general advice is to attend to it only when we know that a variable in
our analysis consistently demonstrated non-linear relationships with
other independent variables.

 We will not test for linearity in our problems.


SW388R7

Assumption of homogeneity of variance


Data Analysis &
Computers II

Slide 4

 The assumption of homogeneity of variance is particular important in


the classification stage of discriminant analysis.

 If one the groups defined by the dependent variable has greater


dispersion than others, cases will tend to be overclassified in it.

 Homogeneity of variance is tested with Box's M test, which tests the


null hypotheses that the group variance-covariance matrices are equal.
If we fail to reject this null hypothesis and conclude that the
variances are equal, we use the SPSS default of using a pooled
covariance matrix in classification.

 If we reject the null hypothesis and conclude that the variances are
heterogeneous, we substitute separate covariance matrices in the
classification, and evaluate whether or not our classificaiton accuracy
is improved.
SW388R7

Detecting outliers in discriminant analysis - 1


Data Analysis &
Computers II

Slide 5

 In the classification phase of discriminant analysis, each case will be


predicted to be a member of one of the groups defined by the
dependent variable.

 The assignment is based on proximity, i.e. the case will be assigned to


the group it is closest to in multidimensional space.

 Just as we use z-scores to measure the location of a case in a


distribution with a given mean and standard deviation, we can use
Mahalanobis distance as a measure of the location of a case relative to
the centroid and covariance matrix for the cases in the distribution for
a group of cases. The centroid and covariance matrix are the
multivariate equivalents of a mean and standard deviation.
SW388R7

Detecting outliers in discriminant analysis - 2


Data Analysis &
Computers II

Slide 6

 According to the SPSS Base 10.0 Applications Guide, page 259, "cases
with large values of Mahalanobis distance from their group mean can
be identified as outliers."

 In the Casewise Statistics output, SPSS provides us with the Squared


Mahalanobis Distance to the Centroid for each of the groups defined
by the dependent variable.

 If a case has a large Squared Mahalanobis Distance to the Centroid is


most likely to belong to, it is an outlier.
SW388R7

Detecting outliers in discriminant analysis - 3


Data Analysis &
Computers II

Slide 7

 If we calculate the critical value that identifies a "large" value for


Mahalanobis D² distance, we can scan the Casewise Statistics table to
identify outliers.

 When we identified multivariate outliers, we used the SPSS function


CDF.CHISQ to calculate the probability of obtaining a D² of a certain
size, given the number of independent variables in the analysis.

 SPSS has a parallel function, IDF.CHISQ, that computes the size of D²


needed to reach a specific probability, given the number of
independent variables in the analysis.
SW388R7

Detecting outliers in discriminant analysis - 4


Data Analysis &
Computers II

Slide 8

 Since we are dealing with the classification phase of discriminant


analysis, we use the number of independent variables included in
computing the discriminant scores for cases.

 For simultaneous discriminant analysis in which all independent


variables are entered at the same time, we use the total number of
independent variables in the calculations for the critical value for D².

 For stepwise discriminant analysis, in which variables are entered by


statistical criteria, we use the number of variables satisfying the
statistical criteria in the calculations for the critical value for D².
SW388R7

Detecting outliers in discriminant analysis - 5


Data Analysis &
Computers II

Slide 9

 We will identify outliers as cases whose probability of being in


the group that they are most likely to belong it is 0.01 or less.
Since the IDF.CHISQ function is based on cumulative
probabilities from the left tail of the distribution through the
critical value, we will use 1.00 – 0.01 = 0.99 as the probability
in the IDF.CHIDQ function.

 For simultaneous discriminant analysis with 4 independent


variables, the compute command for the critical value of D² is:
COMPUTE critval = IDF.CHISQ(0.99, 4).

 For stepwise discriminant analysis, in which 2 of for


independent variables, the compute command for the critical
value of D² is: COMPUTE critval = IDF.CHISQ(0.99, 2).
SW388R7

Multicollinearity
Data Analysis &
Computers II

Slide 10

 Multicollinearity has the same effect in discriminant analysis


that it does in multiple regression, i.e. the importance of an
independent variable will be undervalued because it has a very
strong relationship to another independent variable or
combination of independent variables.
 Like multiple regression, multicollinearity in discriminant
analysis is identified by examining tolerance values.
 While tolerance is routinely included in the output for the
stepwise method for including variables, it is not included for
simultaneous entry of variables. If a tolerance problem occurs
in a simultaneous entry problem, SPSS will include a table titled
"Variables Failing Tolerance Test."
 We should not attempt to interpret an analysis with a
multicollinearity problem until we have resolved the problem
by removing or combining the problemmatic variable.
SW388R7

Validation
Data Analysis &
Computers II

Slide 11

 The primary criteria for a successful discriminant analysis are:


 the existence of sufficient statistically significant

discriminant functions to distinguish among the groups


defined by the dependent variable, and
 an accuracy rate that substantially improves the accuracy

rate obtainable by chance alone.


 We are also concerned with the role of the individual
independent variables, but we can expect to see greater
variation in this aspect of validation. However, if we can verify
the number of discriminant functions and the accuracy rate, we
will require a caution if the role of the variables is not the same
among split-sample models.
SW388R7

Overall strategy for solving problems


Data Analysis &
Computers II

Slide 12

1. Run a baseline discriminant analysis using the method for including


variables implied by the problem statement to find the baseline cross-
validated accuracy rate for the model.
2. Test for useful transformations to improve normality.
3. Substitute transformed variables and check for outliers.
4. If cross-validated accuracy rate from discriminant analysis using
transformed variables and omitting outliers is at least 2% better than
baseline cross-validated accuracy rate, select it for interpretation;
otherwise select baseline model.
5. If the Box’s M statistic is statistically significant, we violate the
assumption of homogeneity of variance and re-run the analysis using
separate covariance matrices for classification. If the accuracy rate
increases by more than 2%, we interpret this model, otherwise return
to model using pooled covariance.
6. If the cross-validated accuracy rate is 25% or more higher than
proportional by chance accuracy rate, interpret the selected
discriminant model:
 Number of functions and importance of predictors
 Role of individual variables on functions distinguishing among groups
SW388R7

Problem 1
Data Analysis &
Computers II

Slide 13

In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of
a statistic? Assume that there is no problem with missing data. Use a level of significance of 0.01
for evaluating assumptions. Use a level of significance of 0.05 for evaluating the statistical
relationship.
From the list of variables "number of hours worked in the past week" [hrs1], "self-employment"
[wrkslf], "highest year of school completed" [educ], and "income" [rincom98], the most useful
predictors for distinguishing among groups based on responses to "opinion about spending on
welfare" [natfare] are "number of hours worked in the past week" [hrs1], "self-employment"
[wrkslf], and "highest year of school completed" [educ]. These predictors differentiate survey
respondents who thought we spend too much money on welfare from survey respondents who
thought we spend about the right amount of money on welfare who, in turn, are differentiated
from survey respondents who thought we spend too little money on welfare.
The most important predictor of groups based on responses to opinion about spending on welfare
was number of hours worked in the past week. The second most important predictor of groups
based on responses to opinion about spending on welfare was self-employment. The third most
important predictor of groups based on responses to opinion about spending on welfare was
highest year of school completed.
Survey respondents who thought we spend about the right amount of money on welfare worked
fewer hours in the past week than survey respondents who thought we spend too much or little
money on welfare. Survey respondents who thought we spend about the right amount of money
on welfare had completed more years of school than survey respondents who thought we spend
too much or little money on welfare. Survey respondents who thought we spend too much money
on welfare were more likely to be self-employed than survey respondents who thought we spend
too little money on welfare.
1. True
2. True with caution
3. False
4. Inappropriate application of a statistic
SW388R7

Dissecting problem 1 - 1
Data Analysis &
Computers II

Slide 14

The problem may give us different levels


of significance for the analysis.

In this problem, we are told to use 0.05


as alpha for the discriminant analysis,
but 0.01 for testing assumptions.

In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application
of a statistic? Assume that there is no problem with missing data. Use a level of significance of
0.01 for evaluating assumptions. Use a level of significance of 0.05 for evaluating the statistical
relationship.
From the list of variables "number of hours worked in the past week" [hrs1], "self-employment"
[wrkslf], "highest year of school completed" [educ], and "income" [rincom98], the most useful
predictors for distinguishing among groups based on responses to "opinion about spending on
welfare" [natfare] are "number of hours worked in the past week" [hrs1], "self-employment"
[wrkslf], and "highest year of school completed" [educ]. These predictors differentiate survey
respondents who thought we spend too much money on welfare from survey respondents who
thought we spend about the right amount of money on welfare who, in turn, are differentiated
from survey respondents who thought we spend too little money on welfare.
SW388R7

Dissecting problem 1 - 2
Data Analysis &
Computers II

Slide 15

The variables listed first in the problem


statement are the independent variables
(IVs): "number of hours worked in the past
In the
week"dataset GSS2000.sav,
[hrs1], is the following
"self-employment" [wrkslf], statement true, false, or an incorrect application of
a statistic? Assume
"highest year that there
of school is no problem
completed" [educ], with missing data. Use a level of significance of
0.01 for evaluating assumptions.
and "income" [rincom98]. Use a level of significance of 0.05 for evaluating the statistical
relationship.

From the list of variables "number of hours worked in the past week" [hrs1], "self-
employment" [wrkslf], "highest year of school completed" [educ], and "income" [rincom98],
the most useful predictors for distinguishing among groups based on responses to "opinion
about spending on welfare" [natfare] are "number of hours worked in the past week" [hrs1],
"self-employment" [wrkslf], and "highest year of school completed" [educ]. These predictors
differentiate survey respondents who thought we spend too much money on welfare from survey
respondents who thought we spend about the right amount of money on welfare who, in turn,
areThe
differentiated from
variable used survey respondents who thought we spend too little money on welfare.
to define
groups is the dependent When a problem asks us
variable (DV): "opinion about to identify the bestonorwelfare worked
Survey respondents
spending who thought we spend about the right amount
on welfare" of money
fewer most useful predictors
hours in the past week than survey respondents who thought we spend too much or little
[natfare]. from a list of
money on welfare. Survey respondents who thought we spend about the right amount of money
on welfare had completed more years of school than surveyindependent
respondentsvariables,
who thought we spend
we do stepwise
too much or little money on welfare. Survey respondents who thought we spend too much
discriminant analysis.
money on welfare were more likely to be self-employed than survey respondents who thought
we spend too little money on welfare.
SW388R7

Dissecting problem 1 - 3
Data Analysis &
Computers II

Slide 16

The problem identifies three groups for the dependent variable:


survey respondents who thought we spend too much money on welfare
survey respondents who thought we spend about the right amount of
money on welfare
survey respondents who thought we spend too little money on welfare.

To distinguish among three groups, the analysis will be required to find


two statistically significant discriminant functions.

In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application
of a statistic? Assume that there is no problem with missing data. Use a level of significance of
0.01 for evaluating assumptions. Use a level of significance of 0.05 for evaluating the
statistical relationship.

From the list of variables "number of hours worked in the past week" [hrs1], "self-employment"
[wrkslf], "highest year of school completed" [educ], and "income" [rincom98], the most useful
predictors for distinguishing among groups based on responses to "opinion about spending on
welfare" [natfare] are "number of hours worked in the past week" [hrs1], "self-employment"
[wrkslf], and "highest year of school completed" [educ]. These predictors differentiate
survey respondents who thought we spend too much money on welfare from survey
respondents who thought we spend about the right amount of money on welfare who, in
turn, are differentiated from survey respondents who thought we spend too little money
on welfare.
SW388R7

Dissecting problem 1 - 4
Data Analysis &
Computers II

Slide 17

In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a statistic? Assume that there is
no problem with missing data. Use a level of significance of 0.01 for evaluating assumptions. Use a level of significance of 0.05
for evaluating the statistical relationship.
In a stepwise analysis, we only
From the list of variables "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf], "highest year of school
interpret[educ],
completed" the independent
and "income" [rincom98], the most useful predictors for distinguishing among groups based on responses to
variables
"opinion aboutthat are on
spending entered
welfare" in
[natfare] are "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf],
and "highest year of school completed" [educ]. These predictors differentiate survey respondents who thought we spend too much
the stepwise
money analysis.
on welfare from survey respondents who thought we spend about the right amount of money on welfare who, in turn, are
differentiated from survey respondents who thought we spend too little money on welfare.
The importance of individual
The most important predictor of groups based on responses to opinion about spending on welfare was number of hours
predictors
worked in the past week. The second most important predictor of groups based onisresponses
based on order about spending on
to opinion
welfare was self-employment. The third most important predictor of of entry
groups in the
based analysis.
on responses to opinion about spending
on welfare was highest year of school completed.
Survey respondents who thought we spend about the right amount of money on welfare worked fewer hours in the past week than
survey respondents who thought we spend too much or little money on welfare. Survey respondents who thought we spend about
the right amount of money on welfare had completed more years of school than survey respondents who thought we spend too
much or little money on welfare. Survey respondents who thought we spend too much money on welfare were more likely to be
self-employed than survey respondents who thought we spend too little money on welfare.
1. True
2. True with caution
3. False
4. Inappropriate application of a statistic
SW388R7

Dissecting problem 1 - 5
Data Analysis &
Computers II

Slide 18

From the list of variables "number of hours worked in the past week" [hrs1], "self-employment" [wrkslf], "highest year of
school completed" [educ], and "income" [rincom98], the most useful predictors for distinguishing among groups based on
responses to "opinion about spending on welfare" [natfare] are "number of hours worked in the past week" [hrs1], "self-
employment" [wrkslf], and "highest year of school completed" [educ]. These predictors differentiate survey respondents
Thethought
who specific
werelationships
spend too muchlisted
moneyinonthe problem
welfare from survey respondents who thought we spend about the right
amount
indicate how the independent variable differentiated
of money on welfare who, in turn, are relates from survey respondents who thought we spend too little
money on welfare.
to groups of the dependent variable, e.g., the
The mostfor
mean important predictor in
hours worked of the
groups based
past weekon responses
will to opinion about spending on welfare was number of hours
worked in the past week. The second most important predictor of groups based on responses to opinion about spending on
be lower
welfare wasfor respondentsThe
self-employment. who think
third mostwe spend predictor of groups based on responses to opinion about spending
important
the
on right was
welfare amount
highestofyear
money versus
of school respondents
completed.
who think
Survey we spend
respondents who too much
thought weor too about
spend little.the right amount of money on welfare worked fewer hours in the
past week than survey respondents who thought we spend too much or little money on welfare. Survey respondents
who thought we spend about the right amount of money on welfare had completed more years of school than
survey respondents who thought we spend too much or little money on welfare. Survey respondents who thought
we spend too much money on welfare were more likely to be self-employed than survey respondents who thought
we spend too little money on welfare.
1. True
2. True with caution
3. False
4. Inappropriate application of a statistic

In order for a stepwise analysis to be


true, we must have enough statistically
significant functions to distinguish among
the groups, the order of entry must be
correct, and each significant relationship
must be interpreted correctly.
SW388R7

LEVEL OF MEASUREMENT - 1
Data Analysis &
Computers II

Slide 19

In the dataset GSS2000.sav, is the following statement true, false, or an incorrect application of a
statistic? Assume that there is no problem with missing data. Use a level of significance of 0.01 for
evaluating assumptions. Use a level of significance of 0.05 for evaluating the statistical relationship.

From the list of variables "number of hours worked in the past week" [hrs1], "self-employment"
[wrkslf], "highest year of school completed" [educ], and "income" [rincom98], the most useful
predictors for distinguishing among groups based on responses to "opinion about spending on
welfare" [natfare] are "number of hours worked in the past week" [hrs1], "self-employment"
[wrkslf], and "highest year of school completed" [educ]. These predictors differentiate survey
respondents who thought we spend too much money on welfare from survey respondents who
thought we spend about the right amount of money on welfare who, in turn, are differentiated
from survey respondents who thought we spend too little money on welfare.

Survey respondents who thought we spend about the right amount of money on welfare worked fewer
hours in the past week than survey respondents who thought we spend too much or little money on
Discriminant
welfare. Survey respondents analysis
who thought requires
we spend thatthe
about theright
dependent
amount of money on welfare had
completed more years of school than survey respondents who thoughtvariables
variable be non-metric and the independent we spend too much or little
be metric or dichotomous. "Opinion about spending
money on welfare. Surveyon welfare" [natfare] is an ordinal level variable,money on welfare were more
respondents who thought we spend too much
likely to be self-employed thansatisfies
which survey respondents who thought we spend too little money on welfare.
the level of measurement
requirement.

It contains three categories: survey respondents who


thought we spend too much money on welfare,
survey respondents who thought we spend about the
right amount of money on welfare, and survey
respondents who thought we spend too little money
on welfare.
SW388R7

LEVEL OF MEASUREMENT - 2
Data Analysis &
Computers II

Slide 20

From the list of variables "number of hours worked in the past week" [hrs1], "self-
employment" [wrkslf], "highest year of school completed" [educ], and "income" [rincom98],
the most useful predictors for distinguishing among groups based on responses to "opinion
about spending on welfare" [natfare] are "number of hours worked in the past week"
[hrs1], "self-employment" [wrkslf], and "highest year of school completed" [educ]. These
predictors differentiate survey respondents who thought we spend too much money on
welfare from survey respondents who thought we spend about the right amount of money
on welfare who, in turn, are differentiated from survey respondents who thought we spend
too little money on welfare.

Survey respondents who thought we spend about the right amount of money on welfare worked
fewer hours
"Number in the
of hours past week
worked in thethan survey respondents who thought we spend too much or little
money
past week"on [hrs1]
welfare.
andSurvey respondents who thought we spend about the right amount of money
"highest
year of schoolhad
on welfare completed"
completed[educ]
more years of school than survey respondents who thought we
are interval level variables, which
spend too much or little money on welfare. Survey respondents who thought we spend too
satisfies the level of measurement
much moneyfor
requirements ondiscriminant
welfare were more likely to be self-employed than survey
"Income" [rincom98] is anrespondents
ordinal level who
thought we spend too little money on welfare. variable. If we follow the convention of
analysis.
treating ordinal level variables as metric
variables, the level of measurement
requirement for discriminant analysis is
satisfied. Since some data analysts do
not agree with this convention, a note
"Self-employment" [wrkslf] is a of caution should be included in our
dichotomous or dummy-coded interpretation.
nominal variable which may be
included in discriminant analysis.
SW388R7

The baseline discriminant analysis


Data Analysis &
Computers II

Slide 21

We begin our analysis by


running a stepwise
discriminant analysis with
natfare as the dependent
variable and hrs1, wrkslf,
educ, and rincom98 as the
independent variables.

Select the Classify |


Discriminant… command
from the Analyze menu.
SW388R7

Selecting the dependent variable


Data Analysis &
Computers II

Slide 22

First, highlight the


dependent variable
natfare in the list
of variables.

Second, click on the right


arrow button to move the
dependent variable to the
Grouping Variable text box.
SW388R7

Defining the group values


Data Analysis &
Computers II

Slide 23

When SPSS moves the dependent variable to the


Grouping Variable textbox, it puts two question marks in
parentheses after the variable name. This is a reminder
that we have to enter the number that represent the
groups we want to include in the analysis.

First, to specify the


group numbers, click
on the Define Range…
button.
SW388R7

Completing the range of group values


Data Analysis &
Computers II

Slide 24

The value labels for natfare show


three categories:
1 = TOO LITTLE
2 = ABOUT RIGHT
3 = TOO MUCH
First, type in 1 in
The range of values that we need the Minimum text
to enter goes from 1 as the box.
minimum and 3 as the maximum.

Second, type in
3 in the Third, click on the
Maximum text Continue button to
box. close the dialog box.

Note: if we enter the wrong range of group


numbers, e.g., 1 to 2 instead of 1 to 3, SPSS
will only include groups 1 and 2 in the analysis.
SW388R7

Specifying the method for including variables


Data Analysis &
Computers II

Slide 25

SPSS provides us with two methods for including


variables: to enter all of the independent variables
at one time, and a stepwise method for selecting
variables using a statistical test to determine the
order in which variables are included.

Since the problem states the


importance of the best subset of
predictors, we mark the option
button to Use stepwise method.
SW388R7

Requesting statistics for the output


Data Analysis &
Computers II

Slide 26

Click on the Statistics…


button to select statistics
we will need for the
analysis.
SW388R7

Specifying statistical output


Data Analysis &
Computers II

Slide 27

First, mark the Means


checkbox on the Descriptives
panel. We will use the group
means in our interpretation.

Second, mark the Univariate


ANOVAs checkbox on the
Descriptives panel. Perusing
these tests suggests which
variables might be useful
descriminators.

Third, mark the Box’s M


checkbox. Box’s M statistic Fourth, click on the
evaluates conformity to the Continue button to
assumption of homogeneity of close the dialog box.
group variances.
SW388R7

Specifying details for the stepwise method


Data Analysis &
Computers II

Slide 28

Click on the Method…


button to specify the
specific statistical criteria to
use for including variables.
SW388R7

Details for the stepwise method


Data Analysis &
Computers II

Slide 29

First, mark the


Mahalanobis
distance option
button on the
Method panel.

Second, mark the


Third, click on
Summary of steps
the Continue
checkbox to
button to close
produce a summary
the dialog box.
table when a new
variable is added.

Fourth, type the level


Third, click on the
of significance in the
option button Use
probability of F so that Entry text box. The
we can incorporate the Removal value is twice
level of significance as large as the entry
specified in the problem. value.
SW388R7

Specifying details for classification


Data Analysis &
Computers II

Slide 30

Click on the Classify…


button to specify details for
the classification phase of
the analysis.
SW388R7

Details for classification - 1


Data Analysis &
Computers II

Slide 31

First, mark the option button to Compute from


group sizes on the Prior Probabilities panel.
This incorporates the size of the groups defined
by the dependent variable into the classification
of cases using the discriminant functions.

Second, mark the


Casewise results
checkbox on the
Display panel to
include
classification details
for each case in the
output.

Third, mark the Summary


table checkbox to include
summary tables
comparing actual and
predicted classification.
SW388R7

Details for classification - 2


Data Analysis &
Computers II

Slide 32

Fourth, mark the Leave-one-out


classification checkbox to request SPSS to
include a cross-validated classification in
the output. This option produces a less
biased estimate of classification accuracy
by sequentially holding each case out of
the calculations for the discriminant
functions, and using the derived functions
to classify the case held out.
SW388R7

Details for classification - 3


Data Analysis &
Computers II

Slide 33

Fifth, accept the default of Within-groups Seventh, click


option button on the Use Covariance Matrix on the Continue
panel. The Covariance matrices are the button to close
measure of the dispersion in the groups the dialog box.
defined by the dependent variable. If we
fail the homogeneity of group variances
test (Box’s M), our option is use Separate
groups covariance in classification.
Sixth, mark the Combined-
groups checkbox on the Plots
panel to obtain a visual plot of
the relationship between
functions and groups defined
by the dependent variable.
SW388R7

Completing the discriminant analysis request


Data Analysis &
Computers II

Slide 34

Click on the OK
button to request the
output for the
disciminant analysis.
SW388R7

Initial sample size - 1


Data Analysis &
Computers II

Slide 35

Analysis Case Processing Summary

Unweighted Cases N Percent


Valid 138 51.1
Excluded Missing or out-of-range
7 2.6
group codes
At least one missing
115 42.6
discriminating variable
Both missing or
out-of-range group codes The initial sample size
10 3.7 before excluding outliers
and at least one missing
discriminating variable and influential cases is 138
Total 132 48.9 cases. With 4 independent
Total variables, the ratio of cases
270 100.0
to variables is 34.5 to 1,
satisfying both the
minimum ratio of 5 cases
for each independent
variable and the preferred
ratio of 20 to 1.
SW388R7

Initial sample size - 2


Data Analysis &
Computers II

Slide 36

Prior Probabilities for Groups

Cases Used in Analysis In addition to the requirement for the


WELFARE Prior Unweighted Weighted ratio of cases to independent
1 .406 56 56.000 variables, discriminant analysis
2 .362 50 50.000 requires that there be a minimum
3 .232 32 32.000 number of cases in the smallest group
Total 1.000 138 138.000 defined by the dependent variable.
The number of cases in the smallest
group must be larger than the number
of independent variables, and
preferrably contain 20 or more cases.

The number of cases in the smallest


group in this problem is 32, which is
larger than the number of
independent variables (4), satisfying
the minimum requirement. In
addition, the number of cases in the
smallest group satisfies the preferred
minimum of 20 cases.
If the sample size did not
initially satisfy the minimum
requirements, discriminant
analysis is not appropriate.
SW388R7
Data Analysis & Classification accuracy before
transformations or removing outliers
Computers II

Slide 37

Classification Resultsb,c

Predicted Group Membership


WELFARE 1 2 3 Total
Original Count 1 43 15 Prior to any
6 64
transformations
2 26 30 of variables
6 to satisfy
62 the
3 17 10 assumptions
9 of discriminant
36
Ungrouped cases 3 3 analysis 2or removal8 of
% 1 67.2 23.4 outliers,
9.4the cross-validated
100.0
2 41.9 48.4 accuracy9.7 rate was
100.050.0%.
3 47.2 27.8 25.0 100.0
This accuracy rate is the
Ungrouped cases 37.5 37.5 25.0 100.0
benchmark that we will use
Cross-validated a Count 1 43 15 6 64
to evaluate the utility of
2 26 30 transformations
6 and62 the
3 17 11 elimination
8 of outliers.
36
% 1 67.2 23.4 9.4 100.0
2 41.9 48.4 9.7 100.0
3 47.2 30.6 22.2 100.0
a. Cross validation is done only for those cases in the analysis. In cross validation, each case
is classified by the functions derived from all cases other than that case.
b. 50.6% of original grouped cases correctly classified.
c. 50.0% of cross-validated grouped cases correctly classified.
SW388R7

ASSUMPTION OF NORMALITY
Data Analysis &
Computers II

Slide 38

There are three metric independent


variables to test for normality:
•"number of hours worked in
the past week" [hrs1], First, move the three
•"highest year of school independent variables
completed" [educ], and to thelist box of
•"income" [rincom98], variables to test.

To test for normality of, run the


script: NormalityAssumptionAnd
Transformations.SBS

Second, click on
the OK button to
produce the output.
SW388R7
Data Analysis & Normality of independent variable:
highest year of school completed
Computers II

Slide 39

Descriptives

Statistic Std. Error


HIGHEST YEAR OF Mean 13.12 .179
SCHOOL COMPLETED 95% Confidence Lower Bound 12.77
Interval for Mean Upper Bound
13.47

5% Trimmed Mean 13.14


Median 13.00
Variance 8.583
Std. Deviation 2.930
Minimum 2
Maximum 20
Range 18
Interquartile Range 3.00
Skewness -.137 .149
Kurtosis 1.246 .296

The independent variable "highest year of


school completed" [educ] does not satisfy the
criteria for a normal distribution.

The skewness (-0.137) fell between -1.0 and


+1.0, but the kurtosis (1.246) fell outside the
range from -1.0 to +1.0.
SW388R7
Data Analysis & Normality of independent variable:
highest year of school completed
Computers II

Slide 40

Neither the logarithmic, the square root, nor


the inverse transformation normalizes the
variable.

A caution should be added to the findings.


SW388R7
Data Analysis & Normality of independent variable:
number of hours worked in the past week
Computers II

Slide 41

Descriptives

Statistic Std. Error


NUMBER OF HOURS Mean 40.99 .958
WORKED LAST WEEK 95% Confidence Lower Bound 39.10
Interval for Mean Upper Bound
42.88

5% Trimmed Mean 41.21


Median 40.00
Variance 161.491
Std. Deviation 12.708
Minimum 4
Maximum 80
Range 76
Interquartile Range 10.00
Skewness -.324 .183
Kurtosis .935 .364

The variable "number of hours worked in the


past week" [hrs1] satisfies the criteria for a
normal distribution. The skewness (-0.324)
and kurtosis (0.935) were both between -1.0
and +1.0.
SW388R7
Data Analysis & Normality of independent variable:
income
Computers II

Slide 42

Descriptives

Statistic Std. Error


RESPONDENTS INCOME Mean 13.35 .419
95% Confidence Lower Bound 12.52
Interval for Mean Upper Bound
14.18 The variable "income"
[rincom98] satisfies
5% Trimmed Mean 13.54 the criteria for a
Median 15.00 normal distribution.
Variance 29.535 The skewness (-
Std. Deviation 5.435 0.686) and kurtosis (-
Minimum 1 0.253) were both
Maximum 23 between
Range 22 -1.0 and +1.0.
Interquartile Range 8.00
Skewness -.686 .187
Kurtosis -.253 .373

Since we do not have any transformations to test,


we can use the baseline model to examine outliers.

If we had transformation to test, we would run the


discriminant analysis again, substituting the
transformed variables before examining outliers.
SW388R7

OUTLIERS
Data Analysis &
Computers II

Slide 43

The classification output for


individual cases can be used to
detect outliers. In this context,
an outlier is a case that is distant
from the centroid of the group to
which it has the highest
probability of belonging.

Distance from the centroid of a


group is measured by
Mahalanobis Distance.

To identify outliers, we scan


the column looking for cases
with Mahalanobis D² distance
greater than a critical value.
SW388R7
Data Analysis & Using SPSS to calculate the critical value
for Mahalanobis D²
Computers II

Slide 44

The critical value for Mahalanobis D² is that


value that would achieve a specified level of
statistical significance given the number of
variables that were included in its calculation.

Specifically, we will use an SPSS function to


give us the critical value for a probability of
0.01 with the degrees of freedom equal to the
number of variables used to compute D².
SW388R7
Data Analysis & The number of variables used to compute
Mahalanobis D²
Computers II

Slide 45

V ariable s Ente r e d/Re m ovea,b,c,d


d

Min. D Squared

Betw een Exac t F


Step Entered Statis tic Groups Statis tic df1 df 2 Sig.
1 NUMBER
OF
HOURS
.023 1 and 3 .475 1 135.000 .492
WORKED
LA ST
WEEK
2 R In a direct entry discriminant analysis that
SELF-EM includes all variables simultaneously, the
P OR
WORKS .251 1 and 2 number of 2variables
3.289 134.000 used to
.040compute the
FOR values of D² is equal to the number of
SOMEBO independent variables included in the analysis.
DY
3 HIGHEST In stepwise discriminant analysis, the number
YEAR OF of variables used to compute the values of D²
SCHOOL .364 1 and 3 2.433 3 133.000 .068
COMPLE
is equal to the number of independent variables
TED selected for inclusion by the statistical
procedure.
At eac h step, the v ariable that maximizes the Mahalanobis distanc e betw een the tw o c los es t
groups is enter ed.
a. Maximum number of s teps is 8. In this problem, 3 out of the 4 independent
b. Maximum s ignif ic anc e of F to enter is .05. variables were used in the discriminant
c. Minimum s ignif ic anc e of F to remove is .10.
functions.
d. F level, tolerance, or V IN ins uf ficient f or f ur ther computation.
SW388R7
Data Analysis & Computing the critical value for
Mahalanobis D²
Computers II

Slide 46

First, we open the window to


compute a new variable by
selecting the Compute…
command from the
Transform menu.
SW388R7

Selecting the SPSS function


Data Analysis &
Computers II

Slide 47

First, we enter the acronym


for the variable we want to
create in the Target Variable
textbox: critval, for critical
value.

Third, we click
on the up
arrow button to
move the
function to the
Numeric
Second, we scroll down the
Expression
list of SPSS function to
textbox.
highlight the one we need:

IDF.CHISQ(p, df)
SW388R7

Completing the function arguments


Data Analysis &
Computers II

Slide 48

First, the first argument to the


IDF.CDF function, p, is replaced by
the cumulative probability associated
with the critical value, 0.99.

Second, the number of independent


variables in the discriminant
functions, 3, is used as the df, or
degrees of freedom.

Third, click on the


OK… button to
compute the variable.
SW388R7

The critical value for Mahalanobis D²


Data Analysis &
Computers II

Slide 49

The critical value is


calculated as a new variable
in the SPSS data editor.
Even though we only need it
calculated a single time, the
compute crease a value for
every case.

Now that we have the critical


value, we can compare it to
the values in the table of
Casewise Statistics.
SW388R7

Skipping ungrouped cases


Data Analysis &
Computers II

Slide 50

Case 50 has a D² 0f 16.603 which is its distance from the


centroid of its predicted group 3. However, the actual
group for the case was "ungrouped" meaning it was
missing data for the dependent variable. This case is not
counted as an outlier because it is already omitted from
the calculations for the discriminant functions.
SW388R7

Identifying outliers
Data Analysis &
Computers II

Slide 51

Case Number 176 has a D² 0f 11.553 which is its distance from


the centroid of its predicted group 2, and which is larger than the
critical value for D² of 11.345. This case is an outlier and should
be omitted in our test for the impact of outliers on the analysis.

To omit it from the analysis, we will have to find its case id


number and eliminate that. We cannot use case numbers to
eliminate outliers, because omitting one case changes the case
number for all of the other cases after it, and we are likely to
exclude the wrong case.
SW388R7

The caseid of the outlier


Data Analysis &
Computers II

Slide 52

To omit the outlier, we scroll


down the data editor to case
176 and note its caseid value,
"20001785."

In this data set, caseids are


string or text data, and we
represent their values in
quotation marks.
SW388R7

Omitting the outliers


Data Analysis &
Computers II

Slide 53

To omit outliers, we select


into the analysis, the cases
that are not outliers.

First, select the


Select Cases…
command from the
Transform menu.
SW388R7

Specifying the condition to omit outliers


Data Analysis &
Computers II

Slide 54

First, mark the If


condition is satisfied
option button to
indicate that we will Second, click on the
enter a specific If… button to specify
condition for the criteria for inclusion
including cases. in the analysis.
SW388R7

The formula for omitting outliers


Data Analysis &
Computers II

Slide 55

To eliminate the outliers, we request


the cases that are not outliers be
included in the analysis. Using this
formula, we are selecting cases that
do not have a caseid of "20001785".

In the formula, the symbols ~=


stands for "not equal to".

If we had more than one outlier, the


formula would be expanded to:
caseid~="20001785" and
After typing in the formula, caseid~="20005967" and
click on the Continue button caseid~="20006102" …
to close the dialog box,
SW388R7

Completing the request for the selection


Data Analysis &
Computers II

Slide 56

To complete the
request, we click on
the OK button.
SW388R7

The omitted outlier


Data Analysis &
Computers II

Slide 57

SPSS identifies the excluded


cases by drawing a slash mark
through the case number.
SW388R7

Classification accuracy without the outlier


Data Analysis &
Computers II

Slide 58

Classification Resultsb,c

Predicted Group Membership


WELFARE 1 2 3 Total
Original Count 1 43 15 6 64
2 26 29 6 61
3 17 10 9 36
Ungrouped cases 3 3 2 8
% 1 67.2 23.4 9.4 100.0
2 42.6 47.5 9.8 100.0
Prior to any transformations of
3 47.2 27.8 25.0 100.0
variables to satisfy the assumptions of
Ungrouped cases 37.5
normality and 37.5
the removal 25.0of outliers,
100.0
Cross-validated a Count 1 43
the cross-validated15 classification
6 64
2 accuracy
26 rate was 29 50.0%. 6 61
3 17 11 8 36
% 1 After substituting
67.2 23.4 transformed
9.4 100.0
2 variables
42.6 and removing
47.5 outliers,
9.8 the
100.0
3 cross-validated
47.2 classification
30.6 22.2 accuracy
100.0
rate was 49.7%.
a. Cross validation is done only for those cases in the analysis. In cross validation, each case
is classified by the functions derived from all cases other than that case.
b. 50.3% of original grouped cases correctly classified.
c. 49.7% of cross-validated grouped cases correctly classified.
SW388R7

SELECTION OF MODEL FOR INTERPRETATION


Data Analysis &
Computers II

Slide 59

Classification Resultsb,c

Predicted Group Membership


WELFARE 1 2 3 Total
Original Count 1 43 15 6 64
2 26 29 6 61
3 17 10 9 36
Ungrouped cases 3 3 2 8
% 1 67.2 23.4 9.4 100.0
2 42.6 47.5 9.8 100.0
3 Since
47.2 the discriminant
27.8 25.0 analysis
100.0
Ungrouped cases using transformations and
37.5 37.5 25.0 100.0
omitting outliers was less
Cross-validated a Count 1 43 15 6 64
accurate in classifying cases than
2 26 29 6 61
the discriminant analysis with all
3 17
cases and 11 8
no transformations, 36
% 1 67.2 23.4
the discriminant 9.4
analysis 100.0
with all
2 cases and
42.6 no transformations
47.5 9.8 100.0
3 was interpreted.
47.2 30.6 22.2 100.0
a. Cross validation is done only for those cases in the analysis. In cross validation, each case
is classified by the functions derived from all cases other than that case.
b. 50.3% of original grouped cases correctly classified.
c. 49.7% of cross-validated grouped cases correctly classified.
SW388R7

SAMPLE SIZE - 1
Data Analysis &
Computers II

Slide 60

Analysis Case Processing Summary

Unweighted Cases N Percent


Valid 138 51.1
Excluded Missing or out-of-range
7 2.6
group codes
At least one missing
115 42.6
discriminating variable The minimum ratio of valid
Both missing or cases to independent
out-of-range group codes variables for discriminant
10 3.7
and at least one missing analysis is 5 to 1, with a
discriminating variable preferred ratio of 20 to 1.
Total 132 48.9 In this analysis, there are
Total 270 100.0 138 valid cases and 4
independent variables.

The ratio of cases to


independent variables is
34.5 to 1, which satisfies
the minimum requirement.
In addition, the ratio of
34.5 to 1 satisfies the
preferred ratio of 20 to 1.
SW388R7

SAMPLE SIZE - 2
Data Analysis &
Computers II

Slide 61

Prior Probabilities for Groups

Cases Used in Analysis


In addition to the requirement for the
WELFARE Prior Unweighted Weighted
ratio of cases to independent
1 TOO LITTLE .409 56 56.000
variables, discriminant analysis
2 ABOUT RIGHT .358 49 49.000 requires that there be a minimum
3 TOO MUCH .234 32 32.000 number of cases in the smallest group
Total 1.000 137 137.000 defined by the dependent variable.
The number of cases in the smallest
group must be larger than the number
of independent variables, and
preferably contain 20 or more cases.

The number of cases in the smallest


group in this problem is 32, which is
larger than the number of
independent variables (4), satisfying
the minimum requirement. In
addition, the number of cases in the
smallest group satisfies the preferred
minimum of 20 cases.
SW388R7

NUMBER OF DISCRIMINANT FUNCTIONS - 1


Data Analysis &
Computers II

Slide 62

The maximum possible number of discriminant


functions is the smaller of one less than the
number of groups defined by the dependent
variable and the number of independent
variables.

In this analysis there were 3 groups defined by


opinion about spending on welfare and 4
independent variables, so the maximum
possible number of discriminant functions was
2.
SW388R7

NUMBER OF DISCRIMINANT FUNCTIONS - 2


Data Analysis &
Computers II

Slide 63

In the table of Wilks' Lambda which tested functions for


statistical significance, the stepwise analysis identified 2
discriminant functions that were statistically significant. The
Wilks' lambda statistic for the test of function 1 through 2
functions (chi-square=21.853) had a probability of 0.001 which
was less than or equal to the level of significance of 0.05.

After removing function 1, the Wilks' lambda statistic for the


test of function 2 (chi-square=7.074) had a probability of
0.029 which was less than or equal to the level of
significance of 0.05. The significance of the maximum
possible number of discriminant functions supports the
interpretation of a solution using 2 discriminant functions.
SW388R7

MULTICOLLINEARITY
Data Analysis &
Computers II

Slide 64

Multicollinearity occurs when one


independent variable is so
strongly correlated with one or
more other variables that its
relationship to the dependent
variable is likely to be
misinterpreted. Its potential
unique contribution to explaining
the dependent variable is
minimized by its strong
relationship to other independent
variables. Multicollinearity is
indicated when the tolerance
value for an independent variable
is less than 0.10.

The tolerance values for all of the


independent variables are larger
than 0.10. Multicollinearity is not
a problem in this discriminant
analysis.
SW388R7
Data Analysis & Independent variables and group membership:
relationship of functions to groups
Computers II

Slide 65

In order to specify the role that each independent


variable plays in predicting group membership on the
dependent variable, we must link together the
relationship between the discriminant functions and the
groups defined by the dependent variable, the role of
the significant independent variables in the
discriminant functions, and the differences in group
means for each of the variables.

Function 2 separates
Functions at Group Centroids survey respondents
who thought we spend
Function too little money on
WELFARE 1 2 welfare (positive value
of 0.235) from survey
1 -.220 .235 respondents who
2 .446 -.031 thought we spend too
3 -.311 -.362 much money (negative
value of -0.362) on
Unstandardized canonical discriminant welfare. We ignore the
functions evaluated at group means second group (-0.031)
Function 1 separates survey respondents in this comparison
who thought we spend about the right because it was
amount of money on welfare (the positive distinguished from the
value of 0.446) from survey respondents other two groups by
who thought we spend too much (negative function 1.
value of -0.311) or little money (negative
value of -0.220) on welfare.
SW388R7
Data Analysis & Independent variables and group membership:
which predictors to interpret
Computers II

Slide 66

Variables Entered/Removeda,b,c,d

Min. D Squared

Between Exact F
Step Entered Statistic Groups Statistic df1 df2 Sig.
1 NUMBER When we use the stepwise method of
OF variable inclusion, we limit our interpretation
HOURS of independent variable predictors to those
.023 1 and 3listed as statistically
.475 1 135.000
significant .492
in the table
WORKED
LAST of Variables Entered/Removed.
WEEK
We will interpret the impact on membership
2 R in groups defined by the dependent variable
SELF-EM by the independent variables:
P OR •number of hours worked in the past week
WORKS .251 1 and 2 •self-employment.
3.289 2 134.000 .040
FOR •highest year of school completed
SOMEBO
DY
3 HIGHEST
YEAR OF
SCHOOL .364 1 and 3 2.433 3 133.000 .068
COMPLE
Had we use simultaneous
TED entry of all variables, we
wouldbetween
At each step, the variable that maximizes the Mahalanobis distance not have imposed
the two closest this
groups is entered. limitation.
a. Maximum number of steps is 8.
b. Maximum significance of F to enter is .05.
c.
SW388R7
Data Analysis & Independent variables and group membership:
predictor loadings on functions
Computers II

Slide 67

We do not Structure Matrix


interpret loadings
in the structure Function
matrix unless they 1 2
are 0.30 or HIGHEST YEAR OF
higher. .687* .136
SCHOOL COMPLETED
NUMBER OF HOURS
-.582* .345
WORKED LAST WEEK
R SELF-EMP OR WORKS
.223 .889*
FOR SOMEBODY
RESPONDENTS INCOMEa .101 .292*
Pooled within-groups correlations between discriminating
variables and standardized canonical discriminant functions
Variables ordered by absolute size of correlation within function.
Based on the structure
*. Largest absolute correlation between each variable and
matrix, the predictor
Based on the structure matrix,any thediscriminant function variable strongly
predictor variables strongly associated with
a. This variable not used in the analysis. associated with
discriminant function 1 which distinguished discriminant function 2
between survey respondents who thought which distinguished
we spend about the right amount of money between survey
on welfare and survey respondents who respondents who thought
thought we spend too much or little money we spend too little money
on welfare were number of hours worked in on welfare and survey
the past week (r=-0.582) and highest year respondents who thought
of school completed (r=0.687). we spend too much money
on welfare was self-
employment (r=0.889).
SW388R7
Data Analysis & Independent variables and group membership:
predictors associated with first function - 1
Computers II

Slide 68

Group Statistics

Valid N (listwise)
WELFARE Mean Std. Deviation Unweighted Weighted
1 TOO LITTLE NUMBER OF HOURS The average number of hours worked
43.96 13.240in the past56week 56.000
for survey
WORKED LAST WEEK
HIGHEST YEAR OF respondents who thought we spend
13.73 2.401about the 56
right amount
56.000 of money on
SCHOOL COMPLETED
welfare (mean=37.90) was lower
R SELF-EMP OR WORKS
1.93 .260than the average
56 number of hours
56.000
FOR SOMEBODY worked in the past weeks for survey
RESPONDENTS INCOME 13.70 5.034respondents
56 who56.000
thought we spend
2 ABOUT RIGHT NUMBER OF HOURS too little money on welfare
37.90 13.235(mean=43.96)
50 50.000
and survey
WORKED LAST WEEK
HIGHEST YEAR OF respondents who thought we spend
14.78 2.558too much money
50 on welfare
50.000
SCHOOL COMPLETED
(mean=42.03).
R SELF-EMP OR WORKS
1.90 .303 50 50.000
FOR SOMEBODY This supports the relationship that
RESPONDENTS INCOME 14.00 5.503"survey respondents
50 50.000who thought we
3 TOO MUCH NUMBER OF HOURS spend about the right amount of
42.03 10.456money on 32 32.000
welfare worked fewer
WORKED LAST WEEK
HIGHEST YEAR OF hours in the past week than survey
13.38 2.524respondents 32 who32.000
thought we spend
SCHOOL COMPLETED
too little or much money on welfare."
R SELF-EMP OR WORKS
1.75 .440 32 32.000
FOR SOMEBODY
RESPONDENTS INCOME 14.75 5.304 32 32.000
Total NUMBER OF HOURS
41.32 12.846 138 138.000
WORKED LAST WEEK
HIGHEST YEAR OF
14.03 2.537 138 138.000
SCHOOL COMPLETED
R SELF-EMP OR WORKS
SW388R7
Data Analysis & Independent variables and group membership:
predictors associated with first function - 2
Computers II

Slide 69

Group Statistics

Valid N (listwise)
WELFARE Mean Std. Deviation Unweighted Weighted
1 TOO LITTLE NUMBER OF HOURS
43.96 13.240The average
56 highest
56.000year of school
WORKED LAST WEEK
completed for survey respondents
2.401who thought
56 we 56.000
spend about the
HIGHEST YEAR OF
13.73
SCHOOL COMPLETED right amount of money on welfare
R SELF-EMP OR WORKS (mean=14.78) was higher than the
1.93 .260average highest
56 56.000
year of school
FOR SOMEBODY
RESPONDENTS INCOME 13.70 5.034completeds56 for survey
56.000 respondents
who thought we spend too little
2 ABOUT RIGHT NUMBER OF HOURS
37.90 13.235money on 50welfare (mean=13.73) and
50.000
WORKED LAST WEEK survey respondents who thought we
HIGHEST YEAR OF
14.78 2.558
spend too 50
much 50.000
money on welfare
SCHOOL COMPLETED (mean=13.38).
R SELF-EMP OR WORKS
1.90 .303This supports
50 the50.000
relationship that
FOR SOMEBODY
RESPONDENTS INCOME 14.00 5.503
"survey respondents
50
who thought we
50.000
spend about the right amount of
3 TOO MUCH NUMBER OF HOURS
42.03 10.456money on 32 welfare had completed
32.000
WORKED LAST WEEK more years of school than survey
HIGHEST YEAR OF respondents who thought we spend
13.38 2.524 32 32.000
SCHOOL COMPLETED too little or much money on welfare."
R SELF-EMP OR WORKS
1.75 .440 32 32.000
FOR SOMEBODY
RESPONDENTS INCOME 14.75 5.304 32 32.000
Total NUMBER OF HOURS
41.32 12.846 138 138.000
WORKED LAST WEEK
HIGHEST YEAR OF
14.03 2.537 138 138.000
SCHOOL COMPLETED
R SELF-EMP OR WORKS
SW388R7
Data Analysis & Independent variables and group membership:
predictors associated with second function
Computers II

Slide 70

Group Statistics

Valid N (listwise)
WELFARE Mean Std. Deviation Unweighted Weighted
1 TOO LITTLE NUMBER OF HOURS Since self-employment is a dichotomous
43.96 13.240 variable, the
56 mean
56.000
is not directly
WORKED LAST WEEK
HIGHEST YEAR OF interpretable. Its interpretation must
13.73 2.401 take into 56
account the coding by which 1
56.000
SCHOOL COMPLETED
corresponds to self-employed and 2
R SELF-EMP OR WORKS
1.93 .260 corresponds
56 to someone
56.000 else. The lower
FOR SOMEBODY mean for survey respondents who
RESPONDENTS INCOME 13.70 5.034 thought we
56 spend too much money on
56.000
2 ABOUT RIGHT NUMBER OF HOURS welfare (mean=1.75), when compared
37.90 13.235 to the mean
50 for 50.000
survey respondents who
WORKED LAST WEEK
HIGHEST YEAR OF
thought we spend too little money on
14.78 2.558 welfare (mean=1.93),
50 50.000 implies that the
SCHOOL COMPLETED
group contained more survey
R SELF-EMP OR WORKS
1.90 .303 respondents
50 who were self-employed
50.000
FOR SOMEBODY and fewer survey respondents who were
RESPONDENTS INCOME 14.00 5.503 working for
50 someone
50.000 else.
3 TOO MUCH NUMBER OF HOURS
42.03 10.456 This supports
32 the relationship that
32.000
WORKED LAST WEEK
"survey respondents who thought we
HIGHEST YEAR OF
13.38 2.524 spend too32much32.000
money on welfare were
SCHOOL COMPLETED more likely to be self-employed than
.440 survey respondents who thought we
R SELF-EMP OR WORKS
1.75 32 32.000
FOR SOMEBODY spend too little money on welfare."
RESPONDENTS INCOME 14.75 5.304 32 32.000
Total NUMBER OF HOURS
41.32 12.846 138 138.000
WORKED LAST WEEK
HIGHEST YEAR OF
14.03 2.537 138 138.000
SCHOOL COMPLETED
R SELF-EMP OR WORKS
SW388R7
Data Analysis & ASSUMPTION OF EQUAL DISPERSION FOR
DEPENDENT VARIABLE GROUPS
Computers II

Slide 71

The assumption of equal


dispersion for groups defined by
the dependent variable only
affects the classification phase of
discriminant analysis, and so is
not evaluated until we are
determining the final accuracy
rate of the model.

Box's M test evaluated the


homogeneity of dispersion
matrices across the subgroups of
the dependent variable. The null
hypothesis is that the dispersion
matrices are homogenous. If
the analysis fails this test, we
request the use of separate
group dispersion matrices in the
classification phase of the
discriminant analysis to see it
this improves our accuracy rate.
SW388R7
Data Analysis & ASSUMPTION OF EQUAL DISPERSION FOR
DEPENDENT VARIABLE GROUPS
Computers II

Slide 72

In this analysis, Box's M statistic


had a value of 19.386 with a
probability of 0.096. Since the
probability for Box's M is greater
than the level of significance for
testing assumptions (0.01), the
null hypothesis is not rejected
and the assumption of equal
dispersion is satisfied.

We use the pooled or within-


groups covariance matrix for
classification.
SW388R7
Data Analysis & ASSUMPTION OF EQUAL DISPERSION FOR
DEPENDENT VARIABLE GROUPS
Computers II

Slide 73

Had we rejected the null hypothesis and concluded that


dispersion was not equal across groups, we would have run
the analysis again, specifying separate-groups covariance
matrices for classification.

If classification using separate covariance matrices were


more accurate by 2% or more, we would report classification
accuracy based on this model rather than the one that use
within-groups covariance.
SW388R7
CLASSIFICATION USING THE DISCRIMINANT MODEL:
Data Analysis &

by chance accuracy rate


Computers II

Slide 74

The independent variables could be characterized as useful


predictors of membership in the groups defined by the dependent
variable if the cross-validated classification accuracy rate was
significantly higher than the accuracy attainable by chance alone.
Operationally, the cross-validated classfication accuracy rate should
be 25% or more higher than the proportional by chance accuracy
rate.

The proportional by chance accuracy rate of was computed by


squaring and summing the proportion of cases in each group from
the table of prior probabilities for groups (0.406² + 0.362² + 0.232²
= 0.350).

Prior Probabilities for Groups

Cases Used in Analysis


WELFARE Prior Unweighted Weighted
1 TOO LITTLE .406 56 56.000
2 ABOUT RIGHT .362 50 50.000
3 TOO MUCH .232 32 32.000
Total 1.000 138 138.000
SW388R7
CLASSIFICATION USING THE DISCRIMINANT MODEL:
Data Analysis &

criteria for classification accuracy


Computers II

Slide 75

Classification Resultsb,c

Predicted Group Membership


1 TOO 2 ABOUT
WELFARE LITTLE RIGHT 3 TOO MUCH Total
Original Count 1 TOO LITTLE 43 15 6 64
2 ABOUT RIGHT 26 30 6 62
3 TOO MUCH 17 10 9 36
Ungrouped cases 3 3 2 8
% 1 TOO LITTLE 67.2 23.4 9.4 100.0
2 ABOUT RIGHT 41.9 48.4 9.7 100.0
3 TOO MUCH 47.2 27.8 25.0 100.0
Ungrouped cases 37.5 37.5 25.0 100.0
Cross-validated a Count 1 TOO LITTLE 43 15 6 64
2 The cross-validated
ABOUT RIGHT accuracy
26 rate30 6 62
computed
3 TOO MUCH by SPSS was
17 50.0% 11 8 36
% which was
1 TOO LITTLE greater than
67.2
or equal to
23.4 9.4 100.0
the proportional by chance accuracy
2 criteria
ABOUT RIGHT
of 43.7% (1.2541.9 x 35.0%48.4= 9.7 100.0
43.7%).
3 TOO MUCH The criteria for
47.2 30.6 22.2 100.0
classification accuracy is satisfied.
a. Cross validation is done only for those cases in the analysis. In cross validation, each case is
classified by the functions derived from all cases other than that case.
b. 50.6% of original grouped cases correctly classified.
c. 50.0% of cross-validated grouped cases correctly classified.
SW388R7

Answering the problem question - 1


Data Analysis &
Computers II

Slide 76

From the list of variables "number of hours worked in the past week" [hrs1], "self-
employment" [wrkslf], "highest year of school completed" [educ], and "income" [rincom98],
the most useful predictors for distinguishing among groups based on responses to "opinion
about spending on welfare" [natfare] are "number of hours worked in the past week" [hrs1],
"self-employment" [wrkslf], and "highest year of school completed" [educ]. These predictors
differentiate survey respondents who thought we spend too much money on welfare from survey
respondents who thought we spend about the right amount of money on welfare who, in turn,
are differentiated fromThe stepwise
survey discriminant
respondents who analysis
thought we spend too little money on welfare.
included the three variables identified
as the most useful predictors.
The most important predictor of groups based on responses to opinion about spending on
welfare was number of hours worked in the past week. The second most important predictor of
groups based on responses to opinion about spending on welfare was self-employment. The third
most important predictor of groups based on responses to opinion about spending on welfare
was highest year of school completed.

Survey respondents who thought we spend about the right amount of money on welfare worked
fewer hours in the past week than survey respondents who thought we spend too much or little
money on welfare. Survey respondents who thought we spend about the right amount of money
on welfare had completed more years of school than survey respondents who thought we spend
too much or little money on welfare. Survey respondents who thought we spend too much
money on welfare were more likely to be self-employed than survey respondents who thought
we spend too little money on welfare.
SW388R7

Answering the problem question - 2


Data Analysis &
Computers II

Slide 77

From the list of variables "number of hours worked in the past week" [hrs1], "self-employment"
[wrkslf], "highest year of school completed" [educ], and "income" [rincom98], the most useful
predictors for distinguishing among groups based on responses to "opinion about spending on
welfare" [natfare] are "number of hours worked in the past week" [hrs1], "self-employment"
[wrkslf], and "highest year of school completed" [educ]. These predictors differentiate survey
respondents who thought we spend too much money on welfare from survey respondents
who thought we spend about the right amount of money on welfare who, in turn, are
differentiated from survey respondents who thought we spend too little money on welfare.

The most important predictor of groups based on responses to opinion about spending on
welfare was number We
of hours
found worked in the past
two statistically week. The second most important predictor of
significant
groups based on responses to opinion
discriminant aboutmaking
functions, spending on welfare
it possible to was self-employment. The
third most importantdistinguish
predictor among
of groupsthebased
three on responses
groups defined to opinion about spending on
welfare was highest by theofdependent
year variable.
school completed.
Moreover, the cross-validated classification
Survey respondents who thought
accuracy we spend
surpassed the about the right
by chance amount of money on welfare worked
accuracy
criteria, supporting the utility of the model.
fewer hours in the past week than survey respondents who thought we spend too much or little
money on welfare. Survey respondents who thought we spend about the right amount of money
on welfare had completed more years of school than survey respondents who thought we
spend too much or little money on welfare. Survey respondents who thought we spend too
much money on welfare were more likely to be self-employed than survey respondents who
thought we spend too little money on welfare.
SW388R7

Answering the problem question - 3


Data Analysis &
Computers II

Slide 78

From the list of variables "number of hours worked in the past week" [hrs1], "self-employment"
[wrkslf], "highest year of school completed" [educ], and "income" [rincom98], the most useful
predictors for distinguishing among groups based on responses to "opinion about spending on
welfare" [natfare] are "number of hours worked in the past week" [hrs1], "self-employment"
[wrkslf], and "highest year of school completed" [educ]. These predictors differentiate survey
respondents who thought we spend too much money on welfare from survey respondents who
The order of importance matched
thought we spend about the right theamount
order ofofentry
money on table
in the welfare
of who, in turn, are differentiated
from survey respondents who thought we Entered/Removed."
"Variables spend too little money on welfare.

The most important predictor of groups based on responses to opinion about spending on
welfare was number of hours worked in the past week. The second most important
predictor of groups based on responses to opinion about spending on welfare was self-
employment. The third most important predictor of groups based on responses to opinion
about spending on welfare was highest year of school completed.

Survey respondents who thought we spend about the right amount of money on welfare worked
fewer hours in the past week than survey respondents who thought we spend too much or little
money on welfare. Survey respondents who thought we spend about the right amount of money
on welfare had completed more years of school than survey respondents who thought we
spend too much or little money on welfare. Survey respondents who thought we spend too
much money on welfare were more likely to be self-employed than survey respondents who
thought we spend too little money on welfare.
SW388R7

Answering the problem question - 4


Data Analysis &
Computers II

Slide 79

The most important predictor of groups based on responses to opinion about spending on
welfare was number of hours worked in the past week. The second most important predictor of
groups based on responses to opinion about
We spending on welfare
verified that was self-employment. The
each statement
third most important predictor of groups about
basedthe
on relationship
responses tobetween
opinion about spending on
welfare was highest year of school completed.
predictors and groups was correct.

Survey respondents who thought we spend about the right amount of money on welfare
worked fewer hours in the past week than survey respondents who thought we spend too
much or little money on welfare. Survey respondents who thought we spend about the right
amount of money on welfare had completed more years of school than survey respondents
who thought we spend too much or little money on welfare. Survey respondents who
thought we spend too much money on welfare were more likely to be self-employed than
survey respondents who thought we spend too little money on welfare.

1. True
The answer to the question is true with
2. True with caution caution. A caution is added because of
3. False the inclusion of ordinal level variables. A
caution is added because of a violation
4. Inappropriate application of a statistic of discriminant analysis assumptions.
SW388R7
Data Analysis & Steps in discriminant analysis:
level of measurement and initial sample size
Computers II

Slide 80

The following is a guide to the decision process for answering


problems about the complete discriminant analysis:

Dependent non-metric? No Inappropriate


Independent variables application of
metric or dichotomous? a statistic

Yes

Ratio of cases to No Inappropriate


independent variables at application of
least 5 to 1?
a statistic

Yes

Number of cases in
smallest group greater No Inappropriate
than number of application of
independent variables? a statistic

Yes
SW388R7
Data Analysis & Steps in discriminant analysis:
running the baseline model
Computers II

Slide 81

Run baseline discriminant analysis, using method for


including variables identified in the research question.
Record cross-validated classification accuracy for
evaluation of transformations and removal of outliers.

Try:
1. Logarithmic transformation
Metric IV’s normally No 2. Square root transformation
distributed? 3. Inverse transformation

If unsuccessful, add caution


for violation of discriminant
analysis assumptions

Yes

Mahalanobis D² to most Yes


likely group greater than Remove outliers from
critical value? data set

No
SW388R7
Data Analysis & Steps in discriminant analysis:
picking discriminant model for interpretation
Computers II

Slide 82

Were transformed variables No


substituted, or outliers and
influential cases omitted?

Yes

Evaluate impact of transformations and


removal of outliers by running discriminant
again, using method for including variables
identified in the research question.

Cross-validated accuracy
for second discriminant
analysis greater than
accuracy of baseline by 2%
Yes or more?
No

Pick discriminant analysis with Pick baseline discriminant


transformations and omitting analysis for interpretation
outliers for interpretation
SW388R7
Data Analysis & Steps in discriminant analysis:
usable discriminant model
Computers II

Slide 83

Sufficient statistically No
significant functions to False
distinguish DV groups?

Yes

Tolerance for all IV’s


greater than 0.10, No
indicating no False
multicollinearity?

Yes
SW388R7
Data Analysis & Steps in discriminant analysis:
assumption of equal dispersion
Computers II

Slide 84

Probability of Box's M test Yes Re-run discriminant analysis, using


less than or equal to level of separate-groups covariance matrices
significance for assumptions? for classification

No
No Accuracy rate at least 2%
higher using separate-
groups covariance
matrices?

Yes

Pick discriminant analysis using Pick discriminant analysis using


within-groups covariance for separate-groups covariance for
interpretation interpretation
SW388R7
Data Analysis & Steps in discriminant analysis:
relationships between IV's and DV
Computers II

Slide 85

Stepwise method of entry


used to include
independent variables?
Yes

No
Entry order of variables
interpreted correctly?
No

False
Yes

Relationships between No
individual IVs and DV groups False
interpreted correctly?

Yes
SW388R7
Data Analysis & Steps in discriminant analysis:
classification accuracy
Computers II

Slide 86

Cross-validated accuracy is No
25% higher than proportional False
by chance accuracy rate?

Yes
SW388R7
Data Analysis & Steps in discriminant analysis:
adding cautions to solution
Computers II

Slide 87

Satisfies preferred ratio of No


cases to IV's of 20 to 1 True with caution

Yes

Satisfies preferred DV group No


minimum size of 20 cases? True with caution

Yes

DV is non-metric level and IVs No


are interval level or True with caution
dichotomous (not ordinal)?

Yes

True

Você também pode gostar