Você está na página 1de 72

SPSS Independent Samples T Test

SPSS independent samples t-test is a procedure for testing whether the means in two populations
on one metric variable are equal. The two populations are identified in the sample by a
dichotomous variable. These two groups of cases are considered “independent samples” because
none of the cases belong to both groups simultaneously; that is, the samples don't overlap.

SPSS Independent Samples T-Test Example

A marketeer wants to know whether women spend the same amount of money on clothes as
men. She asks 30 male and 30 female respondents how much many Euros they spend on clothing
each month, resulting in clothing_expenses.sav. Do these data contradict the null hypothesis that
men and women spend equal amounts of money on clothing?

1. Quick Data Check

Respondents were able to fill in any number of euros. Before moving on to the actual t-test, we
first need to get a basic idea of what the data look like. We'll take a quick look at the histogram
for the amounts spent. We'll obtain this by running FREQUENCIES as shown in the syntax
below.

Result

1
These values look plausible. The maximum monthly amount spent on clothing (around € 525,-)
is not unlikely for one or two respondents, the vast majority of whom spend under €100,-. Also,
note that N = 60, which tells us that there are no missing values.

2. Assumptions Independent Samples T-Test

If we just run our test at this point, SPSS will immediately provide us with relevant test statistics
and a p-value. However, such results can only be taken seriously insofar as the independent t-test
assumptions have been met. These are

1. independent and identically distributed variables (or, less precisely, “independent


observations”);
2. The dependent variable is normally distributed in both populations;
3. Homoscedasticity: the variances of the populations are equal;

Assumption 1 is mostly theoretical. Violation of assumption 2 hardly affects test results for
reasonable sample sizes (say n >30). If this doesn't hold, perhaps consider a Mann-Whitney test
instead or on top of the t-test.
If assumption 3 is violated, test results need to be corrected. For the independent samples t-test,
the SPSS output contains the uncorrected as well as the corrected results by default.

3. Run SPSS Independent Samples T-Test

We'll navigate to Analyze Compare Means Independent-Samples T Test.


Now we specify our test variable and
our grouping variable.
Next, we define our groups as 0 and 1 since these are the values used for female and male

2
respondents.
Finally, we Paste and run the syntax.

*Run independent samples t-test.

T-TEST GROUPS=gender(0 1)
/MISSING=ANALYSIS
/VARIABLES=amount_spent
/CRITERIA=CI(.95).

4. SPSS Independent Samples T-Test Output

From the first table, showing some basic descriptives, we see that 30 female and 30 male
respondents are included in the test.
Female respondents spent an average of €136,- on clothing each month. For male
respondents this is only €88,-. The difference is roughly €48,-.

SPSS Independent Sample T-Test Output

As shown in the screenshot, the t-test results are reported twice. The first line ("equal
variances assumed") assumes that the aforementioned assumption of equal variances has been
met.

If this assumption doesn't hold, the t-test results need to be corrected. These corrected results
are presented in the second line ("equal variances not assumed").

3
Whether the assumption of equal variances holds is evaluated using Levene's test for the
equality of variances. As a rule of thumb, if Sig. > .05, use the first line of t-test results.
Reversely, if its p-value (“Sig.”) < .05 we reject the null hypothesis of equal variances and thus
use the second line of t-test results.

The difference between the amount spent by men and women is around €48,- as we'd already
seen previously.

The chance of finding this or a larger absolute difference between the two means is about
14%. Since this is a fair chance, we do not reject the hypothesis that men and women spend
equal amounts of money on clothing.
Note that the p-value is two-tailed. This means that the 14% chance consists of a 7% chance of
finding a mean difference smaller than € -48,- and another 7% chance for a difference larger than
€ 48,-.

5. Reporting the Independent Sample T-Test Results

When reporting the results of an independent samples t-test, we usually present a table with the
sample sizes, means and standard deviations. Regarding the significance test, we'll state that “on
average, men did not spend a different amount than women; t(58) = 1.5, p = .14.”

Independent t-test for two samples

Introduction

The independent t-test, also called the two sample t-test, independent-samples t-test or student's
t-test, is an inferential statistical test that determines whether there is a statistically significant
difference between the means in two unrelated groups.

Null and alternative hypotheses for the independent t-test

The null hypothesis for the independent t-test is that the population means from the two
unrelated groups are equal:

H0: u1 = u2

In most cases, we are looking to see if we can show that we can reject the null hypothesis and
accept the alternative hypothesis, which is that the population means are not equal:

HA: u1 ≠ u2

To do this, we need to set a significance level (also called alpha) that allows us to either reject or
accept the alternative hypothesis. Most commonly, this value is set at 0.05.

4
What do you need to run an independent t-test?

In order to run an independent t-test, you need the following:

 One independent, categorical variable that has two levels/groups.


 One continuous dependent variable.

Unrelated groups

Unrelated groups, also called unpaired groups or independent groups, are groups in which the
cases (e.g., participants) in each group are different. Often we are investigating differences in
individuals, which means that when comparing two groups, an individual in one group cannot
also be a member of the other group and vice versa. An example would be gender - an individual
would have to be classified as either male or female – not both.

Assumption of normality of the dependent variable

The independent t-test requires that the dependent variable is approximately normally distributed
within each group.

Note: Technically, it is the residuals that need to be normally distributed, but for an independent
t-test, both will give you the same result.

You can test for this using a number of different tests, but the Shapiro-Wilks test of normality or
a graphical method, such as a Q-Q Plot, are very common. You can run these tests using SPSS
Statistics, the procedure for which can be found in our Testing for Normality guide. However,
the t-test is described as a robust test with respect to the assumption of normality. This means
that some deviation away from normality does not have a large influence on Type I error rates.
The exception to this is if the ratio of the smallest to largest group size is greater than 1.5 (largest
compared to smallest).

What to do when you violate the normality assumption

If you find that either one or both of your group's data is not approximately normally distributed
and groups sizes differ greatly, you have two options: (1) transform your data so that the data
becomes normally distributed (to do this in SPSS Statistics see our guide on Transforming Data),
or (2) run the Mann-Whitney U test which is a non-parametric test that does not require the
assumption of normality (to run this test in SPSS Statistics see our guide on the Mann-Whitney
U Test).

Assumption of homogeneity of variance

The independent t-test assumes the variances of the two groups you are measuring are equal in
the population. If your variances are unequal, this can affect the Type I error rate. The
assumption of homogeneity of variance can be tested using Levene's Test of Equality of
Variances, which is produced in SPSS Statistics when running the independent t-test procedure.

5
If you have run Levene's Test of Equality of Variances in SPSS Statistics, you will get a result
similar to that below:

This test for homogeneity of variance provides an F-statistic and a significance value (p-value).
We are primarily concerned with the significance value – if it is greater than 0.05 (i.e., p > .05),
our group variances can be treated as equal. However, if p < 0.05, we have unequal variances
and we have violated the assumption of homogeneity of variances.

Overcoming a violation of the assumption of homogeneity of variance

If the Levene's Test for Equality of Variances is statistically significant, which indicates that the
group variances are unequal in the population, you can correct for this violation by not using the
pooled estimate for the error term for the t-statistic, but instead using an adjustment to the
degrees of freedom using the Welch-Satterthwaite method. In all reality, you will probably never
have heard of these adjustments because SPSS Statistics hides this information and simply labels
the two options as "Equal variances assumed" and "Equal variances not assumed" without
explicitly stating the underlying tests used. However, you can see the evidence of these tests as
below:

From the result of Levene's Test for Equality of Variances, we can reject the null hypothesis that
there is no difference in the variances between the groups and accept the alternative hypothesis
that there is a statistically significant difference in the variances between groups. The effect of
not being able to assume equal variances is evident in the final column of the above figure where
we see a reduction in the value of the t-statistic and a large reduction in the degrees of freedom
(df). This has the effect of increasing the p-value above the critical significance level of 0.05. In
this case, we therefore do not accept the alternative hypothesis and accept that there are no
statistically significant differences between means. This would not have been our conclusion had
we not tested for homogeneity of variances.

Reporting the result of an independent t-test

When reporting the result of an independent t-test, you need to include the t-statistic value, the
degrees of freedom (df) and the significance value of the test (p-value). The format of the test

6
result is: t(df) = t-statistic, p = significance value. Therefore, for the example above, you could
report the result as t(7.001) = 2.233, p = 0.061.

Fully reporting your results

In order to provide enough information for readers to fully understand the results when you have
run an independent t-test, you should include the result of normality tests, Levene's Equality of
Variances test, the two group means and standard deviations, the actual t-test result and the
direction of the difference (if any). In addition, you might also wish to include the difference
between the groups along with a 95% confidence interval. For example:

 General

Inspection of Q-Q Plots revealed that cholesterol concentration was normally distributed for both
groups and that there was homogeneity of variance as assessed by Levene's Test for Equality of
Variances. Therefore, an independent t-test was run on the data with a 95% confidence interval
(CI) for the mean difference. It was found that after the two interventions, cholesterol
concentrations in the dietary group (6.15 ± 0.52 mmol/L) were significantly higher than the
exercise group (5.80 ± 0.38 mmol/L) (t(38) = 2.470, p = 0.018) with a difference of 0.35 (95%
CI, 0.06 to 0.64) mmol/L.

Independent t-test using SPSS Statistics

Introduction

The independent-samples t-test (or independent t-test, for short) compares the means between
two unrelated groups on the same continuous, dependent variable. For example, you could use an
independent t-test to understand whether first year graduate salaries differed based on gender
(i.e., your dependent variable would be "first year graduate salaries" and your independent
variable would be "gender", which has two groups: "male" and "female"). Alternately, you could
use an independent t-test to understand whether there is a difference in test anxiety based on
educational level (i.e., your dependent variable would be "test anxiety" and your independent
variable would be "educational level", which has two groups: "undergraduates" and
"postgraduates").

This "quick start" guide shows you how to carry out an independent t-test using SPSS Statistics,
as well as interpret and report the results from this test. However, before we introduce you to this
procedure, you need to understand the different assumptions that your data must meet in order
for an independent t-test to give you a valid result. We discuss these assumptions next.

Assumptions

When you choose to analyse your data using an independent t-test, part of the process involves
checking to make sure that the data you want to analyse can actually be analysed using an
independent t-test. You need to do this because it is only appropriate to use an independent t-test

7
if your data "passes" six assumptions that are required for an independent t-test to give you a
valid result. In practice, checking for these six assumptions just adds a little bit more time to your
analysis, requiring you to click a few more buttons in SPSS Statistics when performing your
analysis, as well as think a little bit more about your data, but it is not a difficult task.

Before we introduce you to these six assumptions, do not be surprised if, when analysing your
own data using SPSS Statistics, one or more of these assumptions is violated (i.e., is not met).
This is not uncommon when working with real-world data rather than textbook examples, which
often only show you how to carry out an independent t-test when everything goes well!
However, don't worry. Even when your data fails certain assumptions, there is often a solution to
overcome this. First, let's take a look at these six assumptions:

 Assumption #1: Your dependent variable should be measured on a continuous scale


(i.e., it is measured at the interval or ratio level). Examples of variables that meet this
criterion include revision time (measured in hours), intelligence (measured using IQ
score), exam performance (measured from 0 to 100), weight (measured in kg), and so
forth. You can learn more about continuous variables in our article: Types of Variable.
 Assumption #2: Your independent variable should consist of two categorical,
independent groups. Example independent variables that meet this criterion include
gender (2 groups: male or female), employment status (2 groups: employed or
unemployed), smoker (2 groups: yes or no), and so forth.
 Assumption #3: You should have independence of observations, which means that
there is no relationship between the observations in each group or between the groups
themselves. For example, there must be different participants in each group with no
participant being in more than one group. This is more of a study design issue than
something you can test for, but it is an important assumption of the independent t-test. If
your study fails this assumption, you will need to use another statistical test instead of the
independent t-test (e.g., a paired-samples t-test). If you are unsure whether your study
meets this assumption, you can use our Statistical Test Selector, which is part of our
enhanced content.
 Assumption #4: There should be no significant outliers. Outliers are simply single data
points within your data that do not follow the usual pattern (e.g., in a study of 100
students' IQ scores, where the mean score was 108 with only a small variation between
students, one student had a score of 156, which is very unusual, and may even put her in
the top 1% of IQ scores globally). The problem with outliers is that they can have a
negative effect on the independent t-test, reducing the validity of your results.
Fortunately, when using SPSS Statistics to run an independent t-test on your data, you
can easily detect possible outliers. In our enhanced independent t-test guide, we: (a) show
you how to detect outliers using SPSS Statistics; and (b) discuss some of the options you
have in order to deal with outliers.
 Assumption #5: Your dependent variable should be approximately normally
distributed for each group of the independent variable. We talk about the independent
t-test only requiring approximately normal data because it is quite "robust" to violations
of normality, meaning that this assumption can be a little violated and still provide valid
results. You can test for normality using the Shapiro-Wilk test of normality, which is
easily tested for using SPSS Statistics. In addition to showing you how to do this in our

8
enhanced independent t-test guide, we also explain what you can do if your data fails this
assumption (i.e., if it fails it more than a little bit).
 Assumption #6: There needs to be homogeneity of variances. You can test this
assumption in SPSS Statistics using Levene’s test for homogeneity of variances. In our
enhanced independent t-test guide, we (a) show you how to perform Levene’s test for
homogeneity of variances in SPSS Statistics, (b) explain some of the things you will need
to consider when interpreting your data, and (c) present possible ways to continue with
your analysis if your data fails to meet this assumption.

You can check assumptions #4, #5 and #6 using SPSS Statistics. Before doing this, you should
make sure that your data meets assumptions #1, #2 and #3, although you don't need SPSS
Statistics to do this. When moving on to assumptions #4, #5 and #6, we suggest testing them in
this order because it represents an order where, if a violation to the assumption is not correctable,
you will no longer be able to use an independent t-test (although you may be able to run another
statistical test on your data instead). Just remember that if you do not run the statistical tests on
these assumptions correctly, the results you get when running an independent t-test might not be
valid. This is why we dedicate a number of sections of our enhanced independent t-test guide to
help you get this right. You can find out about our enhanced independent t-test guide here, or
more generally, our enhanced content as a whole here.

In the section, Test Procedure in SPSS Statistics, we illustrate the SPSS Statistics procedure
required to perform an independent t-test assuming that no assumptions have been violated. First,
we set out the example we use to explain the independent t-test procedure in SPSS Statistics.

Multiple Regression Analysis using SPSS Statistics

Introduction

Multiple regression is an extension of simple linear regression. It is used when we want to


predict the value of a variable based on the value of two or more other variables. The variable we
want to predict is called the dependent variable (or sometimes, the outcome, target or criterion
variable). The variables we are using to predict the value of the dependent variable are called the
independent variables (or sometimes, the predictor, explanatory or regressor variables).

For example, you could use multiple regression to understand whether exam performance can be
predicted based on revision time, test anxiety, lecture attendance and gender. Alternately, you
could use multiple regression to understand whether daily cigarette consumption can be
predicted based on smoking duration, age when started smoking, smoker type, income and
gender.

Multiple regression also allows you to determine the overall fit (variance explained) of the model
and the relative contribution of each of the predictors to the total variance explained. For
example, you might want to know how much of the variation in exam performance can be
explained by revision time, test anxiety, lecture attendance and gender "as a whole", but also the
"relative contribution" of each independent variable in explaining the variance.

9
This "quick start" guide shows you how to carry out multiple regression using SPSS Statistics, as
well as interpret and report the results from this test. However, before we introduce you to this
procedure, you need to understand the different assumptions that your data must meet in order
for multiple regression to give you a valid result. We discuss these assumptions next.

Assumptions

When you choose to analyse your data using multiple regression, part of the process involves
checking to make sure that the data you want to analyse can actually be analysed using multiple
regression. You need to do this because it is only appropriate to use multiple regression if your
data "passes" eight assumptions that are required for multiple regression to give you a valid
result. In practice, checking for these eight assumptions just adds a little bit more time to your
analysis, requiring you to click a few more buttons in SPSS Statistics when performing your
analysis, as well as think a little bit more about your data, but it is not a difficult task.

Before we introduce you to these eight assumptions, do not be surprised if, when analysing your
own data using SPSS Statistics, one or more of these assumptions is violated (i.e., not met). This
is not uncommon when working with real-world data rather than textbook examples, which often
only show you how to carry out multiple regression when everything goes well! However, don’t
worry. Even when your data fails certain assumptions, there is often a solution to overcome this.
First, let's take a look at these eight assumptions:

 Assumption #1: Your dependent variable should be measured on a continuous scale


(i.e., it is either an interval or ratio variable). Examples of variables that meet this
criterion include revision time (measured in hours), intelligence (measured using IQ
score), exam performance (measured from 0 to 100), weight (measured in kg), and so
forth. You can learn more about interval and ratio variables in our article: Types of
Variable. If your dependent variable was measured on an ordinal scale, you will need to
carry out ordinal regression rather than multiple regression. Examples of ordinal
variables include Likert items (e.g., a 7-point scale from "strongly agree" through to
"strongly disagree"), amongst other ways of ranking categories (e.g., a 3-point scale
explaining how much a customer liked a product, ranging from "Not very much" to "Yes,
a lot"). You can access our SPSS Statistics guide on ordinal regression here.
 Assumption #2: You have two or more independent variables, which can be either
continuous (i.e., an interval or ratio variable) or categorical (i.e., an ordinal or nominal
variable). For examples of continuous and ordinal variables, see the bullet above.
Examples of nominal variables include gender (e.g., 2 groups: male and female),
ethnicity (e.g., 3 groups: Caucasian, African American and Hispanic), physical activity
level (e.g., 4 groups: sedentary, low, moderate and high), profession (e.g., 5 groups:
surgeon, doctor, nurse, dentist, therapist), and so forth. Again, you can learn more about
variables in our article: Types of Variable. If one of your independent variables is
dichotomous and considered a moderating variable, you might need to run a
Dichotomous moderator analysis.
 Assumption #3: You should have independence of observations (i.e., independence of
residuals), which you can easily check using the Durbin-Watson statistic, which is a
simple test to run using SPSS Statistics. We explain how to interpret the result of the

10
Durbin-Watson statistic, as well as showing you the SPSS Statistics procedure required,
in our enhanced multiple regression guide.
 Assumption #4: There needs to be a linear relationship between (a) the dependent
variable and each of your independent variables, and (b) the dependent variable and the
independent variables collectively. Whilst there are a number of ways to check for these
linear relationships, we suggest creating scatterplots and partial regression plots using
SPSS Statistics, and then visually inspecting these scatterplots and partial regression plots
to check for linearity. If the relationship displayed in your scatterplots and partial
regression plots are not linear, you will have to either run a non-linear regression analysis
or "transform" your data, which you can do using SPSS Statistics. In our enhanced
multiple regression guide, we show you how to: (a) create scatterplots and partial
regression plots to check for linearity when carrying out multiple regression using SPSS
Statistics; (b) interpret different scatterplot and partial regression plot results; and (c)
transform your data using SPSS Statistics if you do not have linear relationships between
your variables.
 Assumption #5: Your data needs to show homoscedasticity, which is where the
variances along the line of best fit remain similar as you move along the line. We explain
more about what this means and how to assess the homoscedasticity of your data in our
enhanced multiple regression guide. When you analyse your own data, you will need to
plot the studentized residuals against the unstandardized predicted values. In our
enhanced multiple regression guide, we explain: (a) how to test for homoscedasticity
using SPSS Statistics; (b) some of the things you will need to consider when interpreting
your data; and (c) possible ways to continue with your analysis if your data fails to meet
this assumption.
 Assumption #6: Your data must not show multicollinearity, which occurs when you
have two or more independent variables that are highly correlated with each other. This
leads to problems with understanding which independent variable contributes to the
variance explained in the dependent variable, as well as technical issues in calculating a
multiple regression model. Therefore, in our enhanced multiple regression guide, we
show you: (a) how to use SPSS Statistics to detect for multicollinearity through an
inspection of correlation coefficients and Tolerance/VIF values; and (b) how to interpret
these correlation coefficients and Tolerance/VIF values so that you can determine
whether your data meets or violates this assumption.
 Assumption #7: There should be no significant outliers, high leverage points or highly
influential points. Outliers, leverage and influential points are different terms used to
represent observations in your data set that are in some way unusual when you wish to
perform a multiple regression analysis. These different classifications of unusual points
reflect the different impact they have on the regression line. An observation can be
classified as more than one type of unusual point. However, all these points can have a
very negative effect on the regression equation that is used to predict the value of the
dependent variable based on the independent variables. This can change the output that
SPSS Statistics produces and reduce the predictive accuracy of your results as well as the
statistical significance. Fortunately, when using SPSS Statistics to run multiple regression
on your data, you can detect possible outliers, high leverage points and highly influential
points. In our enhanced multiple regression guide, we: (a) show you how to detect
outliers using "casewise diagnostics" and "studentized deleted residuals", which you can

11
do using SPSS Statistics, and discuss some of the options you have in order to deal with
outliers; (b) check for leverage points using SPSS Statistics and discuss what you should
do if you have any; and (c) check for influential points in SPSS Statistics using a measure
of influence known as Cook's Distance, before presenting some practical approaches in
SPSS Statistics to deal with any influential points you might have.
 Assumption #8: Finally, you need to check that the residuals (errors) are
approximately normally distributed (we explain these terms in our enhanced multiple
regression guide). Two common methods to check this assumption include using: (a) a
histogram (with a superimposed normal curve) and a Normal P-P Plot; or (b) a Normal
Q-Q Plot of the studentized residuals. Again, in our enhanced multiple regression guide,
we: (a) show you how to check this assumption using SPSS Statistics, whether you use a
histogram (with superimposed normal curve) and Normal P-P Plot, or Normal Q-Q Plot;
(b) explain how to interpret these diagrams; and (c) provide a possible solution if your
data fails to meet this assumption.

You can check assumptions #3, #4, #5, #6, #7 and #8 using SPSS Statistics. Assumptions #1 and
#2 should be checked first, before moving onto assumptions #3, #4, #5, #6, #7 and #8. Just
remember that if you do not run the statistical tests on these assumptions correctly, the results
you get when running multiple regression might not be valid. This is why we dedicate a number
of sections of our enhanced multiple regression guide to help you get this right.

In the section, Procedure, we illustrate the SPSS Statistics procedure to perform a multiple
regression assuming that no assumptions have been violated. First, we introduce the example that
is used in this guide.

Linear Regression Analysis using SPSS Statistics

Introduction

Linear regression is the next step up after correlation. It is used when we want to predict the
value of a variable based on the value of another variable. The variable we want to predict is
called the dependent variable (or sometimes, the outcome variable). The variable we are using to
predict the other variable's value is called the independent variable (or sometimes, the predictor
variable). For example, you could use linear regression to understand whether exam performance
can be predicted based on revision time; whether cigarette consumption can be predicted based
on smoking duration; and so forth. If you have two or more independent variables, rather than
just one, you need to use multiple regression.

This "quick start" guide shows you how to carry out linear regression using SPSS Statistics, as
well as interpret and report the results from this test. However, before we introduce you to this
procedure, you need to understand the different assumptions that your data must meet in order
for linear regression to give you a valid result. We discuss these assumptions next.

Assumptions

12
When you choose to analyse your data using linear regression, part of the process involves
checking to make sure that the data you want to analyse can actually be analysed using linear
regression. You need to do this because it is only appropriate to use linear regression if your data
"passes" six assumptions that are required for linear regression to give you a valid result. In
practice, checking for these six assumptions just adds a little bit more time to your analysis,
requiring you to click a few more buttons in SPSS Statistics when performing your analysis, as
well as think a little bit more about your data, but it is not a difficult task.

Before we introduce you to these six assumptions, do not be surprised if, when analysing your
own data using SPSS Statistics, one or more of these assumptions is violated (i.e., not met). This
is not uncommon when working with real-world data rather than textbook examples, which often
only show you how to carry out linear regression when everything goes well! However, don’t
worry. Even when your data fails certain assumptions, there is often a solution to overcome this.
First, let’s take a look at these six assumptions:

 Assumption #1: Your two variables should be measured at the continuous level (i.e.,
they are either interval or ratio variables). Examples of continuous variables include
revision time (measured in hours), intelligence (measured using IQ score), exam
performance (measured from 0 to 100), weight (measured in kg), and so forth. You can
learn more about interval and ratio variables in our article: Types of Variable.
 Assumption #2: There needs to be a linear relationship between the two variables.
Whilst there are a number of ways to check whether a linear relationship exists between
your two variables, we suggest creating a scatterplot using SPSS Statistics where you can
plot the dependent variable against your independent variable and then visually inspect
the scatterplot to check for linearity. Your scatterplot may look something like one of the
following:

If the relationship displayed in your scatterplot is not linear, you will have to either run a
non-linear regression analysis, perform a polynomial regression or "transform" your data,
which you can do using SPSS Statistics. In our enhanced guides, we show you how to:

13
(a) create a scatterplot to check for linearity when carrying out linear regression using
SPSS Statistics; (b) interpret different scatterplot results; and (c) transform your data
using SPSS Statistics if there is not a linear relationship between your two variables.
 Assumption #3: There should be no significant outliers. An outlier is an observed data
point that has a dependent variable value that is very different to the value predicted by
the regression equation. As such, an outlier will be a point on a scatterplot that is
(vertically) far away from the regression line indicating that it has a large residual, as
highlighted below:

The problem with outliers is that they can have a negative effect on the regression
analysis (e.g., reduce the fit of the regression equation) that is used to predict the value of
the dependent (outcome) variable based on the independent (predictor) variable. This will
change the output that SPSS Statistics produces and reduce the predictive accuracy of
your results. Fortunately, when using SPSS Statistics to run a linear regression on your
data, you can easily include criteria to help you detect possible outliers. In our enhanced
linear regression guide, we: (a) show you how to detect outliers using "casewise
diagnostics", which is a simple process when using SPSS Statistics; and (b) discuss some
of the options you have in order to deal with outliers.
 Assumption #4: You should have independence of observations, which you can easily
check using the Durbin-Watson statistic, which is a simple test to run using SPSS
Statistics. We explain how to interpret the result of the Durbin-Watson statistic in our
enhanced linear regression guide.
 Assumption #5: Your data needs to show homoscedasticity, which is where the
variances along the line of best fit remain similar as you move along the line. Whilst we
explain more about what this means and how to assess the homoscedasticity of your data
in our enhanced linear regression guide, take a look at the three scatterplots below, which
provide three simple examples: two of data that fail the assumption (called
heteroscedasticity) and one of data that meets this assumption (called homoscedasticity):

14
Whilst these help to illustrate the differences in data that meets or violates the assumption
of homoscedasticity, real-world data can be a lot more messy and illustrate different
patterns of heteroscedasticity. Therefore, in our enhanced linear regression guide, we
explain: (a) some of the things you will need to consider when interpreting your data; and
(b) possible ways to continue with your analysis if your data fails to meet this
assumption.
 Assumption #6: Finally, you need to check that the residuals (errors) of the regression
line are approximately normally distributed (we explain these terms in our enhanced
linear regression guide). Two common methods to check this assumption include using
either a histogram (with a superimposed normal curve) or a Normal P-P Plot. Again, in
our enhanced linear regression guide, we: (a) show you how to check this assumption
using SPSS Statistics, whether you use a histogram (with superimposed normal curve) or
Normal P-P Plot; (b) explain how to interpret these diagrams; and (c) provide a possible
solution if your data fails to meet this assumption.

You can check assumptions #2, #3, #4, #5 and #6 using SPSS Statistics. Assumptions #2 should
be checked first, before moving onto assumptions #3, #4, #5 and #6. We suggest testing the
assumptions in this order because assumptions #3, #4, #5 and #6 require you to run the linear
regression procedure in SPSS Statistics first, so it is easier to deal with these after checking
assumption #2. Just remember that if you do not run the statistical tests on these assumptions
correctly, the results you get when running a linear regression might not be valid. This is why we
dedicate a number of sections of our enhanced linear regression guide to help you get this right.

15
SPSS Web Books
Regression with SPSS
Chapter 2 - Regression Diagnostics

Chapter Outline
2.0 Regression Diagnostics
2.1 Unusual and Influential data
2.2 Tests on Normality of Residuals
2.3 Tests on Nonconstant Error of Variance
2.4 Tests on Multicollinearity
2.5 Tests on Nonlinearity
2.6 Model Specification
2.7 Issues of Independence
2.8 Summary
2.9 For more information

2.0 Regression Diagnostics

In our last chapter, we learned how to do ordinary linear regression with SPSS, concluding with
methods for examining the distribution of variables to check for non-normally distributed
variables as a first look at checking assumptions in regression. Without verifying that your data
have met the regression assumptions, your results may be misleading. This chapter will explore
how you can use SPSS to test whether your data meet the assumptions of linear regression. In
particular, we will consider the following assumptions.

 Linearity - the relationships between the predictors and the outcome variable should be
linear
 Normality - the errors should be normally distributed - technically normality is necessary
only for the t-tests to be valid, estimation of the coefficients only requires that the errors
be identically and independently distributed
 Homogeneity of variance (homoscedasticity) - the error variance should be constant
 Independence - the errors associated with one observation are not correlated with the
errors of any other observation
 Model specification - the model should be properly specified (including all relevant
variables, and excluding irrelevant variables)

Additionally, there are issues that can arise during the analysis that, while strictly speaking are
not assumptions of regression, are none the less, of great concern to regression analysts.

 Influence - individual observations that exert undue influence on the coefficients


 Collinearity - predictors that are highly collinear, i.e. linearly related, can cause problems
in estimating the regression coefficients.

Many graphical methods and numerical tests have been developed over the years for regression
diagnostics and SPSS makes many of these methods easy to access and use. In this chapter, we

16
will explore these methods and show how to verify regression assumptions and detect potential
problems using SPSS.

2.1 Unusual and Influential data

A single observation that is substantially different from all other observations can make a large
difference in the results of your regression analysis. If a single observation (or small group of
observations) substantially changes your results, you would want to know about this and
investigate further. There are three ways that an observation can be unusual.

Outliers: In linear regression, an outlier is an observation with large residual. In other words, it
is an observation whose dependent-variable value is unusual given its values on the predictor
variables. An outlier may indicate a sample peculiarity or may indicate a data entry error or other
problem.

Leverage: An observation with an extreme value on a predictor variable is called a point with
high leverage. Leverage is a measure of how far an observation deviates from the mean of that
variable. These leverage points can have an unusually large effect on the estimate of regression
coefficients.

Influence: An observation is said to be influential if removing the observation substantially


changes the estimate of coefficients. Influence can be thought of as the product of leverage and
outlierness.

How can we identify these three types of observations? Let's look at an example dataset called
crime. This dataset appears in Statistical Methods for Social Sciences, Third Edition by Alan
Agresti and Barbara Finlay (Prentice Hall, 1997). The variables are state id (sid), state name
(state), violent crimes per 100,000 people (crime), murders per 1,000,000 (murder), the
percent of the population living in metropolitan areas (pctmetro), the percent of the population
that is white (pctwhite), percent of population with a high school education or above (pcths),
percent of population living under poverty line (poverty), and percent of population that are
single parents (single). Below we read in the file and do some descriptive statistics on these
variables. You can click crime.sav to access this file, or see the Regression with SPSS page to
download all of the data files used in this book.

get file = "c:\spssreg\crime.sav".

descriptives
/var=crime murder pctmetro pctwhite pcths poverty single.

Descriptive Statistics

N Minimum Maximum Mean Std. Deviation

17
CRIME 51 82 2922 612.84 441.100

MURDER 51 1.60 78.50 8.7275 10.71758

PCTMETRO 51 24.00 100.00 67.3902 21.95713

PCTWHITE 51 31.80 98.50 84.1157 13.25839

PCTHS 51 64.30 86.60 76.2235 5.59209

POVERTY 51 8.00 26.40 14.2588 4.58424

SINGLE 51 8.40 22.10 11.3255 2.12149

Valid N (listwise) 51

Let's say that we want to predict crime by pctmetro, poverty, and single . That is to say, we
want to build a linear regression model between the response variable crime and the independent
variables pctmetro, poverty and single. We will first look at the scatter plots of crime against
each of the predictor variables before the regression analysis so we will have some ideas about
potential problems. We can create a scatterplot matrix of these variables as shown below.

graph
/scatterplot(matrix)=crime murder pctmetro pctwhite pcths poverty single .

18
The graphs of crime with other variables show some potential problems. In every plot, we see a
data point that is far away from the rest of the data points. Let's make individual graphs of crime
with pctmetro and poverty and single so we can get a better view of these scatterplots. We will
use BY state(name) to plot the state name instead of a point.

GRAPH /SCATTERPLOT(BIVAR)=pctmetro WITH crime BY state(name) .

19
GRAPH /SCATTERPLOT(BIVAR)=poverty WITH crime BY state(name) .

20
GRAPH /SCATTERPLOT(BIVAR)=single WITH crime BY state(name) .

All the scatter plots suggest that the observation for state = "dc" is a point that requires extra
attention since it stands out away from all of the other points. We will keep it in mind when we
do our regression analysis.

Now let's try the regression command predicting crime from pctmetro poverty and single. We
will go step-by-step to identify all the potentially unusual or influential points afterwards.

regression
/dependent crime
/method=enter pctmetro poverty single.

Variables Entered/Removed(b)

Model Variables Entered Variables Removed Method

1 SINGLE, PCTMETRO, POVERTY(a) . Enter

a All requested variables entered.

b Dependent Variable: CRIME

21
Model Summary(b)

Model R R Square Adjusted R Square Std. Error of the Estimate

1 .916(a) .840 .830 182.068

a Predictors: (Constant), SINGLE, PCTMETRO, POVERTY

b Dependent Variable: CRIME

ANOVA(b)

Model Sum of Squares df Mean Square F Sig.

Regression 8170480.211 3 2723493.404 82.160 .000(a)

1 Residual 1557994.534 47 33148.820

Total 9728474.745 50

a Predictors: (Constant), SINGLE, PCTMETRO, POVERTY

b Dependent Variable: CRIME

Coefficients(a)

Unstandardized Standardized
Coefficients Coefficients
t Sig.

Model B Std. Error Beta

-
(Constant) -1666.436 147.852 .000
11.271
1
PCTMETRO 7.829 1.255 .390 6.240 .000

22
POVERTY 17.680 6.941 .184 2.547 .014

SINGLE 132.408 15.503 .637 8.541 .000

a Dependent Variable: CRIME

Let's examine the standardized residuals as a first means for identifying outliers. Below we use
the /residuals=histogram subcommand to request a histogram for the standardized
residuals. As you see, we get the standard output that we got above, as well as a table with
information about the smallest and largest residuals, and a histogram of the standardized
residuals. The histogram indicates a couple of extreme residuals worthy of investigation.

regression
/dependent crime
/method=enter pctmetro poverty single
/residuals=histogram.
Variables Entered/Removed(b)

Model Variables Entered Variables Removed Method

1 SINGLE, PCTMETRO, POVERTY(a) . Enter

a All requested variables entered.

b Dependent Variable: CRIME

Model Summary(b)

Model R R Square Adjusted R Square Std. Error of the Estimate

1 .916(a) .840 .830 182.068

a Predictors: (Constant), SINGLE, PCTMETRO, POVERTY

b Dependent Variable: CRIME

ANOVA(b)

23
Model Sum of Squares df Mean Square F Sig.

Regression 8170480.211 3 2723493.404 82.160 .000(a)

1 Residual 1557994.534 47 33148.820

Total 9728474.745 50

a Predictors: (Constant), SINGLE, PCTMETRO, POVERTY

b Dependent Variable: CRIME

Coefficients(a)

Unstandardized Coefficients Standardized Coefficients


t Sig.

Model B Std. Error Beta

(Constant) -1666.436 147.852 -11.271 .000

PCTMETRO 7.829 1.255 .390 6.240 .000


1
POVERTY 17.680 6.941 .184 2.547 .014

SINGLE 132.408 15.503 .637 8.541 .000

a Dependent Variable: CRIME

Residuals Statistics(a)

Minimum Maximum Mean Std. Deviation N

Predicted Value -30.51 2509.43 612.84 404.240 51

24
Residual -523.01 426.11 .00 176.522 51

Std. Predicted Value -1.592 4.692 .000 1.000 51

Std. Residual -2.873 2.340 .000 .970 51

a Dependent Variable: CRIME

Let's now request the same kind of information, except for the studentized deleted residual. The
studentized deleted residual is the residual that would be obtained if the regression was re-run
omitting that observation from the analysis. This is useful because some points are so influential
that when they are included in the analysis they can pull the regression line close to that
observation making it appear as though it is not an outlier -- however when the observation is
deleted it then becomes more obvious how outlying it is. To save space, below we show just the
output related to the residual analysis.

regression
/dependent crime
/method=enter pctmetro poverty single
/residuals=histogram(sdresid).
Residuals Statistics(a)

25
Minimum Maximum Mean Std. Deviation N

Predicted Value -30.51 2509.43 612.84 404.240 51

Std. Predicted Value -1.592 4.692 .000 1.000 51

Standard Error of Predicted Value 25.788 133.343 47.561 18.563 51

Adjusted Predicted Value -39.26 2032.11 605.66 369.075 51

Residual -523.01 426.11 .00 176.522 51

Std. Residual -2.873 2.340 .000 .970 51

Stud. Residual -3.194 3.328 .015 1.072 51

Deleted Residual -646.50 889.89 7.18 223.668 51

Stud. Deleted Residual -3.571 3.766 .018 1.133 51

Mahal. Distance .023 25.839 2.941 4.014 51

Cook's Distance .000 3.203 .089 .454 51

Centered Leverage Value .000 .517 .059 .080 51

a Dependent Variable: CRIME

26
The histogram shows some possible outliers. We can use the outliers(sdresid) and id(state)
options to request the 10 most extreme values for the studentized deleted residual to be displayed
labeled by the state from which the observation originated. Below we show the output generated
by this option, omitting all of the rest of the output to save space. You can see that "dc" has the
largest value (3.766) followed by "ms" (-3.571) and "fl" (2.620).

regression
/dependent crime
/method=enter pctmetro poverty single
/residuals=histogram(sdresid) id(state) outliers(sdresid).
Outlier Statistics(a)

Case Number STATE Statistic

1 51 dc 3.766

Stud. Deleted Residual 2 25 ms -3.571

3 9 fl 2.620

27
4 18 la -1.839

5 39 ri -1.686

6 12 ia 1.590

7 47 wa -1.304

8 13 id 1.293

9 14 il 1.152

10 35 oh -1.148

a Dependent Variable: CRIME

We can use the /casewise subcommand below to request a display of all observations where the
sdresid exceeds 2. To save space, we show just the new output generated by the /casewise
subcommand. This shows us that Florida, Mississippi and Washington DC have sdresid values
exceeding 2.

regression
/dependent crime
/method=enter pctmetro poverty single
/residuals=histogram(sdresid) id(state) outliers(sdresid)
/casewise=plot(sdresid) outliers(2) .
Casewise Diagnostics(a)

Case Number STATE Stud. Deleted Residual CRIME Predicted Value Residual

9 fl 2.620 1206 779.89 426.11

25 ms -3.571 434 957.01 -523.01

51 dc 3.766 2922 2509.43 412.57

a Dependent Variable: CRIME

Now let's look at the leverage values to identify observations that will have potential great
influence on regression coefficient estimates. We can include lever with the histogram( ) and
the outliers( ) options to get more information about observations with high leverage. We show
28
just the new output generated by these additional subcommands below. Generally, a point with
leverage greater than (2k+2)/n should be carefully examined. Here k is the number of predictors
and n is the number of observations, so a value exceeding (2*3+2)/51 = .1568 would be worthy
of further investigation. As you see, there are 4 observations that have leverage values higher
than .1568.

regression
/dependent crime
/method=enter pctmetro poverty single
/residuals=histogram(sdresid lever) id(state) outliers(sdresid lever)
/casewise=plot(sdresid) outliers(2).
Outlier Statistics(a)

Case Number STATE Statistic

1 51 dc 3.766

2 25 ms -3.571

3 9 fl 2.620

4 18 la -1.839

5 39 ri -1.686
Stud. Deleted Residual
6 12 ia 1.590

7 47 wa -1.304

8 13 id 1.293

9 14 il 1.152

10 35 oh -1.148

1 51 dc .517

Centered Leverage Value 2 1 ak .241

3 25 ms .171

29
4 49 wv .161

5 18 la .146

6 46 vt .117

7 9 fl .083

8 26 mt .080

9 31 nj .075

10 17 ky .072

a Dependent Variable: CRIME

30
As we have seen, DC is an observation that both has a large residual and large leverage. Such
points are potentially the most influential. We can make a plot that shows the leverage by the
residual and look for observations that are high in leverage and have a high residual. We can do
this using the /scatterplot subcommand as shown below. This is a quick way of checking
potential influential observations and outliers at the same time. Both types of points are of great
concern for us. As we see, "dc" is both a high residual and high leverage point, and "ms" has an
extremely negative residual but does not have such a high leverage.

regression
/dependent crime
/method=enter pctmetro poverty single
/residuals=histogram(sdresid lever) id(state) outliers(sdresid, lever)
/casewise=plot(sdresid) outliers(2)
/scatterplot(*lever, *sdresid).

31
Now let's move on to overall measures of influence, specifically let's look at Cook's D, which
combines information on the residual and leverage. The lowest value that Cook's D can assume
is zero, and the higher the Cook's D is, the more influential the point is. The conventional cut-off
point is 4/n, or in this case 4/51 or .078. Below we add the cook keyword to the outliers( )
option and also on the /casewise subcommand and below we see that for the 3 outliers flagged in
the "Casewise Diagnostics" table, the value of Cook's D exceeds this cutoff. And, in the "Outlier
Statistics" table, we see that "dc", "ms", "fl" and "la" are the 4 states that exceed this cutoff, all
others falling below this threshold.

regression
/dependent crime
/method=enter pctmetro poverty single
/residuals=histogram(sdresid lever) id(state) outliers(sdresid, lever, cook)
/casewise=plot(sdresid) outliers(2) cook dffit
/scatterplot(*lever, *sdresid).
Casewise Diagnostics(a)

Case Number STATE Stud. Deleted Residual CRIME Cook's Distance DFFIT

9 fl 2.620 1206 .174 48.507

25 ms -3.571 434 .602 -123.490

32
51 dc 3.766 2922 3.203 477.319

a Dependent Variable: CRIME

Outlier Statistics(a)

Case Number STATE Statistic Sig. F

1 51 dc 3.766

2 25 ms -3.571

3 9 fl 2.620

4 18 la -1.839

5 39 ri -1.686
Stud. Deleted Residual
6 12 ia 1.590

7 47 wa -1.304

8 13 id 1.293

9 14 il 1.152

10 35 oh -1.148

1 51 dc 3.203 .021

2 25 ms .602 .663

3 9 fl .174 .951
Cook's Distance
4 18 la .159 .958

5 39 ri .041 .997

6 12 ia .041 .997

33
7 13 id .037 .997

8 20 md .020 .999

9 6 co .018 .999

10 49 wv .016 .999

1 51 dc .517

2 1 ak .241

3 25 ms .171

4 49 wv .161

5 18 la .146
Centered Leverage Value
6 46 vt .117

7 9 fl .083

8 26 mt .080

9 31 nj .075

10 17 ky .072

a Dependent Variable: CRIME

Cook's D can be thought of as a general measure of influence. You can also consider more
specific measures of influence that assess how each coefficient is changed by including the
observation. Imagine that you compute the regression coefficients for the regression model with
a particular case excluded, then recompute the model with the case included, and you observe the
change in the regression coefficients due to including that case in the model. This measure is
called DFBETA and a DFBETA value can be computed for each observation for each predictor.
As shown below, we use the /save sdbeta(sdbf) subcommand to save the DFBETA values for
each of the predictors. This saves 4 variables into the current data file, sdfb1, sdfb2, sdfb3 and
sdfb4, corresponding to the DFBETA for the Intercept and for pctmetro, poverty and for
single, respectively. We could replace sdfb with anything we like, and the variables created
would start with the prefix that we provide.

34
regression
/dependent crime
/method=enter pctmetro poverty single
/residuals=histogram(sdresid lever) id(state) outliers(sdresid, lever, cook)
/casewise=plot(sdresid) outliers(2) cook dffit
/scatterplot(*lever, *sdresid)
/save sdbeta(sdfb).

The /save sdbeta(sdfb) subcommand does not produce any new output, but we can see the
variables it created for the first 10 cases using the list command below. For example, by
including the case for "ak" in the regression analysis (as compared to excluding this case), the
coefficient for pctmetro would decrease by -.106 standard errors. Likewise, by including the
case for "ak" the coefficient for poverty decreases by -.131 standard errors, and the coefficient
for single increases by .145 standard errors (as compared to a model excluding "ak"). Since the
inclusion of an observation could either contribute to an increase or decrease in a regression
coefficient, DFBETAs can be either positive or negative. A DFBETA value in excess
of 2/sqrt(n) merits further investigation. In this example, we would be concerned about absolute
values in excess of 2/sqrt(51) or .28.

list
/variables state sdfb1 sdfb2 sdfb3
/cases from 1 to 10.
STATE SDFB1 SDFB2 SDFB3

ak -.10618 -.13134 .14518


al .01243 .05529 -.02751
ar -.06875 .17535 -.10526
az -.09476 -.03088 .00124
ca .01264 .00880 -.00364
co -.03705 .19393 -.13846
ct -.12016 .07446 .03017
de .00558 -.01143 .00519
fl .64175 .59593 -.56060
ga .03171 .06426 -.09120

Number of cases read: 10 Number of cases listed: 10

We can plot all three DFBETA values for the 3 coefficients against the state id in one graph
shown below to help us see potentially troublesome observations. We see changed the value
labels for sdfb1 sdfb2 and sdfb3 so they would be shorter and more clearly labeled in the
graph. We can see that the DFBETA for single for "dc" is about 3, indicating that by including
"dc" in the regression model, the coefficient for single is 3 standard errors larger than it would
have been if "dc" had been omitted. This is yet another bit of evidence that the observation for
"dc" is very problematic.

VARIABLE LABLES sdfb1 "Sdfbeta pctmetro"

35
/sdfb2 "Sdfbeta poverty"
/sdfb3 "Sdfbeta single" .

GRAPH
/SCATTERPLOT(OVERLAY)=sid sid sid WITH sdfb1 sdfb2 sdfb3 (PAIR) BY
state(name)
/MISSING=LISTWISE .

The following table summarizes the general rules of thumb we use for the measures we have
discussed for identifying observations worthy of further investigation (where k is the number of
predictors and n is the number of observations).

Value
Measure

leverage >(2k+2)/n

abs(rstu) >2

Cook's D > 4/n

abs(DFBETA) > 2/sqrt(n)

36
We have shown a few examples of the variables that you can refer to in the /residuals ,
/casewise, /scatterplot and /save sdbeta( ) subcommands. Here is a list of all of the variables
that can be used on these subcommands; however, not all variables can be used on each
subcommand.

PRED Unstandardized predicted values.


RESID Unstandardized residuals.
DRESID Deleted residuals.
ADJPRED Adjusted predicted values.
ZPRED Standardized predicted values.
ZRESID Standardized residuals.
SRESID Studentized residuals.
SDRESID Studentized deleted residuals.
SEPRED Standard errors of the predicted values.
MAHAL Mahalanobis distances.
COOK Cook�s distances.
LEVER Centered leverage values.
Change in the regression coefficient that results from the deletion of the ith
DFBETA case. A DFBETA value is computed for each case for each regression
coefficient generated by a model.
Standardized DFBETA. An SDBETA value is computed for each case for
SDBETA
each regression coefficient generated by a model.
DFFIT Change in the predicted value when the ith case is deleted.
SDFIT Standardized DFFIT.
Ratio of the determinant of the covariance matrix with the ith case deleted to
COVRATIO
the determinant of the covariance matrix with all cases included.
Lower and upper bounds for the prediction interval of the mean predicted
response. A lowerbound LMCIN and an upperbound UMCIN are generated.
MCIN
The default confidence interval is 95%. The confidence interval can be reset
with the CIN subcommand. (See Dillon & Goldstein
Lower and upper bounds for the prediction interval for a single observation.
A lowerbound LICIN and an upperbound UICIN are generated. The default
ICIN
confidence interval is 95%. The confidence interval can be reset with the
CIN subcommand. (See Dillon & Goldstein

In addition to the numerical measures we have shown above, there are also several graphs that
can be used to search for unusual and influential observations. The partial-regression plot is
very useful in identifying influential points. For example below we add the /partialplot
subcommand to produce partial-regression plots for all of the predictors. For example, in the 3rd
plot below you can see the partial-regression plot showing crime by single after both crime and

37
single have been adjusted for all other predictors in the model. The line plotted has the same
slope as the coefficient for single. This plot shows how the observation for DC influences the
coefficient. You can see how the regression line is tugged upwards trying to fit through the
extreme value of DC. Alaska and West Virginia may also exert substantial leverage on the
coefficient of single as well. These plots are useful for seeing how a single point may be
influencing the regression line, while taking other variables in the model into account.

Note that the regression line is not automatically produced in the graph. We double clicked on
the graph, and then chose "Chart" and the "Options" and then chose "Fit Line Total" to add a
regression line to each of the graphs below.

regression
/dependent crime
/method=enter pctmetro poverty single
/residuals=histogram(sdresid lever) id(state) outliers(sdresid, lever, cook)
/casewise=plot(sdresid) outliers(2) cook dffit
/scatterplot(*lever, *sdresid)
/partialplot.

38
39
DC has appeared as an outlier as well as an influential point in every analysis. Since DC is
really not a state, we can use this to justify omitting it from the analysis saying that we really
wish to just analyze states. First, let's repeat our analysis including DC below.

regression
/dependent crime
/method=enter pctmetro poverty single.

<some output omitted to save space>

Coefficients(a)

Unstandardized Standardized
Coefficients Coefficients
t Sig.

Model B Std. Error Beta

-
(Constant) -1666.436 147.852 .000
11.271

PCTMETRO 7.829 1.255 .390 6.240 .000


1
POVERTY 17.680 6.941 .184 2.547 .014

SINGLE 132.408 15.503 .637 8.541 .000

a Dependent Variable: CRIME

Now, let's run the analysis omitting DC by using the filter command to omit "dc" from the
analysis. As we expect, deleting DC made a large change in the coefficient for single .The
coefficient for single dropped from 132.4 to 89.4. After having deleted DC, we would repeat the
process we have illustrated in this section to search for any other outlying and influential
observations.

compute filtvar = (state NE "dc").


filter by filtvar.
regression
/dependent crime
/method=enter pctmetro poverty single .

<some output omitted to save space>

40
Coefficients(a)

Unstandardized Coefficients Standardized Coefficients


t Sig.

Model B Std. Error Beta

(Constant) -1197.538 180.487 -6.635 .000

PCTMETRO 7.712 1.109 .565 6.953 .000


1
POVERTY 18.283 6.136 .265 2.980 .005

SINGLE 89.401 17.836 .446 5.012 .000

a Dependent Variable: CRIME

Summary

In this section, we explored a number of methods of identifying outliers and influential points. In
a typical analysis, you would probably use only some of these methods. Generally speaking,
there are two types of methods for assessing outliers: statistics such as residuals, leverage, and
Cook's D, that assess the overall impact of an observation on the regression results, and statistics
such as DFBETA that assess the specific impact of an observation on the regression
coefficients. In our example, we found out that DC was a point of major concern. We performed
a regression with it and without it and the regression equations were very different. We can
justify removing it from our analysis by reasoning that our model is to predict crime rate for
states not for metropolitan areas.

2.2 Tests for Normality of Residuals

One of the assumptions of linear regression analysis is that the residuals are normally distributed.
It is important to meet this assumption for the p-values for the t-tests to be valid. Let's use the
elemapi2 data file we saw in Chapter 1 for these analyses. Let's predict academic performance
(api00) from percent receiving free meals (meals), percent of English language learners (ell),
and percent of teachers with emergency credentials (emer). We then use the /save command to
generate residuals.

get file="c:\spssreg\elemapi2.sav".
regression
/dependent api00

41
/method=enter meals ell emer
/save resid(apires).
Variables Entered/Removed(b)

Model Variables Entered Variables Removed Method

1 EMER, ELL, MEALS(a) . Enter

a All requested variables entered.

b Dependent Variable: API00

Model Summary(b)

Model R R Square Adjusted R Square Std. Error of the Estimate

1 .914(a) .836 .835 57.820

a Predictors: (Constant), EMER, ELL, MEALS

b Dependent Variable: API00

ANOVA(b)

Model Sum of Squares df Mean Square F Sig.

Regression 6749782.747 3 2249927.582 672.995 .000(a)

1 Residual 1323889.251 396 3343.155

Total 8073671.997 399

a Predictors: (Constant), EMER, ELL, MEALS

b Dependent Variable: API00

42
Coefficients(a)

Unstandardized Coefficients Standardized Coefficients


t Sig.

Model B Std. Error Beta

(Constant) 886.703 6.260 141.651 .000

MEALS -3.159 .150 -.709 -21.098 .000


1
ELL -.910 .185 -.159 -4.928 .000

EMER -1.573 .293 -.130 -5.368 .000

a Dependent Variable: API00

Casewise Diagnostics(a)

Case Number Std. Residual API00

93 3.087 604

226 -3.208 386

a Dependent Variable: API00

Residuals Statistics(a)

Minimum Maximum Mean Std. Deviation N

Predicted Value 425.52 884.88 647.62 130.064 400

Residual -185.47 178.48 .00 57.602 400

Std. Predicted Value -1.708 1.824 .000 1.000 400

43
Std. Residual -3.208 3.087 .000 .996 400

a Dependent Variable: API00

We now use the examine command to look at the normality of these residuals. All of the results
from the examine command suggest that the residuals are normally distributed -- the skewness
and kurtosis are near 0, the "tests of normality" are not significant, the histogram looks normal,
and the Q-Q plot looks normal. Based on these results, the residuals from this regression appear
to conform to the assumption of being normally distributed.

examine
variables=apires
/plot boxplot stemleaf histogram npplot.
Case Processing Summary

Cases

Valid Missing Total

N Percent N Percent N Percent

APIRES 400 100.0% 0 .0% 400 100.0%

Descriptives

Statistic Std. Error

Mean .0000000 2.88011205

Lower
-5.6620909
95% Confidence Interval for Bound
Mean
APIRES
Upper Bound 5.6620909

5% Trimmed Mean -.7827765

Median -3.6572906

44
Variance 3318.018

Std. Deviation 57.60224104

Minimum -185.47331

Maximum 178.48224

Range 363.95555

Interquartile Range 76.5523053

Skewness .171 .122

Kurtosis .135 .243

Tests of Normality

Kolmogorov-Smirnov(a) Shapiro-Wilk

Statistic df Sig. Statistic df Sig.

APIRES .033 400 .200(*) .996 400 .510

* This is a lower bound of the true significance.

a Lilliefors Significance Correction

45
Unstandardized Residual Stem-and-Leaf Plot

Frequency Stem & Leaf

1.00 Extremes (=<-185)


2.00 -1 . 4
3.00 -1 . 2&
7.00 -1 . 000
15.00 -0 . 8888899
35.00 -0 . 66666666667777777
37.00 -0 . 444444444555555555
49.00 -0 . 222222222222223333333333
61.00 -0 . 000000000000000011111111111111
48.00 0 . 000000111111111111111111
49.00 0 . 222222222222233333333333
28.00 0 . 4444445555555
31.00 0 . 666666666677777
16.00 0 . 88888899
9.00 1 . 0011
3.00 1 . 2&
1.00 1. &
5.00 Extremes (>=152)

46
Stem width: 100.0000
Each leaf: 2 case(s)

& denotes fractional leaves.

47
2.3 Heteroscedasticity

48
Another assumption of ordinary least squares regression is that the variance of the residuals is
homogeneous across levels of the predicted values, also known as homoscedasticity. If the model
is well-fitted, there should be no pattern to the residuals plotted against the fitted values. If the
variance of the residuals is non-constant then the residual variance is said to be "heteroscedastic."
Below we illustrate graphical methods for detecting heteroscedasticity. A commonly used
graphical method is to use the residual versus fitted plot to show the residuals versus fitted
(predicted) values. Below we use the /scatterplot subcommand to plot *zresid (standardized
residuals) by *pred (the predicted values). We see that the pattern of the data points is getting a
little narrower towards the right end, an indication of mild heteroscedasticity.

regression
/dependent api00
/method=enter meals ell emer
/scatterplot(*zresid *pred).

Let's run a model where we include just enroll as a predictor and show the residual vs. predicted
plot. As you can see, this plot shows serious heteroscedasticity. The variability of the residuals
when the predicted value is around 700 is much larger than when the predicted value is 600 or
when the predicted value is 500.

regression
/dependent api00

49
/method=enter enroll
/scatterplot(*zresid *pred).

As we saw in Chapter 1, the variable enroll was skewed considerably to the right, and we found
that by taking a log transformation, the transformed variable was more normally distributed.
Below we transform enroll, run the regression and show the residual versus fitted plot. The
distribution of the residuals is much improved. Certainly, this is not a perfect distribution of
residuals, but it is much better than the distribution with the untransformed variable.

compute lenroll = ln(enroll).


regression
/dependent api00
/method=enter lenroll
/scatterplot(*zresid *pred).
Variables Entered/Removed(b)

Model Variables Entered Variables Removed Method

1 LENROLL(a) . Enter

a All requested variables entered.

b Dependent Variable: API00

50
Model Summary(b)

Model R R Square Adjusted R Square Std. Error of the Estimate

1 .275(a) .075 .073 136.946

a Predictors: (Constant), LENROLL

b Dependent Variable: API00

ANOVA(b)

Model Sum of Squares df Mean Square F Sig.

Regression 609460.408 1 609460.408 32.497 .000(a)

1 Residual 7464211.589 398 18754.300

Total 8073671.997 399

a Predictors: (Constant), LENROLL

b Dependent Variable: API00

Coefficients(a)

Unstandardized Coefficients Standardized Coefficients


t Sig.

Model B Std. Error Beta

(Constant) 1170.429 91.966 12.727 .000


1
LENROLL -86.000 15.086 -.275 -5.701 .000

51
a Dependent Variable: API00

Residuals Statistics(a)

Minimum Maximum Mean Std. Deviation N

Predicted Value 537.57 751.82 647.62 39.083 400

Residual -288.65 295.47 .00 136.775 400

Std. Predicted Value -2.816 2.666 .000 1.000 400

Std. Residual -2.108 2.158 .000 .999 400

a Dependent Variable: API00

Finally, let's revisit the model we used at the start of this section, predicting api00 from meals,
ell and emer. Using this model, the distribution of the residuals looked very nice and even
52
across the fitted values. What if we add enroll to this model. Will this automatically ruin the
distribution of the residuals? Let's add it and see.

regression
/dependent api00
/method=enter meals ell emer enroll
/scatterplot(*zresid *pred).
Variables Entered/Removed(b)

Model Variables Entered Variables Removed Method

1 ENROLL, MEALS, EMER, ELL(a) . Enter

a All requested variables entered.

b Dependent Variable: API00

Model Summary(b)

Model R R Square Adjusted R Square Std. Error of the Estimate

1 .915(a) .838 .836 57.552

a Predictors: (Constant), ENROLL, MEALS, EMER, ELL

b Dependent Variable: API00

ANOVA(b)

Model Sum of Squares df Mean Square F Sig.

Regression 6765344.050 4 1691336.012 510.635 .000(a)

1 Residual 1308327.948 395 3312.223

Total 8073671.997 399

a Predictors: (Constant), ENROLL, MEALS, EMER, ELL

53
b Dependent Variable: API00

Coefficients(a)

Unstandardized Coefficients Standardized Coefficients


t Sig.

Model B Std. Error Beta

(Constant) 899.147 8.472 106.128 .000

MEALS -3.222 .152 -.723 -21.223 .000

1 ELL -.768 .195 -.134 -3.934 .000

EMER -1.418 .300 -.117 -4.721 .000

ENROLL -3.126E-02 .014 -.050 -2.168 .031

a Dependent Variable: API00

Casewise Diagnostics(a)

Case Number Std. Residual API00

93 3.004 604

226 -3.311 386

a Dependent Variable: API00

Residuals Statistics(a)

Minimum Maximum Mean Std. Deviation N

54
Predicted Value 430.82 888.08 647.62 130.214 400

Residual -190.56 172.86 .00 57.263 400

Std. Predicted Value -1.665 1.847 .000 1.000 400

Std. Residual -3.311 3.004 .000 .995 400

a Dependent Variable: API00

As you can see, the distribution of the residuals looks fine, even after we added the variable
enroll. When we had just the variable enroll in the model, we did a log transformation to
improve the distribution of the residuals, but when enroll was part of a model with other
variables, the residuals looked good so no transformation was needed. This illustrates how the
distribution of the residuals, not the distribution of the predictor, was the guiding factor in
determining whether a transformation was needed.

2.4 Collinearity

55
When there is a perfect linear relationship among the predictors, the estimates for a regression
model cannot be uniquely computed. The term collinearity implies that two variables are near
perfect linear combinations of one another. When more than two variables are involved it is often
called multicollinearity, although the two terms are often used interchangeably.

The primary concern is that as the degree of multicollinearity increases, the regression model
estimates of the coefficients become unstable and the standard errors for the coefficients can get
wildly inflated. In this section, we will explore some SPSS commands that help to detect
multicollinearity.

We can use the /statistics=defaults tol to request the display of "tolerance" and "VIF" values for
each predictor as a check for multicollinearity. The "tolerance" is an indication of the percent of
variance in the predictor that cannot be accounted for by the other predictors, hence very small
values indicate that a predictor is redundant, and values that are less than .10 may merit further
investigation. The VIF, which stands for variance inflation factor, is (1 / tolerance) and as a rule
of thumb, a variable whose VIF values is greater than 10 may merit further investigation. Let's
first look at the regression we did from the last section, the regression model predicting api00
from meals, ell and emer using the /statistics=defaults tol subcommand. As you can see, the
"tolerance" and "VIF" values are all quite acceptable.

regression
/statistics=defaults tol
/dependent api00
/method=enter meals ell emer .
<some output deleted to save space>

Coefficients(a)

Unstandardized Standardized Collinearity


Coefficients Coefficients Statistics
t Sig.

Model B Std. Error Beta Tolerance VIF

(Constant) 886.703 6.260 141.651 .000

MEALS -3.159 .150 -.709 -21.098 .000 .367 2.725


1
ELL -.910 .185 -.159 -4.928 .000 .398 2.511

EMER -1.573 .293 -.130 -5.368 .000 .707 1.415

a Dependent Variable: API00

56
Now let's consider another example where the "tolerance" and "VIF" values are more worrisome.
In the regression analysis below, we use acs_k3 avg_ed grad_sch col_grad and some_col as
predictors of api00. As you see, the "tolerance" values for avg_ed grad_sch and col_grad are
below .10, and avg_ed is about 0.02, indicating that only about 2% of the variance in avg_ed is
not predictable given the other predictors in the model. All of these variables measure education
of the parents and the very low "tolerance" values indicate that these variables contain redundant
information. For example, after you know grad_sch and col_grad, you probably can predict
avg_ed very well. In this example, multicollinearity arises because we have put in too many
variables that measure the same thing, parent education.

We also include the collin option which produces the "Collinearity Diagnostics" table
below. The very low eigenvalue for the 5th dimension (since there are 5 predictors) is another
indication of problems with multicollinearity. Likewise, the very high "Condition Index" for
dimension 5 similarly indicates problems with multicollinearity with these predictors.

regression
/statistics=defaults tol collin
/dependent api00
/method=enter acs_k3 avg_ed grad_sch col_grad some_col.
<some output deleted to save space>

Coefficients(a)

Unstandardized Standardized Collinearity


Coefficients Coefficients Statistics
t Sig.

Model B Std. Error Beta Tolerance VIF

(Constant) -82.609 81.846 -1.009 .313

ACS_K3 11.457 3.275 .107 3.498 .001 .972 1.029

AVG_ED 227.264 37.220 1.220 6.106 .000 .023 43.570


1
GRAD_SCH -2.091 1.352 -.180 -1.546 .123 .067 14.865

COL_GRAD -2.968 1.018 -.339 -2.916 .004 .068 14.779

SOME_COL -.760 .811 -.057 -.938 .349 .246 4.065

a Dependent Variable: API00

57
Collinearity Diagnostics(a)

Variance Proportions
Eige Cond
n ition
Dim valu Inde
Mode e e x ACS_K AVG_E GRAD_SC COL_GRA SOME_CO
l nsio (Constan 3 D H D L
n t)

5.01
1 1.000 .00 .00 .00 .00 .00 .00
3

2 .589 2.918 .00 .00 .00 .05 .00 .01

3 .253 4.455 .00 .00 .00 .03 .07 .02


1
4 .142 5.940 .00 .01 .00 .00 .00 .23

.002 42.03
5 .22 .86 .14 .10 .15 .09
8 6

.011 65.88
6 .77 .13 .86 .81 .77 .66
5 7

a Dependent Variable: API00

Let's omit one of the parent education variables, avg_ed. Note that the VIF values in the
analysis below appear much better. Also, note how the standard errors are reduced for the parent
education variables, grad_sch and col_grad. This is because the high degree of collinearity
caused the standard errors to be inflated. With the multicollinearity eliminated, the coefficient
for grad_sch, which had been non-significant, is now significant.

regression
/statistics=defaults tol collin
/dependent api00
/method=enter acs_k3 grad_sch col_grad some_col.
<some output omitted to save space>

Coefficients(a)

58
Unstandardized Standardized Collinearity
Coefficients Coefficients Statistics
t Sig.

Model B Std. Error Beta Tolerance VIF

(Constant) 283.745 70.325 4.035 .000

ACS_K3 11.713 3.665 .113 3.196 .002 .977 1.024

1 GRAD_SCH 5.635 .458 .482 12.298 .000 .792 1.262

COL_GRAD 2.480 .340 .288 7.303 .000 .783 1.278

SOME_COL 2.158 .444 .173 4.862 .000 .967 1.034

a Dependent Variable: API00

Collinearity Diagnostics(a)

Variance Proportions
Eige Cond
n ition
Dim valu Inde
Mode e x (Constan ACS_K GRAD_SC COL_GRA SOME_CO
ensio
l t) 3 H D L
n

1 3.970 1.000 .00 .00 .02 .02 .01

2 .599 2.575 .00 .00 .60 .03 .04

3 .255 3.945 .00 .00 .37 .94 .03


1
4 .174 4.778 .00 .00 .00 .00 .92

39.92
5 .0249 .99 .99 .01 .01 .00
5

a Dependent Variable: API00

59
2.5 Tests on Nonlinearity

When we do linear regression, we assume that the relationship between the response variable and
the predictors is linear. If this assumption is violated, the linear regression will try to fit a straight
line to data that do not follow a straight line. Checking the linearity assumption in the case of
simple regression is straightforward, since we only have one predictor. All we have to do is a
scatter plot between the response variable and the predictor to see if nonlinearity is present, such
as a curved band or a big wave-shaped curve. For example, let us use a data file called
nations.sav that has data about a number of nations around the world. Let's look at the
relationship between GNP per capita (gnpcap) and births (birth). Below if we look at the
scatterplot between gnpcap and birth, we can see that the relationship between these two
variables is quite non-linear. We added a regression line to the chart by double clicking on it and
choosing "Chart" then "Options" and then "Fit Line Total" and you can see how poorly the line
fits this data. Also, if we look at the residuals by predicted, we see that the residuals are not
homoscedastic, due to the non-linearity in the relationship between gnpcap and birth.

get file = "c:\sppsreg\nations.sav".

regression
/dependent birth
/method=enter gnpcap
/scatterplot(*zresid *pred)
/scat(birth gnpcap) .
Variables Entered/Removed(b)

Model Variables Entered Variables Removed Method

1 GNPCAP(a) . Enter

a All requested variables entered.

b Dependent Variable: BIRTH

Model Summary(b)

Model R R Square Adjusted R Square Std. Error of the Estimate

1 .626(a) .392 .387 10.679

a Predictors: (Constant), GNPCAP

b Dependent Variable: BIRTH

60
ANOVA(b)

Model Sum of Squares df Mean Square F Sig.

Regression 7873.995 1 7873.995 69.047 .000(a)

1 Residual 12202.152 107 114.039

Total 20076.147 108

a Predictors: (Constant), GNPCAP

b Dependent Variable: BIRTH

Coefficients(a)

Unstandardized Coefficients Standardized Coefficients


t Sig.

Model B Std. Error Beta

(Constant) 38.924 1.261 30.856 .000


1
GNPCAP -1.921E-03 .000 -.626 -8.309 .000

a Dependent Variable: BIRTH

Residuals Statistics(a)

Minimum Maximum Mean Std. Deviation N

Predicted Value 1.90 38.71 32.79 8.539 109

61
Residual -23.18 28.10 .00 10.629 109

Std. Predicted Value -3.618 .694 .000 1.000 109

Std. Residual -2.170 2.632 .000 .995 109

a Dependent Variable: BIRTH

62
63
We modified the above scatterplot changing the fit line from using linear regression to using
"lowess" by choosing "Chart" then "Options" then choosing "Fit Options" and choosing
"Lowess" with the default smoothing parameters. As you can see, the "lowess" smoothed curve
fits substantially better than the linear regression, further suggesting that the relationship between
gnpcap and birth is not linear.

We can see that the capgnp scores are quite skewed with most values being near 0, and a
handful of values of 10,000 and higher. This suggests to us that some transformation of the
variable may be necessary. One commonly used transformation is a log transformation, so let's
try that. As you see, the scatterplot between capgnp and birth looks much better with the
regression line going through the heart of the data. Also, the plot of the residuals by predicted
values look much more reasonable.

compute lgnpcap = ln(gnpcap).


regression
/dependent birth
/method=enter lgnpcap
/scatterplot(*zresid *pred) /scat(birth lgnpcap)
/save resid(bres2).

Variables Entered/Removed(b)

64
Model Variables Entered Variables Removed Method

1 LGNPCAP(a) . Enter

a All requested variables entered.

b Dependent Variable: BIRTH

Model Summary(b)

Model R R Square Adjusted R Square Std. Error of the Estimate

1 .756(a) .571 .567 8.969

a Predictors: (Constant), LGNPCAP

b Dependent Variable: BIRTH

ANOVA(b)

Model Sum of Squares df Mean Square F Sig.

Regression 11469.248 1 11469.248 142.584 .000(a)

1 Residual 8606.899 107 80.438

Total 20076.147 108

a Predictors: (Constant), LGNPCAP

b Dependent Variable: BIRTH

Coefficients(a)

Unstandardized Coefficients Standardized Coefficients t Sig.

65
Model B Std. Error Beta

(Constant) 84.277 4.397 19.168 .000


1
LGNPCAP -7.238 .606 -.756 -11.941 .000

a Dependent Variable: BIRTH

Residuals Statistics(a)

Minimum Maximum Mean Std. Deviation N

Predicted Value 12.86 50.25 32.79 10.305 109

Residual -24.75 24.98 .00 8.927 109

Std. Predicted Value -1.934 1.695 .000 1.000 109

Std. Residual -2.760 2.786 .000 .995 109

a Dependent Variable: BIRTH

66
67
This section has shown how you can use scatterplots to diagnose problems of non-linearity, both
by looking at the scatterplots of the predictor and outcome variable, as well as by examining the
residuals by predicted values. These examples have focused on simple regression, however
similar techniques would be useful in multiple regression. However, when using multiple
regression, it would be more useful to examine partial regression plots instead of the simple
scatterplots between the predictor variables and the outcome variable.

2.6 Model Specification

A model specification error can occur when one or more relevant variables are omitted from the
model or one or more irrelevant variables are included in the model. If relevant variables are
omitted from the model, the common variance they share with included variables may be
wrongly attributed to those variables, and the error term can be inflated. On the other hand, if
irrelevant variables are included in the model, the common variance they share with included
variables may be wrongly attributed to them. Model specification errors can substantially affect
the estimate of regression coefficients.

Consider the model below. This regression suggests that as class size increases the academic
performance increases, with p=0.053. Before we publish results saying that increased class size
is associated with higher academic performance, let's check the model specification.

/dependent api00
/method=enter acs_k3 full
/save pred(apipred).

<some output deleted to save space>

Coefficients(a)

Unstandardized Coefficients Standardized Coefficients


t Sig.

Model B Std. Error Beta

(Constant) 32.213 84.075 .383 .702

1 ACS_K3 8.356 4.303 .080 1.942 .053

FULL 5.390 .396 .564 13.598 .000

a Dependent Variable: API00

68
SPSS does not have any tools that directly support the finding of specification errors, however
you can check for omitted variables by using the procedure below. As you notice above, when
we ran the regression we saved the predicted value calling it apipred. If we use the predicted
value and the predicted value squared as predictors of the dependent variable, apipred should be
significant since it is the predicted value, but apipred squared shouldn't be a significant predictor
because, if our model is specified correctly, the squared predictions should not have much of
explanatory power above and beyond the predicted value. That is we wouldn't expect apipred
squared to be a significant predictor if our model is specified correctly. Below we compute
apipred2 as the squared value of apipred and then include apipred and apipred2 as predictors
in our regression model, and we hope to find that apipred2 is not significant.

compute apipred2 = apipred**2.


regression
/dependent api00
/method=enter apipred apipred2.

<some output omitted to save space>

Coefficients(a)

Unstandardized Coefficients Standardized Coefficients


t Sig.

Model B Std. Error Beta

(Constant) 858.873 283.460 3.030 .003

1 APIPRED -1.869 .937 -1.088 -1.994 .047

APIPRED2 2.344E-03 .001 1.674 3.070 .002

a Dependent Variable: API00

The above results show that apipred2 is significant, suggesting that we may have omitted
important variables in our regression. We therefore should consider whether we should add any
other variables to our model. Let's try adding the variable meals to the above model. We see that
meals is a significant predictor, and we save the predicted value calling it preda for inclusion in
the next analysis for testing to see whether we have any additional important omitted variables.

regression
/dependent api00

69
/method=enter acs_k3 full meals
/save pred(preda).
<some output omitted to save space>

Coefficients(a)

Unstandardized Coefficients Standardized Coefficients


t Sig.

Model B Std. Error Beta

(Constant) 771.658 48.861 15.793 .000

ACS_K3 -.717 2.239 -.007 -.320 .749


1
FULL 1.327 .239 .139 5.556 .000

MEALS -3.686 .112 -.828 -32.978 .000

a Dependent Variable: API00

We now create preda2 which is the square of preda, and include both of these as predictors in
our model.
compute preda2 = preda**2.
regression
/dependent api00
/method=enter preda preda2.

<some output omitted to save space>

Coefficients(a)

Unstandardized Coefficients Standardized Coefficients


t Sig.

Model B Std. Error Beta

70
(Constant) -136.510 95.059 -1.436 .152

1 PREDA 1.424 .293 1.293 4.869 .000

PREDA2 -3.172E-04 .000 -.386 -1.455 .146

a Dependent Variable: API00

We now see that preda2 is not significant, so this test does not suggest there are any other
important omitted variables. Note that after including meals and full, the coefficient for class
size is no longer significant. While acs_k3 does have a positive relationship with api00 when
only full is included in the model, but when we also include (and hence control for) meals,
acs_k3 is no longer significantly related to api00 and its relationship with api00 is no longer
positive.

2.7 Issues of Independence

The statement of this assumption is that the errors associated with one observation are not
correlated with the errors of any other observation. Violation of this assumption can occur in a
variety of situations. Consider the case of collecting data from students in eight different
elementary schools. It is likely that the students within each school will tend to be more like one
another that students from different schools, that is, their errors are not independent.

Another way in which the assumption of independence can be broken is when data are collected
on the same variables over time. Let's say that we collect truancy data every semester for 12
years. In this situation it is likely that the errors for observations between adjacent semesters will
be more highly correlated than for observations more separated in time -- this is known as
autocorrelation. When you have data that can be considered to be time-series you can use the
Durbin-Watson statistic to test for correlated residuals.

We don't have any time-series data, so we will use the elemapi2 dataset and pretend that snum
indicates the time at which the data were collected. We will sort the data on snum to order the
data according to our fake time variable and then we can run the regression analysis with the
durbin option to request the Durbin-Watson test. The Durbin-Watson statistic has a range from
0 to 4 with a midpoint of 2. The observed value in our example is less than 2, which is not
surprising since our data are not truly time-series.

sort cases by snum .


regression
/dependent api00
/method=enter enroll
/residuals = durbin .

Model Summary

71
Std. Error
Adjusted R Durbin-
Model R R Square of the
Square Watson
Estimate
1 .318 .101 .099 135.026 1.351

a Predictors: (Constant), ENROLL

b Dependent Variable: API00

2.8 Summary

This chapter has covered a variety of topics in assessing the assumptions of regression using
SPSS, and the consequences of violating these assumptions. As we have seen, it is not sufficient
to simply run a regression analysis, but it is important to verify that the assumptions have been
met. If this verification stage is omitted and your data does not meet the assumptions of linear
regression, your results could be misleading and your interpretation of your results could be in
doubt. Without thoroughly checking your data for problems, it is possible that another
researcher could analyze your data and uncover such problems and question your results
showing an improved analysis that may contradict your results and undermine your conclusions.

72

Você também pode gostar