Você está na página 1de 125

http://eweb.furman.edu/~lpace/SPSS_Tutorials/ http://people.stern.nyu.edu/wgreene/Econometrics/Notes.

htm

Lesson 1: SPSS Windows and Files


Objectives
1. 2. Launch SPSS for Windows. Examine SPSS windows and file types.

Overview
In a typical SPSS session, you are likely to work with two or more SPSS windows and to save the contents of one or more windows to separate files. The window containing your data is the SPSS Data Editor. If you plan to use the data file again, you may click on File, Save from within the Data Editor and give the file a descriptive name. SPSS will supply the .sav extension, indicating that the saved information is in the form of a data file. An SPSS data file includes both the data records and their structure. The window containing the results of the SPSS procedures you have performed is the SPSS Viewer. You may find it convenient to save this as an output file. It is okay to use the same name you used for your data because SPSS will supply the .spo extension to indicate that the saved file is an output file. As you run various procedures, you may also choose to show the SPSS syntax for these commands in a syntax window, and save the syntax in a separate .sps file. It is possible to run SPSS commands directly from syntax, though in this series of tutorials we will focus our attention on SPSS data and output files and use the point-and-click method to enter the necessary commands.

Launching SPSS
SPSS for Windows is launched from the Windows desktop. There are several ways to access the program, and the one you use will be based on the way your particular computer is configured. There may be an SPSS for Windows shortcut on the desktop or in your Start menu. Or you may have to click Start, All Programs to find the SPSS for Windows folder. In that folder, you will find the SPSS for Windows program icon. Once you have located it, click on the SPSS for Windows icon with the left mouse button to launch SPSS. When you start the program, you will be given a blank dataset and a set of options for running the SPSS tutorial, typing in data, running queries, creating queries, or opening existing data sources (see Figure 1-1). For now, just click on Cancel to reveal the blank dataset in the Data Editor screen.

Figure 1-1 SPSS opening screen

The SPSS Data Editor


Examine the SPSS Data Editor's Data View shown in Figure 1-2 below. You will learn in Lesson 2 how create an effective data structure within the Variable View and how to enter and manipulate data using the Data Editor. As indicated above, if you click File, Save while in the Data Editor view, you can save the data along with their structure as a separate file with the .sav extension. The Data Editor provides the Data View as shown below, and also a separate Variable View. You can switch between these views by clicking on the tabs at the bottom of the worksheet-like interface.

Figure 1-2 SPSS Data Editor (Data View)

The SPSS Viewer


The SPSS Viewer is opened automatically to show the output when you run SPSS commands. Assume for example that you wanted to find the average age of 20 students in a class. We will examine the commands needed to calculate descriptive statistics in Lesson 3, but for now, simply examine the SPSS Viewer window (see Figure 1-3). When you click File, Save in this view, you can save the output to a file with the .spo extension.

Figure 1-3 SPSS Viewer

Syntax Editor Window


Finally, you can view and save SPSS syntax commands from the Syntax Editor window. When you are selecting commands, you will see a Paste button. Clicking that button pastes the syntax for the commands you have chosen into the Syntax Editor. For example, the syntax to calculate the mean age shown above is shown in Figure 1-4:

Figure 1-4 SPSS Syntax Editor

Though we will not address SPSS syntax except in passing in these tutorials, you should note that you can run commands directly from the Syntax Editor and save your syntax (.sps) files for future reference. Unlike earlier versions of SPSS, version 15, the version illustrated in these tutorials, automatically presents in the SPSS Viewer the syntax version of the commands you give it when you point and click in the Data Editor or the SPSS Viewer (examine Figure 1-3 for an example). Now that you know the kinds of windows and files involved in an SPSS session, you are ready to learn how to enter, structure, and manipulate data. Those are the subjects of Lesson 2.

Lesson 2: Entering and Working with Data


Objectives
1. 2. 3. 4. 5. Create a data file and data structure. Compute a new variable. Select cases. Sort cases. Split a file.

Overview
Data can be entered directly into the SPSS Data Editor or imported from a variety of file types. It is always important to check data entries carefully and ensure that the data are accurate. In this lesson you will learn how to build an SPSS data file from scratch, how to calculate a new variable, how to select and sort cases, and how to split a file into separate layers.

Creating a Data File

A common first step in working with SPSS is to create or open a data file. We will assume in this lesson that you will type data directly into the SPSS Data Editor to create a new data file. You should realize that you can also read data from many other programs, or copy and paste data from worksheets and tables to create new data files. Launch SPSS. You will be given various options, as we discussed in Lesson 1. Select Type in Data or Cancel . You should now see a screen similar to the following, which is a blank dataset in the Data View of the SPSS Data Editor (see Figure 2-1):

Figure 2-1 SPSS Data Editor - Data View

Key Point: One Row Per Participant, One Column per Variable
It is important to note that each row in the SPSS data table should be assigned to a single participant, subject, or case, and that no case's data should appear on different rows. When there are multiple measures for a case, each measure should appear in a separate column (called a "variable" by SPSS). If you use a coding variable to indicate which group or condition was assigned to a case, that variable should also appear in a separate column. So if you were looking at the scores for five quizzes for each of 20 students, the data for each student would occupy a single row (line) in the data table, and the score for each quiz would occupy a separate column. Although SPSS automatically numbers the rows of the data table, it is a very good habit to provide a separate participant (or subject) number column so that records can be easily sorted, filtered, or selected. Best practice also requires setting up the data structure for the data. For this purpose, we will switch to the Variable View of the Data Editor by clicking on the Variable View tab at the bottom of the Data Editor window. See Figure 2-2.

Figure 2-2 SPSS Data Editor - Variable View

Example Data
Let us establish the data structure for our example of five quizzes and 20 students. We will assume that we also know the age and the sex of each student. Although we could enter "F" for female and "M" for male, most statistical procedures are easier to perform if a number is used to code such categorical variables. Let us assign the number "1" to females and the number "0" to males. The hypothetical data are shown below:
Student 1 2 3 4 5 6 7 8 9 10 Sex 0 0 0 0 1 1 0 1 1 1 Age 18 19 17 20 23 18 21 20 23 21 Quiz1 83 76 85 92 82 88 89 89 92 86 Quiz2 87 89 86 73 75 73 71 70 85 83 Quiz3 81 61 65 76 96 76 61 87 95 77 Quiz4 80 85 64 88 87 91 70 76 89 64 Quiz5 69 75 81 64 78 81 75 88 62 63

11 12 13 14 15 16 17 18 19 20

1 0 0 0 0 1 1 1 1 0

23 18 21 17 19 20 19 22 22 20

90 84 83 79 89 76 92 75 87 75

71 71 80 77 80 85 76 90 87 74

91 67 89 82 64 65 76 78 63 63

86 62 60 63 94 92 74 70 73 91

87 70 60 74 78 82 91 76 64 87

Specifying the Data Structure


Switch to the Variable View by clicking on the Variable View tab (see Figure 2-2 above). The numbers at the left of the window now refer to variables rather than participants. Note that you can specify the variable Name, the Type of variable, the variable Width (in total characters or digits), the number of Decimals , a descriptive Label, labels for different Values, how to deal with Missing Values, the display Column width, how to Align the variable in the display, and whether the Measure is nominal, ordinal, or scale (interval and ratio). In many cases you can simply accept the defaults by leaving the entries blank. But you will definitely want to enter a variable Name and Label, and also specify Value labels for the levels of categorical or grouping variables such as sex or the levels of an independent variable. The variable names should be short and should not contain spaces or special characters other than perhaps underscores. Variable labels, on the other hand, can be longer and can contain spaces and special characters. Let us specify the structure of our dataset by naming the variables as follows. We will also provide information concerning the width, number of decimals, and type of measure, along with a descriptive label: 1. 2. 3. 4. 5. 6. 7. 8. Student Sex Age Quiz1 Quiz2 Quiz3 Quiz4 Quiz5

No decimals appear in our raw data, so we will set the number of decimals to zero. After we enter the desired information, the completed data structure might appear as follows:

Figure 2-3 SPSS data structure (Variable View)

Notice that we provided value labels for Sex, so we won't confuse our 1's and 0's later. To do this, click on Values in the Sex variable row and enter the appropriate labels for males and females (see Figure 2-4).

Figure 2-4 Adding value labels

After entering the value and label for one sex, click on Add and then repeat the process for the other sex. Click on Add after entering this information and then click OK.

Entering the Data


Now return to the data view (click on the Data View tab), and type in the data. If you prefer, you may retrieve a copy of the data file by clicking here. Save the data file with a name that will help you remember it. In this case, we used lesson_2.sav as the file name. Remember that SPSS will provide the .sav extension for a data file. The data should appear as follows:

Figure 2-5 Completed data entry

Computing a New Variable


Now we will compute a new variable by averaging the five quiz scores for each student. When we compute this new variable, it will be added to our variable list, and a new column will be created for it. Let us call the new variable Quiz_Avg and use SPSS's built-in function called MEAN to compute it. Select Transform, then Compute. The Compute

Variable dialog box appears. You may type in the new variable name, specify the type and provide a label, and enter the formula for computing the new variable. In this case, we will use the formula: Quiz_Avg = MEAN(Quiz1, Quiz2, Quiz3, Quiz4, Quiz5) You can enter the formula by selecting MEAN from the Functions window and then clicking on the variable names, or you can simply type in the formula, separating the variable names by commas. The initial Compute Variable dialog box with the target variable named Quiz_Avg and the MEAN function selected is below. The question marks indicate that you must supply expressions for the computation.

Figure 2-6 Compute Variable screen

The appropriate formula is as follows:

Figure 2-7 Completed expression

When you click OK, the new variable appears in both the data and variable views (see below). As discussed earlier, you can change the number of decimals (numerical variables default to two decimals) and add a descriptive label for the new variable.

Figure 2-8 New variable appears in Data View

Figure 2-9 New variable appears in Variable View

Selecting Cases
You may want to select only certain cases, such as the data for females or for individuals with ages lower than 20 years. SPSS allows you to select cases either by filtering (which keeps all the cases but limits further analyses to the selected cases) or by removing the cases that do not meet your criteria. Usually, you will want to filter cases, but sometimes, you may want to create separate files for additional analyses by deleting records that do not match your selection criteria. We will select records for females and filter those records so that the records for males remain but will be excluded from analyses until we select them again. From either the variable view or the data view, click on Data, then click on Select Cases. The resulting dialog box allows you to select the desired cases for further analysis, or to re-select all cases if data were previously filtered. Let us choose "If condition is satisfied," and specify that we want to select only records for which the sex of the participant is female. See the dialog box in the following figure.

Figure 2-10 Select Cases dialog

Click the "If..." button and enter the condition for selection. In this case we will enter the expression Sex = 1. You can type this in directly, or you can point and click to the entries in the dialog box

Figure 2-11 Select Cases expression

Click Continue, then Click OK, and then examine the data view (see Figure 2-12). Records for males will now have a diagonal line through the row number label, indicating that though still present, these records are excluded from further analyses.

Figure 2-12 Selected and filtered data

Also notice that a new variable called Filter_$ has been automatically added to your data file. If you return to the Data menu and select all the cases again, you can use this filter variable to select females instead of having to re-enter the selection formula. If you do not want to keep this new variable, you can right-click on its column label and select Clear.

Figure 2-13 Filter variable added by SPSS

Sorting Cases
Next you will learn to sort cases. Let's return to the Data, Select Cases menu and choose "Select all cases" in order to reselect the records for males. We can sort on one or more variables, For example, we may want to sort the records in our dataset by age and sex. Select Data, Sort Cases:

Figure 2-14 Sort Cases option

Move Sex and Age to the "Sort by" window (see Figure 2-15) and then click OK.

Figure 2-15 Sort Cases dialog

Return to the Data View and confirm that the data are sorted by sex and by age within sex (see Figure 2-16).

Figure 2-16 Cases sorted by Sex and Age

Splitting a File
The last subject we will cover in this tutorial is splitting a file. Instead of filtering cases, splitting a file creates separate "layers" for the grouping variables. For example, instead of selecting only one sex at a time, you may want to run several analyses separately for males and females. One convenient way to accomplish that is to split the file so that every procedure you run will be automatically conducted and reported for the two groups separately. To split a file, select Data, Split File. The data in a goup need to be consecutive cases in the dataset, so the records must be sorted by groups. However, if your data are not already sorted, SPPS can do that for you at the same time the file is split (see Figure 2-17).

Figure 2-17 Split File menu

Now, when you run a command, such as a table command to summarize average quiz scores, the command will be performed for each group separately and those results will be reported in the same output (see Figure 2-18).

Figure 2-18 Split file results in separate analysis for each group

Lesson 3: Descriptive Statistics and Graphs


Objectives

1. 2. 3. 4.

Compute descriptive statistics. Compare means for different groups. Display frequency distributions and histograms. Display boxplots.

Overview
In this lesson, you will learn how to produce various descriptive statistics, simple frequency distribution tables, and frequency histograms. You will also learn how to explore your data and create boxplots.

Example
Let us return to our example of 20 students and five quizzes. We would like to calculate the average score (mean) and standard deviation for each quiz. We will also look at the mean scores for men and women on each quiz. Open the SPSS data file you saved in Lesson 2, or click here for lesson_3.sav. Remember that we previously calculated the average quiz score for each person and included that as a new variable in our data file. To calculate the means and standard deviations for age, all quizzes, and the average quiz score, select Analyze, then Descriptive Statistics, and then Descriptives as shown in the following screenshot (see Figure 3-1).

Figure 3-1 Accessing the Descriptives Procedure

Move the desired variables into the variables window (see Figure 3-2) and then Click OK.

Figure 3-2 Move the desired variables into the variables window.

In the resulting dialog box, make sure you check (at a minimum) the boxes in front of Mean and Std. deviation:

Figure 3-3 Descriptives options

The resulting output table showing the means and standard deviations of the variables is opened in the SPSS Viewer (see Figure 3-4).

Figure 3-4 Output from Descriptives Procedure

Exploring Means for Different Groups


When you have two or more groups, you may want to examine the means for each group as well as the overall mean. The SPSS Compare Means procedure provides this functionality and much more, including various hypothesis tests. Assume that you want to compare the means of men and women on age, the five quizzes, and the average quiz score. Select Analyze, Compare Means, Means (see Figure 3-5):

Figure 3-5 Selecting Means Procedure

Click OK, and then in the resulting dialog box, move the variables you are interested in summarizing into the Dependent List. At this point, do not worry whether your variables are actual "dependent variables" or not. Move Sex to the Independent List (see Figure 3-6). Click on Options to see the many summary statistics available. In the current case, make sure that Mean, Number of Cases, and Standard Deviation are selected.

Figure 3-6 Means dialog box

When you click OK, the report table appears in the SPSS Viewer with the separate means for the two sexes along with the overall data, as shown in the following figure.

Figure 3-7 Report from Means procedure

As this lesson makes clear, there are several ways to produce summary statistics such as means and standard deviations in SPSS. From Lesson 2 you may recall that splitting the file would allow you to calculate the descriptive statistics separately for males and females. The way to find the procedure that works best in a given situation is to try different ones, and always to explore the options presented in the SPSS menus and dialog boxes. The extensive SPSS help files and tutorials are also very useful.

Frequency Distributions and Histograms


SPSS provides several different ways to explore, summarize, and present data in graphic form. For many procedures, graphs and plots are available as output options. SPSS also has an extensive interactive chart gallery and a chart builder

that can be accessed through the Graphs menu. We will look at only a few of these features, and the interested reader is encouraged to explore the many additional charting and graphing features of SPSS. One very useful feature of the Frequencies procedure in SPSS is that it can produce simple frequency tables and histograms. You may optionally choose to have the normal curve superimposed on the histogram for a visual check as to how the data are distributed. Let us examine the distribution of ages of our 20 hypothetical students. Select Analyze, Descriptive Statistics, Frequencies (see Figure 3-8).

Figure 3-8 Selecting Frequencies procedure

In the Frequencies dialog, move Age to the variables window, and then click on Charts. Select Histograms and check the box in front of With normal curve (see Figure 3-9).

Figure 3-9 Frequencies: Charts dialog

Click Continue and OK. In the resulting output, SPSS displays the simple frequency table for age and the frequency histogram with the normal curve (see Figures 3-10 and 3-11).

Figure 3-10 Simple frequency table

Figure 3-11 Frequency histogram with normal curve

Exploratory Data Analysis


In addition to the standard descriptive statistics and frequency distributions and graphs, SPSS also provides many graphical and semi-graphical techniques collectively referred to as exploratory data analysis (EDA). EDA is useful for describing the characteristics of a dataset, identifying outliers, and providing summary descriptions. Some of the most widely-used EDA techniques are boxplots and stem-and-leaf displays. You can access these techniques through the commands found through Analyze, Descriptive Statistics, Exlpore. As with the Compare Means procedure, groups can be separated if desired. For example, a side-by-side boxplot comparing the average quiz grades of men and women is shown in Figure 3-12.

Figure 3-12 Boxplots

Lesson 4: Independent-Samples t Test


Objectives

1. 2.

Conduct an independent-samples t test. Interpret the output of the t test.

Overview
The independent-samples or between-groups t test is used to examine the effects of one independent variable on one dependent variable and is restricted to comparisons of two conditions or groups (two levels of the independent variable). In this lesson, we will describe how to analyze the results of a between-groups design. Lesson 5 covers the paired-samples or within-subjects t test. The reader should note that SPSS incorrectly labels this test a "T test" rather than a t test, but is inconsistent in that labeling, as some of the SPSS output also refers to t-test results . A between-groups design is one in which participants have been randomly assigned to the two levels of the independent variable. In this design, each participant is assigned to only one group, and consequently, the two groups are independent of one another. For example, suppose that you are interested in studying the effects caffeine consumption on task performance. If you randomly assign some participants to the caffeine group and other participants to the no-caffeine group, then you are using a between-groups design. In a within-subjects design, by contrast, all participants would be tested once with caffeine and once without caffeine.

An Example: Parental Involvement Experiment


Assume that you studied the effects of parental involvement (independent variable) on students' grades (dependent variable). Half of the students in a third grade class were randomly assigned to the parental involvement group. The teacher contacted the parents of these children throughout the year and told them about the educational objectives of the class. Further, the teacher gave the parents specific methods for encouraging their children's educational activities. The other half of the students in the class were assigned to the no-parental involvement group. The scores on the first test were tabulated for all of the children, and these are presented below:
Student 1 2 3 4 5 6 7 Involve 1 1 1 1 1 1 1 Test1 78.6 64.9 100.0 83.7 94.0 78.2 76.9

8 9 10 11 12 13 14 15 16

1 0 0 0 0 0 0 0 0

82.0 81.0 69.5 73.8 66.7 54.8 69.3 73.5 79.4

Creating Your Data File: Key Point


When creating a data file for an independent-samples t test in SPSS, you must also create a separate column for the grouping variable that shows to which condition or group a particular participant belongs. In this case, that is the parental involvement condition, so you should create a numeric code that allows SPSS to identify the parental involvement condition for that particular score. If this concept is difficult to grasp, you may want to revisit Lesson 2, in which a grouping variable is created for male and female students. So, the variable view of your SPSS data file should look like the one below, with three variables--one for student number, one for parental involvement condition (using for example a code of "1" for involvement and "0" for no involvement), and one column for the score on Test 1. When creating the data file, is is a good idea to create a variable Label for each variable and Value label for the grouping variable(s). These labels make it easier to interpret the output of your statistical procedures. The variable view of the data file might look similar to the one below.

Figure 4-1 Variable View

The data view of the file should look like the following:

Figure 4-2 Data View


Note that in this particular case the two groups are separated in the data file, with the first half of the data corresponding to the parental involvement condition and the second half corresponding to the no-involvement condition. Although this makes for an orderly data table, such ordering is NOT required in SPSS for the independent-samples t test. When performing the test, whether or not the data are sorted by the independent variable, you must specify which condition a participant is in by use of a grouping variable as indicated above.

Performing the t test for the Parental Involvement Experiment


You should enter the data as described above. Or you may access the SPSS data file for the parental involvement experiment by clicking here. To perform the t test, complete the following steps in order. Click on Analyze, then Compare Means, then Independent Samples T Test.

Figure 4-3 Select Analyze, Compare Means, Independent-Samples T Test

Now, move the dependent variable (in this case, labeled "Score on Test 1 [Test 1] ") into the Test Variable window. Then move your independent variable (in this case, "Parental Involvement [Involve]") into the Grouping Variable window. Remember that Grouping Variable stands for the levels of the independent variable.

Figure 4-4 Independent-Samples T Test dialog box

You will notice that there are question marks in the parentheses following your independent variable in the Grouping Variable field. This is because you need to define the particular groups that you want to compare. To do so, click on Define Groups, and indicate the numeric values that each group represents. In this case, you will want to put a "0" in the field labeled Group 1 and a "1" in the field labeled Group 2. Once you have done this, click on Continue. Now click on OK to run the t test. You may also want to click on Paste in order to save the SPSS syntax of what you have done (see Figure 4-5) in case you desire to run the same kind of test from SPSS syntax.

Figure 4-5 Syntax for the independent-samples t test

Output from the t test Procedure


As you can see below, the output from an independent-samples t test procedure is relatively straightforward.

Figure 4-6 Independent-samples t test output

Interpreting the Output


In the SPSS output, the first table lists the number of participants (N), mean, standard deviation, and standard error of the mean for both of your groups. Notice that the value labels are printed as well as the variable labels for your variables, making it easier to interpret the output.

The second table (see Figure 4-6) presents you with an F test (Levene's test for equality of variances) that evaluates the basic assumption of the t test that the variances of the two groups are approximately equal (homogeneity of variance or homoscedasticity). If the F value reported here is very high and the significance level is very low--usually lower than .05 or .01, then the assumption of homogeneity of variance has been violated. In that case, you should use the t test in the lower half of the table, whereas if you have not violated the homogeneity assumption, you should use the t test in the upper half of the table. The t-test formula for unequal variances makes an adjustment to the degrees of freedom, so this value is often fractional, as seen above. In this particular case, you can see that we have not violated the homogeneity assumption, and we should report the value of t as 2.356, degrees of freedom of 14, and the significance level of .034. Thus, our data show that parental involvement has a significant effect on grades, t(14) = 2.356, p = .034.

Lesson 5: Paired-Samples t Test


Objectives
1. 2. Conduct a paired-samples t test. Interpret the output of the paired-samples t test.

Overview
The paired-samples or dependent t test is used for within-subjects or matched-pairs designs in which observations in the groups are linked. The linkage could be based on repeated measures, natural pairings such as mothers and daughters, or pairings created by the experimenter. In any of these cases, the analysis is the same. The dependency between the two observations is taken into account, and each set of observations serves as its own control, making this a generally more powerful test than the independent-samples t test. Because of the dependency, the degrees of freedom for the pairedsamples t test are based on the number of pairs rather than the number of observations.

Example
Imagine that you conducted an experiment to test the the effects of the presence of others (independent variable) on problem-solving performance (dependent variable). Assume further that you used a within-subjects design; that is, each participant was tested alone and in the presence of others on different days using comparable tasks. Higher scores indicate better problem-solving performance. The data appear below:
Participant 1 Alone 12 Others 10

2 3 4 5 6 7 8 9 10 11 12

8 4 6 12 6 11 5 7 12 9 5

6 5 5 10 5 7 3 6 7 8 2

The following figure shows the variable view of the structure of the dataset:

Figure 5-1 Dataset variable view

Entering Data for a Within-Subjects Design: Key Point


When you enter data for a within-subjects design, there must be a separate column for each condition. This tells SPSS that the two data points are linked for a given participant. Unlike the independent-samples t test where a grouping variable is required, there is no additional grouping variable in the paired-samples t test. The properly configured data are shown in the following screenshot of the SPSS Data Editor Data View:

Figure 5-2 Dataset data view

Performing the Paired-Samples t test Step-by-Step


The SPSS data file for this example can be found here. After you have entered or opened the dataset, you should follow these steps in order. Click on Analyze, Compare Means, and then Paired-Samples T test.

Figure 5-3 Select Paired-Samples T Test

In the resulting dialog box, click on the label for Alone and then press <Shift> and click on the label for Others. Click on the arrow to move this pair of variables to the Paired Variables window.

Figure 5-4 Identify paired variables

Interpreting the Paired-Samples t Test Output


Click OK and the following output appears in the SPSS Output Viewer Window (see Figure 5-5). Note that the correlation between the two observations is reported along with its p level, and that the value of t, the degrees of freedom (df), and the p level of the calculated t are reported as well.

Figure 5-5 Paired-Samples T Test output

Lesson 6: One-Way ANOVA


Objectives
1. 2. 3. Conduct a one-way ANOVA. Perform post hoc comparisons among means. Interpret the ANOVA and post hoc comparison output.

Overview
The one-way ANOVA compares the means of three or more independent groups. Each group represents a different level of a single independent variable. It is useful at least conceptually to think of the one-way ANOVA as an extension of the independent-samples t test. The null hypothesis in the ANOVA is that the several populations being sampled all have the same mean. Because the variance is based on deviations from the mean, the "analysis of variance" can be used to test hypotheses about means. The test statistic in the ANOVA is an F ratio, which is a ratio of two variances. When an ANOVA leads to the conclusion that the sample means differ by more than a chance level, it is usually instructive to

perform post hoc or (a posteriori) analyses to determine which of the sample means are different. It is also helpful to determine and report effect size when performing ANOVA.

Example Problem
In a class of 30 students, ten students each were randomly assigned to three different methods of memorizing word lists. In the first method, the student was instructed to repeat the word silently when it was presented. In the second method, the student was instructed to spell the word backward and visualize the backward word and to pronounce it silently. The third method required the student to associate each word with a strong memory. Each student saw the same 10 words flashed on a computer screen for five seconds each. The list was repeated in random order until each word had been presented a total of five times. A week later, students were asked to write down as many of the words as they could recall. For each of the three groups, the number of correctly-recalled words is shown in the following table:
Method1 1 2 0 0 4 3 1 0 3 3 Method2 4 4 0 6 6 6 6 6 4 4 Method3 7 4 9 8 6 9 6 4 5 6

Entering the Data in SPSS


Recall our previous lessons on data entry. These 30 scores represent 30 different individuals, and each participant's data should take up one line of the data file. The group membership should be coded as a separate variable. The correctlyentered data would take the following form (see Figure 6-1). Note that although we used 1, 2, and 3 to code group membership, we could just as easily have used 0, 1, and 2.

Figure 6-1 Data for one-way ANOVA

Conducting the One-Way ANOVA


To perform the one-way ANOVA in SPSS, click on Analyze, Compare Means, One-Way ANOVA (see Figure 6-2).

Figure 6-2 Select Analyze, Compare Means, One-Way ANOVA

In the resulting dialog box, move Recall to the Dependent List and Method to the Factor field. Select Post Hoc and then check the box in front of Tukey for the Tukey HSD test (see Figure 6-3), which is one of the most frequently used post hoc procedures. Note also the many other post hoc comparison tests available.

Figure 6-3 One-Way ANOVA dialog with Tukey HSD test selected

The ANOVA summary table and the post hoc test results appear in the SPSS Viewer (see Figure 6-4). Note that the overal (omnibus) F ratio is significant, indicating that the means differ by a larger amount than would be expected by chance alone if the null hypothesis were true. The post hoc test results indicate that the mean for Method 1 is significantly lower than the means for Methods 2 and 3, but that the means for Methods 2 and 3 are not significantly different.

Figure 6-4 ANOVA summary table and post hoc test results

As an aid to understanding the post hoc test results, SPSS also provides a table of homogenous subsets (see Figure 6-5). Note that it is not strictly necessary that the sample sizes be equal in the one-way ANOVA, and when they are unequal, the Tukey HSD procedure uses the harmonic mean of the sample sizes for post hoc comparisons.

Figure 6-5 Table of homogeneous subsets

Missing from the ANOVA results table is any reference to effect size. A common effect size index is eta squared, which is the between-groups sum of squares divided by the total sum of squares. As such, this index represents the proportion of variance that can be attributed to between-group differences or treatment effects. An alternative method of performing the one-way ANOVA provides the effect-size index, but not the post hoc comparisons discussed earlier. To perform this alternative analysis, select Analyze, Compare Means, Means (see Figure 6-6). Move Recall to the Dependent List and Method to the Independent List. Under Options, select Anova Table and eta.

Figure 6-6 ANOVA procedure and effect size index available from Means procedure

The ANOVA summary table from the Means procedure appears in Figure 6-7 below. Eta squared is directly interpretable as an effect size index: 58 percent of the variance in recall can be explained by the method used for remembering the word list.

Figure 6-7 ANOVA table and effect size from Means procedure

Lesson 7: Repeated-Measures ANOVA


Objectives
1. 2. 3. Conduct the repeated-measures ANOVA. Interpret the output. Construct a profile plot.

Overview
The repeated-measures or within-subjects ANOVA is used when there are multiple measures for each participant. It is conceptually useful to think of the repeated-measures ANOVA as an extension of the paired-samples t test. Each set of observations for a subject or case serves as its own control, so this test is quite powerful. In the repeated-measures ANOVA, the test of interest is the within-subjects effect of the treatments or repeated measures. The procedure for performing a repeated-measures ANOVA in SPSS is found in the Analyze, General Linear Model menu.

Example Data
Assume that a statistics professor is interested in the effects of taking a statistics course on performance on an algebra test. She administers a 20-item college algebra test to ten randomly selected statistics students at the beginning of the term, at the end of the term, and six months after the course is finished. The hypothetical test results are as follows.

Student 1 2 3 4 5 6 7 8 9 10

Before 13 8 12 12 19 10 10 8 14 11

After 15 8 15 17 20 15 13 12 15 16

SixMo 17 7 14 16 20 14 15 11 13 9

Coding Considerations
Data coding considerations in the repeated-measures ANOVA are similar to those in the paired-samples t test. Each participant or subject takes up a single row in the data file, and each observation requires a separate column. The properly coded SPSS data file with the data entered correctly should appear as follows (see figure 7-1). You may also retrieve a copy of the data file if you like.

Figure 7-1 SPSS data file coded for repeated-measures ANOVA

Performing the Repeated-Measures ANOVA


To perform the repeated-measures ANOVA in SPSS, click on Analyze, then General Linear Model, and then Repeated Measures. See Figure 7-2.

Figure 7-2 Select Analyze, General Linear Model, Repeated Measures

In the resulting Repeated Measures dialog, you must specify the number of factors and the number of levels for each factor. In this case, the single factor is the time the algebra test was taken, and there are three levels: at the beginning of the course, immediately after the course, and six months after the course. You can accept the default label of factor1, or change it to a more descriptive one. We will use "Time" as the label for our factor, and specify that there are three levels (see Figure 7-3).

Figure 7-3 Specifying factor and levels

After naming the factor and specifying the number of levels, you must add the factor and then define it. Click on Add and then click on Define. See Figure 7-4.

Figure 7-4 Specifying within-subjects variable levels

Now you can enter the levels one at a time by clicking on a variable name and then clicking on the right arrow adjacent to the Within-Subjects Variables field. Or you can click on Before in the left pane of the Repeated Measures dialog, then hold down <Shift> and click on SixMo to select all three levels at the same time, and then click on the right arrow to move all three levels to the window in one step (see Figure 7-5).

Figure 7-5 Within-subjects variables appropriately entered

Clicking on Options allows you to specify the calculation of descriptive statistics, effect size, and contrasts among the means. If you like, you can also click on Plots to include a line graph of the algebra test mean scores for the three administrations. Figure 7-6 is a screen shot of the Profile Plots dialog. You should click on Time, then Horizontal Axis, and then click on Add. Click Continue to return to the Repeated Measures dialog.

Figure 7-6 Profile Plots dialog

Now click on Options and specify descriptive statistics, effect size, and contrasts (see Figure 7-7). You must move Time to the Display Means window as well as specify a confidence level adjustment for the main effects contrasts. A Bonferroni correction will adjust the alpha level in the post hoc comparisons, while the default LSD (Fisher's least significant difference test) will not adjust the alpha level. We will select the more conservative Bonferroni correction.

Figure 7-7 Specifying descriptive statistics, effect size, and mean contrasts

Click on Continue, then OK to run the repeated-measures ANOVA. The SPSS output provides several tests. When there are multiple dependent variables, the multiviariate test is used to determine whether there is an overall within-subjects effect for the combined depedendent variables. As there is only one within-subject factor, we can ignore this test in the present case. Sphericity is an assumption that the variances of the differences between the pairs of measures are equal. The insignificant test of sphericity indicates that this assumption is not violated in the present case, and adjustments to the degrees of freedom (and thus to the p level) are not required. The test of interest is the Test of Within-Subjects Effects. We can assume sphericity and report the F ratio as 8.149 with 2 and 18 degrees of freedom and the p level as .003 (see Figure 7-8). Partial eta-squared has an interpretation similar to that of eta-squared in the one-way ANOVA, and is directly interpretable as an effect-size index: about 48 percent of the within-subjects variation in algebra test performance can be explained by knowledge of when the test was administered.

Figure 7-8 Test of within-subjects effects

Additional insight is provided by the Bonferroni-corrected pairwise comparisons, which indicate that the means for Before and After are significantly different, while none of the other comparisons are signficant. The profile plot is of assistance in the visualization of these contrasts. See Figures 7-9 and 7-10. These results indicate an immediate but unsustained improvement in algebra test performance for students taking a statistics course.

Figure 7-9 Bonferroni-corrected pairwise comparisions

Figure 7-10 Profile plot

Figure 8-10 Interaction plot

Lesson 9: ANOVA for Mixed Factorial Designs


Objectives
1. 2. 3. Conduct a mixed-factorial ANOVA. Test between-groups and within-subjects effects. Construct a profile plot.

Overview
A mixed factorial design involves two or more independent variables, of which at least one is a within-subjects (repeated measures) factor and at least one is a between-groups factor. In the simplest case, there will be one between-groups factor and one within-subjects factor. The between-groups factor would need to be coded in a single column as with the

independent-samples t test or the one-way ANOVA, while the repeated measures variable would comprise as many columns as there are measures as in the paired-samples t test or the repeated-measures ANOVA.

Example Data
As an example, assume that you conducted an experiment in which you were interested in the extent to which visual distraction affects younger and older people's learning and remembering. To do this, you obtained a group of younger adults and a separate group of older adults and had them learn under three conditions (eyes closed, eyes open looking at a blank field, eyes open looking at a distracting field of pictures). This is a 2 (age) x 3 (distraction condition) mixed factorial design. The scores on the data sheet below represent the number of words recalled out of ten under each distraction condition.
Complex Distraction 3 6 6 4 2 4 3 2

Age

Closed Eyes

Simple Distraction

Younger Younger Younger Younger Older Older Older Older

8 7 8 7 6 5 5 6

5 6 7 5 5 5 4 3

Building the SPSS Data File


Note that there are eight separate participants, so the data file will require eight rows. There will be a column for the participants' age, which is the between-groups variable, and three columns for the repeated measures, which are the distraction conditions. As always it is helpful to include a column for participant (or case) number. The data appropriately entered in SPSS should look something like the following (see Figure 9-1). You may optionally download a copy of the data file.

Figure 9-1 SPSS data structure for mixed factorial design

Performing the Mixed Factorial Anova


To conduct this analysis, you will use the repeated measures procedure. The initial steps are identical to those in the within-subjects ANOVA. You must first specify repeated measures to identify the within-subjects variable(s), and then specify the between-groups factor(s). Select Analyze, then General Linear Model, then Repeated Measures (see Figure 9-2).

Figure 9-2 Preparing for the Mixed Factorial Analysis

Next, you must define the within-subjects factor(s). This process should be repeated for each factor on which there are repeated measures. In our present case, there is only one within-subject variable, the distraction condition. SPSS will give the within-subjects variables the names factor1, factor2, and so on, but you can provide more descriptive names if you like. In the Repeated Measures dialog box, type in the label distraction and the number of levels, 3. See Figure 9-3. If you like, you can give this measure (the three distraction levels) a new name by clicking in the Measure Name field. If you choose to name this factor, the name must be unique and may not conflict with any other variable names. If you do not name the measure, the SPSS name for the measure will default to MEASURE_1. In the present case we will leave the measure name blank and accept the default label.

Figure 9-3 Specifying the within-subjects factor.

We will now specify the within-subjects and between-groups variables. Click on Add and then Define to specify which variable in the dataset is associated with each level of the within-subjects factor (see Figure 9-4).

Figure 9-4 Defining the within-subjects variable

Move the Closed, Simple, and Complex variables to levels 1, 2, and 3, respectively, and then move Age to the BetweenSubjects Factor(s) window (see Figure 9-5). You can optionally specify one or more covariates for analysis of covariance.

Figure 9-5 The complete design specification for the mixed factorial ANOVA

To display a plot of the cell means, click on Plots, and then move Age to the Horizontal axis, and distraction to Separate Lines. Next click on Add to specify the plot (see Figure 9-6) and then click Continue.

Figure 9-6 Specifying plot

We will use the Options menu to specify the display marginal and cell means, to compare main effects, to display descriptive statistics, and display measures of effect size. We will select the Bonferroni interval adjustment to control the level of Type I error. See Figure 9-7.

Figure 9-7 Repeated measures options

Select Continue to close the options dialog and then OK to run the ANOVA. The resulting SPSS output is rather daunting, but you should focus on the between and within-subjects tests. The test of sphericity is not significant, indicating that this assumption has not been violated. Therefore you should use the F ratio and degrees of freedom associated with the sphericity assumption (see Figure 9-8). Specifically you will want to determine whether there is a main effect for age, an effect for distraction condition, and a possible interaction of the two. The tables of interest from the SPSS Viewer are shown in Figures 9-8 and 9-9.

Figure 9-8 Partial SPSS output

The test of within-subjects effects indicates that there is a significant effect of the distraction condition on word memorization. The lack of an interaction between distraction and age indicates that this effect is consistent for both younger and older participants. The test of between-subjects effects (see Figure 9-9) indicates there is a significant effect of the age condition on word memory.

Figure 9-9 Test of between-subjects effects

The remainder of the output assists in the interpretation of the main effects of the within-subjects (distraction condition) and between-subjects (age condition) factors. Of particular interest is the profile plot, which clearly displays the main effects and the absence of an interaction (see Figure 9-10). As disussed above, SPSS calls the within subjects variable MEASURE_1 in the plot.

Figure 9-10 Profile plot

Lesson 10: Correlation and Scatterplots


Objectives
1. 2. 3. 4. Calculate correlation coefficients. Test the significance of correlation coefficients. Construct a scatterplot. Edit features of the scatterplot.

Overview
In correlational research, there is no experimental manipulation. Rather, we measure variables in their natural state. Instead of independent and dependent variables, it is useful to think of predictors and criteria. In bivariate (two-variable)

correlation, we are assessing the degree of linear relationship between a predictor, X, and a criterion, Y. In multiple regression, we are assessing the degree of relationship between a linear combination of two or more predictors, X1, X2, ...Xk, and a criterion, Y. We will address correlation in the bivariate case in Lesson 10, linear regression in the bivariate case in Lesson 11, and multiple regression and correlation in Lesson 12. The Pearson product moment correlation coefficient summarizes and quantifies the relationship between two variables in a single number. This number can range from -1 representing a perfect negative or inverse relationship to 0 representing no relationship or complete independence to +1 representing a perfect positive or direct relationship. When we calculate a correlation coefficient from sample data, we will need to determine whether the obtained correlation is significantly different from zero. We will also want to produce a scatterplot or scatter diagram to examine the nature of the relationship. Sometimes the correlation is low not because of a lack of relationship, but because of a lack of linear relationship. In such cases, examining the scatterplot will assist in determining whether a relationship may be nonlinear.

Example Data
Suppose that you have collected questionnaire responses to five questions concerning dormitory conditions from 10 college freshmen. (Normally you would like to have a larger sample, but the small sample in this case is useful for illustration.) The questionnaire assesses the students' level of satisfaction with noise, furniture, study area, safety, and privacy. Assume that you have also assessed the students' family income level, and you would like to test the hypothesis that satisfaction with the college living environment is related to wealth (family income). The questionnaire contains five questions about satisfaction with the various aspects of the dormitory "noise," "furniture," "space," "study," "safety," and "privacy." These are answered on a 5-point Likert-type scale (very dissatisfied to very satisfied), which are coded as 1 to 5. The data sheet for this study is shown below.
Student 1 2 3 4 5 6 7 8 9 Income 39 59 75 45 95 115 67 48 140 Noise 5 3 2 5 1 1 3 4 2 Furniture 5 3 1 3 2 1 2 4 2 Study_Area 4 5 2 4 2 1 4 5 1 Safety 5 5 2 4 1 1 3 4 1 Privacy 5 4 2 5 2 1 3 4 1

10

55

Entering the Data in SPSS


The data correctly entered in SPSS would look like the following (see Figure 10-1). Remember not only to enter the data, but to add appropriate labels in the Variable View to improve the readability of the output. If you prefer, you can download a copy of the data file.

Figure 10-1 Data entered in SPSS

Calculating and Testing Correlation Coefficients


To calculate and test the significance of correlation coefficients, select Analyze, Correlate, Bivariate (see Figure 10-2).

Figure 10-2 The bivariate correlation procedure

Move the desired variables to the Variables window, as shown in Figure 10-3.

Figure 10-3 Move desired variables to the Variables window

Under the Options menu, let us select means and standard deviations and then click Continue. The output contains a table of descriptive statistics (see Figure 10-4) and a table of correlations and related significance tests (see Figure 10-5).

Figure 10-4 Descriptive statistics

Figure 10-5 Correlation matrix

Note that SPSS flags significant correlations with asterisks. The correlation matrix is symmetrical, so the above-diagonal entries are the same as the below-diagonal entries. In our survey results we note strong negative correlations between family income and the various survey items and strong positive correlations among the various items.

Constructing a Scatterplot
For purposes of illustration, let us produce a scatterplot of the relationship between satisfaction with noise level in the dormitory and family income. We see from the correlation matrix that this is a significant negative correlation. As family income increases, satisfaction with the dormitory noise level decreases. To build the scatterplot, select Graphs, Interactive, Scatterplot (see Figure 10-6). Please note that there are several different ways to construct the scatterplot in SPSS, and that we are illustrating only one here.

Figure 10-6 Constructing a scatterplot In the resulting dialog, enter Family Income on the X-axis and Noise on the Y-axis (see Figure 10-7).

Figure 10-7 Specifying variables for the scatterplot

The resulting scatterplot (see Figure 10-8) shows the relationship between family income and satisfaciton with dormitory noise.

Figure 10-8 Scatterplot

In the SPSS Viewer it is possible to edit a chart object by double-clicking on it in the SPSS Viewer. In attition to many other options, you can change the labeling and scaling of axes, add trend lines and other elements to the scatterplot, and change the marker types. The edited chart apears in Figure 10-9. If you like, you can save this particular combination as a chart template to use it again in the future.

Figure 10-9 Edited scatterplot

Lesson 11: Linear Regression


Objectives
1. 2. 3. Determine the regression equation. Compute predicted Y values. Compute and interpret residuals.

Overview
Closely related to correlation is the topic of linear regression. As you learned in Lesson 10, the correlation coefficient is an index of linear relationship. If the correlation coefficient is significant, that is an indication that a linear equation can be used to model the relationship between the predictor X and the criterion Y. In this lesson you will learn how to determine the equation of the line of best fit between the predictor and the criterion, how to compute predicted values based on that linear equation, and how to calculate and interpret residuals.

Example Problem and Data

This spring term you are in a large introductory psychology class. You observe an apparent relationship between the outside temperature and the number of people who skip class on a given day. More people seem to be absent when the weather is warmer, and more seem to be present when it is cooler outside. You randomly select 10 class periods and record the outside temperature reading 10 minutes before class time and then count the number of students in attendance that day. If you determine that there is a significant linear relationship, you would like to impress your professor by predicting how many people will be present on a given day, based on the outside temperature. The data you collect are the following:
Temp 50 77 67 53 75 70 83 85 80 64 Attendance 87 60 73 86 59 65 65 62 58 89

Entering the Data in SPSS


These pairs of data must be entered as separate variables. The data file may look something like the following (see Figure 11-1):

Figure 11-1 Data in SPSS

If you prefer, you can download a copy of the data. As you learned in Lesson 10, you should first determine whether there is a significant correlation between temperature and attendance. Running the Correlation procedure (see Lesson 10 for details), you find that the correlation is -.87, and is significant at the .01 level (see Figure 11-2).

Figure 11-2 Significant correlation

A scatterplot is helpful in visualizing the relationship (see Figure 11-3). Clearly, there is a negative relationship between attendance and temperature.

Figure 11-3 Scatterplot

Linear Regression
The correlation and scatterplot indicate a strong, though by no means perfect, relationship between the two variables. Let us now turn our attention to regression. We will "regress" the attendance (Y)on the temperature (X). In linear regression, we are seeking the equation of a straight line that best fits the observations. The usefulness of such a line may not be immediately apparent, but if we can model the relationship by a straight line, we can use that line to predict a value of Y for any value of X, even those that have not yet been observed. For example, looking at the scatterplot in Figure 11-3, what attendance would you predict for a temperature of 60 degrees? The regression line can answer that question. This line will have an intercept term and a slope coefficient and will be of the general form

The intercept and slope (regression) coefficient are derived in such a way that the sums of the squared deviations of the actual data points from the line are minimized. This is called "ordinary least squares" estimation or OLS. Note that the predicted value of Y (read "Y-hat") is a linear combination of two constants, the intercept term and the slope term, and the value of X, so that the only thing that varies is the value of X. Therefore, the correlation between the predicted Ys and the observed Ys will be the same as the correlation between the observed Ys and the observed Xs. If we subtract the predicted value of Y from the observed value of Y, the difference is called a "residual." A residual represents the part of the Y variable that cannot be explained by the X variable. Visually, the distance between the observed data points and the line of best fit represents the residual. SPSS's Regression procedure allows us to determine the equation of the line of best fit, to calculate predicted values of Y, and to calculate and interpret residuals. Optionally, you can save the predicted values of Y and the residuals as either standard scores or raw-score equivalents.

Running the Regression Procedure


Open the data file in SPSS. Select Analyze, Regression, and then Linear (see Figure 11-4).

Figure 11- 4 Performing the Regression procedure

The Regression procedure outputs a value called "Multiple R," which will always range from 0 to 1. In the bivariate case, Multiple R is the absolute value of the Pearson r, and is thus .87. The square of r or of Multiple R is .752, and represents the amount of shared variance between Y and X. When we run the regression tool, we can optionally ask for either standardized or unstandardized (raw-score) predicted values of Y and residuals to be calculated and saved as new variables (see Figure 11-5).

Figure 11-5 Save options in the Regression procedure

Click OK to run the Regression procedure. The output is shown in Figure 11-6. In the ANOVA table summarizing the regression, the omnibus F test tests the hypothesis that the population Multiple R is zero. We can safely reject that null hypothesis. Notice that dividing the regression sum of squares, which is based on the predicted values of Y, by the total sum of squares, which is based on the observed values of Y, produces the same value as R Square. The value of R Square thus represents the proportion of variance in the criterion that can be explained by the predictor. The residual sum of squares represents the variance in the criterion that remains unexplained.

Figure 11-6 Regression procedure output

In Figure 11-7 you can see that the residuals and predicted values are now saved as new variables in the SPSS data file.

Figure 11-7 Saving predicted values and residuals

The regression equation for predicting attendance from the outside temperature is 133.556 - .897 x Temp. So for a temperature of 60 degrees, you would predict the attendance to be 80 students (see Figure 11-8 in which this is illustrated graphically). Note that this process of using a linear equation to predict attendance from the temperature has some obvious practical limits. You would never predict attendance higher than 100 percent, for example, and there may be a point at which the temperature becomes so hot as to be unbearable, and the attendance could begin to rise simply because the classroom is air-conditioned.

Figure 11-8 Linear trend line and regression equation

To impress your professor, assume that the outside temperature on a class day is 72 degrees. Substituting 72 for X in the regression equation, you predict that there will be 69 students in attendance that day.

Examining Residuals
A residual is the difference between the observed and predicted values for the criterion variable (Hair, Black, Babin, Anderson, & Tatham, 2006). Bivariate linear regression and multiple linear regression make four key assumptions about these residuals. 1. The phenomenon (i.e., the regression model being considered) is linear, so that the relationship between X and Y is linear. 2. 3. The residuals have equal variances at all levels of the predicted values of Y. The residuals are independent. This is another way of saying that the successive observations of the dependent variable are uncorrelated. 4. The residuals are normally distributed with a mean of zero.

Thus it can be very instructive to examine the residuals when you perform a regression analysis. It is helpful to examine a histogram of the standardized residuals (see Figure 11-9), which can be created from the Plots menu. The normal curve can be superimposed for visual reference.

Figure 11-9 Histogram of standardized residuals

These residuals appear to be approximately normally distributed. Another useful plot is the normal p-p plot produced as an option in the Plots menu. This plot compares the cumulative probabilities of the residuals to the expected frequencies if the residuals were normally distributed. Significant departures from a straight line would indicate nonnormality in the data (see Figure 11-10). In this case the residuals appear once again to be fairly normally distributed.

Figure 11-10 Normal p-p plot of observed and expected cumulative probabilities of residuals

When there are significant departures from normality, homoscedasticity, and linearity, data transformations or the introduction of polynomial terms such as quadratic or cubic values of the original independent or dependent variables can often be of help (Edwards, 1976).

References
Edwards, A. L. (1976). An introduction to linear regression and correlation. San Francisco: Freeman. Hair, J. F., Black, W. C., Babin, B. J., Anderson, R. E., and Tatham, R. L. (2006). Multivariate data analysis (6th ed.). Upper Saddle River, NJ: Pearson Prentice Hall.

Lesson 12: Multiple Correlation and Regression


Objectives

1. 2. 3.

Perform and interpret a multiple regression analysis. Test the significance of the regression and the regression coefficients. Examine residuals for diagnostic purposes.

Overview
Multiple regression involves one continuous criterion (dependent) variable and two or more predictors (independent variables). The equation for a line of best fit is derived in such a way as to minimize the sums of the squared deviations from the line. Although there are multiple predictors, there is only one predicted Y value, and the correlation between the observed and predicted Y values is called Multiple R. The value of Multiple R will range from zero to one. In the case of bivariate correlation, a regression analysis will yield a value of Multiple R that is the absolute value of the Pearson product moment correlation coefficient between X and Y, as discussed in Lesson 11. The multiple linear regression equation will take the following general form:

Instead of using a to represent the Y intercept, it is common practice in multiple regression to call the intercept term b0. The significance of Multiple R, and thus of the entire regression, must be tested. As well, the significiance of the individual regression coefficients must be examined to verify that a particular independent variable is adding significantly to the prediction. As in simple linear regression, residual plots are helpful in diagnosing the degree to which the linearity, normality, and homoscedasticity assumptions have been met. Various data transformations can be attempted to accommodate situations of curvilinearity, non-normality, and heteroscedasticity. In multiple regression we must also consider the potential impact of multicollinearity, which is the degree of linear relationship among the predictors. When there is a high degree of collinearity in the predictors, the regression equation will tend to be distorted, and may lead to inappropriate conclusions regarding which predictors are statistically significant (Lind, Marchal, and Wathen, 2006). For this reason, we will ask for collinearity diagnostics when we run our regression. As a rule of thumb, if the variance inflation factor (VIF) for a given predictor is very high or if the absolute value of the correlation between two predictors is greater than .70, one or more of the predictors should be dropped from the analysis, and the regression equation should be recomputed. Multiple regression is in actuality a general family of techniques, and the mathematical and statistical underpinnings of multiple regression make it an extremely powerful and flexible tool. By using group membership or treatment level qualitative coding variables as predictors, one can easily use multiple regression in place of t tests and analyses of variance. In this tutorial we will concentrate on the simplest kind of multiple regression, a forced or simultaneous regression in which all predictor variables are entered into the regression equation at one time. Other approaches include stepwise regression in which variables are entered according to their predictive ability and hierarchical regression in which

variables are entered according to theory or hypothesis. We will examine hierarchical regression more closely in Lesson 14 on analysis of covariance.

Example Data
The following data (see Figure 12-1) represent statistics course grades, GRE Quantitative scores, and cumulative GPAs for 32 graduate students at a large public university in the southern U.S. (source: data collected by the webmaster). You may click here to retrieve a copy of the entire dataset.

Figure 12-1 Statistics course grades, GREQ, and GPA (partial data)

Preparing for the Regression Analysis

We will determine whether quantitative ability (GREQ) and cumulative GPA can be used to predict performance in the statistics course. A very useful first step is to calculate the zero-order correlations among the predictors and the criterion. We will use the Correlate procedure for that purpose. Select Analyze, Correlate, Bivariate (see Figure 12-2).

Figure 12-2 Calculate intercorrelations as preparation for regression analysis

In the Options menu of the resulting dialog box, you can request descriptive statistics if you like. The resulting intercorrelation matrix reveals that GREQ and GPA are both significantly related to the course grade, but are not significantly related to each other. Thus our initial impression is that collinearity will not be a problem (see Figure 12-3).

Figure 12-3 Descriptive statistics and intercorrelations

Conducting the Regression Analysis


To conduct the regression analysis, select Analyze, Regression, Linear (see Figure 12-4).

Figure 12-4 Selecting the Linear Regression procedure

In the Linear Regression dialog box, move Grade to the Dependent variable field and GPA and GREQ to the Independent(s) list, as shown in Figure 12-5.

Figure 12-5 Linear Regression dialog box

Click on the Statistics button and check the box in front of collinearity diagnostics (see Figure 12-6).

Figure 12-6 Requesting collinearity diagnostics

Select Continue and then click on Plots to request standardized residual plots and also to request scatter diagrams. You should request a histogram and normal distribution plot of the standardized residuals. You can also plot the standardized residuals against the standardized predicted values to check the assumption of homoscedasticity (see Figure 12-7).

Click OK to run the regression analysis. The results are excerpted in Figure 12-8.

Figure 12-8 Regression procedure output (excerpt)

Interpreting the Regression Output


The significant overall regression indicates that a linear combination of GREQ and GPA predicts grades in the statistics course. The value of R-Square is .513, and indicates that about 51 percent of the variation in grades is accounted for by knowledge of GPA and GREQ. The significant t values for the regression coefficients for GREQ and GPA show that each variable contributes significantly to the prediction. Examining the unstandardized regression coefficients is not very instructive, because these are based on raw scores and their values are influenced by the units of measurement of the predictors. Thus, the raw-score regression coefficient for GREQ is much smaller than that for GPA because the two variables use different scales. On the other hand, the standardized coefficients are quite interpretable, because each shows the relative contribution to the prediction of the given variable with the other variable held constant. These are technically

standardized partial regression coefficicients. In the present case, we can conclude that GREQ has more predictive value than GPA, though both are significant. The collinearity diagnostics indicate a low degree of overlap between the predictors (as we predicted). If the two predictor variables were orthogonal (uncorrelated), the variance inflation factor (VIF) for each would be 1. Thus we conclude that there is not a problem with collinearity in this case. The histogram of the standardized residuals shows that the departure from normality is not too severe (see Figure 12-9).

Figure 12-9 Histogram of standardized residuals

The normal p-p plot indicates some departure from normality and may suggest a curvilinear relationship between the predictors and the criterion (see Figure 12-10).

Figure 12-10 Nomal p-p plot

The plot of standardized predicted values against the standardized residuals indicates a large degree of heteroscedasticity (see Figure 12-11). This is mostly the result of a single outlier, case 11 (Participant 118), whose GREQ and grade scores are significantly lower than those of the remainder of the group. Eliminating that case and recomputing the regression increases Multiple R slightly and also reduces the heteroscedasticity.

Figure 12-11 Plot of predicted values against residuals

Lesson 13: Chi-Square Tests


Objectives
1. 2. Perform and interpret a chi-square test of goodness of fit. Perform and interpret a chi-square test of independence.

Overview
Chi-square tests are used to compare observed frequencies to the frequencies expected under some hypothesis. Tests for one categorical variable are generally called goodness-of-fit tests. In this case, there is a one-way table of observed frequencies of the levels of some categorical variable. The null hypothesis might state that the expected frequencies are equally distributed or that they are unequal on the basis of some theoretical or postulated distribution. Tests for two categorical variables are usually called tests of independence or association. In this case, there will be a twoway contingency table with one categorical variable occupying rows of the table and the other categorical variable occupying columns of the table. In this analysis, the expected frequencies are commonly derived on the basis of the

assumption of independence. That is, if there were no association between the row and column variables, then a cell entry would be expected to be the product of the cell's row and column marginal totals divided by the overall sample size. In both tests, the chi-square test statistic is calculated as the sum of the squared differences between the observed and expected frequencies divided by the expected frequencies, according to the following simple formula:

where O represents the observed frequency in a given cell of the table and E represents the corresponding expected frequency under the null hypothesis. We will illustrate both the goodness-of-fit test and the test of independence using the same dataset. You will find the goodness of fit test for equal or unequal unexpected frequencies as an option under Nonparametric Tests in the Analyze menu. For the chi-square test of independence, you will use the Crosstabs procedure under the Descriptive Statistics menu in SPSS. The cross-tabulation procedure can make use of numeric or text entries, while the Nonparametric Test procedure requires numeric entries. For that reason, you will need to recode any text entries into numerical values for goodness-of-fit tests.

Example Data
Assume that you are interested in the effects of peer mentoring on student academic success in a competitive private liberal arts college. A group of 30 students is randomly selected during their freshman orientation. These students are assigned to a team of seniors who have been trained as tutors in various academic subjects, listening skills, and teambuilding skills. The 30 selected students meet in small group sessions with their peer tutors once each week during their entire freshman year, are encouraged to work with their small group for study sessions, and are encouraged to schedule private sessions with their peer mentors whenever they desire. You identify an additional 30 students at orientation as a control group. The control group members receive no formal peer mentoring. You determine that there are no significant differences between the high school grades and SAT scores of the two groups. At the end of four years, you compare the two groups on academic retention and academic performance. You code mentoring as 1 = present and 0 = absent to identify the two groups. Because GPAs differ by academic major, you generate a binary code for grades. If the student's cumulative GPA is at the median or higher for his or her academic major, you assign a 1. Students whose grades are below the median for their major receive a zero. If the student is no longer enrolled (i.e., has transferred, dropped out, or flunked out), you code a zero for retention. If he or she is still enrolled, but has not yet graduated after four years, you code a 1. If he or she has graduated, you code a 2. You collect the following (hypothetical) data:

Properly entered in SPSS, the data should look like the following (see Figure 13-1). For your convenience, you may also download a copy of the dataset.

Figure 13-1 Dataset in SPSS (partial data)

Conducting a Goodness-of-Fit Test


To determine whether the three retention outcomes are equally distributed, you can perform a goodness-of-fit test. Because there are three possible outcomes (no longer enrolled, currently enrolled, and graduated) and sixty total students, you would expect each outcome to be observed in 1/3 of the cases if there were no differences in the frequencies of these outcomes. Thus the null hypothesis would be that 20 students would not be enrolled, 20 would be currently enrolled, and 20 would have graduated after four years. To test this hypothesis, you must use the Nonparametric Tests procedure. To conduct the test, select Analyze, Nonparametric Tests, Chi-Square as shown in Figure 13-2.

Figure 13-2 Selecting chi-square test for goodness of fit

In the resulting dialog box, move Retention to the Test Variable List and accept the default for equal expected frequencies. SPSS counts and tabulates the observed frequencies and performs the chi-square test (see Figure 13-3). The degrees of freedom for the goodness-of-fit test are the number of categories minus one. The significant chi-square shows that the freqencies are not equally distributed, 2 (2, N = 60) = 6.10, p = .047.

Figure 13-3 Chi-square test of goodness of fit

Conducting a Chi-Square Test of Independence


If mentoring is not related to retention, you would expect mentored and non-mentored students to have the same outcomes, so that any observed differences in frequencies would be due to chance. That would mean that you would expect half of the students in each outcome group to come from the mentored students, and the other half to come from the non-mentored students. To test the hypothesis that there is an association (or non-independence) between mentoring and retention, you will conduct a chi-square test as part of the cross-tabulation procedure. To conduct the test, select Analyze, Descriptive Statistics, Crosstabs (see Figure 13-4).

Figure 13-4 Preparing for the chi-square test of independence

In the Crosstabs dialog, move one variable to the row field and the other variable to the column field. I typically place the variable with more levels in the row field to keep the output tables narrower (see Figure 13-5), though the results of the test would be identical if you were to reverse the row and column variables.

Figure 13-5 Establishing row and column variables

Clustered bar charts are an excellent way to compare the frequencies visually, so we will select that option (see Figure 135). Under the Statistics option, select chi-square and Phi and Cramer's V (measures of effect size for chi-square tests). You can also click on the Cells button to display both observed and expected cell frequencies. The format menu allows you to specify whether the rows are arranged in ascending (the default) or descending order. Click OK to run the Crosstabs procedure and conduct the chi-square test.

Figure 13-6 Partial output from Crosstabs procedure

For the test of independence, the degrees of freedom are the number of rows minus one multiplied by the number of columns minus one, or in this case 2 x 1 = 2. The Pearson Chi-Square is significant, indicating that mentoring had an effect on retention, 2 (2, N = 60) = 14.58, p < .001. The value of Cramer's V is .493, indicating a large effect size (Gravetter & Walnau, 2005). The clustered bar chart provides an excellent visual representation of the chi-square test results (see Figure 13-7).

Figure 13-7 Clustered bar chart

Going Further
For additional practice, you can use the Nonparametric Tests and Crosstabs procedures to determine whether grades differ between mentored and non-mentored students and whether there is an association between grades and retention outcomes.

References
Gravetter, F. J., & Walnau, L. B. (2005). Essentials of statistics for the behavioral sciences (5th ed.). Belmont, CA: Thomson/Wadsworth.

Lesson 14: Analysis of Covariance


Objectives

1. 2.

Perform and interpret an analysis of covariance using the General Linear Model. Perform and interpret an analysis of covariance using hierarchical regression.

Analysis of covariance (ANCOVA) is a blending of regression and analysis of variance (Roscoe, 1975). It is possible to perform ANCOVA using the General Linear Model procedure in SPSS. An entirely equivalent analysis is also possible using hierarchical regression, so the choice is left to the user and his or her preferences. We will illustrate both procedures in this tutorial. We will use the simplest of cases, a single covariate, two treatments, and a single variate (dependent variable). ANCOVA is statistically equivalent to matching experimental groups with respect to the variable or variables being controlled (or covaried). As you recall from correlation and regression, if two variables are correlated, one can be used to predict the other. If there is a covariate(X) that correlates with the dependent variable (Y), then dependent variable scores can be predicted by the covariate. If this is the case, the differences observed between the groups cannot then be attributed to the experimental treatment(s). ANCOVA provides a mechanism for assessing the differences in dependent variable scores after statistically controlling for the covariate. There are two obvious advantages to this approach: (1) any variable that influences the variation in the dependent variable can be statistically controlled, and (2) this control can reduce the amount of error variance in the analysis.

Example Data
Assume that you are comparing performance in a statistics class taught by two different methods. Students in one class are instructed in the classroom, while students in the second class take their class online. Both classes are taught by the same instructor, and use the same textbook, exams, and assignments. At the beginning of the term all students take a test of quantitative ability (pretest), and at the end, their score on the final exam is recorded (posttest). Because the two classes are intact, it is not possible to achieve experimental control, so this is a quasi-experimental design. Assume that you would like to compare the scores for the two groups on the final score while controlling for initial quantitative ability. The hypothetical data are as follows:

Before the ANCOVA


You may retrieve the SPSS dataset if you like. As a precursor to the ANCOVA, let us perform a between-groups t test to examine overall differences between the two groups on the final exam. You will find or recall this test as the subject of Lesson 4, and details will not be repeated here. The result of the t test is shown below. See Figure 14-1. Of course, as you know, if there were multiple groups you would perform an ANOVA rather than a t test. In this case, we conclude that the second method led to improved test scores, but must rule out the possibility that this difference is attributable to differences in quantitative ability of the two groups. As you know by now, you could just as easily have compared the means using the Compare Means or One-way ANOVA procedures, and the square root of the F-ratio obtained would be the value of t.

Figure 14-1 t Test Results

As a second precursor to the ANCOVA, let us determine the degree of correlation between quantitative ability and exam scores. As correlation is the subject of Lesson 10, the details are omitted here, and only the results are shown in Figure 142.

Figure 14-2 Correlation between pretest and posttest scores

Knowing that there is a statistically significant correlation between pretest and posttest scores, we would like to exercise statistical control by holding the effects of the pretest scores constant. The resulting ANCOVA will verify whether there are any differences in the posttest scores of the two groups after controlling for differences in ability.

Performing the ANCOVA in GLM


To perform the ANCOVA via the General Linear Model menu, select Analyze, General Linear Model, Univariate (see Figure 14-3).

Figure 14-3 ANCOVA via the GLM procedure

In the resulting dialog box, move Posttest to the Dependent Variable field, Method to the Fixed Factor(s) field, and Pretest to the Covariate(s) field. See Figure 14-4.

Figure 14-4 Univariate dialog box

Under Options you may want to choose descriptive statistics and effect size indexes, as well as plots of estimated marginal means for Method. As there are just two groups, main effect comparisons are not appropriate. Examine Figure 14-5.

Figure 14-5 Univariate options for ANCOVA

Click Continue. If you like, you can click on Plots to add profile plots for the estimated marginal means of the posttest scores of the two groups after adjusting for pretest scores. Click on OK to run the analysis. The results are shown in Figure 14-6. The results indicate that after controlling for initial quantitative ability, the differences in posttest scores are statistically significantly different between the two groups, F(1,27)=16.64, p < .001, partial eta-squared = .381.

Figure 14-6 ANCOVA results

The profile plot makes it clear that the online class had higher exam scores after controlling for initial quantitative ability (see Figure 14-7).

Figure 14-7 Profile plot

Performing an ANCOVA Using Hierarchical Regression


To perform the same ANCOVA using hierarchical regression, enter the posttest as the criterion. Then enter the covariate (pretest) as one independent variable block and group membership (method) as a second block. Examine the change in RSquare as the two models are compared, and the significance of the change. The F value produced by this analysis is identical to that produced via the GLM approach. Select Analyze, Regression, Linear (see Figure 14-8).

Figure 14-8 ANCOVA via hierarchical regression

Now enter Posttest as the Dependent Variable and Pretest as an Independent variable (see Figure 14-9).

Figure 14-9 Linear regression dialog box Click on the Next button and enter Method as an Independent variable, as shown in Figure 14-10.

Figure 14-10 Entering second block

Click on Statistics, and check the box in front of R squared change (see Figure 14-11).

Figure 14-11 Specify R squared change

Click Continue then OK to run the hierarchical regression. Note in the partial output shown in Figure 14-12 that the value of F for the R Square Change with pretest held constant is identical to that calculated earlier.

Figure 14-2 Hierarchical regression yields results identical to GLM

References
Roscoe, J. T. (1975). Fundamental research statistics for the behavioral sciences (2nd ed.). New York: Hot, Rinehart and Winston, Inc.

Você também pode gostar