Escolar Documentos
Profissional Documentos
Cultura Documentos
The intent of this exercise is to introduce you to the SPSS environment and the most
common applications of interest in the context of this course. The dataset NPTS1990.sav
provided along with this document should be used for this exercise.
Contents
1. Components of the SPSS environment…………………………………………2
2. Reading in Data……………………………………………………………...….3
3. Exploratory Analyses…………………………………………………………...6
a. Frequency Distributions………………………………………………...6
b. Descriptive Statistics……………………………………………………7
c. Cross Tabulations……………………………………………………….8
4. Creating Variables……………………………………………………………..11
a. New Variables…………………………………………………………11
b. Recoding……………………………………………………………….11
5. Linear Regression Model……………………………………………………...15
6. Analyses on Subsets of Data…………………………………………………..17
Acknowledgements: This tutorial was prepared by Prof. Siva Srinivasan of the University
of Florida.
The SPSS environment comprises three major components: (1) The Data Editor,
(2) The Syntax File, and (3) The Output Viewer.
This is the primary window of the SPSS program. The data are displayed in this window
in the format of a typical spreadsheet. There are two “views” of this window:
In the Data View the data values are displayed. Each column typically represents
a variable. Each row of data represents a case (i.e., values of all variables for a
particular household or person for our travel modeling applications).
In the Variable View, details of the variables are listed. Each row represents
details for one variable (hence there are as many rows in the Variable View as
there are columns in Data View). Some of the useful variable attributes include
variable labels (a lengthy meaningful description of the variable), format
(numeric, character, number of decimal places, etc.), and value labels (see Section
4.2 for more on value labels; this is important).
The processing of data in SPSS can be performed using the menu items & dialog boxes
(i.e., the Graphical User Interface or the GUI) or by directly providing the appropriate
commands in a Syntax File. SPSS has its own scripting language and the command
syntaxes are provided in the Help files. Further, it is also possible to generate the syntax
for any analysis using the GUI and “paste” it to the syntax file.
The use of syntax files is highly recommended for the following reasons:
1. It helps you maintain a log of all the processing that you have done on the
data. You can also add comments to the syntax file, and so you can maintain a
very good documentation of the data processing.
2. In case you lose your results, you can re-create them by simply running the
syntax file (instead of working though the GUI all over again). So make sure
that you save your syntax file.
3. It makes data processing faster in the long run. For example, if you have just
run a model and now want to run it again after changing a few variables, you
can simply copy-paste the syntax from the first run and change only the
relevant variables instead of re-specifying everything through the GUI.
The results of SPSS analysis are displayed on a separate window called the Output
Viewer. These can be directly saved (as .spo files). Alternatively, specific results from
the output file can be copied to commonly used applications such as MS Word, Excel,
and PowerPoint (simply right-click on the result to copy).
2. Reading in Data
SPSS is capable of handing input data in various formats. In this course, you will be
provided all data in the SPSS format (.sav files).
In the syntax file, add a comment before the command indicating that you are opening the
required file. All lines of code which represent comments should begin with “/* ”. It is
preferable to have a blank line between comments and commands.
Now highlight the command, right click and select RUN CURRENT to run this
command.
The file is opened in the Data Editor window and you should see the following Data and
Variable Views.
Save this syntax file and keep this open though out this familiarization exercise. As you
keep doing more analysis, you will be pasting all syntax to this file. Keep saving this file
periodically.
The data file comprises a sample of 2000 households drawn from the 1990 US National
Personal Transportation Survey (NPTS). The following variables are included:
3. Exploratory Analysis
Frequencies are a good way to learn about categorical and integer data when the range of
data values is not very large. In this exercise we will generate the frequency distributions
for two variables in the data file.
In the Data Editor window (or in the Syntax File Window), click on ANALYZE-
>DESCRIPTIVE STATISTICS->FREQUENCIES... A new “Frequencies” dialog box
opens up. Select the two variables (ntrip and numcars) of interest by highlighting each of
the variables from the list and clicking on the “>” button).
Once the two variables are selected, click on the PASTE button. The syntax for running
the frequency distributions on the two variables is added to the syntax file already open.
Add comments as appropriate (see figure; zoom and see). Highlight the command, right
click, and select RUN CURRENT. The frequencies are displayed on an Output Viewer
window. (Note: You can also simply click on the OK button without clicking on the PASTE button to
run the frequency analysis, but you will not be able to save the syntax. However, it is a recommended
practice to use the syntax file for data analysis/processing).
To copy the results to an EXCEL document, simply right click on the result (the
frequency table in this case) and select COPY. Open an EXCEL document and paste.
Why do you think there are so few households that make only one trip during the day?
(Answer: People generally come back home on the same day, making at least 2 trips)
In the Data Editor window (or in the Syntax File Window), click on ANALYZE-
>DESCRIPTIVE STATISTICS->DESCRIPTIVES... A new “Descriptives” dialog
box opens up. Select the two variables (ntrip and income) of interest by highlighting each
of the variables from the list and clicking on the “>” button). One can use the OPTIONS
button to specify the statistics of interest. Mean, standard deviation, minimum, and
maximum are the statistics provided by default and these are adequate for our purposes.
Once the two variables are selected, click PASTE. The syntax for generating the
descriptive statistics for the two variables is added to the syntax file already open. Add
comments as appropriate. Highlight the command, right click, and select RUN
CURRENT. The results are displayed on an Output Viewer window.
Cross Tabulations are a useful tool to explore internal consistency of data in the file. For
example, if we have data on both total number of people and number of children in the
household, we would expect that number of people >= number of children for each
household. This can be explored by cross tabulating number of people against number of
children.
Alternatively, Cross Tabulations are also useful as a simple bivariate-analysis tool. That
is, we can explore whether there is a systematic relationship between two variables. In
this exercise, we will examine whether the size of a household is related to the
automobile holdings of the household.
In the Data Editor window (or in the Syntax File Window), click on ANALYZE-
>DESCRIPTIVE STATISTICS->CROSSTABS... A new “Crosstabs” dialog box
opens up. Select the variable numcars for the “Rows” and the variable hhsize for the
‘Columns” (Again, highlight the variable of interest from the list and clicking on the
appropriate “>” button).
Once the two variables are selected, click PASTE. The syntax for cross tabulating hhsize
(in columns) against number of cars (in rows) is added to the Syntax File already open.
Add comments as appropriate. Highlight the command, right click, and select RUN
CURRENT. The results are displayed on an Output Viewer window.
The results are interpreted as follows: There are 87 households in the sample with one
person and zero cars, 190 households with 2 persons and one car, and so on.
We see that there are 61 two-person households with three cars and 32 five-person
households with three cars. Does this mean that two-person households are more likely
than five-person households to own three cars?
Run this new syntax, we get the following output. In this case, the results are COLUMN
percentages, i.e., 20.3% of 1 person households own no cars, 28.9% of two person
households have one car, and so on.
Now look at the numbers for two-person and five-person households with three cars.
What do you conclude?
What can you conclude about the auto ownership levels of 10 person households?
What broad conclusions would you draw about the “impact” of household size on car
ownership?
Which of the two cross tabulations you have developed is necessary for making these
conclusions?
4. Creating Variables
First we will look at creating new variables (adding columns). We will do this by directly
typing in the command.
Note that the above can also be accomplished using the GUI. Click on TRANSFORM-
>COMPUTE VARIABLES and provide the necessary inputs in the dialog box that pops
up. Click PASTE to get the above syntax pasted on to the syntax file.
Run the above command. A new data column gets appended to the file (in the Data
View). In the Variable View, an additional row gets added. Since this an integer
variable, you can set the number of decimal places for this variable to 0 using the
Variable View.
As an example, we will recode the continuous income variable into the following 3
categories (arbitrarily chosen for demonstration purposes): low income (less than 30K),
medium-income (30-50K), and high income (higher than 50 K).
In the Data Editor window (or in the Syntax File Window), Click on TRANSFORM-
>RECODE INTO DIFFERENT VARIABLES. The “Recode into different variables”
dialog box opens up. Select income as the variable to be recoded. Enter inccats as the
name of the output variable and provide a label to this variable (income in categories).
Click on CHANGE.
Now click on the OLD AND NEW VALUES button to define the transformation.
Check “Range: Lowest through _______” and enter the value 30000 in the box.
Enter 1 under New Values and click ADD
Now Check “Range ______through _______” and enter the values 30000 and
50000 as the range in the appropriate boxes. Enter 2 under New Values and click
ADD.
Check “Range: _______through Highest” and enter the value 50000 in the box.
Enter 3 under New Values and click ADD
Click the CONTINUE button.
You get back to the “Recode Into Different Values” window. Click PASTE.
The Syntax for the recoding gets pasted on to the syntax file.
Now to provide more meaningful descriptions of the categories (1,2, and 3) we have
created, enter the following in the Syntax File:
Highlight the RECODE and VALUE LABELS command and run. The new variable
with the appropriate labels is created. Since this an integer variable, you can set the
number of decimal places for this variable to 0 using the Variable View.
Run a frequency distribution on the newly created variable. You should see the following
distribution:
Run a cross tabulation of the continuous income on the categorical income variable to
see whether the variable has been correctly re-coded.
NOTE: By default a constant is always added to the regression model. There is no need
to include a column of ones in the data file.
Run the command for regression from the syntax file. The results are displayed on the
Output Viewer.
Under the model summary, we have the R2 and the adjusted R2 values. The standard error
of estimate is the standard deviation of the error term (i.e., s).
Under the ANOVA, we have the values for SST (total sum of squares), SSE (residual
sum of squares), and SSR (regression sum of squares). The value under the column “df”
for the row Total, would be N-1, where N= sample size=2000. The value under the
column “df” for the row Regression, would be the number of explanatory variables
(K=2).
Note that (1) SST = SSE + SSR, (2) R2 = SSR/SST, and (3) s2 = SSE/(N-K-1) [N =
sample size = 2000, K = number of explanatory variables = 2]
Under the Coefficients, we have the estimates of the model coefficients/parameters, the
standard errors, and the t statistics. Important Note: Although we call the parameters
“betas” in class, SPSS provides these under the column “B”. Do NOT use the values
provided in the column “Beta” by SPSS. The estimates of the model parameters are
0 0.232; 1 2.184; 2 0.826 . Note also that the t values are = (B / Std Error
(B)).
6. Analyses on Subsets of Data
This section of the exercise is focused on performing analysis on a subset of the data file
rather than the whole without having to physically split the file. For example, one might
be interested in estimating different models for different sub groups of the population
(this is called market segmentation). In this exercise, we are going to estimate a model
specifically for the non-low-income households (i.e., income >= 30K).
In the Data Editor window, Click on DATA->SELECT CASES… The Select Cases
Window opens up. Check “If Condition is satisfied” and click on the IF button. A new
window, “Select Cases If” opens. Enter the selection criterion (inccats >= 2) and click
CONTINUE.
You will be returned to the previous window. Make sure that the option “Filtered” is
chosen for “Unselected cases are” and click PASTE. Syntax for selecting the
appropriate subset of data for further analysis is generated and pasted on to the syntax
file. Once this syntax is run (don’t run it just yet), all further analysis will be done on the
data subset although the data file continues to physically have all the records.
Since we want to estimate the same specification as before for the regression model,
simply copy-paste the code for running the regression model.
Once the model is estimated, we want to restore the dataset to its original status.
In the Data Editor window, Click on DATA->SELECT CASES… The Select Cases
Window opens up. Check “All Cases” and click PASTE. Syntax for selecting the entire
data for further analysis is generated and pasted on to the syntax file
Now, highlight the entire command syntax (selecting only the subset, regression model,
and selecting all the data again) and run.
You will see that this model was estimated using only the 1312 households with income
>= 30K. [As already discussed, the value under the column “df” for the row Total, would
be N-1, where N= sample size. Further, from the frequency distribution results on the
inccat variable, we know that there are 1312 households in the middle/high income
categories.