Você está na página 1de 46


What Statistics Books Try To Teach You But Dont

Joe King

University of Washington

1 Introduction to Statistics
1.1 Variables . . . . . . . . . . . . . . .
1.1.1 Types of Variables . . . . .
1.1.2 Sample vs. Population . . .
1.2 Terminology . . . . . . . . . . . . .
1.3 Hypothesis Testing . . . . . . . . .
1.3.1 Assumptions . . . . . . . . .
1.3.2 Type I & II Error . . . . . .
1.3.3 What does Rejecting Mean?
1.4 Writing in APA Style . . . . . . . .
1.5 Final Thoughts . . . . . . . . . . .
2 Description of A Single
2.1 Wheres the Middle?
2.2 Variation . . . . . . .
2.3 Skew and Kurtosis .
2.4 Testing for Normality
2.5 Data . . . . . . . . .
2.6 Final Thoughts . . .

. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .



















































3 Correlations and Mean Testing

3.1 Covariance . . . . . . . . . . .
3.2 Pearsons Correlation . . . . .
3.3 R Squared . . . . . . . . . . .
3.4 Point Biserial Correlation . .
3.5 Spurious Relationships . . . .
3.6 Final Thoughts . . . . . . . .




























4 Means Testing
4.1 Assumptions . . . . . . . . .
4.2 T-Test . . . . . . . . . . . .
4.2.1 Independent Samples
4.2.2 Dependent Samples .
4.2.3 Effect Size . . . . . .
4.3 Analysis of Variance . . . .






























5 Regression: The Basics

5.1 Foundational Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Final Thoughts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3 Bibliographic Note . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


6 Linear Regression
6.1 Basics of Linear Regression . . . . . . . . . . . . . . . .
6.1.1 Sums of Squares . . . . . . . . . . . . . . . . . .
6.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.1 Simple Linear Regression . . . . . . . . . . . . .
6.2.2 Multiple Linear Regression . . . . . . . . . . . .
6.3 Interpretation of Parameter Estimates . . . . . . . . .
6.3.1 Continuous . . . . . . . . . . . . . . . . . . . . Transformation of Continous Variables Natural Log of Variables . . .
6.3.2 Categorical . . . . . . . . . . . . . . . . . . . . Nominal Variables . . . . . . . . . . . Ordinal Variables . . . . . . . . . . . .
6.4 Model Comparisions . . . . . . . . . . . . . . . . . . .
6.5 Assumptions . . . . . . . . . . . . . . . . . . . . . . . .
6.6 Diagnostics . . . . . . . . . . . . . . . . . . . . . . . .
6.6.1 Residuals . . . . . . . . . . . . . . . . . . . . . Normality of Residuals . . . . . . . . . Tests . . . . . . . . . . . . . . Plots . . . . . . . . . . . . . .
6.7 Final Thoughts . . . . . . . . . . . . . . . . . . . . . .





7 Logistic Regression
7.1 The Basics . . . . . . . . . . . . . . . . .
7.2 Regression Modeling Binomial Outcomes
7.2.1 Estimation . . . . . . . . . . . . .
7.2.2 Regression for Binary Outcomes . Logit . . . . . . . . . . . Probit . . . . . . . . . . Logit or Probit? . . . . Logit or Probit? . . . .
7.2.3 Model Selection . . . . . . . . . .
7.3 Further Reading . . . . . . . . . . . . . .
7.4 Conclusions . . . . . . . . . . . . . . . .































Chapter 1
Introduction to Statistics
Statistics is scary to most students but it does not have to be. The trick is to build up your
knowledge base one-step at a time to make sure you get the building blocks necessary to
understand the more advanced statistics. This paper will go from very simple
understanding of variables and statistics to more complex analysis for describing data. This
mini-book of statistics will give several formulas to calculate parameters yet rarely will you
have to calculate these on paper or insert the numbers in an equation for a spreadsheet.
This first chapter will look at some of the basic principles of statistics. Some of the basic
concepts that will be necessary to understand statistical inference. These may seem simple
and some of these many may be familiar with but best to start any work of statistics with
the basic principles as a strong foundation.
1.1 Variables
First we start with the basics. What is a variable? Essentially a variable is a construct we
observe. There are two kinds of variables, manifest (or observed variables) and latent
variables. Latent variables are ones we can only measure by measuring other manifest
variables, but we infer it (socio-economic status is a classic example). Manifest variables we
directly measure and can model or we can use them to construct more complex latent
variables, for example we may measure parents education, parents incoming and combine
those into the construct of socio-economic status.

Types of Variables

There are four primary categories of manifest variables, nominal, ordinal, interval, and
ratio. The first two are categorical variables. Nominal variables are variables which are
strictly categorical and have no discernible hierarchy or order to them, this includes race,
religion, or states for example. Ordinal is also categorical but this has a natural order to
it. Likert scales (strongly disagree, disagree, neutral, agree, strongly agree) is one of the
most common examples of an ordinal variable. Other examples include include class status
(freshman, sophomore, junior, senior) and levels of education obtained (high school,
bachelors, masters, etc).
Definition 1.1 (Nominal). is a categorical variable with no natural order to categories.
Race is a common example.
The continuous variables are interval and ratio. These are not categorical such as having a
set number of values but can take any value between two values. A continuous variable is
exam scores; your score may take any value from 0%-100%. Interval has no absolute value
so we cannot make judgements about the distance between two values. Temperature is a
good example, Celsius and Fahrenheit realistically wont have an absolute minimum or
maximum from the temperatures we experience. We cannot say 30 degrees Fahrenheit is
twice as warm as 15 degrees Fahrenheit. A ratio scale is still continuous but has an
absolute zero, so we can make judgements about differences. I can say a student who got
an 80% on the exam did twice as good as the student who got a 40% on their exam.


Sample vs. Population

One of the primary interests in statistics is to try to generalize our sample to a population.
A population doesnt always have to be the population of a state or nation as we usually
think of the word. Lets say for example the head of UW Medicine came to me and asked
me to do a workplace climate survey on all the nursing staff at UW Medical Center. While
there are alot of nurses there, I could conceivably give my survey to each and every one of
them. This would mean I would not have a problem of generalizability because I know the
attitudes of my entire population.
Unfortunately statistics is rarely this clean, and you will not have access to an entire
population. Therefore I must collect data that is representatives of the population I want
to study, this will be a sample. It is important to note though because different notation
is used for samples versus populations. For example x is generally a sample mean while
is used as the population mean. Rarely will you be able to know the population mean
where this becomes a huge issue. Many books on statistics have the notation at the
beginning of their book, yet I feel this is not a good idea. I will introduce notation as it
becomes relevant, and specifically discuss it when its necessary. Do not be alarmed if you
find yourself coming back to chapters remembering notation, it happens to everyone, and
committing this to memory is a truly life long affair.
1.2 Terminology
There is also the discussion of terminology. This will be discussed before the primary
methods for understanding how to do statistics because the terminology can get confusing.
Unfortunately statistics tends to like to change its terminology and have multiple words for
the same concept, which differ between journals, disciplines and different coursework.
One area where this is most true is when talking about types of variables. We classified
variables into how they are measured above, but how they fit into our research question is
different. Basic statistics books still talk about variables as independent or dependent
variables. Although these have fallen out of favor in alot of disciplines, especially the
methodology literature, but still bears weight so will be discussed. We will talk about
which variables are independent and dependent based on the models we run when we get
to those models but in general, the dependent variable is the one we are interested in
knowing about. In short, we want to know how our independent variables influence our
dependent variable(s).
While the terms independent and dependent variables are widely used, there are different
names for the dependent and independent variables. This in most cases is based on field of
study, personal convention, the material you are publishing in, etc. The dependent variable
is the one with the least confusion and is generally called the outcome variable. This seems
justified given its the outcome we are studying. The independent variable is where there is
less consistency in terminology. In many cases its called the regressor, predictor, or
covariate. I prefer the second term, and dont like the third. The first one seems too tied to
regression modelling and not as general as predictor. Covariate has different meanings with
different tests so in my opinion can be confusing. Predictor also can be confusing because
some people may conflate this with causation which would be a very wrong assumption to
make. I will usually use the term independent variable or predictor due to the lack of


better terms and these are the more common ones you will see in the literature.
1.3 Hypothesis Testing
The basis from where we start our research is the null hypothesis. This simply says there is
no relationship between the variables we are studying. When we reject the null
hypothesis, we are saying we accept the alternative hypothesis which says the null
hypothesis is not true and there is a significant relationship between the variable(s) we
are studying.


There are many types of assumptions that we must make in our analysis in order for our
coefficients to be unbiased.

Type I & II Error

So we have a hypothesis associated with a research question. This mini-book will look at
ways to explore hypothesis and how we can either support or not support our hypothesis.
First we must make a few basics about hypothesis testing. We have to have some basis to
determine whether the questions we are testing are true or not. Yet we also dont want to
make hasty judgements about whether our hypothesis is correct or not. This leads us to
committing errors in our judgements. There are two primary errors in this context. Type
I error is where we reject the null hypothesis when it is correct. Type II error is when
we do not reject the null hypothesis when it is wrong. While we attempt to avoid both
types of errors, the latter is more acceptable than the former. This is because we do not
want to make hasty decision about discussing an important relationship between variables
when none exists. If we say there is no relationship when in fact there is one, this is a more
conservative approach that hopefully future research will correct.

What does Rejecting Mean?

When we try to reject the null hypothesis first we must determine our critical value which
is generally 0.05. It is by convention that it is done and currently debated on whether its
still of practical use given computing technology today. When we reject the null hypothesis
all we are saying is the chances of finding as large or larger result is less than the
significance level. This does not mean that your research question really merits any major
practical effect. Rejecting the null hypothesis may be important but so can not rejecting
the null hypothesis be important. For example if there was a school where lower income
groups and higher income groups were performing significantly different on exams 5
years ago, and I came in and tested again, and I found no statistically significant
differences, I would find that to be highly important. It would mean there was a change in
the test scores and there is now some relative parity.
The next concern is practical significance. If my research is significant, but there may not
be any real reason to think its going to make a difference if implemented in policy or
clinical settings. This is where other measures come into play, like effect sizes which will be
discussed later. One should also note that larger sample sizes can make even a very small
statistics statistically significant and a small sample size can mask a significant result.
All of these must be considerations. One should not take a black and white approach to
answering research questions. Something is just not significant or not.


1.4 Writing in APA Style

One thing to be cautious about is how to write up your results and present them in a
manner which is both ethical and concise. This includes graphics, tables and paragraphs.
These should make the main points of what you want to say while not mis-representing
your results. If you are going to be doing alot of writing for publication you should pick up
a copy of the APA Manual (American Psychological Association, 2009).
1.5 Final Thoughts
A lot was discussed in this first part. These concepts will be revisited in later sections as
we begin to implement these concepts. There are many books which have been written
which expand on these concepts further and articles which have been written about these
concepts. I ask that you constantly keep an open mind as researchers and realize statistics
can never tell us truth, it can only hint at it, or point us in the right directions, and the
process of scientific inquiry never ends.

Chapter 2
Description of A Single Variable
So when we have variables we want to understand the nature of these variables. Our first
job is to describe our data, before we can start to do any test. There are two measures we
want to know about our data. The first is we want to know where the center of the mass of
the data is, and how far from the center of the mass our data is distributed. The middle is
calculated by the measures of central tendency (discussed momentarily), how far from the
middle of that helps us know how much variability there is in our data. This is also called
uncertainty or the dispersion parameter. These concepts are more generally known as
the location and scale parameters. Location being the middle of the distribution, where
on a real number line does the middle lie. Scale is how far away from the middle does our
data go. These are concepts that are common among all statistical distributions.
Although, for now our focus is on the normal distribution. This is also known as the
Gaussian distribution and is widely used in statistics for its satisfying mathematical
properties and being able to conform to allow us to run many types of analyses.
2.1 Wheres the Middle?
The best way to describe data is to use the measure of central tendency, or what is the
middle of a set of values. This includes the mean, median, and mode.
The equation to find the mean is in 2.1. The equation below has some
Pnotation which
requires some discussion as you will see this in alot of formulas. The
is the summation
sign, which tells us to sum everything to its left. The i = 1 below the summation sign
simply means start at the first value in the variable, and the N at the top means go all the
way to the end (or the number of responses seen in that variable).


x =


If we return to our x vector we get 2.2

x = 1 + 2 + 3 + 4 + 5 = 15/5
x = 15/5
x = 3


Our mean is influenced by all the numbers equally, so our example of variable y would give
a different mean by formula 2.3.
x = 1 + 1 + 2 + 3 + 4 + 5 = 15/6
x = 16/6
x = 2.67




The addition of the extra one weighed our mean down. As we will see, values can have
dramatic changes on our mean, especially when the number of values we have is low.
Finally we represent mean in several ways, the Greek letter represents the population
mean, while the mean of a sample can be denoted with a flat bar on top, so we would say
x = 3. Finally the mean is also known as the expected value, so we can write it as
E(x) = 3.
For categorical data there are two great measures. The first is Median which is simply the
middle number of a set, so for a set of values as in 2.4
M edian = 1, 2, |{z}
3 , 4, 5


M edian=3

Now if there is an even number of values we take the mean of the two middle values 2.5
M edian = 1, 1,

2, 3

, 4, 5


M edian=2.5

Mode is simply the most common number in a set, so the last example, 1 is the mode
since it occurs twice, the others occurs once. You may get bi-modal data where there is two
numbers that occur most of all, or even more.
These last two measures if discussing the middle of a distribution are of great interest in
categorical data mostly. Mode is rarely useful in interval or ordinal data, although median
can be of help in this data. Mean is the most relevant for continuous data and one that
will be used a lot in statistics. The mean is more commonly referred to as the average Mean
is computed by taking the sum of all of the values and dividing by the number of values.
2.2 Variation
We now know how to get the mean, but much of the time we also want to know how much
variation is in our data. When we talk about variation we are talking about why we get
different values in the data set. So going on our previous example of [1,2,3,4,5] we want to
know why we got these values and not all 3s, or 4s. A more practical example is why does
one student score a 40 on an exam, and another 80, another 90, another 50, etc. This
measure of variation is called variance. It is also called the dispersion parameter in the
statistics literature and the word dispersion will be used in discussion of other models.
Variance for the normal distribution is first to find the difference between each value and
the sample mean. Then those differences are squared, and the sum of that is divided by the
number of observations as seen below in taking the variance of x. Taking the square root of
the variance gives the standard deviation for the normal distribution. Formula 2.6 shows
the equation for this.

V ar(x) =

(x x)2


Formula 2.7 below shows how we take the formula above and use our previous variable x to
calculate the sample variance.



V ar(x) = ([1 3 = 2] + [2 3 = 1] + [3 3 = 0] + [4 3 = 1] + [5 3 = 2])/5

= (22 + 12 + 02 + 12 + 22 )/5
= (4 + 1 + 0 + 1 + 4)/5
= 10/5

A plot of the normal distribution with lines pointing to the distance between 1, 2 and 3
standard deviations is shown in 2.1.

3 Standard Deviations (99.7%)

2 Standard Deviations (95.4%)
1 Standard Deviation (68.2%)








Figure 2.1: Normal Distribution

Now is when we start getting into the discussion of distributions. Specifically here we will
talk about the normal distribution. The standard deviation is one property of the
normal distribution. The standard deviation is a great way to understand how data is
spread out and gives us an idea of how close to the mean our sample is. The rule for the
normal distribution is 68% of the population will be within one standard deviation of the
mean, 95% will be within two standard deviations, and 99% will be within three standard
deviations. This is shown in Figure 1, which has a mean of 20, and a standard deviation of
There is two other forms of variation that are good to see. This the interquartile range.
This shows the middle 50% of the data. It goes from the upper 75th percentile to the lower
25th percentile. One good graphing technique for this is a box and whisker plot . This is
shown in 7.1. The line in the middle is the middle of the distribution. The box is the
interquartile range, the horizontal lines are two standard deviations out. The dots outside
those are outliers (data points more than two standard deviations from the mean).
2.3 Skew and Kurtosis
Two other concepts which help us evaluate a single normal variable is skew and kurtosis.
This is not talked about as much but they are still important. Skew is when one part of



Figure 2.2: Box and Whisker Plot

the sample is on one side of the mean than the other. Negative skew is where the peak of
the curve is to the right of the mean (the tail going to the left). Positive Skew is where the
peak of the distribution is to the left and the tail is going to the right.
Kurtosis is how flat or peaked a distribution looks. A distribution which has a more peaked
shaped is called leptokurtic, and a shape that is flatter is called platokurtic. Although
skewness and Kurtosis can make a distribution violate normality, it does not always.
2.4 Testing for Normality
Can we test for normality? Well we can, and should. One way is to use descriptive
statistics and to look at a histogram. Below you can see a histogram of the frequency of a
normal distribution. We can overlay a normal distribution over it, and we can see if the
data looks normal. This is not a test per se but we can get a good idea of our data looks
like. This is shown in 2.3.







Figure 2.3: A Histogram of the normal distribution above with the normal curve overlaid
We could also example a PP Plot. This is a plot with a line at a 45 degree angle going
from bottom left to upper right of a plot. the closer the points are to the line the closer to
normality the distribution is. This is also the same principle behind a qqplot (Q meaning

2.5. DATA


2.5 Data
I will try to give examples of data analysis and its interpretation. One good data set is on
Cars released in 1993 (Lock, 1993), names of the variables and more info on the data set
can be found in Appendix ??.
2.6 Final Thoughts
A lot of concepts were discussed are necessary for a basic understanding of statistical
knowledge. Although do not feel you have to have this entire chapter memorized. The
concepts here you may need to come back to from time to time. Do not focus either on
memorizing formulas, focus on what the formulas tell you about the concept. With todays
computing powers your concern will be understanding what the output is telling you and
how to connect that to your research question. While it is good to know how numbers are
calculated, its just to understand how to use it in your test.



Chapter 3
Correlations and Mean Testing
The first part of this book we just looked at describing variables. Now we look at how they
are related and want to test the strength of those relationships. This is a difficult task,
something that will take time to master not only the concepts but its implementation.
Course homeworks are actually the easiest way to do statistics. You are given a research
question told what to run and to report your results. In real analysis you will have to
decide for yourself what test to run that best fits your data and your research question.
While I will provide some equations, its best to look at them just to see what they are
doing, and what they mean, its less important to memorize them. This first part will look
at basic correlations and testing of means (t-tests and ANOVA).
Much of statistics is correlational research. It is research where we look at how one variable
changes when another changes, yet causal inferences will not be assessed. It is very
tempting to use the word cause or to imply some directionality in your research but you
need to refrain from it unless you have alot of evidence to justify it as the ethical standards
for determining causality is high. If you are wishing to learn more about causality see
(J. Pearl, 2009;Judea Pearl, 2009)
3.1 Covariance
Before discussing correlations we have to discuss the idea of a covariance. One of the most
basic ways to associate variables is by getting a covariance matrix. Now a matrix is like a
spreadsheet, each cell having a value in it. The diagonal going from upper left to lower
right is the variance of the variable (as it will be the same variable on the top row as it will
be on the left column. The other values will be the covariance between the two variables.
The idea of covariance is similar to variance, except we want to know how one variable
varies with another. So if one changes in one direction, how will another variable change in
the same direction? Do note though we are only talking about continuous variables here
(for the most part interval and ratio scales are treated the same and the distinction is
rarely made in statistical testing, so when I mention continuous it may be either interval or
ratio without compromising my analysis). The formula for covariance is in 3.1.

Cov(x, y) =

(x x)(y y)


As one can see it is taking the deviations from the mean, and multiplying them together
and then dividing by the sample size. This gives a good measure of the relationship
between the two variables. While this concept is necessary and a bedrock of many
statistical tools, its not very intuitive. It is not standardizing it in anyway that allows us to
make quick understandings of the relationships, this is what leads us into correlations.
3.2 Pearsons Correlation
A correlation is essentially a standardized covariance. We take the covariance and divide it
by the standard deviation in 3.2:




(x x)(y y)
)2 N
i=1 (x x
i=1 (y y

rx,y = qP



If we dissect this formula its not as scary as it looks. The top of the equation is simply the
covariance. The bottom is the variance of x and the variance of y multiplied by each other.
Taking the square root is simply converting that to a standard deviation. This puts the
correlation coefficient into the metric of -1 to 1. A correlation of 0 means no association
what so ever. A correlation of 1 is a perfect correlation. So lets say we are looking at the
association of temperatures between two cities, if city A temperature went up by one
degree, city B would also go up by one degree if their correlations were 1 (remember a
correlation assumes the units of measurement). If the correlation is -1, its a perfect inverse
correlation, so if temperature of city A goes up one degree, city B will go DOWN one
degree. In social science the correlations are never this clean, or clear to understand. Since
the metrics can differ between correlations one must be careful about when you do a
correlation and how you interpret it. Also remember a correlation is non-directional, so if
we have a correlation of .5 and temperature in city A goes up one degree and up a half
degree in city B, then if city B goes up a full degree then will go up a half degree in city A.
Pearsons correlations are reported with an r and then the coefficient, followed by the
significance level. For example r = 0.5, p < .05 if significant.
3.3 R Squared
When we get a pearsons correlation coefficient we can take the square of that value, and
that is whats called the percentage of variance explained. So if we get a correlation of .5,
then the square of that is .25, so we can say that 25% of the variation in one variable is
accounted for by the other variable. Of course as the correlation increases so will the
amount of variance explained.
3.4 Point Biserial Correlation
One special case where a categorical variable can use a continuous Pearsons r is the
point-biserial correlation. If you have a binary variable you can calculate the correlation
between the two categories if the other variable you are comparing it to is continuous. This
is similar to a t-test we will examine later. The test looks at whether or not there is a
significant different between the two groups of the dichotomous variables. When we ask
whether its significant or not, we are wanting to determine whether or not the difference
is due to random chance. We already know there is going to be random variability in any
sample we take, but we want to know if the difference between the two groups is due to this
randomness or is there a genuine difference in the groups which is due to true differences.
3.5 Spurious Relationships
So lets say we get a pearsons r=.5, so what now? Can we say there is a direct relationship
between variables? No, because we dont know if the relationship is direct or not. There
are many examples of spurious relationships. For example, if I look at the rate of illness
students report to the health center at their University and the the relative time of exams,
I would most likely find a good (probably moderate) correlation. Now before any students



starts using this statement as a reason to cancel tests, there is no reason to believe your
exams are causing you to get sick! Well what is it then? Well something we DIDNT
measure, Stress! Stress weakens the immune system, and stress is higher during periods of
examinations, so you are more likely to get ill. If we just looked at correlations we would
only be looking at the surface, so take the results but use them with caution, as they may
not be telling the whole story.
3.6 Final Thoughts
This may seem like a short chapter given the heavy use of correlations but much of the
basics of this chapter will be used in future statistical analysis. One of the primary
concerns to take from this is this is not in anyway measuring causality, and this point can
not be discussed enough. Correlations are a good way of looking at associations, but thats
all, but is a good way to help us explore data and work towards more advanced statistical
models which can help us support or not support our hypotheses. While correlations can
be used, use them with caution.



Chapter 4
Means Testing
This chapter goes a bit more into exploring the differneces between groups. So if we have a
nominal or ordinal variable, and we want to see if these categories are statistically different
based on a continous variable, there are several tests we can do. We already looked at the
point bi-serial correlation, which is one test. This chapter examines the t-test which is a
test that gives a bit more detail, and Analysis of Variance (ANOVA) which will explore
when the number of groups is greater than 2 (the letter denoting groups is generally k,
as n denotes sample size, so ANOVA will be k > 2 or k 3). Here we will want to know
whether the difference in the means is statistically significant.
4.1 Assumptions
So the first assumption we will make is the continuous variables we are measuring are
normally distributed, and we learned to test that earlier. Another assumption we must
make is called homogeneity of variance. This means the variance is the same for both
groups (it doesnt have to be exactly the same but similar, again it will be somewhat
difference due to randomness but is the variance different enough to be statistically
different). If this assumption is untenable we will have to correct for the degrees of
freedom, which will influence whether our t-statistic is significant or not.
This can be shown in the two figures below. 4.1 shows the difference in the means (mean of
10 and 20) but with same variance of 4.

Mean Difference












Figure 4.1: Same Variance

4.2 has same means but one variance is 4 and the other is 16 (standard deviation of 4).
4.2 T-Test
The t-test is similar to the point-biserial as we are wanting to know whether two groups are
statistically different.

4.2. T-TEST


Mean Difference















Figure 4.2: Different Variances

So we will look at the first equation, which the numerator is the difference between the
means. The denominator is the difference between the standard deviations. the variance of
the sample is denoted s2 , and n is the sample size for that group. This is shown in 4.1
x1 x2
t= q 2
+ n22


The degrees of freedom is denoted by 4.2.


s21 /n1 + s22 /n2

df = 2
(s1 /n1 )2 /n1 1 + (s22 /n2 )2 /n1 1


The above equations assume unequal sample sizes and variances. The equations get smaller
if you have same variance or same sample size in each group. Although this only generally
occurs in experimental settings where sample size and other parameters can be more
strictly controlled.
In the end we want to see if there is a statistical difference between groups. If we look at
data from the National Educational Longitudinal Study from 1988 baseline year, we can
see how this works. If we look at the difference in gender and science scores, we can do a
t-test and we find theres a significant mean difference. The means for gender are in 4.1

52.1055 10.42897
51.1838 10.03476

Table 4.1: Means and Standard Deviations of Male and Female Test Scores
Our analysis shows t(10963) = 4.712, p < .05. Although the test of whether variances are
the same is significant F = 13.2, p < .05, so we have to use the variances not assumed. This



changes our results to t(10687.3) = 4.701, p < .05. You can see the main difference is our
degrees of freedom dropped, thus our t-statistic dropped.
This time it didnt matter, our sample size was so large that both values were significant,
but in some tests this may not be the case. If the test of equal variances rejects the null
hypothesis but the test of unequal variances does not reject, even if levenes test is not
significant, you should really be cautious about how you write it up.

Independent Samples

The above example was an independent samples t-test. This means the participants are
independent of each other and so their responses will be too.

Dependent Samples

This is a slightly different version of the t-test where you still have two means but the
samples are not independent of each other. A classic example of this is pre-test, post-test
designs. Also longitudinal data where a measure was collected at one year then measured
on the same test at a later date.

Effect Size

The effect size r is used in this part. The equation for this is in 4.3:

t2 + df


4.3 Analysis of Variance

Analysis of Variance (ANOVA) is used to compute when you have more than two groups.
Here we will look at what happens when have race and standardized test scores. The
problem we will encounter is to see which groups are significantly different. ANOVA adds
some steps to testing the analysis. First all of the means are compared (the equations for
this will be quite complex so we will just go through the analysis steps). First you see if
any of the means are statistically different. This is called an omnibus test and follows the F
distribution (the F distribution and t distribution are similar to the normal but have
fatter tails which means it allows for more outliers but this is of not much consequence
to the applied analysis). We get an F statistic for both levenes test and the omnibus test.
In this analysis we get four group means. These means are below in 4.2:
Mean SD
Asian, Pacific Islander
56.83 10.69
46.72 8.53
Black, Not Hispanic
45.44 8.29
White, Not Hispanic
52.91 10.03
American Indian, Alaskan 45.91 8.13
Table 4.2: Means and Standard Deviations of Race Groups Test Scores
Table 4.3 is the mean differences. Now after we reject the omnibus test we need to see if
theres a significant differences between the tests. We do this by doing post-hoc tests. For



simplicity reasons I have put it in a matrix where the numbers inside is the differences
between the groups. Those with (*) beside them are statistically significant. Now this is
not how it is done in SPSS, because it will give you it in rows but this is easily made.
There are many post-hoc tests one can do. The ones done below are Tukey and
Games-Howell, and both reject the same mean difference groups. There are alot more
post-hoc tests but these two do different things. Tukey adjusts for different sample sizes,
Games Howell corrects for heterogeneity of variance. If you do a few types of post-hoc tests
and the result is the same this gives credence to your hypothesis. If not you should go back
to see if there is a real difference or not or re-examine your assumptions.
Race Groups
Asian-PI Hispanic
White AI-Alaskan
11.3907* 1.2815*
3.9193* -6.1899* -7.4714*
AI-Alaskan 10.9178*
-0.4729 6.9985*
Note: PI-Pacific Islander; AI-American Indian
Table 4.3: Mean Differences Among Race Groups

Chapter 5
Regression: The Basics
Regression techniques make up a major portion of social science statistical inference.
Regression is also called linear models (this will be generalized later but for now we will
stick with linear models) as we try to fit a line to our data. These methods allow us to
create models to predict certain variables of interest. This section will be quite deep, since
regression requires a lot of concepts to consider, but as in past portions of this book, we
will take it one step at a time, starting out with basic principles and moving to more
advanced ones. The principle of regression to we have a set of variables (known as
predictors, or independent variables) that we want to use to predict an outcome (known as
the dependent variable but fallen out of favor in more advanced statistics classes and
works). Then we have a slope for each independent variable, which tells us the relationship
between the predictors and outcomes.
If you see yourself not understanding something, come back to the more fundamental
portions of regression and it will sink in. This type of method is so diverse people spend
careers learning and using this modeling procedure, so its not expected you pick it up in
one quarter, but are just laying the foundations for the use of it.
5.1 Foundational Concepts
So how do we try to predict an outcome? Well it comes back to the concept of variance.
Remember early on in this book we looked at variance as simply variation in a variable.
There are different values for different cases (i.e. different scores on a test for different
students). Regression allows us to use a set of predictors to explain the variation in our
Now we will look at the equations themselves and the notation that we will use. The basic
equation of a regression model (or linear model) is 6.11.

y = 0 +


p xp +



This basic equation may look scary but it is not. There are some basic parts to the
equation which will be relevant to the future understanding of these models. So let us go
left to right. The y is our outcome variable, this is the variable we want to predict the
behavior of. The 0 is the slope of the model (where the regression line crosses the y axis on
a coordinate plane. The p xp the actually two components together. The x is the predictor
variables, and the is the slopes for each predictor. This tells us the relationship between
that predictor and the outcome variable. The summation sign is there, yet unlike other
times this has been used, at the top is the letter p instead of n. This is because p stands for
number of predictors, and not summing to the number of cases. The is the error term,
which takes into account the variability in the model the predictors dont explain.



5.2 Final Thoughts

This brief chapter introduces regression as a concept, or more generally linear modeling. I
dont say linear regression (which is the next chapter) as this is just one form of regression.
Many more types of regression will be done in future chapters. There are many books on
regression, and at the end of each chapter I will note very good ones. One extraordinary
one is A. Gelman and Hill (2007) which I will use a lot to refer to with regards to creating
this chapter.
5.3 Bibliographic Note
Many books have been written on regression. I have used many as inspiration and
references for this work although much of the information is freely available online. On top
of A. Gelman and Hill (2007) for doing regression, the books Everitt, Hothorn, and Group
(2010), Chatterjee and Hadi (2006) and finally the free book Faraway (2002), and other
excellent books that are available for purchase Faraway, 2004; Faraway, 2005. More theory
based books are Venables and Ripley (2002), Andersen and Skovgaard (2010), Bingham and
Fry (2010), Rencher and Schaalje (2008), Rencher and Schaalje (2008),Sheather (2009). As
you can tell most of these books use R which is my preferred statistical package of choice.
Some books are focused on SPSS and do a good job at that, one notable one being by Field
(2009), also more advanced books but still very good is Tabachnick and Fidell (2006) and
Stevens (2009). Stevens (2009) would not make a good text book but is an excellent
reference, including SPSS and SAS instructions and syntax for almost all multivariate
applications in social sciences and is a necessary reference for any social scientist.

Chapter 6
Linear Regression
Lets focus for a while on one type of regression, linear regression. This requires us to have
an outcome variable that is continuous and normally distributed. When we have a
continuous normally distributed outcome, we can use least squares to calculate the
parameter estimates. Other forms of regression use maximum likelihood, which will be
discussed in later chapters. Although the least squares estimates are the maximum
likelihood estimates.
6.1 Basics of Linear Regression
This first regression technique we will learn, and the most common one used is where our
outcome is continuous in nature (interval or ratio it nature, it does not matter). Linear
regression uses an analytic technique called least squares. We will see how this works
graphically and then how the equations give us the numbers for our analysis.
What linear regression does is it looks at the plot of x and y and tries to fit a straight line
that is closest to all of these points. Figure 6.1 shows how this is done. I just randomly
drew values for both x and y and the line is the regression line that is the best fit for the
data. Now as the plot shows, the line doesnt fit perfectly, its just the best fitting line.
The difference between the actual data and the line is whats termed residuals as it is
what is not being captured in the model. The better the line fits and the less residual there
is, the stronger the predictor will predict the outcome.

Figure 6.1: Simple Regression Plot


6.2. MODEL


Sums of Squares

When discussing the sums of squares we get two equations, one is for the sums of squares
for the model in 6.1. This is the difference between our predicted values and the mean.
This is how good our model is fitting. We want this number to be as high as possible.



yi y)2



The second is the sums of squares regression (or error), this is the difference between
predicted and actual values of the outcome, this we want to be as low as possible and is
shown in 6.2.



yi yi )2



The total sums of squares can be done by summing the SSR and SSE or by 6.3.



(yi yi )2



The table 6.1 shows how this can be arranged. We commonly report sums of squares and
degrees of freedom along with the F statistic, the mean squares are less important but will
be shown for the purposes of the examples in this book.
Sums of Squares DF
Mean Square
Residual (Error) SSE
n p 1 M SE = np1

F Ratio
F =M

Table 6.1: ANOVA Table

6.2 Model
First lets look at the simplest model, if we had one predictor it would be a simple linear
regression 6.4. As shown, 0 is the slope parameter for the model, also called the y
intercept, it is where on the coordinate plane the regression line crosses the y-axis when
x = 0. The is the parameter estimate for that predictor beside it, the x. This shows the
magnitude and direction of the relationship to the outcome variable. Finally is the
residual, this is how much the data deviates from the regression line. This is also called the
error term, its the difference between the predicted values of the outcome and the actual
y = 0 + 1 x1 +


6.2. MODEL


More than one predictor is multiple linear regression, such as having two or more
predictors will look like 6.5, note the subscript p stands for parameters, so there will be a
x for each independent variable.
y = 0 + 1 x1 + 2 x2 + + p xp +


Simple Linear Regression

If we have the raw data, we can find the equations by hand. While in the era of very high
speed computers it is rare you will have to manually compute these statistics we should
still look at the equations to see how we derive the slopes. The slope below is how to
calculate the beta coefficient for a simple linear regression. We square values so we get an
approximation of the distance from the best fitting line as shown in 6.6. If we just added
the numbers up, some would be below the line, and some above giving us negative and
positive values respectively so they would add to zero (as is one of the assumptions of error
term). Squaring makes sure we have this issue removed.
(x x)(y y)
1 = i=1
i=1 (x x


The equation 6.7 shows how the slope parameter is calculated in a simple linear regression.
This is where the regression line crosses the y-axis when x = 0.
0 = y + x


Finally we come to our residuals. When we plug in values for x into the equation, we get
the fitted values. These values are predicted by the regression equation. This is signified
by y. When we subtract the actual outcome value for the predicted value (which the fitted
value is known as). This shows how much our actual values fit from the line, and it gives us
an idea of which values are furthest from the regression line.
= y y


We can also find in the model how much of the variability within our outcome is being
explained by our predictors. When we run this model we will get a Pearsons correlation
coefficient (r). We can still square this number (as we did in correlation) and get the
amount of variance explained. This is done in several ways, see 6.10.
(yi yi )2
r =
= Pi=1
yi y)2
i=1 (


We do need to adjust our r squared value to account for complexity of the model.
Whenever we add a predictor, we will always explain more variance. The question is, is

6.2. MODEL


this is truly explaining variance for theoretical reasons or if it is just randomly adding
variation explanation. The adjusted r squared should be comparable to the non-adjusted
value, if they are substantially different, you should look at your model more closely. The
adjusted r-squared can be particularly sensitive to sample size, so smaller sample size will
show differences in adjusted r squared values. Also its best to report both if they vary by a
non-trivial amount.
Adjustedr2 = 1 (1 r2 )

SSE/n p 1
SST /n 1


We can look at an example of data. Lets look at our cars example. Lets see if we can
predict the price of a vehicle based on its miles per gallon (MPG) of fuel used while driving
in the city.
toLatex(mtable("Model 1"=mod1,"Model 2"=mod2,"Model 3"=mod3))
> mod1<-lm(Price~MPG.city);summary(mod1)
lm(formula = Price ~ MPG.city)
1Q Median
-10.437 -4.871 -2.152
1.961 38.951
Estimate Std. Error t value Pr(>|t|)
(Intercept) 42.3661
3.3399 12.685 < 2e-16 ***
0.1449 -7.054 3.31e-10 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 7.809 on 91 degrees of freedom
Multiple R-squared: 0.3535, Adjusted R-squared: 0.3464
F-statistic: 49.76 on 1 and 91 DF, p-value: 3.308e-10
We find that it is a significant predictor of price. Our first test is similar to ANOVA, which
is the F test. This we reject the null hypothesis, F (1, 91) = 49.76, p < .001. We then look
at the significance of our individual predictor. It is significant, here we report two
statistics, the parameter estimate (), and the t-test associated with that. Here miles per
gallon in city is significant with = 1.0219, t(91) = 7.054, p < .001. The first interesting
thing is there is an inverse relationship, as one variable increases, the other decreases, here
we can say that for every mile per gallon used in the city increase, theres a drop in price of
$1,000. 1 We can also look at the r2 value to see how well the model is fitting. The

I say $1000 dollars and not one dollar as this is the unit of measurement, be sure when interpreting data
you use the unit of measurement unless the data is transformed (which will be discussed later).

6.2. MODEL


r2 = 0.3535 and the Adjustedr2 = 0.3464. While the adjusted value is slightly lower its
not a major issue, so we can trust this value.

Multiple Linear Regression

Multiple linear regression is similar to simple regression except we place more than one
predictor in the equation. This is how most models in social science are ran, since we
expect more than one variable to be related to our outcome.

y = 0 +


p xp +



Lets go back to the data, lets add to our model above not only miles per gallon in the city
but fuel tank capacity.
> mod3<-lm(Price~MPG.city+Fuel.tank.capacity);summary(mod3)
lm(formula = Price ~ MPG.city + Fuel.tank.capacity)
-18.526 -4.055




Estimate Std. Error t value Pr(>|t|)
0.868 0.38763
0.2395 -1.924 0.05747 .
2.881 0.00495 **
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 7.514 on 90 degrees of freedom
Multiple R-squared: 0.4081, Adjusted R-squared: 0.395
F-statistic: 31.03 on 2 and 90 DF, p-value: 5.635e-11
We find we reject the null hypothesis with F (2, 90) = 31.03, p < .05. We have an r2 = .408
and adjustedr2 = 0.395. So this model is fitting well and we can explain around 40% of the
variance by these two parameter estimates. Interestingly, miles per gallon fails to remain
significant in the model, = 0.4608, t(90) = 1.924, p = 0.057 This is one of those times
where significance is close, and most people who hold rigidly to the alpha of .05 would say
this isnt important. I dont hold such views, while this seems less important than in the last
model, its still worth mentioning as a possible predictor, but in the presence of fuel tank
capacity has less predictive power.
Fuel talk capacity is strongly related to price = 1.1825, t(90) = 2.881, p < .05. We find
the relationship here is positive, so the more fuel tank capacity the higher the price. We



could speculate larger vehicles, with larger capacity will be more expensive. Although we
have seen consistently that miles per gallon in the city is inversly related, well this may
also deal with size. Larger vehicles may get less fuel efficiency but may be more expensive,
smaller cars may be more fuel efficient and yet cheaper. I am not an expert on vehicle
pricing so we will just trust the data from this small sample.
6.3 Interpretation of Parameter Estimates


When a variable is continuous generally, interpretation is relatively straight forward. We

interpret the coefficients to mean that one unit increase in the predictor will mean an
increase in y by the amount of . So lets say you have a coefficient y = 0 + 2x + . Well
here the 2 is the parameter estimate (), so we say for each unit increase in x, we will
increase y by 2 units. Now when saying the word unit we are referring to the original
measurements of the individual variables. So if x is income in thousands of dollars, and y is
test scores, then for each one thousand dollars increase in income (x) will mean 2 points
greater score on the exam.
This changes if we transform our variables. If we standardize our x values, we would say
for each standard deviation increase in x, increase y by two units. If we standardized y and
x, we would say one standard deviation increase of x would mean two standard deviation
increase in y.
If we log our outcome, then we would say that one thousand dollar increase in come would
mean 2 log units increase in y. One thing to note is when statisticians (or almost all
scientists say log) they mean the natural log. To transform this back to the original units,
you take the exponential function, so ey if you had taken the log of the outcome (reasons
for this will be discussed in testing assumptions). If we take the log of y and x, the we can
talk about percents, so a one percent increase in x, means a 2 percent increase in y.
Although to get back to original units, exponentiation is still necessary.
If we look at our models above, in the simple linear regression model of just MPG in the
city, for each increase in one MPG in the city, the price goes down by 1.0219 thousand
dollars. This is because the coefficient is negative, so the relationship is inverse. In our
multiple regression model we see for each gallon increase in fuel tank capacity the price
increases 1.1825 thousand dollars. This is because the coefficient is positive.

Transformation of Continous Variables

Sometimes its neccessary to transform our variables. This can be done to make
interpretation easier, more relevant to our research question, or to allow our model to meet
assumptions. Natural Log of Variables Here we will explore what happens when we
take the log of continous variables.

> mod2<-lm(log(Price)~MPG.city);summary(mod2)
lm(formula = log(Price) ~ MPG.city)


-0.58391 -0.19678 -0.04151




Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.15282
0.13741 30.223 < 2e-16 ***
0.00596 -9.657 1.33e-15 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Residual standard error: 0.3213 on 91 degrees of freedom
Multiple R-squared: 0.5061, Adjusted R-squared: 0.5007
F-statistic: 93.26 on 1 and 91 DF, p-value: 1.33e-15

Here we have taken the natural logarithm of our outcome variable. This will be shown
later to be advantangeous when looking at our assumptions and violations of that. It can
also make model interpration different and sometimes easier. So now instead of the original
units, its in log units, so we would say, for each MPG unit increase, the price will decrease
0.0576 percent. This is because the coefficient is negative and so the relationship is still
inverse. Notice the percent of variance explained dramatically increased, from 35% to 50%,
this is due to the transformation process.

> mod3<-lm(log(Price)~log(MPG.city));summary(mod3)
lm(formula = log(Price) ~ log(MPG.city))
-0.61991 -0.21337 -0.03462



--Signif. codes:

Estimate Std. Error t value Pr(>|t|)

<2e-16 ***
0.1421 -10.64
<2e-16 ***

0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 0.3052 on 91 degrees of freedom

Multiple R-squared: 0.5543, Adjusted R-squared: 0.5494
F-statistic: 113.2 on 1 and 91 DF, p-value: < 2.2e-16



This model looks at what happens when we take the natural log of both the outcome and
the predictor. This is also interpreted differently, but now both estimates are in percents.
So for each percent increase in MPG in the city, the price decreases by 1.512 percent. Also
the model estimates have changed due to our transformation.


When our predictors are categorical, we need to be careful how they are modeled. They
cannot be added simply as numerical values or words. This would cause estimates to be
wrong, as the model will assume it is a continuous variable.

Nominal Variables

For nominal variables we must recode the levels of the factor. One way to do this is
dummy coding. This is where we code one factor per variable as a 1, with the other
factors as 0. If we denote the number of factors as k, then the total number of dummy
variables we can model for a factor variable is k 1. For example, if we are coding sporting
events, such as football, basketball, soccer, and baseball. The total number of dummy
variables we can have is 3. The coding is easily done in statistics programs automatically
or you can recode your own variables. Our sports example would look like Table 6.2.
Factor Levels

Dummy 1

Dummy 2 Dummy 3

Table 6.2: How Nominal Variables are Recoded in Regression Models using Dummy Coding
As you can see, the baseball part of our sports variable has all zeros. This is the baseline
group, for which the other groups are compared. This is good when there is a natural
baseline group (like treatment vs. control in medical studies). Although ours does not have
a natural baseline. So we can do another type of coding called contrast coding.
Factor Levels

Dummy 1

Dummy 2 Dummy 3

Table 6.3: How Nominal Variables are Recoded in Regression Models using Contrast Coding
As you can see, the factors sum to 0 in the column. Of course in real data sets we may not
have an even number of levels of the factors, the different levels (or group) may have
different amounts. So if there were 25 participants that played football and only 23
baseball players, finding the numbers that contrasts that equal zero will be more difficult.
Luckily many software programs allow for this type of coding automatically.



If we only had these variables as our predictors, this would be equivalent to an Analysis of
Variance, and the intercept would be the mean of the baseline variable. This is not so if
more predictors are added, as this would be an Analysis of Covariance.

Ordinal Variables

For ordinal variables, we generally can allow them to be in the model as one variable and
not require dummy coding. This is because our assumption of linearity is relatively
tenable, as we expect the categories to be naturally ordered and to be increasing. The
interpretation of this would be as you go up one category, the value of y will change the
amount of the parameter estimate (your beta-coefficient for that variable).
6.4 Model Comparisions
In many cases of research we want to know the effect of how much we add to the fit of a
model when we add or take away one or more predictors. When we do model comparisions,
we must ensure the models are nested. This means we add or take away predictor(s), but
otherwise still measuring same things. For example in the above models we compared
MPG and fuel capacity. We will want to know how much adding fuel capacity to the model
adds to model fit, or how adding MPG to the model with fuel capacity already in the
model compares. We can not compare directly a simple regression with only fuel capacity
and another model just measuring MPG.
6.5 Assumptions
The assumptions for regression depend on the nature of the regression being used. For
continuous outcomes, the assumptions are the errors are homoscedastic, normally
distributed errors, linearly related outcome and samples are independent of one another.
We look at the assumptions of linear regression and how to test them. Then we will discuss
corrections to them.
6.6 Diagnostics
We need to make sure our model is fitting our assumptions, and we need to see if we can
correct for times our assumptions are violated.


So first we need to look at our residuals. Remember residuals are the actual y values
subtracted from the predicted y values. For this exercise, I will use the cars data I used
above, as it is a good data set to discuss regression on. For the purposes of looking at our
assumptions, let us stick with simple regression where we have price of vehicles as our
outcome and miles per gallon in the city as our predictor. Here I will just provide R
commands and code along with discussions of it.

Normality of Residuals

First lets look at our assumption of normality. We assume our errors are normally
distributed with mean 0 and some unknown variance. We can do tests of this via my
preferred test, Shapiro Wilks test which is good from sample sizes from 3 - 5000 (Shapiro
and Wilk (1965)).


34 Tests Lets look at the above model and see if our normality assumption is
met. First we test mod1 which is just the variables in its original form.

> shapiro.test(residuals (mod1))

Shapiro-Wilk normality test
data: residuals(mod1)
W = 0.8414, p-value = 1.434e-08

As you can see the results arent pretty, we reject the null hypothesis for the test, so
W = 0.8414, p < .05 which means theres enough evidence to say that the sample deviates
from the theoretical normal distribution the test was expecting. This test, the null
hypothesis is the sample does conform to a normal distribution, so unlike most testing, we
do not want to reject this test.

> shapiro.test(residuals (mod2))

Shapiro-Wilk normality test
data: residuals(mod2)
W = 0.9675, p-value = 0.02022

Doing a second model with the log of the outcome help some, but we still cant say our
assumption is teneable, W = 0.9675, p < .05.

> shapiro.test(residuals (mod3))

Shapiro-Wilk normality test
data: residuals(mod3)
W = 0.9779, p-value = 0.1154

This time we cannot reject the null hypothesis, W = 0.9779, p > .05, so taking the log of
both our outcome and predictor allows us approximate the normal distribution, or at the
very least we can say there isnt enough evidence to say our distribution is significantly
different than the theoretical (or expected) normal distribution.
While logging of the variables has its advantages logging both as we did in the last example
is not common. The non-normality here can be attributed to our sample size of only 93
participants. Be cautious when doing transformations like this as


35 Plots Now lets look at plots. Two plots are important, one is a QQ plot, and
another is a histogram. A histogram allows us to look at the frequency of values, and the
QQ plot plots our residuals against what we would expect from a theoretical normal
distribution. In those plots the line represents where we want our residuals to be, means its
matching the theoretical normal distribution.
Normal QQ Plot





Distribution of Residuals














Theoretical Quantiles

Figure 6.2: Histogram of Studentized Residuals for Model 1

The first set of plots shows us what we expected from our statistics above. Our residuals
dont conform to a normal distribution, we can see heavy right skew in the residuals, and
the QQ plot is very non-normal at the extremes.



Distribution of Residuals

Normal QQ Plot
















Theoretical Quantiles

Figure 6.3: Histogram of Studentized Residuals for Model 2

As we saw in our statistics, taking the log of our outcome made it better, but still not quite
to make our assumption of normality tenable. We are still seeing too much right skew in
our distribution.
Normal QQ Plot


Distribution of Residuals














Theoretical Quantiles

Figure 6.4: Histogram of Studentized Residuals for Model 3

This looks much better! Our distribution is looking much more normal. Our QQ plot still
shows some deviation at the top and bottom but our Shapiro-Wilks test gives us enough
evidence to show the assumption of normality is tenable, so this is OK.



6.7 Final Thoughts

Linear regression is used very widely in statistics, most notably because of the pleasing
mathmatical properties of the normal distribution. Its ease of interpretation and wide
implementation in software packages enhances its abilities. One should be cautious about
the use of it though to ensure your outcome is normally distributed.



Chapter 7
Logistic Regression
So now we begin to discuss the idea that our outcome is not linear. Logistic regression
deals with the idea out outcome is binary, that is it can only take one one of two values
(almost universally 0 and 1). This has many applications, graduate or not graduate,
contract in illness or not, get a job or not, etc. This does pose problems for interpretation
at times, because its not as easy to study.
7.1 The Basics
So we have to model the events that take on values of 0 or 1. The problem is with linear
regression in this sense is that it requires us to use a straight line. This cant be done since
our values are bounded. This means we must go to a different distribution than the
normald distribution
7.2 Regression Modeling Binomial Outcomes
Contingency tables are useful when we have one categorical covariate. Contingency tables
are not possible when we have a continuous predictor or multiple predictors. Even if there
is one variable of interest in relationship to the outcome, researchers still try to control for
the effects of other covariates. This leads to the use of a regression model to test the
relationship between a binary outcome and one or several predictors.


The basic regression model taught in introductory statistics classes is linear regression.
This has a continuous outcome, and estimation is typically estimated using least squares
which was discussed in 6.1. In a binomial outcome, we cannot use this estimation
technique. The binomial model will estimate proportions, which are bound from 0 to 1. A
least squares model may give estimates outside these bounds. Therefore we turn to
maximum liklihood and a class of models known as Generalized Linear Models (GLM)1 .
| {z }

0 +


p xp





Systematic Component

The random component is the outcome variable, its called the random component because
we want to know why there is variation in this variable. The systematic component is the
linear combination of our covariates and the parameter estimates. When our variable is
continuous we dont have to worry about establishing a linear relationship as we assume it
exists if the covariates are related to the outcome. When we have categorical outcomes we
can not have this linear relationship, so GLMs provide a link function, that allows a linear
relationship to exist if there is a significant relationship.

For SPSS users, do not confuse this with General Linear Model which performs ANOVA, ANCOVA and
Some authors use to denote the intercept term, although most still use 0 and will be used here





Regression for Binary Outcomes

Two of the most common functions are logit and probit functions. These allow us to look
at a linear relationship between our outcome and our covariates. In figure 7.1, you can see
there is not a lot of difference between logit and probit, the difference is in the
interpretation of coefficients (discussed below). The green line does show how a traditional
regression line is not an appropriate fit, because the data (the blue dots) goes outside the
range of the data. The logit and probit fits look at the probabilities of being a success. The
figure also shows that there is little difference in the actual model fit between the two
models. Logit and probit models will be very similar in the substantive conclusions made.
The primary difference is in the interpretation of the results. While we dont have a true r2
coefficient, there is a pseudo r2 that was created by Nagelkerke (1992) which does give a
general sense of how much variation is being explained by the predictors.




OLS Regression










Figure 7.1: Logit, Probit and OLS regression lines; data simulated from R




The most common model in education is the logit model, also known as logistic regression,
there are two equations we can solve, equation 7.2 allows us to get the log odds of a
positive response (a success).

logit[(x)] = log
= 0 + p xp
1 (x)
The probability of a positive response is calcualted from equation 7.3.
(x) =

e0 +p xp
1 e0 +p xp


Fitted values (either log odds or probabilities) are usually what is given in statistical
programs, and just uses the values from the sample. Although a researcher can place values
for the covariates of hypothetical participants and it will give a probability for those values.
One caution would be to ensure the values you place in the covariates are within the range
of the data values (i.e. if your sample ages are 18-24 dont solve for an equation of a 26
year old). Since the model was fitted with data that did not include that age range.


The probit function is similar in that its function is assumes an underlying latent normal
distribution bound between 0 and 1 which is found in 7.4. A probit model will change the
probabilities into z scores. In Agresti (2007, p. 72) he uses the probit coefficient of 0.05,
which is -1.645, which is 1.645 standard deviations below the mean for that probability.
P (
) = 1 (0 + p xp )


Logit or Probit?

As can be seen in figure 7.1 the model fit for both logistic and probit regression is very
similar and this is usually true. Its also possible to alter the coefficients to change the
coefficients from logit to probit or vice versa. Amemiya (1981) showed multiplying a logit
coefficient by 1.6 will give the probit coefficient. Andrew Gelman (2006) ran simulations
and found results between 1.6 and 1.8 to be correct corrections, and also corresponds to
Agresti (2007) which mentions the scaling being between 1.6 and 1.8.

Logit or Probit?

This is an example from the car data set we have been using.
> mod1<-glm(Origin~MPG.city,family=binomial(link=logit));summary(mod1)
glm(formula = Origin ~ MPG.city, family = binomial(link = logit))
Deviance Residuals:










Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.44532
1.01996 -2.397
0.0165 *
0.0183 *
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 128.83
Residual deviance: 122.09
AIC: 126.09

on 92
on 91

degrees of freedom
degrees of freedom

Number of Fisher Scoring iterations: 4

> mod2<-glm(Origin~Fuel.tank.capacity,family=binomial(link=logit));summary(mod2)
glm(formula = Origin ~ Fuel.tank.capacity, family = binomial(link = logit))
Deviance Residuals:
-1.3709 -1.1582 -0.9544



Estimate Std. Error z value Pr(>|z|)
Fuel.tank.capacity -0.07649
0.06512 -1.175
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 128.83
Residual deviance: 127.42
AIC: 131.42

on 92
on 91

degrees of freedom
degrees of freedom

Number of Fisher Scoring iterations: 4

> mod3<-glm(Origin~MPG.city+Fuel.tank.capacity,family=binomial(link=logit));summary(mod3
glm(formula = Origin ~ MPG.city + Fuel.tank.capacity, family = binomial(link = logit))


Deviance Residuals:
-1.7426 -1.0539 -0.7408




Estimate Std. Error z value Pr(>|z|)
4.14605 -2.242
0.0249 *
0.0110 *
Fuel.tank.capacity 0.23209
0.0794 .
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 128.83
Residual deviance: 118.74
AIC: 124.74

on 92
on 90

degrees of freedom
degrees of freedom

Number of Fisher Scoring iterations: 4


Model Selection

Researchers tend to fit multiple models to try and find the best fitting model consistent
with their theoretical framework. There are several ways to evaluate models to determine
which model fits best. Sequential model building is a technique frequently used to look at
the addition of predictors to a regression model. The same framework that is used with
other regression models as well. In a linear regression the test to test the models will be an
F test (since the null hypothesis of the model uses an F distribution), models which use
maximum likelihood use the likelihood ratio test which is chi-squared like the ratio test
used above. Shmueli (2010) examines the differences in building a model to explain the
relationship of predictors to an outcome, or a model to predict an outcome from future
data sources. The article also discusses the information criteria such as the AIC and BIC
measures used to test model fit.
7.3 Further Reading
This chapter borrows heavily from Alan Agresti (2007) who is well known and respected
for his work in categorical data analysis. Some books which cover many statistical models
yet still do a good job at logistic regression is Tabachnick & Fidell (2006) and Stevens
(2009). The first book is great for a textbook, Stevens is a dense book, but has both SPSS
syntax and SAS code, works well a must have reference. Gelman and Hill (2007) is rapidly
becoming a classic book in statistical inference yet its computation is focused on R which
hasnt hit mainstream academia much, but they do have some supplemental material at the
end of the book for other programs. Although for those who have an interest in R, another
great book is by Faraway (2005). Andy Field (2009) has a classic book called Discovering
Statistics Using SPSS which blends very nicely SPSS and statistical concepts, and is good



at explaining of difficult statistical concepts. Students who wish to explore categorical data
analysis conceptually there are a few good books, I recommend Agresti (2002); this is a
different book from his 2007 book a focus on theory yet still a lot of great examples of
application). Longs (1997) book explores maximum likelihood methods focusing on
categorical outcomes. It combines a more conceptual and mathematical ideas of maximum
likelihood. A classic by McCullagh and Nelder (1989) which is a seminal work in the
concept of generalized linear models (the citation here is their well known second edition).
7.4 Conclusions
This chapter looked in an introductory manner. There is more to analyzing the binomial
outcomes and reading some of the works above can help add to analyzing binomial
outcomes. This is especially important for researchers whose outcomes will be binomial.
These principals will also act as a starting point to learn about other categorical outcomes
such as nominal outcomes with more than two categories, or an ordinal outcomes (used
often as likert scales).

Agresti, A. (2002). Categorical Data Analysis. Hoboken, NJ: Wiley-Interscience.
Agresti, A. (2007, March). An Introduction to Categorical Data Analysis. Hoboken, NJ:
Wiley-Blackwell. doi:10.1002/0470114754
Amemiya, T. (1981). Qualitative response models: a survey. Journal of Economic
Literature, 19 (4), 14831536. doi:10.2298/EKA0772055N
American Psychological Association. (2009). Publication Manual of the American
Psychological Association, Sixth Edition. American Psychological Association.
Andersen, P. K. & Skovgaard, L. T. (2010). Regression with Linear Predictors. Statistics
for Biology and Health. New York, NY: Springer New York.
Bingham, N. H. & Fry, J. M. (2010). Regression Linear Models in Statistics. Springer
Undergraduate Mathematics Series. London: Springer London.
Chatterjee, S. & Hadi, A. S. (2006). Regression analysis by example (4 ed). Hoboken, NJ:
Everitt, B. S., Hothorn, T., & Group, F. (2010). A Handbook of Statistical Analyses Using
R, Second Edition. Boca Raton, FL: Chapman and Hall/CRC.
Faraway, J. J. (2002). Practical Regression and ANOVA using R.
Faraway, J. J. (2004). Linear Models with R (Chapman & Hall/CRC Texts in Statistical
Science). Boca Raton, FL: Chapman and Hall/CRC.
Faraway, J. J. (2005). Extending the Linear Model with R: Generalized Linear, Mixed
Effects and Nonparametric Regression Models (Chapman & Hall/CRC Texts in
Statistical Science). Boca Raton, FL: Chapman and Hall/CRC.
Field, A. (2009). Discovering Statistics Using SPSS (Introducing Statistical Methods).
Thousand Oaks, CA: Sage Publications Ltd.
Gelman, A. [A.] & Hill, J. (2007). Data Analysis Using Regression and
Multilevel/Hierarchical Models. New York: Cambridge University Press.
Gelman, A. [Andrew]. (2006). Take logit coefficients and divide by approximately 1.6 to get
probit coefficients. Retrieved from
http://www.andrewgelman.com/2006/06/take%5C logit%5C coef/
Lock, R. (1993). 1993 new car data. Journal of Statistics Education, 1 (1). Retrieved from
Long, J. S. (1997). Regression Models for Categorical and Limited Dependent Variables.
Thousand Oaks, CA: SAGE Publications.
McCullagh, P. & Nelder, J. A. (1989). Generalized Linear Models, Second Edition
(Chapman & Hall/CRC Monographs on Statistics & Applied Probability). Boca
Raton, FL: Chapman and Hall/CRC.



Nagelkerke, N. J. D. (1992). Maximum likelihood estimation of functional relationships.

Springer-Verlag New York.
Pearl, J. [J.]. (2009). Causal inference in statistics: An overview. Statistics Surveys, 3,
Pearl, J. [Judea]. (2009). Causality: Models, Reasoning and Inference. Cambridge
University Press.
Rencher, A. & Schaalje, B. (2008). Linear Models in Statistics (2nd ed.).
Shapiro, S. S. & Wilk, M. B. (1965, December). An analysis of variance test for normality
(complete samples). Biometrika, 52 (3-4), 591611. doi:10.1093/biomet/52.3-4.591
Sheather, S. J. S. J. (2009). A modern approach to regression with R. New York, NY:
Springer Verlag. Retrieved from
Shmueli, G. (2010, August). To Explain or to Predict? Statistical Science, 25 (3), 289310.
Stevens, J. P. (2009). Applied Multivariate Statistics for the Social Sciences, Fifth Edition.
New York, NY: Routledge Academic.
Tabachnick, B. G. & Fidell, L. S. (2006). Using Multivariate Statistics (5th Ed.). Upper
Saddle River, NJ: Allyn & Bacon.
Venables, W. N. N. & Ripley, B. D. D. (2002). Modern applied statistics with S (4th Ed.).
New York, NY: Springer.

Você também pode gostar