Escolar Documentos
Profissional Documentos
Cultura Documentos
Charles J. Geyer
December 8, 2003
1 Bernoulli Regression
We start with a slogan
Yi ∼ Bernoulli(pi ) (1)
Yi ∼ Normal(µi , σ 2 ) (2)
and
µ = Xβ (3)
The analogy between (1) and (2) should be clear. Both assume the data are
independent, but not identically distributed. The responses Yi have distribu-
tions in the same family, but not the same parameter values. So all we need
to finish the specification of a regression-like model for Bernoulli is an equation
that takes the place of (3).
1
1.1 A Dumb Idea (Identity Link)
We could use (3) with the Bernoulli model, although we have to change the
symbol for the parameter from µ to p
p = Xβ.
This means, for example, in the “simple” linear regression model (with one
constant and one non-constant predictor xi )
pi = α + βxi . (4)
and set equal to zero to solve for the MLE’s. Fortunately, even for this dumb
idea, R knows how to do the problem.
Example 1.1 (Bernoulli Regression, Identity Link).
We use the data in the file ex12.8.1.dat in this directory, which is read by the
following
> attach(X)
and has three variables x, y, and z. For now we will just use the first two.
The response y is Bernoulli. We will do a Bernoulli regression using the
model assumptions described above. The following code does the regression
and prints out a summary.
2
> out.quasi <- glm(y ~ x, family = quasi(variance = "mu(1-mu)"),
+ start = c(0.5, 0))
> summary(out.quasi, dispersion = 1)
Call:
glm(formula = y ~ x, family = quasi(variance = "mu(1-mu)"), start = c(0.5,
0))
Deviance Residuals:
Min 1Q Median 3Q Max
-1.552 -1.038 -0.678 1.119 1.827
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.354863 0.187324 -1.894 0.0582 .
x 0.016016 0.003589 4.462 8.1e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
We have to apologize for the rather esoteric syntax, which results from our
choice of introducing Bernoulli regression via this rather dumb example.
As usual, our main interest is in the table labeled Coefficients:, which
says the estimated regression coefficients (the MLE’s) are α̂ = −0.34750 and
β̂ = 0.01585. This table also gives standard errors, test statistics (“z values”)
and P -values for the two-tailed test of whether the true value of the coefficient
is zero.
The scatter plot with regression line for this regression is somewhat unusual
looking. It is produced by the code
> plot(x, y)
> curve(predict(out.quasi, data.frame(x = x)), add = TRUE)
and is shown in Figure 1. The response values are, of course, being Bernoulli,
either zero or one, which makes the scatter plot almost impossible to interpret
(it is clear that there are more ones for high x values than for low, but it’s
impossible to see much else, much less to visualize the correct regression line).
That finishes our discussion of the example. So why is it “dumb”? One
reason is that nothing keeps the parameters in the required range. The pi ,
3
1.0
●●● ●● ●●
●●●● ●●●●
●● ●●
●
●● ●●● ●●
● ● ● ●
●●● ●● ●●
0.8
0.6
y
0.4
0.2
0.0
●●
●● ●●●●● ● ●
●●●● ●●●●
●●●
●●●●●
●
●●●
●●●● ●●●
● ●●●● ● ●●●
● ●
30 40 50 60 70 80
Figure 1: Scatter plot and regression line for Example 1.1 (Bernoulli regression
with an identity link function).
4
being probabilities must be between zero and one. The right hand side of (4),
being a linear function may take any values between −∞ and +∞. For the
data set used in the example, it just happened that the MLE’s wound up in
(0, 1) without constraining them to do so. In general that won’t happen. What
then? R being semi-sensible will just crash (produce error messages rather than
estimates).
There are various ad-hoc ways one could think to patch up this problem.
One could, for example, truncate the linear function at zero and one. But that
makes a nondifferentiable log likelihood and ruins the asymptotic theory. The
only simple solution is to realize that linearity is no longer simple and give up
linearity.
η = Xβ (5)
and
pi = h(ηi ) (6)
where h is a smooth invertible function that maps R into (0, 1) so the pi are
always in the required range. We now stop for some important terminology.
• The vector η in (5) is called the linear predictor.
• The function h is called the inverse link function and its inverse g = h−1
is called the link function.
The most widely used (though not the only) link function for Bernoulli re-
gression is the logit link defined by
p
g(p) = logit(p) = log (7a)
1−p
eη 1
h(η) = g −1 (η) = η = (7b)
e +1 1 + e−η
Equation (7a) defines the so-called logit function, and, of course, equation (7b)
defines the inverse logit function.
For generality, we will not at first use the explicit form of the link function
writing the log likelihood
n
X
l(β) = yi log(pi ) + (1 − yi ) log(1 − pi )
i=1
5
where we are implicitly using (5) and (6) as part of the definition. Then
n
∂l(β) X yi 1 − yi ∂pi ∂ηi
= −
∂βj i=1
pi 1 − pi ∂ηi ∂βj
where the two partial derivatives on the right arise from the chain rule and are
explicitly
∂pi
= h0 (ηi )
∂ηi
∂ηi
= xij
∂βj
where xij denotes the i, j element of the design matrix X (the value of the j-th
predictor for the i-th individual). Putting everything together
n
∂l(β) X yi 1 − yi 0
= − h (ηi )xij
∂βj i=1
pi 1 − pi
These equations also do not have a closed form solution, but are easily solved
numerically by R
Example 1.2 (Bernoulli Regression, Logit Link).
We use the same data in Example 1.1. The R commands for logistic regression
are
> out.logit <- glm(y ~ x, family = binomial)
> summary(out.logit)
Call:
glm(formula = y ~ x, family = binomial)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5237 -1.0192 -0.7082 1.1341 1.7665
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.56633 1.15954 -3.076 0.00210 **
x 0.06607 0.02259 2.925 0.00345 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
6
AIC: 131.33
Note that the syntax is a lot cleaner for this (logit link) than for the “dumb” way
(identity link). The regression function for this “logistic regression” is shown in
Figure 2, which appears later, after we have done another example.
Call:
glm(formula = y ~ x, family = binomial(link = "probit"))
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5263 -1.0223 -0.7032 1.1324 1.7760
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.20894 0.68649 -3.218 0.00129 **
x 0.04098 0.01340 3.057 0.00223 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Note that there is a huge difference in the regression coefficients for our three
examples, but this should be no surprise because the coefficients for the three
7
regressions are not comparable. Because the regressions involve different link
functions, the meaning of the regression coefficients are not the same. Com-
paring them is like comparing apples and oranges, as the saying goes. Thus
Bernoulli regression in particular and generalized linear models in general give
us yet another reason why regression coefficients are meaningless. Note that
Figure 2 shows that the estimated regression functions E(Y | X) are almost
identical for the logit and probit regressions despite the regression coefficients
being wildly different. Even the linear regression function used in our first ex-
ample is not so different, at least in the middle of the range of the data, from
the other two.
Regression functions (response predictions) have a direct probabilistic
interpretation E(Y | X).
Regression coefficients don’t.
The regression function E(Y | X) for all three of our Bernoulli regression
examples, including this one, are shown in Figure 2, which was made by the
following code
> plot(x, y)
> curve(predict(out.logit, data.frame(x = x), type = "response"),
+ add = TRUE, lty = 1)
> curve(predict(out.probit, data.frame(x = x), type = "response"),
+ add = TRUE, lty = 2)
> curve(predict(out.quasi, data.frame(x = x)), add = TRUE, lty = 3)
The type="response" argument says we want the predicted mean values g(η),
the default being the linear predictor values η. The reason why this argument
is not needed for the last case is because there is no difference with an identity
link.
8
1.0
●●● ●● ●●
●●●● ●●●●
●● ●●
●
●● ●●● ●●
● ● ● ●
●●● ●● ●●
0.8
0.6
y
0.4
0.2
0.0
●●
●● ●●●●● ● ●
●●●● ●●●●
●●●
●●●●●
●
●●●
●●●● ●●●
● ●●●● ● ●●●
● ●
30 40 50 60 70 80
Figure 2: Scatter plot and regression functions for Examples 1.1, 1.2, and 1.3.
Solid line: regression function for logistic regression (logit link). Dashed line:
regression function for probit regression (probit link). Dotted line: regression
function for no-name regression (identity link).
9
parameter φ is the same for all Yi . The weight wi is a known positive constant,
not a parameter. Also φ > 0 is assumed (φ < 0 would just change the sign of
some equations with only trivial effect). The function b is a smooth function but
otherwise arbitrary. Given b the function c is determined by the requirement
that f integrate to one (like any other probability density).
The log likelihood is thus
n
X yi θi − b(θi )
l(β) = − c(yi , φ) (9)
i=1
φ/wi
Before we proceed to the likelihood equations, let us first look at what the
identities derived from differentiating under the integral sign
Eθ {ln0 (θ)} = 0 (10)
and
Eθ {ln00 (θ)} = − varθ {ln0 (θ)} (11)
and their multiparameter analogs
Eθ {∇ln (θ)} = 0 (12)
and
Eθ {∇2 ln (θ)} = − varθ {∇ln (θ)} (13)
tell us about this model. Note that these identities are exact, not asymptotic,
and so can be applied to sample size one and to any parameterization. So let
us differentiate one term of (9) with respect to its θ parameter
yθ − b(θ)
l(θ, φ) = − c(y, φ)
φ/w
∂l(θ, φ) y − b0 (θ)
=
∂θ φ/w
2
∂ l(θ, φ) b00 (θ)
= −
∂θ2 φ/w
Applied to this particular situation, the identities from differentiating under the
integral sign are
∂l(θ, φ)
Eθ,φ =0
∂θ
2
∂l(θ, φ) ∂ l(θ, φ)
varθ,φ = −Eθ,φ
∂θ ∂θ2
or
Y − b0 (θ)
Eθ,φ =0
φ/w
Y − b0 (θ) b00 (θ)
varθ,φ =
φ/w φ/w
10
From which we obtain
Yi ∼ Binomial(mi , pi )
11
where mi is the number of Bernoulli variables with predictor vector xi . The
density for Yi is
mi y i
f (yi | pi ) = p (1 − pi )mi −yi
yi i
we try to match this up with the GLM form. So first we write the density as
an exponential
mi
f (yi | pi ) = exp yi log(pi ) + (mi − yi ) log(1 − pi ) + log
yi
pi mi
= exp yi log + mi log(1 − pi ) + log
1 − pi yi
mi
= exp mi ȳi θi − b(θi ) + log
yi
ȳi = yi /mi
θi = logit(pi )
b(θi ) = − log(1 − pi )
So we see that
ηi = g(µi ).
12
If, as in logistic regression we take the linear predictor to be the canonical
parameter, that determines the link function, because ηi = θi implies g −1 (θ) =
b0 (θ). In general, as is the case in probit regression, the link function g and the
function b0 that connects the canonical and mean value parameters are unrelated.
It is traditional in GLM theory to make primary use of the mean value
parameter and not use the canonical parameter (unless it happens to be the
same as the linear predictor). For that reason we want to write the variance as
a function of µ rather than θ
φ
varθi ,φ (Yi ) = V (µi ) (15)
w
where
V (µ) = b00 (θ) when µ = b0 (θ)
This definition of the function V makes sense because the function b0 is an
invertible mapping between mean value and canonical parameters. The function
V is called the variance function even though it is only proportional to the
variance, the complete variance being φV (µ)/w.
where h = g −1 is the inverse link function. And we finally arrive at the likeli-
hood equations expressed in terms of the mean value parameter and the linear
predictor
n
∂l(β) 1 X yi − µi
= wi h0 (ηi )xij
∂βj φ i=1 V (µi )
13
These are the equations the computer sets equal to zero and solves to find
the regression coefficients. Note that the dispersion parameter φ appears only
multiplicatively. So it cancels when the partial derivatives are set equal to
zero. Thus the regression coefficients can be estimated without estimating the
dispersion (just as in linear regression).
Also as in linear regression, the dispersion parameter is not estimated by
maximum likelihood but by the method of moments. By (15)
wi (Yi − µi )2
wi
E = var(Yi ) = φ
V (µi ) V (µi )
Thus
n
1 X wi (yi − µ̂i )2
n i=1 V (µ̂i )
would seem to be an approximately unbiased estimate of φ. Actually it is not
because µ̂ is not µ, and
n
1 X wi (yi − µ̂i )2
φ̂ =
n − p i=1 V (µ̂i )
This is rather a mess, but because of (14a) the expectation of the first sum is
zero. Thus the j, k term of the expected Fisher information is, using (16) and
b00 = V ,
2 X n 00
∂ l(β) b (θi ) ∂θi ∂θi
−E =
∂βj ∂βk i=1
φ/wi ∂βj ∂βk
n
X V (µi ) 1 1
= h0 (ηi )xij h0 (ηi )xik
i=1
φ/w i V (µ i ) V (µ i )
n
1 X wi h0 (ηi )2
= xij xik
φ i=1 V (µi )
14
Then
I(β) = X0 DX
is the expected Fisher information matrix. From this standard errors for the
parameter estimates, confidence intervals, test statistics, and so forth can be
derived using the usual likelihood theory. Fortunately, we do not have to do all
of this by hand. R knows all the formulas and computes them for us.
3 Poisson Regression
The Poisson model is also a GLM. We assume responses
Yi ∼ Poisson(µi )
and connection between the linear predictor and regression coefficients, as al-
ways, of the form (5). We only need to identify the link and variance functions
to get going. It turns out that the canonical link function is the log function
(left as an exercise for the reader). The Poisson distribution distribution has
the relation
var(Y ) = E(Y ) = µ
connecting the mean, variance, and mean value parameter. Thus the variance
function is V (µ) = µ, the dispersion parameter is known (φ = 1), and the weight
is also unity (w = 1).
Example 3.1 (Poisson Regression).
The data set in the file ex12.10.1.dat is read by
> X <- read.table("ex12.10.1.dat", header = TRUE)
> names(X)
[1] "hour" "count"
> attach(X)
simulates the hourly counts from a not necessarily homogeneous Poisson pro-
cess. The variables are hour and count, the first counting hours sequentially
throughout a 14-day period (running from 1 to 14 × 24 = 336) and the second
giving the count for that hour.
The idea of the regression is to get a handle on the mean as a function of
time if it is not constant. Many time series have a daily cycle. If we pool the
counts for the same hour of the day over the 14 days of the series, we see a
clear pattern in the histogram. The R hist function won’t do this, but we can
construct the histogram ourselves (Figure 3) using barplot with the commands
> hourofday <- (hour - 1)%%24 + 1
> foo <- split(count, hourofday)
> foo <- sapply(foo, sum)
> barplot(foo, space = 0, xlab = "hour of the day", ylab = "total count",
+ names = as.character(1:24), col = 0)
15
120
100
80
total count
60
40
20
0
1 3 5 7 9 11 13 15 17 19 21 23
Figure 3: Histogram of the total count in each hour of the day for the data for
Example 3.1.
16
In contrast, if we pool the counts for each day of the week, the histogram
is fairly even (not shown). Thus it seems to make sense to model the mean
function as being periodic with period 24 hours, and the obvious way to do that
is to use trigonometric functions. Let us do a bunch of fits
> w <- hour/24 * 2 * pi
> out1 <- glm(count ~ I(sin(w)) + I(cos(w)), family = poisson)
> summary(out1)
Call:
glm(formula = count ~ I(sin(w)) + I(cos(w)), family = poisson)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.78327 -1.18758 -0.05076 0.86991 3.42492
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.73272 0.02310 75.02 < 2e-16 ***
I(sin(w)) -0.10067 0.03237 -3.11 0.00187 **
I(cos(w)) -0.21360 0.03251 -6.57 5.04e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Call:
glm(formula = count ~ I(sin(w)) + I(cos(w)) + I(sin(2 * w)) +
I(cos(2 * w)), family = poisson)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.20425 -0.74314 -0.09048 0.61291 3.26622
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.65917 0.02494 66.516 < 2e-16 ***
I(sin(w)) -0.13916 0.03128 -4.448 8.66e-06 ***
17
I(cos(w)) -0.28510 0.03661 -7.787 6.86e-15 ***
I(sin(2 * w)) -0.42974 0.03385 -12.696 < 2e-16 ***
I(cos(2 * w)) -0.30846 0.03346 -9.219 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Deviance Residuals:
Min 1Q Median 3Q Max
-3.21035 -0.78122 -0.04986 0.48562 3.18471
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.655430 0.025152 65.818 < 2e-16 ***
I(sin(w)) -0.151196 0.032532 -4.648 3.36e-06 ***
I(cos(w)) -0.301336 0.038250 -7.878 3.32e-15 ***
I(sin(2 * w)) -0.439789 0.034464 -12.761 < 2e-16 ***
I(cos(2 * w)) -0.312843 0.033922 -9.222 < 2e-16 ***
I(sin(3 * w)) -0.063440 0.033805 -1.877 0.0606 .
I(cos(3 * w)) 0.004311 0.033632 0.128 0.8980
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
18
another name for the likelihood ratio test statistic (twice the log likelihood
difference between big and small models), which has an asymptotic chi-square
distribution by standard likelihood theory.
> anova(out1, out2, out3, test = "Chisq")
The approximate P -value for the likelihood ratio test comparing models 1 and
2 is P ≈ 0, which clearly indicates that model 1 should be rejected. The
approximate P -value for the likelihood ratio test comparing models 2 and 3 is
P = 0.17, which fairly clearly indicates that model 2 should be accepted and
that model 3 is unnecessary. P = 0.17 indicates exceedingly weak evidence
favoring the larger model. Thus we choose model 2.
The following code
> plot(hourofday, count, xlab = "hour of the day")
> curve(predict(out2, data.frame(w = x/24 * 2 * pi), type = "response"),
+ add = TRUE)
draws the scatter plot and estimated regression function for model 2 (Figure 4).
I hope all readers are impressed by how magically statistics works in this
example. A glance at Figure 4 shows
• Poisson regression is obviously doing more or less the right thing,
• there is no way one could put in a sensible regression function without
using theoretical statistics. The situation is just too complicated.
4 Overdispersion
So far we have seen only models with unit dispersion parameter (φ = 1). This
section gives an example with φ 6= 1 so we can see the point of the dispersion
parameter.
The reason φ = 1 for binomial regression is that the mean value parameter
p = µ determines the variance mp(1 − p) = mµ(1 − µ). Thus the variance
function is
V (µ) = µ(1 − µ) (17)
19
●
15
● ● ● ●
● ●
● ●
● ● ●
● ● ● ● ● ● ●
10
● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ●
count
● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
5
● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ●
0
5 10 15 20
Figure 4: Scatter plot and regression curve for Example 3.1 (Poisson regression
with log link function). The regression function is trigonometric on the scale of
the linear predictor with terms up to the frequency 2 per day.
20
and the weights are wi = mi , the sample size for each binomial variable (this
was worked out in detail in Example 2.1).
But what if the model is wrong? Here is another model. Suppose
Yi | Wi ∼ Binomial(mi , Wi )
where the Wi are IID random variables with mean µ and variance τ 2 . Then by
the usual rules for conditional probability
and
This is clearly larger than the formula mi µ(1−µ) one would have for the binomial
model. Since the variance is always larger than one would have under the
binomial model.
So we know that if our response variables Yi are the sum of a random mixture
of Bernoullis rather than IID Bernoullis, we will have overdispersion. But how
to model the overdispersion? The GLM model offers a simple solution. Allow
for general φ so we have, defining Y i = Yi /mi
E(Y i ) = µi
φ
var(Y i ) = µi (1 − µi )
mi
φ
= V (µi )
mi
where V is the usual binomial variance function (17).
Example 4.1 (Overdispersed Binomial Regression).
The data set in the file ex12.11.1.dat is read by
> X <- read.table("ex12.11.1.dat", header = TRUE)
> names(X)
> attach(X)
21
> y <- cbind(succ, fail)
> out.binom <- glm(y ~ x, family = binomial)
> summary(out.binom)
Call:
glm(formula = y ~ x, family = binomial)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.73992 -1.10103 0.02212 1.06517 2.24202
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.92156 0.35321 -5.44 5.32e-08 ***
x 0.07436 0.01229 6.05 1.45e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Deviance Residuals:
Min 1Q Median 3Q Max
-2.73992 -1.10103 0.02212 1.06517 2.24202
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.92156 0.41713 -4.607 3.03e-05 ***
x 0.07436 0.01451 5.123 5.30e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
22
AIC: NA
fit both the binomial model (logit link and φ = 1) and the “quasi-binomial”
(logit link again but φ is estimated with the method of moments estimator as
explained in the text). Both models have exactly the same maximum likelihood
regression coefficients, but because the dispersions differ, the standard errors,
z-values, and P -values differ.
Your humble author finds this a bit unsatisfactory. If the data are really
overdispersed, then the standard errors and so forth from the latter output are
the right ones to use. But since the dispersion was not estimated by maximum
likelihood, there is no likelihood ratio test for comparing the two models. Nor
could your author find any other test in a brief examination of the literature.
Apparently, if one is worried about overdispersion, one should use the model
that allows for it. And if not, not. But that’s not the way we operate in the
rest of statistics. I suppose I need to find out more about overdispersion (this
was first written three years ago and I still haven’t investigated this further).
23