Escolar Documentos
Profissional Documentos
Cultura Documentos
Response variable is binomial counts, denoted by counts of binary variables. That is, if
X~bernoulli(p), then Y = ∑ X i ∼ Binomial ( n, p )
i
Example 1 Case Study 21.1: Island size and bird extinctions: On each island we count the
number of species that went extinct out all the species on the island. What is the relationship
between the area of an island and the probability of extinction of birds present on the island?
Example 2 Case study 21.2: Moth coloration and natural selection: At each distance from
Liverpool we count the number of moths from each morph that were taken by predators. What is
the relationship between the distance from Liverpool, where trees are dark from industrial soot,
and the probability of predation on the light and dark morphs of the moth Carbonaria?
Y = the number of successes in m binomial trials. For example, how many species went
extinct in the 10 year period of the study on each island?
Yi ~Binomial(mi,πi) where i is the ith island and πi is the probability of extinction on each
island.
X 1 ,..., X p : explanatory variables, in the extinction example, X is the area of the island.
Y/m = the binomial proportion. Note that the sample size in the bird extinction study is
the number of islands, not the number of species.
• logit(π) = η = β 0 + β1 X 1 + … + β p X p
• As before:
eη
π=
1 + eη
page 2
Continuous versus Counted Proportions:
Not all proportions are appropriate to model with logistic regression. We model proportions like
fat calories/total calories, etc., using normal theory, usually. The only proportions that are
appropriate in this context are those that result from an integer count of a certain outcome over
the total number of trials or outcomes.
Variance
µ (Yi X 1i ,…, X pi ) = πi
SD (Yi X 1i ,… , X pi ) = miπ i (1 − π i )
Model Assessment
Estimated versus observed: One was to assess the appropriateness of the model and the efficacy
of the estimation routine is to plot the estimated probability, πˆi , against the observed response
Y
proportion, π i = i . Additionally, plots of the observed logits versus one or more of the
mi
explanatory variables are useful for visual examination as we do for ordinary scatterplots in
linear regression. See Display 21.2.
Residual analysis: As in the binary response case, we have two widely used residuals for
binomial counts models.
Residuals
There are two standard ways to define a residual in logistic regression.
yi − miπˆi
1. Pearson residual = .
miπˆi (1 − πˆi )
⎧⎪ ⎛ Y ⎞ ⎛ m − Y ⎞ ⎫⎪
2. Deviance residual = Dresi = sign (Yi − miπˆi ) 2 ⎨Yi log ⎜ i ⎟ + ( mi − Yi ) log ⎜ i ⎟⎬
⎩⎪ ⎝ miπˆi ⎠ ⎝ mi − miπˆi ⎠ ⎭⎪
The Pearson residual is more easily understood, but the deviance residual directly gives the
contribution of each point to the lack of fit of the model.
page 3
Since the data are grouped, the residuals in a binomial counts logistic regression (either Pearson
or deviance) are more useful than in the binary response regression.
The residuals should be plotted against the predicted values for πis and examined for outliers or
remaining patterns.
• The quantity –2 ln(Maximized likelihood) is also called the deviance of a model since
larger values indicate greater deviation from the assumed model. Comparing two nested
models by the difference in deviances is a drop-in-deviance test.
• The difference between the values of –2ln(Maximized likelihood function) for a full and
reduced model has approximately a chi-square distribution if the null hypothesis that the
extra parameters are all 0 is true. The d.f. is the difference in the number of parameters
for the two models.
Model Selection
Both AIC and BIC can be used as model selection criteria. As with linear regression models,
they are only relative measures of fit, not absolute measures of fit.
AIC = Deviance + 2p
where p is the number of parameters in the model. Stepwise model selection methods are
available in SPSS using likelihood ratio tests or Wald’s test. The LR methods are preferred.
Other software programs, like S-Plus, have stepwise procedures using AIC or BIC.
Goodness-of-fit Tests
Since we have multiple counts per cell there is a goodness-of-fit test similar to that in linear
regression. We can compare the model with the log odds or logit is linear in the parameters to the
model where each cell has a separate mean. So we are comparing the logistic regression model
with p predictors to the model n different parameters, where p is the number of predictors in the
logistic regression model and n is the number of categorical treatment combinations. That is we
are testing
⎧⎪ ⎛ Y ⎞ ⎛ m −Y ⎞ ⎫⎪
The test statistic is D 2 = ∑ Di 2 = ∑ 2 ⎨Yi log ⎜ i ⎟ + ( mi − Yi ) log ⎜ i ⎟⎬
i i ⎩⎪ ⎝ miπˆi ⎠ ⎝ mi − miπˆi ⎠ ⎭⎪
Both the denominators, miπˆi and mi − miπˆi , need to be large for the distribution of the test
{
statistic to be approximately χ 2 n − p . The pvalue for the test is then Pr χ 2 n − p ≥ D 2 }
Wald Test and Confidence Intervals for Single Coefficients.
The Wald test performs similarly as in the binary counts case. The normal approximation used
by this test is adequate so long as n is moderately large and the mπ is greater than 5.
Below is the code for fitting the bird extinction model in Matlab.
extinct=case2101(:,3);
atrisk=case2101(:,2);
area=case2101(:,1);
[b,dev,stats]=glmfit(area,[extinct atrisk],'binomial');
x = 1:10:180;
y = glmval(b,x,'logit');
plot(area,extinct./atrisk,'x',x,y,'r-')