GEE Handout

Generalized Estimating Equations (GEE)
Sren Hjsgaard and Ulrich Halekoh

Unit of Statistics and Decision Analysis
Faculty of Agricultural Sciences, University of Aarhus
April 2, 2007
Printed: April 2, 2007 File: gee-lecture.tex
2
Contents
1 Introduction 3
2 Specications needed for GEEs 6
3 Properties of GEEs 7
3.1 Estimation of a contrast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Comparison of models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4 Exploring dierent working correlations 13
5 Comparison of the parameter estimates 18
6 When do GEEs work? 21
7 The fatacid data reanalyzed 22
8 Comparison of the GLMM and the GEE approach 27
9 Deriving and solving GEEs* 37
9.1 The GEE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
9.2 A motivation of the GEE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
9.3 Newton iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
9.4 Estimation of the covariance of

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
9.4.1 Empirical estimate sandwich estimate . . . . . . . . . . . . . . . . . . . . . . 44
9.4.2 Model based estimate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Regression analysis with R April 2, 2007
3
1 Introduction
We discuss the approach of generalized estimating equations (GEE) to the
analysis of correlated data for GLMs. This approach can be regarded as a
generalization of the quasi-likelihood approch.
To t the model one does not specify the distribution of the data, but makes
assumptions about
1. the mean-covariate relation,
2. the variance function,
3. the correlation.
This is in contrast to the approach by GLMMs, which made distributional
assumptions for the observations and the random eects.
4
Example 1.1 To the data of the respiratory disease we tted the random
intercept model where the model formulation could be written as
y
iv
|s
i
bin(
iv
, 1)
Mean-covariate relation
iv
= logit(
iv
) =
1
+
2
baseline
i
+
3;center(i)
+
4;sex(i)
(1)
+
5;treat(i)
+
6
age
i
+
7
(age/10)
2
i
+ s
i
s
i
N(0,
2
s
) independently distributed
Instead of this model based representing of a correlation one could just require
that the measurements on the same patient are correlated. A correpsonding
model formulation would be
5
E(y
iv
) =
iv
=
iv
Mean-covariate relation
iv
= logit(
iv
) =
1
+
2
baseline
i
+
3;center(i)
+
4;sex(i)
(2)
+
5;treat(i)
+
6
age
i
+
7
(age/10)
2
i
Variance function V () of the binomial distribution
V (
iv
) =
iv
(1
iv
)
Correlation between observations from the same patient
Corr(y
iv
, y
jw
) =
_
_
_
vw
for i = j
0 for i = j

6
2 Specications needed for GEEs
The second way of dealing with dependent GLMtype observations is through
what has become known as generalized estimating equations (GEEs).
For simplicity we shall use the termGEE models.
GEEs can be regarded as an extension of quasilikelihood models for
independent measurements.
Recall that for quasilikelihood models we specify 1) how the mean of the
responses depends on the explanatory variables (the link function) and 2) how
the variance depends on the mean (the variance function).
The setting is as follows: On each of i = 1, . . . , N subjects, there are made n
i
measurements y
i
= (y
i1
, . . . , y
ini
).
Measurements on dierent subjects are assumed to be independent.
Measurements on the same subject are allowed to be correlated.
The model specication of a GEE model involves three elements:
Systematic part: Relate the expectation E(y
it
) =
it
to the linear predictor
7
via the link function
h(
it
) =
it
= x
it
Random part: Specify how the variance Var(y

it
) is related to the mean
E(y
it
) by specifying a variance function V (
it
) such that
Var(y
it
) = V (
it
).
The correlation part: In addition to theseGLMsteps we need to impose a
correlation structure for observations on the same subject. This is done by
specifying a working correlation matrix.
Thus specication of a GEE model involves the same steps as specication of a
GLM and the additional specication of a working correlation structure
3 Properties of GEEs
With these specications, one can derive a system of estimating equations by
which an estimate

and estimate Cov(

) of the covariance matrix of

can
be obtained. See Section 9 for more details.
8
One estimator of the covariance matrix Cov(

) is the empirical or sandwich
a
covariance matrix Cov(

)
e
. It is assumed that one has correctly specied the
mean or systematic part of the model for the observations. Then the estimate
is asymptotically normally distributed with mean and covariance matrix

Cov(

)
e

A
N(, Cov(

)
e
). (3)
This results holds asymptotically even if
1. the variance function V () is misspecied,
2. the working correlation R() is misspecied, i.e. not the true correlation
matrix of y
i
.
These properties of

are often called the robustness of the GEE estimator.
Here robustness means that the asymptotic normality will hold even under
misspecication of the variance function and the correlation matrix.
Note that we make fewer assumptions than if we specify a full statistical
model. This extra exibility comes at a price:
a
see Section 9.4.1
9
The estimate

may not be the best possible.
Hypothesis testing is based on Wald tests (since, as there is no distribution
and hence no likelihood).
3.1 Estimation of a contrast of
Assume that one wants to estimate a sum of the entries in :
=
p
m=1
m
=
Then the distributional result in Eq. (3) says, that

A
N(,
2
)
with
2
Cov(

)
e
.
Example 3.1 We use the results to predict the probability for a positive
response based on model. For any visit v = 1, 2, 3, 4 the logit we have for a
10
response with a baseline response 1, at center=2, sex=F, treat=P and age=30
=
1
+
2
1 +
3;2
1 +
4;M
0 +
5;P
1 +
6
30 +
7
30
2
/100
and = exp()/(1 + exp()). Using the esticon function the estimate and
a 95% condence interval is given as 0.75(0.57, 0.87).
3.2 Comparison of models
Because the tting of a model by GEE is not based on a likelihood there is no
likelihood ratio test available. One compares two nested models using again
the distributional result (3). Let M
0
be a sub-model of M
1
, where M
1
has the
parameter vector . M
0
is derived from M
1
by setting some parameters in
equal to 0. This can generally be written as a matrix equation with a matrix C
such that one has model M
0
if the equation
C = 0
11
is fullled. The test of model M
0
against the larger model M
1
is obtained by
the Wald statistic
(C
)
1
C

A

2
p1p0
where p
1
and p
0
are the number of parameters of model M
1
and M
0
.
Example 3.2 We want to test the hypothesis whether age is a necessary
eect in model M
1
(Eq. (1)) tted with a AR(1) working-correlation. We t
the smaller model
M
0
:
iv
=
1
+
2
baseline
i
+
3;center(i)
+
4;sex(i)
+
5;treat(i)
The result of the Wald test is shown in Table 1 (It was calculated in R using
the anova function, which has a method for models tted by GEE).
Table 1: Wald test on dropping age from model M
1
(Eq. (1))
Df X2 P(>|Chi|)
1 2 10.70 0.0047
12
We do not have to specify the matrix C explicitly to perform the test. In the
present example it is
C =
_
_
0 0 0 0 0 1 0
0 0 0 0 0 0 1
_
_
corresponding to = (
1
,
2
,
3;center=2
,
4,sex=M
,
5,treat=P
,
6
,
7
) if the
reference levels for the factors are chosen as center=1, sex=F and treat=A.
13
4 Exploring dierent working correlations
Recall that from the Pearson residuals of a simple GLM-t to the data we
obtained the following correlation matrix (Table 2) suggesting the correlations
between observations from the same patient:
Table 2: Correlation matrix based on Pearson residuals.
visit1 visit2 visit3 visit4
visit1 1.00 0.35 0.24 0.30
visit2 0.35 1.00 0.34 0.28
visit3 0.24 0.34 1.00 0.36
visit4 0.30 0.28 0.36 1.00
14
In geeglm the following four working correlation matrices can be dened.
Example 4.1 The autoregressive AR(1) working correlation matrix has the
form
R() =
_
_
1
2
3
1
2
2
1
2
1
_
_
The estimated parameter for model M
1
is = 0.42. This suggests that the
autoregressive correlation matrix does not t closely to the empirical matrix in
Table 2.

15
Example 4.2 The exchangeable working correlation matrix is given by
R() =
_
_
1
1
1
1
_
_
(Exchangeable means that one can swap the order of any two measurements
without changing the correlation structure).
Table 2 actually suggests that the correlation is about the same no matter how
far apart the measurements are in time, i.e. that the exchangeable matrix could
be appropriate. The estimated value for for model M
1
is = 0.31.
16
Example 4.3 The unstructured working correlation matrix is:
R() =
_
_
1
12

13

14
12
1
23

24
13

23
1
34
14

24

34
1
_
_
The unstructured working correlation matrix should be used only with great
care: If there are n measurements per subject there are n(n 1)/2 parameters
to estimate. This number becomes very large even for moderate n. In practice
this means that the correlation parameters can be poorly estimated or that the
statistical program fails to produce a result.
17
Example 4.4 The independence working correlation matrix is particularly
simple:
R =
_
_
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
_
_
With the independence working correlation matrix, the actual dependence
between the observations is incorporated in the model only through the
empirical correlation matrix of y
i
(which informally can be thought of as
Table 2).
Note that R does not depend on any unknown parameters.
18
5 Comparison of the parameter estimates
So far, ve models have been considered: 1) observations are assumed
independent 2) AR(1) working correlation 3) an exchangeable correlation 4) the
unstructured working correlation and 5) the independence working correlation.
Table 3 shows the parameter estimates/ standard errors.
The following general results are illustrated from this table
The independence and exchangeable working correlation structures
produce exactly the same parameter estimates as one obtains if a GLM is
tted to data. This is generally true if
All subjects have the same number of observations n
i
= n.
The levels of the covariates do not vary within a subject. In the
example the values of all covariates baseline, center, sex, treat
and age for a subject are the same for all visits.
The standard errors produced by the GEEs are all of a comparable size
(they are practically identical). The standard errors of the GEEs are about
50 % larger than the GLM standard errors.
19
Table 3: Comparison of parameter estimates.
Est.GLM SE.GLM Est.Ind SE.Ind Est.Exc SE.Exc
(Intercept) 3.87 0.96 3.87 1.31 3.87 1.31
baseline 1.89 0.25 1.89 0.38 1.89 0.38
center2 0.51 0.25 0.51 0.38 0.51 0.38
sexM 0.45 0.32 0.45 0.48 0.45 0.48
treatP 1.32 0.24 1.32 0.38 1.32 0.38
age 0.21 0.05 0.21 0.06 0.21 0.06
age
2
/100 0.26 0.06 0.26 0.08 0.26 0.08
Est.GLM SE.GLM Est.AR1 SE.AR1 Est.Un SE.Un
(Intercept) 3.87 0.96 3.48 1.29 3.86 1.30
baseline 1.89 0.25 1.90 0.37 1.92 0.37
center2 0.51 0.25 0.59 0.37 0.50 0.38
sexM 0.45 0.32 0.42 0.48 0.44 0.48
treatP 1.32 0.24 1.25 0.37 1.32 0.38
age 0.21 0.05 0.19 0.06 0.21 0.06
age
2
/100 0.26 0.06 0.24 0.08 0.26 0.08
20
In absence of important prior knowledge, the independence working correlation
is often a good choice. It should be noted that using the independence working
correlation is not the same as assuming observations to be independent.
Details are given in Section 9.4.
21
6 When do GEEs work?
GEE works best
If the number of observations per subject is small and the number of
subjects is large
If in longitudinal studies (e.g. growth curves), the measurements are taken
at the same time for all subjects
22
7 The fatacid data reanalyzed
In the module on quasi-likelihood we analyzed data about the amount of sh
oil fatty acid x14 in the tissue of pigs. In Fig. 1 the amounts of x14 measured
at the weight of 60 kg (sample=1) and 100 kg (sample=2) are plotted for each
pig.
23
response profiles of pigs
sample
x
1
4
0
1
2
3
4
1 2
G
G
G
G
G
G G
G G G
G
G
dose: 0
1 2
G
G
G
G
G
G
G
G
G
G
G
G
dose: 4.4
0
1
2
3
4
G
G
G
G
G
GG
G
G
G
G
G
G
dose: 6.2
G
G
G
G
G
G
G
G
G
G
G
G
dose: 9.3
Figure 1: x14 measurements for all pigs at the two sample occasions.
The four dierent plot-panels refer to the four groups of pigs that were fed
with dierent doses of x14. It is obvious that the pigs have individually
dierent levels of x14 and that the two repeated measurements per pig are
24
correlated. We t the model M
b
used in the lecture on quasi-likelihood:
M
b
: log(
ds
) = +
s
+
d
where d are the doses of x14 and s = 1, 2 the two samples. First we t the
quasi-likelihood model assuming a variance function of the Poisson distribution
and independence of all observations. Secondly, we t a GEE-model using the
variance function of the Poisson distribution and an exchangeable working
correlation. The parameter estimates of both ts are given in Table 4.
25
Table 4: Parameter estimates of tting M
b
by quasi-likelihood (assuming in-
dependence of the observations) and the GEE approach using an exchangeable
working correlation. In both cases we use a Poisson variance function.
Est.QUASI SE.QUASi Est.EX SE.EX
1.057 0.151 1.050 0.163
2
0.824 0.062 0.846 0.040
4.4
1.801 0.161 1.821 0.178
6.2
2.133 0.158 2.125 0.175
9.3
2.374 0.156 2.388 0.180
The standard error for the dierence between the two samples
2
is lower for
the GEE-t than for the t by quasi-likelihood. First observe that both levels of
the treatment sample are observed on a pig which is the experimental unit.
The GEE approach derives the estimation of the dierence between the two
samples mainly from the dierences within each pig. For that comparison the
approach removes therefore the variability between pigs. Because the dierent
26
doses are measured on dierent pigs, such a reduction in the standard errors is
not observed for the -parameter estimates.
27
8 Comparison of the GLMM and the GEE
approach
We have seen that GEE is to a lesser degree than GLMM a model based
approach to analyse correlated data.
GLMMs model the data on the individual level whereas the GEEs work on the
population level.
For observations other than normally distributed variables (with the identity
link) this dierent focus of the models has an impact on the values and
intepretation of the model paramters.
Example 8.1 For the respiratory data we estimate the parameters via GLMM
and GEE.
library(lme4)
library(geepack)
data(respiratory, package = "dataRep")
respiratory <- transform(respiratory, center = factor(center),
28
treat = factor(treat), sex = factor(sex), age2 = (age/10)^2)
M.lmer <- lmer(outcome ~ baseline + center + sex + treat +
age + age2 + (1 | id), data = respiratory, family = binomial)
M.gee <- geeglm(outcome ~ baseline + center + sex +
treat + age + age2, id = id, data = respiratory,
family = binomial)
29
Table 5: Comparison of estimates GLMM to GEE
random intercept (RI) GEE ratio GEE/RI
(Intercept) 5.58 3.87 0.69
baseline 2.12 1.89 0.89
center2 0.68 0.51 0.75
sexM 0.60 0.45 0.75
treatP 1.88 1.32 0.70
age 0.29 0.21 0.71
age
2
/100 0.36 0.26 0.71
The coecients from the GEE-t are a factor between 0.7 to 0.9 smaller than
those from the random intercept model. It can be proven that under the
random intercept model for logistic regression with u
i
N(0,
2
) the factor is
30
about
k

1
_
f
2

2
+ 1

k
(4)
with f =
16
3
15
0.588. Hence the coecient from the GEE model is always
smaller in absolute value than the coecent from the random intercept model.
In the example the estimated variance
2
is
sigma2 <- VarCorr(M.lmer)$id[1]

2
= 1.45 and hence the shrinkage factor 1/(
_
f
2

2
+ 1) according to Eq.
4
f <- 16 * sqrt(3)/(15 * pi)
sh <- 1/sqrt(f^2 * sigma2 + 1)
would be 0.816.
One says that the coecients of GEE are shrinked to 0.
The logg-odds ratio expressed by a
k
is for the GLMM-model the risk of an
individual patient, the GEE estimate yields the risk for a population of patients
31
and is a kind of averaged risk.
Example 8.2 We consider observations simulated according to the random
intercept model
logit(
i,j
) = 0.3 + u
i
+ 1.5 x
j
(5)
y
i,j
bin(
ij
, 1) (6)
where i = 1, . . . , 50 and j = 1, . . . , 20 and x
j
a vector of 20 numbers equally
spaced between -5 and 5. The variance of u
i
was chosen to be 4 (i.e. a
standard deviation of sd.u=2). We have therfore 50 individuals on which 20
measurements are made.
In Fig. 2 the probabilites
i,x
estimated according to the above GLMM are
plotted for all individuals. The thick line is the probability obtained by
averaging across all logits logit
i,x
. This average line is in close agreement to
the line obtained from estimating the parameters by GEE (see Fig. 2, right
panel).
32
4 2 0 2 4
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
x
p
r
o
b
a
b
i
l
i
t
y
4 2 0 2 4
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
x
p
r
o
b
a
b
i
l
i
t
y
averaged probabilitites from GLMM
GEE estimate
Figure 2: Left: Probabilitites for each individual estimated by GLMM and their
average, Right: The average probabilities estimated by GLMM and the proba-
bilities estimated by GEE.
33
In the sequel we explain how the simulation was performed.
The simulation of the data
sim.dat <- function(n.id, n.rep, sd.u) {
set.seed(99)
u.id <- rnorm(n.id, sd = sd.u)
x <- seq(-5, 5, l = n.rep)
y <- p <- matrix(NA, n.rep, n.id)
for (id in c(1:n.id)) {
p[, id] <- plogis(0.3 + u.id[id] + 1.5 * x)
y[, id] <- rbinom(n.rep, 1, p[, id])
}
y <- c(y)
id <- rep(1:n.id, each = n.rep)
x <- rep(x, n.id)
dat <- data.frame(y = y, x = x, id = id)
list(dat = dat, x = x[1:n.rep], p = p)
}
simu <- sim.dat(n.id = 50, n.rep = 20, sd.u = 2)
34
The random intercept model is tted with lmer and also a GEE t is produced
using the independence working assumption.
library(lme4)
library(geepack)
m.lmer <- lmer(y ~ x + (1 | id), data = simu$dat, family = binomial)
m.gee <- geeglm(y ~ x, id = id, data = simu$dat, family = binomial)
The estimates from the gee and the random coecient model:
x <- simu$x
p <- simu$p
matplot(x, p, type = "l", col = 1, lty = 2, ylab = "probability")
lines(apply(p, 1, mean) ~ x, lwd = 4)
The thin lines are the probabilities tted to each individual, the thick line is the
average of these probablitites.
tab <- data.frame(fixef(m.lmer), coef(m.gee), ratio = coef(m.gee)/fixef(m.lmer))
names(tab) <- c("random intercept (RI)", "GEE", "ratio RI/GEE")
35
print(round(tab, 4))
random intercept (RI) GEE ratio RI/GEE
(Intercept) -0.3394 -0.2365 0.6969
x 1.3872 0.9648 0.6955
The theoretical factor
f <- 16 * sqrt(3)/(15 * pi)
sh <- 1/(f^2 * 2^2 + 1)
would be 0.42.
Finally, plotting the average probabilitites
ix
(p.average) and the
probabilities based on the GEE estimationp.gee.
x <- simu$x
p <- simu$p
p.average <- apply(p, 1, mean)
coef.gee <- coef(m.gee)
p.gee <- plogis(coef.gee[1] + coef.gee[2] * x)
36
plot(p.average ~ x, type = "l", ylab = "probability")
lines(p.gee ~ x, type = "l", lty = 2)
legend(-4, 1, legend = c("averaged probabilitites from GLMM",
"GEE estimate"), lty = c(1, 2), cex = 0.6)

37
9 Deriving and solving GEEs*
This section is somewhat technical. To understand the material, knowledge of
matrix algebra is needed.
9.1 The GEE
Let y
i
= (y
i1
, . . . , y
i,ni
) the vector of n
i
observations on the i th subject.
There are i = 1, . . . , N subjects. The GEEs to solve for estimating a p vector
is given by
() =
N
i=1
1
i
(y
i

i
()) = 0. (7)
In (7)
i
= A
1/2
i
R()A
1/2
i
where A
i
is a diagonal matrix with the variance functions V (
it
)s on the
diagonal and R() is the correlation matrix.
38
Note that since
it
depends on the unknown parameter , then so does
i
.
Hence we should really write
i
() to be precise.
The correlation matrix is generally unknown, so therefore one species a
working correlation matrix, e.g. with an autoregressive structure in a repeated
measurements problem.
Since h(
it
) = x
it
=
p
k=1
x
itk
k
, the derivative

is a matrix with p
rows and n
i
columns and the kt-th entry is given by
[
]
kt
=
x
itk
h
(
it
)
.
39
9.2 A motivation of the GEE
One way of motivating the estimating equation (7) is to consider theweighted
sum of squares
() =
N
i=1
(y
i

i
())
1
i
(y
i

i
())
where the weights are given by
1
i
.
To minimize ()we need to take the derivative

() which is () and
equate this with 0 to get the estimate

. But this is exactly (7).
40
Typically R() is estimated from data iteratively by using the current estimate
of to calculate a function of the Pearson residuals
r
P;it
=
y
it

it
()
_
v(
it
())
. (8)
The dispersion parameter is often estimated as
=
1
(
N
i=1
n
i
) p
N
i=1
ni
t=1
r
2
P;it
. (9)
41
9.3 Newton iteration
The tting algorithm becomes an instance of the Newton algorithm for solving
a system of equations:
1. Compute an initial estimate of from a GLM (i.e. by assuming
independence)
2. Compute an estimate R() of the working correlation on the basis of the
current Pearson residuals and the current estimate of
3. Compute an estimate of the variance as
i
= A
1/2
i
R()A
1/2
i
4. Compute an updated estimate of based on the Newtonstep
:= + [
1
i
]
1
[
1
i
(y
i

i
())]
Iterate through 24 until convergence. Note that needs not to be estimated
until the last iteration.
42
The GEE estimate

of is often very similar to the estimate obtained if
observations were treated as being independent.
43
9.4 Estimation of the covariance of

There are two classical ways of estimating the covariance Cov(

). To describe
these, let
I
0
=
1
i

44
9.4.1 Empirical estimate sandwich estimate
The empirical estimate is
Cov(

)
e
= I
1
0
I
1
I
1
0
where
I
1
=
1
i
Cov(y
i
)
1
i
Here Cov(

)
e
is a consistent estimate of Cov(

) even if the working
correlation is misspecied, i.e. if Cov(y
i
) =
i
.
In practice, Cov(y
i
) is replaced by (y
i

i
())(y
i

i
())
.
The empirical estimate is used by geeglm.
45
9.4.2 Model based estimate
The model based estimate is
Cov(

)
m
= I
1
0
This is theGEEversionof the matrix used when estimating Cov(

) in a
GLM. Here Cov(

)
m
consistently estimates Cov(

) if
1. the mean model and
2. the working correlation are correct.

GEE Handout

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

GEE Handout

Enviado por

Direitos autorais:

Formatos disponíveis

Generalized Estimating Equations (GEE)

Sren Hjsgaard and Ulrich Halekoh

Regression analysis with R April 2, 2007

Random part: Specify how the variance Var(y

is asymptotically normally distributed with mean and covariance matrix

Then the distributional result in Eq. (3) says, that

Regression analysis with R April 2, 2007

Regression analysis with R April 2, 2007

There are two classical ways of estimating the covariance Cov(

Regression analysis with R April 2, 2007

Você também pode gostar