Escolar Documentos
Profissional Documentos
Cultura Documentos
Verfasser
Table of contents
1 Introduction .............................................................................................................................. 4
2 Description of Data................................................................................................................... 4
3 Methods ................................................................................................................................... 10
5 Literature ................................................................................................................................ 30
3
Statistical Consulting SS 08
1 Introduction
In survival analysis breast cancer survival is usually predicted by using clinical variables and gene
expressions as independent factors. This paper intends to find new gene expressions as prognostic
markers for breast cancer survival. On behalf of Prof. Dan Cacsire Castillo-Tong and Prof. Zeilinger 6
gene expressions (DDR2, EMP1, EMP2, EMP3, PMP22 and MKI67) were chosen as candidate
markers, which were correlated with disease-free survival, overall survival and tumor specific survival
of the patients. A candidate marker which proves satisfactory in our analyses could be used to develop
a new score to improve prognosis for breast cancer.
2 Description of Data
250 breast cancer patients from the Department of Obstetrics and Gynaecology, Medical University of
Vienna, were included in this study. Date of diagnosis range from 03.03.1987 to 30.11.2001. As
clinical variables, histologic type (HISTOTYPE), tumor size (pT), degree of spread to lymph nodes
(pN), Tumor Grade (G) and age (AGE) were chosen. Additionally to the 6 gene expressions described
in the introduction, the gene expression of estrogen receptor (ER) was also quantified in the tumor
tissues of patients with breast cancer. The analyses were made by the open-source statistical software
package R (R Dev. Core Team, 2008) and the commercial SAS ® (SAS Institute Inc., 2008) software
package.
Simple tests of plausibility were performed to check the data. Patient 7763 was excluded from the data
set because the results of all gene expressions were missing. Patient 7363 was removed from the data,
because the date of recurrence was unknown.
To graphically check the distribution of the gene expressions, histograms were computed. The data
were transformed by taking the logarithm to base 2 to obtain an approximately symmetric normal
distribution.
4
Statistical Consulting SS 08
0.20
Density
Density
Density
Density
0.10
0.10
0.10
0.00
0.00
0.00
-6 -2 0 2 -4 -2 0 2 4 -6 -4 -2 0 -6 -4 -2 0
0.30
0.20
6
0.10 0.20
Density
Density
Density
Density
0.10
4
0.10
2
0.00
0.00
0.00
0
-4 -2 0 2 -8 -4 0 4 -8 -6 -4 -2 0 0.0 0.4 0.8
To show the importance of using the logarithm, the untransformed values of the gene expression
MKI67 are also plotted. Afterwards gene expressions were transformed into dichotomous variables by
using the median.
Histologic type
Invasive ductal carcinoma (IDC) 182
IDC and ILC 6
Invasive lobular carcinoma (ILC) 40
Medullary 5
Mucinous 6
Unknown 9
Total 248
For analysis of survival times groupings were necessary because of the low number of cases in some
subgroups. Concerning histological type, the category IDC and ILC was classified just as IDC,
because for survival the more serious diagnosis is important. Medullary, Mucinous and Unknown were
combined to a new category Others and Unknown.
Tumor size
Mic 1
pT I 64
pT II 127
pT III 23
pT IV 14
Unknown 19
Total 248
The Patient with the category Mic was assigned to pT I. For the analysis pT III and pT IV were pooled
and compared with the groups pT I and pT II.
5
Statistical Consulting SS 08
Nodal status
pN0 95
pN1 123
Unknown 30
Total 248
Differentiation grade
G1 34
G2 122
G3 71
Unknown 21
Total 248
Recurrence of Disease
Recurrence of disease 109
No evidence of disease 139
Total 248
Survival
Alive at last observation 152
Death at last observation 196
Death as a result of disease 71
Death of other cause 16
Death of unknown cause 9
Total 248
For analysis, patients were divided into younger or equal than 50 years and older than 50 years.
Usually around this age the menopause starts. At time of diagnosis, 31% of all patients were younger
and 69% were older than 50 years.
Correlations
Gene expression values were grouped into values lower or equal to and values greater than the median
and then compared between groups constituted by histopathologic data according to the χ 2 -test.
PMP22 seems to be strongly correlated with differentiation grade (G) and nodal status (pN). Gene
expression of EMP2 seems to be strongly correlated with G, pT and pN. High correlations were also
found between ER and G. Furthermore correlation analysis reveals that no significant difference in the
level of expression of PMP22 can be examined between patients aged younger or equal than 50 years
and patients aged older than 50 years. Additionally, correlations between gene expressions were
estimated by Spearman’s nonparametric correlation coefficient. Our gene expressions are in general
remarkably correlated with each other.
6
Statistical Consulting SS 08
As typical in clinical and epidemiological studies, survival times are censored caused by a time
restriction of type I (Lagakos, 1979). The study continues until a prespecified time point (cut-off
point). The date of the event of interest is known precisely only for those subjects who present the
event until cut-off point. For the remaining subjects, it is only known that the time to the event is
greater than the observation time. This is referred as „administrative censoring“ and the incomplete
data are called „right censored“. Besides the time restriction, incomplete data can be also given by lost
to follow-up or drop out patients in the study.
DFS was defined as the time elapsing from date of diagnosis to date of recurrence of disease
(event) or - in case of no recurrence - to date of last gynecological examination (censored).
OS was defined as the time elapsing from date of diagnosis to date of death (event) or - in case
of no death - to date of last observation (censored).
TS follows the definition of OS except that patients who died of causes unrelated to breast
cancer were also treated as censored.
7
Statistical Consulting SS 08
Survival probabilities were estimated using the method of Kaplan and Meier (1958).
Analyses show that the probability of recurrence of cancer within a time period of 5 years is about
36.1%, the probability of death is about 24.6% and of death on account of a tumor is about 19%.
Within a time period of 10, 15 years respectively probabilities for recurrence and death increase
steadily, approximately constant for disease-free survival and overall survival and with diminishing
trend for tumor-specific survival.
1.0
0.8
0.8
Cumulative survival
Cumulative survival
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
0.6
0.4
0.2
0.0
0 2 4 6 8 10 12 14
8
Statistical Consulting SS 08
Median survival time is estimated on the survival curve for DFS, OS and TS.
It estimates the time period beyond which 50% of the patients are expected to survive in the
population under study. As it’s evident in the graphs above, in the case of tumor-specific survival the
survival curve doesn’t fall under 50 %. The last patient in the study dies on account of the tumor after
20.6 years at a survival rate of 0.547.
Follow-up distribution
time in years number at risk
1.0
0.8
25% 12.5 34
0.6
0.4
0.2
0.0
0 2 4 6 8 10 12 14 16 18 20
years
9
Statistical Consulting SS 08
3 Methods
The Kaplan-Meier estimator estimates the survival function from the survival data. It can be used to
measure the survival probability for a certain amount of time after biopsy. The value of the survival
function between successive distinct sampled observations is assumed to be constant. For simplicity,
explanations are restricted to the case, where the event of interest is death.
Let S(t) be the probability that an individual from a given population will have a lifetime exceeding t.
For a sample from this population of size N let the observed times until death of N sample members be
t1 ≤ t 2 ≤ t 3 L ≤ t N
Let T be the random variable that measures the time of death and let F(t) be its cumulative distribution
function. Then the survival function is given by:
The Kaplan-Meier estimator is the nonparametric maximum likelihood estimate of the survival
function S (t ) . It’s of the form
n − di
Sˆ (t ) = ∏ i
ti ≤t ni
where ni is the number "at risk" just prior to time ti , and d i is the number of deaths at time ti . With
censoring, ni is the number of survivors less the number of losses. It is only those surviving cases that
are still being observed that are "at risk" of an (observed) death.
10
Statistical Consulting SS 08
Cox-Regression is a sub-class of survival models in statistics. They are used in this paper in the
analysis of censored survival data for identifying differences in survival due to prognostic factors. The
basic model assumes that the hazard function for failure time T for an individual i with covariate
vector xi′ = ( x1i , x 2 i , K , x ki , K , x Ki ) is
for i = 1, K , N
The covariates are assumed to be constant in time and have independent effects on the hazard rate. The
first part, h0 (t ) , is a function of time and is assumed to be the same for all subjects. Its form is not
specified by the Cox model. The second depends on the individual covariate vector, where β is the
unknown effect parameter which has to be estimated. Cox (1972) introduced a method for
estimating β and hence the hazard ratio without having to involve h0 (t ) by using partial likelihoods.
Although h0 (t ) can take any form, the hazard ratio between 2 individuals can be calculated
independent of h0 (t ) .
The formula underlines the proportional hazards assumption, which means that the failure rates of any
two individuals are proportional, given that the ratio does not depend on time. Although the risk to
die can vary over time, the risk ratio between two individuals is constant over the whole range of
follow-up. h0 (t ) can be interpreted as the hazard function of a subject with all covariates of value
A different crucial assumption follows from the exponential function for linking the independent
covariates to the hazard. It leads to a multiplicative effect of a covariate on the hazard or, concerning
the logarithm of the hazard function, to an additive effect in form of a constant distance over time.
This assumption will be later relaxed by using interaction terms.
11
Statistical Consulting SS 08
4 Survival Analysis
The association of gene expression groups with survival times was assessed by estimating survival
curves through the method of Kaplan-Meier (1956), which were compared by the log-rank test of
Mantel-Haenzel (1959) and quantified by estimating relative risks (crude Hazards Ratio) from
(univariate) Cox regression analyses (1972), which are closely related to log-rank tests. In order to
evaluate gene expressions as independent prognostic factor for DFS, OS and TS, multivariable Cox
regression analyses were used additionally.
Estimates of the survival curve for censored data using the Kaplan-Meier method and the predicted
survivor function for a Cox proportional hazards model were computed by the function Survfit in the
R-package “Survival” (Therneau et al., 2008). Cox proportional hazards regression models are fitted
by the function coxph from the R-package “Survival”. The Efron approximation (1977) is used for
calculation of parameter estimators instead of the typical Breslow method (1974), “as it is much more
accurate when dealing with tied death times, and it is as efficient computationally” (Therneau et al.,
2008).
12
Statistical Consulting SS 08
Kaplan-Meier Curves are plotted for all gene expressions using disease-free survival.
ddr2 emp1
0.0 0.2 0.4 0.6 0.8 1.0
Cumulative survival
≤ median ≤ median
> median > median
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
emp2 emp3
0.0 0.2 0.4 0.6 0.8 1.0
Cumulative survival
≤ median ≤ median
> median > median
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
pmp22 mki67
0.0 0.2 0.4 0.6 0.8 1.0
Cumulative survival
≤ median ≤ median
> median > median
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
Comment: Plots were cut off at a time level of 180 months, which are 15 years. Higher than median gene
expression levels are shown by a solid line, lower or equal than median gene expression levels are shown by a
dashed line.
13
Statistical Consulting SS 08
Kaplan-Meier survival curves for disease-free survival show only small differences in survival times
between low and high gene expression levels, which are statistically not significant, using univariate
Cox regressions with a confidence interval of 95%, as it is shown in the next chapter.
Kaplan Meier-Curves for disease-free survival for all clinical variables except for G are plotted.
pN ER
1.0
1.0
0.8
0.8
Cumulative survival
Cumulative survival
0.6
0.6
0.4
0.4
0.2
0.2
pN0 ≤ median
pN1 > median
0.0
0.0
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
age pT
1.0
1.0
0.8
0.8
Cumulative survival
Cumulative survival
0.6
0.6
0.4
0.4
0.2
0.2
pT1
≤ 50 years pT2
> 50 years pT3
0.0
0.0
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
Comment: Plots were cut off at a time level of 180 months, which are 15 years.
Differences in survival times between lower and higher levels for pN and AGE seem to be quite high.
One has to keep in mind that usual K-M curves reflect the unadjusted analysis. But multivariable Cox
regression allows us to obtain an estimate of the parameter effect adjusted by prognostic covariates.
14
Statistical Consulting SS 08
Crude Hazard Ratios (relative risks) can be calculated by exponentiating estimated coefficients of the
univariate Cox regressions. A Hazard Ratio estimate of 1 means that compared groups don’t differ in
terms of survival, whereas for instance a lower than 1 value indicates lower risk for the group with
higher than median gene expression.
**
p<0.01
*
p<0.05
HR Hazard Ratio
adj. HR HR in the multivariable Cox regression,
adjusted for the clinical variables and ER
CI Confidence Intervals (95%)
A marker is only important if it adds additional information to the survival prediction. Therefore, we
adjust our analyses to established markers, which can be clinical variables as well as gene expressions.
It is known that sometimes variables are only significant if adjusted for other effects. Indeed
multivariable Cox regressions reveal a weakly significant, independent effect of DDR2 on disease-free
15
Statistical Consulting SS 08
survival time, adjusting for Nodal Status, Differentiation Grade, Tumor Size, Age and ER gene
expression. More impressive is the impact of gene expression PMP22 on all 3 survival outcomes.
Patients with higher PMP22 expression had shorter disease-free survival, overall survival and tumor
specific survival time, whereas patients with lower PMP22 expression had better survival. Other gene
expressions didn’t show a significant effect on any survival time in the multivariable case. After these
results we concentrated our analyses on the gene expression PMP22.
Disease-free survival
Crude HR CI Adj. HR1 CI
*
PMP22 expression 1.0 (0.7-1.5) 1.7 (1.1-3.0)
Nodal Status 2.7** (1.7-4.3) 2.5** (1.5-4.0)
Differentiation Grade 1.1 (0.8-1.4) 1.1 (0.7-1.5)
Tumor Size 1.3 (1.0-1.8) 1.3 (1.0-1.8)
Age>50 - - - -
*
ER expression 0.6 (0.4-0.9) - -
1
stratified by ER≤median and ER>median
Overall survival
Crude HR CI Adj. HR CI
PMP22 expression 1.2 (0.8-1.8) 2.1** (1.2-3.6)
Nodal Status 2.4** (1.5-3.9) 2.2** (1.3-3.7)
Differentiation Grade 0.8 (0.6-1.1) 0.9 (0.6-1.3)
Tumor Size 1.6** (1.1-2.1) 1.6** (1.1-2.3)
Age>50 0.8 (0.6-1.3) 0.9 (0.6-1.5)
ER expression 0.6* (0.4-1.0) 0.4** (0.3-0.7)
**
p<0.01
*
p<0.05
HR Hazard Ratio
adj. HR HR in the multivariable Cox regression,
adjusted for the clinical variables and ER
CI Confidence Intervals (95%)
16
Statistical Consulting SS 08
The prognostic value of PMP22 for all 3 survival outcomes is shown in the upper table, together with
histological data and age of patients. Analyses revealed that in the univariate Cox regression the level
of PMP22 doesn’t correlate with any survival outcome, whereas in the multivariable Cox regression
patients with higher expression level of PMP22 had a significantly poorer disease-free survival than
those with lower expression level (p=0.025). Patients with higher than median expression level of
PMP22 had a 1.7 fold (95% confidence level, 1.2-3.6) higher risk to relapse than those with a lower
than median expression level of PMP22. Similar results were obtained for overall survival (p=0.006)
and in a more impressive way, for tumor specific survival (p<0.001). Patients with higher than median
PMP22 expression level had a 3.2 fold (95% confidence interval, 1.7-6.0) higher risk to die on account
of a tumor.
An even larger independent effect on breast cancer was confirmed for nodal status adjusting for
PMP22, tumor size, differentiation grade and age of patient at diagnosis. Patients with negative nodal
status tended to experience much better survival than those with nodal involvement. (DFS: p<0.001,
OS: p=0003, TS: p<0.001)
A larger tumor size was correlated with poorer overall survival (p=0.007) and tumor specific survival
(p=0.018) compared to patients with a smaller tumor size. A negative but not statistically significant
impact of tumor size on disease-free survival was also revealed.
Older patients tended to have better overall and tumor specific survival, but this finding is not
statistically significant in the adjusted case (p=0.670 and p=0.058, respectively). A higher-than median
gene expression of ER is correlated with higher overall survival and higher tumor specific survival.
The same holds for its effect on disease-free survival in the univariate case (p=0.021). As we show in
the next chapter, in the multivariable (adjusted) case the effect of ER seemed to correlate with time as
revealed by a correlation of Schoenfeld-residuals with time (p=0.003). Therefore we stratified the
multivariable Cox regression by ER. Other histological data, that is Differentiation grade, didn’t show
a significant prognostic value in the univariate as in the multivariable case.
Differences between crude and adjusted Hazard Ratios could be due to correlations between the
examined variables. In this case these differences are quite large.
17
Statistical Consulting SS 08
To determine whether a fitted Cox regression model describes adequately the data, one has to check its
fundamental assumptions: (1) proportional hazards assumption, (2) multiplicative effect of covariates
on the hazard and (3) linearity in the relationship between the log hazard and the covariates.
Extensions of the model are now presented to modify these characteristics. Assumption (1) can be
relaxed by stratification, assumption (2) can be relaxed by using interaction terms between the
covariates and assumption (3) can be replaced by integrating natural splines.
The proportional hazards assumption is crucial for Cox regression and means that the ratio between
the hazards of 2 patient groups remain constant over the complete follow-up period. This implies that
in Cox regression analysis one relative risk is computed which should apply to all recurrence or death
times respectively. A way to formally detect violations of the proportional hazards assumption is to
test the significance of an interaction of a covariate with time. A different approach would be to test
the slope of partial residuals as proposed by Schoenfeld (1980, cited after Marubini/Valsecchi (1995,
p. 244). This approach has the advantage that one doesn’t have to pay attention to the specification of
the interaction term. By partitioning both the time axis and the space of the covariate values, mutually
exclusive categories of failure times with associated covariates are formed. The idea behind the
method aims at comparing the number of events observed and the number of those expected under the
Cox model in each of the „cells“ produced by this partition.
The function Cox.zph in the “Survival”- Package of R tests the proportional hazards assumption for a
Cox regression model by using scaled Schoenfeld residuals. However they are calculated after the
method of Grambsch and Therneau (1993), because they better reflect the log hazard ratio function
than ordinary Schoenfeld residuals and are furthermore on the regression coefficient scale. Residuals
are weighted by Grambsch and Therneau's "average variance" method. In detail each residual is scaled
by premultiplying by a time-dependent variance matrix, to obtain estimates of time varying
coefficients.
18
Statistical Consulting SS 08
Plots are made by the cox.zph function. The time dependent coefficient Beta(t) gives an estimate of
the correlation of each covariate with time, the test if the slope of partial residuals is unequal to zero is
measured by the p-value for Beta(t).
Disease-free survival
4
4
2
2
0
0
-2
-2
-4
-4
2
1
0
0
-1
-2
-2
-3
-4
-4
“The solid line is a smoothing-spline fit to the plot, with the broken lines representing a ± 2-standard-error
band around the fit. Systematic departures from a horizontal line are indicative of non-proportional hazards“.
(Fox, 2002, p. 13)
19
Statistical Consulting SS 08
Overall survival
2
2
0
0
-2
-2
-4
-4
18 29 45 60 84 110 140 170 18 29 45 60 84 110 140 170
Beta(t) for G Beta(t) for pN
2
2
0
0
-2
-2
-4
-4
20
Statistical Consulting SS 08
2
1
2
0
0
-1
-2
-2
-3
-4
-4
18 27 35 54 69 89 110 130 18 27 35 54 69 89 110 130
Beta(t) for G Beta(t) for pN
4
2
2
0
0
-2
-2
-4
-4
18 27 35 54 69 89 110 130
The assumption of proportional hazards appears to be supported for nearly all covariates in all survival
times. There only appears to be strong evidence of non-proportional hazards for ER in the disease-free
survival analyses.
21
Statistical Consulting SS 08
To correct for unproportional hazards, a stratified Cox model is used. Stratification can be used if for a
variable non-proportional hazards are detected and if the variable is not of interest by itself. Extending
the model may accommodate this by considering the stratification of the data into subgroups, each
identified by a level of the factor, and applying the model:
hm (t , x) = h0 m (t ) exp(β ′x)
where the suffix m indicates the stratum ( m = 1, K , M ). This model assumes that individuals within
the m th stratum who have different covariates still have proportional hazards, but individuals in
different strata are permitted to experience non-proportional hazards, because each stratum has a
different baseline hazard function.
Analyses show that stratifying by ER doesn’t seem to change coefficients significantly. The effect of
PMP22 fell from 1.8 to 1.7. It may be that the time dependent effect of ER was not too large.
22
Statistical Consulting SS 08
If covariates are introduced in a Cox model without an interaction term, they are supposed to act
independently and multiplicatively on the hazard. The introduction of an interaction term relaxes this
assumption. Because ignoring interaction terms would lead to a misspecification of the model, one has
to test for interaction terms. At first all interaction terms between PMP22 and the clinical variables +
ER are analysed for all survival times.
DFS OS TS
p-value
PMP22 x pN 0.82 0.62 0.78
PMP22 x G 0.20 0.55 0.28
PMP22 x pT 0.24 0.03 0.13
PMP22 x age - 0.29 0.65
PMP22 x ER 0.55 0.70 0.77
There seems to be only one significant interaction term between PMP22 and all the other variables,
which is PMP22 together with Tumor Size in the overall survival analyses. This interaction term will
be analyzed further by drawing Kaplan-Meier Curves, showing the interaction between PMP22 and
pT. Keep in mind that these K-M curves are unadjusted for all the other histological variables +ER.
23
Statistical Consulting SS 08
1.0
0.8
0.8
Cumulative survival
Cumulative survival
0.6
0.6
0.4
0.4
0.2
0.2
≤ median ≤ median
0.0
0.0
> median > median
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
1.0
0.8
0.8
Cumulative survival
Cumulative survival
0.6
0.6
0.4
0.4
0.2
0.2
≤ median ≤ median
0.0
0.0
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
One has to notice that adjusted for Tumor Size, differences between survival curves due to PMP22
became obviously, at least for pT=2 and pT=3. In contrast to the univariate case (graph on the lower
right), in which there seems to be no significant difference.
Analyses are extended to the adjusted case by integrating the interaction term into the multivariable
Cox regression for overall survival.
without Interaction
coef exp(coef) lower CL upper CL p-value
PMP22 0.75 2.13 1.24 3.63 0.01
pT 0.47 1.60 1.14 2.26 0.01
ER -0.82 0.44 0.27 0.74 <0.01
Age -0.11 0.90 0.55 1.46 0.67
G -0.11 0.90 0.63 1.29 0.57
pN 0.79 2.21 1.32 3.68 <0.01
24
Statistical Consulting SS 08
with interaction
coef exp(coef) lower CL upper CL p-value
PMP22 -0.90 0.41 0.08 1.95 0.26
pT -0.07 0.93 0.51 1.70 0.82
ER -0.80 0.45 0.27 0.75 <0.01
Age -0.21 0.81 0.50 1.33 0.41
G -0.08 0.93 0.64 1.34 0.68
pN 0.84 2.31 1.38 3.86 <0.01
PMP22:pT 0.80 2.23 1.07 4.64 0.03
After inclusion of the interaction term PMP22:pT the coefficient for PMP22 became insignificant. To
calculate the impact of PMP22 on overall survival one now has to consider the interaction term too.
For instance, to calculate the effect of PMP22 and pT=1 one has to add (0.80) to the coefficient of
PMP22 (-0.9) which results in -0.1. Using the exponent on this result delivers the (adjusted) Hazard
Ratio.
Overall survival
coef HR lower CL upper CL p-value
pT=1 -0.10 0.90 0.4 2.3 0.830
pT=2 0.70 2.02 1.2 3.5 0.011
pT=3 1.51 4.51 1.8 11.1 0.001
HR … Hazard Ratio
CL … confidence limit (95%)
Because the interaction effect between PMP22 and pT is positive, increasing size of tumors leads to an
increasing interaction term and therefore to an increasing impact of PMP22 on overall survival. For
the case of pT=1, no effect of PMP22 can be detected. For pT=2 and pT=3, PMP22 has an increasing
influence on overall survival.
Confidence intervals were calculated by using Cox regressions where the interaction term for the
analyzed factor of pT was eliminated by subtracting the value of pT itself. Which means: In order to
specify the confidence interval for the effect of PMP22 in case of pT=1, Cox regressions were used,
where the interaction term with pT was eliminated by subtracting pT with 1. Therefore, the estimated
coefficient of PMP22 encompassed also the effect of the interaction term and represented the whole
effect of PMP22. This is shown in the following after the method of Figueiras et al. (1998).
25
Statistical Consulting SS 08
This Cox regression delivers the impact of PMP22 with pT=1 on overall survival.
Here the effect of PMP22 is clearly not significant. However evaluating the effect of the separate
factors of pT delivered a significant effect of PMP22 for pT=2 (pvalue=0.011, 1.2-3.5) and for pT=3
(pvalue=0.001, 1.8-11.1), for the latter there could be also a positive correlation with time on account
of significant Schoenfeld-Residuals (pvalue=<0.001). This would mean that the impact of PMP22 for
pT=3 seems to be larger in the later observation time. Survival curves differ concerning pT=2 and
pT=3, a positive correlation with time can be seen for pT=3.
Correlation Schoenfeld
TS coefficient Residual
10
rho p-value
PMP22 0.37 <0.001
5
ER 0.05 0.624
Age -0.05 0.673
0
G 0.12 0.252
-5
pN <0.01 1.000
pT -0.26 0.019
-10
Time
26
Statistical Consulting SS 08
1.0
0.8
0.8
Cumulative survival
Cumulative survival
0.6
0.6
0.4
0.4
0.2
0.2
≤ median, pN0
> median, pN0
≤ median, pN1 ≤ median
0.0
0.0
> median, pN1 > median
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
1.0
0.8
0.8
Cumulative survival
Cumulative survival
0.6
0.6
0.4
0.4
0.2
0.2
≤ median, pN0
> median, pN0
≤ median, pN1 ≤ median
0.0
0.0
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
1.0
0.8
0.8
Cumulative survival
Cumulative survival
0.6
0.6
0.4
0.4
0.2
0.2
≤ median, pN0
> median, pN0
≤ median, pN1 ≤ median
> median, pN1 > median
0.0
0.0
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
27
Statistical Consulting SS 08
The graphical presentation of the confounding between PMP22 and pN delivers comparable results
with estimated Cox regressions. Looking at the 3 plots on the right side, which illustrate the crude
effect of PMP22 on the 3 outcomes, one can see that survival time doesn’t seem to be correlated with
the degree of PMP22 which undermines the non significant Hazard Ratio from the univariate Cox
regressions. Looking at the left side, we see how correlation between PMP22 and pN extracts
differences between survival times. Patients with negative nodal status again tend to experience better
survival than those with nodal involvement, however PMP22 seem to add additional explanation for
differences in survival times. A higher than median expression of PMP22 has a clearly negative effect
on survival times, with pN=0 and with pN=1. In the multivariable Cox regression we adjusted the
effect of PMP22 not only for pN, but also for Differentiation grade, tumor size, age and ER
expression. We also obtained a significant Hazard Ratio for all 3 survival time outcomes.
„Nonlinearity – that is, an incorrectly specified functional form in the parametric part of the model – is
a potential problem in Cox regression as it is in linear and generalized linear models.“ (Fox, 2000, p.
15) When the linear dependence of the log-hazard on the covariate is not believed to hold through its
entire range, one may extend the predictor to include a squared term to detect a possible departure
from the linear relationship. A more sophisticated approach to this problem consists in using a spline
function to model the relationship between log-hazard and predictors (Harrel et al., 1988; Durrleman
and Simon, 1989) (cit. after Marubini/Valsecchi, p.195).
By using the function rcs from the R-package “Design” (Harrell, 2008) a linear tail-restricted cubic
spline function (natural spline) for PMP22 is integrated into the model.
DFS OS TS
coef p-value coef p-value coef p-value
PMP22 0.98 0.319 PMP22 1.04 0.371 PMP22 0.90 0.515
PMP22' -0.97 0.717 PMP22' -0.15 0.961 PMP22' 1.23 0.734
PMP22'' 0.80 0.938 PMP22'' -3.69 0.752 PMP22'' -8.33 0.545
ER -0.07 0.368 ER -0.17 0.094 ER -0.23 0.081
Age - - Age 0.01 0.176 Age -0.01 0.187
G -0.02 0.912 G -0.17 0.336 G -0.02 0.914
pN 0.94 <0.001 pN 0.76 0.004 pN 1.23 <0.001
pT 0.12 0.449 pT 0.34 0.050 pT 0.29 0.137
28
Statistical Consulting SS 08
0.5
0.0
log Relative Hazard
-0.5
-1.0
-1.0
-1.5
-1.5 -1.0 -0.5 0.0 0.5 -1.5 -1.0 -0.5 0.0 0.5
One can test for each survival time if the model with the cubic spline function delivers a significant
higher likelihood (LL) than the general model. The joint contribution of the cubic spline coefficients to
the likelihood is evaluated by applying the likelihood ratio test (LR), which gives the statistic:
The statistic QLR is asymptotically distributed as a chi-square with two degrees of freedom.
Analyses show that for all survival times the assumption of linearity concerning the effect of gene
expression PMP22 can not be rejected on a significance level of 5% by using cubic splines as an
alternative.
29
Statistical Consulting SS 08
Conclusion
Six candidate markers were evaluated in its ability to predict survival of breast cancer patients. For
simplicity, the expression values of these markers have been categorized (above median, below
median). The marker didn’t prove satisfactory in univariate analyses. However, in multivariable Cox
regressions, statistically significant correlations were found between gene expression of PMP22 and
all of the analyzed survival times, which were disease-free survival, overall survival and tumor
specific survival. Further analyzes showed that the significant effect of PMP22 in multivariable Cox
regressions seemed to be due to confounding by pN and, at least in the overall survival case, by pT.
5 Literature
Breslow, NE. (1974) Covariance analysis of censored survival data. Biometrics, 30: 89-99.
Cox, DR (1972) Regression models and life tables. J R Stat Soc B 34: 187-220.
Efron, B. (1977) The efficiency of Cox's likelihood function for censored data. J. Amer. Statist. Assoc.
72: 557-565.
Kaplan EL and Meier P (1958) Nonparametric estimation for incomplete observations. J Am Stat
Assoc 53: 457-481.
Lagakos S. W. (1979) General right censoring and its impact on the analysis of survival data.
Biometrics, 35: 139-156,
Lam, P (2007) coxph: Cox Proportional Hazards Regression for Duration Dependent Variables, in
Kosuke Imai, Gary King and Olivia Lau, “Zelig: Everyone’s Statistical Software”
<http://gking.harvard.edu/zelig>
30
Statistical Consulting SS 08
Mantel, N. and Haenszel, W. (1959) Statistical Aspects of the Analysis of Data from Retrospective
Studies of Disease. Journal of the National Cancer Institute, 22: 719-748.
Marubini E, Valsecchi MG (1995) Analysing survival data from clinical trials and observational
studies. Wiley.
R Development Core Team (2008). R: A language and environment for statistical computing. R
Foundation for Statistical Computing,Vienna. < http://www.R-project.org>
SAS Institute Inc. (2008) SAS for Windows, Version 9.2 SAS Institute Inc., Cary, NC, USA.
Schemper M, Smith TL (1996) A note on quantifying follow-up in studies of failure time. Control Clin
Trials 17: 343-346.
Therneau T M, Grambsch P M (2000) Modeling Survival Data: Extending the Cox Model, Springer.
Therneau and ported by Lumley T (2008) survival: Survival analysis, including penalised likelihood.
R package version 2.34-1.
31