Você está na página 1de 57

(c) Dr.

Orth, FH Ffm
Summer 2011

Data Mining
Contents
Data........................................................................................................................................2
Types and Scales of Data and the Role of Data in an Analysis ........................................2
Applications and Methods ................................................................................................3
Contingency Tables................................................................................................................4
Linear modeling.....................................................................................................................6
Example: one x variable, one y variable............................................................................7
Example: two x variables, one y variable..........................................................................7
General Solution of the Least Squares Problem in Linear Modeling................................8
Example (continued) two x variables one y variable.........................................................9
Changes in content or structure of the X matrix................................................................9
Linear Modeling using R (another example)...................................................................10
Understanding the output of R's lm( )-function...............................................................11
Interpreting the model in the light of the application......................................................13
Improving Interpretation by Centring..............................................................................14
Centering and Scaling......................................................................................................14
Adding a Square Term to a model...................................................................................16
Using stepwise regression to stabilize a model...............................................................18
Transforming the y variable before fitting (relative errors => log).................................19
Regression diagnostics and model improvement............................................................20
Blood Cell Count example (8 x-variable, one y-variable)...............................................23
Estimating Confidence intervals and prognosis intervals for linear models...................26
General linear modeling (GLM)...........................................................................................31
Fitting a piecewise constant function..............................................................................31
Fitting Splines..................................................................................................................33
Repeated Measures, Longitudinal Analysis, Mixed Effects Modeling................................35
Estimating parameters using Maximum-Likelihood (ML)..................................................39
Classification using logistic regression...........................................................................41
Data Analysis for Categorical x-data....................................................................................43
Example: two categorical x and one continuous y-variable............................................43
Analysis of variance........................................................................................................44
Using Linear Modeling to treat categorical x-data..........................................................46
Classification categorical y-variables................................................................................47
Iris-Example: one categorical y-variable (3 categories) and four variables....................47
A Global Method of Classification..................................................................................48
A Local Method of Classification....................................................................................49
Other global Classification Methods...............................................................................50
Cox-PH for reliability and life-time analysis.......................................................................52
Literature:.............................................................................................................................56

page 1 (57)
Data
Types and Scales of Data and the Role of Data in an Analysis

Data can come in different "shapes" and "sizes". Data can be text: text can be atomic or text
can be structured, text can allow an ordering or be simply there ... Data can be numbers:
Numbers can be discrete or continuous, continous numbers can live on an interval, on a semi-
axis or the whole real number-line. Depending on what the type of data is, different Data
Mining Methods will be used to deal with them to extract information.

Types of data are:


continuous interval

semi-interval (ratio-scale)

unbounded

discrete multilevel quantitative

ordinal (i.e. ordered, rank scale)

qualitative (nominal, categorical)

There is a hierarchy in these scales: a nominal scale is the lowest level, then comes ordinal
scale, the continuous scale the highest. This is because the nominal scale just reflects that
there is a difference between objects of observation (they belong to different classes) but it
does not reflect whether an object A is better of worse than object B, the ordinal scale does
allow this but in turn does not reflect the notion of distance, i.e. of how much better object A
may be than object B. This is then reflected by a continuous scale(s), where there is also a
notion of distance. There are different types of continuous scales, depending of whether they
are bounded or unbounded and whether they have a physical dimension, like length or weight,
or whether they are dimensionless, like rations, angles and so on.

In some applicational contexts it is useful to introduce a multilevel quantitative scale, which is


theoretically a continuous variable, but which can in practice only be set to discrete values.
The idea is that the numbers that describe these settings do reflect how big the difference
between two objects under observation is, but not all settings (on the real number line or in an
interval) can be realized. The quantitative multilevel scale is essentially important for
exogenous variables (i.e. X-variables see below) when influencing factors can be set.

Of course on the computer, continuous scales can only approximated by the available data
formats double or long double but in any case mathematical operations like +, - *, / are
possible. The highest scale would be one, that is dimensionless (like ratio-scales) because for
these all mathematical functions are available (e. g. log( ), tan( ), etc.) , i. e. they can be
transformed. In statistics usually one uses only monotonic functions to transform, because all
other functions would possibly destroy information. More often than not continuous data are
scaled before they are analyzed (an example of a monotonic transformation). This will be
discussed later.

page 2 (57)
Data are also differentiated by their purpose or role in an analysis. The most common way to
differentiate data is to divide into x-data and y-data. The idea is that x-data are easily
available but not so important and y-data are hard to get but important. So that it becomes
interesting to model (unknown) y-data using known x-data.

x-data available at low cost


available early (in time for making changes)
available in big amounts
BUT: essentially irrelevant for customer satisfaction or social benefit

y-data important
BUT: expensive
AND: available too late to change the world (or the part of it, that is of
current concern)

In context of x- and y-data the terms exogenous and endogenous are sometimes used, exo- and
endo- being Greek meaning outside and inside respectively, thus indicating that x-data send
out something, that y-data catch. In mathematics x-data are usually called independent
variables, y-data are dependent variables, and ideally y can be expressed as a function of x,
i.e. y = f(x). In the context of data mining and data analysis the word function is avoided and
replaced by the finer-sounding word model.

The distinction into y-data and x-data is not always necessary, but it plays an important role in
many data-analysis or data-mining applications.

Applications and Methods

Some typical applications of data analysis are:


- dragnet investigations (Rasterfahndung), i.e. computer-aided search for wanted persons
whereby the data of a large number of people are checked against existing data in a
database,
- appraisal of creditworthiness, credit assessment, solvency checks (Bonittsprfung),
- customer relation management (CRM),
- controlling, quality assurance,
- Spam-filtering,
- risk analysis,
- clinical trials, process optimization,
- survival testing, robustness and longevity analysis,
- sensor calibration, pharmaceutical development,
and so on.

Methods used include descriptive statistics, distribution testing and fitting, simple statistical
tests, correlation and regression analysis, classification and clustering, analysis of variance
(ANOVA) and covariance, linear modeling (LM), generalized linear modeling (GLM) and
mixed effects modeling (LME), survival analysis, time series analysis, longitudinal analysis
and in general multivariate data analysis.

page 3 (57)
In this lecture emphasis will be placed on four groups of methods used to treat problems
according to the type or scale of the available data on the x- and on the y-side:

As was stated in the last paragraph, not all data analytical problems concern x- and y-data, so
obviously not all data analytical methods fall into one of the four above categories.
Nonetheless awareness of the four categories and methods that can be classified accordingly
is very useful.

Contingency Tables

The word contingency comes from the latin word contingere, which means touching together
or happening together inexpectedly (in time). The word also comes from the Greek
(endechmena) which means something that is possible. In philosophy it
describes a state of affairs that is neither necessarily true nor false, neither necessary nor
impossible. So contingency describes the finiteness of an existence that can be as it is, could
also be different or could maybe not exist at all.

In statistics contingency describes the connection between two nominal characteristics. In the
simplest case there are two such nominal characteristics, and each of these two characteristics
has 2 possible settings.

Example: Characteristic1: class attendance; settings: regularly, rarely


Characteristic2: exam success; settings: passed, failed
(Isn't this a mean example....?)

The data are collected in a so-called contingency table (4 fields table)

pass fail
regu 7 0
rare 1 5

The contingency table can be used to investigate if the two events regular class attendance and
exam success are touching together and happening unexpectedly together in time or

page 4 (57)
whether we are in a state that is neither necessarily true nor false, neither necessary nor
impossible.

Before we proceed, give names to all players:

X1 \ X2 0 1
0 n1,1 n1,2 n1,.
1 n2,1 n2,2 n2,.
n.,1 n.,2 n

Notice, that in the sum column on the right, there is a dot in the index on the right, and in the
row on the bottom, there is a dot in the index on the left. The dot replaces the index over
which the sum has been formed. These sums are called marginals or marginal sums.

So in the example, n1,.= 7 and n.,1 = 8.

An exact test of independence consists in calculating the probability of getting a given


contingency table assuming the marginals and hypothesizing independence. The hypothesis is
rejected, if this probability is too low. Note that given the marginals and n1,1 the whole table is
uniquely determined. So it suffices to calculate P(X1,1 = n1,1 | marginals), where X1,1 is the
random variable entry in field 1,1. This is a question of combinatorics:

P X 1,1=n 1,1n .,1 , n . ,2 , n 1, . , n 2, . =



n 1, . n 2, .
n 1,1 n 2,1
=
n .,1!n . ,2!n 1, . !n 2, .!

n
n . ,1
n!n 1,1!n 1,2!n 2,1!n 2,2!

In the example above this would mean:

P X 1,1=7n. ,1=8, n .,2 =5, n =7, n = 6=


1, .
7 1
76

2, . =
8!5!7!6!

138 13!7!0!1!5!
123456 2 2
= = = which is very small
910111213 11313 429

So in this case we reject the Null Hypothesis of independence and accept the alternative
Hypothesis, namely that success in exams is highly dependent on regular attendance in the
exercise sessions (what a happy surprise!)

Obviously, given the marginal sums, n1,. and n.,1, n1,1 must be less than the smaller of these
two and at the same time greater or equal to 0, and the sum of all these probabilities must sum
up to 1. So in order to establish significance, one should not only calculate
P X 1,1=n 1,1n.,1 , n. ,2 , n 1, . , n 2, . for the n1,1 in question, but also for all numbers greater than
n1,1 (in the example this is not possible since n1,1 is already as big as n.,1). Then one establish
-levels and make statements about significance.

page 5 (57)
When numbers get larger, exact tests like the one above R has implemented Fisher's exact
test, which es very similar to what was done above become very time consuming even for
a computer and approximations are used. Like for many other tests the -distribution can be
used. The test statistics ca be calculated from the data using the formula:
2
nn 1,1n 2,2n 2,1n 1,2
=
n . ,1n .,2n 1, .n 2, .
This statistics works if n>60 and should be tested against -distribution for 1 degree of
freedom. It was originally suggested by Pearson

For sample size between 20 and 60 one should either use a correction suggested by Yates:
nn1,1n2,2n 2,1n 1,2 n/ 22
corr=
n.,1n. ,2n 1, .n2, .
which again can be tested against -distribution for 1 degree of freedom, or one can use
Fisher' s exact test directly. In R the call is Fisher.test( ).

Linear modeling

Linear Modeling is used for modeling x and y data when both x and y are f continuous scale,
in fact in principle x and y are in RM and RK respectively.

The idea of y depending on x is formulated in mathematics as y = f(x). In data analysis y and


x are not abstract, but concrete vectors of numbers, so rather than writing y = f(x), we write yi
= f(xi), i = 1,, N, where N is a number of observations for each of which an x-value resp.
x-vector in RM and a y value resp. y-vector in RK are available. Since real life is more difficult
than mathematics (who would have thought that, after so many years of torture ) we need to
allow for some error, so we write yi = f(xi) + ei, where we postulate, that ei is "nicely-
behaved, for example normally distributed and independent with expectation 0 and constant
variance 2.

Now if f in the above discussion is too complex, it is hard to impossible to find, particularly if
there are many x- and y-variables (i.e. if x and y are vectors). So the idea is to restrict f in
such a way, that it belongs to certain classes of functions. The easiest type of functions are
linear functions, next easiest are polynomials, splines, step-functions and the like.

When talking about linearity it is always important to be aware of what exactly is meant by
linear, or precisely what depends linearly on what. A model can by such that y is linear in
x, i.e. y is proportional to x; if b be the proportionality constant, which can be interpreted
geometrically as a gradient, y = xb. Linear Modeling does more than modeling y that is linear
in x! This is very important! Linear Modeling is concerned with modeling dependencies of
different kinds, where models contain unknown coefficients (like the b in what we just
discussed), and y depends linearly upon these coefficients!

So the linear model y = xb, the affine model y = a + xb, the square model, y = a + xb + x2c,
as well as all polynomial models, step functions and regression splines belong to the class of

page 6 (57)
linear models, because they are all linear in the coefficients, the coefficients that are to be
estimated by data analysis.

In linear modeling there may be one or many y-variables and one or many x-variables and the
relation between them is supposed to be of the type: Y = X B + E
where Y is a (N*K) matrix of y-values, q variables (columns) and N observations (rows),
where X is a (N*M) matrix of x-values, p variables (columns) and N observations (rows),
where B is a (M*K) matrix of coefficients, linking the x and the y variables,
and E is a (N*K) matrix of errors in the measurements of the y-values.

Example: one x variable, one y variable

Let's start with the easiest situation of one-dimensional


x and y (K=1, M=1, and N=10):

Then y can be represented as a


function of x:
y = b0 + b1x, where in the case of
the numbers at right,
b0 = 5.6 and b1 = 1.42.

Once the coefficients are known,


two things can be done:
(A) Model-check: By comparing
the values on the line, i. e. the values ypred = 5.6 + 1.42x with the values y, the quality of the
model can be determined,
(B) Prediction: For new x not in the original set of x for which y values are there, again
calculate ypred = 5.6 + 1.42x, and use this as a model prediction.
In the example the model is quite good, because the y are close to the ypred, usually denoted by
y or if an editor for formulae is not available by ^y. Predictions can be seen as points on the
interpolated line. The deviations, y y are called residuals, abbreviated by res.

The question of course is, how to get the coefficients? We'll see that later!

Example: two x variables, one y variable

A second example is the case where there are two x variables and one y variable, so there's x1
and x2 and y (K=3 (why not 2?), M=1, and N=4).

x1 x2 y ^y res
-1 -1 3 2,5 0,5
1 -1 5 5,5 -0,5
-1 1 7 7,5 -0,5
1 1 11 10,5 0,5

page 7 (57)
The linear model (in this special case linear also in the x-variables!) has the form y = b0 +
b1x1 + b2 x2, + , ~ N(0,). Using x and y data as in the table above right, the coefficients
would be: b0 = 6.5, b1 = 1.5 and b2 = 2.5, and This in turn would yield the model-predictions
in the column at top extreme right. Coefficients can be interpreted as height and
gradients. So 6.5 is the mean height of the plane, 1.5 is the gradient (steepness) when going
from left to right, 2.5 is the gradient when going from front to behind. Using predictions, ^y,
one can again calculate residuals as differences to the measured y and then estimate (= 0.5,
depending on the denominator used) as the standard deviation of these residuals. This is part
of the model check as in (A) in the first example.

An alternative set of coefficients could be b0 = 6, b1 = 1 and b2 = 2 with predictions:

x1 x2 y ^y res
-1 -1 3 3 0
1 -1 5 5 0
-1 1 7 7 0
1 1 11 9 2

Which set of coefficients is better and why? Well, look at the residuals, defined as the
difference between y and ^y: In the case at the top, the residuals absolute values sum up to 2,
as in the case at the bottom! So there's no difference here. But in the case at top the residuals
are more homogeneous (i.e. more evenly distributed), their absolute values are all the same.

In fact due to this their sum of squares is smaller: yi y i 2 =1 for the first model and
i=1...4

y i y i 2 =4 for the second.


i=1...4

So in the sense of least squares the first model is better. It is in actual fact the best as can
be tried out by checking out different sets of coefficients.

General Solution of the Least Squares Problem in Linear Modeling

Now using the matirx calculus as introduced above, Y would be a 4*1 matrix, X would be a
4*3 matrix and the coefficients, B, would be represented by a 3*1 matrix, to get the equation
Y = XB.


3 1 1 1 b
0
Y = 5 , Y = X B= 1 1 1 2
b 1 So yi y i 2 =Y Y =Y Y T Y Y .
7 1 1 1 i=1...4
11 1 1 1 b2
To minimize this, minimize Y XBT Y XB=Y T BT X T Y XB by differentiating
w.r.t. B using the multiplication rule and setting this to 0:
X T Y XBY T BT X T X = 0 . Since the two summands are transposes of each
T
other, the sum can only be zero, if each summand is zero, so X Y XB=0 and hence:

page 8 (57)
X T Y XB=0 => X T XB= X T Y => B= X T X 1 X T Y .

Notice that only due to the fact, that the model is linear in B, the least squares functional is
quadratic in B, its derivative is again linear in B, and equating it to zero means solving a
linear set of equations, which can be done by mere matrix inversion. So linearity in B is
important. Otherwise finding the correct B would involve solving a non linear set of equations
which is a lot more difficult. Solving this would involve using classical interactive methods
like the Newton-Raphson algorithm or more modern computer based methods like Simulated
Annealing to reach a solution.

Example (continued) two x variables one y variable

For the example above, the matirx calculations would be:


X Koeffizient
Nr b0 b1 b2
1 1 -1 -1
2 1 1 -1
3 1 -1 1
4 1 1 1

Hence the coefficients are: b0 = 6.5 = 26 / 4, b1 = 1.5 = 6 / 4 and b2 = 2.5 = 10 / 4, as was


proposed at the outset. (Remark: X' means the same thing as XT).

Note that in the case above the X'X-matrix is diagonal. This is because the columns of X are
orthogonal (i.e. the scalar products of different columns are 0). The inversion of X'X is just
done by inverting the diagonal elements.

Changes in content or structure of the X matrix

If there are small perturbations of the X-matrix, X'X is no longer diagonal, (X'X)-1 is
somewhat more difficult to calculate, but if the perturbations are not too big, this poses no real
problem for the computer. However, if perturbations are too large, problems may arise with
the inversion of X'X.

Hence the coefficients are: b0 = 6.44 = 26 / 4 0.2*6.6, b1 = 1.56 = -0.02*26 + 0.28*6.6 +


0.02*10 and b2 = 2.56 = 0.02*6.6 + 10 / 4. So this is no major change even though (X'X)-1
only has two zero entries left and looks complicated.

Adding observations (rows in x) leaves the structure of X'X unchanged, but tends to make the
numbers in X'X larger, hence those in (X'X)-1 smaller. Adding coefficients to the model adds
columns to X makes the size of X'X larger. In this case care has to be taken, that there are
enough observations and that X'X does not degenerate but stays invertible.

page 9 (57)
Linear Modeling using R (another example)

To illustrate the possiblities of linear modeling we shall work with a small data set, in which
we try to predict body size of persons from their shoe size.

(Imagine you're late for your basketball training session, your friends are all on the field in
the gymnasium and already dribbling. You enter the changing room shoes everywhere
and you want to know if the people there are taller or smaller than you, because basketball is
much more fun when everyone else is smaller than you ....)

shoesize bodysize

190
1 38 153
2 38 161

185
3 39 167

180
4 39 169

175
5 40 173 krpergre

6 40 176 170

7 41 182
165

8 41 181
160

9 42 188
10 42 189
155

Abbreviating shoe size by shs and body size by 38 39 40 41 42

bs, a linear model linear in coefficients and schuhgre

variables would be:

ybs = b0 + bshsxshs + , where ~ NID(0,) (NID = normally and independently distr.

with homogeneous (i.e. constant) variance)

We want to find the coefficients b0 and bshs, because this will allow us to predict body size
from shoe size in the future. Ideally we also want an estimate for in order to judge the
quality of the predictions made the bigger is, the less we will have confidence in our
predictions! (This is part of model diagnostics).

In R the command sequence would be:

shoesize<-c(38,38,39,39,40,40,41,41,42,42)
bodysize<-c(153,161,167,169,173,176,182,181,188,189)
bspl<-data.frame(shoesize,bodysize)
fm<-lm(bodysize ~ shoesize, data=bspl) # fits the model
summary(fm) # outputs the model:

Call:
lm(formula = bodysize ~ shoesize, data = bspl)
Residuals:
Min 1Q Median 3Q Max
-5.6000 -0.8125 0.1250 1.7625 2.7500
Coefficients:

page 10 (57)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -132.1000 22.9163 -5.764 0.000422 ***
shoesize 7.6500 0.5725 13.361 9.42e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.561 on 8 degrees of freedom


Multiple R-Squared: 0.9571, Adjusted R-squared: 0.9517
F-statistic: 178.5 on 1 and 8 DF, p-value: 9.416e-07

Understanding the output of R's lm( )-function

Residuals: at each x, theres a yobs (the dot in the graph) and a ypred (the position on the line
at the x-value in question), the difference, yobs ypred, is the so-called residual.

> fm$residuals # or just: fm$re or residuals(fm)


1 2 3 4 5 6 7 8 9 10
-5.60 2.40 0.75 2.75 -0.90 2.10 0.45 -0.55 -1.20 -0.20

Min, 1Q, Median, 3Q and Max are respectively Min, first quartile, median, third quartile and
max of the residuals.

Coefficients: there are two coefficients, b0 also called the intercept and bx the
gradient to describe the regression line, given by ypred = b0 + bxx. The function call to
lm(formula = y ~ x, data = dummy)
uses the method ordinary least squares to fit the model (linear function) to the data, i.e.
the expression
SSres = (yi (b0 + bxxi))2 is minimized (SS means Sum of Squares).
It is possible in R to utilize so called weighted least squares, meaning that different
observations are give different weight. To do this call
lm(formula = y ~ x, data = dummy, weight=w^2)
so that
SWSres = wi2 (yi (b0 + bxxi))2 is minimized
To be able to use weighted least squares the weights have to be known in advance.
Typically wi = 1/si where si is an estimate of the standard deviation of the measurement(s)
at the point i. In the shoesize/bodysize example above, no reasonable estimate for si can be
given. Hence no weights are used in that example.

Estimate, Std. Error, t value Pr(>|t|): see below!

Residual Standard error (RSE): The residuals have a mean value, thats zero (because
otherwise a model with a better SSres could easily be constructed by just adding this mean
value to the intercept-coefficient). But the standard deviation of residuals is neither zero
nor one. The residual standard error is almost the standard deviation of the residuals,
sresidual, but not quite!
sresiuals := {1/N (yi (b0 + bxxi))2 }1/2 = (SSres / N)
whereas
RSE := {1/DFres (yi (b0 + bxxi))2 }1/2 = (SSres / DFres)
A sharp eye is needed to spot the difference: it's in the denominator!

page 11 (57)
sresidual2 is sometimes called the ML-estimator (maximum-likelihood-estimator) of variance,
whereas RSE2 is the REML-estimator (restricted-ML-estimator) of variance.
(Maximum-likelihood estimation will be treated in a later chapter).

Degrees of Freedom (DFres): Each observation has the freedom to be what iti is, hence
each such value adds a degree of freedom to the system, and each coefficient in the
model takes one such degree of freedom away from the residuals, so
DFres = #observed values #coefficients = 10 2 = 8 (in our shoesize/bodysize
example).
The idea is: if you fit a model with two coeffients to two values, there are no residuals,
because the line just satisfies the system of linear equations: yi = b0 + bxxi for i = 1, 2.
Hence SSres = (yi (b0 + bxxi))2 = 0.
The sum will become nonzero if a third observation is added (and the system is no longer
solvable) and it grow with every additional observation to be fitted. So in fact its the
number of these additional observation values that should count toward DFresid, which
should in turn be used to standardize the sum of squares.

R (multiple R-squared): This is the so-called regression measure or fit-measure,


equivalently defined as
(A) R := [correl(y,ypred)] (squared correlation
coefficient
between observations and
predictions)
(B) R := 1 SSres / SStot-corrected, (1 minus relative unexplained
variation)
(C) R := SSregression / SStot-corrected (relative explained variation)
where SStot-corrected := (yi ymean)2 (corrected sum of squares i.e. variation)
and SSregression := (yi,pred ymean) 2
(explained sum of squares i.e. variation)

Radj is the so-called adjusted R which accounts for degrees of freedom.


Radj = 1 MSres / MStot-corrected, where MSres = SSres / DFres = RSE
and MStot-corrected = SStotal-corrected / DFtotal-corrected = var = s. (MS means Mean of Squares)

F-statistic: This is the tool for testing if the regression coefficients are statistically
significant. An F-value is always the quotiont of two variances (i.e. two MS-values), in
this case MSregression / MSres, where MSregression is the explained variance:
MSregression = SSregression / DFregression = ((b0 + bxxi) ymean )2 / 1.
Why divided by 1 and why write it down? Again this concerns degrees of freedom, but
this tim not residual degrees of freedom, DFresiduals, but DFregression. In this case only the
coefficient bx contributes to the variation of the regression model, i.e. the prediction values
ypred,i = b0 + bxxi. R can be made to calculate MSregression using the command:
sum((fm$fitted.values - mean(fm$fitted.values))^2),
where fm$fitted.values can be substituted simply by fm$f.

In the case above shoesize example , the value of the F-statistic is 178.5 on 1 and 8
degrees of freedom, because DFregression is 1 and DFresidual is 8. Using R this can be verified
using the formula:
sum((fm$f-mean(fm$f))^2)/(sum(fm$re^2)/8)

page 12 (57)
When checking the theoretical F-distribution (after R.A. Fisher) on 1 and 8 degrees of
freedom, the corresponding p-value would be 9.416e-07. So a value of 178.5 would
have a probability of ~ 10-6, if the x and y had no connection and varied completely
randomly. This in turn can be checked using:
pf(sum((fm$f-mean(fm$f))^2)/(sum(fm$re^2)/8),1,8,lower.tail=FALSE)
This means that it is very, very, very improbable to have an F-statistic of 178.5 on 1
and 8 DF assuming the null-hypothesis that the y-data are completely random So this
null-hypothesis must be rejected at 90%, at 95%, at 99% even at 99.9%.

Signif.-codes: In american literature on statistics, it is common to use *s to indicate the


level of confidence. *** would mean 99,9% confidence level, as is the case for this model
(equivalently for the coefficient bx). Similarly b0 is also ***-significant, because its p-
value is 0,000422 < 1 99,9% (test see next point using t-distribution).

Estimate, Std. Error, t value Pr(>|t|): Coefficients can be tested to be significant. This is
done by comparing their value (estimate) to their Std. Error.
The formula for the estimate is b = (XTX)-1 XTy (in the unweighted case), where b is the
coefficient vector consisting of b0 and bx, y is the vector of y-values, and X is a matrix
with 2 columns, one of which is filled with 1s, and the other is the vector of x-values. XT
is the transpose of X (i.e. rows and columns exchanged, so XT has two rows (the first of
which is just 1s) consisting of 10 values.
The formula for the Std.Errors is calculated from the sqare root of the diagonal of (XTX)-1
(which is a 2 by 2 matrix) multiplied by RSE (see above). Since RSE is 2.561 and since
the two Std. Errors are 22.9163 and 0.5725, the diagonal elements must have been
8.948 bzw. 0.2235 (their values are not displayed by R).

Interpreting the model in the light of the application

Judging th quality and the limits of a model, i.e. understanding the statistical output in the last
section is only one part of statistical analysis. The other part, particularly when doing Data
Mining is understanding the model. To do this it is necessary to understand the meaning of the
coefficients. One part is understanding what is the physical unit of the coefficient, given
these units for x and y, the other part is understanding the value of the coefficient. Both is not
always easy. In fact coefficients are often difficult to interpret directly in the context of the
application.

In the example above, one coefficient is -132.1, the other is 7.65. What do these mean and
what are the associated physical units? It doesn't make much sense to say: body-size 132.1
cm! But this is the would-be size of a (obviously non-existing) person with shoe size 0.
Imagine extrapolating the line on the right toward 0 on the x axis; that's why it's called
"intercept".

The other coefficient is just the gradient of the line. So it's physical unit is not cm, but cm per
unit of shoesize, cm/sz-unit, and it's value can be read off the graph of the fitted function:
increasing shoe size by 1 unit increases body size by 7.65 cm, this means going from shoesize

page 13 (57)
38 to shoesize 42 increases bodysize by 4 sz-unit times 7.65 cm/sz-unit = 30.6 cm (from
roughly 158 to 188 cm).

To get the plot:


> plot(shoesize,bodysize) #these are the measured points
> abline(coef(fm),col="red") #this is the function
> points(shoesize,fitted(fm),col="red") #predicted points on the
#line (hardly visible)

Improving Interpretation by Centring

190
185
Centring means, that the interval for x is shifted
in such a way, that it is centred around 0. In the

180
example the x-values vary from 38 to 42, hence

175
krpergre
xcentre = 40, and x - xcentre varies from -2 to 2, an

170
interval with centre at 0.

165
The data in the example have been designed i.e.
160
chosen at the will of the investigator. In such a 155

case it is recommended to use x - xcentre instead


of x for fitting the model. When the x-data are 38 39 40 41 42
collected ad hoc the mean value is usually used schuhgre
for centring, i.e. x - xmean is used instead of just
x.

In R centring can be performed by changing the x-variables after the ~ [please use I( ) for
technical reasons].
> fm_s=lm(bodysize ~ I(shoesize-40), data=bspl)
> fm_s

Call:
lm(formula = bodysize ~ I(shoesize - 40), data = bspl)

Coefficients:
(Intercept) I(shoesize - 40)
173.90 7.65

Centring has improved interpretability of the result. Why? Because now the intercept, here
173.9 is where the regression line crosses x - xcentre = 0, and this can be interpreted as a value
for the y values at the centre of the investigated x-region not somewhere far away at a non-
existant shoesize 0! So 173.9 cm corresponds to the size of a person with a shoe size of 40
(because for x = 40) the centred x-variable is 0.

Centering and Scaling

Scaling means that x-variables are dilated (stretched or shrunk) after centring depending on
the size of the interval through which x is varied in order to make coefficients if different
model terms comparable. After scaling x-variables and fitting the model for these scaled (and

page 14 (57)
centred) variables all coefficients will be of the same physical unit as the y variable, and
hence all be comparable, even if the origiinal x-variables had different units.

There are two ways commonly used to scale data after centering:
(A) min-max-scaling (sometimes called orthogonal scaling)
(B) unit-variance-scaling.

Min-max scaling should be used when there is a well-defined typically predefined


maximum and minimum of the data and when the data are well structured as for example in a
designed experiment. In the case of shoesize/bodysize example this can be said to be the case
for shoesize, because there are two data each for shoesizes ranging from 38 to 42.

x x centre
In min-max-scaling the formula is: x sc=
x max x centre

In the case of the shoesizebodysize example this would mean fitting:

> fm_sc=lm(bodysize ~ I((shoesize-40)/(42-40)), data=bspl)


> fm_sc
Call:
lm(formula = bodysize ~ I((shoesize - 40)/(42 - 40)), data = bspl)
Coefficients:
(Intercept) I((shoesize - 40)/(42 - 40))
173.9 15.3

On first encounter scaling does not seem to have improved the situation at all. The value of
the intercept has not changed compared to just centering. Only the value of the coefficient of
shoesize term has changed. It is now twice as big. So what?

With just centering (or with no preparation at all) the coefficient could be interpreted as
measuring the average change in size, when shoesize changed by 1. This change was 7.65 cm.
Now the coefficient measures the change that occurs, when you change from the centre-value
of 40 to the maximum value of 42, respectively to the minimum value of 38. This has two
advantages that may often be useful in application:

It is easy to read off the lowest possible and the highest possible prediction values from the
coefficients: Highest is coeffintercept + coeffshoesize, and lowest is coeffintercept coeffshoesize, hence in
the example the lowest predicted value would be 158.6 = 173.9 15.3, the highest predicted
value would be 189.2 = 173.9 + 15.3;
It would be easier to compare coefficients (see remark above): If we had had a model that
considered other influencing factors (x-variables), that may have an effect on body size, like
possibly sex, age, thumbsize and so on, and all such factors were centred and scaled they
would all have become dimensionless by scaling and centring and all coefficients would have
taken on the dimension of the response, in our case of body size, which is cm. (Before scaling
the unit would have been cm/unit of shoesize, or cm/per year of age, or cm/mm of thumbsize.)

The second popular scaling method is unit-variance scaling (UV-scaling). Unit-variance


scaling is usually used when the data are not designed or when they are unstructured (this is
what is meant by ad hoc) i.e. when min and max are not very reliable often min or max
are just outliers !

page 15 (57)
x x mean
In UV-scaling, the formula is: x UV = where xmean and s are mean and standard
s
deviation of the data. The scaling is called Unit-variance because after scaling has been done,
the variance of the variable will just be 1, as of course will be the standard deviation which is
just the square root of variance.

Adding a Square Term to a model

In the case for the bodysize/shoesize example a model that is linear in the x-variable shoesize
seemed quite adequate. Looking at the plot of the regression function and the observed values,
there was good correlation (measured by the dimensionless R-measure) and small deviation
(measured by RSE in the unit cm). In many real life Data Mining problems linear models are
not adequte. There are different ways in which models can be improved.

One idea, that might improve the model, is using a model of the form:
ybs = b0 + bshsxshs + bshs*shs xshs + , where ~ NID(0,)
This model will allow for some curvature instead of just fitting a straight line:
Using R this can be done by:
fmsq<-lm(bodysize ~ shoesize + I(shoesize^2), data=bspl)

The I(..) around the shoesize^2-term is necessary to avoid that R evaluates the squaring
immediately but rather just inside the lm( )-function, when needed to do the fitting.
The result is the following:
Note that there is a bad surprise in the result we get (see below)!!!
> summary(fmsq)
Call:
lm(formula = bodysize ~ shoesize + I(shoesize^2), data = bspl)

Residuals:
Min 1Q Median 3Q Max
-4.52857 -0.84643 0.06429 0.98929 3.47143

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -988.1714 761.0281 -1.298 0.235
shoesize 50.5071 38.0865 1.326 0.226
I(shoesize^2) -0.5357
0.4760 -1.125 0.298
160 165 170 175 180 185

Residual standard error:


2.519 on 7 degrees of
fitte d (fm s q )

freedom
Multiple R-Squared: 0.9637,
Adjusted R-squared: 0.9533
F-statistic: 92.87 on 2 and
7 DF, p-value: 9.13e-06

Well, what's the bad surprise?


For one thing, the RSE is smaller,
the R-measure is bigger. These are 38 39 40 41 42

schuhgre

page 16 (57)
both signs that the model has a better fit than the linear model above. That's not a bad
surprise.

So, what's th bad surprise?

In any case the graph is nice too:


> fitted(fmsq)
1 2 3 4 5 6 7 8 9 10
157.52 157.52 166.78 166.78 174.97 174.97 182.08 182.08 188.12 188.12

>plot(shoesize,fitted (fmsq))
>points(shoesize, bodysize)
>points(shoesize, fitted(fmsq),col="red")
>lines(formula=fitted (fmsq)~shoesize, col="red")
>abline(coef(fm), col="green")

So what's the bad surprise?


The bad surprise is that none of the three coefficients are statistically significant!! Not the
intercept, not the linear term, and not the square term either! And none of them can be
interpreted!!

Centring imroves the situation but does not cure it:


> fm_s_sq<-lm(bodysize ~ I(shoesize-40) + I((shoesize-40)^2),data=bspl)
> fm_s_sq
Call:
lm(formula = bodysize ~ I(shoesize - 40) + I((shoesize - 40)^2), data =
bspl)
Coefficients:
(Intercept) I(shoesize - 40) I((shoesize - 40)^2)
174.9714 7.6500 -0.5357

This situation can further be improved by centering and scaling! This will be illustrated using
the min-max scaling technique already discussed above:
> fm_sc_sq=lm(bodysize ~ I((shoesize-40)/(42-40)) + I((shoesize-40)^2/4),
data=bspl)
> fm_sc_sq

Call:
lm(formula = bodysize ~ I((shoesize - 40)/(42 - 40)) + I((shoesize -
40)^2/4), data = bspl)

Coefficients:
(Intercept) I((shoesize - 40)/(42 - 40))
174.971 15.300
I((shoesize - 40)^2/4)
-2.143

> summary(fm_sc_sq)

Call:
lm(formula = bodysize ~ I((shoesize - 40)/(42 - 40)) + I((shoesize -
40)^2/(42-40)^2, data = bspl)

Residuals:
Min 1Q Median 3Q Max
-4.52857 -0.84643 0.06429 0.98929 3.47143

page 17 (57)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 174.971 1.241 140.955 2.39e-13 ***
I((shoesize - 40)/(42 - 40)) 15.300 1.126 13.582 2.76e-06 ***
I((shoesize - 40)^2/4) -2.143 1.904 -1.125 0.298
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.519 on 7 degrees of freedom


Multiple R-Squared: 0.9637, Adjusted R-squared: 0.9533
F-statistic: 92.87 on 2 and 7 DF, p-value: 9.13e-06

When checking significance of the coefficients, we see that now the intercept and the linear
coefficients are both statistically significant. Only the squared coefficient is not significant.
Thi is a considerable improvement, considering that without centring and scaling none of the
three coefficients are significant, and in particular the coefficients cannot be interpreted /
understood.

With centring and scaling interpretation is easy. Intercept and linear coefficient can be
interpreted as in the linear model (see above), and the interpretation of the square term is just
as easy: Due to centring and scaling the extreme values for x are 1 and 1, so x is maximally
+1 and if the coefficient of the square term is bxx, it means that the maximal correction of the
linear model due to the taking up the square term in the model. In this example this correction
is very small, namely 2.143 cm, that statistical analysis (t-test) classifies it as insignificant.
Just check RSE, which is still 2.519.

Using stepwise regression to stabilize a model

Stepwise regression is a procedure, that chooses model coefficients out of those given in the
formula-specification in the lm()-function call, in order to make a parsimonious model (i.e.
with as few coefficients as possible).

The function step( ) uses the so-called Akaike Information Criterion (AIC) to decide which
terms to eliminate from the model. AIC is somewhat difficult to understand, it can be
calculate using AIC = 2loglik + 2p, where loglik = i=1,,N log (Pr true estimate (yi)). loglik (log-
likelihood which will be explained in the section on Maximum Likelihood) is a negative
number; its absolute value gets smaller when the fit gets better. Since adding a model term
would add 2 to AIC (makes AIC worse) loglik has to improve by at least 1 to compensate this
increase.

Below is the result of applying step() to fmsq (see section above). The output should be
interpreted as follows (see below):
Start: AIC= 20.91: AIC was originally 20.91, since p=3, loglik is 7.455=(20.916)/2,
- I(schuhgre^2): Taking away the square term gives an AIC of 20.573, since p is now 2,
loglik is 8.286=(20.5734)/2,
<none>: leaving the model as it is, i.e. taking away nothing, obviously doesn't change the AIC
and doesn't change loglik,

page 18 (57)
- schuhgre: taking away the linear term increases AIC 21.151, so since again p=2, loglik
is 8.575=(21.1514)/2.

So the model krpergre ~ schuhgre is best as it has the lowest AIC, and hence the
highest loglik of all models with p=2 and this model investigated further. Can leaving out
more terms improve the model again?
Step: AIC= 20.57: This is the AIC we continue with
<none>: leaving the model as it is, i.e. taking away nothing, obviously doesn't change the AIC
and doesn't change loglik,
- schuhgre: taking away the linear term increases AIC to 50.06, and since now p=1 (just
the intercept) loglik is now 24.03=(50.062)/2. For this model both AIC and loglik are very,
ver8.575=(21.1514)/2y bad, so the linear model is best. For this model, coefficients are
given. They correspond to the coefficients in the sections above.

> step(fmsq)
Start: AIC= 20.91
krpergre ~ schuhgre + I(schuhgre^2)

Df Sum of Sq RSS AIC


- I(schuhgre^2) 1 8.036 52.450 20.573
<none> 44.414 20.910
- schuhgre 1 11.158 55.572 21.151

Step: AIC= 20.57


krpergre ~ schuhgre

Df Sum of Sq RSS AIC


<none> 52.45 20.57
- schuhgre 1 1170.45 1222.90 50.06

Call:
lm(formula = krpergre ~ schuhgre, data = bspl)

Coefficients:
(Intercept) schuhgre
-132.10 7.65

In the example just discussed, the x-variables were not scaled and centred. Scaling and
centring does not change values of loglik or AIC, so for choosing the best model using the
step( )-function it does not matter, whether scaling and centring has been done. It does
matter when interpreting coefficients, though as has been discussed in sections above.

Transforming the y variable before fitting (relative errors => log)

It is possible to fit a model of the type


ybs = exp(b0 + bshsxshs + ), where ~ NID(0,)
= exp(b0) exp(bshs xshs) exp()
This is what is meant by multiplicative or relative errors. Note that for small , exp() = (1 +
), so that the error in y is exp(b0) exp(bshs xshs) , which is relative error in the sense that this
error is distributed according to NID(0, E(ybs) ), where E(ybs) is the expectation value of ybs
(i.e. the deterministic part of ybs).

page 19 (57)
To do this in R first transform y using log(), and then fit log(y) using lm( ). For prediction the
log has to be inverted. logarithm is used when errors are relative, but any other invertible
transformation can be used, like y1/2, neglog(y)=-log(1-y), logit(y)=log(y)-log(1-y) etc. can be
used.

Please note that R does not follow the usual naming-convention for logarithms normally
ln( ) denotes the inverse of of exp( ), and log( ) is log10( ), i.e. the inverse of 10x in R
however log() is inverse of exp( ) and log10( ) should be used to get the usual log10().

Constructing the model for log(y):


> fmlog<-lm(log(bodysize) ~ shoesize, data=bspl)
> abline(coef(fmlog))
> fitted(fmlog)
1 2 3 4 5 6 7 8 9 10
5.0676 5.0676 5.11205 5.1120 5.1564 5.1564 5.2007 5.2007 5.2451 5.24515

The fitted.values have to be retransformed for them to be interpretable. This is done by


exponentiation:
> exp(fitted(fmlog))
1 2 3 4 5 6 7 8 9 10
158.80 158.80 166.01 166.01 173.54 173.54 181.41 181.41 189.64 189.64
190

To get the plot at right, use the commands:


> plot(shoesize,bodysize )
185

> points(shoesize,bodysize)
> points(shoesize,
180

exp(fitted(fmlog)), col="red")
> lines(formula=
175

exp(fitted(fmlog)) ~ shoesize,
krpergre

col="red")
170

> abline(coef(fm),col="green")
165

The red line is the exponential model, the


green line is the linear model. Note the slight
160

curvature in the exponential model.


155

Also note that the curvature goes the other


way than for the quadratic model used 38 39 40 41 42
above. This is an indication, that the
transformation should not be used, because schuhgre
it makes the model worse than the linear model.

Regression diagnostics and model improvement

After fitting a model and afer investigating general model properties, like coefficients, R-
value, RSE (residual standard error), it is absolutely necessary to check residuals to get an
idea, how the model can possibly still be improved.

page 20 (57)
There are several characteristic plots, which give precise information on how to judge and
improve a model. These techniques work best for design data but can also be used albeit
with care for ad-hoc data.

Diagnostic plots
Name What to look for? What to do?
Res vs variable curvature take up square term
Obs vs Fit strong outlier lonely outlier make sure the obs value was correct
Res vs Fitted trumpet-form transform y variable
N-Plot / qq-plot outlier top right / bottom left don't trust model here

Res iduals vs variable:


Res square term
missing

Investigation: Stat.Vers uch (MLR) x-var


Gesam taus beute

Observed vs Fitted values: 4,00+07


10

3,00+07

20
Observed

2,00+07
2
6
18
1,00+07 11 14
169
15 13 12
0,00+00 3814 17
57 19
0,00+00 1,00+07 2,00+07 3,00+07 4,00+07
Predicted

Residuals vs Fitted values:


Res
N=20 R2=0,823 R2 Adj.=0,627
DF=9 Q2=0,111 RSD=5889073,0000

pred
Normal probability plot 20 400
quantile/quantile plot:

One further diagnostic tool is the so called normal probability plot, also called qq-plot
(quantile-quantile-plot), depending on which way the axes are chosen. The idea in this plot is
to plot the theoretical position of the i-th value of n normally (or otherwise) distributed values
against the true positions of the measured values in this case the residuals.

page 21 (57)
The idea is, one wants to test if same data are normally distributed it works the same for
other distributions (e.g. Weibull):

Step 1: compare to normal normal density


distribution => hard to see if its good

Step 2: calculate ranks: (a) mean rank:


i / (N+1) oder (i ) / N (for expected
normal) or (b) median rank: (i0,3) /
(N+10,6)) (for expected Weibull)
compare to cumulative distr.
=> still not easy to see

Step 3: rescale y- axis:


now values and ranks should be on a
straight line.
Outliers are at the bottom lying far to the
left or at the top lying far to the right.

If the shape is not reasonably straight, then


the type of the distribution was wrong.

When analysing residuals it is very difficult


to draw the correct conclusion out of this message.

R uses the qq-plot for regression diagnostics in which x- and y-axes are exchanged, i.e.
theoretical values are on the x-axis and outliers are on the left very far down or on the right
very high up in the plot.
Residuals Rankit Plot

The R-command is:


>qqnorm(resid(mCD3p), 0.
main="Residuals Rankit Plot") 2
0.
1
(Note: qqnorm does not do exactly the same Sa
m 0.
procedure as described above for the normal pl 0
e
probability plot.) Q
-
0.
ua 1
ntil -
es 0.
2
-
0.
3
-
0.
4
-2 -1 0 1 2

Theoretical Quantiles

page 22 (57)
Blood Cell Count example (8 x-variable, one y-variable)

To exemplify model diagnostics and to illustrate some of the possibilities we shall investigate
the following example:

age/[months] Leukos Lymphos CD14p CD3p CD4p CD56p CD8p CD19p


2 9320 3464,24 829 2144 1728 151 328 937
2 9090 6033,94 852 3651 2663 326 918 1746
3 9030 6438,39 737 4414 3374 265 865 1481
3 8500 6108,1 471 5052 2804 184 2112 748
4 10800 7005,96 564 4627 3359 238 1142 1929
5 11600 8270,8 503 6260 4748 296 1283 1610
6 5900 5382,57 231 3132 2243 181 758 1873
6 6900 5301,27 351 3381 2263 99 1067 1743
7 12700 8327,39 627 5543 3772 153 1599 2482
7 8160 4994,74 332 3949 2856 52 927 859
8 10500 7206,15 402 5222 3198 490 1806 1159
9 9980 6005,96 578 3907 2304 554 1395 1436
9 10200 7734,66 359 5379 3767 308 1410 1758

The data consists of cell-count data for different leukocyte-subtypes of healthy children. The
frist column is age in months. The counts are given in number of cells per l of blood. They
have been measured using flow-cytometry, an analytic method in which cells are marked and
coloured according cluster molecules that they present on their surface. Once marked they can
be counted using powerful microscopes and software.

Leukocyte is another name for white blood cells. Subtypes are Lymphocytes, Monocytes
and Granulocytes (the latter have been excluded from this study). The Abbreviation CD
means Cluster Designation and is a state of the art way to classify cells in the immune system.
In fact CD14+ are monocytes. Lymphocytes have three subtypes: CD3+ are T-cells (cells that
have been trainied in the thymus); CD19+ are B-cells (grow up in the bone marrow and are
not trained outside); CD56+ are natural killer cells. T-cells are again differentiated into
CD3+CD4+, so called T-helper cells and
CD3+CD8+, so called toxic cells.
6000

The idea is to find out, in which way cell counts


5000

depend on age. In fact this can be seen, just by


plotting one (or all) of the cell counts over age.
4000

This has been done in the plot at right for CD3+-


CD3p

cells.
3000

So havign imported the data to ss08_0423


2000

> attach(ss08_0423)
> plot(age,CD3p)
1000

There is an obvious dependence, but it is just as 0 50 100 150 200

obviously not linear. In applications, where there is alter

just one x variable and one y variable a simple


scatter plot like the one shown here, can illustrate, that the dependence is more complicated.
In higher dimensions it's better to use residual plots after fitting a linear model.

page 23 (57)
Fitting the model and plotting the line using
> mod<-lm(CD3p ~ age, data=ss08_0423)
> abline(mod$coef)

2000
gives the line as above.

1000
mod$resid
Residuals i.e. the difference between measured

0
values and fitted.values can be plotted using
> plot(age,mod$resid)

-1000
> abline(0,0)

-2000
This gives the plot at right. Obviously the residuals are
not homogeneous. This means the model has to be 0 50 100 150 200

refined, because using least squares regression, as is alter

done in the lm( )- function, requires homogeneous error variance, and inhomogeneous
residuals are a very strong indication that this assumption is violated!

Another good way of analysing residuals is to plot


residuals against the fitted values using the command:

2000
> plot(mod$f,mod$res) 1000

The plot at right suggests that when CD3+-values are


mod$res

small, residuals are small too.


0
-1000

This is a strong indication, that smaller values are more


precise than larger ones and that this should be
-2000

considered in the regression. There are two ways of


doing this 500 1000 1500 2000 2500 3000 3500

mod$f

1 - using weighted regression:when calling lm(..) set the parameter weights:


lm(formula, data, subset, weights, na.action,
method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,
singular.ok = TRUE, contrasts = NULL, offset, ...)

2 - transform the marker values using a contracting transformation (like log(...) or root(...))
before modelling them. Taking logs is a good way to treat an error structure that is
multiplicative and not additive (i.e. relative errors are homogeneous).

Take the log of the values before importing them into R. In


3.8

fact I suggest taking log ( y + 1), so that the occasional y-


value of 0 does not pose a problem [log(0) is not defined].
3.6

For log(cd3p+1) vs age we get the following scatter-plot:


3.4
C D 3p

3.2

This is still not a linear relation! But we know from


immunology that "immunological age" is not linear in
3.0

time. It would be a good idea to find a transformation that


0 50 100 150 200

alter

page 24 (57)
transforms age into "immunological age". This should be done with data from outside of this
data set, in order not to "sacrifice informaion".

In the following considerations we transformed age by taking square roots. This is not quite
satisfying, because above say 20 years of age the immunological status does not change much
any more.

After taking square roots and modeling the

3.8
relation looks like this:

3.6
e.g. for CD3p:
> mCD3p<-lm(CD3p ~ age,data=ss08_0423)

3.4
CD3p
> abline(mCD3p$coef)

A further possibility to investigate the model

3.2
quality is to plot the histogram of the residuals:

3.0
> hist(mCD3p$residuals)

Histogram is still skew! Below is the 2 4 6 8 10 12 14

corresponding plot of residuals against the fitted alter

values.

Can we support the normal probability


assumptions? Histogram of mCD3p$residuals
25
20
Frequency

15
0.2

10
0.1

5
0.0
mCD3p$res

0
-0.1

-0.4 -0.2 0.0 0.2

mCD3p$residuals
-0.2

The commands
-0.3

> plot(fitted(mCD3p),resid(mCD3p))
-0.4

> plot(mCD3p$f,mCD3p$res)
3.1 3.2 3.3 3.4 3.5 3.6
are identical.
mCD3p$f
Put the line in using
> abline(0,0)

There is a slight problem with the three values at the bottom right. So it seems rather to be a
problem with the three values than with the model. But one is never 100% sure!

page 25 (57)
Estimating Confidence intervals and prognosis intervals for linear
models

Confidence intervals vs prognosis intervals


When estimating prediction error the first question is, whether one is interested in estimation
error or prognosis error. Estimation error is the expected deviation of the model prediction
from the true response value at a given prediction point in the experimental domain. (So
this would mean, if the model is valid, estimation error would go to zero on repeating the
design until eternity, determining model coefficients from this infinity of results, and making
the prediction it this model. The say 95% confidence interval would degenerate to a point,
the true response value.

Prognosis error for a new observation at the prediction point cannot degenerate to a single
point just by repeating the design: every new observation has its intrinsic right to err, and
the expected size of this error cannot converge to zero by merely repeating experiments. The
situation is well known from quality control: It is one thing to ask how does a mean value
deviate the answer being +/ s/ n and another to ask how does a new measurement
deviate (given the information from the design) the answer here being +/ s(1+1/n).
This is standard error, in order to get the corresponding confidence- and prognosis intervals
just multiply by the corresponding t-value, i.e. ~2 for 95% and ~3 for 99,5% confidence level.

Simple Bootstrap for estimating standard deviation and percentile intervals for
characteristics like mean value, median, standard deviation, correlation coefficient
The famous Baron von Mnchhausen succeeded in ... escaping from a swamp by pulling
himself up by his own hair (or bootstraps, depending on who tells the story). Well in German
versions of the story it was ..by his own hair, in most English versions it's by his bootstrap.
In any case Bradley Efron, in 1979 inventor of the Bootstrap-method for estimating statistical
error and for determining confidence intervals, Baron v. M. was inspiration at least for the
name of the method ((see [ET], p. 5)!

The bootstrap consists of taking samples from the sample in order to induce variation.
Imagine you had a sample of n observed values. Then calculate your favourite characteristic,
e.g. a median or a correlation coefficient. In order do calculate its standard deviation or its
confidence interval, construct an virtual n faced die, where you have marked the n faces
with the n observed values (or pairs of values). Now use the die get R = 1000 samples of size
n (you need to throw n*1000 times). For each such sample calculate the characteristic of
interest. Sort the 1000 characteristic values and pick out the percentiles of interest. For 95%
percentiles pick the 25th and 975th values.

R-code for the simple bootstrap


#Data import from Clipboard (column name must be z)
boot_ex <- read.delim("clipboard",dec=",")
z<-boot_ex$z
hist(z)
mean(z)

#prepareing for the bootstrap


n <- 1000 #sample size for Bootstrap
charv <- array(0,dim=n) #Matrix for Bootstrap-characteristics

#Simple Bootstrap, draws randomly fron the values

page 26 (57)
for(i in 1:n){
z_star <- sample(z,size=length(z), replace=T) #bootstrap samples
charv[i] <- mean(z_star) #bootstrap charact.
}
#Quantile, Quartile, Histogram
sort(charv)[c(975,500,25)]
sort(charv)[c(750,500,250)]
hist(sort(charv))

Classical method for estimating prediction error and prognosis error


In linear modelling error propagation is a little more complicated than for mean value
estimation, but assuming a valid model, i.e. without bias, and independent and close to normal
experimental errors, both estimation and prognosis error can be estimated. Model predictions
are just weighted means, i.e. linear combinations of observed response values, which can be
expressed by a simple matrix equation of the form,
y^ = Hy.
The entries in the of the H-Matrix are weights that can be calculated from the design and the
model equation using H = X(XtX)-1Xt, where X is the extended design matrix. In statistics
jargon H itself is called Hat-Matrix, because it puts the hat on the y: y^ = Hy, i. e. it allows
estimating predictions directly form the observations. The diagonal elements, hi, of the Hat-
Matrix determine the influence of an observation, yi, onto its own prediction, y^i. This so
called leverage is typically high for a corner point in a design, it is usually very small for
centre points. Estimation error at a new (extended) prediction point, x+, is then given by
+/RSD(x+(XtX)-1 x+t), where RSD die residual standard deviation, i.e. the linear modelling
equivalent of standard deviation, s. Prognosis error at the new (extended) point will then be
+/RSD(x+[I + (XtX)-1] x+t). As in mean value estimation corresponding confidence- and
prognosis intervals will be obtained by multiplying by the correct t-value, i.e. ~2 for 95% and
~3 for 99,5% confidence level (assuming a sufficient number of residual degrees of freedom).

Monte Carlo Method for confidence intervals and prognosis intervals


The Monte-Carlo method is based on random sampling, (i.e. throwing the die as in the
famous casino in Monaco) from a pre-supposed statistical distribution for the experimental
error (usually normal distribution). As in the classical method described above, calculating
confidence- and prognosis intervals is based on error propagation in order to estimate the
dispersion parameter of the distribution and requires assumptions of unbiased models and
well behaved (independent and normally distributed) experimental errors. Instead of then
using t-values, ample from the distribution, i.e. take say R = 1000, R = 10 000 or even R =
100 000 random numbers that pertain to the distribution and calculate Quartile intervals
(a=25%; (1-a)=75%) or Percentile intervals (a=2,5%; (1-a)=97,5%) resp. (a=0,25%; (1-
a)=97,75%) by sorting and picking out entry no aR and (1-a)R, a = 25%; 2,5%; resp.
0,25%; e. g. for Quartiles, i. e. a = 25% and R = 10 000, use entry no 2500. and7500.

Two-level Bootstrap method for estimating prognosis intervals


The 2 level bootstrap is well described in [DH], i. e. with practical examples and software
code for the open source software R. Error propagation is not used at all. So assumptions on

page 27 (57)
the validity of of the model and the distribution of the experimental errors play a less
important role than in the classical method and the Monte Carlo method. As in the Monte
Carlo method random sampling, i.e. throwing a die, is done, albeit not from a statistical
distribution, whose shape (usually normal) must be assumed, but directly from the residuals,
to be precise: from deleted residuals out of cross validation. Residual, ei = yi y^i and deleted
residual ri = ei / (1 hi).

Now draw samples of size n ( = run number in the design) with replacement, call them r*r,
and do this say R = 1000 times, so r = 1, , R. (Please note, there are many, many more such
samples that can possibly be drawn, namely (2n- 1)!/(n!n!) of them.) In order to simulate
estimation error, determine R = 1000 new models from the right sides y*r = y^+ r*r, i. e.
calculate R = 1000 sets of coefficients, b*r, and at a new prediction point, x+, R = 1000 new
model estimates, y+*r. Sort these and like in the Monte-Carlo method and pick out entry no aR
and (1-a)R, in order to get Percentile intervals for the model estimators.

To get prognosis intervals for a single new prediction sample a second time from the deleted
residuals, M = 1 or 2 times for each of the R = 1000 runs above would be sufficient let
these results be called e*rm, r = 1, R, m = 1, , M. Now simulate prognosis error using
d*rm = y+*r (y^+ +e*rm ),
and simulate the new observation response values using,
y^*rm = y^+ d*rm = 2 y^+ y+*r +e*rm.
Sort these and pick out the entries aRM and (1a)RM.

R code for this is:


#Data import from Clipboard (design with three columns "zt", "temp", "pre"
#as factors and a forth column "z" as response
boot_ex <- read.delim("clipboard",dec=",")
attach(boot_ex) #... for easy access to the columns
head(boot_ex) #should show headers and the first six rows

#Doing the regression linear model plus one interaction


X <- matrix(c(1,1,1,1,1,1,1,1,1,1,zt,temp,pre,zt*temp),ncol=5)
I <- solve(t(X)%*%X) #information matrix XTX^-1
g <- diag(I) #diagonal if the I-Matrix
b <- I%*%t(X)%*%z #model coefficients b
H <- X%*%I%*%t(X) #hat-Matrix for calculating predictions
h <- diag(H) #diagonale of H-Matrix (leverages)
p <- H%*%z #prognosen p
e <- z - p #residuals
r <- e/sqrt(1-h) #deleted residuals r

#Vorbereitung des Bootstrap


r <- r-mean(r) #centring studentised residuals
n <- 1000 #sample size for outer bootstrap
m <- 2 #sample size inner bootstrap
delta <- array(0,dim=c(n,m)) #matrix for bootstrap results

x_plus <- matrix(c(1, 0, 0,0,0)) #new prediction point


p_plus<-t(x_plus)%*%b #prediction at prediction point

page 28 (57)
#The 2 level bootstrap draws randomly from stud. Residuen
for(i in 1:n){ #outer level for coefficients
z_star <- p + sample(r,replace=T) #Residuen wrfeln
b_star <- I%*%t(X)%*%z_star #bootstrap-coefficient
p_star <- t(x_plus)%*%b_star #pred. with bootstrap-model
epsil <- sample(r,size=m,replace=T) #inner level fr pred. error
delta[i,] <- p_star-(p_plus+epsil) #matrix of simulated error
}
#Quantile, Quartile, Histogram
p_plus-sort(delta[,])[c(m*975,m*500,m*25)]
p_plus-sort(delta[,])[c(m*750,m*500,m*250)]
simulated_new_measurements<-p_plus-sort(delta[,])
hist(simulated_new_measurements)

#cleaning up
detach(boot_ex)

The paradox of the skew confidence intervals

The (1)-confidence interval (say = 5%) of a distribution parameter estimated from a


characteristic of a sample of observed values, e .g. of the median, , from the empirical
median, t50, is defined as: [low , high]: where low is that parameter value, for which the
probability of achieving an empirical estimate greater than t50, is equal to /2; high being
defined analogously.
This area = 2.5%
This area also = 2.5%

For a symmetric distribution this confidence interval is symmetric about the emp. value.

This area also = 2.5%


This area = 2.5%

For a left-skew distribution obviously the confidence interval of a position parameter will be
right skew (and vice versa) as can be seen above.

page 29 (57)
But using a transformation to symmetrize the distribution

forward

transform

backward

transform

. the distribution for the transformed values are symmetric, hence the confidence interval
for the transformed entity is symmetric and the backward-transformed interval will be
left-skew!

Where is the catch?

page 30 (57)
General linear modeling (GLM)

As we have when the words linear "linear modeling" are used, what is meant is not
necessarily y = b0 + b1x1. Of course this is a linear model. It is linear in the coefficients b and
in the variables x. "Linear Modeling" refers to models that must be linear in the b's but not
necessarily in the x's. In fact it is the strength of linear modeling that it allows non-linear
terms in x!

The general form of the linear model is then


g(y) = b1h1(x) + b2h2(x) + b3h3(x) + . + bmhm(x) + , where ~ NID(0,)

y may be a vector of responses, g( ) is a so called link function (like log() in the example of
the last lecture) that must be invertible, x is the vector of x-values, x1, xm, hm(x) are
functions of this x, bi are the coefficients that are to be estimated from the collected data.

Examples for hm could be:


(1) linear model in the variables: h1(x) = x1, h2(x) = x2, , hm(x) = xm. Usually a term h0(x) =
1 is added, with a corresponding coefficient b0 the intercept. This is in fact the model we
used in the last lecture:
g(y) = y = b0 + b1x1 + . + bmxm.+ , where ~ NID(0,)
(2) model that's polynomial in x:
h1(x) = x1, h2(x) = x2, h12(x) = x1x2, h11(x) = x1, h22(x) = x2,
(3) indicator functions: example in one dimension: I(a, b](x) = 1 for x (a, b], 0 for x (a, b].
Typically when indicator functions are used, the sets will be disjoint, and will cover the
complete domain of the x, then b1h1(x) + b2h2(x) + b3h3(x) + . + bmhm(x) represents a
piecewise constant function,
(4) products of (2) and (3), i.e. piecewise polynomials,
(5) spline-functions (to be explained below).

Models of the types (1) and (2) have been considered above, at least for x in one dimension
(but using 2 dimension does not introduce any new concepts yet )

Fitting a piecewise constant function

Coming back to the example of shoe-size: We fit a piecewise constant function using R:
Define indicator functions for the intervals for the intervals I1 = I(37.5, 39.5], I2 = I(39.5, 41.5] and
I3 = I(41.5, 43.5].
> I1<-function(x) {if(x>37.5 &&x <=39.5) 1 else 0}
> I2<-function(x) {if(x>39.5 &&x <=41.5) 1 else 0}
> I3<-function(x) {if(x>41.5 &&x <=43.5) 1 else 0}

If the data frame is not already loaded, load it:


> schuhgre<-c(38,38,39,39,40,40,41,41,42,42)
> krpergre<-c(153,161,167,169,173,176,182,181,188,189)

page 31 (57)
> bspl<-data.frame(schuhgre,krpergre)
> bspl
schuhgre krpergre
1 38 153
2 38 161
3 39 167
4 39 169
5 40 173
6 40 176
7 41 182
8 41 181
9 42 188
10 42 189

Trying to fit a linear model directly using Ij(x) will not work in R,because the functions Ij have
only been defined for numbers and not vectors.

So
> fm_pc<-lm(krpergre ~ 0 + I(I1(schuhgre)) + I(I2(schuhgre)) +
I(I3(schuhgre)), data=bspl)
yields the following error:
Fehler in model.frame(formula, rownames, variables, varnames, extras,
extranames, : Variablenlngen sind unterschiedlich (gefunden fr
'I(I1(schuhgre))')

One way of fixing this, is to construct columns for the three Ij(x) using the apply( )-function in
R to construct the vectors I1(x), I2(x) and I3(x):
> apply(as.matrix(bspl$schuhgre),1,I1) # ,1, means "column"
[1] 1 1 1 1 0 0 0 0 0 0
> apply(as.matrix(bspl$schuhgre),1,I2)
[1] 0 0 0 0 1 1 1 1 0 0
> apply(as.matrix(bspl$schuhgre),1,I3)
[1] 0 0 0 0 0 0 0 0 1 1

> fm_pc<-lm(krpergre ~ 0 + I(apply(as.matrix(schuhgre),1,I1)) +


+ I(apply(as.matrix(schuhgre),1,I2)) +
+ I(apply(as.matrix(schuhgre),1,I3)),
+ data=bspl)
> fm_pc

Call:
lm(formula = krpergre ~ 0 + I(apply(as.matrix(schuhgre), 1, I1)) +
I(apply(as.matrix(schuhgre), 1, I2)) + I(apply(as.matrix(schuhgre),
1, I3)), data = bspl)

Coefficients:
I(apply(as.matrix(schuhgre), 1, I1)) 162.5
I(apply(as.matrix(schuhgre), 1, I2)) 178.0
I(apply(as.matrix(schuhgre), 1, I3)) 188.5

> summary(fm_pc)

Call:
lm(formula = krpergre ~ 0 + I(apply(as.matrix(schuhgre),
1, I1)) + I(apply(as.matrix(schuhgre), 1, I2)) +
I(apply(as.matrix(schuhgre),
1, I3)), data = bspl)

page 32 (57)
Residuals:
Min 1Q Median 3Q Max
-9.500e+00 -1.875e+00 -6.939e-16 3.750e+00 6.500e+00

Coefficients:
Estimate Std. Error t value Pr(>|t|)
I(apply(as.matrix(schuhgre), 1, I1)) 162.500 2.735 59.41 1.01e-10
***
I(apply(as.matrix(schuhgre), 1, I2)) 178.000 2.735 65.07 5.32e-11
***
I(apply(as.matrix(schuhgre), 1, I3))

0
9
1
188.500 3.868 48.73 4.01e-10 ***
---

5
8
1
Signif. codes: 0 '***' 0.001 '**' 0.01
'*' 0.05 '.' 0.1 ' ' 1

0
8
1
Residual standard error: 5.471 on 7

5
7
1
degrees of freedom
Multiple R-Squared: 0.9993,

0
7
1
Adjusted R-squared: 0.999

e
p

k
r
g
r
re

F-statistic: 3379 on 3 and 7 DF, p-

5
6
1
value: 2.008e-11

0
6
1
To get the graph, use the following commands:
>plot(schuhgre,krpergre )

5
5
1
>points(schuhgre,krpergre)
>points(schuhgre,
38 39 40 41 42
fitted(fm_pc),col="red")
>lines(formula=fitted(fm_pc) ~ schuhgre
schuhgre,col="red")
>abline(coef(fm),col="green")

Actually the red line should really be piecewise constant, but the graphics options are not so
easily found to do this.

At least the jump should be vertical at the points 39,5 and 41,5, because that's where the
indicator functions have their jumps. But fm_pc only knows the values of the function at the
points in the vector Schuhgre. That's why the jump is where it is.

Fitting Splines

Just as piecewise constant functions can be fit to the data, we can fit
(a) piecewise linear functions,
(b) piecewise linear functions that are continuous (i.e. linear splines),
(c) piecewise square functions,
(d) piecewise square functions that have continuous derivative (i.e. quadratic splines)
(e) piecewise cubic functions that have continuous second derivative (i.e. cubic splines).

This can be achieved by using an appropriate set of functions hm(x) in the general formula
above.

page 33 (57)
To illustrate the different cases, we continue the schuhgre/krpergre example, and
continue to use the points 37.5, 39.5 and 41.5 as knots, where the discontinuities in the
functions resp. its derivatives may be.

In case (a) we would use


h1(x) = I(37.5, 39.5], h2(x) = xI(37.5, 39.5], (these 2 represent a linear model in (37.5, 39.5] )
h3(x) = I(39.5, 41.5] , h4(x) = xI(39.5, 41.5], (these 2 represent a linear model in (39.5, 41.5] )
h5(x) = I(41.5, 43.5] and h6(x) = xI(41.5, 43.5] (these 2 represent a linear model in (41.5, 43.5] )

However, the data are too sparse to fit this model, since there is only one measurement point
in the third interval (41.5, 43.5], so that at most we could fit h5(x) here, but not h6(x), (which
in fact amounts to calculating the mean of the two values at schuhgre 42, i.e. 188.5).

In case (b)
h0(x) = 1, h1(x) = x, h2(x) = (x 39.5) I(39.5, inf] , h3(x) = (x 41.5) I(41.5, inf]

Since all of the functions are continuous (the jump in the indicator functions are zeroed out by
the (x xat the junp) factor in front), the fitted model will also be continuous.

Case (c) can in fact not be reasonably fitted with the data at hand, because between each of
two knots there are always only two observation points, so pieces of square functions cannot
be fitted. The model would have too many degrees of freedom, the functions hm(x) would be:
h1(x) = I(37.5, 39.5], h2(x) = xI(37.5, 39.5], h3(x) = xI(37.5, 39.5]
h4(x) = I(39.5, 41.5] , h5(x) = xI(39.5, 41.5], h6(x) = xI(39.5, 41.5]
h7(x) = I(41.5, 43.5] and h8(x) = xI(41.5, 43.5] h9(x) = xI(41.5, 41.5]

Case (d) however is possible, because the condition of continuity of the first derivatives
decreases the models degrees of freedom from 9 to 5, because the function and its derivative
must be 0 at the 2 knot points. The easiest way to satisfy the condition is to fit the following
set of hk(x):

h0(x) = 1, h1(x) = x, h2(x) = x, h3(x) = (x 39.5) I(39.5, inf] , h4(x) = (x 41.5) I(41.5, inf]
Since all of the functions are continuous and have continuous first derivative, the fitted model
will also be continuous and have a continuous first derivative.

As in case (c) in case (e) the data are too sparse to fit a cubic spline, but the basis of hk(x)'s to
be used would be:
h0(x) = 1, h1(x) = x, h2(x) = x, h3(x) = x, h4(x) = (x 39.5) I(39.5, inf] ,
h5(x) = (x 41.5) I(41.5, inf]

A good exercise to do would be to


(a) construct the hk(x) for each of the (workable) cases for the schuhgre/krpergre
example,
(b) fit the corresponding models
(c) check the statistics for significance of the coefficients
(d) understand why/whether these statistics are correct
(e) plot the fitted functions and compare them to each other and the functions fitted before.

page 34 (57)
R-Software would not be very good, if one had to fit splines using the lm( )-function. In fact
there is a function to fit the very popular cubic splines:
smooth.spline(x, y = NULL, w = NULL, df, spar = NULL,
cv = FALSE, all.knots = FALSE, nknots = NULL,
keep.data = TRUE, df.offset = 0, penalty = 1,
control.spar = list())

Check the R-Help-system for details. The parameter list is much more complicated than the
user may have anticipated, because the functionality of smooth.spline() is much greater
than the possibilities within lm(). Details are beyond the scope of this introductory course.

Repeated Measures, Longitudinal Analysis, Mixed Effects


Modeling

The most common violation of IID are repeated measures. Not always, but usually, this means
that time turns up as a factor, and measurements on the same subject (or patient) are repeated
time after time. 26
26
1
2 1,28571429
T
T
26 3 4,71428571 T
26 4 13 T

Errors in repeated measures are Pat-ID


1
time.in.weeks
1
CD3p group
H
26
26
5
6
30,75
45,6666667
T
T

usually not independent (!) they 1


1
2
3
1,28571429
2
H
H
26
26
7
8
46,1428571
32,2857143
T
T
26 9 33,6363636 T
are correlated, the closer in time 1
1
4
5
15,5
83
H
H 26
26
10
11
51,4545455
69,2727273
T
T
1 6 240,142857 H
the measurements are made, the 1
1
7
8
286,857143
271,857143
H
H
26
26
12
14
84,8571429
79
T
T
greater the correlations. N-I-D and 1
1
9
10
280,714286
590,166667
H
H
26
26
16
18
38
27,4
T
T

even I-I-D* are both violated! 1


1
11
12
1325,45455
763,545455
H
H
26
26
20
22
42,6666667
63,4
T
T
1 14 435,5 H 26 24 73,2 T
2 1 H 27 1 8,2 T
2 2 0,85714286 H 27 2 18,4285714 T
Since the random errors in the 2
2
3
4
4,14285714
22,4285714
H
H
27 3 33,1111111 T
27 4 59,5555556 T
measurements are correlated. 2
2
5
6
85,8571429
155,857143
H
H
27
27
5
6
86
77
T
T
Modelling this is much more 2
2
7
8
197,428571
217,142857
H
H
27
27
7
8
92
108
T
T

delicate than just estimating 2


2
9
10
207,5
225
H
H
27
27
9
10
134
153
T
T

correlation coefficients. 2 11 199,428571 H 27


27
11
12
154,166667
86
T
T
27 14 73,2631579 T
27 16 96,1052632 T
27 18 93,5357143 T
When measures are repeated, this is done often to understand 27
27
20
22
84,0357143
117,378378
T
T
different sources of variation:
between-group variation typically a fixed effect,
between-individual variation usually a random effect
within-individual biological variation over time also a random effect
random error from measurement to measurement independent error.

In a fixed effects model (i.e. the models we've used until now), all of the gradients (i.e.
coefficients) are assumed to be deterministic; random errors are independent random errors
assumed to occur only in individual experiments, and these errors 1, 2, 3, 4

are independent!

In a mixed effects model, some of the gradients (i. e. coefficients) are assumed to be random
variables themselves (!) This makes things more complicated, non-independent but random errors
1, 2, 3, 4
requires more runs but it can handle correlated errors!

page 35 (57)
For example, we can assume the expectation value of the measurement for a patient at a given
time point is itself random (we usually assume expectations to be ).

To be precise: We assume i is the index for the patient and j is the index for the week, and
yij = 0i + 1iweekj + ij ij ~ NID(ind)
where
0i = 00 + 02 gi + u0i u0i = NID(pat)
1i = 10 + 12 gi
hence
yij = 00 + 02 gi + 10weekj + 12 gi*weekj + u0i + ij

The first term is the mean, the second considers the influence of the group to which the
patient belongs, the third that of the week, the forth is the interaction between the two. These
are the fixed effects. u0i is the random deviation of a patient from the mean, u1iweekj is the
random deviation of the influence of the week from the mean.

This can be modelled using the lme(...)-function in the nlme-library (see step 4 below).

Step 1: Look at the mean values for group "H" and group "T" at each tiw:

ht<-read.delim("clipboard", sep="\t", dec=",", head=TRUE); #import


head(ht); attach(ht) #show and attach
tapply(log(CD3p+1)[group=="H"],tiw[group=="H"],mean,na.rm=TRUE)
#mean for each tiw
tapply(log(CD3p+1)[group=="T"],tiw[group=="T"],mean,na.rm=TRUE)
#mean for each tiw:

Step 2: Make a grid plot of this with lines drawn in:

library(lattice) #load graph-libr.


trellis.device(color=F) #open window
xyplot(log(CD3p+1)~group | tiw, data=ht, #plot multi-graph
panel=function(x,y){
panel.xyplot(x,y)
panel.lmline(x,y,lty=1)
}
)

Please notice, that there are 6


graphs in each row, just as there
were 6 weeks in the output of step
1. Here week 1 is at bottom left,
so week 14 is in the 3rd row from
the bottom at left.

It seems group-T patients recover


a little better initially, until week
4, then group-H and group-T

page 36 (57)
patients are at the same level, whereas at week 14 and thereafter group-H patients continue to
improve whereas group-T patients seem to stagnate.

Step 3: Calculate gradients at each tiw

library(nlme) #load library


comp<-lmList(log(CD3p+1)~group|factor(tiw), data=ht,na.action=na.exclude)
summary(comp) #influence of group
#at each tiw
groupT
Estimate Std. Error t value Pr(>|t|)
1 1.252191213 1.8826189 0.66513260 0.50615859
2 0.560060878 0.5920563 0.94595888 0.34445698
3 0.336013809 0.5215717 0.64423320 0.51961005
4 0.283858055 0.5050534 0.56203578 0.57425006
5 -0.007158556 0.4961136 -0.01442927 0.98849113
6 -0.112170562 0.4961136 -0.22609852 0.82118291
7 0.107026170 0.4921287 0.21747596 0.82789324
8 0.079559939 0.4921287 0.16166489 0.87161083

Residual standard error: 1.537152 on 795 degrees of freedom

Gradients at each tiw are just differences between group-H and group-T. These results are the
same as above.

t- and p-values are slightly different to t-test and linear modelling above. This is because
standard error is calculated from all residuals and not just from the residuals at the week
concerned.

This analysis does not yet use a mixed effects model, it just gives a list of linear models.

Step 4: Fit the linear mixed effects model

The mixed effects model has the form:


yij = 00 + 02 gi + 10weekj + 12 gi*weekj + u0i + u1iweekj + ij
which translates into the command below. Notice that in the software R, when including the
interaction term factor(tiw)*factor(group) the linear terms factor(tiw) and
factor(group) are automatically included s well as the constant (intercept). Notice also that
the random part of the model is particularly simple, just the constant. The group-variable is
Pat.ID. All 58 weeks are used in a first analysis.

Before fitting the model the contrasts used should be set to sum to zero for each variable, so
that they are orthogonal to the intercept. This is most easily done using the options-command.

options(contrasts=c('contr.sum','contr.poly'))
htlme<-lme(log(CD3p+1)~factor(tiw)*factor(group), random=~1|
Pat.ID,data=ht[tiw>=1&tiw<=58,],na.action=na.exclude)
summary(htlme)$tTable

page 37 (57)
Value Std,Error DF t-value p-value
(Intercept) 4,7249 0,1996 756 23,6773 3,74E-093
factor(tiw)1 -3,2200 0,6095 756 -5,2826 1,67E-007
factor(tiw)2 -3,5075 0,1960 756 -17,8977 5,37E-060
factor(tiw)3 -2,8140 0,1736 756 -16,2087 6,18E-051
factor(tiw)4 -1,9226 0,1684 756 -11,4194 5,65E-028
factor(tiw)5 -1,0486 0,1655 756 -6,3351 4,07E-010
factor(tiw)6 -0,6234 0,1655 756 -3,7664 1,79E-004
factor(tiw)7 -0,4079 0,1642 756 -2,4839 0,0132
factor(tiw)8 -0,2553 0,1642 756 -1,5547 0,1204
factor(tiw)9 -0,1176 0,1642 756 -0,7160 0,4742
factor(tiw)10 -0,0593 0,1642 756 -0,3612 0,7181
factor(tiw)11 -0,1117 0,1642 756 -0,6799 0,4968
factor(tiw)12 -0,2108 0,1642 756 -1,2837 0,1996
factor(tiw)13 0,0014 0,1654 756 0,0084 0,9933
.
.
.

factor(tiw)26 1,4612 0,3192 756 4,5775 5,50E-006


factor(tiw)27 1,3041 0,3280 756 3,9761 7,68E-005
factor(group)1 0,1536 0,1996 39 0,7699 0,446

fac(tiw)1:fac(gr)1 -0,1961 0,6095 756 -0,3217 0,748


fac(tiw)2:fac(gr)1 -0,5286 0,1960 756 -2,6971 0,007
fac(tiw)3:fac(gr)1 -0,2696 0,1736 756 -1,5527 0,121
fac(tiw)4:fac(gr)1 -0,2373 0,1684 756 -1,4097 0,159
fac(tiw)5:fac(gr)1 -0,1291 0,1655 756 -0,7800 0,436
fac(tiw)6:fac(gr)1 -0,0766 0,1655 756 -0,4628 0,644
fac(tiw)7:fac(gr)1 -0,2072 0,1642 756 -1,2613 0,208
fac(tiw)8:fac(gr)1 -0,1934 0,1642 756 -1,1777 0,239
fac(tiw)9:fac(gr)1 -0,1691 0,1642 756 -1,0297 0,303
.
.
.
summary(htlme)
Linear mixed-effects model fit by REML
Data: ht[tiw >= 1 & tiw <= 58, ]
AIC BIC logLik
2715.304 2986.648 -1299.652

Random effects:
Formula: ~1 | Pat.ID
(Intercept) Residual
StdDev: 1.204801 1.011532

Fixed effects: log(CD3p + 1) ~ factor(tiw) * factor(group)


...see above

Interpretation:
Residual standard deviation is 1.01153, standard deviation of the patient mean is 1.204801.
Since patient standard deviation is relatively high, it is definitely not allowed to treat
measurement errors at different time points as independent.

No effort has been made to model time dependency numerically. Both group-T/H and tiw
were used as (qualitative) factors.

page 38 (57)
It may be interesting to model time dependence using tiw as a numerical variable, with
suitable scaling maybe the change in gradient of the group-effect will be significant. But this
is beyond the scope of this

page 39 (57)
Estimating parameters using Maximum-Likelihood (ML)

There are several ways to estimate distribution parameters in statistics when data are given
that presumably correspond to the distribution. One of the if not the most method is that of
maximizing likelihood. It is more general than least-squares regression, that is used in linear
modeling. In fact least-squares regression is a special case of maximum likelihood estimation
and we shall prove this later in this section.

How does maximum likelihood work?


The idea is to determine that value of a distribution parameter, that maximizes probability or
probability density (when the distribution is continuous) of a set of data, that has been
collected, and that corresponds to the distribution. Thus given a distribution and some
measurements, the likelihood, L , of a parameter value, , is defined as either the
probability, Pr(measurements | ), or the probability density, dPx(measurements | ), of the
measurements collected assuming that the parameter value of were true. In a way the logic
was turned around.... . Once likelihood, L , is established, that parameter value is chosen,
for which likelihood is maximal.

This shall be illustrated using exponential time decay and a single measurement.

Assume a survival function is given by S t =et and is to be estimated from failure time
point at t = . What's the best ? The cumulative failure probability is given by
t
F t =1 S t =1 e . Of course the probability of a failure being exactly at t = is
exactly 0, because F t =1et is continuous. So we shall use probability density for a
failure, f t = e t . Now given the failure at t = , we turn the logic around and say:
likelihood of a parameter value , L , is the probability density at t = , assuming a
parameter value of . So L = e

In the plot at right, the function f t = e t has Exp-decay for different values of lambda

been plotted against t for values of ranging from


1.0

0.2 (blue) through 0.5 (pink), 1 (black), 2 (red) and 5


(green). It can be seen, that as increases, the curves
0.8

start at higher values at t = 0 but decay much more


expdecay_dens(1, t)

quickly.
0.6

A vertical line has been drawn at t =2 . At this value,


0.4

the pink curve is highest, it corresponds to


1
0.2

=0.5 =
2 .
0.0

0 2 4 6 8 10
Indeed, the maximum of L = e can be found t

by differentiating w.r.t. and then equating to 0:


1
dL =e e =0 implies = , which is confirms the value read off in the graph.

Often it is easier to maximize the logarithm of likelihood, log-likelihood, abbreviated by


l =lo g L . Particularly when the exponential function plays a major role in the
definition of the distributions involved, but also when several independent measurements

page 40 (57)
have been made, log-likelihood is easier to use. Of course, since logarithm is a monotonic
function log-likelihood will be maximal at the same values of the parameters as likelihood.

In the example above using log-likelihood would mean:


1
l =log , hence dl = , which is zero for = 1 as above .

In all following calculations and examples, log-likelihood will be used, not likelihood itself.

The situation becomes a little more interesting, when there are several measurements.
Likelihood of a parameter is again defined as the probability resp. probability density of the
measured values. When measurements are independent the probability of several events
occurring is the product of the probabilities, the same ist true for probability densities. So
likelihoods become products and log-likelihoods become sums (which are a lot easier to
differentiate!).

As an example we use exponential decay again, and assume that there are two independent
failure time points t1 and t2. Then likelihood is
L = dP t failure at t 1 failure at t 2 | = e t e t
1 2

and log-likelihood is
l =log [ dP t failure at t 1 failure at t 2 | ]=log t 1 log t 2 ,
and hence
1 1 2
dl = t 1 t 2 , which is zero if =
t 1t 2 .

In problems concerning failure and survival, it is unusual to use the probability densities for
likelihood calculations, it's better and usual to use hazards, which are in fact conditional
failure probabilities on condition that no failure has occurred up to the considered time point.
This will be explained some more in the section on Cox proportional hazards model.

The maximum likelihood estimation method can also be used to estimate and of a normal
distribution as well as for estimating coefficients in a linear model of the type
ybs = b0 + b1x1 + b2x2 + , where ~ NID(0,) (NID = normally and independently distr.),
as discussed before.

Assuming there is one measurement x1, and x1 ~ NID(,). What would have the highest
2
x1
1
2 2
likelihood? Well L , = e ,
2
2
1 x 1
hence l , = 2 lo g 2 log 2 .
2
From this formula, it can readily be seen, that the maximum-likelihood estimator for must
be the same as the least squares estimator note the minus sign. Taking the derivative yields
x 1
dl , = , which is zero for ML = x1. A best estimate for can obviously not be
2
2
1 x 1
found from just one measurement: dl , = 3 , which cannot be zero, if the

page 41 (57)
numerator in the second fraction is zero, which it is since ML = x1, as was just shown.
Otherwise the best estimate for would be =x 1 .

When there are two different measurements, x1, and x2, both ~ NID(,), then
2 2
x1 x 2
1
2 21 2
L , = e e 2
2 2
x 1 2 x 2 2
l , =log 2 2 lo g 2 2
2 2
x 1 x 2 x 1 x 2
dl , = 2 , which is zero for ML = ,
2
2 2 2 2
2 x 1 x 2 2 x 1 x 2
dl , = , which is zero for ML= .

3
2

Having seen these calculations for one and for two measurements, it is easy to generalize to N
measurements, x1, ... xN. In fact the formulae for both parameters generalize easily:
1 N 1 N
ML = i=1 x i and 2ML= i =1 x i 2 . Here the two parameters have been estimated
N N
indpendently of each other. When as is common practice is estimated first, then in
estimating , the xi only have N-1 degrees of freedom. Hence an N-1-dimensional normal
distribution must be used. This would reduce numerator in the fraction N / to N-1, and
hence in solving the equation for , there would only be N-1 in the denominator. This variant
of ML-esimation is called Restricted Maximum Likelihood estimation, or REML for short.

Classification using logistic regression

At the beginning of the course, we divided variables according to there scale and according to
their use. As regards use there are two types, exogenic variables (i.e. those that send out
effects) and endogenous variables (i.e. those that take in these effects). We called the former
X-variables and the latter Y-variables. As regards scale we considered different several scale-
types for variables, namely continuous scales, ordered scales and discrete scales (where in fact
in the last case talking of a scale is maybe a slight misdenomer (Fehlbenennung), because
the word scale is inevitably associated with numbers and discrete scales really just
differentiate between categories or classes, which just exist side by side.)

In a data analysis situation, depending on availability of X and Y-data, depending on the scale
of these data, and of course depending on the questions to be answered, different statistical
methods or modeling techniques shall be applied. In particular, when inferences about Y-data
are to be drawn from available X-data, modeling techniques to be used depend on the scales
of the data involved.

page 42 (57)
Until now we have discussed the following methods

Well, in fact we discussed contingency tables for 2 discrete settings in each of the X and Y
side. anova, we discussed a little in first semester and GLM generalized linear modeling
contains simple techniques of transforming discrete data into continuous data, so one can use
linear modeling to do the analysis.

In fact, when there are two categories1, just use x=-1 for the first and x=1 for the other; when
there are more than 2 classes, DON'T just use one x variable, because this would introduce an
ordering(!), use what's called dummy-variables, that define contrasts between categories, or
use indicator variables x=0/1,that indicate whether an object is not/is in a given category.

Until now, in the Regression-setting, i.e. when we did linear modeling, the Y's were
continuous. When the Y-variable is discrete, we talk of a classification problem.

In a classification setting, the X-matrix usually contains continuous data.

The simplest classification situation is that of classifying into two classes, namely class 0 and
class 1, for example female and male, based on some measurement, for example bodysize
(this may be interesting in a discotheque, where they want to admit approximately equal
numbers of guys and girls, and they don't want to employ a guard, but use automated
measurements of size to do he classification and count).

So to do the modeling a model assumption must be made. In what's called logistic regression
the model assumption is, that the ratio of the probabilities of belonging to one or the other
class is given by e x , where x is the measurement (it may be a vector) and is a coefficient
(or vector of coefficients). This ration is also called the odds of belonging to one class as
opposed to the other (the word is often used in horse-betting etc.). This means that the log of
the odds, also called the logit, for short, is a linear function of the x:
P Y =1 P Y =1 x
logit =lo g odds=lo g = x , equivalently = e . Since
P Y =0 P Y = 0
P Y =0 P Y =1 =1 it must follow, that:
x
e 1
P Y =1 | X = x = P Y =0 | X = x= =1 p x , .
x =: p x , and
1e 1 e x

1 Notice we use the word categories for known settings of a discrete X variable and the word classes for the
unknown settings of a discrete Y-variable, that are to be predicted, in the hope of reducing confusion.

page 43 (57)
Now given measurements xi and class memberships yi of a total of N individuals, i, =1, , N,
the log-likelihood of a parameter vector given independence of the classifications will
x x
be given by l = x i log 1e log 1e . To maximize this, take
i i

y i =1 yi =0
derivatives and set them to zero:
xi e x xi ex N i i

l = x i x x = x i y i p x i , = 0 !
y =1 1e y =0 1 e i=1
i
i
i
i

Since this is a nonlinear function, it is advisable to look for the solution of the equation by
using a Newton-Raphson iteration:
new = old H old 1 l old , where H old is the derivative, l , of l T

evaluated at old , i.e. the Hessian matrix (second derivative) of l , which in turn is given
N
T
by: H = x i x i p x i , 1 p x i , , because
i=1
xi T xi
xi e xi x ei 1e x x i xTi e x e x
i i i
e x 1 i

x
= xi 2
= x i x Ti x 2
= x i x Ti p xi , 1 p x i , .
1e i
1e 1e i

The Newton-Raphson iteration becomes


N N
1
new = old [ x i x Ti p x i , old 1 p x i , old ] x i y i p xi , which can be rewritten
i=1 i =1
new old T 1 T T 1 T old 1
as = X W X X y p = X W X X W X W y p , where

W=
old
p i 1 p i
0
old 1
0
p i 1 p i is a N*N diagonal matrix of weights. Setting
new T 1 T old
z = X W y p the iteration becomes: = X W X X W z , which can be
solved by using the lm(...) function in R.

Data Analysis for Categorical x-data

Categorical x-data (and continuous y-data) are best analys either using ANOVA (analysis-of-
variance) or GLM (generalized linear modeling). Both techniques will be illustrated in his
section, using a new data set, that is within the R example pool.

Example: two categorical x and one continuous y-variable

1. Importing the data for anova (check R-intro sample session):


> filepath <- system.file("data", "morley.tab" , package="datasets")
> mm <- read.table(fiepath)
> mm

Expt Run Speed


001 1 1 850
002 1 2 740

page 44 (57)
003 1 3 900
004 1 4 1070
005 1 5 930
006 1 6 850
007 1 7 950
008 1 8 980
009 1 9 980
010 1 10 880
011 1 11 1000
012 1 12 980
013 1 13 930
014 1 14 650
015 1 15 760
016 1 16 810
017 1 17 1000
018 1 18 1000
019 1 19 960
020 1 20 960
021 2 1 960
022 2 2 940
023 2 3 960
024 2 4 940
025 2 5 880
.
.
.

2. Convert the columns Expt and Runs to factors in order to be able to use Analysis of
Variance Speed of Light Data
> mm$Expt <- factor(mm$Expt)
> mm$Run <- factor(mm$Run)
1000

3. Box-Plot:
900

The boxplot or box-whisker (small b) contains the


following information:
800
700

* smallest non-outlier observation (lower "whisker")


* lower (first) quartile (Q1, x0.25)
* median (second quartile) (Med, x0.5) 1 2 3 4 5

* upper (third) quartile (Q3, x0.75) Exp.-No

* largest non-outlier observation (upper whisker)


* interquartile range, IQR = Q3 Q1 (box)
* "mild" outlier, between 1.5*(IQR) and 3*(IQR) below Q1 or above Q3 (dots in picture)
* "extreme" outlier, more than 3*(IQR) below Q1 or above Q3 (none in picture)

The vertical lines (the "whiskers") extend to at most 1.5 times the box width (the interquartile
range) from either or both ends of the box. They must end at an observed value, thus
connecting all the values outside the box that are not more than 1.5 times the box width away
from the box.

page 45 (57)
Three times the box width marks the boundary between "mild" and "extreme" outliers. In a
typical boxplot, "mild" and "extreme" outliers are differentiated by open dots and stars
respectively.

There are variants of the boxplot. Which variant is used depends on the software.

Analysis of variance
>fm <- aov(Speed ~ Run + Expt, data=mm)
>summary(fm)
Df Sum Sq Mean Sq F value Pr(>F)
Run 19 113344 5965 1.1053 0.363209
Expt 4 94514 23629 4.3781 0.003071 **
Residuals 76 410166 5397
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This is a ANOVA or AOV-table (analysis of variance).


There are two factors Expt (at 5 levels hence 4 DF) and Expt (at 20 levels, hence 19 DF).
In all there are 100 runs, 20 per experiment and 5 experiments.

x3,10 x3,mea
n

xmean,1 xmean
0

Let xi,j denote the result at run i of experiment j


For each experiment a mean is calculated: x,i,mean = 1/20 j=1,20 xi,j
For each run a mean is calculated, xmean,j = 1/5 i=1,,5 xi,j
The grand mean is calculated: xmean = 1/100 i=1,,5, j=1,20 xi,j

SSRun is the sum of squared deviations of the xmean,j from xmean. In the above date this sum is
113344. Since there are 20 such numbers, there are 19 degrees of freedom, which are used to
calculate the mean of squared deviations, which is then 5965.

SSExpt is the sum of squared deviations of the xi,mean from xmean. In the above data this sum is
94514. Since there are 5 such numbers, there are 4 degrees of freedom, which are used to
compute the mean of squares, which is then 23629.

If there were no additional variation in the data, one should be able to calculate
xi,j = xmean + expt-correction + run-correction
= xmean + (xi,mean xmean) + (xmean,j xmean) = xi,mean + xmean,j xmean)

page 46 (57)
But there is additional variation! So SSResiduals is the sum of the squares of these deviations, 100
deviations in all, but not all independent, there are 76 residual degrees of freedom (DFresidual =
100 1 19 4 = 76), so MSresidual = 5397.

Time to calculate the F-statistic by comparing the variance induced by the two factors with
the random variance:
FExpt = MSExpt / MSResiduals = 4,3781, corresponding p-value = 0,307%
FRuns = MSRuns / MSResiduals = 1,1053, corresponding p-value = 36,32%

A quotient of 4,3781 (or above) is improbable if there are only random effects,
whereas a quotient of 1,1053 (or above) is highly probable.

So the influence of the factor experiment on the data is significant at 0,5% but not at 0,1%,
hence **. The factor Runs is not significant.

Using Linear Modeling to treat categorical x-data

Instead of using aov() to evaluate the Michalson-Morley-Experiment, lm() can be used


as well The command is:
> fml <- lm(Speed ~ Run + Expt, data=mm)
> summary(fml)

Call:
lm(formula = Speed ~ Run + Expt, data = mm)

Residuals:
Min 1Q Median 3Q Max
-206.60 -37.35 4.90 44.28 132.90

Coefficients:
Estimate Std. Error t- value
(Intercept) 9.506e+02 3.599e+01 26.413 2e-16 ***
Run2 -5.200e+01 4.646e+01 -1.119 0.266588
Run3 -2.800e+01 4.646e+01 -0.603 0.548545
Run4 6.000e+00 4.646e+01 0.129 0.897591
Run5 -7.600e+01 4.646e+01 -1.636 0.106032
Run6 -1.040e+02 4.646e+01 -2.238 0.028125 *
Run7 -1.000e+02 4.646e+01 -2.152 0.034551 *
Run8 -4.000e+01 4.646e+01 -0.861 0.391996
Run9 -1.000e+01 4.646e+01 -0.215 0.830167
Run10 -3.800e+01 4.646e+01 -0.818 0.415992
Run11 4.000e+00 4.646e+01 0.086 0.931621
Run12 -1.738e-13 4.646e+01 -3.74e-15 1.000000
Run13 -3.600e+01 4.646e+01 -0.775 0.440851
Run14 -9.400e+01 4.646e+01 -2.023 0.046576 *
Run15 -6.000e+01 4.646e+01 -1.291 0.200492
Run16 -6.600e+01 4.646e+01 -1.420 0.159552
Run17 -6.000e+00 4.646e+01 -0.129 0.897591
Run18 -3.800e+01 4.646e+01 -0.818 0.415992
Run19 -5.000e+01 4.646e+01 -1.076 0.285271
Run20 -4.400e+01 4.646e+01 -0.947 0.346641
Expt2 -5.300e+01 2.323e+01 -2.281 0.025325 *

page 47 (57)
Expt3 -6.400e+01 2.323e+01 -2.755 0.007343 **
Expt4 -8.850e+01 2.323e+01 -3.810 0.000281 ***
Expt5 -7.750e+01 2.323e+01 -3.336 0.001317 **

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 73.46 on 76 degrees of freedom


Multiple R-Squared: 0.3363, Adjusted R-squared: 0.1355
F-statistic: 1.675 on 23 and 76 DF, p-value: 0.04956

The coefficient table can be interpreted in the following way:


The grand mean is 950.6.
The average of Run 2 taken over the 5 experiments is 52.00 below this grand mean.
Same for Runs 3 to 20. So in fact the coefficients are the "corrections" that were used in the
interpretation of the Anova-results (above).

What happened to Run 1? Well, since all "corrections" must sum up to 0 (otherwise the grand
mean would not be the grand mean), Runs 1's correction must be minus the sum of all the
others.

Same is true for the Expt-levels.

Classification categorical y-variables

X-matrix as usual (can either be categorical or continuous)

Until now, in the Regression-setting, the Y's were continuous.


In Classification the idea is that Y is categorical, i.e. it just identifies classes.

Iris-Example: one categorical y-variable (3 categories) and four variables

Classical Classification Problem in


7

Statistics: iris-dataset. (It's in the dataset-


package and comes in 2 versions: iris is a
6

data frame with 5 columns, the last of


5

which is the name of the species; iris3 is a


Petal.Length

3-d 50*4*3-matrix, the last dimension is


4

for te species). The following properties


of 150 different iris flowers were
3

measured:
Sepal.Length,
2

Sepal.Width,
1

Petal.Length,
4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
Petal.Width
Sepal.Length

The idea is to use these variables to decide, whether a flower is one of the three species:
setosa, virginca or versicolor.

page 48 (57)
In the plot the little dots are setosa, the triangles are the versicolor and the crosses (plusses)
are virginica.

attach(iris)
plot(Sepal.Length,Petal.Length,

7
pch=as.numeric(Species))

6
There are two different paradigms:
(1) classification by supervised learning

5
(2) clustering by unsupervised learning

Petal.Length

4
The picture last page corresponds to is

3
supervised learning. Different classes are
indicated by the symbols. The picture at right

2
is unsupervised learning. Here all symbols are
the same.

1
4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0

There are different ways of doing each. Sepal.Length

Essentially there are two different strategic


approaches to doing clustering and classification:

(a) global methods


(b) local methods

A Global Method of Classification

The easiest global method is Linear Regression of an indicator matrix. Here the idea is to
code the class membership by 0 and 1. So in case of 3 classes, we'd have 3 y-variables, each
of which is 1 if the observation (in our case iris) is in the corresponding class. These three y-
variables are called the indicator matrix.
Sepal.Length Sepal.Width Petal.Length Petal.Width Species S(set) S(ver) S(vir)
1 5,1 3,5 1,4 0,2 setosa 1 0 0
2 4,9 3 1,4 0,2 setosa 1 0 0
3 4,7 3,2 1,3 0,2 setosa 1 0 0
4 4,6 3,1 1,5 0,2 setosa 1 0 0
51 7 3,2 4,7 1,4 versicolor 0 1 0
52 6,4 3,2 4,5 1,5 versicolor 0 1 0
53 6,9 3,1 4,9 1,5 versicolor 0 1 0
54 5,5 2,3 4 1,3 versicolor 0 1 0
101 6,3 3,3 6 2,5 virginica 0 0 1
102 5,8 2,7 5,1 1,9 virginica 0 0 1
103 7,1 3 5,9 2,1 virginica 0 0 1
104 6,3 2,9 5,6 1,8 virginica 0 0 1

Now make models for these y-variables. Once the models are available from fitting to a
training set of data, new observations can be classified by calculating predictions for all of the
indicator y-variables, and then by classifying the new observation into that class, i, which has
the highest yi,pred value.
> lmset<-lm(S.set. ~ Sepal.Length + Sepal.Width + Petal.Length +
Petal.Width, data=iris)
> lmver<-lm(S.ver. ~ Pet
> lmvir<-lm(S.vir. ~ Pet

page 49 (57)
Using the above models classification is done by rate total 0,85
setosa 1
chosing the iris type that has the highest pred-value. The versicolor 0,68
correct-classification rates are at right, some of the virginica 0,86
predictions are below:

S(set)-pred S(ver)-pred S(vir)-pred Summe Klasse correct?


1 0,98 0,12 -0,10 1 setosa 1
2 0,84 0,35 -0,20 1 setosa 1
3 0,90 0,24 -0,15 1 setosa 1
4 0,83 0,34 -0,16 1 setosa 1
51 0,22 0,36 0,42 1 virginica 0
52 0,22 0,27 0,51 1 virginica 0
53 0,14 0,40 0,46 1 virginica 0
54 0,07 0,68 0,25 1 versicolor 1
101 -0,16 0,07 1,09 1 virginica 1
102 -0,10 0,44 0,65 1 virginica 1
103 -0,13 0,36 0,77 1 virginica 1
104 -0,12 0,50 0,62 1 virginica 1

It is not too hard to prove that if y1 + y2 + y3 = 1, then after fitting a linear model, the same is
true for the predictions, i.e. y1,pred + y2,pred + y3,pred = 1. (This can be verified in the column
Summe above and be proved by using yi,pred = X(XTX)-1XTyi and that this is a linear operation).
> plot(lmset$f,S.set.)
> plot(lmver$f,S.ver.)
1 .0

1 .0
0 .8

0 .8
0 .6
0 .6

S .v e r.
S .s e t.

0 .4
0 .4

0 .2
0 .2

0 .0
0 .0

0.0 0.5 1.0 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8

lmiris$f lmiris$f

One problem with the Linear Regression of an indicator matrix is that observations that are a
long way away from the separating (or decision) boundary, yi,pred = (vertical line in the plots
above), have a strong leverage (Hebelwirkung) on the model and hence the position of the
separating (or decision) line.

A second problem with Linear Regression of an indicator matrix is what's called masking.
This is in fact what seems to have happened in the plot above right, where versicolor was
modelled and predictions would be difficult, because setosa are on one side, virginica on the
other. The correct-classification rate for versicolor in any case is the lowest of the three. It is
only 0,68 compared to 0,86 for virginica and an impeccable 1 for setosa.

A Local Method of Classification

The easiest local method is k-nearest-neighbours. For each vector x in the input space define
a neighbourhood Nk(x), defined by the k closest points in the training set (observation (row)

page 50 (57)
vectors in the X-matrix). Then calculate ypred(x) = 1/k x(i) in N(k,x) yi. Depending on the number k
used, nearest neighbours classification can yield very complex and very particular separation
lines. In particular, if k=1, then all observations in the training set will be classified correctly.

Problems with k-nearest-neighbours are:


(1) Tendency to overfit the data, i.e. to make good fit (meaning good classification of the
training set) but bad prediction (meaning bad classification for validation and prediction sets).
(2) For higher dimensional problems the "nearest neighbours" can easily be very, very, very
far away from a point that is to be classified. This (i.e. the shear size of high dimensional
spaces) is called the "dimension trap".

data(iris3)
train <- rbind(iris3[1:25,,1], iris3[1:25,,2], iris3[1:25,,3])
test <- rbind(iris3[26:50,,1], iris3[26:50,,2], iris3[26:50,,3])
cl <- factor(c(rep("s",25), rep("c",25), rep("v",25)))
knn(train, test, cl, k = 4, prob=TRUE)
attributes(.Last.value)

Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth edition.
Springer.

Other global Classification Methods

Now nearest neighbours and linear regression on indicators are extreme methods. Good in
between methods exist:

LDA (linear discriminant analysis) and related to this QDA (quadratic discriminant analysis)
Also there is Logistic regression, which was treated in the section on Maximum Likelihood.

More modern methods are so called Kernel methods, that are beyond the scope of this lecture
at this stage. One of the methods related to this class of methods is called support vector
machines, SVM)

Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA)

Let x be a p-vector of measurements (in the iris example p = 4, and the 4 measurements are
Sepal.Length, Sepal.Width, Petal.Length, Petal.Width),
let there be k classes (in the iris example k = 3 and the classes are setosa, versicolor and
virginica)
let pk be the prior probability (i.e. with no information from any measurement) of being in
class k, and
let fk(x), the probability function within class k, be a p-dimensional Gauss-function with mean
k and covariance matrix . need not be a diagonal matrix, but it is assumed that is the
same for all classes (if k is allowed to depend on the class k, then the same ideas can be used
and lead to QDA, quadratic discriminant analysis. This is advantageous from the point of
view of flexibility, but involves estimating many more parameters).

Then linear discriminant functions are defined by:

page 51 (57)
k(x) = xT-1k kT-1k +log(pk)
For a measurement x, the class is assigned, that has maximal k(x). This can be deduced from
the formula for the p-dimensional Gauss-function with covariance . It means that the
decision boundaries between two classes k and l are along lines of the form k(x) = l(x).
These are linear.
Of course in application the true parameters pk, k, and are not known. so they are estimated
from the data:
pk: Nk/N, where Nk is the number of class k observations, N the number of all observations,
k: 1/Nk class of xi is k xi,
: 1/(N-K) k=1 bis Kclass of xi is k (xi-k) (xi-k)T.

For QDA the discriminant function is:


k(x) = log(|k| - (x- k)Tk-1(x-k) +log(pk)

CART (classification and regression trees), respectively RPART (recursive partitioning and
regression trees)
> plot(rpart(Species ~ .,data=iris),margin=0.1,branch=0.5)
> text(rpart(Species ~ .,data=iris))

Petal.Length< 2.45
|
2.5
2.0

Petal.Width< 1.75
Petal.Width

1.5

setosa
1.0
0.5

versicolor virginica

1 2 3 4 5 6 7

Petal.Length

page 52 (57)
Cox-PH for reliability and life-time analysis

In the first semester we had seen, that 0,6 h(t)


time-dependent processes could be 0,5 hazard h0(t)=con.
modeled using exponential decay 0,4

0,3
models (at right) or Weibull-models. 0,2

0,1

An interesting generalization, that


allows studying the dependency on
0
0 2 4 6 8 10 12 t
1

other covariates (x-variables) is the 0,8

Cox-proportional Hazard model, that Availability, R(t),


R(t ) et
0,6

David Cox introduced in 1972. 0,4 or Survival, S(t)


0,2

Hazard in general is the risk of failure 0

at a given time point. 0 2 4 6 8 10 12

So h(t) = - S'(t)/S(t).
1
0,9
probabiblity of failure,
0,8
0,7
F (t ) 1 e t
F(t) = 1 - S(t) ,
In Exponential Decay h(t) = ist
0,6
0,5

constant, in the Weibull model


0,4
0,3

h(t) = / (t/)-1.
0,2
0,1

The special case = 1 is again


0
0 2 4 6 8 10 12

exponential decay, with = 1/. 1 probability density


0,9
0,8 f(t) for failure at time t.
0,7
Cox proportional hazard model is a so- 0,6

called semi-parametric model, that


0,5
0,4
f (t ) e t
0,3
leaves open the way the hazard 0,2
area = P(a part will fail
0,1 in this interval)
function depends on time (so there is 0
0 2 4 6 8 10 12
no parameter here), and in addition
allows for hazard to depend on other covariates in a simple way (described by the parameter-
vector b):
h(t) = h0(t) exp(bx)
where x is a vector of covariates and b is a vector of coefficients, that will be estimated from
the data. The covariates, x, are assumed to be centred!

The interesting thing about this model is that h0(t), so-called base line hazard and the
coefficients b can be estimated independently of each other! So it is usual to estimate
coefficients first and base line hazard after.

To estimate coefficients a variant of maximum likelihood estimation is used. In fact a new


notion is was introduced by Cox, namely partial likelihood.

His idea was given a set of data consisting of subjects Sj and failure times tj to maximize
the probability P(S1=i1, ..., Sn=in), where (ij ) is the index sequence of the time-ordered failure
events as they actually occurred, amongst the probabilities of all possible permutations by
varying b.

This can be done by first calculating

page 53 (57)
P(Sj=i | given all data before j) = hi(tj) / k hk(tj),
where the sum goes over all subjects that are still at risk, meaning, that they have not had a
failure or have not been taken out of the test. Since numerator and denominator both contain
base line hazard h0(tj) as a factor, it can just be cancelled out (this is the trick that makes
coefficient estimation independent of the base line hazard) and leaves one with the quotient
P(Sj=i | all data before j) = exp( bxi) / k exp( bxk),
where again the sum goes over all subjects that are still at risk at time tj.

Using the chain rule for conditional probabilities, P(A and B) = P(A | B)*P(B) applied at all
failure time points for all n subjects one has:
P(S1=i1, ..., Sn=in) = j=1,n [exp( bxj) / k at risk at time j exp( bxk)] (Cox Partial
Likelihood). This partial likelihood can be maximized by varying the coefficients b.

As an example, we take 4 subjects, S1, S2, S3, S4, that fail at time points 2, 3, 4 and 5.
A covariate x may take values 1 and 1, for simplicity we assume 1 occurs twice and +1
occurs twice.

In case x1 and x2 are the -1's,


P ++ = [exp(-b)/(2exp(-b)+2exp(b) ] [exp(-b)/(exp(-b)+2exp(b)) ]
[exp(b)/2exp(b)) ] [exp(b)/exp(b) ]
= 1 / [ (2exp(-b)+2exp(b))(exp(-b)+2exp(b))2exp(b)exp(b) ] ,
in the other cases:
P + + = 1 / [ (2exp(-b)+2exp(b))(exp(-b)+2exp(b))(exp(-b)+exp(b))exp(b) ] ,
P ++ = 1 / [ (2exp(-b)+2exp(b))(exp(-b)+2exp(b))(exp(-b)+exp(b))exp(-b) ] ,
P+ + = 1 / [ (2exp(-b)+2exp(b))(2exp(-b)+exp(b))(exp(-b)+exp(b))exp(b) ] ,
P+ + = 1 / [ (2exp(-b)+2exp(b))(2exp(-b)+exp(b))(exp(-b)+exp(b))exp(-b) ] ,
P+ + = 1 / [ (2exp(-b)+2exp(b))(2exp(-b)+exp(b))2exp(-b)exp(-b) ] .
25

--++
20
-+-+
-++-
+--+
+-+-
15
++--

10

0
page 54 (57)
-2,5 -2 -1,5 -1 -0,5 0 0,5 1 1,5 2 2,5
The best coeff's in each case are - for ++; -0,47 for + + ; -0,2 for + +
for ++ ; 0,47 for + + ; 0,2 for + + .
The infinity cases are a degenerate case, which occurs, when the correlation between early
failure and high covariate values is very high. This is usually an indication that hazard in the
presence of the covariates is not proportional to a baseline hazard, i.e. CoxpH should not be
used.

For interpretation purposes it is important to notice, that coefficients depend only on the
ordering of the failures and the observed values of the covariates, not at all on the time points
of failure themselves! These will determine the base line hazard function, which can be
obtained by setting all x's to zero (i.e. centre) and fitting to the time points (for example a
Weibull-fit or some other non-parametric method offered in R).

In R Cox proportional hazard regression can be found in the survival-package. There is a so-
called penalized version which is better for collinear x-data in the penalize-package. But we
shall learn about this later (third semester on MVDA multivariate data analysis).

Input to CoxpH is a survival-object, that should be cast from a data-frame into a survival-
object using the function Surv( ). [Careful: capital S at the beginning.]
> coxtest<-read.delim("clipboard",dec=",")
> coxtest
day event x
1 2 1 -1
2 4 1 -1
3 3 1 1
4 5 1 1
> cox.mod<-coxph(Surv(day,event) ~ x, data = coxtest)
> summary(cox.mod)
Call:
coxph(formula = Surv(day, event) ~ x, data = coxtest)

n= 4

coef exp(coef) se(coef) z Pr(>|z|)


x -0.4703 0.6248 0.6201 -0.758 0.448

exp(coef) exp(-coef) lower .95 upper .95


x 0.6248 1.600 0.1853 2.107

Rsquare= 0.143 (max possible= 0.796 )


Likelihood ratio test= 0.62 on 1 df, p=0.4325
Wald test = 0.58 on 1 df, p=0.4482
Score (logrank) test = 0.62 on 1 df, p=0.4328
5.0

> basehaz(cox.mod)
hazard time
1 0.2246892 2
4.0

2 0.5755534 3
time

3 1.0249318 4
3.0

4 2.6254170 5
> plot(basehaz(cox.mod))
2.0

> plot(survfit(cox.mod))
0.5 1.0 1.5 2.0 2.5

hazard

page 55 (57)
The hazard function shown is baseline hazard, note that it is a discrete distribution, that
when integrated gives a piecewise continuous function.

The survival function, S(t), that satisfies the equation


h(t) = - S'(t)/S(t) is also shown for the baseline situation (i.e.

1.0
for x = 0) with 95% confidence-intervals.

0.8
0.6
To get the survival function for different values of the

0.4
covariates, construct a new data frame and use the command:

0.2
0.0
> plot(survfit(cox.mod,newdata=data.frame(x=c(-
1,1)))) 0 1 2 3 4 5

These functions are shown without confidence intervals.

1.0
0.8
0.6
0.4
0.2
0.0

0 1 2 3 4 5

page 56 (57)
Literature:

[ET] Efron, B; Tibshirani, R.J.: An Introduction to the Bootstrap. Chapman&Hall/CRC


(1993)

[DH] Davison, A.C.; Hinkley, D.V.: Bootstrap Methods and their Applications. Cambridge
University Press (1997)

page 57 (57)

Você também pode gostar