Escolar Documentos
Profissional Documentos
Cultura Documentos
Michael Nussbaum
May 3, 2004
Department of Mathematics,
Malott Hall, Cornell University,
Ithaca NY 14853-4201,
e-mail nussbaum@math.cornell.edu,
http://www.math.cornell.edu/~nussbaum
2
CONTENTS
0.2 Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
0.3 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1 Introduction 7
1.1 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 What is statistics ? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.1 The Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.2 Confidence statements with the Chebyshev inequality . . . . . . . . . . . . . 12
4 Unbiased estimators 37
4.1 The Cramer-Rao information bound . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2 Countably infinite sample space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3 The continuous case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
10 Regression 139
10.1 Regression towards the mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
10.2 Bivariate regression models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
10.3 The general linear model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
10.3.1 Special cases of the linear model . . . . . . . . . . . . . . . . . . . . . . . . . 149
10.4 Least squares and maximum likelihood estimation . . . . . . . . . . . . . . . . . . . 153
10.5 The Gauss-Markov Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
13 Exercises 181
13.1 Problem set H1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
13.2 Problem set H2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
13.3 Problem set H3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
13.4 Problem set H4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
13.5 Problem set H5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
13.6 Problem set H6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
13.7 Problem set H7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
13.8 Problem set E1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
13.9 Problem set H8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
13.10Problem set H9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
13.11Problem set H10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
13.12Problem set E2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
Preface 5
14 Appendix: tools from probability, real analysis and linear algebra 195
14.1 The Cauchy-Schwartz inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
14.2 The Lebesgue Dominated Convergence Theorem . . . . . . . . . . . . . . . . . . . . 195
0.2 Preface
Spring. 4 credits. Prerequisite: MATH 471 and knowledge of linear algebra such as taught in
MATH 221. Some knowledge of multivariate calculus helpful but not necessary.
Classical and recently developed statistical procedures are discussed in a framework that emphasizes
the basic principles of statistical inference and the rationale underlying the choice of these procedures
in various settings. These settings include problems of estimation, hypothesis testing, large sample
theory. (The Cornell Courses of Study 2000-2001).
This course is a sequel to the introductory probability course MATH471. These notes will be used
as a basis for the course in combination with a textbook (to be found among the references
given below).
0.3 References
[BD] Bickel, P., Doksum, K., Mathematical Statistics, Basic Ideas and Selected Topics, Vol. 1, (2d Edition),
Prentice Hall, 2001
[CB] Casella, G. and R. Berger, R. Statistical Inference, Duxbury Press, 1990.
[D] Durrett, R., The Essentials of Probability, Duxbury Press, 1994.
[DE] Devore, J., Probability and Statistics for Engineering and the Sciences, Duxbury - Brooks/Cole, 2000
[FPP] Freedman, D., Pisani, R., and Purves, R: Statistics (3rd Edition) 1997.
[HC] Hogg, R. V. and Craig, A. T., Introduction to Mathematical Statistics (5 Edition), Prentice-Hall, 1995
[HT] Hogg, R. V. and Tanis, E. A.., Probability and Statistical Inference (6 Edition), Prentice-Hall, 2001
[LM] Larsen, R. and Marx, M., An Introduction to Mathematical Statistics and its Applications, Prentice
Hall 2001
[M] Moore, D. The Basic Practice of Statistics, (2d Edition), W. H. Freeman and Co, 2000
[R] Rice, J., Mathematical Statistics and Data Analysis, Duxbury Press, 1995
[ROU] Roussas, G., A Course in Mathematical Statistics, (2d Edition), Academic Press, 1997
[RS] Rohatgi, V and Ehsanes Saleh, A. K., An Introduction to Probability and Statistics, John Wiley 2001
[SH] Shao, Jun, Mathematical Statistics, Springer Verlag, 1998
[TD] Tamhane, A and Dunlop, D., Statistics and Data Analysis: from Elementary to Intermediate Prentice
Hall 2000
6 CONTENTS
Chapter 1
INTRODUCTION
To test to see if the null hypothesis is true, we spin the roulette wheel n times and let Xi = 1 if
red comes up on the ith trial and 0 otherwise, so that Xn is the fraction of times red comes up in
the first n trials. The test is specified by giving a critical region Cn so that we reject H0 (that is,
decide H0 is incorrect) when Xn Cn . One possible choice in this case is
( r )
18 18 20
Cn = x : x > 2 / n .
38 38 38
This choice is motivated by the fact that if H0 is true then using the central limit theorem ( is a
standard normal variable)
Xn
P Xn Cn = P 2 P (|| 2) = 0.05. (1.1)
/ n
Rejecting H0 when it is true is called a type I error. . In this test we have set the type I error
to be 5%.
The basis for the approximation is the central limit theorem. Indeed the results Xi ,
i = 1, . . . , n of the n trials are independent random variables all having the same distribution (or
probability law). This distribution is a Bernoulli law, where Xi takes only values 0 and 1:
P (Xi = 0) = 1 p, P (Xi = 1) = p.
8 Introduction
= EX1 = p.
Xn L
= N (0, 1) (1.2)
/ n
L
where N (0, 1) is the standard normal distribution and = signifies convergence in law (in distrib-
ution). Thus as n
Xn
P 2 P (|| 2) .
/ n
So in fact our reasoning is based on a large sample approximation for n . The relation
P (||q
2) = 0.05 is then taken from a tabulation of the standard normal law N (0, 1).
18 20
Now 38 38 = 0.4993 1/2 so to simplify the arithmetic the test can be formulated as
18 1
reject H0 if Xn >
38 n
or in terms of the total number of reds Sn = X1 + . . . + Xn
18n
reject H0 if Sn > n.
38
Suppose now that we spin the wheel n = 3800 times and get red 1868 times. Is the wheel biased?
We expect 18n/38 = 1800 reds, so the excess number of reds is |Sn 1800| = 68. Given the large
number of trials, this might not seem like a large excess. However 3800 = 61.6 and 68 > 61.6
so we reject H0 and think if H0 were correct then we would see an observation this far from the
mean less than 5% of the time.
Testing academic performance ([D] chap. 5.4 p. 244). Do married college students with
children do less well because they have less time to study or do they do better because they are
more serious ? The average GPA at the university is 2.48, so we might set up the following
hypothesis test concerning , the mean grade point average of married students with children:
Suppose that to test this hypothesis we have records X1 , . . . , Xn of 25 married college students
with children. Their average GPA is Xn = 2.35 and sample standard deviation n = 0.5. Recall
that the standard deviation of a sample X1 , . . . , Xn with sample mean Xn is
v
u n
u 1 X 2
n = t Xi Xn .
n1
i=1
Hypothesis testing 9
Using (1.1) from the last example we see that to have a test with a type I error of 5% we should
reject H0 if
2
Xn 2.48 > .
n
The basis is again the central limit theorem: no particular assumption is made about the distribu-
tion of the Xi , they are just independent and identically distributed random variables (i.i.d. r.v.s)
with a finite variance 2 (standard deviation ). We again have the CLT (1.2) and we can take
= 2.48 to test our hypothesis. But contrary to the previous example, the value of is then still
undetermined (in the previous example both and are given by p). Thus is unknown, but we
can estimate it by the sample standard deviation n . Taking n = 25 we see that
2(0.5)
= 0.2 > 0.13 = Xn 2.48
25
so we are not 95% certain that 6= 2.48. Note the inconclusiveness in the outcome: the result
of the test is the negative statement we are not 95% certain that H0 is not true.. , but not
that there is particularly strong evidence for H0 . That nonsymmetric role of the two hypotheses is
specific to the setup of statistical testing; this will be discussed in detail later.
Testing the dierence of two means.([D] p. 245) Suppose we have independent random samples
of size n1 and n2 from two populations with unknown means 1 , 2 and variances 21 , 22 and we
want to test
H0 : 1 = 2 null hypothesis
H1 : 1 6= 2 alternative hypothesis.
Indeed these are just reformulations of (1.2) using properties of the normal law:
2
L() = N (0, 1) if and only if L + = N , ,
n n
and L() = L() = N (0, 1). Here we are using the standard notation L() for probability law of
(law of , distribution of ). We have assumed that the two samples are independent, so if H0
is correct,
21 22
X1 X2 N 0, + .
n1 n2
Based on the last result, if we want a test with a type I error of 5% then we should
s
21 22
reject H0 if X1 X2 > 2 + .
n1 n2
For a concrete example we consider a study of passive smoking reported in the New England Journal
of Medicine. (cf. [D] p.246). A measurement of the size S of lung airways called FEF 25-75%
was taken for 200 female nonsmokers who were in a smoky environment and for 200 who were not.
10 Introduction
In the first group the average value of S was 2.72 with a standard deviation of 0.71 while in the
second group the average was 3.17 with a standard deviation of 0.74 (Larger values are better.).
To see that there is a significant dierence between the averages we note that
s r
21 22 (0.71)2 (0.74)2
2 + =2 + = 0.14503
n1 n2 200 200
while X1 X2 = 0.45. With these data, H0 is rejected, based on the reasoning, similar to the
first example, if H0 were true, then what we are seeing would be very improbable, i.e. would have
probability not higher than 0.05. But again the reasoning is based on a normal approximation,
i.e. an belief that sample size 200 is a large enough.
Function: noun
Date: 1594
1 : the act or process of inferring: as a : the act of passing from one proposition, statement, or
judgment considered as true to anotherwhose truth is believed to follow from that of the former b
: the act of passing from statistical sample data to generalizations (as of the value of population
parameters) usually with calculated degrees of certainty.
as an estimate of p, so pn = Xn .
1.3.1 The Law of Large Numbers
We have for Xn = pn and any > 0
In words, for any small fixed number the probability that pn is outside the interval (p, p+)
can be made arbitrarily small by selecting n suciently large. If the institute samples enough voters,
it can believe that its estimate pn is close enough to the true value. In statistics, estimates which
converge to the true value in the above probability sense are called consistent estimates (recall
that the convergence type (1.3) is called convergence in probability ). As a basic requirement the
institute needs a good estimate pn to base its prediction upon. The LLN tells the institute that it
actually pays to get more opinions.
Suppose the institute has sampled a large number of voters, and the estimate pn turns out > 1/2
but near. It is natural to proceed with caution in this case, as the reputation of the institute
depends on the reliability of its published results. Results which are deemed unreliable will not be
published. A controversy might arise within the institute:
Researcher a: We spent a large amount of money on this poll on we have a really large n. So let
us go ahead and publish the result that A will win.
12 Introduction
Researcher b: This result is too close to the critical value. I do not claim that B will win, but I
favor not to publish a prediction.
Clearly a sound and rational criterion is needed for a decision whether to publish or not. A method
for this should be fixed in advance.
1.3.2 Confidence statements with the Chebyshev inequality
Recall Chebyshevs inequality ([D] chap. 5.1, p. 222). If Y is a random variable with finite variance
Var(Y ) and Y > 0 then
P (|Y EY | y) Var(Y )/y2 .
Var(X1 ) p(1 p)
P (|pn p| > ) 2
= (1.4)
n n2
1
for an given in advance (e.g. = 0.05 or = 0.01). Now p(1 p) 4 so
1
P (|pn p| > )
4n2
provided we select = 1/ 4n.
The Chebyshev inequality thus allows to to quantitatively assess the accuracy when the sampe
size is given (or alternatively, to determine the neccessary sample size to attain to a given desired
level of accuracy ). The convergence in probability (or LLN) (1.3) is just a qualitative statement
on pn ; actually it also derived from the Chebyshev inequality (cf. the proof of the LLN).
Another way of phrasing (1.5) would be: the probability that the interval (pn , pn + ) covers p
is more than 95%, or
P ((pn , pn + ) 3 p) 1 . (1.6)
Such statements are called confidence statements, and the interval (pn , pn +) is a confidence
interval Note that pn is a random variable, so the interval is in fact a random interval. Therefore
the element sign is written in reverse form 3 to stress the fact that (1.6) the interval is random,
not p (p is merely unknown).
To be even more precise, we note that the probability law depends on p, so we should properly write
Pp (as is done usually in statistics, where the probabilities depend on an unknown parameter). So
we have
Pp ((pn , pn + ) 3 p) 1 . (1.7)
When is a preset value, and (pn , pn + ) is known to fulfill (1.7), the common practical point
of view is: we believe that our true unknown p is within distance of pn . When pn happens to be
more than away from 1/2 (and e.g larger) then the opinion poll institute has enough evidence;
this immediately implies we believe that our true unknown p is greater than 1/2, and they can
go ahead and publish the result. They know that if the true p is actually less than 1/2, then the
Confidence intervals 13
Note that the is to some extent arbitrary, but common values are = 0.05 (95% confidence) and
= 0.01 (99% confidence).
The reasoning When I observe a fact (an outcome of a random experiment) and I know that under
a certain hypothesis, this fact would have less than 5% probability then I reject this hypothesis is
very common; it is the basis of statistical testing theory. In our case of confidence intervals, the
fact (event) would be 1/2 is not within (pn , pn + ) and the hypothesis would be p = 1/2.
When we reject p = 1/2 because of pn 1/2 + then we can also reject all values p < 1/2.
But this type of decision rule (rational decision making, testing) cannot give reasonable certainty
in all cases. When 1/2 is in the 95% confidence interval the institute would be well advised to be
cautious, and not publish the result. It just means unfortunately, I did not observe a fact which
would be very improbable under the hypothesis, so there there is not enough evidence against the
hypothesis; nothing can really be ruled out. .
In summary: the prediction is suggested by pn ; the confidence interval is a rational way of deciding
whether to publish or not.
Note that, contrary to the above testing examples, the confidence interval did not involve any
large sample approximation. However such arguments (normal approximation, estimate Var(X1 )
by pn (1 pn ) ) can alternatively be used here.
14 Introduction
Chapter 2
ESTIMATION IN PARAMETRIC MODELS
X1 = 1 if defective
X1 = 0 otherwise.
That is, X1 takes value 1 with probability p and value 0 with probability 1 p. Such a random
variable is called a Bernoulli random variable and the corresponding probability distribution
is the Bernoulli distribution, or Bernoulli law , written B(1, p). The sample space of X1 is the set
{0, 1}.
When we take a sample of n transistors, this should be a simple random sample, which means
the following:
a) each individual is equally likely to be included in the sample
b) results for dierent individuals are independent one from another.
In mathematical language, a simple random sample of size n yields a set X1 , . . . , Xn of independent,
identically distributed random variables (i.i.d. random variables).. They are identically distributed,
16 Estimation in parametric models
in the above example, because they all follow the Bernoulli law B(1, p).(they are a random selection
from the population which has population proportion p). The X1 , . . . , Xn are independent as
random variables because of property b) of a simple random sample.
Denote X = (X1 , . . . , Xn ) the totality of observations, or the vector observations. This is now a
random variable with values in the n-dimensional Euclidean space Rn . (We also call this a random
vector, or also a random variable with values in Rn . Some probability texts assume random variables
to take values in R only; the higher dimensional objects are called random elements or random
vectors.) The sample space of X is now the set of all sequences of length n which cosnsist of 0s
and 10 s, written also {0, 1}n . In general, we denote X the sample space of an observed random
variable X..
Notation Let X be a random variable with values in a space X . We write L(X) for the probability
distribution (or the law) of X.
Recall that the probability distribution (or the law) of X is given by the the totality of the values
X can take, togehter with the associated probabilities. That definition is true for discrete random
variables (the totality of values is finite or countable); for continuous random variables the proba-
bility density function defines the distribution. When X is real valued, either discrte or continuous,
the law of X can also be described by the distribution function
F (x) = P (X x).)
In the above exampe, each individial Xi is Bernoulli: L(Xi ) = B(1, p) but the law of X =
(X1 , . . . , Xn ) is not Bernoulli: it is the law of n i.i.d. random variables having Bernoulli law
B(1, p). (In probability theory, such a law is called the product law, written B(1, p)n ). Note that
in our statistical problem above, p is not known so we have a whole set of laws for X: all the laws
B(1, p)n where p [0, 1].
The parametric estimation problem. Let X be an observed random variable with values in X
and L(X) be the probability distribution (or the law) of X. Assume that L(X) is unknown, but
known to be from a certain set of laws:
L(X) {P ; } .
Here is an index (a parameter) of the law and is called the parameter space (the set of admitted
). The problem is to estimate a function based on a realization of X. The set {P ; } is
also called a parametric family of laws.
In the sequel we assume is a subset of the real line R and X = (X1 , . . . , Xn ) where X1 , . . . , Xn
are independent random variables. Here n is called sample size. In most of this section we confine
ourselves to the case that X is a finite set (with the exception of some examples). In the above
example, the above takes the role of the population proportion p; since the population proportion
is known to be in [0, 1], the parameter space would be [0, 1].
Definition 2.1.1 (i) A statistic T is an arbitrary function of the observed random variable X.
(ii) As an estimator T of we admit any mapping with values in .
T : X 7
In this case, for any realization x the statistic T (x) gives the estimated value of .
Basic concepts 17
Note that T = T (X) is also a random variable. Statistical terminology is such that an estimate
is a realized value of that random variable, i.e. T (x) above (the estimated value of ), whereas
estimator denotes the whole function T . (also called estimating function) . Sometimes the
words estimate and estimator are used synonymously.
Thus an estimator is a special kind of statistic. Other instances of statistics are those used for
building tests or confidence intervals.
Notation Since the distribution of X depends on an unknown parameter, we stress this dependence
and write
P (X B), E h(X) etc.
for probabilities, expectations etc. which are computed under the assumption that is the
true parameter of X.
For later reference we write the model of i.i.d. Bernoulli random variables in the following form.
Model M1 A random vector X = (X1 , . . . , Xn ) is observed, with values in X = {0, 1}n ; the
distribution of X is the joint law B(1, p)n of n independent and identically distributed
Bernoulli random variables Xi each having law B(1, p), where p [0, 1].
This fits into the above parametric estimation problems as follows: we have to set = p, =
[0, 1].and P = B(1, p)n .
Remark 2.1.2 We set p = and write Pp () for probabilities depending on the unknown p. Thus
for a particular value x = (x1 , . . . , xn ) {0, 1}n we have
n
Y n
Y
Pp (X = x) = Pp (Xi = xi ) = pxi (1 p)1xi .
i=1 i=1
The problem is to estimate the parameter p from the data X1 , . . . , Xn . (Note that now we used the
word data for the random variables, not for the realizations; but this also corresponds to usage
in statistics). As estimators all mappings from X into [0, 1] are admitted, for instance the relative
frequency of observed 1s:
Xn
T (X) = n1 Xj = Xn = pn
j=1
i.e. the sample mean Xn which coincides with the sample proportion pn from (2.1). In the sequel
we shall identify certain requirements which good estimators have to fulfill, and we shall develop
criteria to quantify the performance of estimators.
Let us give sime other examples of parametric estimation problems. In each case, we observe a
random vector X = (X1 , . . . , Xn ) consisting of independent and identically distributed random
18 Estimation in parametric models
variables Xi . For the Xi we shall specify a parametric family of laws {Q , }; then this defines
the parametric family of laws {P , } for X, i.e. the specification L(X) {P , }.
It suces to give the family of laws of X1 ; then L(X1 ) {Q , } determines the family
{P , }.
(i) Poisson family: {Po(), > 0}. Here = (0, ).
(ii) Normal location family: N (, 2 ), R where 2 is fixed (known). Here = , = R
and the expectation parameter of the normal law describes its location on the real line.
(iii) Uniform family: {U (0, ), > 0} where U (0, ) is the uniform law with endpoints 0 and ,
having density
1 for 0 x
p (x) = {
0 otherwise.
Choosing the best estimator. In the above example involving transistors the sample proportion
p suggested itself as a reasonable estimator of the population proportion p. However its not a priori
clear that this is the best; we may also consider other function of the observations X = (X1 , . . . , Xn ).
First of all, we have to define what it means that an estimator is good. A quantitative comparison
of estimators is made possible by the approach of statistical decision theory . We choose a loss
function L(t, ) which measures the loss (inaccuracy) if the unknown parameter is estimated
by a value t. We stress that t must be be chosen as an estimate depending on the data, so the
criterion becomes more complicated (randomness intervenes) and the choice of the loss function is
just a first step. The loss is assumed to be nonnegative, i.e. the minimal possible loss is zero.
Natural choices, in case R, are the distance of t and (estimation error)
L(t, ) = |t |
The risk R(T, ) as a function of is called the risk function of the estimator T .
Note that in our present model ( = p) we do not have to worry about existence of the expected
value (since the law B n (1, p) has finite support). (For later generalizations, note that since L is
nonnegative, if its expectation is not finite then it must be + which may then also be regarded
as the value of the risk.)
Basic concepts 19
Thus the random nature of the loss is sucessfully dealt with by taking the expectation, but for
judging an estimator T (X), the fact remains that is unknown. Thus we still have a whole risk
function 7 R(T, ) as a criterion for performance. But it is desirable to express the quality of T
by just one number and then try to minimize it. There is a further choice involved for the method
to obtain such a number; in the sequel we shall discuss several approaches.
Suppose that, rather than reducing the problem in this fashion, we try to minimize the whole risk
function simultaneously, i.e. try to find an estimator T such that
R(T , ) = min R(T, ) for all .
T
Such an estimator would be called a uniformly best estimator . In general such an estimator will
not exist: for each 0 consider an estimator
T0 (x) = 0 .
This estimator ignores the data and always selects 0 ; then R(T0 , 0 ) = 0, i.e. this estimator is
very good if 0 is true. Thus if T were uniformly best it would have to compete with T0 , i.e.
fulfill
R(T , ) = 0 for all
i.e. T always achieves 0 risk. If the risk adequately expresses a distance to the parameter this
means that a sure decision is possible, which is not realistic for statistical problems, and possible
only if the problem itself is degenerate or trivial. Thus (we argued informally) uniformly best
estimators do not exist in general.
There are two ways out of this dilemma:
reduce the problem to minimizing one characteristic, as argued above (i.e. the maximal risk
or an average risk over )
restrict the class of estimators such as to rule out unreasonable competitors like T0 which
are likely to be very bad for most . Within a restricted class of estimators, a uniformly
best one may very well exist.
Consistency of estimators. At least an estimator should converge towards the true (unknown)
parameter to be estimated when the sample size increases. To emphasize the dependence of the
data vector on the sample size n we write the statistical model
L(X (n) ) {P,n ; } .
Recall that convergence in probability for a sequence of random variables is denoted by the symbol
.
P
Definition 2.1.4 A sequence Tn = Tn (X (n) ) of estimators (each based on a sample of size n) for
the parameter is called consistent if for all
P,n |Tn (X (n) ) | > 0 as n , for all > 0,
or in other notation
Tn = Tn (X (n) ) as n , if L(X (n) ) = P,n .
P
20 Estimation in parametric models
Tn (X1 , . . . , Xn ) = Xn
Proof. Since X1 , . . . , Xn are i.i.d. Bernoulli B(1, p) random variables, we have EX1 = p and the
result immediately follows from the law of large numbers.
The consistency requirement restricts class of estimators to reasonable ones; consistency can be
seen as a minimal requirement. But there are still many mappings of the observation space into
the parameter space which define consistent sequences of estimators.
(the latter equality being true in the case of quadratic loss). This quantity B(T ) is called integrated
risk or mixed risk of the estimator T . A Bayes estimator TB then is defined by the property of
minimizing the integrated risk:
B(TB ) = inf B(T ) (2.6)
T
In our example involving the parameter p above, p takes continuous values in the interval (0, 1), but
to make the connection to the Bayes formula above, consider a model where the parameter takes
only finitely many values ( = (1 , . . . , k )). Consider the case that P is the joint distribution of
(X, U ) where X is the data and U is a random variable which takes the k possible values of the
parameter (U ). For A = X A and Bi = U = i we get
k
X
P (X A) = P (X A|U = i )P (U = i )
i=1
Here gi = P (U = i ) can be construed as a prior probability that the parameter takes value
for the parameter i , and the conditional probabilities P (X A|U = i ) can be construed as a
family of probability measures depending on . In other words, in the Bayesian approach
Bayes estimators 21
k
X
B(T ) = Epi (T pi )2 gi
i=1
The expectations Epi (T pi )2 for a given pi can then be interpreted as conditional expectation
given U = i , and the integrated risk B(T ) above then is an unconditional expecation, namely
B(T ) = E (T p)2 , the expected loss squared loss (T p)2 with respect to the joint distribution
of observations X and random parameter p.
In the philosophical foundations of statistics, or in theories of how statistical decisions should be
made in the real world, the Bayesian approach has developed into an important separate school of
thought; those who believe that prior distributions on should always be applied (and are always
available) are sometimes called Bayesians, and Bayesian statistics is the corresponding part of
Mathematical Statistics.
In Model M1 , consider the following family of prior densities for the Bernoulli parameter p: for
, > 0
p1 (1 p)1
g, (p) = , p [0, 1].
B(, )
where
()()
B(, ) = . (2.7)
( + )
The corresponding distributions are called the Beta distributions ; here B(, ) stands for the
beta function defined by (2.7). Recall that
Z
() = x1 exp(x)dx.
0
Thus we consider a whole family (, > 0) of possible prior distributions for p, allowing a wide range
of choices for prior belief. The plot below shows three dierent densities from this family; let us
mention that the uniform density on [0, 1] is also in this class (for = = 1). We will discuss the
Beta family in more detail later, establishing also that B(, ) is the correct normalization factor.
It will also become clear that Bayesian methods are very useful to prove non-Bayesian optimality
properties of estimators.
22 Estimation in parametric models
Proposition 2.2.1 In Model M1 , let g be an arbitrary prior density for the parameter p [0, 1].
The corresponding Bayes estimator of p is
R 1 z(x)+1
p (1 p)nz(x) g(p)dp
TB (x) = 0R 1 (2.8)
z(x) (1 p)nz(x) g(p)dp
0 p
Remark. (i) Note that the Bayes estimator depends on the sample x only via the statistic z(x),
or equivalently, via the the sample mean Xn (x) = n1 z(x).
(ii) The function
pz(x) (1 p)nz(x) g(p)
gx (p) = R 1
z(x) (1 p)nz(x) g(p)dp
0 p
is a probability density 0n [0, 1], depending on the observed x. We see from (2.8) that TB (x) is the
expectation for that density: Z 1
TB (x) = pgx (p)dp.
0
Proof. For any estimator T and X = {0, 1}n
XZ 1
B(T ) = (T (x) p)2 pz(x) (1 p)nz(x) g(p)dp.
xX 0
To minimize B(T ), the value T (x) should be chosen optimally; one could try to do this for every
term in the sum, given x. Thus we look for the minimum of
Z 1
bx (t) = (t p)2 pz(x) (1 p)nz(x) g(p)dp.
0
Admissible estimators 23
bx (t) = c0 2c1 t + c2 t2
where
Z 1
c2 = pz(x) (1 p)nz(x) g(p)dp,
0
Z 1
c1 = pz(x)+1 (1 p)nz(x) g(p)dp,
0
Z 1
c0 = pz(x)+2 (1 p)nz(x) g(p)dp.
0
Since
0 pz(x) (1 p)nz(x) 1,
all these integrals are finite, and c2 > 0. It follows that the unique minimum t0 of bx (t) can be
obtained by setting the derivative 0. The solution of
TB (x) = t0 (x)
implies
R (S, ) = R (T, ) for all . (2.10)
Thus admissibility means that there can be no estimator S which is uniformly at least as good
(2.9 holds), and strictly better for one 0 : then R (S, 0 ) < R (T, 0 ) contradicts (2.10).
Non-admissibility of T means that T can be improved by another estimator S.
Proposition 2.3.2 Suppose that in Model M1 , the prior density g is such that g(p) > 0 for all
p [0, 1] with the exception of a finite number of points p. Then the Bayes estimator TB for this
prior density is admissible, for quadratic loss.
24 Estimation in parametric models
Proof. Suppose that TB is not admissible. Thus there is an estimator S and a p0 [0, 1] with
Note that R (S, p) is continuous in p (continuous on p (0, 1), and right resp. left continuous at
the endpoints 0 and 1):
X
R (S, p) = (S(x) p)2 pz(x) (1 p)nz(x)
xX
P
for z(x) = ni=1 xi , x X . Thus (2.11) implies that there must be whole neighborhood of p0
within which S is better: for some > 0
It follows that
Z 1 Z 1
B(S) = R(S, p)g(p)dp < R(TB , p)g(p)dp = B(TB ).
0 0
This contradicts the fact that TB is a Bayes estimator; thus TB must be admissible.
(Technical remark: for < 1,R the function p1 tends to as p 0, so the second integral in
1
(2.14) is improper [a limit of t for t & 0], similarly in case of < 1. Hence, strictly speaking,
R1
one should argue in terms of t first and then take a limit. Note that these limits exist since the
function p1 is integrable on (0, 1) for > 0:
Z 1 Z 1 1
1 1 1 1 1
p (1 p) dp p dp = p = (1 t ) . )
t t t
Bayes estimators for Beta densities 25
(for the last equality, we reversed the roles of and in (2.15)). Setting now = + z(x),
= + n z(x), we obtain from (2.13)
R1
p (1 p)1 dp 1
TB (x) = R 10 = 1+
1 (1 p)1 dp
0 p
+ z(x)
= = ,
+ ++n
thus (:= means defining equality)
Xn + /n
T, (X) := TB (X) = .
1 + /n + /n
We already know that the Bayes estimator is a function of the sample mean; we specified formula
(2.8).
Let us discuss some limiting cases.
Limiting case A) Sample size n is fixed, 0, 0. In the limit Xn is obtained. However,
the family of densities g, does not converge to a density for 0, 0, since the function
g (p) = p1 (1 p)1
This means that Xn is not a Bayes estimator for one of the densities g, .
Limiting case B) and are fixed, sample size n . In this case we have for p (0, 1)
n Zn p 0, n .
and since Xn P p, it is obvious that the above quantity is op (n ). At the same time, Xn converges
in probability to p, but slower: we have
which tends to 0 for 0 < 1/2 but not for = 1/2. Rather, for = 1/2 we have by the central
limit theorem
n1/2 (Xn p) d N (0, p(1 p))
so that (Xn p) is not op (n1/2 ).
The interpretation of (2.16), (2.17) is that n , the influence of the a priori information
diminishes. The Bayes estimators become all close to the sample mean (and to each other) at rate
n , < 1, whereas they converge to the true p only at rate n , < 1/2.
(for the last equality, note that E(Y EY + a)2 = E(Y EY )2 + a2 for nonrandom a). Set
q := 1 p and note that nXn has the binomial distribution B(n, p), so that
2
Ep nXn np = Var(nXn ) = npq.
The above reasoning is valid for all p [0, 1]; it means that the risk R (T, , p) for this special
choice of , does not depend upon p. In addition we know that T, is admissible. This implies
that T, is a minimax estimator ; let us define that important notion.
Note that the maximum in (2.20) exists since R(T, p) is continuous in p for every estimator. The
minimax approach is similar to the Bayes approach, in that one characteristic of the risk func-
tion R(T, p), p [0, 1] is selected as performance criterion for an estimator. In that case the
characteristic is the worst case risk .
Theorem 2.5.2 In model M1 , the Bayes estimator T, for = = n1/2 /2 is a minimax esti-
mator for quadratic loss.
Proof. Let T be an arbitrary estimator of p. Since T, is admissible according to Proposition
2.3.2, there must be a p0 [0, 1] such that
R(T, p0 ) R (T, , p0 ) = mn . (2.21)
If there were no such p0 , then we would have
R(T, p) < R (T, , p)
for all p which contradicts admissibility of T, . Now (2.21) implies
M (T ) mn = M (T, ).
It is clearly visible that the problem region for the sample mean is the area around p = 1/2; the
minimax estimator is better in the center, at the expense of the tail regions, and thus achieves a
smaller overall risk.
It is instructive to look at the form of the minimax estimator TM itself. The function fM (x) =
x+n1/2 /2
1+n1/2
is plotted below, also for n = 30.
obtain fM (1) = 1 fM (0). This means that the values of the sample mean are moved towards
the value 1/2, which is the one where the risk would be maximal. This can be understood as a kind
of prudent behaviour of the minimax estimator, which tends to be closer to the value 1/2 since it
is more damaging to make an error if this value were true. We can also write
1
Xn = + (Xn 1/2),
2
1 (Xn 1/2)
TM = +
2 1 + n1/2
where it is seen that the minimax estimator takes the distance of Xn to 1/2 and shrinks it by a
factor (1 + n1/2 )1 which is < 1.
Chapter 3
MAXIMUM LIKELIHOOD ESTIMATORS
In the general statistical model where X is an observed random variable with values in X , X is a
countable set (i.e. X is a discrete r.v.) and {P ; } is a family of probability laws on X :
L(X) {P ; } ,
consider the probability function for given , i.e. P (X = x). For each x X , the function
Lx () = P (X = x)
is called the likelihood function of given x. The name reflects the heuristic principle that when
observations are realized, i.e. X took the value x, the most likely parameter values of are those
where P (X = x) is maximal. This does not mean that Lx () gives a probability distribution on
the parameter space; the likelihood principle has its own independent rationale, on a purely
heuristic basis.
Under special conditions however the likelihood function can be interpreted as a probability func-
tion. Consider the case that = {1 , . . . , k } is a finite set, and consider a prior distribution on
which is uniform:
P (U = i ) = k1 , i = 1, . . . , k.
Understanding P (X = x) as a conditional distribution given , i.e. setting (as always in the
Bayesian approach)
P (X = x|U = ) = P (X = x),
we immediately obtain a posterior distribution of , i.e. the conditional distribution of U given
X = x:
P (U = , X = x)
P (U = |X = x) = (3.1)
P (X = x)
P (X = x)P (U = ) P (X = x)
= =
P (X = x) k P (X = x)
= Lx () (k P (X = x))1 . (3.2)
Here the factor (k P (X = x))1 does not depend on , so that in this case, the posterior probability
function of is proportional to the likelihood function Lx (). Recall that the marginal probability
function of X is
Xk
P (X = x) = Pi (X = x)P (U = i )
i=1
30 Maximum likelihood estimators
Thus in this special case, the likelihood principle can be derived from the Bayesian approach, for a
noninformative prior distribution (i.e. the uniform distribution on ). However in cases where
there is no natural uniform distribution on , such as for = R or = Z+ (the nonnegative
integers), such a reasoning is not straightforward. (A limiting argument for a sequence of prior
distributions is often possible).
A maximum likelihood estimator (MLE) of is an estimator T (x) = TML (x) such that
i.e. for each given x, the estimator is a value of which maximizes the likelihood.
In the case of model M1 , the probability function for given p = is
for x X . In this case the parameter space is = [0, 1], on which there is a natural uniform
distribution (the beta density for = = 1). It can be shown that also in this case, something
analogous to (3.2) holds, i.e. the likelihood function is proportional to the density of the posterior
distribution. Without proving this statement, let us directly compute the maximum likelihood
estimate. P
Assume first that x is such that z(x) = ni=1 xi (0, n). Then the likelihood function has Lx (0) =
Lx (1) = 0 and is positive and continuously dierentiable on the open interval p (0, 1). Thus also
the logarithmic likelihood function
is continuously dierentiable, and since log is a monotone function, local extrema of the likelihood
function in (0, 1) coincide with local extrema of the log-likelihood in p (0, 1). We have
We look for zeros of this function in p (0, 1); the local extrema are among these. We obtain
Proposition 3.0.3 In model M1 , the sample mean Xn is the unique maximum likelihood estimator
(MLE) of the parameter p [0, 1]:
TML (X) = Xn .
Maximum likelihood estimators 31
We now turn to the continuous likelihood principle. Let X be a random variable with values
in Rk such that L(X) {P ; }, and each law P has a density p (x) on Rk . For each x Rk ,
the function
Lx () = p (x)
is called the likelihood function of given x. A maximum likelihood estimator (MLE) of is an
estimator T (x) = TML (x) such that
i.e. for each given x, the estimator is a value of which maximizes the likelihood.
Let us consider an example in which all densities are Gaussian.
Model M2 Observed are n independent and identically random variables X1 , . . . , Xn , each having
law N (, 2 ), where 2 > 0 is given (known) and R is unknown.
This is also called the Gaussian location model (or Gaussian shift model). Consider the case of
sample size n = 1. We can represent Xi as
Xi = + i
where i are i.i.d. centered normal: L( i ) = N (0, 2 ). The parameter is and parameter space
is R.
Proposition 3.0.4 In the Gaussian location model M2 , the sample mean Xn is the unique maxi-
mum likelihood estimator of the expectation parameter R:
TML (X) = Xn .
Proof. We have
n
!
Y 1 (xi )2
Lx () = 1 exp
2 2 2
i=1 (2 ) 2
n
!
1 1 X 2
= n exp 2 (xi )
(2 2 ) 2 2
i=1
Note that
n
d 1 X
lx () = 2 (2 (xi ))
d 2
i=1
n !
1 X
= xi n = 0
2
i=1
32 Maximum likelihood estimators
if and only if Pn
i=1 xi
= = xn .
n
Thus lx () which is continuously dierentiable on R has a unique local extremum at = xn . Since
lx () for , this must be a maximum.
We can now ask what happens if the variance 2 is also unknown. In that connection we introduce
the Gaussian location-scale model
Model M3 Observed are n independent and identically random variables X1 , . . . , Xn , each having
law N (, 2 ), where R and 2 > 0 are both unknown.
P
In addition to the sample mean Xn = n1 ni=1 Xi , consider the statistic
n
1 X
Sn2 = (Xi Xn )2
n1
i=1
This statistic is called the sample variance . The empirical second (central) moment (e.s.m.) is
n
n1 2 1X
Sn2 = Sn = (Xi Xn )2
n n
i=1
At first glance, it would seem more natural to call the expression Sn2 the sample variance, since it
is the sample analog of the variance of the random variable X1 :
However it is customary in statistics to call Sn2 the sample variance; the reason is unbiasedness
for 2 , which will be discussed later. Note that we need a sample size n 2 for Sn2 and Sn2 to be
nonzero.
Proposition 3.0.5 In the Gaussian location-scale model M3 , for a sample size n 2, if Sn2 >
0 then sample mean and e.s.m. (Xn , Sn2 ) are the unique maximum likelihood estimators of the
parameter = (, 2 ) = R (0, ):
Proof. We write xn , s2n for sample mean and e.s.m. when x1 , . . . , xn are realized data. Note that
n
X
s2n =n 1
x2i (xn )2 .
i=1
Hence
n
X n
X
(xi )2 = x2i 2nxn + n2
i=1 i=1
n
X
= x2i nx2n + n (xn )2 = ns2n + n (xn )2 .
i=1
Maximum likelihood estimators 33
where 2
1 x
(x) = exp
(2)1/2 2
is the standard normal density. We see that the first factor is a normal density in xn and the second
factor does not depend on . To find MLEs of and 2 we first maximize for fixed 2 over all
possible R. The first factor is the likelihood function in model M2 for a sample size n = 1,
variance n1 2 and an observed value x1 = xn . This gives an MLE = xn . We can insert this
value into (3.3); we now have to maximize
2
1 s2n
Lx xn , = exp 2 1
(2 2 )n/2 2 n
over 2 > 0. For notational convenience, we set = 2 ; equivalently, one may minimize
2
lx () = log Lx (xn , ) = n log + nsn .
2 2
Note that if s2n > 0, for 0 we have lx () and for also lx () , so that a
minimum exists and is a zero of the derivative of lx . The event Sn2 > 0 has probability 1 since
otherwise xi = xn , i = 1, . . . , n. i.e. all xi are equal, which clearly has probability 0 for independent
continuous random variables Xi . We obtain
l0 () =n ns2
x n2 = 0,
2 2
= s2n
for given R, > 0. We shall assume = 1 here; clearly the above is a density for any since
Z
exp(x)dx = 1.
0
Clearly DE(, ) has finite moments of any order, and by symmetry reasons is the expectation.
We introduce the double exponential location model .
34 Maximum likelihood estimators
Model M4 Observed are n independent and identically random variables X1 , . . . , Xn , each having
law DE(, 1), where R is unknown.
We will show that the MLE in this case is the sample median. For any vector (x1 , . . . , xn ), let
x(1) x(2) . . . x(n) be the vector of ordered values; this is always uniquely defined. For
n i.i.d. random variables X = (X1 , . . . , Xn ), define the order statistics to be components of
the vector X(1) , . . . , X(n) . Recall that a statistic was any function of the data; thus for given i,
the i-th order statistic is a well defined data function. In particular X(n) = maxi=1,...,n Xi etc,
X(1) = mini=1,...,n Xi . Define the sample median as
X((n+1)/2) if n is uneven
med(X) = { 1
2 (X(n/2) + X(n/2+1) ) if n is even
In other words, the sample median is the central order statistic if n is uneven, and the center of
the two central order statistics if n is even.
Proposition 3.0.6 In the double exponential location model M4 , the sample median med(X) is a
maximum likelihood estimator of the location parameter = R: TML (X) = med(X).
For given x this is a piecewise linear continuous function in ; for > x(n) it is a linear function
in tending to for , and for > x(1) it is also a linear function in tending to for
. Thus the minimum must be attained in the range of the sample, i. e. in the interval
[x(1) , x(n) ]. Assume that x(1) < x(1) < . . . x(n) , i.e. no two values of the sample are equal (there are
no ties). That event has probability one since the Xi are continuous random variables. Inside an
interval (x(i) , x(i+1) ), 1 i n 1 the derivative of lx () is
i
X n
X
l0 () = 1+ (1) = i (n i) = 2i n.
x
j=1 j=i+1
Consider first the case of even n. Then lx0 () is negative for i < n/2, positive for i > n/2 and 0
for i = n/2. Thus the minimum in is attained by any value [x(n/2) , x(n/2+1) ], in particular
by the center of that interval which is the sample median. If n is uneven then lx0 () is negative for
i < n/2 and positive for i > n/2. Since the function lx () is continuous, the minimum is attained
in the beginning point of the first interval where lx0 () is positive, which is X((n+1)/2) .
Maximum likelihood estimators 35
P
Since the sample median minimizes ni=1 |Xi | it may be called Pna least absolute deviation
2
estimator . In contrast the sample minimizes the sum of squares i=1 (Xi ) , i.e. it is a least
squares estimator . Note that the sample mean remains the same when X(1) whereas Xn
also tends to infinity in that case. By that reason, the sample median is often applied independently
of a maximum likelihood justification, which holds when the data are double exponential.
In analogy to the sample median, the population median is defined as as a "half point" of the
distribution of a random variable X representing the population. A value m is a median of a r.v.
X if simultaneously P (X m) 1/2 and P (X m) 1/2 hold. For continuous distributions, we
always have P (X = m) = 0, and therefore m is a solution of P (X > m) = 1 P (X < m) = 1/2
which may not be unique.
36 Maximum likelihood estimators
Chapter 4
UNBIASED ESTIMATORS
Consider again the general parametric estimation problem: let X be a random variable with values
in X and L(X) be the distribution (or the law) of X. Assume that L(X) is known up to a from
a parameter space Rk :
L(X) {P ; } .
The problem is to estimate a real valued function g() based on a realization of X. Up to now we
primarily considered the case where is an interval in R and g() = (i.e. we are interested in
estimation of itself), but the Gaussian location-scale model M3 was an instance of a model with
a two-dimensional parameter = (, 2 ). Here we might set g() = .
Consider an estimator T of g() such that E T exists. In our i.i.d. Bernoulli model M1 , that is
true for every estimator.
In model M1 we have
Ep Xn = p for all p [0, 1],
i.e. the sample mean is an unbiased estimator for the parameter p. Similarly, in the Gaussian
location and location-scale models, the sample mean is unbiased for = X1 .
Unbiasedness is sometimes considered as a value in itself, i.e. a desirable property for an estimator,
independently of the risk optimality. Recall that the risk function (for quadratic loss) for estimation
of a real valued was defined as
R(T, ) = E (T (X) )2 .
The last line is called the bias-variance decomposition of the quadratic risk of T ; it holds for
any T with finite risk at (or equivalently with Var T (X) < ); T need not be unbiased. If T
38 Unbiased estimators
is unbiased then the second term in (4.2), i.e. the squared bias vanishes; thus for unbiased T the
quadratic risk is the variance.
We saw that in model M1 the estimator
Xn + n1/2 /2
TM =
1 + n1/2
is minimax with respect to quadratic risk R(T, ), and it is clearly biased. The unbiased estimator
Xn performs less well in the minimax sense. It is thus a matter of choice how to judge the
performance: one may strictly adhere to the quadratic risk as a criterion, leaving the bias problem
aside, or impose unbiasedness as an a priori requirement for good estimators.
If the latter point of view is taken, within the restrict the class of unbiased estimators it is often
possible to find uniformly best elements (optimal for all values of ), as we shall see now.
Model Mf The sample space X for the observed random variable X is finite, and L(X)
{P , }, where is an open (possibly infinite) interval in R.
The case = R is included. Note that model M1 is a special case, if the open interval = (0, 1)
is taken as parameter space. With this model we associate a family of probability functions p (x),
x X , . Let T be an unbiased estimator of the parameter:
E T (X) = , (4.3)
Since X is finite, the expectation always exists for all . Let us add a dierentiability
assumption on the dependence of p (x) on .
where p0 is the derivative wrt to . The first derivative of a probability function has an interesting
property. We can also dierentiate the equality
X
p (x) = 1
xX
The Cramer-Rao information bound 39
This means that in (4.4) we can replace T (x) by T (x) + c where c is any constant; indeed (4.5)
implies that the right side of (4.4) is still 1. Choosing c = , we obtain
X
(T (x) ) p0 (x) = 1. (4.6)
xX
Note that if p (x) = 0 at = 0 then p00 (x) must also be 0: since p (x) is nonnegative and
dierentiable, it has a local minimum at 0 and hence p00 (x) = 0. This fact and (4.6) imply
X
1= (T (x) ) l (x)p (x) = E (T (X) ) l (X) (4.7)
xX
Let us apply now the Cauchy-Schwarz inequality : for any two random variables Z1 , Z2 on a
common probability space
|EZ1 Z2 |2 EZ12 Z22
provided the left side exists, i.e. EZi2 < , i = 1, 2. Squaring both sides of (4.7), we obtain
1 E (T (X) )2 E l2 (X).
The quantity
IF () = E l2 (X) (4.8)
is called the Fisher information of the parametric family p , of probability functions, at
point .
The form (4.10) involving the logarithm of p is convenient in many cases for computation of the
Fisher information.
Note also that for the score function we have as a consequence of (4.5)
E l (X) = 0. (4.11)
The Cramer-Rao bound (4.9) gives a benchmark against which to measure the performance of any
unbiased estimator. An unbiased estimator attaining the Cramer-Rao bound at is called a best
unbiased estimator (or uniformly best if that is true for all ; uniformly is usually omitted).
Another terminology is uniformly minimum variance unbiased estimator (UMVUE).
Example 4.1.2 Consider the case of Model M1 for sample size n = 1, i.e. we observe one Bernoulli
r.v. with law B(1, p), p (0, 1). The probability function is, for x {0, 1}
qp (x) = (1 p)1x px .
We check again (4.11) in this case from Ep (X p) = 0. The Fisher information is thus
Thus p(1 p) is the Cramer-Rao bound. It follows that X is a best unbiased estimator for all
p (0, 1), i.e. uniform minimum variance unbiased.
The next theorem specializes the situation to the case of independent and identically distributed
data.
Theorem 4.1.3 Suppose that X = (X1 , . . . , Xn ) are independent and identically distributed where
the statistical model for X1 is a model of type Mf , with probability function q , satisfying assumption
D1 . Assume that the Fisher information for q satisfies IF () > 0, . Then for every unbiased
estimator T of
Var T (X) n1 (IF ())1 , . (4.12)
where IF () is the Fisher information for q .
Proof. TheQmodel for the vector X = (X1 , . . . , Xn ) is also of type Mf , and the probability function
is p (x) = ni=1 q (xi ). It suces to find the Fisher information of p , call it IF,n (). We have
2 n !2
d X d
IF,n () = E log p (X) = E log q (Xi ) .
d d
i=1
Countably infinite sample space 41
Expanding the square, we note for the mixed terms (i 6= j), when l is the score function for q
Now Theorem 4.1.1 is valid for the whole model for X = (X1 , . . . , Xn ) and implies (4.12).
Interpreting the Fisher information. To draw an analogy to concepts from physics or me-
chanics, let t be a time parameter and and suppose x(t) = (xi (t))i=1,2,3 describes a movement in
three dimensional space R3 (a path in space, as a function of time). The velocity v(t0 ) at time t0
is the length of the gradient, i.e.
3 !1/2
X
v(t0 ) = x0 (t) = (x0i (t))2 . (4.14)
i=1
Model Md The sample space X for the observed random variable X is countable, and L(X)
{P , }, where is an open (possibly infinite) interval in R.
42 Unbiased estimators
The proof of the Cramer-Rao bound is very similar here, only we need to be sure that dientiation
is possible under the infinite sum sign. If X = {x1 , x2 , . . .} then we need to dierentiate both sides
of the unbiasedness equation
X
T (xi )p (xi ) = (4.17)
i=1
0
If and = + are two close values then we would take the ratio of dierences on both sides
X
T (xi ) (p (xi ) p+ (xi )) 1 = 1 (4.18)
i=1
and compute the derivative by letting 0. Both sides are dierentiable in . Here for every xi
Condition D2 (i) The probability function p (x) is positive and dierentiable in , at every x X
and every
(ii) For every , there exist an = > 0 and a function b (x) satisfying E b2 (X) <
and
p (x) p+ (x)
b (x) for all || and all x X .
p (x)
Using that condition, we find for the right hand side of (4.18)
X
X
1 p (xi ) p+ (xi )
T (xi ) (p (xi ) p+ (xi )) = T (xi ) p (xi )
p (xi )
i=1 i=1
X
= r (xi )p (xi ).
i=1
converges to a limit function r0 (x) = T (x)l (x) pointwise (i.e. for every x X ). We would like to
show
X
X
r (xi )p (xi ) r0 (xi )p (xi ) = E T (X)l (X) (4.19)
i=1 i=1
since in that case we could infer that E T (X)l (X) = 1 (dierentiate both sides of (4.17) under
the integral sign). By a result from real analysis, the Lebesgue dominated convergence theorem
(see Appendix) it suces to show that there is a function r(x) 0 and a > 0 such that
according to condition D2 and the finiteness of E T 2 (X), which is natural to assume for the
estimator. T (x). Thus we have established
E T (X)l (X) = 1.
Theorem 4.2.1 (Cramer-Rao bound, discrete case) In model Md , assume that the Fisher
information exists and is positive:
2
p0 (X)
0 < IF () = E < .
p (X)
for all , and also that condition D2 holds. Then for for every unbiased estimator T of with
finite variance
Var T (X) (IF ())1 , . (4.20)
Remark 4.2.2 Clearly an analog of Theorem 4.1.3 holds, we do not state it explicitly. It is evident
that (4.13) still holds. One would have to verify that it suces to impose Condition D2 only on
the law of X1 , but we omit this argument.
Example. Let X have Poisson law Po(), where > 0 is unknown. Let us check condition D2
(here = (0, )). We have for xk = k, k = 0, 1, . . .
k
p (xk ) = exp()
k!
This is continuously dierentiable in , for every k 0, and
1 k1 k
p0 (xk ) = k k exp() = k1 1 exp().
k! k!
44 Unbiased estimators
By the mean value theorem, for every and every k 0 there exists (k) (lying in the interval
between 0 and ) such that
Let us denote by C0 , C1 , C2 etc. constants which do not depend on k (but may depend on and
). We have
k
+ 1 C0 (k + 1) C1 k,
b2 (k) C1 k2 C22k C1 exp(2k) C22k = C3 exp(C4 k).
Thus for E b2 (X) < it suces to show that the Poisson law has finite exponential moments of
all order, i.e. for all c > 0 and all > 0
The essential work for the Cramer-Rao bound was already done in the previous subsection; indeed
this time we need dierentiation under the integral sign which is analogous to the infinite series
case. The reasoning is very similar. As estimator T is called unbiased if E |T | < and E T =
for all . We again start with the unbiasedness relation
Z
T (x)p (x)dx =
and we need to dierentiate the left side under the integral sign. Let us assume that our parameter
is one dimensional, as in the previous subsections; multivariate versions of the Cramer-Rao bound
can be derived but need not interest us here.
Condition D3 (i) The parameter space is an open (possibly infinite) interval in R. There is a
subset S Rk such that for all , we have p (x) > 0 for x S, p (x) = 0 for x / S.
(ii) The density p (x) is positive and dierentiable in , at every x S and every
(iii) For every , there exist an = > 0 and a function b (x) satisfying E b2 (X) <
and
p (x) p+ (x)
b (x) for all || and all x X .
p (x)
Theorem 4.3.1 (Cramer-Rao bound, continuous case) In Model Mc , assume that that the
smoothness condition D3 holds, and that the Fisher information exists and is positive: 0 < IF () <
for all . Then for for every unbiased estimator T of with finite variance
Example 4.3.2 Consider the Gaussian location model M2 for sample size n = 1, i.e. we observe
a Gaussian r.v. with law N (, 2 ) where 2 > 0 is known, = R. Let us verify condition D3 . For
ease of notation assume 2 = 1. If is the standard Gaussian density then p (x) = (x ), and
46 Unbiased estimators
Proposition 4.3.3 In the Gaussian location model M2 , the Fisher information w.r.t. the expec-
tation parameter R is n/ 2 , and the sample mean Xn is is a UMVUE of .
Proof. We have
V ar Xn = 2 /n
so we need only find the Fisher information. Condition D3 is easily verified, analogously to the
case n = 1: in (3.4) we found an expression for the joint density
Yn
1 xn 1 s2n
p (xi ) = 1/2 exp 2 1 . (4.22)
i=1
n1/2 n1/2 n (2 2 )(n1)/2 2 n
P
where s2n = n1 ni=1 (xi xn )2 is the sample variance. The second factor does not depend on
( 2 is fixed), and therefore cancels in the term (p (x) p+ (x))/p (x) in condition D3 . The first
factor is the density (as a function of xn ) of the law N (, n1 2 ); thus reasoning as in the above
example (4.3.2), we finally have to show that
E0 exp 2|xn |n 2 <
which follows as above. To find the Fisher information, we can use the factorization (4.22) again.
Indeed
n
!2
d Y
IF,n () = E log p (xi )
d
i=1
2
d 1 xn
= E log 1/2
d n n1/2
so here the Fisher information is the same as if we observed only xn with law N (, n1 2 ), i.e. in
a Gaussian location model with variance 2 = n1 2 . The score function in such a model is
d x
l (x) = log
d n1/2
d (x )2
= = 2 (x ),
d 2 2
and
IF () = 4 E(x )2 = 2 = n 2 .
We will now discuss an example where the Cramer-Rao bound does not hold; cf. Casella and
Berger [CB], p. 312, Example 7.3.5. Suppose that X1 , . . . , Xn are i.i.d with uniform density on
[0, ]. That means the density is
1
p (x) = , 0 x .
We might try to formally apply the Cramer-Rao bound, neglecting the dierentiability condition
for a moment. For one observation, the score function is
1, 0 < x <
l (x) = log p (x) = { ,
0, x >
48 Unbiased estimators
hence (formally)
1
IF () = E l2 (x) =
2
which suggests that for any unbiased estimator based on n observations
2
Var T (X) . (4.23)
n
However an estimator can be found which is better. Consider the maximum, i.e the largest order
statistic
X[n] = max Xi
i=1,...,n
d
q (t) = (t/)n = ntn1 /n , 0 t
dt
and q (t) = 0 for t > . For the expectation of X[n] we get
Z
n n1 n 1
E X[n] = n yy dy = n y n+1
0 n+1 0
n
= .
n+1
Remark 4.3.4 Assume that = 1, i.e. we have the uniform distribution on [0, 1]. Then
1
E max Xi =
i=1,...,n n+1
which can be interpreted as follows: when n points are randomly thrown into [0, 1] (independently,
1
with uniform law) then the largest of these value tends to be n+1 away from the right boundary.
The same is true by symmetry for the smallest value and the left boundary.
The estimator X[n] is thus biased, but the bias can easily be corrected: the estimator
n+1
Tc (X) = X[n]
n
(which moves X[n] up towards the interval boundary) is unbiased. To find its variance we note
Z
2 n 2 n1 n 1
E X[n] = n y y dy = n yn+2
0 n+2 0
n2
= ,
n+2
The continuous case 49
n+1 2 2
Var (Tc (X)) = Var X[n]
n
2 2 !
2 n+1 n n
=
n n+2 n+1
(n + 1)2
= 2 1
n(n + 2)
2
= (4.24)
n(n + 2)
which is smaller than the bound 2 /n suggested by (4.23). In fact the variance (4.24) decreases
much faster with n than the bound (4.23), by an additional factor 1/(n + 2).
Theorem 4.3.1 is in fact not applicable since the conditions are violated (the support depends on ,
and the density is not dierentiable everywhere). That suggests that the statistical model where
describes the support of the density [0, ] is more informative than the regular cases where
the Cramer-Rao bound applies. Indeed if X[n] = y then all values < y can be excluded with
certainty, and certainty is a lot of information in a statistical sense.
50 Unbiased estimators
Chapter 5
CONDITIONAL AND POSTERIOR DISTRIBUTIONS
Such a function defines a joint distribution Q of X and U : for any subset A of X and any interval
B in
XZ
Q(X A, U B) = Q(A B) = q(x, u)du.
xA B
It is then possible to extend the class of sets where the probabilities Q are defined: consider sets
C S of form [
C= {x} Bx
xA
Let S0 be the collection of all such sets C as above; the function Q on sets C S0 fulfills all the
axioms of a probability.
Suppose now that P , is a parametric family of distributions on X and g() is a density on
. Setting
q(x, ) = P (x)g()
52 Conditional and posterior distributions
XZ Z X !
P (x)g()d = P (x) g()d
xX xX
Z
= g()d = 1
since g is a density. The joint distribution of X and U thus defined gives rise to marginal and
conditional probabilities, for x X
Z
P (X = x) = P (X = x, U ) = P (x)g()d, (5.3)
P (U B, X = x)
P (U B|X = x) = (5.4)
P (X = x)
R
P (x)g()d
= RB .
P (x)g()d
If P (x) > 0 for all x and (which is the case in M1 if = (0, 1)) then also P (X = x) > 0, and
the conditional probability (5.4) is well defined. Then it is immediate (and shown in probability
courses) that the function
Qx (B) = P (U B|X = x)
fulfills all the axioms of a probability on ; it defines the conditional law of U given X = x. In the
statistical context this is called the posterior distribution of given X = x.
It is obvious that this distribution has a density on = (a, b): for B = (a, t] we have
Z t Z 1
Qx ((a, t]) = P (x)g() Pu (x)g(u)du d
a
This density is called the conditional density , or in the context of Bayesian statistics, the
posterior density of given X = x. The formula (5.5) is very simple: for given prior density
g(), adjoin the probability function P (x) (when X = x is observed) and renormalize P (x)g()
such that it integrates to one w.r.t. (i.e. becomes a density).
Remark 5.1.1 The formula (5.5) suggests an analog for the case that X is a continuous r.v. as
well, with density p (x) say. In this case the formula (5.4) cannnot be used to define a conditional
distribution since all events {X = x} have probability 0 for the laws P (x) and thus also for the
marginal law of X. For continuous r.v.s X the conditional (posterior) density qx () is directly
defined by replacing the probability P (x) in (5.5) by the density p (x); cf. [D], sect. 3.8 and our
discussion to follow in later sections.
Exercise. To prepare for the purely continuous case, let us see what happens when we reverse the
roles of and X in (5.5), i.e. we take the marginal probability function for X given by (5.3) and
Bayesian inference 53
combine it with the conditional density for given by (5.4). Consider the expression qx ()P (X = x)
and, analogously to (5.5), divide it by its sum over all possible values of x (x X ). Call the result
q (x). Show that for any with g() > 0, the relation
q (x) = P (x), x X
holds. This result justifies to call P (x) a conditional probability function under U = :
P (x) = P (X = x|U = ) ,
even though U is continuous and the event U = has probability 0.
Remark 5.1.2 Consider the uniform prior density on : g() = (b a)1 . Then
Z 1
qx () = Lx () Lx (u)du
i.e. the posterior density is proportional as a function of to the likelihood function Lx () (the
normalizing factor does not depend on ).
p1 (1 p)1
Pp (x)g(p) = pz(x) (1 p)nz(x)
B(, )
pz(x)+1 (1 p)nz(x)+1
= .
B(, )
The posterior density is proportional to the Beta density g+z(x),+nz(x) , and hence must coincide
with this density:
qx (p) = g+z(x),+nz(x) (p).
We see that if the prior density is in the Beta class, then the posterior density is also in this class,
for any observed data x. Such a family is called a conjugate family of prior distributions
The next subsection presents a more technical discussion of the Beta family. Using the formula
given below for the expectation, we find the expected value of the posterior distribution as
+ z(x)
E(U |X = x) = (5.6)
+ z(x) + + n z(x)
+ z(x) Xn + /n
= = = T, (X),
++n 1 + /n + /n
i.e. it coincides with the Bayes estimator for the prior density g, already found in section 2.4.
This is no coincidence, as the next proposition shows. Notationally, the separate symbol U for
as a random variable is only needed for expressions like P (X = x|U = ); we suppress U and set
U = wherever possible. Recall that E(U |X = x) is a general notation for conditional expectation
([D] sec. 4.6), i.e the expectation of a conditional law. In our statistical context E(|X = x) is
called the posterior expectation .
54 Conditional and posterior distributions
Proposition 5.2.1 In the statistical model Mf (the sample space X for X is finite and is an
open interval in R), let be a finite interval and g be a prior density on . Then, for a quadratic
loss function a Bayes estimator TB of is given by the posterior expectation
Proof. Note that for a r.v. U = taking values in a finite interval, the expectation and all higher
moments always exists, so both for prior and posterior distributions the expectation exists. The
integrated risk for any estimator is (it was defined previously for the special case of Model M1 )
Z
B(T ) = R(T, )g()d
Z
= E (T )2 g()d.
Let q() be an arbitrary density in , Eq (), be expectation under q and a be a constant (a does
not depend on ). Then we claim
with equality if and only if a = . (Note that Eq (a )2 is always finite in our model). Indeed
Eq (a )2 = Eq (a Eq ( Eq ))2
= (a Eq )2 + Varq
in view of Eq ( Eq ) = 0, which proves (5.8) and our claim. Now apply this result to the
expression in round brackets under the sum sign in (5.7) and obtain that for any given x, T (x) =
Eqx = E(|X = x) = TB (x) is the unique minimizer. Hence
Z
B(T ) E (TB (X) )2 g()d
= E (TB (X) )2 .
()()
B(, ) = . (5.9)
( + )
and satisfies
( + 1) = (). (5.10)
It was already argued that for , > 0, the function x 7 x1 (1 x)1 is integrable on [0, 1] (cf.
relation (2.15). Relation (5.9) is proved below. The moments are (k integer)
Z 1
mk : = xk g, (x)dx
0
Z
( + ) 1 k+1
= x (1 x)1 dx
()() 0
( + ) (k + )()
= .
()() (k + + )
(k 1 + )(k 2 + ) . . . ()
mk = .
(k 1 + + )(k 2 + + ) . . . ( + )
Thus = implies EU = 1/2; in particular the prior distribution for which the Bayes estimator
is minimax ( = = n1/2 /2, cf. Theorem 2.5.2) has expected value 1/2.
56 Conditional and posterior distributions
Consider the change of variables x = rt, y = r(1 t) for 0 < r < , 0 t 1. (Indeed every
(x, y) with x > 0, y > 0 can be uniquely represented in this form: r = x + y, t = x/(x + y)). The
Jacobian matrix is
t rt r rt =
r t
r (1 t)
t r(1 t) r r(1 t)
with determinant (1 t)r + rt = r. The new density in variables t, r is
1
g, (t, r) = (rt)1 (r(1 t))1 r exp(r)
()()
1
= r+1 exp(r)t1 (1 t)1 .
()()
When we integrate over r, the result is the marginal density of t (call this f, (t)) and hence must
integrate to one. We find from the definition of the Gamma function
Z
( + ) 1
f, (t) =
g, (t, r)dr = t (1 t)1 .
0 ()()
This is the density of the Beta law; since this density integrates to one, we proved (5.12).
Note that formula (5.15) is analogous to the conditional probability function in the discrete case:
if X and Y can take only finitely many values x, y and
p(x, y) = P (Y = y, X = x), pX (x) = P (X = x)
are the joint and marginal probability functions then if pX (x) > 0, by the classical definition of
conditional probability
P (Y = y, X = x) p(x, y)
p (Y = y|X = x) = = .
P (X = x) pX (x)
which looks exactly like (5.15), but all the terms involved are probability functions, not densities.
To state a connection between conditional densities and independence, we need a slightly more
precise definition. First note that densities are not unique; they can be modified in certain points
or subsets without aecting the corresponding probability measure.
Definition 5.4.1 Let Z be a random variable with values in Rk and density q on Rk . A version of
q is a density q on Rk such that for all sets A Rk for which the k-dimensional volume (measure)
is defined, Z Z
q(z)dz = q(z)dz
A A
58 Conditional and posterior distributions
Figure 1 Conditional density of Y given X = x. The altitude lines symbolize the joint density of X
and Y . On the dark strip, in the limit when h tends to 0, this gives the conditional density of Y given
X=x
In particular, if A = A1 A2 and A2 has volume 0 then the density q can be arbitrarily modified on
A2 . On the real line, A2 might consists of a countable number of points; on Rk , A2 might consists
of hyperplanes, smooth surfaces etc.
Definition 5.4.2 Suppose that Z = (Z1 , . . . , Zk ) is a continuous random variable with values in
Rk and with joint density p(z) = p(z1 , . . . , zk ). Set X = Z1 , Y = (Z2 , . . . , Zk ) and p(x, y) = p(z).
A version of the conditional density of Y given X = x is any function p (y|x) with properties:
(i) For all x, p (y|x) is a density in y, i.e.
Z
p (y|x) 0, p (y|x) dy = 1
R
(ii) There is a version of p(x, y) such that for pX (x) = p(x, y)dy we have
p(x, y) = p (y|x) pX (x) (5.16)
Lemma 5.4.3 A conditional density as defined above exists.
Proof. If x is such that pX (x) > 0 then we can set
p(x, y)
p (y|x) = . (5.17)
pX (x)
Clearly this is a density in y since
Z Z
p(x, y) pX (x)
dy = dy = 1.
pX (x) pX (x)
Conditional densities in continuous models 59
A0 = {(x, y) : x A}
we obtain
P ((X, Y ) A0 ) = P (X A) = 0.
Now modify p(x, y) on A0 , to obtain another version p, namely set
For this version p(x, y) we have: pX (x) = 0 implies p(x, y) = 0 and hence for such x, p (y|x) can
be chosen as an arbitrary density to fulfill (5.16).
Proposition 5.4.4 (X, Y ) with joint density p(x, y) are independent if and only if there is a ver-
sion of p (y|x) which does not depend on x.
Proof. If X and Y are independent then p(x, y) = pX (x)pY (y). Thus pY (y) = p (y|x) is such a
version. Conversely, if p (y|x) is such a version then
Z Z
pY (y) = p(x, y)dx = p (y|x) pX (x)dx
Z
= p (y|) pX (x)dx = p (y|)
hence
p(x, y) = pY (y)pX (x)
which implies that (X, Y ) are independent.
In case that all occurring joint and marginal densities are positive, there is really no need to consider
modifications; the conditional densities can just be taken as (5.17).
Let now f (Y ) be any function of the random variable Y. The conditional expectation of f (Y )
given X = x is Z
E (f (Y )|X = x) = f (y)p(y|x)dy,
i.e. the expectation with respect to the conditional distribution of Y given X = x, given by the
conditional density p(y|x). Note that this conditional expectation depends on x, i.e. is a function
of x- the realization of the random variable X. Sometimes it is useful to "keep in mind" the original
random nature of this realization, i.e. consider the conditional expectation to be a function of the
random variable X. The common notation for this random variable is
E (f (Y )|X)
60 Conditional and posterior distributions
i.e. expectation of conditional expectation yields the expectation. These notions are the same in
the case of discrete random variables; only in the continuous case we had to pay attention to the
fact that P (X = x) = 0.
Proposition 5.5.1 Suppose that in model Mc a density () on is given such that () > 0,
, (and that p (x) is jointly measurable as a function of (x, )). Then the function
is a density on Rk . When this is construed as a joint density of (X, ), then p (x) is a version
of the conditional density p(x|).
thus p(x, ) is a density. Then () is the marginal density of , derived from the joint density:
indeed Z Z
p(x, )dx = p (x)()dx = ().
Then, we immediately see from (5.18) that p (x) is a version of the conditional density p(x|).
This justifies the notation that for a parametric family of densities, one writes interchangeably
p (x) or p(x|). Then p(|x) is again called the posterior density. If is an interval in R then the
conditional expectation Z
E(|X = x) = p(|x)d
Bayesian inference in the Gaussian location model 61
if it exists, is again called posterior expectation. Let us discuss the case of the Gaussian location
model, first in the case of sample size n = 1. We can represent X = X1 as
X =+
we obtain
1 m (x m)
p(|x) = .
Proposition 5.5.2 In the Gaussian location model M2 for sample size n = 1 and a normal prior
distribution L() = N (m, 2 ), the posterior distribution is
L(|X = x) = N (m + (x m), 2 )
where and are defined by (5.19), (5.20). The normal family N (m, 2 ), m R, 2 > 0 is a
conjugate family of prior distributions.
Let us interpret this result. The posterior distribution of has expected value m + (x m); note
that 0 < < 1. The prior distribution of had expectation m, so the posterior expectation of
intuitively represents a compromise between the prior belief and the empirical evidence about ,
i.e. X = ( + ) = x. Indeed
E(|X = x) = m + (x m) = m(1 ) + x
so E(|X = x) is a convex (linear) combination of data x and prior mean m (is always between
these two points). In other words, the data x are shrunken towards m when x m is multiplied
by . A similar shrinkage eect was observed for the Bayes estimator in the Bernoulli model M1
when the prior mean was 1/2 (recall that the minimax estimator there was Bayes for a Beta prior
with = , which has mean 1/2).
Moreover
2 2
2 = 2 < min( 2 , 2 ).
+ 2
The posterior variance 2 is thus smaller than both the prior variance and the variance of the data
given (i.e. 2 ). It is seen that the information in the data and in the prior distribution is added
up to give a smaller variability (a posteriori) than is contained in either of the two sources. In fact
1 1 1
= 2 + 2.
2
The inverse variance of a distribution can be seen as a measure of concentration, sometimes called
precision . Then the precision of the posterior is the sum of precisions of the data and of the prior
distribution.
The posterior expectation again can be shown to be a Bayes estimator for quadratic loss. Before
establishing this, let us discuss two limiting cases.
2
= 1, 2 2 . (5.21)
2 + 2
A large prior variance means that the prior density is more and more spread out, i.e more
diuse or noninformative . 1 then means that the prior information counts less and
less in comparison to the data, and the posterior expecation of tends to x. This means,
that in the limit the only evidence we have about is the realized value X = x.
Bayesian inference in the Gaussian location model 63
2
= 0, 2 0. (5.22)
2 + 2
A small prior variance means that the prior density is more and more concentrated around
m. Then (5.22) means that the belief that is near m becomes overwhelmingly strong,
and forces the posterior distribution to concentrate around m as well.
2
= 0, 2 2
2 + 2
The posterior density tends to the prior density, since the quality of the information X = x
becomes inferior (large variance 2 )
Case 4 Variance 2 of the data (given ) is small: 2 0. We expect the prior distribution to
matter less and less, since the data are more and more reliable. Indeed
2
= 1, 2 0
2 + 2
which is similar to (5.21) for case 1.
2
r=
2
is often called the signal-to-noise ratio . Recall that in the Gaussian location model (Model
Md,1 ) for n = 1 the data are
X =+
where L() = N (0, 2 ) and L() = N (0, 2 ). The parameter can be seen as a signal which is
observed with noise . For m = 0 we have
2 = E2 , 2 = E 2
so that 2 , 2 represent the average absolute value the (squared) signal and noise. The parameters
of the posterior density can be expressed in terms of the signal-to-noise ratio r and 2 :
2 r 2 2 2 r
= 2 2
= , = 2 2
= 2
+ 1+r + 1+r
and the discussion of the limiting cases 1-4 above could have been in terms of r.
Remark 5.5.3 The source of randomness of the parameter may not only be prior belief, as
argued until now, but may be entirely natural and part of the model, along with the randomness
of the noise . This randomness assumption even dominates in the statistical literature related to
communication and signal transmission (assumption of a random signal). The problem of estima-
tion of then becomes the problem of prediction . Predicting still means to find an estimator
64 Conditional and posterior distributions
T (X) depending on the data, but for assessing the quality of a predictor, the randomness of is
always taken into account. The risk of a predictor is the expected loss
EL(T (X), ).
w.r. to the joint distribution of X and . This coincides with the mixed risk of an estimator
in Bayesian statistics when a prior distribution on is used to build the joint distribution of
(X, ). Thus an optimal predictor is the same as a Bayes estimator, when the loss functions L
coincide.
the binary channel : = {0, 1}, P = B(1, p ) where p (0, 1), = 0, 1. The signals are
either 0 and 1, and given = 1 the channel then produces the correct signal with probability
p1 and the distorted signal 0 with probability 1 p1 ; analogously for = 0. It is naturally to
assume here p1 > 1/2, p0 < 1/2 lest the channel would be entirely distorted (gives the wrong
signal with higher probability than the correct one). A natural prior , would be the uniform
here, ((0) = (1) = 1/2) since in most data streams is probably not sensible to assume that
0 is more frequent than 1.
N
the n-fold product of the binary channel: = {0, 1}n , P = ni=1NB(1, p(i) ) where
n
(i) is the i-th component of a sequence of 0s and 1s of length n and i=1 signifies the
n-fold product of laws. The numbers p (0, 1), = 0, 1 are as above in the simple binary
channel. Here a signal is a sequence of length n, and the channel transmits this sequence
such that each component is sent independently in a simple binary channel. The dierence
to the previous case is that the signal is a sequence of length n, not just 0 or 1; thus n = 8
gives card() = 28 = 256 and this suces to encode and send all the ASCII signs. Here it
is natural to assume a non-uniform distribution on the alphabet since ASCII signs (e.g.
letters) do not all occur with equal probability.
the Gaussian channel . Let R and P = N (, 2 ) where 2 is fixed. Here the signal
at the output has the form
X =+ (5.23)
where L() = N (0, 2 ), i.e. the channel transmits the signal in additive Gaussian noise.
Here the real numbers are codes agreed upon for any other signal, such as sequences of
of 0s and 1s as above. Thus the channel (5.23) itself coincides with the Gaussian location
model. However since there are usually only a finite number of signals possible, in information
theory one considers prior distributions for the signal which are concentrated on finite
sets R.
Bayesian inference in the Gaussian location model 65
After this digression, our next task is Bayesian inference in the Gaussian location model for general
sample size n. Our prior distribution for will again be N (m, 2 ). We have
Xi = + i
where the last line is the representation found in (3.4) in the proof of Proposition 3.0.5, with the
standard normal density and s2n the sample variance. The first factor is a normal density in xn and
the second factor (call it p(s2n ) for now) does not depend on . If we denote , (x) the density of
the normal law N (, 2 ) then
p (x) = ,n1/2 (xn ) p(s2n ).
If () is the prior density then
This coincides with the posterior distribution for one observation (n = 1) at value X1 = xn and
variance n1 2 . In other words, the posterior distribution given X = x may be computed as if
we observed only the sample mean Xn , taking into account that its law is N (, n1 2 ). Thus the
posterior density depends on the vector x only via the function xn of x.
Proposition 5.5.4 In the Gaussian location model M2 for general sample size n and a normal
prior distribution L() = N(0, 2 ), the posterior distribution is
2 2 n1 2 2
= , = (5.25)
n1 2 + 2 n1 2 + 2
The normal family N (m, 2 ), m R, 2 > 0 is a conjugate family of prior distributions.
Corollary 5.5.5 In Model M2 for general sample size n, consider the statistic Xn and regard these
as data in a derived model:
L(Xn |) = N (, n1 2 )
(Gaussian location for sample size n = 1 and variance n1 2 ). In this model, the normal prior
L() = N (m, 2 ) leads to the same posterior distribution (5.24) for .
66 Conditional and posterior distributions
Remark 5.5.6 Sucient statistics Properties of statistics (data functions) T (X) like these
suggest that T (X) may contain all the relevant information in the sample, i.e. that is suces to
take the statistic T (X) and perform all inference about the parameter in the parametric model
for T (X) derived from the original family {P , } for X:
{Q , } = {L(T (X)|), }
which is a family of distributions in the space T where T (X) takes its values. This idea is called
the suciency principle and T would be a sucient statistic. At this point we do not rigorously
define this concept; noting only that it is of fundamental importance in the theory of statistical
inference.
The expectation of the posterior distribution in Proposition 5.5.4 is m + (xn m). It is natural
to call this the conditional expectation of given X = x, or the posterior expecation, written
E(|X = x). (We have not shown so far that is is unique; more accurately we should call it a
version of the conditional expectation.) Clearly for n we have 1 and E(|X = x) will be
close to the sample mean. Moreover, the above corollary shows that in a discussion of limiting cases
as in Case 1 Case 4 above, all statements carry over when X (the one observation for n = 1) is
replaced by the sample mean. In addition, the noise variance in this discussion now is replaced by
n1 2 . Thus e.g. case 4 can be taken to mean that as sample size n increases, the prior distribution
matters less and less (indeed we have more and more information in the sample). The analog of
the signal-to-noise ratio would be
n 2
r= as n .
2
Again we see that intuitively large sample size n amounts to small noise.
where the last expectation is taken w.r. to the joint distribution of X and . A Bayes estimator
is an estimator TB which minimizes B(T ):
B(TB ) = inf B(T ).
T
It is also called a best predictor, depending on the context (cf. remark 5.5.3).
In the statistical model Mf (the sample space X for X is finite and is an finite interval in R),
let be a finite interval and g be a prior density on . Then, for a quadratic loss function a Bayes
estimator TB of is given by the posterior expectation
TB (x) = E(|X = x), x X .
Bayes estimators (continuous case) 67
Proof. In Proposition 5.2.1 the finitene second moment was automatically ensured by the con-
dition that was a finite interval; otherwise the proof is entirely analogous. Under the condition
E(|X = x) exists. We have
Z
B(T ) = (T (x) )2 p(x|)()dxd
Rk
Z Z
2
= (T (x) ) p(|x)d pX (x)dx
Rk
Z
= E (T (X) )2 |X = x pX (x)dx.
Rk
Invoking relation (5.8) again, we find for any x and TB (x) = E(|X = x)
Z Z
(T (x) )2 p(|x)d (TB (x) )2 p(|x)d.
This holds true even if the left side is infinite. The right side is the variance of p(|x) and is finite
under the assumptions. Hence
B(T ) E (TB (X) )2 .
Corollary 5.6.2 In Model M2 for general sample size n and a normal prior distribution L() =
N (m, 2 ), the Bayes estimator for quadratic loss is
TB (X) = m + (Xn m)
Var(|X = x) = 2
Consider the Gaussian location model with a mean zero Gaussian prior L() = N (0, 2 ). According
to Corollary 5.6.2 we have TB (X) = Xn and hence
2 2
E TB (X) = E X
n =
n1 2 + 2 n1 2 + 2
n1 2
= 1 2 ,
n + 2
2
2
2
Var (TB (X)) = Var Xn = n1 2 ,
n1 2 + 2
hence
2
1 4 1 2
R(TB , ) = E (TB (X) )2 = n + (n1 2 2 2
)
n1 2 + 2
2
1 2 2 n1 2 2
= n 1+ .
n1 2 + 2 4
A) Since
2
<1
n1 2 + 2
for small values of we have
R(TB , ) < R(Xn , )
i.e. for small values of the estimator TB is better.
B) For || , R(TB , ) whereas R(Xn , ) remains constant. For large values of the
estimator Xn is better.
The sample mean recommends itself as particularly prudent, in the sense that it takes into account
possibly large values of , whereas the Bayes estimator is more optimistic in the sense that it is
geared towards smaller values of .
Minimax estimation of Gaussian location 69
Risks of the Bayes estimators for = 1 and = 2, and risk of sample mean (dotted), for
n1 2 = 1
Recall Definition 2.5.1 of a minimax estimator; that concept is the same in the general case: for
any estimator T set
M (T ) = sup R(T, ). (5.28)
Here more generally sup is written in place of the maximum, since it is not claimed that the
supremum is attained. Conceptually, the criterion is again the worst case risk. An estimator TM
is called minimax if
M (TM ) = min M (T ).
T
Theorem 5.7.1 In the Gaussian location model for = R (Model M2 ), the sample mean Xn
is a minimax estimator for quadratic loss.
It essential for this argument that the parameter space is the whole real line. It turns out (see
exercise below) that for e.g. = [K, K] we can find an estimator TB, which is uniformly strictly
better than Xn , so that Xn is no longer minimax.
Let us consider the case = [K, K] in more detail; here we have an a priori restriction || K. It
is a more complicated problem to find a minimax estimator here. A common approach is to simplify
the problem again, by looking for minimax estimators within a restricted class of estimators. For
simplicity consider the case of the Gaussian location for n = 1, 2 = 1. A linear estimator is any
estimator
T (X) = aX + b
where a, b are real numbers. The appropriate worst case risk is again (5.28), for the present
parameter space . A minimax linear estimator is a linear estimator TLM such that
Exercise 5.7.2 Consider the Gaussian location model with restricted parameter space = [K, K],
where K > 0, sample size n = 1 and 2 = 1. (i) Find the minimax linear estimator TLM . (ii)
Show that TLM is strictly better than the sample mean Xn = X, everywhere on = [K, K] (this
implies that X is not admissible). (iii) Show that TLM is Bayesian in the unrestricted model = R
for a certain prior distribution N (0, 2 ), and find the 2 .
Chapter 6
THE MULTIVARIATE NORMAL DISTRIBUTION
Recall the Gaussian location model (Model M2 ), for sample size n = 1. We can represent X as
X =+
where is centered normal: L() = N (0, 2 ). In a Bayesian statistical approach, assume that
becomes random as well: L() = N (0, 2 ), in such a way that it is independent of the noise .
Thus and are independent normal r.v.s. For Bayesian statistical inference, we computed the
joint distribution of X and ; this is well defined as the distribution of the r.v. ( + , ). The
joint density was
1 (x )2 2
p(x, ) = exp 2 . (6.1)
2 2 2 2
Here we started with a pair of independent Gaussian r.v.s (, ) and obtained a pair (X, ). We saw
that the marginal of X is N(0, 2 + 2 ); the marginal of is N (0, 2 ). We have a joint distribution
of (X, ) in which both marginals are Gaussian, but (X, ) are not independent: indeed (6.1) is
not the product of its marginals. Note that we took a linear transformation of (, ):
X = 1 + 1 ,
= 0+1
We could have written = x1 and = x2 for independent standard normals x1 ,x2 ; then the
linear transformation is
X = x1 + x2 , (6.2)
= 0 x1 + x2 .
Let us consider a more general situation. We have independent standard normals x1 ,x2 ; and
consider
y1 = m11 x1 + m12 x2 ,
y2 = m21 x1 + m22 x2 .
where the linear transformation is nonsingular: m11 m22 m21 m12 6= 0. Nonsingularity is true for
(6.2): it means just > 0. Define a matrix
m11 m12
M=
m21 m22
72 The multivariate normal distribution
and vectors
x1 y1
x= ,y = .
x2 y2
Then nonsingularity means |M | 6= 0 (where |M | is the determinant of M ). Let us find the joint
distribution of y1 , y2 , i.e. the law of the vector y. For this we immediately proceed to the k-
dimensional case, i.e. consider i.i..d standard normal x1 , . . . , xk , and let
x = (x1 , . . . , xk )> ,
Set = M M > ; this is a nonsingular symmetric matrix. Recall the following matrix rules:
1
(M 1 )> M 1 = M M > = 1 , || = |M | M > = |M |2 ,
1
M = |M |1 .
We thus obtain
x> x =(M x)> 1 M x,
Z
1 1 > 1
P (y A) = exp (M x) M x dx
(2)k/2 MxA 2
This suggests a multivariate change of variables: setting y = M x, we have to set x = M 1 y and
(formally)
dx = M 1 dy = |M |1 dy.
for some = M M > , || 6= 0 (y = (y1 , . . . , yk )> ), is called the k-variate normal distribution
Nk (0, ).
Clearly all matrices = M M > with |M | 6= 0 are possible here. Let us describe the class of possible
matrices dierently.
Lemma 6.1.5 Any matrix = M M > with |M | 6= 0 is positive definite: for any x Rk which is
not identically zero.
x> x > 0.
since z> z = 0 would mean z = 0 and thus x =M 1 z = 0, which was excluded by assumption.
Recall the following basic fact from linear algebra.
Proposition 6.1.6 (Spectral decomposition) Any positive definite kk-matrix can be written
= C > C
with positive diagonal elements i > 0, i = 1, . . . , k (called eigenvalues or spectral values) and
C is an orthogonal k k-matrix :
C > C = CC > = Ik
(Ik is the unit k k-matrix). If all the i , i = 1, . . . , k are dierent, is unique and C is unique
up to sign changes of its row vectors.
Lemma 6.1.7 Every positive definite k k-matrix can be written as M M > where M is a
nonsingular k k-matrix.
74 The multivariate normal distribution
Proof. It is easy to take a square root 1/2 of a diagonal matrix : let 1/2 be the diagonal
1/2
matrix with diagonal elements i ; then 1/2 1/2 = . Now take M = C > 1/2 where C, are
from the spectral decomposition:
M M > = C > 1/2 1/2 C = C > C =
and M is nonsingular:
Y k
1/2
|M | = C > 1/2 = C > i ,
i=1
and for any orthogonal matrix C one has C > = 1 since
2
>
C = C > C = |Ik | = 1.
As a result, the k-variate normal distribution Nk (0, ) is defined for any positive definite matrix
.
Recall that for any random vector y in Rk , the covariance matrix is defined by
(Cov(y))i,j = Cov(yi , yj ) = E(yi Eyi )(yj Eyj ).
if the expectations exist. (Here (A)i,j is the (i.j) entry of a matrix A). This existence is guaranteed
P
by a condition ki=1 Eyi2 < (via the Cauchy-Schwarz inequality).
Lemma 6.1.8 The law Nk (0, ) has expectation 0 and covariance matrix .
Since
1 if r = s
Exr xs = {
0 otherwise
we obtain
k
X
Cov(yi , yj ) = mir mjr = M M > = ()i,j .
i,j
r=1
Definition 6.1.9 Let x be a random vector with distribution Nk (0, ) where is a positive definite
matrix, and let be a (nonrandom) vector from Rk . The distribution of the random vector
y =x+
Lemma 6.1.10 (i) The law Nk (, ) has expectation and covariance matrix .
(ii) The density is
1 1 > 1
, (y) := exp (y ) (y ) .
(2)k/2 ||1/2 2
Lemma 6.1.11 L(x) = Nk (0, Ik ) if and only if x is a vector of i.i.d. standard normals.
Lemma 6.1.12 Let L(x) = Nk (, ) where is positive definite, and let A be a (nonrandom)
l k matrix with rank l (this implies l k). Then
so we can assume = 0. Let also x =M where L() = Nk (0, Ik ) and = M M > . Then
Ax =AM so for l = k the claim is immediate. In general, for l k, consider also the l l-
matrix AA> ; note that it is positive definite:
for every nonzero l-vector a since A> a is then a nonzero k-vector (A has rank l). Let
AA> = C > C
Ax = AM ,
DAx = DAM .
76 The multivariate normal distribution
Suppose first that l = k. Then DAx is multivariate normal with expectation 0 and covariance
matrix
Hence
L(DAx) = Nl (0, Il ),
i.e. DAx is a vector of standard normals.
If l < k then we find a (k l) k matrix F such that
For this it suces to select the rows of F as k l orthonormal vectors which are a basis of the
subspace of Rk which is orthogonal to the subspace spanned by the rows of DAM . Then for the
k k matrix
DAM
F0 =
F
we have
Il 0
F0 F0> = = Ik ,
0 Ikl
i.e. F0 is orthogonal. Hence F0 is multivariate normal Nk (0, Ik ), i.e. a vector of independent
standard normals. Since DAM consists of the first l elements of F0 , it is a vector of l standard
normals. We have shown again that DAx is a vector of standard normals.
Note that D1 = C > 1/2 , since
Two random vectors x, y with dimensions k, l respectively, are said to have a joint normal distri-
bution if the vector
x
z=
y
has a k + l-multivariate normal distribution. The covariance matrix Cov(x,y) of x, y is defined by
it is a k l-matrix. We then have a block structure for the joint covariance matrix Cov(z):
Cov(x) Cov(x,y)
Cov(z) = .
Cov(y, x) Cov(y)
The multivariate normal distribution 77
Theorem 6.1.13 Two random vectors x, y with joint normal distribution are independent if and
only if they are uncorrelated, i.e. if
Cov(x,y) = 0 (6.4)
(where 0 stands for the null matrix).
Proof. Independence here means that the joint density is the product of its marginals. Assume
that both x,y, are centered, i.e. have zero expectation (otherwise expectation can be subtracted,
with (6.4) still true). Write = Cov(z), 11 = Cov(x), 22 = Cov(y), 12 = Cov(x,y), 21 = >12 .
Then (6.4) means that
11 0
= .
0 22
where 0 represents null matrices of appropriate dimensions. A matrix of such a structure is called
block diagonal. It is easy to see that
1
1 11 0
= .
0 1
22
and the marginal of y is 0,22 (y). Thus 0, (z) is the product of its marginals.
Conversely, assume that 0, (z) is the product of its marginals, i.e. x, y are independent. This
implies that for any real valued functions f (x), g(y) of the two vectors we have Ef (x)g(y) =
Ef (x)Eg(y). Let ei(k) be the i-th unit vector in Rk , so e>
i(k) x = xi ; then for the covariance matrix
we obtain
Cov(x,y)i,j = Exi yj = E e> i(k) xe>
j(l) y
= Ee> >
i(k) xEej(l) y =0
Recall the Gaussian location-scale model which was already introduced (see page 32). :
Model M3 Observed are n independent and identically random variables X1 , . . . , Xn , each having
law N (, 2 ), where R and 2 > 0 are both unknown..
P ([ , + ] 3 ) 1 . (7.1)
meaning that the probability that the interval [ , + ] covers is more than 95%, Note both
, + are random variables (functions of the data), so the interval is in fact a random interval.
Therefore the element sign is written in reverse form 3 to stress the fact that (7.1) the interval
is random, not ( is merely unknown)
When 2 is known it is easy to build a confidence interval based on the sample mean: since
L(Xn ) = N(, n1 2 ),
it follows that
L( Xn n1/2 /) = N(0, 1). (7.2)
Definition 7.1.1 Suppose that P is a continuous law on R with density p such that p(x) > 0 for
x > 0 and P ((0, )) 1/2. For every (0, 1/2), the uniquely defined number za > 0 fulfilling
Z
p(x)dx =
z
The word upper is usually omitted for the standard normal distribution, since it is symmetric
around 0. In our case it follows immediately from (7.2) that
P Xn n1/2 / z/2 = 1 ,
80 The Gaussian location-scale model
hence
[ , + ] = [Xn z/2 / n, Xn + z/2 / n] (7.3)
is a 1 -confidence interval for .
Obviously we have to know the variance 2 for this confidence interval, so the procedure breaks
down for the location-scale model. In Proposition 3.0.5, page 32 we already encountered the sample
variance:
Xn n
X
Sn2 = n1 (Xi Xn )2 = n1 Xi2 (Xn )2 .
i=1 i=1
This appears as a reasonable estimate to substitute for the unknown 2 , for various reasons. First,
Sn2 is the variance of the empirical distribution Pn : when x1 , . . . , xn are observed, the empirical
distribution is a discrete law which assigns probability n1 to each point xi . From the point of
view that x1 , . . . , xn should be identified with the random variables X1 , . . . , Xn , this is a random
probability distribution, with distribution function
n
X
1
Fn (t) = n 1(,t] (Xi )
i=1
This is the empirical distribution function, (e.d.f.). Obviously if Z is a random variable with
law Pn then (assuming x1 , . . . , xn fixed)
n
X
xn = n1 xi = EZ
i=1
n
X
s2n = n1 x2i (xn )2 = Var(Z).
i=1
Analogously to the sample mean, we write Sn2 for s2n when this is construed as a random variable.
For the expectation of Sn2 we obtain (when i are ind. standard normals)
n
X n
X
ESn2 = En 1 2
(Xi Xn ) = En1
(Xi (Xn ))2
i=1 i=1
n
n
!
X X
= En1 ( i ( n ))2 = 2 E n1 2i ( n )2
i=1 i=1
n n1/2 n n1/2
T (X) = 1/2 = 1/2 .
1 Pn 1 Pn
n1 i=1 2 ( i n )2 n1 i=1 ( i n )2
We see that the distribution of T does not depend on the parameters and 2 ; it depends only
on the sample size n (i..e the number of i.i.d standard normals i involved).
Xn n1/2
T = T (X) =
Sn
is called the t-distribution with n 1 degrees of freedom (denoted tn1 ). The statistic T is called
the t-statistic.
P (T x) = P (T x)
n1
2 X 2
Sn2 = i
n1
i=1
Proof. Let X be the vector X = (X1 , . . . , Xn )> ; then X has a multivariate normal distribution
with covariance matrix 2 In . To describe the expectation vector, let 1n = (1, . . . , 1)> be the vector
in Rn consisting of 1s. Then
L (X) = Nn (1n , 2 In ).
82 The Gaussian location-scale model
b>
n
L (Y ) = Nn (B1n , 2 In )
and the components Y1 , . . . , Yn are independent. (Indeed when the covariance matrix of a multi-
variate normal is of form 2 In , or more generally a diagonal matrix, then the joint density of yi is
the product of its marginals). Moreover
Yn = b>
nX = n
1/2 >
1n X = n1/2 Xn , (7.4)
EYj = Eb>
j X = b>
j 1n = 0, j = 1, . . . , n 1.
It follows that Y1 , . . . , Yn1 are independent N (0, 2 ) random variables and are independent of Xn .
Now
Xn Xn
2 2 > > >
Yi = kY k = X B BX = X X = Xi2 .
i=1 i=1
so that
n1
X n
X
Yi2 = Xi2 nXn2 = nSn2 .
i=1 i=1
This shows that Sn2is a function of Y1 , . . . , Yn1 and hence independent of Xn , which establishes
(i). Dividing both sides in the last display by n 1 and setting i = Yi / establishes (ii).
Definition 7.2.4 Let X1 , . . . , Xn be independent N (0, 1). The distribution of the statistic
n
X
2 2
= (X) = Xi2
i=1
To find the form of the t-distribution, we need to find the density of the ratio of a normal and
the square root of an independent 2 -variable. We begin with deriving the density of the 2 -
distribution. The following lemma immediately follows from the above definition.
Chi-square and t-distributions 83
Our proof will proceed by induction. Start with 21 : let X1 be N (0, 1); then
Z t1/2 2
2 1/2 1/2 1 z
P X1 t = P t X1 t =2 1/2
exp dz
0 (2) 2
A change of variable x = z 2 , dz = (1/2x1/2 )dx gives
Z t x
2 1
P X1 t = 1/2
exp dx
0 (2x) 2
and we obtain
1
f1 (x) = x1/21 exp(x/2), x 0.
21/2 1/2
Now
(1/2) = 1/2 (7.5)
follows from the fact that f1 integrates to one and the definition of the gamma-function. We
obtained the density of 21 as claimed.
For the induction step, we assume that L(Y1 ) = 2n and L(Y2 ) = 21 . By the previous lemma, fn+1
is the convolution of fn and f1 : assuming the densities zero for negative argument, we obtain
Z
fn+1 (x) = fn (y)f1 (x y)dy
0
(Indeed the convolution of densities is the operation applied to densities of two independent r.v.s
for obtaining the density of the sum). Hence
Z x
1 1
fn+1 (x) = n/2
yn/21 exp(y/2) 1/2 (x y)1/2 exp((x y)/2)dy
0 2 (n/2) 2 (1/2)
Z x
1
= exp(x/2) y n/21 (x y)1/2 dy
2(n+1)/2 (n/2)(1/2) 0
n1/2 Z1
Z2
where Z1 , Z2 are independent r.v.s with standard normal and 2n -distribution, respectively.
(ii) The density of the law tn is
(n+1)/2
((n + 1)/2) x2
fn (x) = 1+ , x R.
(n)1/2 (n/2) n
Proof. (i) follows immediately from Definition 7.2.1 and Theorem 7.2.3: for a r..v T with tn1 -
distribution we obtain, if i are defined as in the proof of Theorem 7.2.3
Xn n1/2 n
T = = 1/2
Sn Pn1
1
n1 i=1 2i
(n 1)1/2 n (n 1)1/2 Z1
= P 1/2 = 1/2
.
n1 2
Z2
i=1 i
To prove (ii), we note that the law tn is symmetric, so we can proceed via the law of the squared
variable nZ12 /Z1 . Here L(Z12 ) = 21 , and the joint density of Z12 , Z2 is
1 1
g(t, u) = t1/21 exp(t/2) un/21 exp(u/2), t, u 0.
21/2 (1/2) 2n/2 (n/2)
Consequently
2 Z
Z1 1
P x = (n+1)/2
un/21 t1/2 exp((t + u)/2) dt du
Z2 t/ux,u0,t0 2 (1/2)(n/2)
Here the terms depending on z together with appropriate constants (when we also divide and
multiply by ((n+1)/2)) form the density of 2n+1 , so that the expression becomes, after integrating
out z, 2
Z
((n + 1)/2) v1/2 Z1
(n+1)/2
dv = P x .
0vx (1/2)(n/2) (v + 1) Z2
86 The Gaussian location-scale model
The t-density with 3 degrees of freedom against the standard normal (dotted)
Clearly the t-distribution has heavier tails, which means that the quantiles (now called t/2 in place
of z/2 ) are farther out and the confidence interval is wider. A narrower confidence interval, for the
same level , is preferable. Thus the absence of knowledge of the sample variance 2 is reflected in
less sharp confidence statements.
Theorem 7.2.8 Let t/2 be the upper /2-quantile of the t-distribution with n 1 degrees of
freedom. Then in Model M3 , the interval
[ , + ] = [Xn Sn t/2 / n, Xn + Sn t/2 / n]
The t-distribution has been found by Gosset (The probable error of a mean, Biometrika 6, 1908),
who wrote under the pseudonym Student. The distribution is frequently called Students t.
A notion of studentization has been derived from it: in the Gaussian location model M2 , the
statistic
Xn n1/2
Z=
Some asymptotics 87
is sometimes called the Z-statistic. It is standard normal when = 0 and can therefore be used
to test the hypothesis = 0 (for the theory if testing hypotheses cf. later sections). In the
location-scale model, is not known, and by substituting Sn one forms the t-statistic
Xn n1/2
T = .
Sn
The procedure of substituting the unknown variance 2 by its estimate Sn2 is called studentization.
Remark 7.2.9 Consider the absolute moment of order r (integer) of the tn -distribution:
Z (n+1)/2
r r x2
E |T | = mr = Cn x 1+ dx
0 n
for some i.i.d. standard normals i , we have by the law of large numbers
n
X
1
n 2i P E 21 = 1.
i=1
This suggests that the law of n1/2 Z1 / Z2 , i.e. the law tn , should become close to the law of Z1 as
n . Let us formally prove that statement. We begin with a recap of some probability notions.
Fn (t) F (t) as n
A point of continuity is obviously a point where F is continuous. Any distribution function F has
left and right side limits at every point t, so it means that these limits coincide in t. When F is
88 The Gaussian location-scale model
continuous then in the above statement t must run through all t R. For instance, the distribution
function of every N (, 2 ) is continuous.
It is also said that a sequence of r.v.s Yn converges in distribution (or in law), written
Yn =d F
L
when the d.f. of Yn converge in d. to F . One also writes Yn Y for a r.v. Y having that
L D
distribution function F (or also Fn F , Fn F )
Example 7.3.2 Convergence in probability : if F is the d.f. of the random variable Y = 0 (which
is always 0 !) then
F (t) = 1[1,) (t)
Example 7.3.3 Let L(Yn ) = B(n, n1 ) and L(Y ) = Po() then for all events A
be the average (or sample mean) and 2 = Var(Y1 ). Then for fixed F and n
n1/2 Yn EY1 =d N (0, 2 ). (7.6)
Yn P 0.
Then
L (Xn + Yn ) =d P0 .
Some asymptotics 89
Proof. Let > 0; and > 0 be arbitrary and given. Suppose |Xn Yn | . Then, for every T > 0,
either {|Xn | > T }, or if that is not the case, then |Yn | /T . Hence .
P (|Xn Yn | ) P (|Xn | T ) + P (|Yn | /T ) . (7.9)
Now for every T > 0
P (|Xn | > T ) = 1 Fn (Xn T ) + Fn (Xn < T )
1 Fn (Xn T ) + Fn (Xn T ) .
Since Fn converges to F0 at both points T , T , we find m1 = m1 (T ) (depending on T ) such that
for all n m1
P (|Xn | T ) 1 F0 (T ) + F0 (T ) +
Select now T large enough such that
1 F0 (T ) , F0 (T ) .
Then for all n m1 (T )
P (|Xn | T ) 3.
On the other hand, once T is fixed, in view of convergence in probability to 0 of |Yn |, one can find
m2 such that for all n m2
P (|Yn | /T ) .
In view of (7.9) we have for all n m = max(m1 , m2 )
P (|Xn Yn | ) 4.
Since > 0 was arbitrary, the result is proved.
We need an auxiliary result which despite its simplicity is still frequently cited with a name attached
to it.
Proof. Consider an arbitrary > 0. Select > 0 small enough such that (x0 , x0 + ) is in
the abovementioned neighborhood of x0 and also fulfilling the condition that |z x0 | implies
|f (z) f (x0 )| (by continuity of f such a can be found). Then the event |f (Xn ) f (x0 )| >
implies |Xn x0 | > and hence
P (|f (Xn ) f (x0 )| > ) P (|Xn x0 | > ) .
Since the latter probability tends to 0 as n , we also have
P (|f (Xn ) f (x0 )| > ) 0 as n
and since was arbitrary, the result is proved.
Some asymptotics 91
Theorem 7.3.8 The t-distribution with n degrees of freeedom converges (in distribution) to the
standard normal law as n .
Proof. Let
Z1
Xn = 1
n Z2
where Z1 , Z2 are independent r.v.s with standard normal and 2n -distribution, respectively. We
know already
n1 Z2 P 1
By Slutskys theorem, for the r.v. Y1,n
1
Y1,n := 1 P 1
n Z2
since the function g(x) = x1/2 is continuous at 1 and defined for x > 0. (Indeed we need consider
only those x, since n1 Z2 > 0 with probability one). Now
Xn = Z1 Y1,n = Z1 + Z1 (Y1,n 1) .
Now Y1,n 1 P 0, hence by Lemma 7.3.6 Z1 (Y1,n 1) P 0. Now Z1 is a constant sequence with
law N (0, 1), which certainly converges in law to N (0, 1). Then by Lemma 7.3.5 Xn =d N (0, 1).
We can translate this limiting statement about the t-distribution into a confidence statement.
Theorem 7.3.9 Let z/2 be the upper /2-quantile of N (0, 1). Then in the Gaussian location-scale
model Mc,2 , the interval
[ , + ] = [Xn Sn z/2
/ n, Xn + Sn z/2 / n]
lim P,2 ([ , + ] 3 ) 1 .
n
Here the same quantiles as in the exact interval (7.3) are used, but Sn replaces the unknown . In
summary: if 2 is unknown, one has the choice between an exact confidence interval (which keeps
level 1 ) based on the t-distribution, or an asymptotic interval (which keeps the confidence level
only approximately) based on the normal law. The normal interval would be shorter in general:
consider e.g. degrees of freedom 10 and = 0.05; then for the t-distribution we have z/2 = 2.228,
whereas the normal quantile is z/2 = 1.96 (cf. the tables of the normal and t-distribtions on pp.
608/609 of [CB]).
Note that in subsection (1.3.2) we discussed an nonasymptotic confidence interval for the Bernoulli
parameter p based on the Chebyshev inequality, and mentioned that alternatively, a large sample
approximation based on the CLT could also have been used. In this section we developed the tools
for this.
92 The Gaussian location-scale model
Chapter 8
TESTING STATISTICAL HYPOTHESES
8.1 Introduction
Consider again the basic statistical model where X is an observed random variable with values in
X and the law L(X) is known up to a parameter from a parameter space : L(X) {P ; }.
This time we do not restrict the nature of ; this may be a general set, possibly even a set of laws
(in this case is identified with L(X)). Suppose the parameter set is divided into two subsets:
= 0 1 where 0 1 = . The problem is to decide on the basis of an observation of X
whether the unknown belongs to 0 or to 1 . Thus two hypotheses are formulated:
H 0 , the hypothesis
K 1 , the alternative.
Of course both of these are hypotheses, but in testing theory they are treated in a nonsymmetric
way (to be explained). In view of this nonsymmetry, one of them is called the hypothesis and
the other the alternative. It is traditional to write them as above, with letters H for the first (the
hypothesis) and K for the second (the alternative).
Example 8.1.1 Assume that a new drug has been developed, which is supposed to have a higher
probability p of success when applied to an average patient. The new drug will be introduced only
if a high degree of certainty can be obtained that it is better. Suppose p0 is the known probability
of success of the old drug. Clinical trials are performed to test the hypothesis that p > p0 . For
the new drug, n patients are tested independently, and succes of the drug is measured (we assume
that only success or failure of the treatment can be seen in each case). Let the j-th experiment
(patient) for the new drug be Xj ; assume that the Xj are independent B(1, p). Thus observations
are X = (X1 , . . . , Xn ), Xj are i.i.d. Bernoulli r.v.s., where = p and = (0, 1). The hypotheses
are 0 = (0, p0 ] and 1 = (p0 , 1).
The motivation for a nonsymmetric treatment of the hypotheses is evident in this example: if the
statistical evidence is inconclusive, one would always stay with the old drug. There can be no
question of treating H and K the same way. Thus in section 5.5 we briefly discussed the problem
of estimating a signal {0, 1} (binary channel, Gaussian channel), where basically both values 0
and 1 are treated the same way, e.g. we use a Bayesian decision rule for prior probabilities 1/2. In
contrast, one will decide 1, i.e decide in favor of the new drug only if there is reasonable statistical
certainty.
Formally, a test is usually defined as a statistic which is the indicator function of a set S c , such
that
(X) = 1S c (X)
where a value (X) = 1 is understood as a decision that 0 is rejected (and 0 that it is
accepted).
In the above example, a reasonable test would be given by a rejection region
( n
)
X
S c = x : n1 xj > c
i=1
for realizations x = (x1 , . . . , xn ), where c is some number fixed in advance. One would decide
1 if the number of successes in the sample is large enough.
Tests and estimators have in common that they are statistics which are also decision rules. But the
nature of the decisions is dierent, which is reflected in the loss functions. The natural common
loss function for testing problems is loss is 0 if the decision is correct, and is 1 otherwise. Thus
if d {0, 1} is one of the two possible decisions then the loss function is
d if 0
L(d, ) = { .
1 d if 1
As for estimation, the risk of a decision rule at parameter value is the expected loss when is
the true parameter. The decision rule is (X) = 1S c (X); its risk is
E (X) = P (S c ) if 0
R(, ) = E L((X), ) = {
1 E (X) = P (S) if 1 .
Thus the risk coincides with the probability of error in each case: for 0 , an erroneous
decision is made when X S c ; when 1 is true, then the error in the decision occurs when
X S.
Since both probability of errors are functions of E (1 (X)) = P (S), one can equivalently work
with the operation characteristic (OC):
1 R(, ) if 0
OC(, ) = P (S) = {
R(, ) if 1 .
A test with zero risk would require a set S such that P (S) = 1 if 0 , P (S) = 0 if 1 .
This is possible only in degenerate cases: if such an S X exists then the families {P ; 0 }
and {P ; 1 } are said to be orthogonal. In this case sure decisions are possible according
to whether X S or not, and one is led outside the realm of statistics.
Example 8.1.3 Let U (a, b) be the uniform law on the interval (a, b), = {0, 1} and P0 = U (0, 1),
P1 = U (1, 2). Here S = (0, 1) gives zero error probabilities.
Introduction 95
In typical statistical situations, the probabilities P (S) depends continuously on the parameter,
and the sets 0 , 1 are bordering each other. In this case the risk R(, ) cannot be near 0 on the
common border.
Sa = {x : x a}
for some a. The OC of any such test a = 1Sac is (if is a standard normal r.v. such that
X = + )
where means approximate equality. For a visualization of this approximation to the OC, see the
plot below.
These examples show that it is not generally possible to keep error probabilities uniformly small
under both hypotheses.
OC of the critical region {x > 2} for testing H : 0 vs. K : > 0 for the family N (, 1), R
Approximation to OC for critical region {Xn > c} for testing H : p p0 vs. K : p > p0 for i.i.d.
Bernoulli model (Model Md,1 ), with p0 = 0.6 , c = 0.8, n = 15
Thus significance level means that probability of an error of the first kind is uniformly less than
. The power is the probability of not making an error of the second kind for a given in the
alternative.
In terms of the risk, has level if R(, ) for all 0 , and the power is
() = 1 R(, ), 1 .
Tests and confidence sets 97
() = 1 OC(, ), 1
In example 8.1.1, it is particularly apparent why the error of the first kind is very sensitive, so
that its probability should be kept under a given small . When actually the old drug is better
( 0 ), but we decide erroneously that the new is better, it is a very painful error indeed, with
potentially grave consequences. We wish a decision procedure which limits the probability of such
a catastrophic misjudgment. But given this restriction, opportunities for switching to a better drug
should not be missed, i.e. when the new drug is actually better, then the decision should be able
to detect it with as high a probability as possible.
For drug testing, procedures like this (significance level should be kept for some small ) are
required by law in every developed country. In general statistical practice, common values are
= 0.05 and = 0.01.
If one of the hypothesis sets 0 , 1 consists of only one element, the respective hypothesis (or
alternative) called simple, otherwise it is called composite . An important special case of testing
theory is the one where both hypothesis and alternative are simple, i.e. 0 = {0 }, 1 = {1 }
(which means the whole parameter space consists of the two elements 0 , 1 ; in this case the
Neyman-Person fundamental lemma applies (see below).
A test where 0 is simple is called a significance test. The question to be decided there is only
whether the hypothesis H : 0 = can be rejected with significance level ; the alternatives are
usually not specified.
P (A(X) 3 ) 1 , .
Confidence intervals A(X) = [ (X), + (X)] were treated in detail in section 7.1 for the Gaussian
location-scale model and in the introductory section 1.3. Confidence sets are also called domain
estimators of the parameter (the estimators which pick a value of rather than a covering set
are called point estimators ).
There is a close connection between confidence sets and significance tests.
1 Suppose a confidence set A(X) for level 1 is given. Let 0 be arbitrary and consider a
simple hypothesis H : = 0 (vs. alternative K : 6= 0 ). Construct a test 0 by
0 (X) = 1 1A(X) (0 )
where 1A(X) is the indicator of the confidence set, as a function of 0 . In other words, H :
= 0 is rejected if 0 is outside the confidence set. Then
P0 (0 = 1) = 1 P0 (A(X) 3 0 )
2 We saw that a confidence set generates a family of significance tests, one for each 0 . Assume
now conversely that such a family , is given, and they all observe level . Define
A(X) = { : (X) = 0} .
Then
For a more general setting, let () be a function of the parameter (with values in an arbitrary set
). A confidence set for () is defined by
P (A(X) 3 ()) 1 , .
For instance might have two components: = (1 ,2 ). and () = 1 . Then the above family
of tests should be indexed by , and 0 has level for a hypothesis H: () = 0 . This
hypothesis is composite if is not one-to-one (then 0 cannot be called a significance test).
As an example, consider the Gaussian location-scale model for unknown 2 (Model Mc,2 ). Here
= (, 2 ) R (0, ). Consider the quantity
Xn n1/2
T = T (X) =
Sn
used to build a confidence interval
[ , + ] = [Xn Sn z/2 / n, Xn + Sn z/2 / n]
for the unkown expected value . The level 1 was kept for all unknown 2 > 0, i.e. we have a
confidence interval for the parameter function () = .
As we remarked, T (X) depends on the parameter and therefore is not a statistic. Such a function
of both the observations and the parameter used to build a confidence interval is called a pivotal
quantity. The knowledge of the law of the pivotal quantity under the respective parameter is the
basis for a confidence interval. When looking at the significance test derived from T (X), for a
hypothesis H : = 0 , we find that the test is
0 (X) = 1 if T0 (X) > z/2 (8.1)
where z/2 is the upper /2-quantile of the t-distribution for n 1 degrees of freedom. From this
point of view, when = 0 is a known hypothesis, T0 (X) does not depend on an unknown
parameter, and is thus a statistic. In the example, it is the t-statistic for testing H : = 0 , and
the test (8.1) is the two sided t-test. The basic result implied by Theorem 7.2.8 about this test
is the following.
Theorem 8.2.1 In the Gaussian location-scale model Model Mc,2 , for sample size n, consider the
hypothesis H : = 0 . Let z/2 be the upper /2-quantile of the t-distribution with n 1 degrees
of freedom. Then the two sided t-test (8.1) has level for any unknown 2 > 0.
Tests and confidence sets 99
A (X) A(X)
i.e. A (X) is contained A(X) (in the case of intervals, A (X) would be shorter or of equal length).
For the respective families 0 , 0 , 0 of -significance tests this means
hence
P 0 = 1 P 0 = 1 for all .
At 6= 0 these are precisely the respective powers of the two tests 0 , 0 . (At = 0 the
relation implies that for A(X) to keep level 1 , it is sucient that A (X) keeps this level). Thus
0 has uniformly better (or at least as good) power for all 6= 0 .
It was mentioned that shorter confidence intervals are desirable (given a confidence level), since
they enable sharper decision making. Translating this into a power relation for tests, we have
made the idea more transparent.
Theassumed inclusion A (X) A(X) implies a larger critical
region for 0 : 0 = 1 0 = 1 . This does not describe all situations in which 0 might
have better power (and thus A (X) is better in some sense); we shall not further investigate the
power of confidence intervals here but will concentrate on tests instead.
However asymptotic confidence intervals should be discussed briefly. The statement of Theorem
7.3.9 can be translated immediately into the language of test theory. When the law of the observed
r.v. X depends on n, we write Xn and L(Xn ) {P,n ; } for the family of laws. Here the
observation space Xn might also depend on n, as is the case of n i.i.d. observed variables.
Definition 8.2.2 (i) A sequence of tests n = n (Xn ) for testing H : 0 vs. K : 1 has
asymptotic level if
[ , + ] = [Xn Sn z/2
/ n, Xn + Sn z/2 / n]
liminf n P,2 ([ , + ] 3 ) 1
where z/2 is a normal quantile. Thus if 0 is the derived test for H : = 0 then
limsup n P,2 0 = 1 = 1 liminf n P, 2 ([ , + ] 3 0 )
1 (0 ) = (, 2 ) : 6= 0 .
Above, the first plot, in the Gaussian location model with 2 = 1 gives the OC of a test of H: 0
vs K : > 0 of type
1 if Xn > cn
(X) = {
0 otherwise
where cn is selected such that is an -test, for = 0.05 and sample sizes n = 1, 2, 4 respectively.
This is the same situation as in the first plot on p. 96, only is selected as one of the common values
(in the other figure we just took c1 = 2 and did not care about the resulting ), and three sample
sizes are plotted. This is a one sided Gauss test; we have not yet discussed the respective theory,
but it can be observed that these test keep level on the whole composite hypothesis H: 0 .
Moreover, the behaviour of consistency is visible (for larger n, the power increases).
The second plot concerns the simple hypothesis H: = 0 in the same model ( 2 = 1, = 0.05,
n = 1, 2, 4) and the two sided Gauss test (8.2) derived from the confidence interval (7.3). The
middle OC-line for n = 2 is dotted.
we have
E E , for all 1 .
102 Testing Statistical Hypotheses
Typically is not possible to find such a UMP test; some tests do better at particular points in the
alternative, at the the expense of the power at other points. An example is given in the following
plot.
Power of one sided and two sided Gauss test for H : = 0 vs. K : 6= 0
Model Mc The observed random variable X = (X1 , . . . , Xk ) is continuous with values in Rk and
L(X) {P , }. Each law P is described by a joint density p (x) = p (x1 , . . . , xk ),
and Rd .
(Earlier we required that be an open set, but this is omitted now).
The Neyman-Pearson Fundamental Lemma 103
Definition 8.3.2 Assume Model Mc , and that = {0 , 1 } consists of only two elements. A test
for the hypotheses
H : = 0
K : = 1
is called a Neyman-Pearson test of level if
1 if p1 (x) > c p0 (x)
(X) = {
0 otherwise
where the value c is chosen such that
P0 ((X) = 1) = . (8.3)
We should first show that a Neyman-Person test exists. Of course we can take any c and build
a test according to the above rule. This rule seems plausible: given x, each of the two densities
can be regarded as a likelihood. We might say that 1 is more likely if the ratio of likelihoods
p1 (x)/p0 (x) is suciently large. We recognize an application of the likelihood principle (recall
that this consists in regarding the density as a function of when x is already observed, and
assigning corresponding likelihoods to each ).
The question is only whether c can be chosen such that (8.3) holds. Define a random variable
FL (t) = P0 (L(X) t) .
Recall that distribution functions F are monotone increasing, right continuous and 0 at , 1 at
. To ensure (8.3), we make
Assumption L. The distribution function FL (t) is continuous.
In this case
P0 (L(X) = t) = 0
for all t, and for c > 0 consider the probability P0 ((X) = 1). Since
P0 (p0 (X) = 0) = 0,
we have
which is continuous and monotone a with limit 0 for c . Furthermore, FL (t) = 0 for all t < 0
since L(X) is nonnegative, and in view of continuity of FL (Assumption L) we have
1 FL (c) 0 as c 0.
where t0 is a well defined number if 1 6= 0 . However for a normal X this probability is 0 for any
t0 . Moreover L(X) = 0 cannot happen since exp(z) > 0 for all z, so that assumption L is fulfilled
if 1 6= 0 .
Example 8.3.4 Let U (a, b) be the uniform law on the interval [a, b] and P0 = U (0, 1), P1 =
U (0, a) where 0 < a < 1. Then p1 (x) = a1 for x [0, a], 0 otherwise, and
a1 if x [0, a]
L(x) = p1 (x) = { .
0 otherwise.
P (L(X) = a1 ) = a
P (L(X) = 0) = 1 a
In the latter example we cannot guarantee the existence of a Neyman-Pearson test. We will remedy
this situation later; let us first prove the optimality of Neyman-Pearson tests (abbreviated N-P tests
henceforth). We do not claim that there is only one N-P test for a given level (the c may not be
unique, and also one can use dierent versions of the densities, e.g. modifying them in some points
etc.).
P1 (S) P1 (SNP ) .
The Neyman-Pearson Fundamental Lemma 105
S = (S SN P ) A,
SNP = (S SN P ) A0 .
P0 (S) 1 = P0 (SNP )
which implies
P0 (A) P0 A0 . (8.5)
c , we have for any x A that p (x) > c p (x), hence
Since A SNP 1 0
Z Z
P0 (A) = p0 (x)dx c1 p1 (x)dx = c1 P1 (A) , (8.6)
A A
P0 ((X) = 1) = 1 FL (c)
and that FL (c) is continuous from the right. When Assumption L is not fulfilled, the following
situation may occur: there is a c0 such that for all c < c0 , FL (c) < 1 , and at c0 we have
FL (c0 ) > 1 . In other words, the function 1 FL (c) jumps in such a way that is not attained.
In order to deal with this situation, let us generalize the notion of a test function.
Definition 8.3.6 A randomized test (based on the data X) is any statistic such that 0
(X) 1.
When the value of is between 0 and 1, the interpretation is that the decision between hypothesis
H and alternative K is taken randomly, such that is the probability of deciding K. Thus,
given the data X = x, a Bernoulli random variable Z is generated with law (conditional on x)
L(Z) = B(1, (x)), and the decision is Z. The former nonrandomized test functions are special
cases: when (x) = 1 or (x) = 0, the Bernoulli r.v. Z is degenerate and takes the corresponding
value 1 or 0 with probability one. For a randomized test, we have for = 0 or = 1 , and writing
P (Z = ) for the unconditional probability in Z (when X is random)
so that both the errors of first and second kind are a function of the expected value of (X) under
the respective hypothesis.
This method of introducing artificial randomness into the decision process should be regarded with
common sense reservations from a practical point of view. However randomized tests provide
a completion of theory and therefore a better understanding of the basic problems of statistics.
For instance, inclusion of randomized tests allows to state that there is always a level test: take
(X) = , independently of the data. But the power of that trivial test is also , i. e. not very
good.
In the above situation, when there is no c such that FL (c) = 1 , consider the left limit of FL at
c0 :
FL, (c0 ) := lim FL (c) = P0 (L(X) < c0 )
c%c0
(this always exists for monotone functions), then the height of the jump of FL at c0 is the probability
that the r.v. L(X) takes the value c0 :
Define
P0 (L(X) > c0 )
= ,
P0 (L(X) = c0 )
then
= P0 (L(X) > c0 ) + P0 (L(X) = c0 ) . (8.9)
Moreover, since FL, (c0 ) is a limit of values which are all < 1 ,
FL, (c0 ) 1
FL (c0 ) (1 )
0 < = 1.
FL (c0 ) FL, (c0 )
That allows us to construe as the value of a randomized test , which is taken if the event
L(X) = c0 occurs. We can then define Neyman-Pearson tests for any statistical model, provided
the likelihood ratio L(X) is defined as a random variable. Then the distribution function FL is
defined, and we may construct a level test as above.
is called a randomized Neyman-Pearson test of level if there exist c [0, ) and [0, 1]
such that
1 if L(x) > c
(x) = { if L(x) = c
0 if L(x) < c
and such that
P0 (L(X) > c) + P0 (L(X) = c) = . (8.10)
Note we modified the definition of L(x): formerly we took L(x) = 0 if p0 (x) = 0. But under
H such values of x do not occur anyway (or with probability 0), so that FL as considered before
remains the same and a level is attained. The modification ensures that if we decide on the basis
of L (rejct if L is large enough), then p0 (x) = 0 implies that the decision is always 1.
Proof. We have shown existence above: L(X) is a well defined r.v. if X has law P0 (it takes value
only with probability 0); it has distribution function FL . If c0 exists such that FL (c0 ) = 1
then take c = c0 and = 0. Otherwise, find c and as above, fulfilling (8.9). Only the properties
of distribution functions were used for establishing (8.9), i.e. (8.10). Let Z be the randomizing
random variable, which has conditional law L(Z) = B(1, (x)) given X = x. Then the probability
under H that H is rejected is
P0 (Z = 1) = E0 (X) (8.11)
= 1 P0 (L(X) > c) + P0 (L(X) = c) = ,
108 Testing Statistical Hypotheses
i.e. the test has indeed size . The optimality proof is analogous to Theorem 8.3.5. We assume
that p (x) are densities; the case of probability functions requires only changes in notation.
According to (8.11), let be a N-P test with given c, and
The second term on the right is bounded from above by c, since is a level test. For the first
term, since p1 (x) cp0 (x) > 0 on S> and (x) 1, we obtain an upper bound by substituting 1
for . Hence
Z
E1 (X) (p1 (x) cp0 (x)) dx + c
S
Z > Z
= (x) (p1 (x) cp0 (x)) dx + c (x)p0 (x)dx
S>
Z Z
= (x)p1 (x)dx + c (x)p0 (x)dx
S> S=
Z
= (x)p1 (x)dx E1 (X).
S> S=
In some cases the Neyman-Pearson lemma allows the construction of UMP tests for composite
hypotheses. Consider the Gaussian location model (Model Mc,1 ), for sample size n, and the hy-
potheses H : 0 . K : > 0 . Consider the one sided Gauss test 0 : for the test
statistic
Xn 0 n1/2
Z(X) =
the test is defined by
0 (X) = 1 if Z0 (X) > z (8.12)
(0 otherwise), where z is the upper -quantile of N(0, 1). As was argued in Example 8.3.3,
condition L is fulfilled here; hence for Neyman-Pearson tests of two simple hypotheses within this
model randomization is not needed. We have composite hypotheses now, but the following can be
shown.
Proposition 8.3.9 In the Gaussian location model (Model Mc,1 ), for sample size n, for the test
problem H : 0 vs. K : > 0 , for any 0 < < 1 the one sided Gauss test (8.12) is a UMP
-test.
Likelihood ratio tests 109
Consider now any point 1 > 0 . We claim that for simple hypotheses H : = 0 , K : = 1
the test 0 is a Neyman-Pearson test of level . Indeed, when p0 , p1 are the densities then for
x = (x1 , . . . , xn ) ( is the standard normal density)
Yn Pn 2
Pn 2
((xi 1 )/) i=1 (xi 1 ) i=1 (xi 0 )
L(x) = = exp
((xi 0 )/) 2 2
i=1
nxn (1 0 ) n21 n20
= exp exp
2 2 2
Since 1 > 0 , L(x) is a monotone function of xn , and L(x) > c is equivalent to xn > c for some
c . In turn, xn is a monotone function of Z(x) = (xn 0 ) n1/2 /, thus L(x) > c is equivalent to
Z(x) > c . We find c from the level condition:
P0 (Z(X) > c ) = ,
K : 1
is called a likelihood ratio test (LR test) if there exist c [0, ) such that
1 if L(x) > c
(x) = { .
0 if L(x) < c
Note that for L(x) = c we made no requirement; any value (x) in [0, 1] is possible, so that the
test is possibly a randomized one. Neyman-Pearson tests are a special case for simple hypotheses.
One interpretation is the following. Suppose that the suprema over both hypotheses are attained,
so that for certain i (x) i , i = 0, 1 we have
Then i (x) are maximum likelihood estimators (MLE) of under assumptions i , and the
LR test can be interpreted as a Neyman-Pearson test for simple hypotheses H : = 0 (x) vs.
K : = 1 (x). Of course this is pure heuristics and none of the Neyman-Pearson optimality theory
applies, since the hypotheses have been formed on the basis of the data.
Consider the Gaussian location-scale model and recall the form of the t-statistic for given 0
Xn 0 n1/2
T0 (X) = .
Sn
The two sided t-test was already defined (cp. (8.1); it rejects when T0 (X) is too large. The one
sided t-test is the test which rejects when T0 (X) is too large (in analogy to the one sided Gauss
test for known 2 ).
Proposition 8.4.2 Consider the Gaussian location-scale model (Model Mc,2 ), for sample size n.
(i) For hypotheses H : 0 vs. K : > 0 , the one sided t-test is a LR test.
(ii) For hypotheses H : = 0 vs. K : 6= 0 , the two sided t-test is a LR test.
Proof. In relation (3.4) in the proof of Proposition 3.0.5, we obtained the following form of the
density p,2 of the data x = (x1 , . . . , xn ) :
!
1 Sn2 + (xn )2
p,2 (x) = exp , (8.13)
(2 2 )n/2 2 2 n1
n
X
Sn2 = n1 (xi xn )2 . (8.14)
i=1
Consider first the two sided case (ii). To find MLEs of and 2 under 6= 0 , we first maximize
for fixed 2 over all possible R. This gives an unrestricted MLE = xn , and since xn = 0
with probability 0, we obtain that 1 = xn is the MLE of with probability 1 under K. We now
have to maximize
2 1 Sn2
lx ( ) = 2 n/2 exp 2 1
( ) 2 n
over 2 > 0. For notational convenience, we set = 2 ; equivalently, one may minimize
2
lx () = log lx () = n log + nSn .
2 2
Likelihood ratio tests 111
Note that if Sn2 > 0, for 0 we have lx () and for also lx () , so that a
minimum exists and is a zero of the derivative of lx . The event s2n > 0 has probability 1 since
otherwise xi = xn , i = 1, . . . , n, i.e. all xi are equal, which clearly has probability 0 for independent
continuous xi . We obtain
l0 () = n nSn2
x =0
2 2 2
= Sn2
1 1
max p,2 (x) = exp (n/2)
=0 , 2 >0 ( 20 )n/2 (2)n/2
and the likelihood ratio L is
max6=0 ,2 >0 p,2 (x) ( 2 )n/2
L(x) = = 02 n/2
max=0 ,2 >0 p,2 (x) ( 1 )
!n/2
Sn2 + (xn 0 )2
=
s2n
Note that
n1/2 Xn 0 (n 1)1/2 (Xn 0 )
T0 (X) = =
Sn Sn
hence n/2
1 2
L(X) = 1 + T .
n 1 0
Thus L(x) is a strictly monotone increasing function of T0 , which proves (ii).
Consider now claim (i). For hypotheses H : 0 vs. K : > 0 , the one sided t-test which
rejects when the t-statistic
Xn 0 n1/2
T0 (X) = .
Sn
is too large (with a proper choice of critical value, such that an -test results). It is easy to see
that the rejection region T0 (X) > z where z is the upper -quantile of the tn1 -distribution
112 Testing Statistical Hypotheses
leads to an -test for H (exercise). to shoe equivalence to the LR test, note that when maximizing
p,2 (x) over the alternative, the supremum is not attained ( > 0 is an open interval). However
the supremum is the same as the maximum over 0 which is attained by certain maximum
likelihod estimators 1 , 21 . (we will find these, and also MLEs 0 , 20 under H).
The density p,2 of the data x = (x1 , . . . , xn ) is again (8.13), (8.14). To find MLEs of and 2
under > 0 , we first maximize for fixed 2 over all possible . When xn > 0 the solution is
= xn . When xn 0 , the problem is to minimize (xn )2 under a condition > 0 . This
minimum is not attained ( can be selected arbitrarily close to 0 , such that still > 0 , which
makes (xn )2 arbitrarily close to (xn 0 )2 , never attaining this value). However
Thus the MLE of under 0 is 1 = max(xn , 0 ). This is not the MLE under K, but gives
the supremal value of the likelihood under K for given 2 . To continue, we have to maximize in
2 . Now
(xn 1 )2 = (min(0, xn 0 ))2
and defining
2
Sn,1 := Sn2 + (xn 1 )2
we obtain !
2
Sn,1
1
sup p,2 (x) = sup 2 n/2
exp 2 1 .
>0 , 2 >0 2 >0 (2 ) 2 n
The maximization in 2 is now analogous to the argument above given for part (ii). The maximizing
value is 21 = Sn,1
2 and the maximized likelihood (which is also the supremal likelihood under K) is
1 1 1
sup p,2 (x) = 2 n/2 exp 1 .
>0 ,2 >0 ( 1 ) (2)n/2 2n
Now under the hypothesis, since 0 is a closed interval, the MLEs can straightforwardly be
found. An analogous argument to the one above gives
0 = min(xn , 0 ),
min (xn )2 = (xn 0 )2 = (max(0, xn 0 ))2 ,
0
20 = Sn,0
2 2
where Sn,0 := Sn2 + (xn 0 )2 ,
1 1 1
sup p,2 (x) = exp 1 .
0 , 2 >0 ( 20 )n/2 (2)n/2 2n
Suppose first that the t-statistic T0 (X) has values 0; this is equivalent to xn 0 . In this case
0 = xn , 1 = 0 , hence
n/2 n/2
Sn2 1
L(x) = =
Sn2 + (xn 0 )2 1 + (xn 0 )2 /Sn2
n/2
1
= .
1 + (T0 (X))2 /(n 1)
Thus for nonpositive values of T0 (X), the likelihood ratio L(x) is a monotone decreasing function
of the absolute value of T0 (X), which means it is monotone increasing in T0 (X), on values
T0 (X) 0.
Consider now nonnegative values of T0 (X): T0 (X) 0. Then xn 0 , hence 0 = 0 , 1 = xn
and
2 n/2 2 !n/2
0 Sn + (xn 0 )2
L(x) = =
21 Sn2
n/2
= 1 + (T0 (X))2 /(n 1) .
Thus for values T0 (X) 0, the likelihood ratio L(x) is monotone increasing in T0 (X).
The two areas of values of T0 (X) we considered do overlap (in T0 (X) = 0); and we showed that
L(x) is a monotone increasing function of T0 (X) on both of these. Hence L(x) is a monotone
increasing function of T0 (X).
114 Testing Statistical Hypotheses
Chapter 9
CHI-SQUARE TESTS
9.1 Introduction
Consider the following problem related to Mendelian heredity. Two characteristica of pea plants
(phenotypes) are observed: form, which may be smooth or wrinkled, and color, which may be yellow
or green. Thus there are 4 combinations of form and color (4 combined phenotypes). Mendelian the-
ory predicts certain frequencies of these in the total population of pea plants; callPthem M1 , . . . , M4
(here M1 is the frequency of smooth/yellow etc); assume these are normed as 4j=1 Mj = 1. We
P a sample of n pea plants; let the observed frequency be Zj for each phenotype (j = 1, . . . , 4),
observe
then 4j=1 Zk = n. We wish to find out whether these observations support the Mendelian hypoth-
esis that (M1 , . . . , M4 ) are the frequencies of phenotypes in the total population.
Model Md,2 The observed random vector Z = (Z1 , . . . , Zk ) has a multinomial distribution Mk (n, p)
P
with unknown probability vector p = (p1 , . . . , pk ) ( kj=1 pj = 1, pj 0, j = 1, . . . , k).
Recall the basic facts about the multinomial law. Consider a random k-vector Y of form
(0, . . . , 0, 1, 0, . . . , 0) where exactly one component is 1 and the others are 0. The probability that
the 1 is at position j with is pj ; thus Y can describe an individual falling into one of k categories
(phenotypes in the aboveP example). This Y is said to have law Mk (1, p). If Y1 , . . . , Yn are i.i.d. with
law Mk (1, p) then Z = ni=1 Yi has the law Mk (n, p). (The Yi may be called counting vectors).
The probability function is
k
Y
n! z
P (Z = (z1 , . . . , zk )) = Qk pj j (9.1)
j=1 zj ! j=1
P
where kj=1 zj = n, zj 0 integer. Since the j-th component of Y1 is Bernoulli B(1, pj ), the j-th
P
component Zj of Z has binomial law B(n, pj ). The Zj are not independent; in fact kj=1 Zj = n.
For k = 2 all the information is in Z1 since Z2 = n Z1 ; thus for k = 2 observing a multinomial
M2 (n, (p1 , p2 )) is equivalent to observing a binomial B(n, p1 ).
In Model Md,2 consider the hypotheses
H : p = p0
K : p 6= p0 .
The test we wish to find thus is a significance test. Recall the basic rationale of hypothesis testing:
what we wish to statistically ascertain at level is K; if K is accepted then it can be claimed that
the deviation from the null hypothesis H is statistically significant; and there can be reasonable
confidence in the truth of K. On the contrary, when H is accepted, no statistical significance
The hypothesis H is often called the null hypothesis, even if it is not of the form = 0
116 Chi-square tests
claim can be attached to this result. When formulating a test problem, the statement for which
reasonable statistical certainty is desired is taken as K.
Let us find the likelihood ratio test for this problem. Setting = p, 0 = p0 = (p0,1 , . . . , p0,k ),
denoting p (z) the probability function (9.1) of Mk (n, p) and
Xk
= p :pj 0, pj = 1 , 1 = \ {0 }
j=1
Consider the numerator; let us first maximize over p (this will be justified below). If some of
the zj are 0, we can set the corresponding pj = 0 (making the other pj larger; we set 00 = 1). We
now maximimze over pj such that zj > 0. Taking a logarithm, we have to maximize
X
zj log pj
j:zj >0
so that X X
zj log pj zj log n1 zj
j:zj >0 j:zj >0
and for pj = n1 zj equality is attained. This is the unique maximizer since log x = x 1 only for
x = 1. We proved
Proposition 9.1.1 In the multinomial Model Md,2 , with L(Z) = Mk (n, p) with no restriction on
the parameter p the maximum likelihood estimator p is
p(Z) = n1 Z.
The interpretation is that p is a vector valued sample mean of the counting vectors Y1 , . . . , Yn . In
this sense, we have a generalization of the result for binomial observations (Proposition 3.0.3).
Recall that for the LR statistic we have to find the supremum
Q over 1 = \ {p0 }, i.e. only one
z
point p0 is taken out. Since the target function p 7 kj=1 pj j is continuous (with 00 = 1) on ,
we have
sup1 p (z) sup p (z)
L(z) = =
p0 (z) p0 (z)
k z
max p (z) Y n1 zj j
= = .
p0 (z) p0,j
j=1
Introduction 117
Since the logarithm is a monotone function, the acceptance region S (complement of critical /
rejection region) can also be written
k
X np
0,j
S = z : log (L(z))1 = zj log c .
zj
j=1
Even the logarithm is a relatively involved function of the data, so it is dicult to find its distribu-
tion under H and to determine the critical value c from that. We will use a Taylor approximation
of the logarithm to simplify it. The basis is the observation that the estimator p(Z) is consistent,
i.e. converges in probability to the true probability vector p
p(Z) = n1 Z p p.
Under the hypothesis, this true vector is p0 , so all values n1 zj /p0,j converge to one. Note the
Taylor expansion
x2
log (1 + x) = x + o(x2 ) as x 0
2
where o(x2 ) is a term which is of smaller order than x2 ( such that o(x2 )/x2 0). Thus, assuming
that each term p0,j /n1 zj 1 is small, we obtain
k
X
1 p0,j
log (L1 (z)) = zj log 1 + 1 1
n zj
j=1
k
X k 2
p0,j 1X p0,j
zj 1 zj 1 .
n1 zj 2 n1 zj
j=1 j=1
Here the first term on the right vanishes, since the p0,j sum to one and the zj sum to n. We obtain
k
X 2
1 1 p0,j
log (L1 (z)) zj 1 .
2 n1 zj
j=1
We need not make the approximation more rigorous, if we do not insist on using the likelihood
ratio test. In fact we will use the LR principle only to find a reasonable test (which should be
shown to have asymptotic level ). In this spirit, we proceed with another approximation n1 zj
p0,j to obtain
k 2 k 2
1X p0,j 1X p0,j n1 zj
zj 1 = zj
2 n1 zj 2 n1 zj
j=1 j=1
k 2
1 X n p0,j n1 zj
.
2 p0,j
j=1
Definition 9.1.2 In the multinomial Model Md,2 , with L(Z) = Mk (n, p), the 2 -statistic relative
to a given parameter vector p0 is
k 1/2
X
2
2 n p0,j n1 Zj
(Z) = .
p0,j
j=1
118 Chi-square tests
The name is derived from the asymptotic distribution of this statistic, which we will establish below
(the statistic does not have a 2 -distribution). The hypothesis H : p = p0 will be rejected if 2 (Z)
is too large; as shown above, that idea was obtained from the likelihood ratio principle.
But the 2 (Z) has an interpretation of its own, as a measure of deviation from the hypoth-
esis. Indeed n1 Zj are consistent estimators of the true parameter p, so the sum of squares
Pk 2
j=1 (p0,j pj ) can be seen as a measure of departure from H. In the chi-square statistic, we
have a weighted sum of squares with weights p1 0,j .
We know that since each Zj has a marginal binomial law, for each j we have a convergence in
distribution
L
n1/2 n1 Zj p0,j N (0, p0,j (1 p0,j )) (9.2)
i.e. has a limiting normal law by the CLT under H. The 2 distribution is a sum of squares of
independent normals. However the Zj are not independent in the multinomial law; so we need
more than the CLT for each Zj : in fact a multivariate CLT for the joint law of (Z1 , . . . , Zk ) is
required.
if the expectations exist. (Here (A)i,j is the (i.j) entry of a matrix A). This existence is guaranteed
by condition (9.3), as a consequence of the Cauchy-Schwarz inequality:
|Cov(Yj , Yl )|2 Var(Yj )Var(Yl ) EYj2 EYl2
2
E kYk2 .
Note that for expectations of vectors and matrices, the following convention holds: the expectation
of a vector (matrix) is the vector (matrix) of expectations. This means
EY = (EYj )j=1,...,k
Our starting point for the multivariate CLT is the observation that if t is a nonrandom k-vector,
then the r.v.s t> Yi , i = 1, . . . , n are real-valued i.i.d. r.v.s with finite second moment.Indeed,
since for any vector t, x we have
2
t> x = t> xx> t,
we obtain
2 2
Var(t> Y) = E t> YEt> Y = E t> (YEY) =
= Et> (YEY)(YEY)> t
= t> E(YEY)(YEY)> t = t> Cov(Y)t <
2t = t> Cov(Y)t
is not 0, then !
n
X L
n1/2 n1 t> Yi Et> Y N (0, 2t ). (9.4)
i=1
Define the sample mean of the random vectors Yi by
n
X
Yn = n1 Yi ;
i=1
L(Z) =Nk (0, ) implies that L(t> Z) =N(0, t> t) for every t 6= 0.
(Lemma 6.1.12. Actually the converse is also true; cf. Proposition 9.2.3 below). In the sequel, for
the multivariate CLT we will impose the condition that Cov(Y) is nonsingular. This means that
t> Cov(Y)t > 0 for every t 6= 0.
For an interpretation of this condition, note if it is violated then there exists a t 6= 0 such that
t> Cov(Y)t = 0 (this number is the variance of a random variable and thus it cannot be negative).
Then the r.v. t> Y is 0 with probability 1. Define the hyperplane (linear subspace) in Rk
n o
H = x Rk : t> x = 0 ;
then Y H with probability one. The condition that Cov(Y) is nonsingular thus excludes this
case of a degenerate random vector which is actually concentrated on a linear subspace of Rk .
However a multivariate CLT is still possible if the vector Y is linearly transformed (to a space of
lower dimension).
120 Chi-square tests
L
t> Yn Q0,t as n ,
Q0,t = L(t> Y0 ),
L
Yn Q0 .
Note that it is not excluded here that the limit law has a singular covariance matrix (t> Y0 might
be 0 with probability one for certain t, or even for all t). However for the multivariate CLT we
will exclude this case by assumption, since we did not systematically treat the multivariate normal
Nk (0, ) with singular .
With this definition, it is not immediately clear that the limit law Q0 is unique. It is desirable
to have this uniqueness; otherwise there could be two dierent limit laws Q0 , Q0 such that (for
L(Y0 ) = Q0 )
L(t> Y0 ) = L(t> Y0 ) for all t 6= 0.
That this is not possible will follow from the Proposition 9.2.3 below.
Let
n
X
Yn = n1 Yi
i=1
and assume that the covariance matrix = Cov(Y1 ) is nonsingular. Then for fixed Q and n
L
n1/2 Yn EY1 Nk (0, ).
Proposition 9.2.3 Let Q, Q be the distributions of two random vectors Y, Y with values in
Rk . Then
L(Y) = L(Y ) if and only if L(t> Y) = L(t> Y ) for all t 6= 0.
Application to multinomials 121
Proof. The complete argument is beyond the scope of this course; let us discuss some elements
(cp. also the arguments for the proof of the univariate CLT in [D]). Suppose that for all t Rk , the
expression
MY (t) = E exp(t> Y)
is finite (i. e. the expectation is finite). In that case, MY is called the moment generating function
(m.g.f.) of L(Y). Analogously to the one dimensional case, it can be shown that the m. g.f.
determines L(Y) uniquely (that is the key argument). Thus if L(t> Y) = L(t> Y ) then their
univariate m. g.f. coincide:
E exp(ut> Y) = E exp(ut> Y )
and conversely, if that is the case for all u and t then MY = MY , hence L(Y) = L(Y ).
Existence of the m.g.f. is a strong additional assumption on a distribution. The proof in the general
case (without any conditions on the laws L(Y), L(Y )) is based on the so called characteristic
function of a random vector
Y (t) = E exp(it> Y)
where the complex-valued expression
P (Yn A) Q0 (A)
while for j 6= l
EYj Yl = P (Yj = 1, Yl = 1) = 0
(the random vector Y has a 1 in exactly one position). We can now write down the covariance
matrix:
p pj pl , j = l
Cov(Yj Yl ) = EYj Yl EYj EYl = { j
pj pl , j 6= l.
1/2
Introduce transformed variables Yj = (Yj pj )/pj ; then
1/2 1/2
1/2 1/2 1 pj pl , j = l
Cov(Yj Yl ) = pj pl Cov(Yj Yl ) ={ 1/2 1/2
pj pl , j 6= l.
1/2
Let be the diagonal matrix with diagonal elements pj . Then, for vectors
1/2 1/2
Y =(Y p), p =p =(p1 , . . . , pk )>
1/2
Lemma 9.3.1 Let L(Z) = Mk (n, p), let be the diagonal matrix with diagonal elements pj
and let the (k 1) k-matrix F be defined by
F = G0
Application to multinomials 123
e> >
k Yi = p Yi = 0, i = 1, . . . , n.
Hence
e>
k (Z np) = 0.
This implies
k
X
p1
j (Zj npj )
2
= k(Z np)k2 = kG(Z np)k2
j=1
k
X k1
2 X 2
>
= ej (Z np) = e>
j (Z np)
j=1 j=1
thus the first claim (9.10) is proved. For the second claim, we note that in view of (9.6) and the
additivity of covariance matrices for independent vectors
= nG0 G>
0 = nIk1 .
(computation rules for covariance matrices can be obtained from the rules for the multivariate
normal, cp. Lemma 6.1.12:
Cov(AX) = ACov(X)A> .
The following is the Central Limit Theorem for a multinomial random variable, which generalizes
the de Moivre-Laplace CLT for binomials (sums of i.i..d Bernoulli r.vs, cp. (1.2) and (9.2)). Since
the components of the multinomial are dependent, we need to multiply with a (k 1) k-matrix F
first, otherwise we would get a multivariate normal limiting distribution with singular covariance
matrix.
124 Chi-square tests
Proposition 9.3.2 Let L(Z) = Mk (n, p). Then for the (k 1) k-matrix F defined above we
have
L
n1/2 F n1 Z p Nk1 (0, Ik ) as n .
the F Yi are again i.i.d. vectors expectation F p and with unit covariance matrix, according to
(9.11) for n = 1. For the multivariate CLT, the second moment condition is fulfilled trivially since
the vector Y1 takes only k possible values. The multivariate CLT (Theorem 9.2.2) yields the result.
The next result justifies the name of the 2 -statistic, by establishing an asymptotic distribution.
Theorem 9.3.3 Let L(Z) = Mk (n, p). Then for the 2 -statistic
k 1/2 1
X
2
2 n n Zj pj
(Z) =
pj
j=1
we have
L
2 (Z) 2k1 as n .
k
X
2
(Z) = n 1
p1
j (Zj npj )
2
j=1
2
= n1 kF (Z np)k2 = n1/2 F (n1 Z p) .
The above Proposition 9.3.2 implies that the expression inside kk2 is asymptotically multivariate
k 1-standard normal. Denote this expression
where F is the distribution function of the law 2k1 , at every continuity point t of F . Since this
law has a density, F is continuous, so it has to be shown for every t (it suces for t 0). The set
Chi-square tests for goodness of fit 125
n o
x : kxk2 t is a ball in Rk1 and hence regular in the sense of Proposition 9.2.4. Thus if is a
random vector with law Nk1 (0, Ik ) then
P kVn k2 t P kk2 t .
Theorem 9.4.1 Consider Model Md,2 : the observed random k-vector Z has law L(Z) = Mk (n, p)
where p is unknown. Consider the hypotheses
H : p = p0
K : p 6= p0 .
Let z be the upper -quantile of the distribution 2k1 . The test (Z) defined by
1 if 2 (Z) > z
(Z) = { (9.12)
0 otherwise
where
k
X
2 (np0,j Zj )2
(Z) = . (9.13)
np0,j
j=1
The form (9.13) of the 2 -statistic is easy to memorize: take observed frequency minus expected
frequency, square it, and divide by expected frequency, and sum over components (components are
also called cells).
Remark 9.4.2 On quantiles. In the literature (especially in tables) it is more common to use
lower quantiles; i.e. values q such that for a given random variable X
P (X q ) , P (X q ) 1 .
For = 1/2 one obtains a median (i.e. a theoretical median of the random variable X; note
that quantiles q need not be unique). When X has a continuous strictly monotone distribution
function F (at least in a neighborhood of q ) then q is the unique value
F (q ) = .
126 Chi-square tests
Lower quantiles are often written in the form 2k; if F corresponds to 2k . Thus for the upper
quantile z used above we have
z = 2k1;1 .
Example 9.4.3 Consider again the heredity example (beginning of Section 9.1). Suppose the
values of (M1 , . . . , M4 ) = (p0,1 , . . . , p0,4 ) predicted by the theory are
1 2
(p0,1 , . . . , p0,4 ) = 2
91 , 9 91, 9 91, 92
100
= (0.828 1, 0.0819, 0.0819, 0.00 81)
(we do not claim that these are the correct values corresponding to Mendelian theory). Suppose
we have 1000 observations with observed frequency vector
At significance level = 0.05, we find z = 23;0.95 = 7.82. The hypothesis is not rejected.
Example 9.4.4 Suppose we have a die and want to test whether it is fair, i.e. all six outcomes
are equally probable. For n independent trials, the frequency vector for outcomes (1, . . . , 6) is
Z = (Z1 , . . . , Z6 )
The quantile at = 0.05 is z = 25;0.95 = 11.07, so the hypothesis of a uniform distribution cannot
be rejected.
Chi-square tests for goodness of fit 127
Exercise. Suppose you are intent on proving that the number generator is bad, and run the above
simulation program 20 times. You claim that the random number generator is bad when the test
rejects at least once. Are you still doing an -test ? (assuming that n above is large enough so
that the level of one test is practically )
The 2 -test can also be used to test hypotheses that the data follow a specific distribution, not
necessarily multinomial. Suppose observations are i.i.d.real valued X1 , . . . , Xn , with distribution
Q. Suppose Q0 is a specific distribution, and consider hypotheses
H : Q = Q0
K : Q 6= Q0 .
This is transformed into multinomial hypotheses by selecting a partition of the real line into subsets
or cells A1 , . . . , Ak
[k
Ak = R, Ai Aj = , j 6= i .
j=1
The Aj are often called cells or bins; they are usually intervals. For a real r.v. X, define an
indicator vector
Y(X) = (Y1 , . . . , Yk ), Yj = 1Aj (X), j = 1, . . . , k (9.14)
i.e. Y(X) indicates into which of the k cells the r.v. X falls. Then obviously Y(X) has a
multinomial distribution:
This p(Q) is the vector of cell probabilities, corresponding to the given partition. Thus, in the
above problem, the vectors Yi = Y(Xi ) are multinomial; they are sometimes called binned data.
Then
Xn
Z := Y(Xi ) (9.15)
i=1
is multinomial Mk (n, p) with the above value of p. When Q takes the value Q0 then also the vector
of cells probabilities takes the value
Corollary 9.4.5 Suppose observations are i.i.d. real valued X1 , . . . , Xn , with distribution Q. Con-
sider hypotheses
H : Q = Q0
K : Q 6= Q0 .
For a partition of the real line into nonintersecting cells A1 , . . . , Ak , define the vector of cell fre-
quencies Z by (9.15). Then the 2 -test (Z) defined by (9.12) based Z is an asymptotic -test as
n .
128 Chi-square tests
This 2 -test has very wide range of applicability; it is not specified whether Q0 is discrete or
continuous. Every distribution Q0 gives rise to a specific multinomial distribution Mk (n, p(Q0 ))
which is then tested. For instance, a random number generator for standard normal variables can
be tested in this way. On the real line, at least one of the cells Aj contains an unbounded interval.
However there is a certain arbitrariness involved in the choice of the cells A1 , . . . , Ak . In fact
partitioning the data into groups amounts to a coarsening of the hypothesis: there are certainly
distributions Q 6= Q0 which have the same cell probabilities, i.e. p(Q0 ) = p(Q). These cannot
be told apart from Q0 by this method. If one choses a large number of groups k, the number of
observations in each cell may be small, so that the approximation based on the CLT appears less
credible.
Thus we encountered the first nonparametric hypothesis, in the form of the alternative Q 6= Q0 . In
this sense, the 2 -test for goodness of fit in Corollary 9.4.5 is a nonparametric test; in a narrower
sense this term is used for tests which have level on a nonparametric hypothesis. (However this
2 -test actually tests the hypothesis on the cell probabilities p(Q) = p(Q0 ), with asymptotic level
, and the set of all Q fulfilling this hypothesis is also nonparametric).
which is called the probability simplex in Rk . It is the set of all k-dimensional probability vectors.
(Here 1 = (1, . . . , 1)> Rk ). Instead of fixing p0 as before, we now have only linear restrictions on
p: if H> is the orthogonal complement of H and h1 , . . . , hkd1 an orthonormal basis then
h>
j p = 0, j = 1, . . . , k d 1. (9.17)
H0 = SP H (9.18)
The probability simplex for k = 3 intersected with the linear space spanned by 1 (dimension 1).
The intersection is n1 1 (dimension d = 0).
The probability simplex intersected with a linear subspace H (dimension 2). The intersection is
H0 (dimension d = 1).
The multinomial data vector n1 Z also takes values in SP , which means intuitively that there are
k 1 degrees of freedom. Our parameter vector p varies in H0 with dimension d, which means
that there are d free parameters under the hypothesis which must be estimated. We now claim
130 Chi-square tests
that the corresponding 2 -statistic has a limiting 2 -distribution with degrees of freedom
dim(SP ) dim(H0 ) = (k 1) d = k d 1.
Let us discuss what we mean by estimated parameters. A guiding principle is still the likelihood
ratio principle: consider the LR statistic
sup1 p (z)
L(z) = (9.19)
sup0 p (z)
where 1 , 0 are the parameter spaces under H, K respectively. In the case 0 = {p0 } this led
us to the 2 -statistic relative to p0
k 1/2
X
2
2 n p0,j n1 Zj
(Z) = .
p0,j
j=1
Since now in (9.19) we also have to maximize over the hypothesis, we should expect that in place
of p0,j we now obtain estimated values under the hypothesis: p = p(Z), which are the maximum
likelihood estimators under H.
Write op (nr ) for a random vector such that nr kop (nr )k p 0. , r 0.
Lemma 9.5.1 The MLE p in a multinomial model {Mk (n, p), p H0 } for a parameter space H0
given by (9.18) fulfills
F (p p) = F n1 Z p + op (n1/2 ) (9.20)
where F is the (k 1) k-matrix defined in Lemma (9.3.1), p is the true parameter vector, and
is a (k 1) (k 1) projection matrix of rank d.
Comment. The result means that the k 1-vector F (p p) almost lies in the d-dimensional
linear subspace of Rk1 associated to the projection . This is related to the fact that both p, p
are in the d-dimensional manifold H0 .
Proof. We present only a sketch, suppressing some technical arguments. Let p denote the true
value and p the one over which one maximizes (and ultimately the maximizing value). Consider
the log-likelihood; up to an additive term which does not depend on p it is
k
X
Zj log pj
j=1
A preliminary argument (which we cannot give here) shows that p is consistent, i.e. p p p. Also
n1 Z p p, so that pj /n1 Zj p 1. . A Taylor expansion of the logarithm yields
Xk
pj
2 Zj log 1 + 1 1
n Zj
j=1
k
X Xk 2
pj pj
= 2 Zj 1 + Zj 1 + op (1).
n1 Zj n1 Zj
j=1 j=1
Tests with estimated parameters 131
k 2
X n n1 Zj pj
+ op (1).
pj
j=1
The above expression is one which p minimizes. Similarly to (9.10) we now have
k
X 2
p1 1 2 1
j (n Zj pj ) = F (n Z p)
j=1
since with the choice of the vector ek and as in Lemma (9.3.1) we have ek = p = 2 p = 1
and
e> 1
k (n Z p) =1 1 = 0.
Furthermore, denoting
q = F (p p), Z = F (n1 Z p) (9.21)
we obtain
F (n1 Z p)2 = kZ qk2 (9.22)
and q minimizes
n kZ qk2 + op (1). (9.23)
Let us disregard the requirement that all components of p must be nonnegative; it can be shown
that since n1 Z SP , this requirement is fulfilled automatically for the minimizer p if n is large
enough. With this agreement, the vector q varies in the set
H1 = {F (x p), x S H} . (9.24)
where n o
S = x : 1> x = 1 .
This set H1 can be described as follows. The set S is an ane subspace of Rk ; for the given p SP
it can be represented n o
S = p + z : 1> z = 0 .
q =Z
i.e. q is the projection of Z onto H1 . This is already (9.20) up to the size op (n1/2 ) of the error
term, for which a more detailed argument based on (9.23) is necessary.
Consider now the 2 statistic relative to H, with (maximum likelihood) estimated parameter p =
p(Z). We obtain
k 2
X n pj (Z) n1 Zj
2
(Z) = .
pj (Z)
j=1
To find the asymptotic distribution, we substitute the denominator by its probability limit pj (the
true parameter):
k 2
X n pj (Z) n1 Zj
2
(Z) = + op (1)
pj
j=1
2
n F (n1 Z p)
2
2 1/2
= n kZ qk = n k1 )Z + op (n
(I )
according to the approximation of the Lemma above. Now the matrix = Ik1 is also a
projection matrix, namely onto the orthogonal complement of H1 , of rank k 1 d. It can be
represented
= CC >
where C is a (k 1 d) (k 1) orthogonal matrix (such that C > C = Ik1d ). Thus
2
2 (Z) = n1/2 C > Z + op (1) + op (1).
Since
L
n1/2 Z = n1/2 F (n1 Z p) Nk1 (0, Ik1 )
it follows
L
n1/2 C > Z Nk1d (0, Ik1d ) .
This implies
L
2 (Z) 2kd1 .
We see that if d parameters must be estimated under H, then the degrees of freedom in the limiting
2 law are k d 1. We argued for a hypothesis H : p H0 where H0 is a d-dimensional ane
manifold in Rk described in (9.18) (0 d < k 1).
Tests with estimated parameters 133
Theorem 9.5.2 Consider Model Md,2 : the observed random k-vector Z has law L(Z) = Mk (n, p)
where p is unknown. Let H0 be a d-dimensional set of probability vectors of form (9.18). Consider
the hypotheses
H : p H0
K : p H
/ 0
Let 2kd1;1 be the lower 1 -quantile of the distribution 2kd1 . The test (Z) defined by
where
k
X
2 (npj Zj )2
(Z) = . (9.26)
npj
j=1
P = {p , }
and Rd . Under some smoothness conditions, and assuming that the mapping 7 p is
one-to-one, the set P can be regarded as a smooth subset of the probability simplex SP . We
can assume that in every point p P there is a tangent set of form H0 (or tangent ane subspace
H1 ) which has the same dimension d. In this sense, locally we are back in the previous case of
Theorem 9.5.2. Here locally means that if the MLE of is consistent, it will point us to a
vicinity of the true parameter , i.e. the true underlying probability vector p , and in this vicinity
we can substitute P by its tangent space H0 at p . This is the outline of the proof that the
2 -statistic with estimated parameters
2
k
X npj () Zj
2 (Z) = (9.27)
j=1 npj ()
where p = (p1 (), . . . , pk ()) has still a limiting 2 -distribution with k d 1 degrees of freedom.
The essential condition is that d parameters are to be estimated, and 7 p is smooth and
one-to-one.
In conjunction with binned data, this procedure can be used to test that the data are in a spe-
cific class of distributions, not necessarily multinomial. Suppose observations are i.i.d.real valued
X1 , . . . , Xn , with distribution Q. Suppose Q = {Q , } is a specific class of distributions, and
consider hypotheses
H:QQ
K:Q / Q.
This is transformed into multinomial hypotheses by selecting a partition of the real line into subsets
or cells A1 , . . . , Ak , as discussed above. We obtain a vector of cell probabilities
The observed vector of cell frequencies Z is multinomial Mk (n, p) with the above value of p. When
Q takes values inside the family Q, then p(Q) also takes values inside a family P defined by
P: = {p(Q ), } .
These probabilities give the joint distribution of (X1 , X2 ),with marginal distributions
s
X r
X
P (X1 = j) = pjl =: q1,j , P (X2 = l) = pjl =: q2,l .
l=1 j=1
We are interested in the problem whether X1 , X2 are independent, i.e. whether the joint distribution
is the product of its marginals:
Suppose that there are n i.i.d observations X1 , . . . , Xn , all having the distribution of X.
This can easily be transformed into a hypothesis about a multinomial distribution. Call the pairs
(j, l) cells; it is not important that they are pairs of natural numbers- these can just be symbols
for certain categories. Thus there are rs cells; they can be written as an r s-matrix. Define a
counting variable Yi associated to observation Xi = (X1i , X2i ): Yi is a r s-matrix such that
Yi = (Yi,jl )l=1,...,s
j=1...,r ,
1 if (X1i , X2i ) = (j, l)
Yi,jl = {
0 otherwise.
These Yi can be identified with vectors of dimension k = rs; when they are looked upon as vectors,
they have a multinomial distribution
L(Yi ) = Mk (1, p)
Chi-square tests for independence 135
where p is the r s-matrix of cell probabilities pjl . We can also define counting vectors for each
variable X1 , X2 separately:
Y1,i = (Y1,i,j )j=1...,r , Y2,i = (Y2,i,l )l=1...,s ,
1 if X1i = j, 1 if X2i = l,
Y1,i,j = { , Y2,i,l = { ,
0 otherwise. 0 otherwise.
Then the counting matrix Yi is obtained as
>
Yi = Y1,i Y2,i .
Again we have n of these observed counting vectors, and the matrix (or vector) of observed cell
frequencies is
Xn
Z= Yi .
i=1
Let us stress again that we identify a r s matrix with a rs-vector here. We write Mrs (n, p),
for the multinomial distribution of a r s-matrix with corresponding matrix of probabilities p. (
This can be identified with Mrs (n, p) when p is construed as a vector). We can also define cell
frequencies for the variables X1 , X2 separately:
n
X n
X
Z1 = Y1,i , Z2 = Y2,i . (9.29)
i=1 i=1
The hypothesis of independence of X1 , X2 translates into a hypothesis about the shape of the
probability matrix p: according to (9.28), for
q1 = (q1,j )j=1,...,r , , q2 = (q2,l )l=1,...,s ,
we have
H : p = q1 q>2,
K : p is not of this form.
The hypotheses H can be written in the form H : p PI where PI is a parametric family of
probability vectors (the lower index I in PI stands for independence):
n o
PI = q1 q> 2 , q1 SP,r , q2 SP,s
where SP,r is the probability simplex in Rr . Indeed, in the case of independence we have just the
two marginals, which are two probability vectors in Rr , Rs respectively. These marginal probability
vectors have r 1, and s 1 independent parameters respectively (the respective r 1, s 1 first
components). Thus PI can be smoothly parametrized by a r + s 2-dimensional parameter ,
where is a subset of Rr+s2 (but we do not make this explicit). Thus the hypotheses are
H : p PI
K : p P
/ I.
Define marginal cell frequencies
s
X r
X
Zj = Zjl , Zl = Zjl .
l=1 j=1
Proposition 9.6.1 In the multinomial Model Md,2 , when Z is a multinomial r s matrix with
law L(Z) = Mrs (n, p), r, s 2, the maximum likelihood estimator p under the hypothesis of
independence p PI is
l=1,...,s
p = q1 q>
2 = (q1,j q2,l )j=1,...,r
where
q1 = n1 Z1 , q2 = n1 Z2
and Z1 , Z2 are the vectors of marginal cell frequencies
r Y
Y s
z
P (Z = z) = C(n, z) pjljl
j=1 l=1
where C(n, z) is a factor which does not depend on the parameters. Independence means pjl =
q1,j q2,l , so the likelihood function is
r Y
Y s
zjl z
l(q1 , q2 ) = C(n, z) q1,j q2,ljl
j=1 l=1
Y s
r Y r Y
Y s
z z
= C(n, z) jl
q1,j q2,ljl
j=1 l=1 j=1 l=1
r
Y s
Y
zj zl
= C(n, z) q1,j q2,l .
j=1 l=1
Q zl
Now the factor sl=1 q2,l is the likelihood (up to a factor) for the multinomial vector Z2 defined
Q zj
in (9.29) with parameter q2 , and rj=1 q1,j is proportional to the the likelihood for Z1 . Thus
maximizing over q1 , q2 amounts to maximizing the product of two multinomial likelihoods, each in
its own parameter q1 ,q2 . The maximizer of each likelihood is the unrestricted MLE in a multinomial
model for Z1 or Z2 , thus according to Proposition 9.1.1
q1 = n1 Z1 , q2 = n1 Z2 .
For n i.i..d data (X1i , X2i ), this gives rise to observed cell frequencies
It serves as a symbolic aid in computing the 2 -statistic (9.30). The 2 -test for independence is
also called 2 -test in a contingency table.
Exercise. Test your random number generator for independence in consecutive pairs. If N1 , N2 , . . .is
the sequence generated then take pairs X1 = (N1 , N2 ), X2 = (N3 , N4 ), . . ., and test independence
of the first from the second component. Note: if they are not independent then presumably the
pairs Xi are also not independent, so the alternatives which one might formulate are dierent from
the above. Still the contingency table. provides an asymptotic -test.
138 Chi-square tests
Chapter 10
REGRESSION
Var(X) =: 2X = Var(Y ) =: 2Y
Cov(X, Y ) =: XY > 0
http://www.m-w.com
Sources dier as to what he actually observed; the textbook has fathers and sons (p. 555 ), while the Encyclo-
pedia of Statististal Sciences quotes Galton about seeds.
140 Regression
Thus
X 2X XY
L = N2 (a1, ), = .
Y XY 2Y
The average body height of sons, given the height of the father, is described by the conditional
expectation E(Y |X = x). To find it, we state a basic result on the conditional distribution L(Y |X =
x) in a bivariate normal. Some special cases appeared already ( Proposition 5.5.2).
Proposition 10.1.1 Suppose that X and Y have a joint bivariate normal distribution with expec-
tation vector = (X , Y )> and positive definite covariance matrix :
2
X X XY
L = N2 (, ), = . (10.1)
Y XY 2Y
Then
L(Y |X = x) = N y + (x x ), 2Y |X (10.2)
where
XY
= ,
2X
2XY
2Y |X = 2Y .
2X
Proof. Recall the form of the joint density p of X and Y : (Lemma 6.1.10: for z = (x, y)>
1 1 > 1
p(z) = p(x, y) = exp (z ) (z ) .
(2) ||1/2 2
|| = 2X 2Y 2XY ,
2 1 2
1 X XY 1 Y XY
= = .
XY 2Y || XY 2X
Note that
2
x2 2Y 2xy XY + y 2 2X = x2 2Y + 2X y x XY 2
X x2 2XY 2
X
= 2X (y x)2 + x2 2Y |X ,
|| 2
X = 2Y 2XY 2 2
X = Y |X .
In view of
y x = y y (x x )
this proves (10.2).
Definition 10.1.3 Let X and Y have a joint bivariate normal distribution (10.1).
(i) The quantity
XY
= 2
X
is called the regression coecient for the regression of Y on X.
(ii) The linear function
y = E(Y |X = x) = y + (x x ) (10.3)
is called the regression function or regression line (for Y on X).
E(Y |X = x) = y + (x x ).
For father / son height model of Galton, we assumed that y = x = a, furthermore Y = X , and
positive correlation: XY > 0. This implies
0 < 1.
Here equality is not possible in : since is positive definite, it cannot be singular, hence || =
2X 2Y 2XY > 0, so that
| XY | < X Y
142 Regression
E(Y |X = x) = a + (x a)
= (1 )a + x.
which means that E(Y |X = x) is a convex combination of x and a; the average height of sons,
given the height x of fathers, is always pulled toward the mean height a. It is less than x if x > a
and is greater than x if x < a.
It is interesting to note that an analogous phenomenon is observed for the relationship of sons
with given height to their fathers. Indeed, reversing the roles of X and Y , we obtain the second
regression line
x = E(X|Y = y) = x + 0 (y y ), (10.4)
XY
0 =
2Y
Under the assumption 2Y = 2X we have 0 = , hence 0 < 0 < 1, so that the fathers of tall sons
tend to be shorter etc. In (10.4) x is given as a function of y; when we put in the same form as the
first regression line, with y a function of x, we obtain
1
y = y + (x x )
0
and turns out that the other regression line also goes through the point (x , y ), but has dierent
slope 1/ 0 . This slope is higher (1/ 0 > 1) if 2Y = 2X . The linear function (10.4) is said to pertain
to the regression of X on Y .
Back in the general bivariate normal N2 (, ), consider the random variable
= Y E(Y |X = x) = Y y (x x );
we know from (10.2) that it has conditional law N (0, 2Y |X ), given x. It is often called the residual
(random variable). With the conditional law of , we can form the joint law of and X; since
the conditional law does not depend on x (Corollary 10.1.2 (ii)), it turns out that and X are
independent.
= Y E(Y |X) = Y y (X x )
Recall that E(Y |X) denotes the conditional expectation as a random variable, i.e. E(Y |X = x)
when X is understood as random.
As a consequence, we can write any bivariate normal distribution as
Y = y + + (10.5)
X = x + (10.6)
Regression towards the mean 143
where , are independent normal with laws N (0, 2X ), N (0, 2Y |X ) respectively. If x = 0 we can
write
Y = y + X + .
Assume that we wish to obtain a representation (10.5), (10.5) in terms of standard normals. Define
0 = 1 1
Y |X , 0 = X ; then for a matrix
!
X 0 X 0
M= = 2 1/2
X Y |X XY 1
X Y 2XY 2
X
=M M >
according to the rules for the multivariate normal. The above relation can indeed be verified, and
represents a decomposition of . into a product of a lower triangular with its transpose.
Next we define a regression function for possibly nonnormal distributions
Definition 10.1.1 Let X, Y have a continuous or discrete joint distribution, where the first mo-
ment of Y exists: E|Y | < . The regression function (for Y on X ) is defined as
In the normal case, we have seen that r is a linear function, determined by the regression coecient
and the two means. In the general case, one should verify that E(Y |X = x) exists, and clarify the
problem of uniqueness. Let us do that in the continuous case.
Proposition 10.1.1 Let X, Y have a continuous joint distribution, where E|Y | < .
(i) There is a version of the conditional density p(y|x) such that for this density, E(Y |X = x)
exists for all x.
(ii) For all versions of p(y|x), the law of the random variable E(Y |X) = r(X) is the same, thus
L(E(Y |X)) is unique.
Proof. Recall definition 5.4.2: any version of the conditional density p(y|x) fulfills
of reasoning with integrals.) Hence we must have P (X A) = 0. For these x we can modify our
version
R of p(y|x), e.g. take it as the standard normal density (as in the proof of Lemma 5.4.3), so
that |y|p(y|x)dy is also finite for these x. Thus we found a version p(y|x) such that
Z
E(Y |X = x) = yp(y|x)dy (10.9)
Note that two densities for a probability law must coincide except on
R a set of probability 0. This
holds for p(x, y), and it implies that two versions of the function yp(x, y)dy must coincide for
all x. (in terms of X). But the versions of densities pX (x) coincide for a set of probability 0 in
terms of X; and then (10.10) implies such a property for r(x). This mean that L(r(X)) is uniquely
determined.
Note that strictly speaking, we are not entitled to speak of the regression function r(x) above,
as it is not unique. However the law of the random variable E(Y |X) = r(X) is unique.
The next statement recalls a best prediction property of the conditional expectation. In the
framework of discrete distributions, this was already discussed in Remark 2.4, in connection with
the properties of the conditional expectation E(Y |X) as a random variable. For the normal case,
this was exercise H5.2.
Proposition 10.1.2 Let X, Y have a continuous or discrete joint distribution, where the second
moment of Y exists: E|Y |2 < . Then the regression function has the property that
where MX is the set of all (measurable) functions of X such that E (f (X))2 < .
Proof. This resembles other calculations with conditional expectations (cf. Remark 2.4, p. 9).
A little more work is needed now to ensure that all expectations exist. We concentrate on the
continuous case. First note that under the conditions, E (Y f (X))2 is finite for every f MX .
We claim that also r MX . Indeed the Cauchy-Schwarz inequality gives
Z 2 Z Z
2
|r(x)| = yp(y|x)dy p(y|x)dy y 2 p(y|x)dy
Z
= y 2 p(y|x)dy.
The last integral can be shown to be finite for all x when E|Y |2 < , similarly to Lemma 10.1.1
above (possibly with a modification of p(y|x)). Thus
Z Z
2 2
E (r(X)) pX (x) y p(y|x)dy dx
Z Z
= y 2 p(x, y)dxdy = EY 2 <
Bivariate regression models 145
which proves r MX . This implies that all expectations in the following reasoning are finite. We
have for any f MX
Hence
E (Y f (X))2 E (Y r(X))2 .
The regression function is thus a characteristic of the joint distribution of X and Y . In general
r(x) := E(Y |X = x) is a nonlinear regression function; it is linear if the joint distribution is
normal.
In the general case, it is no longer true that the residual = Y r(X) and X are independent;
it can only be shown that they are uncorrelated (Exercise). However one can build a bivariate
distribution of X, Y from independent (with zero mean) and X:
Y = r(X) + . (10.11)
Assume that E |r(X)| < , so that Y has finite expectation and hence E (Y |X) exists. It then
follows that
E (Y |X) = E (r(X)|X) + E (|X) = r(X) + E = r(X)
so r is the regression function for (X, Y ).
Yi = r(xi ) + i ,
146 Regression
r(x) = + x
In a nonparametric regression model, the functions are also nonlinear in x, but the term is reserved
for families r , indexed by a finite dimensional . A typical example for nonlinear regression
is polynomial regression
k
X
r(x) = j xj
j=0
or more generally
k
X
r (x) = j j (x)
j=0
r (x) = sin(x).
where i are i.i.d. with E1 = 0, E21 < . This is a direct generalization of (10.12).
Y = X +
Notation and terminology. Now X is an n k-matrix ,whereas in the previous section X was a
random variable (the regressor variable), and generalizing this, in the first paragraph of this section
X = (X1 , . . . , Xk ) was a random vector of regressor variables. This reflects the situation that the
matrix X may arise from independent realizations xi of the random vector X, in conjunction with
a conditional point of view (the xi are considered nonrandom, and form the rows of the matrix X)
Therefore we keep the symbol X for the matrix above;X may be called regression matrix. The
columns of X ( 1 , . . . , k , say) may be called nonrandom regressor variables; they correspond
to the random regressors X1 , . . . , Xk .
In the normal linear model, we have L() = Nn (0,In ), and the components i are i.i.d. standard
normal. Hence L(Y) = Nn (X,In ). In the general case, i are only uncorrelated; however in most
cases, when modelling real world phenomena by a by a linear model, r.v.s which are uncorrelated
but not independent will not often occur.
Z = cos(2U ), Y = sin(2U).
Then Z 2 + Y 2 = 1 and the pair (Z, Y ) takes values on the unit circle, which implies that Z, Y are
not independent (they do not have a joint density on R2 which is the product of its marginals.)
148 Regression
Identifiability.
Let us explain the assumption rank(X) = k. Let x> i , i = 1, . . . , n be the rows of the n k-matrix
X and j , j = 1, . . . , k the colums:
>
x1
X= . . . = 1 , . . . , k .
x> n
If P is a statistical model, i.e. is the set of possible (assumed) distributions of an observed random
variable Z with values in Z, then nonidentifiablity means that there exist two parameters 1 6= 2
which lead to the same law of Z. These cannot be distinguished in any statistical sense: for
hypotheses 1 = 2 vs. 1 6= 2 the trivial randomized test (Z) = is a most powerful -test. A
parameter is just an index or name for a probability law; identifiability means that no law in P
has two names.
Identifiablity thus is a basic condition for a statistical model, if inference on the parameter is
desired. If is nonindentifiable in P then it is advisable to reparametrize, i.e. give the laws other
names which are identifiable.
Assume for a moment that the assumption rank(X) = k is not part of the definition of the linear
model (Definition 10.3.1(i)).
Lemma 10.3.4 In the normal linear model, is identifiable if and only if rank(X) = k.
EY = X,
Conversely, assume identifiability; recall L(Y) = Nn (X, In ). If rank(X) < k then (10.14) is
violated, i.e. 1 , . . . , k are linearly dependent. Hence there is 6= 0 such that X = 0. Since also
X0 = 0, the parameters and 0 lead to the same distribution L(Y) = Nn (0, In ), hence is not
identifiable. This contradicts the assumption, hence rank(X) = k.
In the linear model without the normality assumption, another parameter is present, namely the
distribution of the random noise vector . When this law L() is assumed unknown, except for
the assumptions E = 0, Cov() = 2 In , then the parameter is = (, L()) and L(Y) = P is
indexed by . In this situation, we call = () identifiable if P1 = P2 implies (1 ) = (2 ).
It is easy to see (exercise) that also in this model, the condition rank(X) = k is necessary and
sucient for identifiability of .
10.3.1 Special cases of the linear model
1. Bivariate linear regression. We have
Yi = + xi + i , i = 1, . . . , n
>
Here k = 2, the rows of X are x>
i = (1, xi ), = (, ) , and identifiability holds as soon as not all
xi are equal.
2. Normal location-scale model. Here
Yi = + i , i = 1, . . . , n,
Note that we obtained as a special case of the linear model one which was earlier classified as a
nonlinear regression model (in a wide sense), since the functions
k
X
r(x) = j j (x)
j=1
are nonlinear in x (polynomials). However they are linear in the parameter = ( 1 , . . . , k )> ,
and for purposes of estimating this can be treated as a linear model.
Lemma 10.3.5 In the linear model arising from polynomial regression, the parameter is identi-
fiable if and only if among the design points xi , i = 1, . . . , n, there are at least k dierent points.
Proof. Identifiability means linear independence of the vectors j in (10.15). This in turn means
that for any coecients 1 , . . . , k , the relation
k
X
r(xi ) = j j (xi ) = 0, i = 1, . . . , n (10.16)
j=1
150 Regression
Let 1 , . . . , k be the coecients of r0 ; then .(10.16) holds for r = r0 , hence the vectors j are
linearly dependent.
The result that some of the design points xi may be the same opens up the possibility of a design
with replication: for a fixed number m k, take m dierent points xj , j = 1, . . . , m and repeated
measurements at these points, l times say, so that n = ml. The entire design may then be written
with a double index:
xjk = xj , j = 1, . . . , m, k = 1, . . . , l
and each of the xj appears l times. As a result, using double index notation again, we obtain
observations
Yjk = r(xj ) + jk , j = 1, . . . , m, k = 1, . . . , l (10.17)
P
which suggests taking averages Yj = l1 lk=1 Yjk to obtain a simplified model, with more accurate
data Yj . The choice of such a replicated design may be advantageous.
Comment on notation: we now use k also as a running index, even though above k denoted the
number of functions j involved, i.e. the dimension of the regression parameter . In the sequel,
we will use d for the dimension of . The reason is that use of k in expressions such as Yjk is
traditional, in connection with regression and replicated designs.
4. Analysis of variance. Consider a model
Yjk = j + jk , j = 1, . . . , m, k = 1, . . . , l (10.18)
where jk are independent noise variables. Here again a replication structure is present, similary to
(10.17), but no particular form is assumed for the function r. Thus, if r is an arbitrary function in
(10.17), we might as well write j = r(xj ) and assume j unrestricted. The case m = 2 (for normal
jk ) was already encountered in the two sample problem. Suppose there are two treatments,
j = 1, 2, respective observations Yjk , j = 1, 2, and one wishes to test whether the treatments have
an eect: H:1 = 2 vs. K : 1 6= 2 . To explain the name analysis of variance, let us find the
likelihood ratio test. In HW 7.1 we found a certain t-test for this problem; cf. also HW 6.1; this
will turn out to be the LR test.
Define the two sample means and variances
l
X l
X
yi = l1 yik , Sil2 = l1 (yik yi )2 , i = 1, 2.
k=1 k=1
we obtain
1 2 1 X
Sn2 = 2
S1l + S2l + (yj y )2 (10.21)
2 2
j=1,2
Note that the second term has the form of a sample variance, for a sample of size 2 with obser-
vations y1 , y2 The first term can be seen as the variability within groups and the second
term as the variability between groups.
Theorem 10.3.6 In the Gaussian two sample problem (10.18), L(jk ) = N (0, 2 ), 2 > 0 un-
known, for hypotheses
H : 1 = 2
K : 1 6= 2
the LR statistic is !
1P 2 n/2
j=1,2 (yj y )
L(y1 , y2 ) = 1 + 2 1 2 2
.
2 (S1l + S2l )
The LR test is equivalent to a t-test which rejects when |T | is too large, where
l1/2 (y1 y2 )
T = 1/2 ,
2 + S 2
S1l 2l
l
Sil2 = S 2 , i = 1, 2,
l 1 il
and L(T ) = t2l2 under H.
Comment. This form of the likelihood ratio statistic explains the name analysis of variance.
The pooled sample variance is decomposed according to (10.21), and the LR test rejects if the
variability between groups is too large, compared to the variability within groups.
Proof. The argument is very similar to the proof of Proposition 8.4.2 about the LR tests for the
one sample Gaussian location-scale model; it can be considered the two sample analog. Since
we have two independent samples with dierent expectations i and the same variance, the joint
density p1 ,2 ,2 of all the data y1 = (y11 , . . . , x1l ), y2 = (y21 , . . . , y2l ) is, under the alternative,
!
Y 1 Sil2 + (yi i )2
p1 ,2 ,2 (y1 , y2 ) = exp . (10.22)
(2 2 )l/2 2 2 l1
i=1,2
Maximizing the likelihood under the alternative we obtain estimates i = yi and for n = 2l
2 + S2
1 S1l 2l
max p1 ,2 ,2 (y1 , y2 ) = exp
1 6=2 ,2 >0 (2 2 )l 2 2 l1
2 + S 2 )/2
1 (S1l 2l
= exp
(2 2 )n/2 2 2 n1
1 1 1
= exp 1
( 2 )n/2 (2)n/2 2n
152 Regression
where
1 2
2 = 2
S1l + S2l .
2
Under the hypothesis 1 = 2 = the data y1k , y2k are identically distributed, as N (, 2 ). Thus
we can treat the two samples as one pooled sample of size n, and form the pooled mean and variance
estimates (10.19), (10.20). To maximize the likelihood in , 2 , we just refer to the results in the
one sample case (Proposition 8.4.2) to obtain
1 1 1
max p,,2 (y1 , y2 ) = 2 n/2 exp 1 .
, 2 >0 ( 0 ) (2)n/2 2n
Thus the likelihood ratio is
max1 6=2 ,2 >0 p1 ,2 ,2 (y1 , y2 )
L(y1 , y2 ) =
max,2 >0 p,,2 (y1 , y2 )
2 n/2 !n/2
0 Sn2
= = 2 .
2 S1l + S2l 2 /2
L (l/2)1/2 (y1 y2 ) = N(0, 2 ),
2(l 1) 2 2
L (S1l + S2l )/2 = 22l2 .
The ANOVA (analysis of variance) model (10.18) is a special case of the linear model. Write
where 1l is the l-vector consisting of 1s, X is the ml m matrix where the column vectors 1l are
arranged in a diagonal fashion (with 0s elsewhere),
Y =X + ,
H : Lin(1m )
where Lin(1m ) is the linear subspace of Rm spanned by the column vector 1m . That is a special
case of a linear hypothesis about (i.e. is in some specified subspace of Rm ).
Y = X + , (10.25)
2
E = 0, Cov() = In . (10.26)
and the problem of estimating , with known or unknown 2 . If X is a nk-matrix and rank(X) =
k (identifiability), then the expectation of the random vector Y is in the linear subspace of Rn
spanned by the matrix X (or by the k columns j of X):
n o
Lin(X) = Lin({1 , . . . , k }) = X, Rk
Xk
= z= j j , j arbitrary .
j=1
EY Lin(X), Cov(Y) = 2 In .
Indeed no distributional assumptions about are made in the general linear model, so the above is
all the information one has. For obtaining an estimator of , one could try to apply the principle
of maximum likelihood. However since the distribution of (Q, say) is unspecified, it would have
to be considered a parameter, along with . Then Q and specify the distribution of Y, and
hence a likelihood for any realization Y = y. But maximizing it in Q does not lead to satisfactory
results, since the class for Q ( arbitrary distributions on Rn ) is too large. Thus one has to look
for a dierent principle to get an estimator of . The distribution Q of the noise is not of primary
interest in most cases; it can be considered a nuisance parameter . The regression parameter is
the parameter of interest, since it describes the dependence of Y on X, and possibly also 2 .
Of course one could assume normality, and then find maximum likelihood estimators. However
another principle can be invoked, without a normality assumption.
154 Regression
Definition 10.4.1 In the general linear model (10.25), (10.26), with observed vector Y Rn ,
Rk and rank(X) = k, a least squares estimator (LSE) of is a function = (Y) of
the observations such that 2
2
YX = min kYXk
Rk
The name least squares derives from the fact that the squared Euclidean norm of any vector
z Rn is the sum of squares of the components zi :
n
X
kzk2 = zi2 .
i=1
Yi = x>
i + i , i = 1, . . . , n
where x>i are the rows of X ( xi are k-vectors). Thus another way of describing the least squares
estimator is to say that for given Y it is a minimizer of
n
X 2
LsY () = Yi x>
i .
i=1
This expression, depending on the observations Y, to be minimized in is also called the least
squares criterion.
Exercise. Show that if is the true parameter in (10.25), (10.26), then provides a best approx-
imation to the data Y in an average sense:
This minimization property can serve as a justification for the least squares criterion. If we could
compute E kYXk2 for any , we would take the minimizer and obtain the true . However to
find the expectation involved we already have to know , so we take just kYXk2 and minimize
it. In this sense, can be considered an empirical analog of .
Theorem 10.4.2 Consider the general linear model (10.25), (10.26), in the case k < n, with
observed vector Y Rn , Rk and rank(X) = k, with 2 either known or unknown ( 2 > 0)
(i) The LSE of is uniquely determined and given by
1
= X > X X > Y. (10.27)
(ii) Under a normality assumption L() = Nn (0, 2 In ), with probability one the LSE coincides
with the maximum likelihood estimator (MLE) for , both in cases of known 2 and unknown 2
(when 2 > 0).
Least squares and maximum likelihood estimation 155
1
Proof. Note that rank(X) = k implies that X > X exists. The key argument is that the matrix
1
X = X X > X X>
represents the linear projection operator in Rn onto the linear subspace Lin(X). Indeed, note that
X is a projection matrix, i.e. idempotent (X X = X ) and symmetric: > X = X . To see
these two properties, note
1 1 1
X X = X X > X X >X X >X X > = X X >X X > = X ,
and using the matrix rule (AB)> = B > A> , (which implies (H 1 )> = (H > )1 for symmetric
nonsingular H)
1 > 1 > 1 >
>
X = X X >X X> = X X X >X =X X >X X>
> 1
= X X >X X > = X .
Hence X is a projection matrix. It has rank k, and it leaves the space Lin(X) invariant: if
z Lin(X) then z =Xa for some a Rk , and
1
X z =X Xa =X X > X X > Xa =Xa = z.
Moreover, consider the orthogonal complement of Lin(X), i.e. Lin(X) . Any z Lin(X) is
mapped into 0 by the linear map X : since X > z = 0, we have
1
X z =X X > X X > z = 0.
These facts establish that indeed X is a matrix which represents the projection onto Lin(X). (As
a consequence, In X is the projection operator onto Lin(X) ).
It is well known that the projection operator has a minimizing property: for any y Rn , X y gives
the element of Lin(X) closest to y (in the sense of kk). Indeed for any z Lin(X)
yX y = (In X )y Lin(X)
since In X projects onto Lin(X) . It follows that, when we compute (10.28) via kzk2 = z> z,
since the two vectors are orthogonal, we get a sum of squared norms:
Part (i) is proved. For (ii), write down the likelihood function: since L(Y) = Nn (X, 2 In ), it is
2 1 1 2
LY (, ) = exp 2 kYXk .
(2 2 )n/2 2
For known 2 , it is obvious that maximizing LY in is equivalent to minimizing the least squares
criterion
LsY () = kYXk2 .
For unknown 2 , restricted only by 2 > 0, minimize LY (, 2 ) first in , for given 2 . The solution
is again given by (10.29). We now have to maximize
!
2 1 n1 LsY ()
LY (, ) = exp
(2 2 )n/2 2 2 n1
in 2 . This procedure was already carried out in the proof of Proposition 3.0.5 (insert now
n1 LsY () for Sn2 ). Provided that LsY () > 0, the solution is
2 = n1 LsY ().
Thus (ii) is proved if we show that LsY () > 0 happens with probability one. Indeed LsY () = 0
means that Y Lin(X). Under normality, with covariance matrix 2 In it is obvious that Y Lin(X)
happens with probability 0, if k < n is fulfilled, since any proper linear subspace of Rn has proba-
bility 0 (indeed for any nonrandom vector z Rn , z 6= 0 the event z> Y = 0 has probability 0, since
z> Y is normally distributed with variance 2 kzk > 0).
It is easy to see that if k = n then the LSE of is still given by (10.27) and coincides with the
MLE under normality if 2 is known, but if 2 is unknown then LsY () = 0 and the MLE of 2
under normality does not exist (or should be taken as 0, with the likelihood function taking value
). The assumption k < n is realistic, since k represents the number of independent regressor
variables in most cases, and can be expected to be less than n.
Exercise. Consider the special cases of the linear model discussed in the previous subsection.
1. Bivariate linear regression:
Yi = + xi + i , i = 1, . . . , n
Least squares and maximum likelihood estimation 157
>
Here k = 2, the rows of X are x> i = (1, xi ), = (, ) , and identifiability holds as soon as not all
xi are equal). Show that the LSE of , are
Pn
(Y Yn )(xi xn )
Pni
n = Yn n xn , n = i=1 2
. (10.30)
i=1 (xi xn )
Remark 10.4.3 Note that the formula for n is analogous to the regression coecient in a bivari-
ate normal distribution for (X, Y ) according to Definition 10.1.3:
XY
=
2X
Therefore n is also called the empirical regression coecient (and the theoretical coe-
cient). The regression function for (X, Y ) was found as
y = E(Y |X = x) = y + (x x ) = + x
Since the bivariate regression is very important for applications (it is programmed in scientific
pocket calculators), we summarize again what was done there.
Fitting a straight line to data (bivariate linear regression). Given pairs (Yi , xi ), i = 1, . . . , n,
find the straight line y = + x which best fits the data in the sense that the least squares
criterion
X n
(Yi + xi )2 .
i=1
is minimal. The solutions n , n are given by (10.30). The fitted straight line y = n + n x
passes through the point (xn , Yn ) with slope n .
Yi = + i , i = 1, . . . , n,
This can be obtained from 1. above when = 0 is assumed, and xi = 1. Show that the LSE of
is the sample mean:
n = Yn . (10.31)
3. ANOVA: Here
Yjk = j + jk , j = 1, . . . , m, k = 1, . . . , l
Show that the LSE of j is the group mean:
l
X
1
j = Yj = l Yjk , j = 1, . . . , m.
k=1
158 Regression
Y = X + , (10.32)
2
E = 0, Cov() = In . (10.33)
Recall that the normal linear model is obtained when we add the assumption L() = N (0, 2 In ).
This might be called NLM (possibly NLM1 or NLM2 ).
In (LM) we are now interested in optimality properties of the least squares estimator of
1
= X > X X > Y.
Above in (10.31) it was remarked that the sample mean Yn is a special case of , for X = 1 (the
n-vector consisting of 1s). In Section 5.7 it was shown that Yn is a minimax estimator under
normality, for known 2 (i.e. in the Gaussian location model). In Proposition 4.3.3 it was shown
that in the Gaussian location model, the sample mean is also a uniformly minimum variance
unbiased estimator (UMVUE). Both optimality properties depend on the normality assumption:
e.g. for the second optimality, within the class of unbiased estimators, it was shown that the
quadratic risk of Yn attains the Cramer-Rao bound, which is the inverse Fisher information. The
Fisher information IF is a function of the density (or probability function) of the data: recall the
general form of IF for a density p depending on
2
IF () = E log p (Y ) .
In the linear model, the distribution of Y is left unspecified. We might ask whether the sample
mean,
or more generally , still has any optimality properties. A natural choice for the risk is
2
E , i.e. expected loss for the squared Euclidean distance.
Note that is unbiased:
1 1
E = E X > X X >Y = X >X X > EY
1
= X >X X > X = .
Without normality, it cannot be shown that is best unbiased. But when we further restrict the
class of estimators admitted for comparison, to linear estimators, turns out to be optimal. Recall
that a similar device was already employed for the sample mean and minimaxity: in Exercise 5.7.2
a minimax linear estimator was proposed, i.e. an estimator which is minimax within the class of
linear ones.
The Gauss-Markov Theorem 159
b = AY
E b = (10.34)
and if 2 2
E b E b (10.35)
for all linear unbiased b. Relations (10.34), (10.35) are assumed to hold for all values of the
unknown parameters ( Rk , and L() as specified).
Theorem 10.5.2 (Gauss, Markov) In the linear model (LM), the least squares estimator is the
unique BLUE.
Proof. Consider a linear unbiased estimator b = AY. The unbiasedness condition implies
= E b = EAY =AX
n
X
tr [B] = Bii .
i=1
In other words, the trace is the sum of diagonal elements. Note that for n k-matrices B, C and
for n-vectors z, y we have
h i n X
X k h > i
tr BC > = >
Bij Cji = tr C B , (10.39)
i=1 j=1
h i n X
X n h i
> 2
tr B B = Bij = tr BB > , (10.40)
i=1 j=1
h i Xn h i
tr zy> = zi yi = y> z, tr zz> = z> z = kzk2 .
i=1
160 Regression
From (10.39) we see that C 7 tr [BC] is a linear operation on matrices, so that when C is a random
matrix, then for the expectation we have
h i X n
k X
tr A> A = A2ij = kvec(A)k2
i=1 j=1
i. e. problem is to minimize the length of the vector vec(A) under a set of ane linear restrictions
AX = Ik . Consider first the special case k = 1; then a = A> and X are vectors of dimension k
and the set {a : a> X = 1} is an ane hyperplane in Rk . To minimize the norm of a within this set
means taking a perpendicular to the hyperplane, i.e. having the same direction as X. This gives
a = X(X > X)1 as a minimizer
1 >
This argument is generalized to k 1 as follows. Let X = X X > X X be the projection
operator onto Lin(X) in the space Rn . We have
h i h i h i
tr A> A = tr AA> = tr A(In X + X )A>
h i h i
= tr A(In X )A> + tr AX A>
h i 1
> >
= tr A(In X )A + tr X X .
in view of AX = Ik . Here the term tr A(In X )A> is nonnegative, since for C = A(In X )
we have (recall that In X is idempotent, since it is a projection)
h i h i
tr A(In X )A> = tr CC > 0
since tr CC > is a sum of squares (10.40).Thus
2 1
2 >
E b tr X X .
1 >
This lower bound is attained for A = X > X X :
1
(In X )A> = (In X )X X > X =0 (10.41)
The Gauss-Markov Theorem 161
Yi = + xi + i , i = 1, . . . , n.
>
Here k = 2, the rows of X are x>
i = (1, xi ), = (, ) . A linear hypotheses could be
H : = 0,
meaning that the nonrandom regressor variable xi has no influence upon Yi . If the xi are i.i.d.
realizations of a normal random variable X, L(X) = N(x , 2X ), then the regression coecient
is (see Definition (10.1.3)
XY
= 2 .
X
The hypothesis then means that XY = 0, i.e. X and Y are independent random variables, or
equivalenty X and Y are uncorrelated. Indeed the correlation coecient is
XY
=
X Y
and it is related to by
X
=
Y
(in Galtons case of fathers and sons where 2X = 2Y they are actually equal).
Using the linear algebra formalism of (NLM), the linear subspace S would be the subspace of R2
spanned by the vector (1, 0)> , i. e.
S = Lin((1, 0)> ).
2. Normal location-scale model. Here
Yi = + i , i = 1, . . . , n,
k = 1, X = (1, . . . , 1)> , i are i.i.d. N (0, 2 ), 2 > 0. The only possible linear hypothesis is
H : = 0.
164 Linear hypotheses and the analysis of variance
the functions j are j (x) = xj1 and = ( 1 , . . . , k )> . A linear hypotheses could be
H : k = 0,
meaning that the regression function
k
X
r(x) = j j (xi )
j=1
is actually a polynomial of degree k 2 , and not of degree k 1 as postulated in the model (LM). A
special case is 1. above (the bivariate linear regression) for k = 2. The hypothesis means that the
mean function r(x) is a polynomial of lesser degree, or that the model is less complex. Of course,
as always with testing problems, the statistically significant result would be a rejection; for e.g.
k = 3 this means that it is not possible to describe the mean function of the Yi by a straight line,
and a higher degree polynomial is needed.
4. Analysis of variance (ANOVA). Here
Yjk = j + jk , j = 1, . . . , m, k = 1, . . . , l (11.1)
where jk are independent noise variables. The index j is associated with m treatments or groups,
and one might wish to test whether the treatments have any eect:
H:1 = . . . = m .
or in other words, whether the groups dier in the characteristic Yjk measured. The linear subspace
S of Rm would be
S = Lin(1)
where 1 = (1, . . . , 1)> Rm , and dim(S) = 1 (note that the m here corresponds to k in (10.32),
(10.33)).
5. General case. Recall that j were the colums of the matrix X, so that model (LM) can be
written
Xk
Y= j j +
j=1
where the j-th column may have arisen from n independent realizations of the j-th component of
a random vector X, or as designed nonrandom values. In either case, j may be construed as one
of k regressor variables which influences the regressand Y ). Such a variable j is also called an
explanatory variable, or one of j independent variables, when Y is the dependent variable.
Thus a hypothesis H : j = 0 postulates that j is without influence on Y, and can be dispensed
with. Clearly the polynomial regression above is a special case for j = (j (x1 ), . . . , j (xn ))>
(designed nonrandom values). Here again, what is sought is the statistical certainty in the case
of rejection, when it can be claimed that j actually does influence the measured quantity Y.
In the normal linear model (NLM), to find a test statistic for the problem
H : S.
Note that throughout math and statistics software, a vector of values is frequently called a variable.
Testing linear hypotheses 165
K : S
/
we could again apply the likelihood ratio principle.That was already done in a special case of
ANOVA, namely the two sample problem (cf. Theorem 10.3.6). We repeat some of that reasoning,
with general notation, assuming first 2 unknown. The density of Y is (as a function of y Rn )
1 1 2
p,2 (y) = exp kyXk .
(2 2 )n/2 2 2
The LR statistic is
max S,
/ 2 >0 p, 2 (y)
L(y) = .
maxS,2 >0 p,2 (y)
/ and then over 2 > 0. Maximizing first over
To find the numerator, we first maximize over S
Rk , we obtain the LSE of :
1
= X > X X > Y.
Key idea for linear hypotheses in linear models. To find the maximized likelihood under the
hypothesis, we note that S implies that X varies in a linear subspace of Lin(X). Indeed if
S is a k s-matrix such that S = Lin(S) then every S can be represented as = Sb for some
b Rs , thus
X =XSb
where rank(XS) = s and we see that
{X, S} = Lin(XS).
We can now apply the results for least squares (or ML) estimation of Rk to estimation of
b Rs . We can immediately write down a LSE for b and a derived one for S, but we skip
this and proceed directly to the MLE of 2 under the hypothesis, analogously to (11.2):
20 = n1 k(In XS )yk2
166 Linear hypotheses and the analysis of variance
For a further transformation, note that the matrix X S0 is again a projection matrix, namely
onto the orthogonal complement of S0 in X . Denote this orthogonal complement as X S 0 (the
space of all x X which are orthogonal to S0 ). We then have a decomposition
Rn = X (X S 0 ) S0 (11.5)
where denotes the orthogonal sum, X = Rn X and all three spaces are orthogonal. We have
a corresponding decomposition of In (which is the projection onto Rn )
In = (In X ) + (X S0 ) + S0
and all three matrices on the right are projection matrices, orthogonal to one another. (Exercise:
show that X S0 is the projection operator onto X S 0 ). As a consequence
Note that the form obtained for the two sample problem in Theorem 10.3.6 is a special case; there
we could further argue that the LR test is equivalent to a certain t-test.
Definition 11.1.2 In the normal linear model (NLM), consider a linear hypothesis given by a
linear subspace S Rk , dim(S) = s < k
H : S.
K : S.
/
The F -statistic for this problem is
Proposition 11.1.3 In the normal linear model (NLM), under a hypothesis H : S, the per-
taining F -statistic has an F -distribution with k s, n k degrees of freedom.
EY =X S0 ,
hence
(X S0 )Y =(X S0 )(X + ) = (X S0 ).
But
EY =X X
is already in the model assumption (LM); it holds under H in particular. Thus
(In X )Y =(In X ).
Lin(e1 , . . . , es ) = S0 , Lin(es+1 , . . . , ek ) = X S 0 ,
Lin(ek+1 , . . . , en ) = X
i.e. the basis conforms to the decomposition (11.5). Then the three projection operators are given
by
Xs
S0 y = (e>i y)ei , etc.,
i=1
hence
n
X k
X
k(In X )k2 = (e> 2 2
i ) , k(X S0 )k = (e> 2
i ) .
i=k+1 i=s+1
Theorem 11.1.4 In the normal linear model (NLM), consider a linear hypothesis given by a linear
subspace S Rk , dim(S) = s < k
H : S.
K : S.
/
Let F (Y) be the pertaining F -statistic and Fks,nk;1 the lower 1 -quantile of the distribution
Fks,nk . The F-test which rejects when
F (Y) > Fks,nk;1
is an -test.
where 1l is the l-vector consisting of 1s, X is the n m matrix where the vectors 1l are arranged
in a diagonal fashion (with 0s elsewhere),
Y = (Y11 , . . . , Y1l1 , Y212 , . . . , Ymlm )> , (11.11)
and the n-vector is formed analogously. Then
Y =X +
rank(X) = m, and the hypothesis H : 1 = . . . = m can be expressed as
H : S = Lin(1m )
where Lin(1m ) is a one dimensional linear space (dim(S) = s = 1). Thus
S0 = {X, S} = Lin(1n )
which can also be seen by noting that if all j are equal to one value , then then Yjk = + jk .
To find the value k(X S0 )Yk2 , note that
1
S0 = 1n 1> n 1n 1> 1 >
n = n 1n 1n ,
S0 Y = 1n n1 1> n Y = 1n Y
Furthermore let
lj
X
Yj = lj1 Yjk
k=1
be the mean within the j-th group. We claim that
1 Y1
X Y =X X > X X >Y = X . . . . (11.12)
Ym
Plj
Indeed, X > Y gives the m-vector of the sums k=1 Yjk for each group, and X > X is a m m-
diagonal matrix with lj as diagonal elements. But we can also write
Y
S0 Y = 1n Y = X . . .
Y
hence
Y1 Y
X Y S0 Y = X . . . ,
Ym Y
m
X 2
k(X S0 )Yk2 = lj Yj Y .
j=1
170 Linear hypotheses and the analysis of variance
The terms
lj
m X m
X 2 X 2
Dw = Yjk Yj , Db = lj Yj Y
j=1 k=1 j=1
can be called the sums of squares within groups and and sums of squares between groups
respectively. Recall that we introduced this terminology essentially already in the two sample
problem (just before Theorem 10.3.6; there we divided by n and called this variability). Consider
also the quantity
m X lj
X 2
D0 = Yjk Y = k(In S0 )Yk2 .
j=1 k=1
This is the total sums of squares; then n1 D0 = Sn2 is the total sample variance . We have a
decomposition
D0 = Dw + Db (11.14)
as an immediate consequence of the identity
The decomposition (11.14) of the total sample variance S 2 generalizes (10.21); it gives the name to
the test procedure analysis of variance. The hypothesis H of equality of means is rejected when
the between groups sum of squares Db is too large, compared to the within groups sum of squares
Dw .
Note that the quantities
dw = (n m)1 Dw , db = (m 1)1 Db
are both unbiased estimates of 2 under the hypothesis (cf. the proof of Proposition 11.1.3); they
can be called the respective mean sum of squares. The F -statistic (11.13) involves these quantities
dw , db which dier only by a factor from the sums of squares Dw , Db .
A common way of visualizing all the quantities involved is the ANOVA table:
Yjk = + j + jk , k = 1, . . . , lj , j = 1, . . . , m,
Yjk = Y + j + jk , k = 1, . . . , lj , j = 1, . . . , m, (11.15)
j = Yj Y , jk = Yjk Yj ,
where jk are called residuals and j can be interpreted as estimates of the factor eects j . The
relation (11.15) can be written as a decomposition of the data vector Y
Y = S0 Y + (X S0 )Y+(In X )Y.
This indicates how the two way layout can be treated. The model is
Yijk = ij + ijk , k = 1, . . . , l, i = 1, . . . , q, j = 1, . . . , m,
where ijk are i.i.d normal variables with variance 2 . For simplicity we assume that all groups
(i, j) have an equal number of observations l. The index i is called the first factor and j is called
the second factor. Again this is a normal linear model, but the matrix X has an involved form.
It is somewhat laborious to work out all the projections and derived sums of squares; we therefore
172 Linear hypotheses and the analysis of variance
forgo the projection approach and use the more elementary multiple dot notation. Note that
the projection approach is still needed for a rigourous proof that the tests statistics below have an
F -distribution. The array of nonrandom mean values ij can be written
ij = + i + j + ij (11.16)
q
X m
X
i = j = 0 (11.17)
i=1 j=1
where
q
X m
X
= = q 1 i = m1 j is the global eect
i=1 j=1
i = i is the main eect of value i of the first factor
j = j is the main eect of value j of the second factor
ij = ij i j + is the interaction
of value i of the first factor and value j of the second factor.
(note that relation (11.16, 11.17) immediately follow from the definitions above of the quantities
involved). It follows also that
q
X
ij = qj q q(j ) = 0 for all j = 1, . . . , m,
i=1
m
X
ij = 0 for all i = 1, . . . , q.
j=1
Assume now that there is no interaction between the factors 1 and 2, i.e. ij = 0. In
this case from (11.16) we obtain
Yijk = + i + j + ijk , k = 1, . . . , l, i = 1, . . . , q, j = 1, . . . , m.
where
ijk = Yijk Yi Yj + Y
are the residuals. The two-way ANOVA table is
Here the F -value in the row for the first factor is for testing H1 .
Remark 11.3.1 We briefly outline the associated projection arguments needed for the proof of the
F -distributions. Let Y be the n-vector of the data Yijk , e.g. arranged in a lexicographic fashion:
Note that S10 , S01 and S00 are mutually orthogonal in Rn : e.g. for any Z1 S10 and Z2 S01 we
have Z> 1 Z2 = 0 etc. Consider
P the linear space X spanned by these three subspaces, i.e. the set of
all linear combinations 3r=1 r Zr where Z1 S10 , Z2 S01 , Z3 S00 . This can be represented
Z : Zijk = P
+ i + j , forP
some , i , j ,
X = q q .
where i=1 i = 0, i=1 j = 0
The basic (normal) linear model (under the assumption of no interaction) can be expressed as
L(Y) = Nn (EY, 2 In )
where EY X .
To write in the form Y =X + we need a matrix X which spans the space X ; we can avoid this
here. The two way ANOVA decomposition (11.18) can be written as
is an unbiased estimator of 2 and Dw / 2 has a law 2nmq+1 . The two linear hypotheses are
H1 : 1 = . . . = q = 0 or equivalently EY S01 S00
H2 : 1 = . . . = m = 0 or equivalently EY S10 S00 .
Under H1 , we have S10 EY = 0 and hence the expression
is independent of dw and such that Db1 / 2 has a law 2q1 . Thus db1 /dw has law Fq1,nmq+1 .
The textbook in chap. 11.3 treats the case where l = 1 and j are random variables (RCB,
randomized complete block design). This model is very similar to the two way layout treated here.
The theory of ANOVA with its associated design problems has many further ramifications.
Chapter 12
SOME NONPARAMETRIC TESTS
where Zi = Xi Yi and
1 if x > 0
sgn(x) = { 0 if x = 0
1 if x < 0.
176 Some nonparametric tests
It is plausible to reject H if the number of positive Zi is too large compared to the number of
negative Zi . This is the (one sided) sign test for H. Define
n
X
S0 = 1(0,) (Zi );
i=1
then S = 2S0 n. Moreover if p = P (Z > 0) then the statistic S0 has a binomial distribution
B(n, p). Under H we have
P (Z < 0) = P (Z > 0)
so that p = 1/2 and the distribution of S0 under H is B(n, 1/2). Since S is a strictly monotone
function of S0 , the sign test can also be based on S and the binomial distribution under H (rejection
when S is too large). Note that since B(n, 1/2) is a discrete distribution, in order to achieve
exact size under the hypothesis, for any given one has to use a randomized test in general
(alternatively, for a nonrandomized -test the size may be less than ).
This is a prototypical nonparametric test; the hypothesis H contains all distributions Q for Z which
have median 0, and this is a large nonparametric class of laws. The statistic S is distribution-free
under H, since its law is B(n, 1/2) which does not depend on Q in H.
Si = sgn(Zi )Ri
Under the hypothesis of symmetry, about one half of the Si would be negative; thus Si would be
close to 0. Thus it seems plausible to reject H if W is too large (one sided test) or if |W | is too
large (two sided test).
The Wilcoxon signed rank test 177
Lemma 12.2.1 Under the hypothesis H of symmetry, W has the same distribution as
n
X
Vn = Mi i (12.1)
i=1
Proof. If Z has a symmetric distribution then sgn(Z) and |Z| are independent:
thus the conditional distribution of |Z| given sgn(Z) does not depend on sgn(Z), which means
independence. Thus W has the same law as
n
X
Vn = Mi Ri
i=1
where Mi are independent of the original sample of Zi . Moreover, a random permutation of the
(M1 , . . . , Mn ) (independent of the Mi ) does not change the law of the vector (M1 , . . . , Mn ), so that
L(V ) = L(V ).
Note that EMi = 0 so that EW = 0 under H. The variance of Mi is
1 1
Var(M1 ) = EM12 = (1)2 + (1)2 = 1
2 2
so that
n
X 1
Var(Wn ) = Var(Vn ) = i2 = (n(n + 1)(2n + 1)) =: vn .
6
i=1
where Bi are i.i.d. Bernoulli B(1, 1/2). This distribution has been tabulated in the past ( one sided
critical values, without randomization, for selected values of and n = 1, . . . , 20 ) and this can
easily be included in statistical software today. Note that the two sided critical values for W can
be obtained from the fact that W has a symmetric distribution around 0.
Further justification. Recall that when Zi are i.i.d. normal N (, 2 ) with unknown 2 then
the hypothesis of symmetry around 0 reduces to H : = 0. For this the t-test is available,
based on T = n1/2 Zn /Sn or when 2 is known we could even use a Z-test based on the statistic
Z 0 = n1/2 Zn / and its normal distribution. The one sided Z-test was UMP test against alternatives
H : > 0, and the two sided Z- and t-tests can also be shown to have an optimality property
(UMP unbiased tests). When the assumption of normality of Q = L(Zi ) is not justified, for testing
symmetry we could try still to use the t-test: we have
if the second moment of the Q exists. In that case we would obtain an asymptotic -test for the
hypothesis of symmetry, but this breaks down if the class of admitted distributions is too large, i.e.
the second moment of Q may not exist. For instance, we may have Q() = Q0 ( ) where Q0 is a
Cauchy distribution (which is symmetric about 0). Symmetry of Q means = 0, and T is not an
appropriate test statistic (does not provide an asymptotic -test). In contrast, the Wilcoxon signed
statistic W is distribution free under the hypothesis of symmetry: according to Lemma (12.2.1) i.e.
its law does not depend on Q as long as Q is symmetric.
Nonparametric alternatives. A random variable Z1 with distribution function F1 is stochas-
tically larger than Z2 with distribution function F2 if
If, for distribution functions we define the symbol F1 - F2 to mean F1 F2 and F1 (x0 ) < F2 (x0 )
for at least one x0 then is equivalent to
F1 - F2 .
An
Invariance considerations. Let z be a point in Rn (thought of as representing a realization of
Z = (Z1 , . . . , Zn )) and let be a transformation (z) = ( (z1 ), . . . , (zn )) where is a continuous,
cf. e.g. the table given in Rohatgi and Saleh, Introduction to Prob. and Statistics, 2nd Ed.
The Wilcoxon signed rank test 179
strictly monotone increasing and uneven real valued function on R (uneven means (x) = (x)).
If Q = L(Z) is a symmetric law then the law of (Z) is also symmetric:
Exercise H1.2. Let pi , i = 1, . . . , k be a finite subset of (0, 1) (where k 2), and consider a
statistical model M01 with the same data X and parameter p as M1 but where p is now restricted
to = {p1 , .P
. . , pk }. Assume that each parameter value pi is assigned a prior probability
qi > 0 where ki=1 qi = 1. For any estimator T, define the mixed risk
k
X
B(T ) = R(T, pi )qi
i=1
Pp (X1 = k) = (1 p)k1 p, k = 1, 2, . . .
i) In the statistical model where p (0, 1] is unknown, find the MLE of p (proof).
ii) Compute (or find in a textbook) the expectation of X1 under p.
182 Exercises
Exercise H2.4 Let X1 , . . . , Xn be independent and identically distributed such that X1 has the
uniform law on the interval [0, ] for some > 0 (i. e. X1 has the uniform density p (x) =
1 1[0,] (x). Here 1A (x) is the indicator function of a set A: 1A (x) = 1 if x A, 1A (x) = 0
otherwise). In the statistical model where > 0 is unknown, find the MLE of (proof).
Hint: it can be assumed that not all Xi are 0, since this event has probability 0 under any > 0.
Exercise H2.5 Let X1 , . . . , Xn be independent and identically distributed such that X1 has the ex-
ponential law Exp() on [0, ) with parameter (i. e. X1 has density p (x) = 1 exp(x1 )1[0,) (x)).
i).In the statistical model where > 0 is unknown, find the MLE of (proof)
ii) Recall the expectation of X1 under .
.
Exercise H3.2. Let X be an observed (integer-valued) random variable with binomial law B(n, )
where = (0, 1) is unknown.
(i) Compute the Fisher information at each .
(ii) Clearly condition D1 for the validity of the Cramer-Rao bound is fulfilled (sample space is
finite, prob. function dierentiable). Find a minimum variance unbiased estimator (UMVUE) of
for this model.
Exercise H3.3. Let X1 , ..., Xn be independent and identically distributed with Geometric law
Geom(1 ), i. e.
P (X1 = k) = (1 1 )k1 1
where = (1, ) is unknown. (Note that in Exercise H2.3 the family Geom(p), p (0, 1] was
considered. Here we just took = p1 as parameter and also excluded the value p = 1)
(i) Compute the Fisher information at each .
(ii) Assume that condition D2 for the validity of the Cramer-Rao bound is fulfilled . Find a
minimum variance unbiased estimator (UMVUE) of for this model.
Hint: Here the variance of X1 is important; compute or look up in a book.
Exercise H4.2. Consider the Gaussian scale model: observations are X1 , . . . , Xn , independent
and identically distributed such that X1 has the normal law N(0, 2 ) with variance 2 , where 2
is unknown and varies in the set = (0, ).
(i) Compute the Fisher information IF ( 2 ) for one observation X1 .
(ii) Assume that condition D3 for the validity of the Cramer-Rao bound is fulfilled. Show that for
n observations in the Gaussian scale model, the sample variance
n
X
S 2 = n1 Xi2
i=1
Exercise H4.3. In the handout, sec. 8.3 a family of Gamma densities was introduced as
1 1
f (x) = x exp(x)
()
1
f, (x) = x1 exp(x 1 ).
()
EU = .
(iii) In the above Poisson model, with prior (, ), find the posterior expectation E(|X) of
and discuss its relation to the sample mean Xn for large n (, fixed).
Remark: Note that if Proposition 8.1 carries over to the case of countable sample space then
E(|X) is a Bayes estimator for quadratic risk.
184 Exercises
For the case that in addition 2 is unknown, find a statistic which has a t-distribution if 1 = 2 (
this would then be called a studentized statistic) , and find the degrees of freedom.
H : 1 = 2.
Exercise H7.2. Let z/2,n be the upper /2-quantile of the t-distribution with n degrees of
freedom and z/2 the respective quantile for the standard normal distribution. Use the tables at
the end of the textbook or a computer program to find
a) z/2,n for n = 5, n = 20 and z/2 for a value = 0.05
b) the same for = 0.01.
Exercise H7.3. In the introductory subsection 1.3.2 Confidence statements with the Chebyshev
inequality (early in the handout, p. 10) we constructed a confidence interval for the parameter p
for n i.i.d. Bernoulli observations X1 , . . . , Xn (model Md,1 ) of form [Xn , Xn + ] where Xn is
the sample mean and = n, = 2(n)1/2 . Using the Chebyshev inequality, it was shown that
this has coverage probability at least 1 .
(i) Use the central limit theorem
and the upper bound p(1 p) 1/4 to construct an asymptotic 1 confidence interval of form
[Xn n, , Xn + n, ] for p, involving a quantile of the standard normal distribution N (0, 1).
(ii) Show that the ratio of the lengths of the two confidence intervals, i.e. n, /n, , does not
depend on n and find its numerical values for = 0.05 and = 0.01, using the table of N (0, 1) on
p. 608 textbook.
(iii) Use the property of the standard normal distribution function (x)
1 (x) x1 exp x2 /2
Comment: The proof of consistency of the two-sided Gauss test, which is illustrated in the figure
p. 95, is similar but more direct since 2 is known there.
186 Exercises
Exercise E1.1. (10%). Let X1 , . . . , Xn be independent identically distributed with unknown dis-
tribution Q, where it is known only that
Var(X1 ) K
for some known positive K. Then also = EX1 exists. Consider hypotheses
H : = 0
K : 6= 0 .
Find an -test (exact -test; i.e. level is observed for every n, not just asymptotic as n ).
(Hint: Chebyshev inequality, p. 10 or [D], p. 222).
Exercise E1.2. Let X1 , . . . , Xn , be independent Poisson Po(). Consider some 0 , 1 such that
0 < 0 < 1 .
(i) (5%) Consider simple hypotheses
H : = 0
K : = 1 .
Find a most powerful -test.
Note: the distribution of any proposed test statistic can be expected to be discrete, so that a
randomized test might be most powerful. For the solution, this aspect can be ignored; just indicate
the statistic, its distribution under H and the type of rejection region (such as reject when T is
too large).
(ii) (10%) Consider composite hypotheses
H : = 0
K : > 0 .
Find a uniformly most powerful (UMP) -test.
Hint: take a solution of (i) which does not depend on 1 .
(iii) (10%) Consider composite hypotheses
H : 0
K : > 0 .
Find a uniformly most powerful (UMP) -test.
Hint: take a solution of (ii) and show that it preserves level on H : 0 . Properties of the
Poisson distribution are useful.
Exercise E1.3. (20%) Consider the Gaussian location-scale model (Model Mc,2 ), for sample size
n, i. e. observations are i.i.d. X1 , . . . , Xn with distribution N (, 2 ) where R and 2 > 0 are
unknown. For a certain 20 > 0, consider hypotheses H : 2 20 vs. K : 2 > 20 .
Find an -test with rejection region of form (c, ) (i.e. a one-sided test) where c is a quantile of a
2 -distribution. (Note: it is not asked to find the LR test; but the test should have level . This
includes an argument that the level is observed on all parameters in the hypothesis H : 2 20 . )
Hint: A good estimator of 2 might be a starting point.
.
Exercise E1.4 (Two sample problem, F -test for variances). Let X1 , . . . , Xn be independent
N (1 , 21 ) and Y1 , . . . , Yn be independent N (2 , 22 ), also independent of X1 , . . . , Xn (n > 1) where
Problem set H8 187
Exercise E1.5 (F -test for equality of variances). Consider the two sample problem of exercise
E1.4, but hypotheses H : 21 = 22 vs. K : 21 6= 22 .
i) (5%) Find the likelihood ratio test and show that it is equivalent to a test which rejects if the
F -statistic (13.1) is outside a certain interval of form [c1 , c].
ii) (5%) Show that the c of i) can be chosen as the upper /2 quantile of the distribution Fr,r for
a a certain r > 0.
Use these data to test whether the use of carbolic acid is associated with patient mortality.
Exercise H8.2. Let X1 , . . . , Xn be independent N (1 , 1) and Y1 , . . . , Yn be independent N (2 , 1),
also independent of X1 , . . . , Xn and consider hypotheses
H : 1 =2 = 0
K : (1 , 2 ) 6= (0, 0).
Find the likelihood ratio test and show that it is equivalent to a test based on a statistic which
has a certain 2 -distribution under H (thus the critical value can be taken as a quantile of this
2 -distribution).
Exercise H8.3. Let X1 , . . . , Xn be independent Poisson Po(1 ) and Y1 , . . . , Yn be independent
Po(2 ), also independent of X1 , . . . , Xn . Let = (1 , 2 ) be the parameter vector, 0 a particular
value for this vector (with positive components) and consider hypotheses
188 Exercises
H : = 0
K : . 6= 0 .
Find an asymptotic -test. Hint: find a statistic similar to the 2 -statistic in the multinomial case
(Definition 9.1.2, p. 111 handout) and its asymptotic distribution under H.
Exercise H8.4 (Adapted from exercise 8.60, p. 399 textbook) Let Z = (Z1 , . . . , Zk ) have a
multinomial law Mk (n, p) with unknown p = (p1 , p2 , . . . , pk ), where k > 2. Consider hypotheses
on the first two components
H : p1 = p2
K : p1 6= p2
A test that if often used, called McNemars Test, rejects H if
(X1 X2 )2
> 21;1 (13.2)
X1 + X2
Y = X + ,
E = 0, Cov() = 2 In .
where X is a n k-matrix, rank(X) = k. Show that provides a best approximation to the data
Y in an average sense:
E kYXk2 = min E kYXk2
Rk
Yi = + xi + i , i = 1, . . . , n
where not all xi are equal, , are real valued and i are uncorrelated with variance 2 .
(i) Show that the LSE of , are
Pn
(Y Yn )(xi xn )
Pni
n = Yn n xn , n = i=1 2
. (13.3)
i=1 (xi xn )
(ii) Find the distribution of n when i are independent normal: L(i ) = N (0, 2 ).
Hint: for (ii), a possibility is to find the projection matrix 1 projecting onto the space Lin(1),
where 1 is the n-vector consisting of 1s, and use
where
XY = >
Y X = EXY
Comment. Compare this with the form of in the case k = 1 (i.e = 2 X XY as in Def.
10.1.3, p. 135 handout), and with the form of the LSE in the linear model (Theorem 10.42, p.
149): = (X > X)1 X > Y.
Hint. Consider independent X , where X has the same law as X and L() = N (0, 2 ), for some
2 > 0 (X is a random k-vector, a random variable) and define
Y = X> +.
with as above. Find a value of the variance 2 such that (X ,Y ) have the same joint distribution
as (X,Y ). This solves the problem, since then
Exercise H10.1. (15 %) A company registered 100 cases within a year where some employee was
missing exactly one day at work. These were distributed among the days of the week as follows:
Day M T W Th F
No. 22 19 16 18 25
Test the hypothesis that these one day absences are uniformly distributed among the days of the
week, at level = 0.05.
190 Exercises
(Solution consists of: value of the test statistic, critical value for the test [quantile of the pertaining
distribution], resulting decision. )
Exercise H10.2 (15 %) The personnel manager of a bank wants to find out whether the chance
to successfully pass a job interview depends on the sex of the applicant. For 35 randomly selected
applicants, 21 of which were male, the results of the interview were evaluated. It turned out that
exactly 16 applicants passed the interview, 5 of which were female. Use a 2 -test in a contingency
table to test whether interview result and sex are independent, at level = 0.05.
(Solution consists of: value of the test statistic, critical value for the test [quantile of the pertaining
distribution], resulting decision. )
Comment: most sources would recommend Fishers exact test here, but this was not treated and
the 2 -test is also applicable.
Exercise H10.3. (20 %) Suppose 18 wheat fields have been divided into m = 3 groups, with
l1 = 5, l2 = 7 and l3 = 6 members. There are three kinds of fertilizer j , and the group j of fields
is treated with fertilizer j , j = 1, 2, 3. The yield results for all fields are given in the following
table.
1 781 655 611 789 596
2 545 786 976 663 790 568 720
3 696 660 639 467 650 380
Assume that these values are realizations of independent random variables Yjk with distribution
N (j , 2 ), k = 1, . . . , lj , j = 1, 2, 3, where the mean values j correspond to fertilizer j . Test the
hypothesis that the three group means coincide, at level = 0.05.
(Solution consists of: value of the test statistic, critical value for the test [quantile of the pertaining
distribution], resulting decision. )
Exercise H10.4. (25 %) Consider a normal linear model of type NLM2 (cf. section 10.5, p. 152
handout), for dimension k = 1, i.e. observations are
Y = X + ,
where X is a nonrandom n 1-matrix (i.e. an n-vector in this case), is an unknown real valued
parameter and
L() = Nn (0, 2 In )
where 2 > 0 is unknown. Assume also that rank(X) = 1 (identifiability condition; i. e. X 6= 0).
Consider some value 0 and hypotheses
H : 0
K : > 0.
Let be the LSE of and define the statistic
where X is the projection matrix onto the linear space Lin(X). Show that T 0 (Y) can be used as
a test statistic to construct an -test, and indicate the distribution and the quantile used to find
the rejection region.
Comment: Note that this is not a linear hypothesis on .
Hint: in the case that all elements of X are 1 we obtain the Gaussian location-scale model (mean
value ), for which the present hypothesis testing problem was discussed extensively.
Problem set E2 191
Exercise H10.5. (25 %) Suppose that the random vector Z = (X, Y ) has a bivariate normal
distribution with EX = EY = 0 and covariance matrix
where = EXY / X Y is the correlation coecient between X and Y . Suppose that Zi = (Xi , Yi ),
i = 1, . . . , n are i.i.d. observed random 2-vectors each having the distribution of Z. Define the
empirical covariance matrix
n
X n
X
2 1
SX = n Xi2 , SY2 1
=n Yi2 ,
i=1 i=1
n
X
SXY = n1 Xi Yi
i=1
(note that we do not use centered data Xi Xn etc. here for empirical variances / covariances
since EX = EY = 0 is known) and the empirical correlation coecient
SXY
=
SX SY
q
(where SX = SX 2 etc.).
Consider hypotheses
H :=0
K : 6= 0.
Define the statistic
T0 (Z) = n 1p (13.6)
12
(where Z represents all the data Zi , i = 1, . . . , n). Show that T0 (Z) can be used as a test statistic
to construct an -test, and indicate the distribution and the quantile used to find the rejection
region.
Hint: Note that this is closely related to the previous exercise H10.4. In H 10.4, set 0 = 0,
consider the two sided problem H : = 0 vs. K : 6= 0 (then H is a linear hypothesis) and look
for similarities of the statistics (13.5) and (13.6). The distribution of (13.5) was found in H10.4,
but now the Xi are random. How does this aect the distribution of the test statistic ?
Further comment: when it is not assumed that EX = EY = 0, the definition of has to be
modified
in an obvious way, by using centered data X i Xn , Yi Yn , and n 2 appears in place
of n 1. In this form the test is found in the literature.
Exercise E2.1. (25 %) A course in economics was taught to two groups of students, one in a
classrooom situation and the other on TV. There were 24 students in each group. These students
were first paired according to cumulative grade-point averages and background in economics, and
then assigned to the courses by a flip of a coin (this was repeated 24 times). At the end of the course
192 Exercises
each class was given the same final examination. Use the Wilcoxon signed rank test (level = 0.05,
normal approximation to the test statistic) to test that the two methods of teaching are equally
eective against a two-sided alternative. The dierences in final scores for each pair of students,
the TV students score having been subtracted from the corresponding classroom students score
were as follows:
14 4 6 2 1 18
6 12 8 4 13 7
2 6 21 7 2 11
3 14 2 17 4 5
Hint: treatment of ties. Let Z1 , . . . , Zn be the data. If some |Zi | have the same absolute values
(i.e. ties occur) then they are assigned values Ri which are the averages of the ranks. Example:
4 values |Zi1 |, . . . , |Zi4 | have the same absolute value c and only two of the other |Zi | are smaller.
The |Zi1 |, . . . , |Zi4 | would then occupy ranks 3, 4, 5, 6. Since they are tied, they are all assigned the
average rank (3 + 4 + 5 + 6)/4 = 4.5. The next rank assigned is then 7 (or higher if there is another
tie).
Remark: For the two-sided version of the Wilcoxon signed rank test, as described in section 12.2
handout, the last sentence on p. 175 should read as The corresponding asymptotic -test of H
1/2
then rejects if |Wn | > z/2 vn where z/2 is the upper /2-quantile of N (0, 1)).
As a starting point, to limit the bookkeeping eort in the solution, the following table gives the
ordered absolute values of the data |Zi |, starring those that were originally negative. Below each
entry (every other row of the table) is the rank Ri of |Zi | (in the notation of the handout), where
ties are treated as indicated:
1* 2 2 2 2 3
1 3.5 3.5 3.5 3.5 6
4 4 4 5 6 6
8 8 8 10 12 12
6 7 7 8 11 12
12 14.5 14.5 16 17 18
13 14 14 17 18 21
19 20.5 20.5 22 23 24
Solution.
Exercise E2.2. Consider the one-way layout ANOVA treated in handout sec. 11.2, relation
(11.7):
Yjk = j + jk , k = 1, . . . , l, j = 1, . . . , m, (13.7)
where jk are i.i.d. normal N (0, 2 ) noise variables and j are unknown parameters, and there is
an equal number of observation l > 1 for each factor j. The total number of observations is n = ml.
In this case the F test given by the statistic (11.13) (handout) can be considered an average t test.
(i) (25 %) Let i and j be two dierent factor indices from {1, . . . , m}. Show that a t test of
H : i = j
K : i 6= j
can be based on the statistic
Yi Yj
Tij (Y) = 1/2
2dw /l
Problem set E2 193
where
m X
X l
2
dw = (n m)1 Yjk Yj
j=1 k=1
is the mean sum of squares within groups. (More precisely, show that Tij has a t-distribution under
H and find the degrees of freedom).
(ii) (25 %) Show that
m X
X m
1
Tij2 (Y) = F (Y)
m(m 1)
j=1 i=1
where F (Y) is the F statistic (11.13) (handout). This relation shows that F (Y) is an average of
the (nonzero) Tij2 (Y).
Exercise E2.3. (25 %) Consider again the model (13.7) of exercise F2, with the same assump-
tions, but in an asymptotic framework where the number l of observation in each group tends to
infinity (m stays fixed). Consider the F -statistic
P 2
(m 1)1 m j=1 l Yj Y
F (Y) = P Pl 2
(n m)1 m j=1 k=1 Yjk Yj
Find the limiting distribution, as l , of the statistic (m 1)F (Y) under the hypothesis of
equality of means H : 1 = . . . = m . (It follows that this distribution can be used to obtain an
asymptotic -test of H).
Hint: Limiting distributions of other test statistics in the handout have been obtained e.g. in
Theorem 7.3.8 and Theorem 9.3.3.
194 Exercises
Chapter 14
APPENDIX: TOOLS FROM PROBABILITY, REAL ANALYSIS AND LINEAR
ALGEBRA
thus
1 2
Y1 Y2 Y1 + 1 Y22 .
2
This proves that EY1 Y2 exists and
1
|EY1 Y2 | EY12 + 1 EY22 .
2
1/2 1/2
If both EY22 , EY12 > 0 then for = EY22 / EY12 we obtain the assertion. If one of them
is 0 (EY22 = 0, say), then by taking > 0 arbitrarily small, we obtain |EY1 Y2 | = 0.
Theorem 14.2.1 (Lebesgue) Let X be a random variable taking values in a sample space X and
let rn (x), n = 1, 2, . . . be a sequence of functions on X such that rn (x) r0 (x) for all x X .
Assume furthermore that there exists a function r(x) 0 such that |rn (x)| r(x) for all x X
and all n (domination property), and E r(X) < . Then
E rn (X) E r0 (X), as n .