Você está na página 1de 195

A First Course in Mathematical Statistics

MATH 472 Handout, Spring 04

Michael Nussbaum

May 3, 2004


Department of Mathematics,
Malott Hall, Cornell University,
Ithaca NY 14853-4201,
e-mail nussbaum@math.cornell.edu,
http://www.math.cornell.edu/~nussbaum
2
CONTENTS

0.2 Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
0.3 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1 Introduction 7
1.1 Hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 What is statistics ? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.1 The Law of Large Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.2 Confidence statements with the Chebyshev inequality . . . . . . . . . . . . . 12

2 Estimation in parametric models 15


2.1 Basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Bayes estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Admissible estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 Bayes estimators for Beta densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5 Minimax estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3 Maximum likelihood estimators 29

4 Unbiased estimators 37
4.1 The Cramer-Rao information bound . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2 Countably infinite sample space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3 The continuous case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5 Conditional and posterior distributions 51


5.1 The mixed discrete / continuous model . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2 Bayesian inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3 The Beta densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.4 Conditional densities in continuous models . . . . . . . . . . . . . . . . . . . . . . . . 56
5.5 Bayesian inference in the Gaussian location model . . . . . . . . . . . . . . . . . . . 60
5.6 Bayes estimators (continuous case) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.7 Minimax estimation of Gaussian location . . . . . . . . . . . . . . . . . . . . . . . . 68

6 The multivariate normal distribution 71

7 The Gaussian location-scale model 79


7.1 Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.2 Chi-square and t-distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4 CONTENTS

7.3 Some asymptotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

8 Testing Statistical Hypotheses 93


8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
8.2 Tests and confidence sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
8.3 The Neyman-Pearson Fundamental Lemma . . . . . . . . . . . . . . . . . . . . . . . 101
8.4 Likelihood ratio tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

9 Chi-square tests 115


9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
9.2 The multivariate central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 118
9.3 Application to multinomials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
9.4 Chi-square tests for goodness of fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
9.5 Tests with estimated parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
9.6 Chi-square tests for independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

10 Regression 139
10.1 Regression towards the mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
10.2 Bivariate regression models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
10.3 The general linear model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
10.3.1 Special cases of the linear model . . . . . . . . . . . . . . . . . . . . . . . . . 149
10.4 Least squares and maximum likelihood estimation . . . . . . . . . . . . . . . . . . . 153
10.5 The Gauss-Markov Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

11 Linear hypotheses and the analysis of variance 163


11.1 Testing linear hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
11.2 One-way layout ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
11.3 Two-way layout ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

12 Some nonparametric tests 175


12.1 The sign test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
12.2 The Wilcoxon signed rank test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

13 Exercises 181
13.1 Problem set H1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
13.2 Problem set H2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
13.3 Problem set H3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
13.4 Problem set H4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
13.5 Problem set H5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
13.6 Problem set H6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
13.7 Problem set H7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
13.8 Problem set E1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
13.9 Problem set H8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
13.10Problem set H9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
13.11Problem set H10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
13.12Problem set E2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
Preface 5

14 Appendix: tools from probability, real analysis and linear algebra 195
14.1 The Cauchy-Schwartz inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
14.2 The Lebesgue Dominated Convergence Theorem . . . . . . . . . . . . . . . . . . . . 195

0.2 Preface
Spring. 4 credits. Prerequisite: MATH 471 and knowledge of linear algebra such as taught in
MATH 221. Some knowledge of multivariate calculus helpful but not necessary.
Classical and recently developed statistical procedures are discussed in a framework that emphasizes
the basic principles of statistical inference and the rationale underlying the choice of these procedures
in various settings. These settings include problems of estimation, hypothesis testing, large sample
theory. (The Cornell Courses of Study 2000-2001).
This course is a sequel to the introductory probability course MATH471. These notes will be used
as a basis for the course in combination with a textbook (to be found among the references
given below).

0.3 References
[BD] Bickel, P., Doksum, K., Mathematical Statistics, Basic Ideas and Selected Topics, Vol. 1, (2d Edition),
Prentice Hall, 2001
[CB] Casella, G. and R. Berger, R. Statistical Inference, Duxbury Press, 1990.
[D] Durrett, R., The Essentials of Probability, Duxbury Press, 1994.
[DE] Devore, J., Probability and Statistics for Engineering and the Sciences, Duxbury - Brooks/Cole, 2000
[FPP] Freedman, D., Pisani, R., and Purves, R: Statistics (3rd Edition) 1997.
[HC] Hogg, R. V. and Craig, A. T., Introduction to Mathematical Statistics (5 Edition), Prentice-Hall, 1995
[HT] Hogg, R. V. and Tanis, E. A.., Probability and Statistical Inference (6 Edition), Prentice-Hall, 2001
[LM] Larsen, R. and Marx, M., An Introduction to Mathematical Statistics and its Applications, Prentice
Hall 2001
[M] Moore, D. The Basic Practice of Statistics, (2d Edition), W. H. Freeman and Co, 2000
[R] Rice, J., Mathematical Statistics and Data Analysis, Duxbury Press, 1995
[ROU] Roussas, G., A Course in Mathematical Statistics, (2d Edition), Academic Press, 1997
[RS] Rohatgi, V and Ehsanes Saleh, A. K., An Introduction to Probability and Statistics, John Wiley 2001
[SH] Shao, Jun, Mathematical Statistics, Springer Verlag, 1998
[TD] Tamhane, A and Dunlop, D., Statistics and Data Analysis: from Elementary to Intermediate Prentice
Hall 2000
6 CONTENTS
Chapter 1
INTRODUCTION

1.1 Hypothesis testing


This course is a sequel to the introductory probability course Math471, the basis of which has been
the book The Essentials of Probability by R. Durrett (quoted as [D] henceforth). Some statistical
topics are already introduced there. We start by discussing some sections and examples (essentially
reproducing, sometimes extending the text of [D]).
Testing biasedness of a roulette wheel ([D] chap. 5.4 p. 244). Suppose we run a casino and we
wonder if our roulette wheel is biased. A roulette wheel has 18 outcomes that are red, 18 that are
black and 2 that are green. If we bet $1 on red then we win $1 with probability p = 18/38 = 0.4736
and lose $1 with probability 20/38 (see [D] p. 81 on gambling theory for roulette). To phrase the
biasedness question in statistical terms, let p be the probability that red comes up and introduce
two hypotheses:

H0 : p = 18/38 null hypothesis


H1 : p 6= 18/38 alternative hypothesis

To test to see if the null hypothesis is true, we spin the roulette wheel n times and let Xi = 1 if
red comes up on the ith trial and 0 otherwise, so that Xn is the fraction of times red comes up in
the first n trials. The test is specified by giving a critical region Cn so that we reject H0 (that is,
decide H0 is incorrect) when Xn Cn . One possible choice in this case is
( r )
18 18 20
Cn = x : x > 2 / n .
38 38 38

This choice is motivated by the fact that if H0 is true then using the central limit theorem ( is a
standard normal variable)

Xn
P Xn Cn = P 2 P (|| 2) = 0.05. (1.1)
/ n

Rejecting H0 when it is true is called a type I error. . In this test we have set the type I error
to be 5%.
The basis for the approximation is the central limit theorem. Indeed the results Xi ,
i = 1, . . . , n of the n trials are independent random variables all having the same distribution (or
probability law). This distribution is a Bernoulli law, where Xi takes only values 0 and 1:

P (Xi = 0) = 1 p, P (Xi = 1) = p.
8 Introduction

If is the expectation of this law then

= EX1 = p.

If 2 is the variance then

2 = EX12 (EX1 )2 = p p2 = p(1 p).

The central limit theorem (CLT) gives

Xn L
= N (0, 1) (1.2)
/ n
L
where N (0, 1) is the standard normal distribution and = signifies convergence in law (in distrib-
ution). Thus as n
Xn
P 2 P (|| 2) .
/ n
So in fact our reasoning is based on a large sample approximation for n . The relation
P (||q
2) = 0.05 is then taken from a tabulation of the standard normal law N (0, 1).
18 20
Now 38 38 = 0.4993 1/2 so to simplify the arithmetic the test can be formulated as

18 1
reject H0 if Xn >
38 n
or in terms of the total number of reds Sn = X1 + . . . + Xn

18n

reject H0 if Sn > n.
38
Suppose now that we spin the wheel n = 3800 times and get red 1868 times. Is the wheel biased?
We expect 18n/38 = 1800 reds, so the excess number of reds is |Sn 1800| = 68. Given the large
number of trials, this might not seem like a large excess. However 3800 = 61.6 and 68 > 61.6
so we reject H0 and think if H0 were correct then we would see an observation this far from the
mean less than 5% of the time.
Testing academic performance ([D] chap. 5.4 p. 244). Do married college students with
children do less well because they have less time to study or do they do better because they are
more serious ? The average GPA at the university is 2.48, so we might set up the following
hypothesis test concerning , the mean grade point average of married students with children:

H0 : = 2.48 null hypothesis


H1 : 6= 2.48 alternative hypothesis.

Suppose that to test this hypothesis we have records X1 , . . . , Xn of 25 married college students
with children. Their average GPA is Xn = 2.35 and sample standard deviation n = 0.5. Recall
that the standard deviation of a sample X1 , . . . , Xn with sample mean Xn is
v
u n
u 1 X 2
n = t Xi Xn .
n1
i=1
Hypothesis testing 9

Using (1.1) from the last example we see that to have a test with a type I error of 5% we should
reject H0 if
2
Xn 2.48 > .
n
The basis is again the central limit theorem: no particular assumption is made about the distribu-
tion of the Xi , they are just independent and identically distributed random variables (i.i.d. r.v.s)
with a finite variance 2 (standard deviation ). We again have the CLT (1.2) and we can take
= 2.48 to test our hypothesis. But contrary to the previous example, the value of is then still
undetermined (in the previous example both and are given by p). Thus is unknown, but we
can estimate it by the sample standard deviation n . Taking n = 25 we see that
2(0.5)
= 0.2 > 0.13 = Xn 2.48
25
so we are not 95% certain that 6= 2.48. Note the inconclusiveness in the outcome: the result
of the test is the negative statement we are not 95% certain that H0 is not true.. , but not
that there is particularly strong evidence for H0 . That nonsymmetric role of the two hypotheses is
specific to the setup of statistical testing; this will be discussed in detail later.
Testing the dierence of two means.([D] p. 245) Suppose we have independent random samples
of size n1 and n2 from two populations with unknown means 1 , 2 and variances 21 , 22 and we
want to test

H0 : 1 = 2 null hypothesis
H1 : 1 6= 2 alternative hypothesis.

Now the CLT implies that



21 22
X1 N 1 , , X2 N 1 , .
n1 n2

Indeed these are just reformulations of (1.2) using properties of the normal law:

2
L() = N (0, 1) if and only if L + = N , ,
n n

and L() = L() = N (0, 1). Here we are using the standard notation L() for probability law of
(law of , distribution of ). We have assumed that the two samples are independent, so if H0
is correct,
21 22
X1 X2 N 0, + .
n1 n2
Based on the last result, if we want a test with a type I error of 5% then we should
s
21 22
reject H0 if X1 X2 > 2 + .
n1 n2

For a concrete example we consider a study of passive smoking reported in the New England Journal
of Medicine. (cf. [D] p.246). A measurement of the size S of lung airways called FEF 25-75%
was taken for 200 female nonsmokers who were in a smoky environment and for 200 who were not.
10 Introduction

In the first group the average value of S was 2.72 with a standard deviation of 0.71 while in the
second group the average was 3.17 with a standard deviation of 0.74 (Larger values are better.).
To see that there is a significant dierence between the averages we note that
s r
21 22 (0.71)2 (0.74)2
2 + =2 + = 0.14503
n1 n2 200 200

while X1 X2 = 0.45. With these data, H0 is rejected, based on the reasoning, similar to the
first example, if H0 were true, then what we are seeing would be very improbable, i.e. would have
probability not higher than 0.05. But again the reasoning is based on a normal approximation,
i.e. an belief that sample size 200 is a large enough.

1.2 What is statistics ?


The Merriam-Webster Dictionary says:
Main Entry: statistics
Pronunciation: st&-tis-tiks
Function: noun plural but singular or plural in construction
Etymology: German Statistik study of political facts and figures, from New Latin statisticus of
politics, from Latin status state
Date: 1770
1 : a branch of mathematics dealing with the collection, analysis, interpretation, and presentation
of masses of numerical data
2 : a collection of quantitative data
In ENCYCLOPDIA BRITANNICA we find
Statistics: the science of collecting, analyzing, presenting, and interpreting data. Governmental
needs for census data as well as information about a variety of economic activities provided much of the
early impetus for the field of statistics. Currently the need to turn the large amounts of data available
in many applied fields into useful information has stimulated both theoretical and practical developments
in statistics. Data are the facts and figures that are collected, analyzed, and summarized for presentation
and interpretation. Data may be classified as either quantitative or qualitative. Quantitative data measure
either how much or how many of something, and qualitative data provide labels, or names, for categories
of like items. .......... Sample survey methods are used to collect data from observational studies, and
experimental design methods are used to collect data from experimental studies. The area of descriptive
statistics is concerned primarily with methods of presenting and interpreting data using graphs, tables, and
numerical summaries. Whenever statisticians use data from a samplei.e., a subset of the populationto
make statements about a population, they are performing statistical inference. Estimation and hypothesis
testing are procedures used to make statistical inferences. Fields such as health care, biology, chemistry,
physics, education, engineering, business, and economics make extensive use of statistical inference. Methods
of probability were developed initially for the analysis of gambling games. Probability plays a key role in
statistical inference; it is used to provide measures of the quality and precision of the inferences. Many of the
methods of statistical inference are described in this article. Some of these methods are used primarily for
single-variable studies, while others, such as regression and correlation analysis, are used to make inferences
about relationships among two or more variables.
The subject of this course is statistical inference Let us again quote Merriam-Webster:
Main Entry: inference
Pronunciation: in-f(&-)r&n(t)s, -f&rn(t)s
Confidence intervals 11

Function: noun
Date: 1594
1 : the act or process of inferring: as a : the act of passing from one proposition, statement, or
judgment considered as true to anotherwhose truth is believed to follow from that of the former b
: the act of passing from statistical sample data to generalizations (as of the value of population
parameters) usually with calculated degrees of certainty.

1.3 Confidence intervals


Suppose a country votes for president, and there are two candidates, A and B. An opinion poll
institute wants to predict the outcome, by sampling a limited number of voters. We assume that
10 days ahead of the election, all voters have formed an opinion, no one intends to abstain, and
all voters are willing answer the opinion poll if asked (these assumptions are not realistic, but are
made here in order to explain the principle). The proportion intending ot vote for A is p, where
0 < p < 1, so if a voter is picked at random and asked, the probability that he favors A is p. The
proportion p is unknown; if p > 1/2 then A wins the election.
The institute samples n voters, and assigns value 1 to a variable Xi if the vote intention is A, 0
otherwise (i = 1, . . . , n.). The institute selects the sample in a random fashion, throughout the
voter population, so that Xi can be assumed to be independent Bernoulli B(1, p) random variables.
(Again in practice the choice is not entirely random, but follows some elaborate scheme in order to
capture dierent parts of the population, but we disregard this aspect). Recall that p is unknown;
an estimate of p is required to form the basis of a prediction (< 1/2 or > 1/2 ?). In the theory of
statistical inference a common notation for estimates based on a sample of size n is pn . Suppose
that the institute decides to use the sample mean
n
X
1
Xn = n Xi
i=1

as an estimate of p, so pn = Xn .
1.3.1 The Law of Large Numbers
We have for Xn = pn and any > 0

P (|pn p| > ) 0 as n . (1.3)

In words, for any small fixed number the probability that pn is outside the interval (p, p+)
can be made arbitrarily small by selecting n suciently large. If the institute samples enough voters,
it can believe that its estimate pn is close enough to the true value. In statistics, estimates which
converge to the true value in the above probability sense are called consistent estimates (recall
that the convergence type (1.3) is called convergence in probability ). As a basic requirement the
institute needs a good estimate pn to base its prediction upon. The LLN tells the institute that it
actually pays to get more opinions.
Suppose the institute has sampled a large number of voters, and the estimate pn turns out > 1/2
but near. It is natural to proceed with caution in this case, as the reputation of the institute
depends on the reliability of its published results. Results which are deemed unreliable will not be
published. A controversy might arise within the institute:
Researcher a: We spent a large amount of money on this poll on we have a really large n. So let
us go ahead and publish the result that A will win.
12 Introduction

Researcher b: This result is too close to the critical value. I do not claim that B will win, but I
favor not to publish a prediction.
Clearly a sound and rational criterion is needed for a decision whether to publish or not. A method
for this should be fixed in advance.
1.3.2 Confidence statements with the Chebyshev inequality
Recall Chebyshevs inequality ([D] chap. 5.1, p. 222). If Y is a random variable with finite variance
Var(Y ) and Y > 0 then
P (|Y EY | y) Var(Y )/y2 .

Applying this for Y = Xn = pn , we obtain

Var(X1 ) p(1 p)
P (|pn p| > ) 2
= (1.4)
n n2

for any > 0. Suppose we wish to guarantee that

P (|pn p| > ) (1.5)

1
for an given in advance (e.g. = 0.05 or = 0.01). Now p(1 p) 4 so

1
P (|pn p| > )
4n2

provided we select = 1/ 4n.
The Chebyshev inequality thus allows to to quantitatively assess the accuracy when the sampe
size is given (or alternatively, to determine the neccessary sample size to attain to a given desired
level of accuracy ). The convergence in probability (or LLN) (1.3) is just a qualitative statement
on pn ; actually it also derived from the Chebyshev inequality (cf. the proof of the LLN).
Another way of phrasing (1.5) would be: the probability that the interval (pn , pn + ) covers p
is more than 95%, or
P ((pn , pn + ) 3 p) 1 . (1.6)

Such statements are called confidence statements, and the interval (pn , pn +) is a confidence
interval Note that pn is a random variable, so the interval is in fact a random interval. Therefore
the element sign is written in reverse form 3 to stress the fact that (1.6) the interval is random,
not p (p is merely unknown).
To be even more precise, we note that the probability law depends on p, so we should properly write
Pp (as is done usually in statistics, where the probabilities depend on an unknown parameter). So
we have
Pp ((pn , pn + ) 3 p) 1 . (1.7)

When is a preset value, and (pn , pn + ) is known to fulfill (1.7), the common practical point
of view is: we believe that our true unknown p is within distance of pn . When pn happens to be
more than away from 1/2 (and e.g larger) then the opinion poll institute has enough evidence;
this immediately implies we believe that our true unknown p is greater than 1/2, and they can
go ahead and publish the result. They know that if the true p is actually less than 1/2, then the
Confidence intervals 13

outcome they see ( pn 1/2 + ) has less 0.05 probability:

Pp (pn 1/2 + ) = 1 Pp (pn < 1/2)


1 Pp (pn < p)
1 Pp (pn < p < pn + )
= 1 Pp ((pn , pn + ) 3 p)
.

Note that the is to some extent arbitrary, but common values are = 0.05 (95% confidence) and
= 0.01 (99% confidence).
The reasoning When I observe a fact (an outcome of a random experiment) and I know that under
a certain hypothesis, this fact would have less than 5% probability then I reject this hypothesis is
very common; it is the basis of statistical testing theory. In our case of confidence intervals, the
fact (event) would be 1/2 is not within (pn , pn + ) and the hypothesis would be p = 1/2.
When we reject p = 1/2 because of pn 1/2 + then we can also reject all values p < 1/2.
But this type of decision rule (rational decision making, testing) cannot give reasonable certainty
in all cases. When 1/2 is in the 95% confidence interval the institute would be well advised to be
cautious, and not publish the result. It just means unfortunately, I did not observe a fact which
would be very improbable under the hypothesis, so there there is not enough evidence against the
hypothesis; nothing can really be ruled out. .
In summary: the prediction is suggested by pn ; the confidence interval is a rational way of deciding
whether to publish or not.
Note that, contrary to the above testing examples, the confidence interval did not involve any
large sample approximation. However such arguments (normal approximation, estimate Var(X1 )
by pn (1 pn ) ) can alternatively be used here.
14 Introduction
Chapter 2
ESTIMATION IN PARAMETRIC MODELS

2.1 Basic concepts


After hypothesis testing and confidence intervals, let us introduce the third major branch of statis-
tical inference: parameter estimation.
Recall that a random variable is a number depending on the outcome of an experiment. When the
space of outcomes is , then a random variable X is a function on with values in a space X ,
written X(). The is often omitted.
We also need the concept of a realization of a random variable. This is a particular value x X
which the random variable has taken, i.e. when has taken a specific value. Conceptually, by
random variable we mean the whole function 7 X(), whereas the realization is just a value
x X (the data in a statistical context).
Population and sample.
Suppose a large shipment of transistors is to be inspected for defective ones. One would like to
know the proportion of defective transistors in the shipment; assume it is p where 0 p 1 (p is
unknown). A sample of n transistors is taken from the shipment and the proportion of defective
ones is calculated. This is called the sample proportion:
#{defectives in sample}
p = . (2.1)
n
Here the shipment is called the population in the statistical problem and p is called a parameter
of the population. When we randomly select one individual (transistor) from the population, this
transistor is defective with probability p. We may define a random variable X1 in the following
way:

X1 = 1 if defective
X1 = 0 otherwise.

That is, X1 takes value 1 with probability p and value 0 with probability 1 p. Such a random
variable is called a Bernoulli random variable and the corresponding probability distribution
is the Bernoulli distribution, or Bernoulli law , written B(1, p). The sample space of X1 is the set
{0, 1}.
When we take a sample of n transistors, this should be a simple random sample, which means
the following:
a) each individual is equally likely to be included in the sample
b) results for dierent individuals are independent one from another.
In mathematical language, a simple random sample of size n yields a set X1 , . . . , Xn of independent,
identically distributed random variables (i.i.d. random variables).. They are identically distributed,
16 Estimation in parametric models

in the above example, because they all follow the Bernoulli law B(1, p).(they are a random selection
from the population which has population proportion p). The X1 , . . . , Xn are independent as
random variables because of property b) of a simple random sample.
Denote X = (X1 , . . . , Xn ) the totality of observations, or the vector observations. This is now a
random variable with values in the n-dimensional Euclidean space Rn . (We also call this a random
vector, or also a random variable with values in Rn . Some probability texts assume random variables
to take values in R only; the higher dimensional objects are called random elements or random
vectors.) The sample space of X is now the set of all sequences of length n which cosnsist of 0s
and 10 s, written also {0, 1}n . In general, we denote X the sample space of an observed random
variable X..

Notation Let X be a random variable with values in a space X . We write L(X) for the probability
distribution (or the law) of X.

Recall that the probability distribution (or the law) of X is given by the the totality of the values
X can take, togehter with the associated probabilities. That definition is true for discrete random
variables (the totality of values is finite or countable); for continuous random variables the proba-
bility density function defines the distribution. When X is real valued, either discrte or continuous,
the law of X can also be described by the distribution function

F (x) = P (X x).)

In the above exampe, each individial Xi is Bernoulli: L(Xi ) = B(1, p) but the law of X =
(X1 , . . . , Xn ) is not Bernoulli: it is the law of n i.i.d. random variables having Bernoulli law
B(1, p). (In probability theory, such a law is called the product law, written B(1, p)n ). Note that
in our statistical problem above, p is not known so we have a whole set of laws for X: all the laws
B(1, p)n where p [0, 1].
The parametric estimation problem. Let X be an observed random variable with values in X
and L(X) be the probability distribution (or the law) of X. Assume that L(X) is unknown, but
known to be from a certain set of laws:

L(X) {P ; } .

Here is an index (a parameter) of the law and is called the parameter space (the set of admitted
). The problem is to estimate a function based on a realization of X. The set {P ; } is
also called a parametric family of laws.
In the sequel we assume is a subset of the real line R and X = (X1 , . . . , Xn ) where X1 , . . . , Xn
are independent random variables. Here n is called sample size. In most of this section we confine
ourselves to the case that X is a finite set (with the exception of some examples). In the above
example, the above takes the role of the population proportion p; since the population proportion
is known to be in [0, 1], the parameter space would be [0, 1].

Definition 2.1.1 (i) A statistic T is an arbitrary function of the observed random variable X.
(ii) As an estimator T of we admit any mapping with values in .

T : X 7

In this case, for any realization x the statistic T (x) gives the estimated value of .
Basic concepts 17

Note that T = T (X) is also a random variable. Statistical terminology is such that an estimate
is a realized value of that random variable, i.e. T (x) above (the estimated value of ), whereas
estimator denotes the whole function T . (also called estimating function) . Sometimes the
words estimate and estimator are used synonymously.
Thus an estimator is a special kind of statistic. Other instances of statistics are those used for
building tests or confidence intervals.

Notation Since the distribution of X depends on an unknown parameter, we stress this dependence
and write
P (X B), E h(X) etc.
for probabilities, expectations etc. which are computed under the assumption that is the
true parameter of X.

For later reference we write the model of i.i.d. Bernoulli random variables in the following form.

Model M1 A random vector X = (X1 , . . . , Xn ) is observed, with values in X = {0, 1}n ; the
distribution of X is the joint law B(1, p)n of n independent and identically distributed
Bernoulli random variables Xi each having law B(1, p), where p [0, 1].

This fits into the above parametric estimation problems as follows: we have to set = p, =
[0, 1].and P = B(1, p)n .

Remark 2.1.2 We set p = and write Pp () for probabilities depending on the unknown p. Thus
for a particular value x = (x1 , . . . , xn ) {0, 1}n we have
n
Y n
Y
Pp (X = x) = Pp (Xi = xi ) = pxi (1 p)1xi .
i=1 i=1

Note that the above probability can be written


Pn Pn
xi
p i=1 (1 p)n i=1 xi
= pz (1 p)nz ,
P
where z denotes the value z = ni=1 xi . Thus the probability Pp (X = x) depends only on the
number of realized 1s among the x1 , . . . , xn , or the number of successes in n Bernoulli trials (the
number of times the original event has occurred).

The problem is to estimate the parameter p from the data X1 , . . . , Xn . (Note that now we used the
word data for the random variables, not for the realizations; but this also corresponds to usage
in statistics). As estimators all mappings from X into [0, 1] are admitted, for instance the relative
frequency of observed 1s:
Xn
T (X) = n1 Xj = Xn = pn
j=1

i.e. the sample mean Xn which coincides with the sample proportion pn from (2.1). In the sequel
we shall identify certain requirements which good estimators have to fulfill, and we shall develop
criteria to quantify the performance of estimators.
Let us give sime other examples of parametric estimation problems. In each case, we observe a
random vector X = (X1 , . . . , Xn ) consisting of independent and identically distributed random
18 Estimation in parametric models

variables Xi . For the Xi we shall specify a parametric family of laws {Q , }; then this defines
the parametric family of laws {P , } for X, i.e. the specification L(X) {P , }.
It suces to give the family of laws of X1 ; then L(X1 ) {Q , } determines the family
{P , }.
(i) Poisson family: {Po(), > 0}. Here = (0, ).
(ii) Normal location family: N (, 2 ), R where 2 is fixed (known). Here = , = R
and the expectation parameter of the normal law describes its location on the real line.
(iii) Uniform family: {U (0, ), > 0} where U (0, ) is the uniform law with endpoints 0 and ,
having density
1 for 0 x
p (x) = {
0 otherwise.
Choosing the best estimator. In the above example involving transistors the sample proportion
p suggested itself as a reasonable estimator of the population proportion p. However its not a priori
clear that this is the best; we may also consider other function of the observations X = (X1 , . . . , Xn ).
First of all, we have to define what it means that an estimator is good. A quantitative comparison
of estimators is made possible by the approach of statistical decision theory . We choose a loss
function L(t, ) which measures the loss (inaccuracy) if the unknown parameter is estimated
by a value t. We stress that t must be be chosen as an estimate depending on the data, so the
criterion becomes more complicated (randomness intervenes) and the choice of the loss function is
just a first step. The loss is assumed to be nonnegative, i.e. the minimal possible loss is zero.
Natural choices, in case R, are the distance of t and (estimation error)

L(t, ) = |t |

or the quadratic estimation error


L(t, ) = (t )2 . (2.2)
Another possible choice is
L(t, ) = 1[,) (|t |) (2.3)
for some > 0, which means that emphasis is put on the distance being less than , not on its
actual value.
As has been said above, the value of the loss becomes random when we insert an estimate of t
based on the data. The loss then is a random variable L(T (X), ) where T (X) is the estimator.
To judge the performance of an estimator, it is natural to proceed to the average or expected loss
for given .

Definition 2.1.3 The risk of an estimator T at parameter value is

R(T, ) = E L(T (X), ).

The risk R(T, ) as a function of is called the risk function of the estimator T .

Note that in our present model ( = p) we do not have to worry about existence of the expected
value (since the law B n (1, p) has finite support). (For later generalizations, note that since L is
nonnegative, if its expectation is not finite then it must be + which may then also be regarded
as the value of the risk.)
Basic concepts 19

Thus the random nature of the loss is sucessfully dealt with by taking the expectation, but for
judging an estimator T (X), the fact remains that is unknown. Thus we still have a whole risk
function 7 R(T, ) as a criterion for performance. But it is desirable to express the quality of T
by just one number and then try to minimize it. There is a further choice involved for the method
to obtain such a number; in the sequel we shall discuss several approaches.
Suppose that, rather than reducing the problem in this fashion, we try to minimize the whole risk
function simultaneously, i.e. try to find an estimator T such that
R(T , ) = min R(T, ) for all .
T

Such an estimator would be called a uniformly best estimator . In general such an estimator will
not exist: for each 0 consider an estimator

T0 (x) = 0 .
This estimator ignores the data and always selects 0 ; then R(T0 , 0 ) = 0, i.e. this estimator is
very good if 0 is true. Thus if T were uniformly best it would have to compete with T0 , i.e.
fulfill
R(T , ) = 0 for all
i.e. T always achieves 0 risk. If the risk adequately expresses a distance to the parameter this
means that a sure decision is possible, which is not realistic for statistical problems, and possible
only if the problem itself is degenerate or trivial. Thus (we argued informally) uniformly best
estimators do not exist in general.
There are two ways out of this dilemma:

reduce the problem to minimizing one characteristic, as argued above (i.e. the maximal risk
or an average risk over )
restrict the class of estimators such as to rule out unreasonable competitors like T0 which
are likely to be very bad for most . Within a restricted class of estimators, a uniformly
best one may very well exist.

Consistency of estimators. At least an estimator should converge towards the true (unknown)
parameter to be estimated when the sample size increases. To emphasize the dependence of the
data vector on the sample size n we write the statistical model
L(X (n) ) {P,n ; } .
Recall that convergence in probability for a sequence of random variables is denoted by the symbol
.
P

Definition 2.1.4 A sequence Tn = Tn (X (n) ) of estimators (each based on a sample of size n) for
the parameter is called consistent if for all

P,n |Tn (X (n) ) | > 0 as n , for all > 0,

or in other notation
Tn = Tn (X (n) ) as n , if L(X (n) ) = P,n .
P
20 Estimation in parametric models

Proposition 2.1.5 In Model M1 the estimator

Tn (X1 , . . . , Xn ) = Xn

is consistent for the probability of success p.

Proof. Since X1 , . . . , Xn are i.i.d. Bernoulli B(1, p) random variables, we have EX1 = p and the
result immediately follows from the law of large numbers.
The consistency requirement restricts class of estimators to reasonable ones; consistency can be
seen as a minimal requirement. But there are still many mappings of the observation space into
the parameter space which define consistent sequences of estimators.

2.2 Bayes estimators


In the Bayesian approach, one assumes that an a priori distribution is given on the parameter space,
reflecting ones prior belief about the parameter . In the case of Model M1 , assume that this
prior distribution is given in the form of a density g on the interval [0, 1]. For an estimator T , this
allows to reduce the risk function R(T, p), p [0, 1] to just one number, by integration:
Z 1
B(T ) = R(T, p)g(p)dp (2.4)
0
Z 1
= Ep (T p)2 g(p)dp (2.5)
0

(the latter equality being true in the case of quadratic loss). This quantity B(T ) is called integrated
risk or mixed risk of the estimator T . A Bayes estimator TB then is defined by the property of
minimizing the integrated risk:
B(TB ) = inf B(T ) (2.6)
T

and the Bayes risk is minimal integrated risk, i.e. (2.6).


The name Bayesian derives from the analogy with the Bayes formula: if Bi , i = 1, . . . , k is a
partition of the sample space then
k
X
P (A) = P (A|Bi ) P (Bi ).
i=1

In our example involving the parameter p above, p takes continuous values in the interval (0, 1), but
to make the connection to the Bayes formula above, consider a model where the parameter takes
only finitely many values ( = (1 , . . . , k )). Consider the case that P is the joint distribution of
(X, U ) where X is the data and U is a random variable which takes the k possible values of the
parameter (U ). For A = X A and Bi = U = i we get
k
X
P (X A) = P (X A|U = i )P (U = i )
i=1

Here gi = P (U = i ) can be construed as a prior probability that the parameter takes value
for the parameter i , and the conditional probabilities P (X A|U = i ) can be construed as a
family of probability measures depending on . In other words, in the Bayesian approach
Bayes estimators 21

a family of probability measures {P ; } is understood as a conditional distribution given


U = . The marginal distribution of awaits specification as a prior distribution based on belief
or experience. In our example (2.5), the probabilities gi = P (U = i ), i = 1, . . . , k are replaced
by a probability density g(p), p (0, 1). It is also possible of course to admit only finitely many
values of p: pi , i = 1, . . . , k with prior probabilities gi ; then (2.5) would read

k
X
B(T ) = Epi (T pi )2 gi
i=1

The expectations Epi (T pi )2 for a given pi can then be interpreted as conditional expectation
given U = i , and the integrated risk B(T ) above then is an unconditional expecation, namely
B(T ) = E (T p)2 , the expected loss squared loss (T p)2 with respect to the joint distribution
of observations X and random parameter p.
In the philosophical foundations of statistics, or in theories of how statistical decisions should be
made in the real world, the Bayesian approach has developed into an important separate school of
thought; those who believe that prior distributions on should always be applied (and are always
available) are sometimes called Bayesians, and Bayesian statistics is the corresponding part of
Mathematical Statistics.
In Model M1 , consider the following family of prior densities for the Bernoulli parameter p: for
, > 0

p1 (1 p)1
g, (p) = , p [0, 1].
B(, )

where
()()
B(, ) = . (2.7)
( + )

The corresponding distributions are called the Beta distributions ; here B(, ) stands for the
beta function defined by (2.7). Recall that

Z
() = x1 exp(x)dx.
0

Thus we consider a whole family (, > 0) of possible prior distributions for p, allowing a wide range
of choices for prior belief. The plot below shows three dierent densities from this family; let us
mention that the uniform density on [0, 1] is also in this class (for = = 1). We will discuss the
Beta family in more detail later, establishing also that B(, ) is the correct normalization factor.
It will also become clear that Bayesian methods are very useful to prove non-Bayesian optimality
properties of estimators.
22 Estimation in parametric models

Beta densities for (, ) = (2, 6), (2, 2), (6, 2)

Proposition 2.2.1 In Model M1 , let g be an arbitrary prior density for the parameter p [0, 1].
The corresponding Bayes estimator of p is
R 1 z(x)+1
p (1 p)nz(x) g(p)dp
TB (x) = 0R 1 (2.8)
z(x) (1 p)nz(x) g(p)dp
0 p

for x = (x1 , . . . , xn ) {0, 1}n and


n
X
z(x) = xi .
i=1

Remark. (i) Note that the Bayes estimator depends on the sample x only via the statistic z(x),
or equivalently, via the the sample mean Xn (x) = n1 z(x).
(ii) The function
pz(x) (1 p)nz(x) g(p)
gx (p) = R 1
z(x) (1 p)nz(x) g(p)dp
0 p
is a probability density 0n [0, 1], depending on the observed x. We see from (2.8) that TB (x) is the
expectation for that density: Z 1
TB (x) = pgx (p)dp.
0
Proof. For any estimator T and X = {0, 1}n
XZ 1
B(T ) = (T (x) p)2 pz(x) (1 p)nz(x) g(p)dp.
xX 0

To minimize B(T ), the value T (x) should be chosen optimally; one could try to do this for every
term in the sum, given x. Thus we look for the minimum of
Z 1
bx (t) = (t p)2 pz(x) (1 p)nz(x) g(p)dp.
0
Admissible estimators 23

This can be written as a polynomial of degree 2 in t:

bx (t) = c0 2c1 t + c2 t2

where
Z 1
c2 = pz(x) (1 p)nz(x) g(p)dp,
0
Z 1
c1 = pz(x)+1 (1 p)nz(x) g(p)dp,
0
Z 1
c0 = pz(x)+2 (1 p)nz(x) g(p)dp.
0

Since

0 pz(x) (1 p)nz(x) 1,

all these integrals are finite, and c2 > 0. It follows that the unique minimum t0 of bx (t) can be
obtained by setting the derivative 0. The solution of

b0x (t) = 2tc2 2c1 = 0

is t0 = c1 /c2 ; we have 0 c1 c2 since p 1. This implies that 0 t0 1, and

TB (x) = t0 (x)

is of the form (2.8).

2.3 Admissible estimators


Bayes estimators also enjoy an optimality property in terms of the original risk function p 7
R (TB , p) even before a weighted average over the parameter p is taken.

Definition 2.3.1 An estimator T of a parameter is called admissible , if for every estimator


S of , the relation
R (S, ) R (T, ) for all (2.9)

implies
R (S, ) = R (T, ) for all . (2.10)

Thus admissibility means that there can be no estimator S which is uniformly at least as good
(2.9 holds), and strictly better for one 0 : then R (S, 0 ) < R (T, 0 ) contradicts (2.10).
Non-admissibility of T means that T can be improved by another estimator S.

Proposition 2.3.2 Suppose that in Model M1 , the prior density g is such that g(p) > 0 for all
p [0, 1] with the exception of a finite number of points p. Then the Bayes estimator TB for this
prior density is admissible, for quadratic loss.
24 Estimation in parametric models

Proof. Suppose that TB is not admissible. Thus there is an estimator S and a p0 [0, 1] with

R (S, p) R (TB , p) for all p [0, 1],


and R (S, p0 ) < R (TB , p0 ) . (2.11)

Note that R (S, p) is continuous in p (continuous on p (0, 1), and right resp. left continuous at
the endpoints 0 and 1):
X
R (S, p) = (S(x) p)2 pz(x) (1 p)nz(x)
xX

P
for z(x) = ni=1 xi , x X . Thus (2.11) implies that there must be whole neighborhood of p0
within which S is better: for some > 0

R (S, p) < R (TB , p) , p [p0 , p0 + ] [0, 1].

It follows that
Z 1 Z 1
B(S) = R(S, p)g(p)dp < R(TB , p)g(p)dp = B(TB ).
0 0

This contradicts the fact that TB is a Bayes estimator; thus TB must be admissible.

2.4 Bayes estimators for Beta densities


Let us specify (2.8) to the case of Beta densities g, (p) introduced above. We have
R1
0 pz(x)+1 (1 p)nz(x) g, (p)dp
TB (x) = R1 (2.12)
z(x) (1 p)nz(x) g
0 p , (p)dp
R 1 +z(x)
0 p (1 p)+nz(x)1 dp
= R1 . (2.13)
0 p+z(x)1 (1 p)+nz(x)1 dp

With a partial integration we obtain for , > 0


Z 1 1 Z
1 1 1 1
p (1 p) dp = p (1 p) + p (1 p) dp (2.14)
0 0 0
Z
1 1
= p (1 p) dp. (2.15)
0

(Technical remark: for < 1,R the function p1 tends to as p 0, so the second integral in
1
(2.14) is improper [a limit of t for t & 0], similarly in case of < 1. Hence, strictly speaking,
R1
one should argue in terms of t first and then take a limit. Note that these limits exist since the
function p1 is integrable on (0, 1) for > 0:
Z 1 Z 1 1
1 1 1 1 1
p (1 p) dp p dp = p = (1 t ) . )
t t t
Bayes estimators for Beta densities 25

Relation (2.15) implies


Z 1 Z 1
1 1
p (1 p) dp = (p + 1 p)p1 (1 p)1 dp
0 0
Z 1 Z 1
= p (1 p)1 dp + p1 (1 p) dp
0 Z 1 0

= 1+ p (1 p)1 dp
0

(for the last equality, we reversed the roles of and in (2.15)). Setting now = + z(x),
= + n z(x), we obtain from (2.13)
R1
p (1 p)1 dp 1
TB (x) = R 10 = 1+
1 (1 p)1 dp
0 p
+ z(x)
= = ,
+ ++n
thus (:= means defining equality)
Xn + /n
T, (X) := TB (X) = .
1 + /n + /n
We already know that the Bayes estimator is a function of the sample mean; we specified formula
(2.8).
Let us discuss some limiting cases.
Limiting case A) Sample size n is fixed, 0, 0. In the limit Xn is obtained. However,
the family of densities g, does not converge to a density for 0, 0, since the function

g (p) = p1 (1 p)1

is not integrable. Indeed


Z 1/2 Z 1/2
1 1
p (1 p) dp p1 dp = log(1/2) log t for t .
t t

This means that Xn is not a Bayes estimator for one of the densities g, .
Limiting case B) and are fixed, sample size n . In this case we have for p (0, 1)

T, (X) = Xn + op (n ) for all 0 < 1. (2.16)

Definition 2.4.1 A sequence of r.v. Zn is called op (n ) if

n Zn p 0, n .

Relation (2.16) thus means that


n T, (X) Xn p 0.
To see (2.16), note that
Xn + /n /n Xn (/n + /n)
Xn =
1 + /n + /n 1 + /n + /n
26 Estimation in parametric models

and since Xn P p, it is obvious that the above quantity is op (n ). At the same time, Xn converges
in probability to p, but slower: we have

Xn = p + op (n ) for all 0 < 1/2. (2.17)

Indeed, V arp (Xn ) = p(1 p)/n, hence

Varp (n (Xn p)) = n2 Varp (Xn p) = n2 Varp Xn = n21 p(1 p)

which tends to 0 for 0 < 1/2 but not for = 1/2. Rather, for = 1/2 we have by the central
limit theorem
n1/2 (Xn p) d N (0, p(1 p))
so that (Xn p) is not op (n1/2 ).
The interpretation of (2.16), (2.17) is that n , the influence of the a priori information
diminishes. The Bayes estimators become all close to the sample mean (and to each other) at rate
n , < 1, whereas they converge to the true p only at rate n , < 1/2.

2.5 Minimax estimators


Let us consider the case = = n1/2 /2. For the risk we obtain
1 2
R (T, , p) = Ep (T, p)2 = 2
Ep nXn + (n + + )p (2.18)
(n + + )
1 n 2 o
= Ep nXn np + ((1 p) p)2 (2.19)
(n + n1/2 )2

(for the last equality, note that E(Y EY + a)2 = E(Y EY )2 + a2 for nonrandom a). Set
q := 1 p and note that nXn has the binomial distribution B(n, p), so that
2
Ep nXn np = Var(nXn ) = npq.

Hence (2.19) equals


n n
R (T, , p) = 1/2 2
4pq + (p q)2 = (p + q)2
4(n + n ) 4(n + n1/2 )2
1
= 2 =: mn .
4n 1 + n1/2

The above reasoning is valid for all p [0, 1]; it means that the risk R (T, , p) for this special
choice of , does not depend upon p. In addition we know that T, is admissible. This implies
that T, is a minimax estimator ; let us define that important notion.

Definition 2.5.1 In model M1 , for any estimator T set

M (T ) = max R(T, p). (2.20)


0p1

An estimator TM is called minimax if

M (TM ) = min M (T ) = min max R(T, p).


T T 0p1
Minimax estimators 27

Note that the maximum in (2.20) exists since R(T, p) is continuous in p for every estimator. The
minimax approach is similar to the Bayes approach, in that one characteristic of the risk func-
tion R(T, p), p [0, 1] is selected as performance criterion for an estimator. In that case the
characteristic is the worst case risk .
Theorem 2.5.2 In model M1 , the Bayes estimator T, for = = n1/2 /2 is a minimax esti-
mator for quadratic loss.
Proof. Let T be an arbitrary estimator of p. Since T, is admissible according to Proposition
2.3.2, there must be a p0 [0, 1] such that
R(T, p0 ) R (T, , p0 ) = mn . (2.21)
If there were no such p0 , then we would have
R(T, p) < R (T, , p)
for all p which contradicts admissibility of T, . Now (2.21) implies
M (T ) mn = M (T, ).

The estimator T, = TM has the form


Xn + /n Xn + n1/2 /2
TM = = . (2.22)
1 + /n + /n 1 + n1/2
Let us compare the risk functions of the minimax estimator TM and of the sample mean. We have
p(1 p)
R Xn , p = Varp (Xn ) = ,
n
1
R (TM , p) = mn = 2 .
4n 1 + n1/2
For n = 30 the two risk function are plotted below.

Risk of sample mean (line) and minimax risk (dots), n=30


28 Estimation in parametric models

It is clearly visible that the problem region for the sample mean is the area around p = 1/2; the
minimax estimator is better in the center, at the expense of the tail regions, and thus achieves a
smaller overall risk.
It is instructive to look at the form of the minimax estimator TM itself. The function fM (x) =
x+n1/2 /2
1+n1/2
is plotted below, also for n = 30.

The minimax estimator as a function of the sample mean


n1/2/2
This function has values fM (1/2) = 1/2 and fM (0) = 1+n 1/2 and it is linear; by symmetry we

obtain fM (1) = 1 fM (0). This means that the values of the sample mean are moved towards
the value 1/2, which is the one where the risk would be maximal. This can be understood as a kind
of prudent behaviour of the minimax estimator, which tends to be closer to the value 1/2 since it
is more damaging to make an error if this value were true. We can also write
1
Xn = + (Xn 1/2),
2
1 (Xn 1/2)
TM = +
2 1 + n1/2
where it is seen that the minimax estimator takes the distance of Xn to 1/2 and shrinks it by a
factor (1 + n1/2 )1 which is < 1.
Chapter 3
MAXIMUM LIKELIHOOD ESTIMATORS

In the general statistical model where X is an observed random variable with values in X , X is a
countable set (i.e. X is a discrete r.v.) and {P ; } is a family of probability laws on X :

L(X) {P ; } ,

consider the probability function for given , i.e. P (X = x). For each x X , the function

Lx () = P (X = x)

is called the likelihood function of given x. The name reflects the heuristic principle that when
observations are realized, i.e. X took the value x, the most likely parameter values of are those
where P (X = x) is maximal. This does not mean that Lx () gives a probability distribution on
the parameter space; the likelihood principle has its own independent rationale, on a purely
heuristic basis.
Under special conditions however the likelihood function can be interpreted as a probability func-
tion. Consider the case that = {1 , . . . , k } is a finite set, and consider a prior distribution on
which is uniform:
P (U = i ) = k1 , i = 1, . . . , k.
Understanding P (X = x) as a conditional distribution given , i.e. setting (as always in the
Bayesian approach)
P (X = x|U = ) = P (X = x),
we immediately obtain a posterior distribution of , i.e. the conditional distribution of U given
X = x:
P (U = , X = x)
P (U = |X = x) = (3.1)
P (X = x)
P (X = x)P (U = ) P (X = x)
= =
P (X = x) k P (X = x)
= Lx () (k P (X = x))1 . (3.2)

Here the factor (k P (X = x))1 does not depend on , so that in this case, the posterior probability
function of is proportional to the likelihood function Lx (). Recall that the marginal probability
function of X is
Xk
P (X = x) = Pi (X = x)P (U = i )
i=1
30 Maximum likelihood estimators

Thus in this special case, the likelihood principle can be derived from the Bayesian approach, for a
noninformative prior distribution (i.e. the uniform distribution on ). However in cases where
there is no natural uniform distribution on , such as for = R or = Z+ (the nonnegative
integers), such a reasoning is not straightforward. (A limiting argument for a sequence of prior
distributions is often possible).
A maximum likelihood estimator (MLE) of is an estimator T (x) = TML (x) such that

Lx (TML (x)) = max Lx (),


i.e. for each given x, the estimator is a value of which maximizes the likelihood.
In the case of model M1 , the probability function for given p = is

Pp (X = x) = pz(x) (1 p)nz(x) = Lx (p).

for x X . In this case the parameter space is = [0, 1], on which there is a natural uniform
distribution (the beta density for = = 1). It can be shown that also in this case, something
analogous to (3.2) holds, i.e. the likelihood function is proportional to the density of the posterior
distribution. Without proving this statement, let us directly compute the maximum likelihood
estimate. P
Assume first that x is such that z(x) = ni=1 xi (0, n). Then the likelihood function has Lx (0) =
Lx (1) = 0 and is positive and continuously dierentiable on the open interval p (0, 1). Thus also
the logarithmic likelihood function

lx (p) = log Lx (p)

is continuously dierentiable, and since log is a monotone function, local extrema of the likelihood
function in (0, 1) coincide with local extrema of the log-likelihood in p (0, 1). We have

lx (p) = z(x) log p + (n z(x)) log(1 p),


lx0 (p) = z(x)p1 (n z(x))(1 p)1 .

We look for zeros of this function in p (0, 1); the local extrema are among these. We obtain

z(x)p1 = (n z(x))(1 p)1 ,


z(x)(1 p) = (n z(x))p,
np = z(x).
P
Thus if z(x) = ni=1 xi (0, n) then p = n1 z(x) = xn gives the unique local extremum of lx (p)
on (0, 1), which must be a maximum of Lx (p).
If either z(x) = 0 or z(x) = n then

Lx (p) = (1 p)n or Lx (p) = p

such that p = 0 or p = 1 resp. are maximum likelihood estimates. We have established

Proposition 3.0.3 In model M1 , the sample mean Xn is the unique maximum likelihood estimator
(MLE) of the parameter p [0, 1]:
TML (X) = Xn .
Maximum likelihood estimators 31

We now turn to the continuous likelihood principle. Let X be a random variable with values
in Rk such that L(X) {P ; }, and each law P has a density p (x) on Rk . For each x Rk ,
the function
Lx () = p (x)
is called the likelihood function of given x. A maximum likelihood estimator (MLE) of is an
estimator T (x) = TML (x) such that

Lx (TML (x)) = max Lx (),


i.e. for each given x, the estimator is a value of which maximizes the likelihood.
Let us consider an example in which all densities are Gaussian.

Model M2 Observed are n independent and identically random variables X1 , . . . , Xn , each having
law N (, 2 ), where 2 > 0 is given (known) and R is unknown.

This is also called the Gaussian location model (or Gaussian shift model). Consider the case of
sample size n = 1. We can represent Xi as

Xi = + i

where i are i.i.d. centered normal: L( i ) = N (0, 2 ). The parameter is and parameter space
is R.

Proposition 3.0.4 In the Gaussian location model M2 , the sample mean Xn is the unique maxi-
mum likelihood estimator of the expectation parameter R:

TML (X) = Xn .

Proof. We have
n
!
Y 1 (xi )2
Lx () = 1 exp
2 2 2
i=1 (2 ) 2
n
!
1 1 X 2
= n exp 2 (xi )
(2 2 ) 2 2
i=1

thus the logarithmic likelihood function is


n
1 X n
lx () = log Lx () = 2
(xi )2 log(2 2 ).
2 2
i=1

Note that
n
d 1 X
lx () = 2 (2 (xi ))
d 2
i=1
n !
1 X
= xi n = 0
2
i=1
32 Maximum likelihood estimators

if and only if Pn
i=1 xi
= = xn .
n
Thus lx () which is continuously dierentiable on R has a unique local extremum at = xn . Since
lx () for , this must be a maximum.
We can now ask what happens if the variance 2 is also unknown. In that connection we introduce
the Gaussian location-scale model

Model M3 Observed are n independent and identically random variables X1 , . . . , Xn , each having
law N (, 2 ), where R and 2 > 0 are both unknown.
P
In addition to the sample mean Xn = n1 ni=1 Xi , consider the statistic
n
1 X
Sn2 = (Xi Xn )2
n1
i=1

This statistic is called the sample variance . The empirical second (central) moment (e.s.m.) is
n
n1 2 1X
Sn2 = Sn = (Xi Xn )2
n n
i=1

At first glance, it would seem more natural to call the expression Sn2 the sample variance, since it
is the sample analog of the variance of the random variable X1 :

2 = Var(X1 ) = E (X1 EX1 )2 .

However it is customary in statistics to call Sn2 the sample variance; the reason is unbiasedness
for 2 , which will be discussed later. Note that we need a sample size n 2 for Sn2 and Sn2 to be
nonzero.

Proposition 3.0.5 In the Gaussian location-scale model M3 , for a sample size n 2, if Sn2 >
0 then sample mean and e.s.m. (Xn , Sn2 ) are the unique maximum likelihood estimators of the
parameter = (, 2 ) = R (0, ):

TML (X) = (Xn , Sn2 )

The event Sn2 > 0 has probability one for all .

Proof. We write xn , s2n for sample mean and e.s.m. when x1 , . . . , xn are realized data. Note that
n
X
s2n =n 1
x2i (xn )2 .
i=1

Hence
n
X n
X
(xi )2 = x2i 2nxn + n2
i=1 i=1
n
X
= x2i nx2n + n (xn )2 = ns2n + n (xn )2 .
i=1
Maximum likelihood estimators 33

Thus the likelihood function Lx () is


n
!
1 1 X 2
Lx () = n exp 2 (xi )
(2 2 ) 2 2
i=1
!
1 s2n + (xn )2
= exp (3.3)
(2 2 )n/2 2 2 n1

1 xn 1 s2n
= exp . (3.4)
n1/2 n1/2 n1/2 (2 2 )(n1)/2 2 2 n1

where 2
1 x
(x) = exp
(2)1/2 2
is the standard normal density. We see that the first factor is a normal density in xn and the second
factor does not depend on . To find MLEs of and 2 we first maximize for fixed 2 over all
possible R. The first factor is the likelihood function in model M2 for a sample size n = 1,
variance n1 2 and an observed value x1 = xn . This gives an MLE = xn . We can insert this
value into (3.3); we now have to maximize

2
1 s2n
Lx xn , = exp 2 1
(2 2 )n/2 2 n

over 2 > 0. For notational convenience, we set = 2 ; equivalently, one may minimize
2
lx () = log Lx (xn , ) = n log + nsn .
2 2

Note that if s2n > 0, for 0 we have lx () and for also lx () , so that a
minimum exists and is a zero of the derivative of lx . The event Sn2 > 0 has probability 1 since
otherwise xi = xn , i = 1, . . . , n. i.e. all xi are equal, which clearly has probability 0 for independent
continuous random variables Xi . We obtain

l0 () =n ns2
x n2 = 0,
2 2
= s2n

as the unique zero, so 2 = s2n is the MLE of 2 .


For our next example, let us introduce the double exponential distribution DE(, ). It has density

f (x) = (2)1 exp |x | 1

for given R, > 0. We shall assume = 1 here; clearly the above is a density for any since
Z
exp(x)dx = 1.
0

Clearly DE(, ) has finite moments of any order, and by symmetry reasons is the expectation.
We introduce the double exponential location model .
34 Maximum likelihood estimators

Model M4 Observed are n independent and identically random variables X1 , . . . , Xn , each having
law DE(, 1), where R is unknown.

We will show that the MLE in this case is the sample median. For any vector (x1 , . . . , xn ), let
x(1) x(2) . . . x(n) be the vector of ordered values; this is always uniquely defined. For
n i.i.d. random variables X = (X1 , . . . , Xn ), define the order statistics to be components of
the vector X(1) , . . . , X(n) . Recall that a statistic was any function of the data; thus for given i,
the i-th order statistic is a well defined data function. In particular X(n) = maxi=1,...,n Xi etc,
X(1) = mini=1,...,n Xi . Define the sample median as

X((n+1)/2) if n is uneven
med(X) = { 1
2 (X(n/2) + X(n/2+1) ) if n is even

In other words, the sample median is the central order statistic if n is uneven, and the center of
the two central order statistics if n is even.

Proposition 3.0.6 In the double exponential location model M4 , the sample median med(X) is a
maximum likelihood estimator of the location parameter = R: TML (X) = med(X).

Note that uniqueness is not claimed here.


Proof. The likelihood function is
n
Y
Lx () = (2)1 exp ( |xi |)
i=1
n
!
X
n
= (2) exp |xi |
i=1

and maximizing it is equivalent to minimizing log Lx (), or minimizing


n
X
lx () = |xi | .
i=1

For given x this is a piecewise linear continuous function in ; for > x(n) it is a linear function
in tending to for , and for > x(1) it is also a linear function in tending to for
. Thus the minimum must be attained in the range of the sample, i. e. in the interval
[x(1) , x(n) ]. Assume that x(1) < x(1) < . . . x(n) , i.e. no two values of the sample are equal (there are
no ties). That event has probability one since the Xi are continuous random variables. Inside an
interval (x(i) , x(i+1) ), 1 i n 1 the derivative of lx () is
i
X n
X
l0 () = 1+ (1) = i (n i) = 2i n.
x
j=1 j=i+1

Consider first the case of even n. Then lx0 () is negative for i < n/2, positive for i > n/2 and 0
for i = n/2. Thus the minimum in is attained by any value [x(n/2) , x(n/2+1) ], in particular
by the center of that interval which is the sample median. If n is uneven then lx0 () is negative for
i < n/2 and positive for i > n/2. Since the function lx () is continuous, the minimum is attained
in the beginning point of the first interval where lx0 () is positive, which is X((n+1)/2) .
Maximum likelihood estimators 35

P
Since the sample median minimizes ni=1 |Xi | it may be called Pna least absolute deviation
2
estimator . In contrast the sample minimizes the sum of squares i=1 (Xi ) , i.e. it is a least
squares estimator . Note that the sample mean remains the same when X(1) whereas Xn
also tends to infinity in that case. By that reason, the sample median is often applied independently
of a maximum likelihood justification, which holds when the data are double exponential.
In analogy to the sample median, the population median is defined as as a "half point" of the
distribution of a random variable X representing the population. A value m is a median of a r.v.
X if simultaneously P (X m) 1/2 and P (X m) 1/2 hold. For continuous distributions, we
always have P (X = m) = 0, and therefore m is a solution of P (X > m) = 1 P (X < m) = 1/2
which may not be unique.
36 Maximum likelihood estimators
Chapter 4
UNBIASED ESTIMATORS

Consider again the general parametric estimation problem: let X be a random variable with values
in X and L(X) be the distribution (or the law) of X. Assume that L(X) is known up to a from
a parameter space Rk :
L(X) {P ; } .
The problem is to estimate a real valued function g() based on a realization of X. Up to now we
primarily considered the case where is an interval in R and g() = (i.e. we are interested in
estimation of itself), but the Gaussian location-scale model M3 was an instance of a model with
a two-dimensional parameter = (, 2 ). Here we might set g() = .
Consider an estimator T of g() such that E T exists. In our i.i.d. Bernoulli model M1 , that is
true for every estimator.

Definition 4.0.7 (i) The quantity


E T g()
is called the bias of the estimator T .
(ii) If
E T = g() for all
then the estimator T is called unbiased for g().

In model M1 we have
Ep Xn = p for all p [0, 1],
i.e. the sample mean is an unbiased estimator for the parameter p. Similarly, in the Gaussian
location and location-scale models, the sample mean is unbiased for = X1 .
Unbiasedness is sometimes considered as a value in itself, i.e. a desirable property for an estimator,
independently of the risk optimality. Recall that the risk function (for quadratic loss) for estimation
of a real valued was defined as

R(T, ) = E (T (X) )2 .

Suppose T is an estimator with R(T, ) < . Then

R(T, ) = E (T (X) E T + E T ))2 (4.1)


2 2
= E (T (X) E T ) + (E T )
= Var T (X) + (E T )2 . (4.2)

The last line is called the bias-variance decomposition of the quadratic risk of T ; it holds for
any T with finite risk at (or equivalently with Var T (X) < ); T need not be unbiased. If T
38 Unbiased estimators

is unbiased then the second term in (4.2), i.e. the squared bias vanishes; thus for unbiased T the
quadratic risk is the variance.
We saw that in model M1 the estimator
Xn + n1/2 /2
TM =
1 + n1/2
is minimax with respect to quadratic risk R(T, ), and it is clearly biased. The unbiased estimator
Xn performs less well in the minimax sense. It is thus a matter of choice how to judge the
performance: one may strictly adhere to the quadratic risk as a criterion, leaving the bias problem
aside, or impose unbiasedness as an a priori requirement for good estimators.
If the latter point of view is taken, within the restrict the class of unbiased estimators it is often
possible to find uniformly best elements (optimal for all values of ), as we shall see now.

4.1 The Cramer-Rao information bound


An information bound , relative to a statistical decision problem, is a bound for the best possi-
ble performance of a decision procedure, either within all possible procedures or within a certain
restricted class of methods. We shall discuss a bound which relates to the class of unbiased esti-
mators.
Let us formalize a set of assumptions which was already made several times.

Model Mf The sample space X for the observed random variable X is finite, and L(X)
{P , }, where is an open (possibly infinite) interval in R.

The case = R is included. Note that model M1 is a special case, if the open interval = (0, 1)
is taken as parameter space. With this model we associate a family of probability functions p (x),
x X , . Let T be an unbiased estimator of the parameter:

E T (X) = , (4.3)

Since X is finite, the expectation always exists for all . Let us add a dierentiability
assumption on the dependence of p (x) on .

Assumption D1 For every x X , the probability function p (x) is dierentiable in , in every


point .

Now (4.3) can be written X


T (x)p (x) =
xX
and both sides are dierentiable in . Taking the derivative we obtain
X
T (x)p0 (x) = 1. (4.4)
xX

where p0 is the derivative wrt to . The first derivative of a probability function has an interesting
property. We can also dierentiate the equality
X
p (x) = 1
xX
The Cramer-Rao information bound 39

valid for any probability function p (x). Dierentiation gives


X
p0 (x) = 0. (4.5)
xX

This means that in (4.4) we can replace T (x) by T (x) + c where c is any constant; indeed (4.5)
implies that the right side of (4.4) is still 1. Choosing c = , we obtain
X
(T (x) ) p0 (x) = 1. (4.6)
xX

Define now the score function



p0 (x)/p (x) if p (x) > 0
l (x) =
0 if p (x) = 0.

Note that if p (x) = 0 at = 0 then p00 (x) must also be 0: since p (x) is nonnegative and
dierentiable, it has a local minimum at 0 and hence p00 (x) = 0. This fact and (4.6) imply
X
1= (T (x) ) l (x)p (x) = E (T (X) ) l (X) (4.7)
xX

Let us apply now the Cauchy-Schwarz inequality : for any two random variables Z1 , Z2 on a
common probability space
|EZ1 Z2 |2 EZ12 Z22
provided the left side exists, i.e. EZi2 < , i = 1, 2. Squaring both sides of (4.7), we obtain

1 E (T (X) )2 E l2 (X).

The quantity
IF () = E l2 (X) (4.8)
is called the Fisher information of the parametric family p , of probability functions, at
point .

Theorem 4.1.1 (Cramer-Rao bound ) In model Mf , under smoothness assumption D1 , as-


sume that IF () > 0, . Then for every unbiased estimator T of

Var T (X) (IF ())1 , (4.9)

where IF () is the Fisher information (4.8).

Note that the score function l (x) can also be written


d
l (x) = d log p (x) if p (x) > 0
0 if p (x) = 0
so that the Fisher information is
2
d
IF () = E l2 (X) = E log p (X) (4.10)
d
X 2
= p0 (x) /p (x).
x:p (x)>0
40 Unbiased estimators

The form (4.10) involving the logarithm of p is convenient in many cases for computation of the
Fisher information.
Note also that for the score function we have as a consequence of (4.5)

E l (X) = 0. (4.11)

The Cramer-Rao bound (4.9) gives a benchmark against which to measure the performance of any
unbiased estimator. An unbiased estimator attaining the Cramer-Rao bound at is called a best
unbiased estimator (or uniformly best if that is true for all ; uniformly is usually omitted).
Another terminology is uniformly minimum variance unbiased estimator (UMVUE).

Example 4.1.2 Consider the case of Model M1 for sample size n = 1, i.e. we observe one Bernoulli
r.v. with law B(1, p), p (0, 1). The probability function is, for x {0, 1}

qp (x) = (1 p)1x px .

For the score function we obtain


d d d
lp (x) = log(1 p)1x px = (1 x) log(1 p) + x log p
dp dp dp
1 1
= (1 x) +x
(1 p) p
x(1 p) (1 x)p xp
= = .
p(1 p) p(1 p)

We check again (4.11) in this case from Ep (X p) = 0. The Fisher information is thus

IF (p) = Ep lp2 (X) = (p(1 p))2 Varp X = (p(1 p))1 .

Thus p(1 p) is the Cramer-Rao bound. It follows that X is a best unbiased estimator for all
p (0, 1), i.e. uniform minimum variance unbiased.

The next theorem specializes the situation to the case of independent and identically distributed
data.

Theorem 4.1.3 Suppose that X = (X1 , . . . , Xn ) are independent and identically distributed where
the statistical model for X1 is a model of type Mf , with probability function q , satisfying assumption
D1 . Assume that the Fisher information for q satisfies IF () > 0, . Then for every unbiased
estimator T of
Var T (X) n1 (IF ())1 , . (4.12)
where IF () is the Fisher information for q .

Proof. TheQmodel for the vector X = (X1 , . . . , Xn ) is also of type Mf , and the probability function
is p (x) = ni=1 q (xi ). It suces to find the Fisher information of p , call it IF,n (). We have
2 n !2
d X d
IF,n () = E log p (X) = E log q (Xi ) .
d d
i=1
Countably infinite sample space 41

Expanding the square, we note for the mixed terms (i 6= j), when l is the score function for q

E l (Xi )l (Xj ) = E l (Xi ) E l (Xj ) = 0

in view of independence of Xi and Xj and (4.11). Hence


n
X
IF,n () = E l2 (Xi ) = n IF (). (4.13)
i=1

Now Theorem 4.1.1 is valid for the whole model for X = (X1 , . . . , Xn ) and implies (4.12).

Interpreting the Fisher information. To draw an analogy to concepts from physics or me-
chanics, let t be a time parameter and and suppose x(t) = (xi (t))i=1,2,3 describes a movement in
three dimensional space R3 (a path in space, as a function of time). The velocity v(t0 ) at time t0
is the length of the gradient, i.e.
3 !1/2
X
v(t0 ) = x0 (t) = (x0i (t))2 . (4.14)
i=1

If the sample space X has k elements and all p (xi ) > 0


k 2
X d
p (xi )
d
IF () = (4.15)
p (xi )
i=1
Xk 2
d 1/2
= 4 p (xi ) . (4.16)
d
i=1

Thus IF () is similar to a squared


velocity if we identify
with time, i.e. we consider the curve
1/2 1/2
described in Rk by the vector p (x1 ), . . . , p (xk ) when varies in an interval . Note that

1/2 1/2
the vector q := p (x1 ), . . . , p (xk ) has length one (kq k2 = 1) since p (xi ) sum up to one.
Thus the family (q , ) describes a parametric curve on the surface of the unit sphere in Rk .
The Fisher information can thus be interpreted as a sensitivity of the location q Rk with
respect to small changes in the parameter, at position .
Note that in Theorem (4.1.1) we assumed that IF () > 0. Consider the case where p (x) does
not depend on ; formally then p () is dierentiable, and l2 = 0 , IF () = 0. Then Theorem
(4.1.1) has to be interpreted as meaning that unbiased estimators T do not exist (indeed there
is insucient information in the family to enable E T = ; the left side does not depend on ).
The relation IF () = 0 may also occur in other families or at particular points ; then unbiased
estimators do not exist.

4.2 Countably infinite sample space


Let us now consider the case where the sample space X is still discrete but only countable (not
necessarily finite), e.g. the case where it consists of the nonnegative integers X = Z+ .

Model Md The sample space X for the observed random variable X is countable, and L(X)
{P , }, where is an open (possibly infinite) interval in R.
42 Unbiased estimators

The proof of the Cramer-Rao bound is very similar here, only we need to be sure that dientiation
is possible under the infinite sum sign. If X = {x1 , x2 , . . .} then we need to dierentiate both sides
of the unbiasedness equation
X
T (xi )p (xi ) = (4.17)
i=1
0
If and = + are two close values then we would take the ratio of dierences on both sides

X
T (xi ) (p (xi ) p+ (xi )) 1 = 1 (4.18)
i=1

and compute the derivative by letting 0. Both sides are dierentiable in . Here for every xi

(p (xi ) p+ (xi )) 1 p0 (xi ).

hence if p (xi ) > 0 then

p (xi ) p+ (xi ) p0 (xi )


= l (xi ) as 0.
p (xi ) p (xi )

Condition D2 (i) The probability function p (x) is positive and dierentiable in , at every x X
and every
(ii) For every , there exist an = > 0 and a function b (x) satisfying E b2 (X) <
and
p (x) p+ (x)
b (x) for all || and all x X .
p (x)

Using that condition, we find for the right hand side of (4.18)

X
X
1 p (xi ) p+ (xi )
T (xi ) (p (xi ) p+ (xi )) = T (xi ) p (xi )
p (xi )
i=1 i=1

X
= r (xi )p (xi ).
i=1

The sequence of functions for 0


p (x) p+ (x)
r (x) = T (xi )
p (x)

converges to a limit function r0 (x) = T (x)l (x) pointwise (i.e. for every x X ). We would like to
show
X
X
r (xi )p (xi ) r0 (xi )p (xi ) = E T (X)l (X) (4.19)
i=1 i=1

since in that case we could infer that E T (X)l (X) = 1 (dierentiate both sides of (4.17) under
the integral sign). By a result from real analysis, the Lebesgue dominated convergence theorem
(see Appendix) it suces to show that there is a function r(x) 0 and a > 0 such that

|r (x)| r(x) for all || , and E r(X) < .


Countably infinite sample space 43

To establish that, we use



p (x) p+ (x)

|r (x)| = T (xi )
p (x) |T (xi )b (x)| =: r(x)

and the Cauchy-Schwarz inequality


1/2 1/2
E r(X) E T 2 (X) E b2 (X) <

according to condition D2 and the finiteness of E T 2 (X), which is natural to assume for the
estimator. T (x). Thus we have established

E T (X)l (X) = 1.

. The same technique allows us to dierentiate the relation



X
p (xi ) = 1
i=1

under the series sign, i.e. to obtain



X
p0 (xi ) = E l (X) = 0.
i=1

Theorem 4.2.1 (Cramer-Rao bound, discrete case) In model Md , assume that the Fisher
information exists and is positive:
2
p0 (X)
0 < IF () = E < .
p (X)

for all , and also that condition D2 holds. Then for for every unbiased estimator T of with
finite variance
Var T (X) (IF ())1 , . (4.20)

Remark 4.2.2 Clearly an analog of Theorem 4.1.3 holds, we do not state it explicitly. It is evident
that (4.13) still holds. One would have to verify that it suces to impose Condition D2 only on
the law of X1 , but we omit this argument.

Example. Let X have Poisson law Po(), where > 0 is unknown. Let us check condition D2
(here = (0, )). We have for xk = k, k = 0, 1, . . .

k
p (xk ) = exp()
k!
This is continuously dierentiable in , for every k 0, and

1 k1 k
p0 (xk ) = k k exp() = k1 1 exp().
k! k!
44 Unbiased estimators

By the mean value theorem, for every and every k 0 there exists (k) (lying in the interval
between 0 and ) such that

p+ (k) p (k) = p0+ (k) (k).

so that for || , > 0 suciently small


0
p (k) p+ (k) p (k)
= + (k)


p (k) p (k)

k + (k) k
= 1 exp( (k))

+ (k)

k k
+1 1+ exp =: b (k)

Let us denote by C0 , C1 , C2 etc. constants which do not depend on k (but may depend on and
). We have
k
+ 1 C0 (k + 1) C1 k,

b2 (k) C1 k2 C22k C1 exp(2k) C22k = C3 exp(C4 k).
Thus for E b2 (X) < it suces to show that the Poisson law has finite exponential moments of
all order, i.e. for all c > 0 and all > 0

E exp (cX) < .

To see this, note



X
X ( exp c)k
k
E exp (cX) = exp(ck) exp() = exp()
k! k!
k=1 k=1
= exp( exp c ).

The continuous case 45

4.3 The continuous case


Model Mc The observed random variable X = (X1 , . . . , Xk ) is continuous with values in Rk and
L(X) {P , }. Each law P is described by a joint density p (x) = p (x1 , . . . , xk ), and
is an open subset of Rd .

The essential work for the Cramer-Rao bound was already done in the previous subsection; indeed
this time we need dierentiation under the integral sign which is analogous to the infinite series
case. The reasoning is very similar. As estimator T is called unbiased if E |T | < and E T =
for all . We again start with the unbiasedness relation
Z
T (x)p (x)dx =

and we need to dierentiate the left side under the integral sign. Let us assume that our parameter
is one dimensional, as in the previous subsections; multivariate versions of the Cramer-Rao bound
can be derived but need not interest us here.

Condition D3 (i) The parameter space is an open (possibly infinite) interval in R. There is a
subset S Rk such that for all , we have p (x) > 0 for x S, p (x) = 0 for x / S.
(ii) The density p (x) is positive and dierentiable in , at every x S and every
(iii) For every , there exist an = > 0 and a function b (x) satisfying E b2 (X) <
and
p (x) p+ (x)
b (x) for all || and all x X .
p (x)

We can define a score function l (x)


d
d log p (x) if p (x) > 0
l (x) = {
0 if p (x) = 0

and the Fisher information


Z
0 2
IF () = E l2 (X) = p (x) /p (x)dx.
S

Theorem 4.3.1 (Cramer-Rao bound, continuous case) In Model Mc , assume that that the
smoothness condition D3 holds, and that the Fisher information exists and is positive: 0 < IF () <
for all . Then for for every unbiased estimator T of with finite variance

Var T (X) (IF ())1 , . (4.21)

The proof is exactly as in the preceding countably infinite


P R (discrete case), but the infinite series

i=1 T (x )p (x
j j ) has to be substituted by an integral T (x)p (x)dx. Remark 4.2.2 on the i.i.d.
case can be made here analogously.

Example 4.3.2 Consider the Gaussian location model M2 for sample size n = 1, i.e. we observe
a Gaussian r.v. with law N (, 2 ) where 2 > 0 is known, = R. Let us verify condition D3 . For
ease of notation assume 2 = 1. If is the standard Gaussian density then p (x) = (x ), and
46 Unbiased estimators

by a transformation y = x it suces to show the condition at parameter point = 0. Now the


density
p (x) = (x )
is continuously dierentiable in with derivative
p0 (x) = 0 (x )
Since 2
1 x
(x) = 1/2
exp
(2) 2
we have
0 (x) = x(x).
Now for ||
Z
0
|(x ) (x)| = (x + u)dy sup 0 (x u) .
|u|

Also note that



sup 0 (x u) (|x| + ) sup (x u),
|u| |u|

(x u)2
(x u) = (2)1/2 exp
2
2

= (x) exp xu u /2
so that for || , > 0 suciently small and = 0

p (x) p+ (x) sup|u| |0 (x + u)|

p (x) (x)



(|x| + ) sup exp xu u2 /2
|u|
(|x| + ) exp (|x|)
Since |x| C1 exp (|x|) for an appropriate constant C1 (depending on ), we have

p (x) p+ (x)
C2 =: b (x).
p (x)
Now we need to show that
E b2 (x) = C22 E0 exp (|X|2) < ,
The right side clearly has a finite expectation under L(X) = N (0, 1). Indeed for all t > 0 we have
Z 2
1 x
exp(t|x|) exp dx
(2)1/2 2
Z 2
1 x
2 1/2
exp(tx) exp dx
(2) 2
Z
1 (x t)2 t2 t2
= 2 exp dx exp( ) = 2 exp( ).
(2)1/2 2 2 2
.
The continuous case 47

Proposition 4.3.3 In the Gaussian location model M2 , the Fisher information w.r.t. the expec-
tation parameter R is n/ 2 , and the sample mean Xn is is a UMVUE of .

Proof. We have
V ar Xn = 2 /n
so we need only find the Fisher information. Condition D3 is easily verified, analogously to the
case n = 1: in (3.4) we found an expression for the joint density
Yn
1 xn 1 s2n
p (xi ) = 1/2 exp 2 1 . (4.22)
i=1
n1/2 n1/2 n (2 2 )(n1)/2 2 n
P
where s2n = n1 ni=1 (xi xn )2 is the sample variance. The second factor does not depend on
( 2 is fixed), and therefore cancels in the term (p (x) p+ (x))/p (x) in condition D3 . The first
factor is the density (as a function of xn ) of the law N (, n1 2 ); thus reasoning as in the above
example (4.3.2), we finally have to show that

E0 exp 2|xn |n 2 <

which follows as above. To find the Fisher information, we can use the factorization (4.22) again.
Indeed
n
!2
d Y
IF,n () = E log p (xi )
d
i=1
2
d 1 xn
= E log 1/2
d n n1/2

so here the Fisher information is the same as if we observed only xn with law N (, n1 2 ), i.e. in
a Gaussian location model with variance 2 = n1 2 . The score function in such a model is

d x
l (x) = log
d n1/2

d (x )2
= = 2 (x ),
d 2 2
and
IF () = 4 E(x )2 = 2 = n 2 .

We will now discuss an example where the Cramer-Rao bound does not hold; cf. Casella and
Berger [CB], p. 312, Example 7.3.5. Suppose that X1 , . . . , Xn are i.i.d with uniform density on
[0, ]. That means the density is
1
p (x) = , 0 x .

We might try to formally apply the Cramer-Rao bound, neglecting the dierentiability condition
for a moment. For one observation, the score function is
1, 0 < x <
l (x) = log p (x) = { ,
0, x >
48 Unbiased estimators

hence (formally)
1
IF () = E l2 (x) =
2
which suggests that for any unbiased estimator based on n observations

2
Var T (X) . (4.23)
n
However an estimator can be found which is better. Consider the maximum, i.e the largest order
statistic
X[n] = max Xi
i=1,...,n

. The density of X[n] (q , say) can be found as follows: for t



P X[n] t = P (X1 t, . . . , Xn t)
Y n
= P (Xi t) = (t/)n ,
i=1

and P X[n] t = 1 for t > , hence

d
q (t) = (t/)n = ntn1 /n , 0 t
dt
and q (t) = 0 for t > . For the expectation of X[n] we get
Z
n n1 n 1
E X[n] = n yy dy = n y n+1
0 n+1 0
n
= .
n+1
Remark 4.3.4 Assume that = 1, i.e. we have the uniform distribution on [0, 1]. Then
1
E max Xi =
i=1,...,n n+1
which can be interpreted as follows: when n points are randomly thrown into [0, 1] (independently,
1
with uniform law) then the largest of these value tends to be n+1 away from the right boundary.
The same is true by symmetry for the smallest value and the left boundary.

The estimator X[n] is thus biased, but the bias can easily be corrected: the estimator

n+1
Tc (X) = X[n]
n
(which moves X[n] up towards the interval boundary) is unbiased. To find its variance we note
Z
2 n 2 n1 n 1
E X[n] = n y y dy = n yn+2
0 n+2 0
n2
= ,
n+2
The continuous case 49


n+1 2 2

Var (Tc (X)) = Var X[n]
n
2 2 !
2 n+1 n n
=
n n+2 n+1

(n + 1)2
= 2 1
n(n + 2)
2
= (4.24)
n(n + 2)

which is smaller than the bound 2 /n suggested by (4.23). In fact the variance (4.24) decreases
much faster with n than the bound (4.23), by an additional factor 1/(n + 2).
Theorem 4.3.1 is in fact not applicable since the conditions are violated (the support depends on ,
and the density is not dierentiable everywhere). That suggests that the statistical model where
describes the support of the density [0, ] is more informative than the regular cases where
the Cramer-Rao bound applies. Indeed if X[n] = y then all values < y can be excluded with
certainty, and certainty is a lot of information in a statistical sense.
50 Unbiased estimators
Chapter 5
CONDITIONAL AND POSTERIOR DISTRIBUTIONS

5.1 The mixed discrete / continuous model


In section 2.2 we introduced the integrated risk with respect to a prior density g; it was mentioned
already that in the Bayesian approach, the parametric family P , for the observations X is
understood as a conditional distribution given . In model M1 , the data X were discrete and the
assumed prior distribution for p was continuous (the Beta densities.) It is clear that this should
give rise to a joint distribution of observation and parameter. Most basic probability courses
treat joint distributions of two r.v.s (X, U ) only for the case that the joint law is either discrete
or continuous (hence X and U are of the same type). Let us fill in some technical details here for
this mixed case.
When X and U are both discrete or both continuous then a joint distribution is given by a joint
probability function q(x, u), or a joint density q(x, u) respectively. In our mixed case where is
an interval in R and our sample space X for X is finite we can expect that a joint distribution is
given by q(x, u) which is of mixed type: q(x, u) 0 for all x X , u and
XZ
q(x, u)du = 1. (5.1)
xX

Such a function defines a joint distribution Q of X and U : for any subset A of X and any interval
B in
XZ
Q(X A, U B) = Q(A B) = q(x, u)du.
xA B

It is then possible to extend the class of sets where the probabilities Q are defined: consider sets
C S of form [
C= {x} Bx
xA

where A is an arbitrary subset of X and for each x A, Bx is a finite union of intervals in .


Define
XZ
Q(C) = q(x, u)du (5.2)
xA Bx

Let S0 be the collection of all such sets C as above; the function Q on sets C S0 fulfills all the
axioms of a probability.
Suppose now that P , is a parametric family of distributions on X and g() is a density on
. Setting
q(x, ) = P (x)g()
52 Conditional and posterior distributions

we have an object q as described above: it is nonnegative and (5.1) is fulifilled:

XZ Z X !
P (x)g()d = P (x) g()d
xX xX
Z
= g()d = 1

since g is a density. The joint distribution of X and U thus defined gives rise to marginal and
conditional probabilities, for x X
Z
P (X = x) = P (X = x, U ) = P (x)g()d, (5.3)

P (U B, X = x)
P (U B|X = x) = (5.4)
P (X = x)
R
P (x)g()d
= RB .
P (x)g()d

If P (x) > 0 for all x and (which is the case in M1 if = (0, 1)) then also P (X = x) > 0, and
the conditional probability (5.4) is well defined. Then it is immediate (and shown in probability
courses) that the function
Qx (B) = P (U B|X = x)
fulfills all the axioms of a probability on ; it defines the conditional law of U given X = x. In the
statistical context this is called the posterior distribution of given X = x.
It is obvious that this distribution has a density on = (a, b): for B = (a, t] we have
Z t Z 1
Qx ((a, t]) = P (x)g() Pu (x)g(u)du d
a

which identifies the density of Qx as


Z 1
qx () = P (x)g() Pu (x)g(u)du (5.5)

This density is called the conditional density , or in the context of Bayesian statistics, the
posterior density of given X = x. The formula (5.5) is very simple: for given prior density
g(), adjoin the probability function P (x) (when X = x is observed) and renormalize P (x)g()
such that it integrates to one w.r.t. (i.e. becomes a density).

Remark 5.1.1 The formula (5.5) suggests an analog for the case that X is a continuous r.v. as
well, with density p (x) say. In this case the formula (5.4) cannnot be used to define a conditional
distribution since all events {X = x} have probability 0 for the laws P (x) and thus also for the
marginal law of X. For continuous r.v.s X the conditional (posterior) density qx () is directly
defined by replacing the probability P (x) in (5.5) by the density p (x); cf. [D], sect. 3.8 and our
discussion to follow in later sections.

Exercise. To prepare for the purely continuous case, let us see what happens when we reverse the
roles of and X in (5.5), i.e. we take the marginal probability function for X given by (5.3) and
Bayesian inference 53

combine it with the conditional density for given by (5.4). Consider the expression qx ()P (X = x)
and, analogously to (5.5), divide it by its sum over all possible values of x (x X ). Call the result
q (x). Show that for any with g() > 0, the relation
q (x) = P (x), x X
holds. This result justifies to call P (x) a conditional probability function under U = :
P (x) = P (X = x|U = ) ,
even though U is continuous and the event U = has probability 0.

Remark 5.1.2 Consider the uniform prior density on : g() = (b a)1 . Then
Z 1
qx () = Lx () Lx (u)du

i.e. the posterior density is proportional as a function of to the likelihood function Lx () (the
normalizing factor does not depend on ).

5.2 Bayesian inference


Computation of a posterior distribution and of a Bayes estimator (which is associated, as we shall
see) can be subsumed under the term Bayesian inference . Let us compute the posterior density
in our case of model M1 and prior densities of the Beta family. We have

p1 (1 p)1
Pp (x)g(p) = pz(x) (1 p)nz(x)
B(, )
pz(x)+1 (1 p)nz(x)+1
= .
B(, )
The posterior density is proportional to the Beta density g+z(x),+nz(x) , and hence must coincide
with this density:
qx (p) = g+z(x),+nz(x) (p).
We see that if the prior density is in the Beta class, then the posterior density is also in this class,
for any observed data x. Such a family is called a conjugate family of prior distributions
The next subsection presents a more technical discussion of the Beta family. Using the formula
given below for the expectation, we find the expected value of the posterior distribution as
+ z(x)
E(U |X = x) = (5.6)
+ z(x) + + n z(x)
+ z(x) Xn + /n
= = = T, (X),
++n 1 + /n + /n
i.e. it coincides with the Bayes estimator for the prior density g, already found in section 2.4.
This is no coincidence, as the next proposition shows. Notationally, the separate symbol U for
as a random variable is only needed for expressions like P (X = x|U = ); we suppress U and set
U = wherever possible. Recall that E(U |X = x) is a general notation for conditional expectation
([D] sec. 4.6), i.e the expectation of a conditional law. In our statistical context E(|X = x) is
called the posterior expectation .
54 Conditional and posterior distributions

Proposition 5.2.1 In the statistical model Mf (the sample space X for X is finite and is an
open interval in R), let be a finite interval and g be a prior density on . Then, for a quadratic
loss function a Bayes estimator TB of is given by the posterior expectation

TB (x) = E(|X = x), x X .

Proof. Note that for a r.v. U = taking values in a finite interval, the expectation and all higher
moments always exists, so both for prior and posterior distributions the expectation exists. The
integrated risk for any estimator is (it was defined previously for the special case of Model M1 )
Z
B(T ) = R(T, )g()d
Z
= E (T )2 g()d.

Obviously this can be written


Z X
B(T ) = (T (x) )2 P (x)g()d
x
XZ
= (T (x) )2 P (x)g()d.
x

From (5.5) we have


P (x)g() = qx ()P (X = x),
where P (X = x) is the marginal probability function of X. Hence
X Z 2

B(T ) = (T (x) ) qx ()d P (X = x). (5.7)
x

Let q() be an arbitrary density in , Eq (), be expectation under q and a be a constant (a does
not depend on ). Then we claim

Eq (a )2 Eq (Eq )2 = Varq (5.8)

with equality if and only if a = . (Note that Eq (a )2 is always finite in our model). Indeed

Eq (a )2 = Eq (a Eq ( Eq ))2
= (a Eq )2 + Varq

in view of Eq ( Eq ) = 0, which proves (5.8) and our claim. Now apply this result to the
expression in round brackets under the sum sign in (5.7) and obtain that for any given x, T (x) =
Eqx = E(|X = x) = TB (x) is the unique minimizer. Hence
Z
B(T ) E (TB (X) )2 g()d

= E (TB (X) )2 .

where the last expectation refers to the joint distribution X and .


In the last display we wrote TB (x) = E(|X = x) as a random variable when the conditioning x is
seen as random. The common notation for this random variable is E(|X).
The Beta densities 55

5.3 The Beta densities


Let us discuss more carefully the Beta class of densities which were used as prior distributions in
model Md,1 . Consider the following family of prior densities for the Bernoulli parameter p: for
, > 0
x1 (1 x)1
g, (x) = , x [0, 1].
B(, )
where B(, ) is the beta function

()()
B(, ) = . (5.9)
( + )

Recall that the Gamma function is defined (for > 0) as


Z
() = x1 exp(x)dx
0

and satisfies
( + 1) = (). (5.10)
It was already argued that for , > 0, the function x 7 x1 (1 x)1 is integrable on [0, 1] (cf.
relation (2.15). Relation (5.9) is proved below. The moments are (k integer)
Z 1
mk : = xk g, (x)dx
0
Z
( + ) 1 k+1
= x (1 x)1 dx
()() 0
( + ) (k + )()
= .
()() (k + + )

Invoking (5.10), we find

(k 1 + )(k 2 + ) . . . ()
mk = .
(k 1 + + )(k 2 + + ) . . . ( + )

especially (for a r.v. U with beta density g, )



EU = m1 = , (5.11)
+
( + 1)
EU 2 = m2 = ,
(1 + + )( + )
( + 1)( + ) (1 + + )2
Var(U ) = EU 2 (EU )2 =
(1 + + )( + )2

= .
(1 + + )( + )2

Thus = implies EU = 1/2; in particular the prior distribution for which the Bayes estimator
is minimax ( = = n1/2 /2, cf. Theorem 2.5.2) has expected value 1/2.
56 Conditional and posterior distributions

We now show that Z 1


()()
B(, ) = u1 (1 u)1 du = (5.12)
0 ( + )
Consider the gamma distribution for parameter : it has density for x > 0
1 1
f (x) = x exp(x).
()
Consider independent r.v.s X, Y with gamma laws for parameters , , and their joint density
1
g, (x, y) = f (x)f (y) = x1 y1 exp((x + y)).
()()

Consider the change of variables x = rt, y = r(1 t) for 0 < r < , 0 t 1. (Indeed every
(x, y) with x > 0, y > 0 can be uniquely represented in this form: r = x + y, t = x/(x + y)). The
Jacobian matrix is

t rt r rt =
r t
r (1 t)
t r(1 t) r r(1 t)
with determinant (1 t)r + rt = r. The new density in variables t, r is

1
g, (t, r) = (rt)1 (r(1 t))1 r exp(r)
()()
1
= r+1 exp(r)t1 (1 t)1 .
()()

When we integrate over r, the result is the marginal density of t (call this f, (t)) and hence must
integrate to one. We find from the definition of the Gamma function
Z
( + ) 1
f, (t) =
g, (t, r)dr = t (1 t)1 .
0 ()()

This is the density of the Beta law; since this density integrates to one, we proved (5.12).

5.4 Conditional densities in continuous models


Consider a two dimensional random variable (X, Y ) with joint density p(x, y). Consider an interval
Bh = (xh/2, x+h/2) and assume P (X Bh ) > 0. The conditional probability P (Y A|X Bh )
is well defined:
P (Y A, X Bh )
P (Y A|X Bh ) = (5.13)
P (X Bh )
It is clear that P (A|X Bh ) considered as a function of A, is a probability distribution, i.e. the
conditional distribution of Y given X Bh . This distribution has a density: for A = {Y y} we
obtain Z y Z
P (Y y|X Bh ) = p(x, y)dxdy (P (X Bh ))1 (5.14)
Bh

which shows that the density is


Z
p (y|X Bh ) = p(x, y)dx (P (X Bh ))1 .
Bh
Conditional densities in continuous models 57

When X is continuous, it is desirable to define conditional distributions under an event {X = x}.


Indeed, when X is realized, conditioning on an interval Bh which contains x does not correspond to
the actual information we have. We might be tempted to condition on smaller and smaller intervals,
all containing the realized value x, and each of these conditions would reflect more accurately the
information we have, yet none of them would correspond to the actual information X = x.
It is natural to try to pass to a limit here as h 0. Let pX (x) be the marginal density of X:
Z
pX (x) = p(x, y)dy.

Without trying to be rigorous in this limit argument, for Bh = (x h/2, x + h/2), h 0


Z
p(x, y)dx hp(x, y),
(xh/2,x+h/2)
Z
P (X Bh ) = pX (x)dx hpX (x)
(xh/2,x+h/2)

Based on this heuristic, we introduce the conditional density given X = x by definition: if


pX (x) > 0 then
p(x, y)
p (y|x) = . (5.15)
pX (x)
The figure is meant to illustrate the idea of the of p (y|x) as giving the relative weight given to
dierent ys by the joint density, once the other argument x is fixed.
The expression p (y|x) is defined only in case pX (x) > 0. In that case however it clearly is a density,
since it is nonnegative and, as a function of y, it integrates to one. Then, for such x for which
pX (x) > 0, we define the conditional distribution of Y given X = x: for any A
Z
P (Y A|X = x) = p (y|x) dy.
A

Note that formula (5.15) is analogous to the conditional probability function in the discrete case:
if X and Y can take only finitely many values x, y and
p(x, y) = P (Y = y, X = x), pX (x) = P (X = x)
are the joint and marginal probability functions then if pX (x) > 0, by the classical definition of
conditional probability
P (Y = y, X = x) p(x, y)
p (Y = y|X = x) = = .
P (X = x) pX (x)
which looks exactly like (5.15), but all the terms involved are probability functions, not densities.
To state a connection between conditional densities and independence, we need a slightly more
precise definition. First note that densities are not unique; they can be modified in certain points
or subsets without aecting the corresponding probability measure.

Definition 5.4.1 Let Z be a random variable with values in Rk and density q on Rk . A version of
q is a density q on Rk such that for all sets A Rk for which the k-dimensional volume (measure)
is defined, Z Z
q(z)dz = q(z)dz
A A
58 Conditional and posterior distributions

Figure 1 Conditional density of Y given X = x. The altitude lines symbolize the joint density of X
and Y . On the dark strip, in the limit when h tends to 0, this gives the conditional density of Y given
X=x

In particular, if A = A1 A2 and A2 has volume 0 then the density q can be arbitrarily modified on
A2 . On the real line, A2 might consists of a countable number of points; on Rk , A2 might consists
of hyperplanes, smooth surfaces etc.
Definition 5.4.2 Suppose that Z = (Z1 , . . . , Zk ) is a continuous random variable with values in
Rk and with joint density p(z) = p(z1 , . . . , zk ). Set X = Z1 , Y = (Z2 , . . . , Zk ) and p(x, y) = p(z).
A version of the conditional density of Y given X = x is any function p (y|x) with properties:
(i) For all x, p (y|x) is a density in y, i.e.
Z
p (y|x) 0, p (y|x) dy = 1
R
(ii) There is a version of p(x, y) such that for pX (x) = p(x, y)dy we have
p(x, y) = p (y|x) pX (x) (5.16)
Lemma 5.4.3 A conditional density as defined above exists.
Proof. If x is such that pX (x) > 0 then we can set
p(x, y)
p (y|x) = . (5.17)
pX (x)
Clearly this is a density in y since
Z Z
p(x, y) pX (x)
dy = dy = 1.
pX (x) pX (x)
Conditional densities in continuous models 59

Let A = {x R : pX (x) = 0}. Clearly P (X A) = 0, and hence for

A0 = {(x, y) : x A}

we obtain
P ((X, Y ) A0 ) = P (X A) = 0.
Now modify p(x, y) on A0 , to obtain another version p, namely set

p(x, y) = 0 for (x, y) A0 .

Clearly p is a version of p: for any event B


Z
p(z)dz = P (Z B) = P (Z B A0 )
B
Z Z Z
= p(z)dz = p(z)dz = p(z)dz.
BA0 BA0 B

For this version p(x, y) we have: pX (x) = 0 implies p(x, y) = 0 and hence for such x, p (y|x) can
be chosen as an arbitrary density to fulfill (5.16).

Proposition 5.4.4 (X, Y ) with joint density p(x, y) are independent if and only if there is a ver-
sion of p (y|x) which does not depend on x.

Proof. If X and Y are independent then p(x, y) = pX (x)pY (y). Thus pY (y) = p (y|x) is such a
version. Conversely, if p (y|x) is such a version then
Z Z
pY (y) = p(x, y)dx = p (y|x) pX (x)dx
Z
= p (y|) pX (x)dx = p (y|)

hence
p(x, y) = pY (y)pX (x)
which implies that (X, Y ) are independent.
In case that all occurring joint and marginal densities are positive, there is really no need to consider
modifications; the conditional densities can just be taken as (5.17).
Let now f (Y ) be any function of the random variable Y. The conditional expectation of f (Y )
given X = x is Z
E (f (Y )|X = x) = f (y)p(y|x)dy,

i.e. the expectation with respect to the conditional distribution of Y given X = x, given by the
conditional density p(y|x). Note that this conditional expectation depends on x, i.e. is a function
of x- the realization of the random variable X. Sometimes it is useful to "keep in mind" the original
random nature of this realization, i.e. consider the conditional expectation to be a function of the
random variable X. The common notation for this random variable is

E (f (Y )|X)
60 Conditional and posterior distributions

it is a random variable which is a function of X. For the expectation we then have


Z Z Z
EE (f (Y )|X) = E f (y)p(y|X)dy = f (y)p(y|x)dy pX (x)dx
Z Z
= f (y)p(x, y)d(x, y) = f (y)pY (y)dy = Ef (Y ),

i.e. expectation of conditional expectation yields the expectation. These notions are the same in
the case of discrete random variables; only in the continuous case we had to pay attention to the
fact that P (X = x) = 0.

5.5 Bayesian inference in the Gaussian location model


Suppose that we observe a continuous random variable X = (X1 , . . . , Xn ) with density unknown
density p (x), x Rn , where is an open subset of Rd (this was called model Mc before).
In a Bayesian statistical approach, assume that becomes random as well. Suppose we have a
prior density () on the parameter space and wish to build a joint density of X and from
this.
We need some regularity condition on functions and sets. A set A Rk is called measurable if its
k-dimensional volume is defined (volume may be 0 or ). One also defines measurable functions
on Rk ; we
R do not give the definition here but remark only that measurability is necessary for
integrals f (x)dx to be defined (thus densities must be measurable). All continuous functions are
measurable, and also functions which are continuous except on a set of volume 0. In this course,
we need not dwell on these technicalities (measure theory); we assume that all occurring sets and
functions are are measurable.

Proposition 5.5.1 Suppose that in model Mc a density () on is given such that () > 0,
, (and that p (x) is jointly measurable as a function of (x, )). Then the function

p(x, ) = p (x)() (5.18)

is a density on Rk . When this is construed as a joint density of (X, ), then p (x) is a version
of the conditional density p(x|).

Proof. The function p(x, ) is nonnegative; we have


Z Z Z Z Z
p(x, )dxd = p (x)dx()d = ()d = 1,
Rn Rn

thus p(x, ) is a density. Then () is the marginal density of , derived from the joint density:
indeed Z Z
p(x, )dx = p (x)()dx = ().

Then, we immediately see from (5.18) that p (x) is a version of the conditional density p(x|).
This justifies the notation that for a parametric family of densities, one writes interchangeably
p (x) or p(x|). Then p(|x) is again called the posterior density. If is an interval in R then the
conditional expectation Z
E(|X = x) = p(|x)d
Bayesian inference in the Gaussian location model 61

if it exists, is again called posterior expectation. Let us discuss the case of the Gaussian location
model, first in the case of sample size n = 1. We can represent X = X1 as

X =+

where is centered normal: L( i ) = N (0, 2 ). The parameter is and parameter space is R.


Assume that becomes random as well: L() = N (m, 2 ) where m and 2 are known (these are
sometimes called hyperparameters ). For Bayesian statistical inference, we wish to compute the
conditional distribution of given X, i.e. the posterior distribution of the parameter.
The joint density of X and is

1 (x )2 1 ( m)2
p(x, ) = exp exp
2 2 2 2 2 2
2 2
1 (x ) ( m)
= exp 2
.
2 2 2 2
To find the marginal density of X, we define = X and compute the joint distribution of and
. First find the conditional law L(|) which is the law of X when is fixed. This is N (0, 2 ),
and since it does not depend on , and are independent in their joint law (Proposition 5.4.4).
In other words, we have
X =+
where , are independent N(m, 2 ), N (0, 2 ) respectively. By the properties of the normal
distribution, we conclude that X has marginal law L(X) = N (m, 2 + 2 ), with density ( is the
standard Gaussian density)

1 xm
pX (x) =
( 2 + 2 )1/2 ( 2 + 2 )1/2

1 (x m)2
= exp
(2)1/2 ( 2 + 2 )1/2 2( 2 + 2 )
and the posterior density is

p(|x) = p(x, )/pX (x)



( 2 + 2 )1/2 (x )2 ( m)2 (x m)2
= exp + .
(2)1/2 2 2 2 2 2( 2 + 2 )
We conjecture that this is a Gaussian density (if X and were independent then p(|x) = p ()
certainly would be Gaussian). We shall establish that p(|x) is indeed Gaussian with variance
2 2
:= 2 . (5.19)
2 + 2
Note that for = m, x = x m
2
(x )2 ( m)2 (x m)2 1 2
2
2
+ 2 2
= 2 x
2 2 2( + ) 2 2 + 2
and setting
2
= , (5.20)
2 + 2
62 Conditional and posterior distributions

we obtain
1 m (x m)
p(|x) = .

Proposition 5.5.2 In the Gaussian location model M2 for sample size n = 1 and a normal prior
distribution L() = N (m, 2 ), the posterior distribution is

L(|X = x) = N (m + (x m), 2 )

where and are defined by (5.19), (5.20). The normal family N (m, 2 ), m R, 2 > 0 is a
conjugate family of prior distributions.

Let us interpret this result. The posterior distribution of has expected value m + (x m); note
that 0 < < 1. The prior distribution of had expectation m, so the posterior expectation of
intuitively represents a compromise between the prior belief and the empirical evidence about ,
i.e. X = ( + ) = x. Indeed

E(|X = x) = m + (x m) = m(1 ) + x

so E(|X = x) is a convex (linear) combination of data x and prior mean m (is always between
these two points). In other words, the data x are shrunken towards m when x m is multiplied
by . A similar shrinkage eect was observed for the Bayes estimator in the Bernoulli model M1
when the prior mean was 1/2 (recall that the minimax estimator there was Bayes for a Beta prior
with = , which has mean 1/2).
Moreover
2 2
2 = 2 < min( 2 , 2 ).
+ 2
The posterior variance 2 is thus smaller than both the prior variance and the variance of the data
given (i.e. 2 ). It is seen that the information in the data and in the prior distribution is added
up to give a smaller variability (a posteriori) than is contained in either of the two sources. In fact
1 1 1
= 2 + 2.
2
The inverse variance of a distribution can be seen as a measure of concentration, sometimes called
precision . Then the precision of the posterior is the sum of precisions of the data and of the prior
distribution.
The posterior expectation again can be shown to be a Bayes estimator for quadratic loss. Before
establishing this, let us discuss two limiting cases.

Case 1 Variance 2 of the prior density is large: 2 . In this case

2
= 1, 2 2 . (5.21)
2 + 2
A large prior variance means that the prior density is more and more spread out, i.e more
diuse or noninformative . 1 then means that the prior information counts less and
less in comparison to the data, and the posterior expecation of tends to x. This means,
that in the limit the only evidence we have about is the realized value X = x.
Bayesian inference in the Gaussian location model 63

Case 2 Variance 2 of the prior density is small: 2 0. In this case

2
= 0, 2 0. (5.22)
2 + 2
A small prior variance means that the prior density is more and more concentrated around
m. Then (5.22) means that the belief that is near m becomes overwhelmingly strong,
and forces the posterior distribution to concentrate around m as well.

Case 3 Variance 2 of the data (given ) is large: 2 . In this case

2
= 0, 2 2
2 + 2
The posterior density tends to the prior density, since the quality of the information X = x
becomes inferior (large variance 2 )

Case 4 Variance 2 of the data (given ) is small: 2 0. We expect the prior distribution to
matter less and less, since the data are more and more reliable. Indeed

2
= 1, 2 0
2 + 2
which is similar to (5.21) for case 1.

In the case of prior mean m = 0, the quantity

2
r=
2
is often called the signal-to-noise ratio . Recall that in the Gaussian location model (Model
Md,1 ) for n = 1 the data are
X =+
where L() = N (0, 2 ) and L() = N (0, 2 ). The parameter can be seen as a signal which is
observed with noise . For m = 0 we have

2 = E2 , 2 = E 2

so that 2 , 2 represent the average absolute value the (squared) signal and noise. The parameters
of the posterior density can be expressed in terms of the signal-to-noise ratio r and 2 :

2 r 2 2 2 r
= 2 2
= , = 2 2
= 2
+ 1+r + 1+r
and the discussion of the limiting cases 1-4 above could have been in terms of r.

Remark 5.5.3 The source of randomness of the parameter may not only be prior belief, as
argued until now, but may be entirely natural and part of the model, along with the randomness
of the noise . This randomness assumption even dominates in the statistical literature related to
communication and signal transmission (assumption of a random signal). The problem of estima-
tion of then becomes the problem of prediction . Predicting still means to find an estimator
64 Conditional and posterior distributions

T (X) depending on the data, but for assessing the quality of a predictor, the randomness of is
always taken into account. The risk of a predictor is the expected loss

EL(T (X), ).

w.r. to the joint distribution of X and . This coincides with the mixed risk of an estimator

B(T ) = E (E L(T (X), ))

in Bayesian statistics when a prior distribution on is used to build the joint distribution of
(X, ). Thus an optimal predictor is the same as a Bayes estimator, when the loss functions L
coincide.

In statistical communication theory (or information theory ), a family of probability distribu-


tions {P , } on some sample space X is often called a noisy (communication) channel,
the parameter is the signal at the input and observations X are called the signal at the
output. It is assumed that the signal is intrinsically random, e.g. the finite set = {1 , . . . , k }
may represent letters of an alphabet which occur at dierent frequencies (or probabilities) ().
Thus () may be given naturally, by the frequency of certain letters in a given language. Typical
examples of noisy channels are

the binary channel : = {0, 1}, P = B(1, p ) where p (0, 1), = 0, 1. The signals are
either 0 and 1, and given = 1 the channel then produces the correct signal with probability
p1 and the distorted signal 0 with probability 1 p1 ; analogously for = 0. It is naturally to
assume here p1 > 1/2, p0 < 1/2 lest the channel would be entirely distorted (gives the wrong
signal with higher probability than the correct one). A natural prior , would be the uniform
here, ((0) = (1) = 1/2) since in most data streams is probably not sensible to assume that
0 is more frequent than 1.
N
the n-fold product of the binary channel: = {0, 1}n , P = ni=1NB(1, p(i) ) where
n
(i) is the i-th component of a sequence of 0s and 1s of length n and i=1 signifies the
n-fold product of laws. The numbers p (0, 1), = 0, 1 are as above in the simple binary
channel. Here a signal is a sequence of length n, and the channel transmits this sequence
such that each component is sent independently in a simple binary channel. The dierence
to the previous case is that the signal is a sequence of length n, not just 0 or 1; thus n = 8
gives card() = 28 = 256 and this suces to encode and send all the ASCII signs. Here it
is natural to assume a non-uniform distribution on the alphabet since ASCII signs (e.g.
letters) do not all occur with equal probability.

the Gaussian channel . Let R and P = N (, 2 ) where 2 is fixed. Here the signal
at the output has the form
X =+ (5.23)
where L() = N (0, 2 ), i.e. the channel transmits the signal in additive Gaussian noise.
Here the real numbers are codes agreed upon for any other signal, such as sequences of
of 0s and 1s as above. Thus the channel (5.23) itself coincides with the Gaussian location
model. However since there are usually only a finite number of signals possible, in information
theory one considers prior distributions for the signal which are concentrated on finite
sets R.
Bayesian inference in the Gaussian location model 65

After this digression, our next task is Bayesian inference in the Gaussian location model for general
sample size n. Our prior distribution for will again be N (m, 2 ). We have

Xi = + i

where i are N (0, 2 ) and independent. Consider the density of X given = :

p (x) = p(x|) = p(x1 , . . . , xn |)


Yn
1 (xi )2
= exp
(2 2 )1/2 2 2
i=1

1 xn 1 s2n
= 1/2 exp 2 1
n1/2 n1/2 n (2 2 )(n1)/2 2 n

where the last line is the representation found in (3.4) in the proof of Proposition 3.0.5, with the
standard normal density and s2n the sample variance. The first factor is a normal density in xn and
the second factor (call it p(s2n ) for now) does not depend on . If we denote , (x) the density of
the normal law N (, 2 ) then
p (x) = ,n1/2 (xn ) p(s2n ).
If () is the prior density then

p (x)() ,n1/2 (xn )()


p(|x) = R =R
p (x)()d ,n1/2 (xn )()d

This coincides with the posterior distribution for one observation (n = 1) at value X1 = xn and
variance n1 2 . In other words, the posterior distribution given X = x may be computed as if
we observed only the sample mean Xn , taking into account that its law is N (, n1 2 ). Thus the
posterior density depends on the vector x only via the function xn of x.

Proposition 5.5.4 In the Gaussian location model M2 for general sample size n and a normal
prior distribution L() = N(0, 2 ), the posterior distribution is

L(|X = x) = N (m + (xn m), 2 ) (5.24)

where and are defined by

2 2 n1 2 2
= , = (5.25)
n1 2 + 2 n1 2 + 2

The normal family N (m, 2 ), m R, 2 > 0 is a conjugate family of prior distributions.

We again emphasize the special role of the sample mean.

Corollary 5.5.5 In Model M2 for general sample size n, consider the statistic Xn and regard these
as data in a derived model:
L(Xn |) = N (, n1 2 )
(Gaussian location for sample size n = 1 and variance n1 2 ). In this model, the normal prior
L() = N (m, 2 ) leads to the same posterior distribution (5.24) for .
66 Conditional and posterior distributions

Remark 5.5.6 Sucient statistics Properties of statistics (data functions) T (X) like these
suggest that T (X) may contain all the relevant information in the sample, i.e. that is suces to
take the statistic T (X) and perform all inference about the parameter in the parametric model
for T (X) derived from the original family {P , } for X:
{Q , } = {L(T (X)|), }
which is a family of distributions in the space T where T (X) takes its values. This idea is called
the suciency principle and T would be a sucient statistic. At this point we do not rigorously
define this concept; noting only that it is of fundamental importance in the theory of statistical
inference.

The expectation of the posterior distribution in Proposition 5.5.4 is m + (xn m). It is natural
to call this the conditional expectation of given X = x, or the posterior expecation, written
E(|X = x). (We have not shown so far that is is unique; more accurately we should call it a
version of the conditional expectation.) Clearly for n we have 1 and E(|X = x) will be
close to the sample mean. Moreover, the above corollary shows that in a discussion of limiting cases
as in Case 1 Case 4 above, all statements carry over when X (the one observation for n = 1) is
replaced by the sample mean. In addition, the noise variance in this discussion now is replaced by
n1 2 . Thus e.g. case 4 can be taken to mean that as sample size n increases, the prior distribution
matters less and less (indeed we have more and more information in the sample). The analog of
the signal-to-noise ratio would be
n 2
r= as n .
2
Again we see that intuitively large sample size n amounts to small noise.

5.6 Bayes estimators (continuous case)


The following discussion of Bayes estimators as posterior expectations is analogous to the case of
finite sample space (Proposition 5.2.1). Consider the again the general continuous statistical model
Mc where a density () on is given such that () > 0, . Consider the mixed quadratic
risk of an estimator T
Z
B(T ) = R(T, )()d
Z

= E (T (X) )2 ()d = E (T (X) )2


where the last expectation is taken w.r. to the joint distribution of X and . A Bayes estimator
is an estimator TB which minimizes B(T ):
B(TB ) = inf B(T ).
T

It is also called a best predictor, depending on the context (cf. remark 5.5.3).
In the statistical model Mf (the sample space X for X is finite and is an finite interval in R),
let be a finite interval and g be a prior density on . Then, for a quadratic loss function a Bayes
estimator TB of is given by the posterior expectation
TB (x) = E(|X = x), x X .
Bayes estimators (continuous case) 67

Proposition 5.6.1 In model Mc assume a prior density () on such that () > 0, .


Suppose that for all realizations x of X, there is a version of the posterior density with finite second
moment (E(2 |X = x) < ). Then the posterior expectation T (x) = E(|X = x) is a Bayes
estimator for quadratic loss.

Proof. In Proposition 5.2.1 the finitene second moment was automatically ensured by the con-
dition that was a finite interval; otherwise the proof is entirely analogous. Under the condition
E(|X = x) exists. We have
Z
B(T ) = (T (x) )2 p(x|)()dxd
Rk
Z Z
2
= (T (x) ) p(|x)d pX (x)dx
Rk
Z
= E (T (X) )2 |X = x pX (x)dx.
Rk

Invoking relation (5.8) again, we find for any x and TB (x) = E(|X = x)
Z Z
(T (x) )2 p(|x)d (TB (x) )2 p(|x)d.

This holds true even if the left side is infinite. The right side is the variance of p(|x) and is finite
under the assumptions. Hence
B(T ) E (TB (X) )2 .

Corollary 5.6.2 In Model M2 for general sample size n and a normal prior distribution L() =
N (m, 2 ), the Bayes estimator for quadratic loss is

TB (X) = m + (Xn m)

where is given by (5.25). The Bayes risk is


n1 2 2
B(TB (X)) = E (TB (X) )2 = 2 = . (5.26)
n1 2 + 2
Proof. The second statement follows from the fact that if Var(|X) denotes the variance of the
posterior density p(|X) then, from the proof of Proposition 5.6.1,

E (TB (X) )2 = E(Var(|X))

and Proposition (5.5.4), which gives

Var(|X = x) = 2

Thus Var(|X = x) does not depend on x and E(Var(|X)) = 2 .


Note that speaking of the Bayes estimator is justified here: indeed when we start with the normal
conditional density for our data X1 , . . . , Xn . and normal prior, then there is always a version of the
posterior density which is normal. As long as we do not arbitrarily modify these normal densites
in certain points (which is theoretically allowed ) we obtain a unique posterior expectation for all
data points x.
68 Conditional and posterior distributions

5.7 Minimax estimation of Gaussian location


The Bayes estimators have also interesting properties with regard to the original risk function
R(T, ), without integration over . One such statement (admissibility) was proved in Propo-
sition 2.3.2. It is easy to find the risk R(T, ) of the Bayes estimator: the usual bias-variance
decomposition gives

E (TB (X) )2 = E (TB (X) E TB (X))2 + E (E TB (X) )2 .

Consider the Gaussian location model with a mean zero Gaussian prior L() = N (0, 2 ). According
to Corollary 5.6.2 we have TB (X) = Xn and hence

2 2
E TB (X) = E X
n =
n1 2 + 2 n1 2 + 2
n1 2
= 1 2 ,
n + 2
2
2
2
Var (TB (X)) = Var Xn = n1 2 ,
n1 2 + 2

hence
2
1 4 1 2
R(TB , ) = E (TB (X) )2 = n + (n1 2 2 2
)
n1 2 + 2
2
1 2 2 n1 2 2
= n 1+ .
n1 2 + 2 4

The sample mean is an unbiased estimator, hence


2
R(Xn , ) = E Xn = Var Xn = n1 2 . (5.27)

Comparing the two risks, we note the following facts.

A) Since
2
<1
n1 2 + 2
for small values of we have
R(TB , ) < R(Xn , )
i.e. for small values of the estimator TB is better.

B) For || , R(TB , ) whereas R(Xn , ) remains constant. For large values of the
estimator Xn is better.

C) As , the risk of TB (X) approaches the risk of Xn , at any given value of .

The sample mean recommends itself as particularly prudent, in the sense that it takes into account
possibly large values of , whereas the Bayes estimator is more optimistic in the sense that it is
geared towards smaller values of .
Minimax estimation of Gaussian location 69

Risks of the Bayes estimators for = 1 and = 2, and risk of sample mean (dotted), for
n1 2 = 1

Recall Definition 2.5.1 of a minimax estimator; that concept is the same in the general case: for
any estimator T set
M (T ) = sup R(T, ). (5.28)

Here more generally sup is written in place of the maximum, since it is not claimed that the
supremum is attained. Conceptually, the criterion is again the worst case risk. An estimator TM
is called minimax if
M (TM ) = min M (T ).
T

Theorem 5.7.1 In the Gaussian location model for = R (Model M2 ), the sample mean Xn
is a minimax estimator for quadratic loss.

Proof. Suppose that for an estimator T0 we have


M (T0 ) < M (Xn ). (5.29)
Above in (5.27) it was shown that R(Xn , ) = n1 2 does not depend on , hence M (Xn ) = n1 2 .
It follows that there is > 0 such that
R(T0 , ) n1 2 for all R.
Take a prior distribution N (0, ) for ; then for the mixed (integrated) risk B(T0 ) of the estimator
T0
B(T0 ) = E R(T0 , ) n1 2
If TB, is the Bayes estimator for N (0, ) given by Corollary (5.6.2) then
n1 2 2
B(TB, ) = 2 = n1 2 as .
n1 2 + 2
70 Conditional and posterior distributions

Hence for large enough we have


B(T0 ) < B(TB, )
which contradicts the fact that TB, is the Bayes estimator. Hence there can be no T0 with (5.29).

It essential for this argument that the parameter space is the whole real line. It turns out (see
exercise below) that for e.g. = [K, K] we can find an estimator TB, which is uniformly strictly
better than Xn , so that Xn is no longer minimax.
Let us consider the case = [K, K] in more detail; here we have an a priori restriction || K. It
is a more complicated problem to find a minimax estimator here. A common approach is to simplify
the problem again, by looking for minimax estimators within a restricted class of estimators. For
simplicity consider the case of the Gaussian location for n = 1, 2 = 1. A linear estimator is any
estimator
T (X) = aX + b
where a, b are real numbers. The appropriate worst case risk is again (5.28), for the present
parameter space . A minimax linear estimator is a linear estimator TLM such that

M (TLM ) = sup R(TLM , ) min M (T ).


T linear

Exercise 5.7.2 Consider the Gaussian location model with restricted parameter space = [K, K],
where K > 0, sample size n = 1 and 2 = 1. (i) Find the minimax linear estimator TLM . (ii)
Show that TLM is strictly better than the sample mean Xn = X, everywhere on = [K, K] (this
implies that X is not admissible). (iii) Show that TLM is Bayesian in the unrestricted model = R
for a certain prior distribution N (0, 2 ), and find the 2 .
Chapter 6
THE MULTIVARIATE NORMAL DISTRIBUTION

Recall the Gaussian location model (Model M2 ), for sample size n = 1. We can represent X as

X =+

where is centered normal: L() = N (0, 2 ). In a Bayesian statistical approach, assume that
becomes random as well: L() = N (0, 2 ), in such a way that it is independent of the noise .
Thus and are independent normal r.v.s. For Bayesian statistical inference, we computed the
joint distribution of X and ; this is well defined as the distribution of the r.v. ( + , ). The
joint density was

1 (x )2 2
p(x, ) = exp 2 . (6.1)
2 2 2 2
Here we started with a pair of independent Gaussian r.v.s (, ) and obtained a pair (X, ). We saw
that the marginal of X is N(0, 2 + 2 ); the marginal of is N (0, 2 ). We have a joint distribution
of (X, ) in which both marginals are Gaussian, but (X, ) are not independent: indeed (6.1) is
not the product of its marginals. Note that we took a linear transformation of (, ):

X = 1 + 1 ,
= 0+1

We could have written = x1 and = x2 for independent standard normals x1 ,x2 ; then the
linear transformation is

X = x1 + x2 , (6.2)
= 0 x1 + x2 .

Let us consider a more general situation. We have independent standard normals x1 ,x2 ; and
consider

y1 = m11 x1 + m12 x2 ,
y2 = m21 x1 + m22 x2 .

where the linear transformation is nonsingular: m11 m22 m21 m12 6= 0. Nonsingularity is true for
(6.2): it means just > 0. Define a matrix

m11 m12
M=
m21 m22
72 The multivariate normal distribution

and vectors
x1 y1
x= ,y = .
x2 y2
Then nonsingularity means |M | 6= 0 (where |M | is the determinant of M ). Let us find the joint
distribution of y1 , y2 , i.e. the law of the vector y. For this we immediately proceed to the k-
dimensional case, i.e. consider i.i..d standard normal x1 , . . . , xk , and let

x = (x1 , . . . , xk )> ,

M be a nonsingular k k matrix and y =M x.


Let A be a measurable set in Rk , i.e. a set which has a well defined volume (finite or infinite).
Then ( is the standard normal density) )
Z k !
Y
P (y A) = P (M x A) = (xi ) dx1 . . . dxk
MxA i=1
Z k
!
1 1X 2
= exp xi dx1 . . . dxk
(2)k/2 MxA 2
i=1
Z
1 1 >
= exp x x dx.
(2)k/2 MxA 2
Pk
where x> is the transpose of x, x> x is the inner product (scalar product) and x> x = 2
i=1 xi . Now
let M 1 be the inverse of M and write

x =M 1 M x, x> x = (M x)> (M 1 )> M 1 M x.

Set = M M > ; this is a nonsingular symmetric matrix. Recall the following matrix rules:
1

(M 1 )> M 1 = M M > = 1 , || = |M | M > = |M |2 ,
1
M = |M |1 .

We thus obtain
x> x =(M x)> 1 M x,
Z
1 1 > 1
P (y A) = exp (M x) M x dx
(2)k/2 MxA 2
This suggests a multivariate change of variables: setting y = M x, we have to set x = M 1 y and
(formally)

dx = M 1 dy = |M |1 dy.

Writing |M | = ||1/2 , we obtain


Z
1 1 > 1
P (y A) = exp y y dy.
(2)k/2 ||1/2 yA 2
The multivariate normal distribution 73

Definition 6.1.3 The distribution in Rk given by the joint density of y1 , . . . , yk



1 1 > 1
p(y) = exp y y (6.3)
(2)k/2 ||1/2 2

for some = M M > , || 6= 0 (y = (y1 , . . . , yk )> ), is called the k-variate normal distribution
Nk (0, ).

Theorem 6.1.4 Let x1 , . . . , xk be i.i..d standard normal random variables, M be a nonsingular


k k matrix and
x = (x1 , . . . , xk )> , y =M x.
Then y has a distribution Nk (0, ), where = M M > .

Clearly all matrices = M M > with |M | 6= 0 are possible here. Let us describe the class of possible
matrices dierently.

Lemma 6.1.5 Any matrix = M M > with |M | 6= 0 is positive definite: for any x Rk which is
not identically zero.
x> x > 0.

Proof. For any x 6= 0 (0 is the null vector) and z = M > x

x> x = x> M M > x = (M > x)> (M > x) = z> z > 0,

since z> z = 0 would mean z = 0 and thus x =M 1 z = 0, which was excluded by assumption.
Recall the following basic fact from linear algebra.

Proposition 6.1.6 (Spectral decomposition) Any positive definite kk-matrix can be written

= C > C

where is a diagonal k k-matrix



1 0 0
0 2 0

=
0 0

0 0
0 0 k

with positive diagonal elements i > 0, i = 1, . . . , k (called eigenvalues or spectral values) and
C is an orthogonal k k-matrix :
C > C = CC > = Ik
(Ik is the unit k k-matrix). If all the i , i = 1, . . . , k are dierent, is unique and C is unique
up to sign changes of its row vectors.

Lemma 6.1.7 Every positive definite k k-matrix can be written as M M > where M is a
nonsingular k k-matrix.
74 The multivariate normal distribution

Proof. It is easy to take a square root 1/2 of a diagonal matrix : let 1/2 be the diagonal
1/2
matrix with diagonal elements i ; then 1/2 1/2 = . Now take M = C > 1/2 where C, are
from the spectral decomposition:
M M > = C > 1/2 1/2 C = C > C =
and M is nonsingular:
Y k
1/2
|M | = C > 1/2 = C > i ,
i=1

and for any orthogonal matrix C one has C > = 1 since
2
>
C = C > C = |Ik | = 1.

As a result, the k-variate normal distribution Nk (0, ) is defined for any positive definite matrix
.
Recall that for any random vector y in Rk , the covariance matrix is defined by
(Cov(y))i,j = Cov(yi , yj ) = E(yi Eyi )(yj Eyj ).
if the expectations exist. (Here (A)i,j is the (i.j) entry of a matrix A). This existence is guaranteed
P
by a condition ki=1 Eyi2 < (via the Cauchy-Schwarz inequality).

Lemma 6.1.8 The law Nk (0, ) has expectation 0 and covariance matrix .

Proof. In the present case, we have y = M x, hence


k
X
yi = mir xr ,
r=1

which immediately shows Eyi = 0. Furthermore,


k ! k !
X X
Cov(yi , yj ) = Eyi yj = E mir xr mjs xs
r=1 s=1

k
X
= E mir mjs xr xs .
r,s=1

Since
1 if r = s
Exr xs = {
0 otherwise
we obtain
k
X
Cov(yi , yj ) = mir mjr = M M > = ()i,j .
i,j
r=1

For the matrix one commonly writes


()i,j = ij , ()i,i = ii = 2i .
The multivariate normal distribution 75

Definition 6.1.9 Let x be a random vector with distribution Nk (0, ) where is a positive definite
matrix, and let be a (nonrandom) vector from Rk . The distribution of the random vector

y =x+

is called the k-variate normal distribution Nk (, ).

The following result is obvious.

Lemma 6.1.10 (i) The law Nk (, ) has expectation and covariance matrix .
(ii) The density is

1 1 > 1
, (y) := exp (y ) (y ) .
(2)k/2 ||1/2 2

Lemma 6.1.11 L(x) = Nk (0, Ik ) if and only if x is a vector of i.i.d. standard normals.

Proof. In this case the joint density is


k
Y
, (x) = 0,Ik (x) = (xi )
i=1

which proves that the components xi are i.i.d. standard normal. N

Lemma 6.1.12 Let L(x) = Nk (, ) where is positive definite, and let A be a (nonrandom)
l k matrix with rank l (this implies l k). Then

L(Ax) = Nl (A, AA> )

Proof. It suces to show

L(Ax A) = L(A(x )) =Nl (0, AA> ),

so we can assume = 0. Let also x =M where L() = Nk (0, Ik ) and = M M > . Then
Ax =AM so for l = k the claim is immediate. In general, for l k, consider also the l l-
matrix AA> ; note that it is positive definite:

a> AA> a > 0

for every nonzero l-vector a since A> a is then a nonzero k-vector (A has rank l). Let

AA> = C > C

be a spectral decomposition. Define D = 1/2 C; then

Ax = AM ,
DAx = DAM .
76 The multivariate normal distribution

Suppose first that l = k. Then DAx is multivariate normal with expectation 0 and covariance
matrix

DAM (DAM )> = DAM M > A> D>


= DAA>D> = DC >
CD
>

= 1/2 C C > C C > 1/2


= 1/2 1/2 = Il .

Hence
L(DAx) = Nl (0, Il ),
i.e. DAx is a vector of standard normals.
If l < k then we find a (k l) k matrix F such that

DAM F > = 0, F F > = Ikl .

For this it suces to select the rows of F as k l orthonormal vectors which are a basis of the
subspace of Rk which is orthogonal to the subspace spanned by the rows of DAM . Then for the
k k matrix
DAM
F0 =
F
we have
Il 0
F0 F0> = = Ik ,
0 Ikl
i.e. F0 is orthogonal. Hence F0 is multivariate normal Nk (0, Ik ), i.e. a vector of independent
standard normals. Since DAM consists of the first l elements of F0 , it is a vector of l standard
normals. We have shown again that DAx is a vector of standard normals.
Note that D1 = C > 1/2 , since

DC > 1/2 = 1/2 CC > 1/2 = 1/2 1/2 = Il .


Hence for Ax =D1 DAx

L(Ax) = Nl (0, D1 (D1 )> ) =


= Nl (0, C > 1/2 1/2 C) = Nl (0, AA> ).

Two random vectors x, y with dimensions k, l respectively, are said to have a joint normal distri-
bution if the vector
x
z=
y
has a k + l-multivariate normal distribution. The covariance matrix Cov(x,y) of x, y is defined by

(Cov(x, y))i,j = (Cov(xi yj )) , i = 1, . . . , k, j = 1, . . . , l;

it is a k l-matrix. We then have a block structure for the joint covariance matrix Cov(z):

Cov(x) Cov(x,y)
Cov(z) = .
Cov(y, x) Cov(y)
The multivariate normal distribution 77

Theorem 6.1.13 Two random vectors x, y with joint normal distribution are independent if and
only if they are uncorrelated, i.e. if
Cov(x,y) = 0 (6.4)
(where 0 stands for the null matrix).

Proof. Independence here means that the joint density is the product of its marginals. Assume
that both x,y, are centered, i.e. have zero expectation (otherwise expectation can be subtracted,
with (6.4) still true). Write = Cov(z), 11 = Cov(x), 22 = Cov(y), 12 = Cov(x,y), 21 = >12 .
Then (6.4) means that
11 0
= .
0 22
where 0 represents null matrices of appropriate dimensions. A matrix of such a structure is called
block diagonal. It is easy to see that
1
1 11 0
= .
0 1
22

Indeed, both inverses 1 1


11 , 22 exist, since for the determinants

|| = |11 | |22 | (6.5)


and the matrix multiplication rules show
1
11 11 0
1 = = Ik+l .
0 1
22 22

For the joint density of x, y (density of z> = (x> , y> )) we obtain


p(x, y) = 0, (z) =

1 1 > 1 > 1
= exp (x 11 x + y 22 y)
(2)(k+l)/2 |11 |1/2 |22 |1/2 2
= 0,11 (x) 0,22 (y).
This immediately implies that the marginal of x is
Z
p(x) = p(x, y)dy =0,11 (x)

and the marginal of y is 0,22 (y). Thus 0, (z) is the product of its marginals.
Conversely, assume that 0, (z) is the product of its marginals, i.e. x, y are independent. This
implies that for any real valued functions f (x), g(y) of the two vectors we have Ef (x)g(y) =
Ef (x)Eg(y). Let ei(k) be the i-th unit vector in Rk , so e>
i(k) x = xi ; then for the covariance matrix
we obtain

Cov(x,y)i,j = Exi yj = E e> i(k) xe>
j(l) y
= Ee> >
i(k) xEej(l) y =0

since both x, y are centered.


Exercise. Prove (6.5) for a block diagonal , using the spectral decomposition for each of the
blocks 11 , 22 and the rule |AB| = |A||B|.
78 The multivariate normal distribution
Chapter 7
THE GAUSSIAN LOCATION-SCALE MODEL

Recall the Gaussian location-scale model which was already introduced (see page 32). :

Model M3 Observed are n independent and identically random variables X1 , . . . , Xn , each having
law N (, 2 ), where R and 2 > 0 are both unknown..

The parameter is this model is two dimensional: = (, 2 ) = R (0, ).

7.1 Confidence intervals


To illustrate the consequences of an unknown variance 2 , let us look at the problem of constructing
a confidence interval, beginning with the location model, model M2 . Suppose the confidence
level 1 is given (e.g. = 0.05 or = 0.01), and try to construct a random interval ( , + ):

P ([ , + ] 3 ) 1 . (7.1)

meaning that the probability that the interval [ , + ] covers is more than 95%, Note both
, + are random variables (functions of the data), so the interval is in fact a random interval.
Therefore the element sign is written in reverse form 3 to stress the fact that (7.1) the interval
is random, not ( is merely unknown)
When 2 is known it is easy to build a confidence interval based on the sample mean: since

L(Xn ) = N(, n1 2 ),

it follows that
L( Xn n1/2 /) = N(0, 1). (7.2)

Definition 7.1.1 Suppose that P is a continuous law on R with density p such that p(x) > 0 for
x > 0 and P ((0, )) 1/2. For every (0, 1/2), the uniquely defined number za > 0 fulfilling
Z
p(x)dx =
z

is called the (upper) -quantile of the distribution P .

The word upper is usually omitted for the standard normal distribution, since it is symmetric
around 0. In our case it follows immediately from (7.2) that


P Xn n1/2 / z/2 = 1 ,
80 The Gaussian location-scale model

hence
[ , + ] = [Xn z/2 / n, Xn + z/2 / n] (7.3)
is a 1 -confidence interval for .
Obviously we have to know the variance 2 for this confidence interval, so the procedure breaks
down for the location-scale model. In Proposition 3.0.5, page 32 we already encountered the sample
variance:
Xn n
X
Sn2 = n1 (Xi Xn )2 = n1 Xi2 (Xn )2 .
i=1 i=1
This appears as a reasonable estimate to substitute for the unknown 2 , for various reasons. First,
Sn2 is the variance of the empirical distribution Pn : when x1 , . . . , xn are observed, the empirical
distribution is a discrete law which assigns probability n1 to each point xi . From the point of
view that x1 , . . . , xn should be identified with the random variables X1 , . . . , Xn , this is a random
probability distribution, with distribution function
n
X
1
Fn (t) = n 1(,t] (Xi )
i=1

This is the empirical distribution function, (e.d.f.). Obviously if Z is a random variable with
law Pn then (assuming x1 , . . . , xn fixed)
n
X
xn = n1 xi = EZ
i=1
n
X
s2n = n1 x2i (xn )2 = Var(Z).
i=1

Analogously to the sample mean, we write Sn2 for s2n when this is construed as a random variable.
For the expectation of Sn2 we obtain (when i are ind. standard normals)
n
X n
X
ESn2 = En 1 2
(Xi Xn ) = En1
(Xi (Xn ))2
i=1 i=1
n
n
!
X X
= En1 ( i ( n ))2 = 2 E n1 2i ( n )2
i=1 i=1

Since E 2i = 1 and L( n ) = N (0, n1 ), we have E( n )2 = n1 , and thus


n1
ESn2 = 2 1 n1 = 2 .
n
Thus Sn2 is not unbiased, but an unbiased estimator is
n
n 1 X
Sn2 = Sn2 = (Xi Xn )2 .
n1 n1
i=1

Suppose we plug in Sn2 for 2 in the confidence interval (7.1). We cannot


expect
this to be an
-confidence interval, since it is based on the standard normal law for Xn n1/2 / via (7.2),
and the quantity
Xn n1/2
T = T (X) =
Sn
Chi-square and t-distributions 81
q
cannot be expected to have a normal law. (Here Sn = Sn2 ). However we can hope to identify this
distribution, and then base a confidence interval upon it (by taking a quantile).
Note that T is not a statistic, since it depends on the unknown parameter . However, reverting
again to the representation Xi = i , we obtain

n n1/2 n n1/2
T (X) = 1/2 = 1/2 .
1 Pn 1 Pn
n1 i=1 2 ( i n )2 n1 i=1 ( i n )2

We see that the distribution of T does not depend on the parameters and 2 ; it depends only
on the sample size n (i..e the number of i.i.d standard normals i involved).

7.2 Chi-square and t-distributions


Definition 7.2.1 Let X1 , . . . , Xn be independent N (0, 1). The distribution of the statistic

Xn n1/2
T = T (X) =
Sn

is called the t-distribution with n 1 degrees of freedom (denoted tn1 ). The statistic T is called
the t-statistic.

Lemma 7.2.2 The t-distribution is symmetric around 0, i.e. for x 0

P (T x) = P (T x)

Proof. This follows immediately form the fact that L( i ) = N (0, 1) = L( i ).


This suggest that a confidence interval based on the t-distribution can be built in the same fashion
as for the standard normal, i.e. taking an upper quantile.
It remains to find the actual form of the t-distribution, in order to compute its quantiles. The
following result prepares this derivation.

Theorem 7.2.3 Let X1 , . . . , Xn be n i.i.d. r.v.s each having law N (, 2 ).


(i) The sample mean Xn and the sample variance Sn2 are independent random variables.
(ii) The bias corrected sample variance Sn2 = nSn2 /(n 1) can be represented as

n1
2 X 2
Sn2 = i
n1
i=1

where 1 , . . . , n1 are i.i.d standard normals, independent of Xn .

Proof. Let X be the vector X = (X1 , . . . , Xn )> ; then X has a multivariate normal distribution
with covariance matrix 2 In . To describe the expectation vector, let 1n = (1, . . . , 1)> be the vector
in Rn consisting of 1s. Then
L (X) = Nn (1n , 2 In ).
82 The Gaussian location-scale model

Consider the linear subspace of Rn of dimension n 1 which is orthogonal to 1n . Let b1 , . . . , bn1


be an orthonormal basis of this subspace and bn = n1/2 1n . Then bn has also norm 1 (kbn k = 1),
the whole set b1 , . . . , bn is an orthonormal basis of Rn and the n n-matrix
>
b1
b>
B= 2
...

b>
n

is orthogonal. Define Y = BX; then

L (Y ) = Nn (B1n , 2 In )

and the components Y1 , . . . , Yn are independent. (Indeed when the covariance matrix of a multi-
variate normal is of form 2 In , or more generally a diagonal matrix, then the joint density of yi is
the product of its marginals). Moreover

Yn = b>
nX = n
1/2 >
1n X = n1/2 Xn , (7.4)
EYj = Eb>
j X = b>
j 1n = 0, j = 1, . . . , n 1.

It follows that Y1 , . . . , Yn1 are independent N (0, 2 ) random variables and are independent of Xn .
Now
Xn Xn
2 2 > > >
Yi = kY k = X B BX = X X = Xi2 .
i=1 i=1

On the other hand, in view of (7.4)


n
X n1
X
Yi2 = Yi2 + nXn2 ,
i=1 i=1

so that
n1
X n
X
Yi2 = Xi2 nXn2 = nSn2 .
i=1 i=1

This shows that Sn2is a function of Y1 , . . . , Yn1 and hence independent of Xn , which establishes
(i). Dividing both sides in the last display by n 1 and setting i = Yi / establishes (ii).

Definition 7.2.4 Let X1 , . . . , Xn be independent N (0, 1). The distribution of the statistic
n
X
2 2
= (X) = Xi2
i=1

is called the chi-square distribution with n degrees of freedom, denoted 2n .

To find the form of the t-distribution, we need to find the density of the ratio of a normal and
the square root of an independent 2 -variable. We begin with deriving the density of the 2 -
distribution. The following lemma immediately follows from the above definition.
Chi-square and t-distributions 83

Figure 1 The densities of 2n for n = 5, 10, 15, . . . , 30

Lemma 7.2.5 Let Y1 , Y2 be independent r.vs with laws


L(Y1 ) = 2k , L(Y2 ) = 2l , k, l 1.
Then
L(Y1 + Y2 ) = 2k+l .

Proposition 7.2.6 The density of the law 2n is


1
fn (x) = xn/21 exp(x/2), x 0.
2n/2 (n/2)
Comment. In Section 5.3 a family of Gamma densities was introduced as
1 1
f (x) = x exp(x)
()
for > 0. Define more generally, for some > 0
1
f, (x) = x1 exp(x 1 ).
()
The corresponding law is called the (, )-distribution . The chi-square distribution 2n thus
coincides with the distribution (n/2, 2). The figure shows the 2n -densities for 6 values of n (
n = 5, 10, . . . , , 30). From the definition it is clear that 2n has expectation n.
Proof of the Proposition. Recall that for > 0
Z
() = x1 exp(x)dx
0
( + 1) = ().
84 The Gaussian location-scale model

Our proof will proceed by induction. Start with 21 : let X1 be N (0, 1); then
Z t1/2 2
2 1/2 1/2 1 z
P X1 t = P t X1 t =2 1/2
exp dz
0 (2) 2
A change of variable x = z 2 , dz = (1/2x1/2 )dx gives
Z t x
2 1
P X1 t = 1/2
exp dx
0 (2x) 2
and we obtain
1
f1 (x) = x1/21 exp(x/2), x 0.
21/2 1/2
Now
(1/2) = 1/2 (7.5)
follows from the fact that f1 integrates to one and the definition of the gamma-function. We
obtained the density of 21 as claimed.
For the induction step, we assume that L(Y1 ) = 2n and L(Y2 ) = 21 . By the previous lemma, fn+1
is the convolution of fn and f1 : assuming the densities zero for negative argument, we obtain
Z
fn+1 (x) = fn (y)f1 (x y)dy
0

(Indeed the convolution of densities is the operation applied to densities of two independent r.v.s
for obtaining the density of the sum). Hence
Z x
1 1
fn+1 (x) = n/2
yn/21 exp(y/2) 1/2 (x y)1/2 exp((x y)/2)dy
0 2 (n/2) 2 (1/2)
Z x
1
= exp(x/2) y n/21 (x y)1/2 dy
2(n+1)/2 (n/2)(1/2) 0

To compute the integral, we note (with a change of variables u = y/x)


Z x Z x n/21
n/21 1/2 n/21 1/2 y y 1/2 1
y (x y) dy = x x x 1 dy
0 0 x x x
Z 1
= x (n+1)/21
un/21 (1 u)1/2 du
0
(n+1)/21
= x B (n/2, 1/2)
where B (n/2, 1/2) is the Beta integral:
Z 1
()()
B(, ) = u1 (1 u)1 du =
0 ( + )
(cf. Section 5.3 on the Beta densities). Thus, collecting results, we get
1 (n/2)(1/2)
fn+1 (x) = exp(x/2)x(n+1)/21
2(n+1)/2 (n/2)(1/2) ((n + 1)/2)
1
= x(n+1)/21 exp(x/2)
2(n+1)/2 ((n + 1)/2)
which is the form of fn+1 (x) claimed.
Chi-square and t-distributions 85

Proposition 7.2.7 (i) The law tn is the distribution of

n1/2 Z1

Z2

where Z1 , Z2 are independent r.v.s with standard normal and 2n -distribution, respectively.
(ii) The density of the law tn is
(n+1)/2
((n + 1)/2) x2
fn (x) = 1+ , x R.
(n)1/2 (n/2) n

Proof. (i) follows immediately from Definition 7.2.1 and Theorem 7.2.3: for a r..v T with tn1 -
distribution we obtain, if i are defined as in the proof of Theorem 7.2.3

Xn n1/2 n
T = = 1/2
Sn Pn1
1
n1 i=1 2i
(n 1)1/2 n (n 1)1/2 Z1
= P 1/2 = 1/2
.
n1 2
Z2
i=1 i

To prove (ii), we note that the law tn is symmetric, so we can proceed via the law of the squared
variable nZ12 /Z1 . Here L(Z12 ) = 21 , and the joint density of Z12 , Z2 is
1 1
g(t, u) = t1/21 exp(t/2) un/21 exp(u/2), t, u 0.
21/2 (1/2) 2n/2 (n/2)

Consequently
2 Z
Z1 1
P x = (n+1)/2
un/21 t1/2 exp((t + u)/2) dt du
Z2 t/ux,u0,t0 2 (1/2)(n/2)

Substitute t by v = t/u, then dt = udv and the above integral is


Z
1
(n+1)/2 (1/2)(n/2)
un/21 (vu)1/2 exp((vu + u)/2) udv du
0vx,u0 2
Z
1
= (n+1)/2
u(n+1)/21 v 1/2 exp(((v + 1)u)/2) dv du
0vx,u0 2 (1/2)(n/2)

Another substitution of u by z = (v + 1)u makes this


Z Z
1 1 1
(n+1)/2 (n+1)/21
z (n+1)/21 v1/2 exp(z/2) dz dv.
0vx z0 2 (1/2)(n/2) (v + 1) v+1

Here the terms depending on z together with appropriate constants (when we also divide and
multiply by ((n+1)/2)) form the density of 2n+1 , so that the expression becomes, after integrating
out z, 2
Z
((n + 1)/2) v1/2 Z1
(n+1)/2
dv = P x .
0vx (1/2)(n/2) (v + 1) Z2
86 The Gaussian location-scale model

A final change of variables s = (nv)1/2 , dv = n1 2sds gives


! Z (n+1)/2
n1/2 |Z1 | ((n + 1)/2) s2
P 0 1/2
a = 2 1/2 (n/2)
1+ ds
Z2 0sa (n) n
!
n1/2 Z1
= 2P 0 1/2
a .
Z2

Now dierentiating w.r. to a gives the form of the density (ii).


Let us visualize the forms of the normal and the t-distribution.

The t-density with 3 degrees of freedom against the standard normal (dotted)
Clearly the t-distribution has heavier tails, which means that the quantiles (now called t/2 in place
of z/2 ) are farther out and the confidence interval is wider. A narrower confidence interval, for the
same level , is preferable. Thus the absence of knowledge of the sample variance 2 is reflected in
less sharp confidence statements.

Theorem 7.2.8 Let t/2 be the upper /2-quantile of the t-distribution with n 1 degrees of
freedom. Then in Model M3 , the interval

[ , + ] = [Xn Sn t/2 / n, Xn + Sn t/2 / n]

is a 1 -confidence interval for the unknown expected value .

The t-distribution has been found by Gosset (The probable error of a mean, Biometrika 6, 1908),
who wrote under the pseudonym Student. The distribution is frequently called Students t.
A notion of studentization has been derived from it: in the Gaussian location model M2 , the
statistic
Xn n1/2
Z=

Some asymptotics 87

is sometimes called the Z-statistic. It is standard normal when = 0 and can therefore be used
to test the hypothesis = 0 (for the theory if testing hypotheses cf. later sections). In the
location-scale model, is not known, and by substituting Sn one forms the t-statistic

Xn n1/2
T = .
Sn

The procedure of substituting the unknown variance 2 by its estimate Sn2 is called studentization.

Remark 7.2.9 Consider the absolute moment of order r (integer) of the tn -distribution:
Z (n+1)/2
r r x2
E |T | = mr = Cn x 1+ dx
0 n

where Cn is the appropriate normalizing constant. For x , theRintegrand is of order xr x(n+1) =



xrn1 , so this integral is finite only for r < n. For r = n, since 0 x1 dx = , the r-th moment
does not exist. This illustrates the fact that the t-distribution has heavy tails compared to the
normal distribution; indeed E |Z|r < for all r if Z is standard normal.

7.3 Some asymptotics


We saw in the plot of the t-distribution that already for 3 degrees of freedom it appears close to
the normal. Write
n1/2 Z1 Z1
= 1
Z2 n Z2
where Z1 , Z2 are independent r.v.s with standard normal and 2n -distribution, respectively. Since
n
X
n1 Z2 = n1 2i
i=1

for some i.i.d. standard normals i , we have by the law of large numbers
n
X
1
n 2i P E 21 = 1.
i=1

This suggests that the law of n1/2 Z1 / Z2 , i.e. the law tn , should become close to the law of Z1 as
n . Let us formally prove that statement. We begin with a recap of some probability notions.

Definition 7.3.1 Let Fn , n = 1, 2, . . . be a sequence of distribution functions. The Fn are said to


converge in distribution to a limit F (a distribution function) if

Fn (t) F (t) as n

for every point of continuity t of F .

A point of continuity is obviously a point where F is continuous. Any distribution function F has
left and right side limits at every point t, so it means that these limits coincide in t. When F is
88 The Gaussian location-scale model

continuous then in the above statement t must run through all t R. For instance, the distribution
function of every N (, 2 ) is continuous.
It is also said that a sequence of r.v.s Yn converges in distribution (or in law), written

Yn =d F

L
when the d.f. of Yn converge in d. to F . One also writes Yn Y for a r.v. Y having that
L D
distribution function F (or also Fn F , Fn F )

Example 7.3.2 Convergence in probability : if F is the d.f. of the random variable Y = 0 (which
is always 0 !) then
F (t) = 1[1,) (t)

i.e. F (t) jumps only in 0. Now Yn =d Y is equivalent to Yn P 0 (Exercise)

Example 7.3.3 Let L(Yn ) = B(n, n1 ) and L(Y ) = Po() then for all events A

sup |P (Yn A) P (Y A)| 0


A

which implies Yn =d Y (take A = (, t]).

Example 7.3.4 (Central Limit Theorem). Let Y1 , . . . , Yn be independent identically distributed


r.v.s with distribution function F and finite second moment EY12 < . Let
n
X
1
Yn = n Yi
i=1

be the average (or sample mean) and 2 = Var(Y1 ). Then for fixed F and n

n1/2 Yn EY1 =d N (0, 2 ). (7.6)

The normal law N (0, 2 ) is continuous,


so in the CLT the relation =d means convergence of the
d.f. of the standardized sum n 1/2 Yn EY1 to the normal d.f. at every t R.

Lemma 7.3.5 Suppose Xn is a sequence of r.v. which converges in distribution to a continuous


limit law P0 :
L (Xn ) =d P0

and let Yn be a sequence of r.v. which converges in probability to 0:

Yn P 0.

Then
L (Xn + Yn ) =d P0 .
Some asymptotics 89

Note that no independence assumptions were made.


Proof. Let Fn be the distribution function of Xn and F0 be the respective d.f. of the law P0 .
Convergence in distribution means that
P (Xn t) = Fn (t) F0 (t)
for every continuity point of the limit d.f. F0 . We assumed that P0 is a continuous law, so it means
convergence for every t. Now for > 0
P (Xn + Yn t) = P ({Xn + Yn t} ({|Yn | } {|Yn | > }))
= P ({Xn + Yn t} {|Yn | }) + P ({Xn + Yn t} {|Yn | > }).
The first term on the right is
P ({Xn t Yn } {|Yn | }) P ({Xn t + } {|Yn | })
P (Xn t + ).
The other term is
P ({Xn + Yn t} {|Yn | > }) P (|Yn | > ) 0 as n .
On the other hand we have
P (Xn t + ) F0 (t + ) as n .
Hence for every > 0 we can find m1 such that for all n m1
P (Xn + Yn t) F0 (t + ) + 2. (7.7)
Now take the same > 0; we have
P (Xn t ) =

P ({Xn t } {|Yn | }) + P ({Xn t } {|Yn | > })


P ({Xn + Yn t} {|Yn | }) + P (|Yn | > )
P (Xn + Yn t) + P (|Yn | > ).
Consequently
P (Xn + Yn t) P (Xn t ) P (|Yn | > ).
Using again the two limits for the probabilities on the right, for every > 0 we can find m2 such
that for all n m2
P (Xn + Yn t) F0 (t ) 2. (7.8)
Taking m = max(m1 , m2 ) and collecting (7.7), (7.8), we obtain for n m
F0 (t ) 2 P (Xn + Yn t) F0 (t + ) + 2.
Since F0 is continuous at t, and was arbitrary, we can select such that
F0 (t + ) F0 (t) + ,
F0 (t ) F0 (t)
so that for n large enough
F0 (t) 3 P (Xn + Yn t) F0 (t) + 3
and since was also arbitrary, the result follows.
90 The Gaussian location-scale model

Lemma 7.3.6 Under the assumptions of Lemma 7.3.6, we have


Xn Yn P 0.

Proof. Let > 0; and > 0 be arbitrary and given. Suppose |Xn Yn | . Then, for every T > 0,
either {|Xn | > T }, or if that is not the case, then |Yn | /T . Hence .
P (|Xn Yn | ) P (|Xn | T ) + P (|Yn | /T ) . (7.9)
Now for every T > 0
P (|Xn | > T ) = 1 Fn (Xn T ) + Fn (Xn < T )
1 Fn (Xn T ) + Fn (Xn T ) .
Since Fn converges to F0 at both points T , T , we find m1 = m1 (T ) (depending on T ) such that
for all n m1
P (|Xn | T ) 1 F0 (T ) + F0 (T ) +
Select now T large enough such that
1 F0 (T ) , F0 (T ) .
Then for all n m1 (T )
P (|Xn | T ) 3.
On the other hand, once T is fixed, in view of convergence in probability to 0 of |Yn |, one can find
m2 such that for all n m2
P (|Yn | /T ) .
In view of (7.9) we have for all n m = max(m1 , m2 )
P (|Xn Yn | ) 4.
Since > 0 was arbitrary, the result is proved.
We need an auxiliary result which despite its simplicity is still frequently cited with a name attached
to it.

Lemma 7.3.7 (Slutskys theorem). Suppose a sequence of random variables Xn converges in


probability to a number x0 (Xn p x0 as n ). Suppose f is a real valued function defined in
a neighborhood of x0 and continuous there. Then
f (Xn ) p f (x0 ), n

Proof. Consider an arbitrary > 0. Select > 0 small enough such that (x0 , x0 + ) is in
the abovementioned neighborhood of x0 and also fulfilling the condition that |z x0 | implies
|f (z) f (x0 )| (by continuity of f such a can be found). Then the event |f (Xn ) f (x0 )| >
implies |Xn x0 | > and hence
P (|f (Xn ) f (x0 )| > ) P (|Xn x0 | > ) .
Since the latter probability tends to 0 as n , we also have
P (|f (Xn ) f (x0 )| > ) 0 as n
and since was arbitrary, the result is proved.
Some asymptotics 91

Theorem 7.3.8 The t-distribution with n degrees of freeedom converges (in distribution) to the
standard normal law as n .

Proof. Let
Z1
Xn = 1
n Z2
where Z1 , Z2 are independent r.v.s with standard normal and 2n -distribution, respectively. We
know already
n1 Z2 P 1
By Slutskys theorem, for the r.v. Y1,n

1
Y1,n := 1 P 1
n Z2

since the function g(x) = x1/2 is continuous at 1 and defined for x > 0. (Indeed we need consider
only those x, since n1 Z2 > 0 with probability one). Now

Xn = Z1 Y1,n = Z1 + Z1 (Y1,n 1) .

Now Y1,n 1 P 0, hence by Lemma 7.3.6 Z1 (Y1,n 1) P 0. Now Z1 is a constant sequence with
law N (0, 1), which certainly converges in law to N (0, 1). Then by Lemma 7.3.5 Xn =d N (0, 1).

We can translate this limiting statement about the t-distribution into a confidence statement.

Theorem 7.3.9 Let z/2 be the upper /2-quantile of N (0, 1). Then in the Gaussian location-scale
model Mc,2 , the interval

[ , + ] = [Xn Sn z/2

/ n, Xn + Sn z/2 / n]

is an asymptotic -confidence interval for the unkown expected value :

lim P,2 ([ , + ] 3 ) 1 .
n

Here the same quantiles as in the exact interval (7.3) are used, but Sn replaces the unknown . In
summary: if 2 is unknown, one has the choice between an exact confidence interval (which keeps
level 1 ) based on the t-distribution, or an asymptotic interval (which keeps the confidence level
only approximately) based on the normal law. The normal interval would be shorter in general:
consider e.g. degrees of freedom 10 and = 0.05; then for the t-distribution we have z/2 = 2.228,
whereas the normal quantile is z/2 = 1.96 (cf. the tables of the normal and t-distribtions on pp.
608/609 of [CB]).
Note that in subsection (1.3.2) we discussed an nonasymptotic confidence interval for the Bernoulli
parameter p based on the Chebyshev inequality, and mentioned that alternatively, a large sample
approximation based on the CLT could also have been used. In this section we developed the tools
for this.
92 The Gaussian location-scale model
Chapter 8
TESTING STATISTICAL HYPOTHESES

8.1 Introduction
Consider again the basic statistical model where X is an observed random variable with values in
X and the law L(X) is known up to a parameter from a parameter space : L(X) {P ; }.
This time we do not restrict the nature of ; this may be a general set, possibly even a set of laws
(in this case is identified with L(X)). Suppose the parameter set is divided into two subsets:
= 0 1 where 0 1 = . The problem is to decide on the basis of an observation of X
whether the unknown belongs to 0 or to 1 . Thus two hypotheses are formulated:

H 0 , the hypothesis

K 1 , the alternative.

Of course both of these are hypotheses, but in testing theory they are treated in a nonsymmetric
way (to be explained). In view of this nonsymmetry, one of them is called the hypothesis and
the other the alternative. It is traditional to write them as above, with letters H for the first (the
hypothesis) and K for the second (the alternative).

Example 8.1.1 Assume that a new drug has been developed, which is supposed to have a higher
probability p of success when applied to an average patient. The new drug will be introduced only
if a high degree of certainty can be obtained that it is better. Suppose p0 is the known probability
of success of the old drug. Clinical trials are performed to test the hypothesis that p > p0 . For
the new drug, n patients are tested independently, and succes of the drug is measured (we assume
that only success or failure of the treatment can be seen in each case). Let the j-th experiment
(patient) for the new drug be Xj ; assume that the Xj are independent B(1, p). Thus observations
are X = (X1 , . . . , Xn ), Xj are i.i.d. Bernoulli r.v.s., where = p and = (0, 1). The hypotheses
are 0 = (0, p0 ] and 1 = (p0 , 1).

The motivation for a nonsymmetric treatment of the hypotheses is evident in this example: if the
statistical evidence is inconclusive, one would always stay with the old drug. There can be no
question of treating H and K the same way. Thus in section 5.5 we briefly discussed the problem
of estimating a signal {0, 1} (binary channel, Gaussian channel), where basically both values 0
and 1 are treated the same way, e.g. we use a Bayesian decision rule for prior probabilities 1/2. In
contrast, one will decide 1, i.e decide in favor of the new drug only if there is reasonable statistical
certainty.

Definition 8.1.2 A test is a decision rule characterized by an acceptance region S X , i.e. a


(measurable) subset of the sample space, such that
94 Testing Statistical Hypotheses

X S means that 0 is accepted


X S c means that 0 is rejected (thus 1 is accepted. )
The complement S c is called the critical region (rejection region).

Formally, a test is usually defined as a statistic which is the indicator function of a set S c , such
that
(X) = 1S c (X)
where a value (X) = 1 is understood as a decision that 0 is rejected (and 0 that it is
accepted).
In the above example, a reasonable test would be given by a rejection region
( n
)
X
S c = x : n1 xj > c
i=1

for realizations x = (x1 , . . . , xn ), where c is some number fixed in advance. One would decide
1 if the number of successes in the sample is large enough.
Tests and estimators have in common that they are statistics which are also decision rules. But the
nature of the decisions is dierent, which is reflected in the loss functions. The natural common
loss function for testing problems is loss is 0 if the decision is correct, and is 1 otherwise. Thus
if d {0, 1} is one of the two possible decisions then the loss function is

d if 0
L(d, ) = { .
1 d if 1

As for estimation, the risk of a decision rule at parameter value is the expected loss when is
the true parameter. The decision rule is (X) = 1S c (X); its risk is

E (X) = P (S c ) if 0
R(, ) = E L((X), ) = {
1 E (X) = P (S) if 1 .

Thus the risk coincides with the probability of error in each case: for 0 , an erroneous
decision is made when X S c ; when 1 is true, then the error in the decision occurs when
X S.
Since both probability of errors are functions of E (1 (X)) = P (S), one can equivalently work
with the operation characteristic (OC):

1 R(, ) if 0
OC(, ) = P (S) = {
R(, ) if 1 .

A test with zero risk would require a set S such that P (S) = 1 if 0 , P (S) = 0 if 1 .
This is possible only in degenerate cases: if such an S X exists then the families {P ; 0 }
and {P ; 1 } are said to be orthogonal. In this case sure decisions are possible according
to whether X S or not, and one is led outside the realm of statistics.

Example 8.1.3 Let U (a, b) be the uniform law on the interval (a, b), = {0, 1} and P0 = U (0, 1),
P1 = U (1, 2). Here S = (0, 1) gives zero error probabilities.
Introduction 95

In typical statistical situations, the probabilities P (S) depends continuously on the parameter,
and the sets 0 , 1 are bordering each other. In this case the risk R(, ) cannot be near 0 on the
common border.

Example 8.1.4 Let = R, P = N (, 2 ), ( 2 fixed), 0 = (, 0], 1 = (0, ). (Gaussian


location model, Model Mc,1 , for sample size n = 1). Reasonable acceptance regions are

Sa = {x : x a}

for some a. The OC of any such test a = 1Sac is (if is a standard normal r.v. such that
X = + )

OC(, ) = P (S) = P (X a) = P ( (a )/)


= ((a )/)

where is the standard normal distribution function.



Example 8.1.1 (continued). Suppose p0 = 0.6 and use a critical region Xn > c for c = 0.8. We
assume n is not small and use the De Moivre-Laplace central limit theorem for an approximation
to the OC. We have
!
n1/2 (Xn p) n1/2 (c p)
OC(, p) = Pp (Xn c) = Pp
(p(1 p))1/2 (p(1 p))1/2
!
n1/2 (c p)

(p(1 p))1/2

where means approximate equality. For a visualization of this approximation to the OC, see the
plot below.
These examples show that it is not generally possible to keep error probabilities uniformly small
under both hypotheses.

Definition 8.1.5 Suppose that L(X) {P ; } and is test for H : 0 vs. K : 1


with acceptance region S.
(i) An error of the first kind is: takes value 1 when 0 is true;
An error of the second kind is: takes value 0 when 1 is true;
(ii) has significance level (or level) if

P ((X) = 1) for all 0

(iii) For 1 , the probability


() = P ((X) = 1)

is called the power of the test .


96 Testing Statistical Hypotheses

OC of the critical region {x > 2} for testing H : 0 vs. K : > 0 for the family N (, 1), R

Approximation to OC for critical region {Xn > c} for testing H : p p0 vs. K : p > p0 for i.i.d.
Bernoulli model (Model Md,1 ), with p0 = 0.6 , c = 0.8, n = 15

Thus significance level means that probability of an error of the first kind is uniformly less than
. The power is the probability of not making an error of the second kind for a given in the
alternative.
In terms of the risk, has level if R(, ) for all 0 , and the power is

() = 1 R(, ), 1 .
Tests and confidence sets 97

In terms of the OC, has level if OC(, ) 1 and the power is

() = 1 OC(, ), 1

In example 8.1.1, it is particularly apparent why the error of the first kind is very sensitive, so
that its probability should be kept under a given small . When actually the old drug is better
( 0 ), but we decide erroneously that the new is better, it is a very painful error indeed, with
potentially grave consequences. We wish a decision procedure which limits the probability of such
a catastrophic misjudgment. But given this restriction, opportunities for switching to a better drug
should not be missed, i.e. when the new drug is actually better, then the decision should be able
to detect it with as high a probability as possible.
For drug testing, procedures like this (significance level should be kept for some small ) are
required by law in every developed country. In general statistical practice, common values are
= 0.05 and = 0.01.
If one of the hypothesis sets 0 , 1 consists of only one element, the respective hypothesis (or
alternative) called simple, otherwise it is called composite . An important special case of testing
theory is the one where both hypothesis and alternative are simple, i.e. 0 = {0 }, 1 = {1 }
(which means the whole parameter space consists of the two elements 0 , 1 ; in this case the
Neyman-Person fundamental lemma applies (see below).
A test where 0 is simple is called a significance test. The question to be decided there is only
whether the hypothesis H : 0 = can be rejected with significance level ; the alternatives are
usually not specified.

8.2 Tests and confidence sets


A confidence set is a random subset A(X) of the parameter space which covers the true parameter
with probability at least 1 :

P (A(X) 3 ) 1 , .

Confidence intervals A(X) = [ (X), + (X)] were treated in detail in section 7.1 for the Gaussian
location-scale model and in the introductory section 1.3. Confidence sets are also called domain
estimators of the parameter (the estimators which pick a value of rather than a covering set
are called point estimators ).
There is a close connection between confidence sets and significance tests.

1 Suppose a confidence set A(X) for level 1 is given. Let 0 be arbitrary and consider a
simple hypothesis H : = 0 (vs. alternative K : 6= 0 ). Construct a test 0 by

0 (X) = 1 1A(X) (0 )

where 1A(X) is the indicator of the confidence set, as a function of 0 . In other words, H :
= 0 is rejected if 0 is outside the confidence set. Then

P0 (0 = 1) = 1 P0 (A(X) 3 0 )

hence 0 is an -significance test for H : = 0 .


98 Testing Statistical Hypotheses

2 We saw that a confidence set generates a family of significance tests, one for each 0 . Assume
now conversely that such a family , is given, and they all observe level . Define

A(X) = { : (X) = 0} .

Then

P (A(X) 3 ) = P ( (X) = 0) = 1 P ( (X) = 1)


1 .

i.e. we found a 1 -confidence set A(X).

For a more general setting, let () be a function of the parameter (with values in an arbitrary set
). A confidence set for () is defined by

P (A(X) 3 ()) 1 , .

For instance might have two components: = (1 ,2 ). and () = 1 . Then the above family
of tests should be indexed by , and 0 has level for a hypothesis H: () = 0 . This
hypothesis is composite if is not one-to-one (then 0 cannot be called a significance test).
As an example, consider the Gaussian location-scale model for unknown 2 (Model Mc,2 ). Here
= (, 2 ) R (0, ). Consider the quantity

Xn n1/2
T = T (X) =
Sn
used to build a confidence interval

[ , + ] = [Xn Sn z/2 / n, Xn + Sn z/2 / n]

for the unkown expected value . The level 1 was kept for all unknown 2 > 0, i.e. we have a
confidence interval for the parameter function () = .
As we remarked, T (X) depends on the parameter and therefore is not a statistic. Such a function
of both the observations and the parameter used to build a confidence interval is called a pivotal
quantity. The knowledge of the law of the pivotal quantity under the respective parameter is the
basis for a confidence interval. When looking at the significance test derived from T (X), for a
hypothesis H : = 0 , we find that the test is

0 (X) = 1 if T0 (X) > z/2 (8.1)

where z/2 is the upper /2-quantile of the t-distribution for n 1 degrees of freedom. From this
point of view, when = 0 is a known hypothesis, T0 (X) does not depend on an unknown
parameter, and is thus a statistic. In the example, it is the t-statistic for testing H : = 0 , and
the test (8.1) is the two sided t-test. The basic result implied by Theorem 7.2.8 about this test
is the following.

Theorem 8.2.1 In the Gaussian location-scale model Model Mc,2 , for sample size n, consider the
hypothesis H : = 0 . Let z/2 be the upper /2-quantile of the t-distribution with n 1 degrees
of freedom. Then the two sided t-test (8.1) has level for any unknown 2 > 0.
Tests and confidence sets 99

Analogously , when 2 is known, the test



0 (X) = 1 if Z0 (X) > z/2

(8.2)

where z/2 is the upper /2-quantile of N (0, 1) and

Xn n1/2
Z = Z (X) =

is the Z-statistic, is called the two sided Gauss test for H : = 0 . Then the t-statistic T0 (X)
can be construed as a studentized Z-statistic Z0 (X).
Let us investigate the relationship between the power of and the confidence interval. Suppose
that we have two 1 -confidence sets A (X), A(X) for itself, where

A (X) A(X)

i.e. A (X) is contained A(X) (in the case of intervals, A (X) would be shorter or of equal length).
For the respective families 0 , 0 , 0 of -significance tests this means

0 (X) = 1 implies 0 (X) = 1

hence

P 0 = 1 P 0 = 1 for all .
At 6= 0 these are precisely the respective powers of the two tests 0 , 0 . (At = 0 the
relation implies that for A(X) to keep level 1 , it is sucient that A (X) keeps this level). Thus
0 has uniformly better (or at least as good) power for all 6= 0 .
It was mentioned that shorter confidence intervals are desirable (given a confidence level), since
they enable sharper decision making. Translating this into a power relation for tests, we have
made the idea more transparent.

Theassumed inclusion A (X) A(X) implies a larger critical
region for 0 : 0 = 1 0 = 1 . This does not describe all situations in which 0 might
have better power (and thus A (X) is better in some sense); we shall not further investigate the
power of confidence intervals here but will concentrate on tests instead.
However asymptotic confidence intervals should be discussed briefly. The statement of Theorem
7.3.9 can be translated immediately into the language of test theory. When the law of the observed
r.v. X depends on n, we write Xn and L(Xn ) {P,n ; } for the family of laws. Here the
observation space Xn might also depend on n, as is the case of n i.i.d. observed variables.

Definition 8.2.2 (i) A sequence of tests n = n (Xn ) for testing H : 0 vs. K : 1 has
asymptotic level if

limsupn P,n (n (X) = 1) for all 0

(ii) The sequence n is consistent if it has asymptotic power one, i.e.

lim P,n (n (X) = 1) = 1 for all 1 .


n
100 Testing Statistical Hypotheses

We saw in Theorem 7.3.9 that in Model Mc,2 , the interval


[ , + ] = [Xn Sn z/2

/ n, Xn + Sn z/2 / n]

is an asymptotic -confidence interval for the unkown expected value :

liminf n P,2 ([ , + ] 3 ) 1


where z/2 is a normal quantile. Thus if 0 is the derived test for H : = 0 then


limsup n P,2 0 = 1 = 1 liminf n P, 2 ([ , + ] 3 0 )

so that every 0 is an asymptotic -test.


Exercise. Consider the test 0 (i.e. the test based on the t-statistic as in (8.1), where the quantile

z/2 of the standard normal is used in place of z/2 ). Show that in the Gaussian location-scale
model Mc,2 , for sample size n, this test is consistent as n on the pertaining alternative


1 (0 ) = (, 2 ) : 6= 0 .

Consistency of one sided Gauss tests (OC-plot)


The Neyman-Pearson Fundamental Lemma 101

Consistency of two sided Gauss tests (OC-plot)

Above, the first plot, in the Gaussian location model with 2 = 1 gives the OC of a test of H: 0
vs K : > 0 of type
1 if Xn > cn
(X) = {
0 otherwise
where cn is selected such that is an -test, for = 0.05 and sample sizes n = 1, 2, 4 respectively.
This is the same situation as in the first plot on p. 96, only is selected as one of the common values
(in the other figure we just took c1 = 2 and did not care about the resulting ), and three sample
sizes are plotted. This is a one sided Gauss test; we have not yet discussed the respective theory,
but it can be observed that these test keep level on the whole composite hypothesis H: 0 .
Moreover, the behaviour of consistency is visible (for larger n, the power increases).
The second plot concerns the simple hypothesis H: = 0 in the same model ( 2 = 1, = 0.05,
n = 1, 2, 4) and the two sided Gauss test (8.2) derived from the confidence interval (7.3). The
middle OC-line for n = 2 is dotted.

8.3 The Neyman-Pearson Fundamental Lemma


We saw that in testing theory, the a basic goal is to maximize the power of tests, while keeping
significance level .

Definition 8.3.1 Suppose that L(X) {P ; }. An -test = (X) for testing H : 0


vs. K : 1 is uniformly most powerful (UMP) if for every other test which is an -test
for the problem:
sup E
0

we have
E E , for all 1 .
102 Testing Statistical Hypotheses

Typically is not possible to find such a UMP test; some tests do better at particular points in the
alternative, at the the expense of the power at other points. An example is given in the following
plot.

Power of one sided and two sided Gauss test for H : = 0 vs. K : 6= 0

In the Gausian location model with 2 = 1 and n = 1, for hypotheses H: = 0 vs K : 6= 0 we


look at the OC of two tests:
1 if Xn > c1
one sided Gauss test: 1 (X) = {
0 otherwise

1 if Xn > c2
two sided Gauss test: 2 (X) = {
0 otherwise
where c1 , c2 are selected such both 1 , 2 are -tests for = 0.05. This is the same situation as in
the two plots on p. 8.2 and the following, where we put the two tests in one picture and omitted
the larger sample sizes n = 2, 4. (Here of course Xn = X1 if n = 1, but the same curves appear if
only Var(Xn ) = 1, i.e. 2 /n = 1).
We see that 1 is better than 2 for alternatives > 0, but it is much worse for alternatives < 0.
If we really are interested also in alternatives < 0 (i.e. we wish to detect these, and not just be
content with a statement 0) we should apply the two sided test; the one sided 1 is totally
implausible for the two sided alternative 6= 0. However, even though it is implausible, 1 has
better power for > 0.
In one special situation it is possible to find a UMP test, namely when both hypothesis and
alternative are simple. In this case, we have only one point in the alternative, and maximizing the
power turns out to be possible. Recall the continuous statistical model, first defined in section 4.3:

Model Mc The observed random variable X = (X1 , . . . , Xk ) is continuous with values in Rk and
L(X) {P , }. Each law P is described by a joint density p (x) = p (x1 , . . . , xk ),
and Rd .
(Earlier we required that be an open set, but this is omitted now).
The Neyman-Pearson Fundamental Lemma 103

Definition 8.3.2 Assume Model Mc , and that = {0 , 1 } consists of only two elements. A test
for the hypotheses
H : = 0
K : = 1
is called a Neyman-Pearson test of level if
1 if p1 (x) > c p0 (x)
(X) = {
0 otherwise
where the value c is chosen such that

P0 ((X) = 1) = . (8.3)

We should first show that a Neyman-Person test exists. Of course we can take any c and build
a test according to the above rule. This rule seems plausible: given x, each of the two densities
can be regarded as a likelihood. We might say that 1 is more likely if the ratio of likelihoods
p1 (x)/p0 (x) is suciently large. We recognize an application of the likelihood principle (recall
that this consists in regarding the density as a function of when x is already observed, and
assigning corresponding likelihoods to each ).
The question is only whether c can be chosen such that (8.3) holds. Define a random variable

p1 (X)/p0 (X) if p0 (X) > 0


L = L(X) = {
0 otherwise
where X has distribution P0 , and let FL be its distribution function

FL (t) = P0 (L(X) t) .

Recall that distribution functions F are monotone increasing, right continuous and 0 at , 1 at
. To ensure (8.3), we make
Assumption L. The distribution function FL (t) is continuous.
In this case
P0 (L(X) = t) = 0
for all t, and for c > 0 consider the probability P0 ((X) = 1). Since

P0 (p0 (X) = 0) = 0,

we have

P0 ((X) = 1) = P0 ({(X) = 1} {p0 (X) > 0})


= P0 ({L(X) > c} {p0 (X) > 0})
= P0 (L(X) > c) = 1 P0 (L(X) c)
= 1 P0 (L(X) < c) = 1 FL (c)

which is continuous and monotone a with limit 0 for c . Furthermore, FL (t) = 0 for all t < 0
since L(X) is nonnegative, and in view of continuity of FL (Assumption L) we have

1 FL (c) 0 as c 0.

Thus a value c with (8.3) exists for every (0, 1).


104 Testing Statistical Hypotheses

Example 8.3.3 Let P = N (, 1), (Gaussian location). Here


!
p1 (x) (x 1 )2 (x 0 )2
L(x) = = exp
p0 (x) 2
2
1 20
= exp (x(1 0 )) exp
2
Now for any t > 0

21 20
P0 (L(X) = t) = P0 X(1 0 ) = log t
2
= P0 (X = t0 (1 , 0 , t))

where t0 is a well defined number if 1 6= 0 . However for a normal X this probability is 0 for any
t0 . Moreover L(X) = 0 cannot happen since exp(z) > 0 for all z, so that assumption L is fulfilled
if 1 6= 0 .

Example 8.3.4 Let U (a, b) be the uniform law on the interval [a, b] and P0 = U (0, 1), P1 =
U (0, a) where 0 < a < 1. Then p1 (x) = a1 for x [0, a], 0 otherwise, and
a1 if x [0, a]
L(x) = p1 (x) = { .
0 otherwise.

Thus if X has law P0 then L takes only values a1 and 0, and

P (L(X) = a1 ) = a
P (L(X) = 0) = 1 a

Thus FL is not continuous; it jumps at 0 and a1 is constant at other points.

In the latter example we cannot guarantee the existence of a Neyman-Pearson test. We will remedy
this situation later; let us first prove the optimality of Neyman-Pearson tests (abbreviated N-P tests
henceforth). We do not claim that there is only one N-P test for a given level (the c may not be
unique, and also one can use dierent versions of the densities, e.g. modifying them in some points
etc.).

Theorem 8.3.5 (Neyman-Pearson fundamental lemma ). In model Model Mc , for =


{0 , 1 }, under assumption L, any N-P test of level (0 < < 1) is a most powerful -test for
the hypotheses H : = 0 vs. K : = 1 .

Proof. Let be a N-P test and SNP its acceptance region

SNP = {x : p1 (x) c p0 (x)} .

We can assume c > 0 here since


1 = P0 (L(X) c)
and for c = 0, < 1 this would mean that P0 (L(X) = 0) > 0, which we excluded by assumption
L. Let be any -test and S its acceptance region. We have to show

P1 (S) P1 (SNP ) .
The Neyman-Pearson Fundamental Lemma 105

Define A = S \ SNP , A0 = SNP \ S; then

S = (S SN P ) A,
SNP = (S SN P ) A0 .

and it suces to show


P1 (A) P1 A0 . (8.4)
Now since is an -test, we have

P0 (S) 1 = P0 (SNP )

which implies
P0 (A) P0 A0 . (8.5)
c , we have for any x A that p (x) > c p (x), hence
Since A SNP 1 0
Z Z
P0 (A) = p0 (x)dx c1 p1 (x)dx = c1 P1 (A) , (8.6)
A A

and since A0 SNP , we have


Z
0
1

P0 A c p1 (x)dx = c1 P1 A0 . (8.7)
A0

Relations (8.7), (8.6), (8.5) imply (8.4).


Let us consider the case where Assumption L is not fulfilled. For any test, the quantity P0 ((X) = 1)
is called the size of the test. We saw in the above proof that the assumption that the Neyman-
Pearson test has size exactly was essential. Recall that

P0 ((X) = 1) = 1 FL (c)

and that FL (c) is continuous from the right. When Assumption L is not fulfilled, the following
situation may occur: there is a c0 such that for all c < c0 , FL (c) < 1 , and at c0 we have
FL (c0 ) > 1 . In other words, the function 1 FL (c) jumps in such a way that is not attained.
In order to deal with this situation, let us generalize the notion of a test function.

Definition 8.3.6 A randomized test (based on the data X) is any statistic such that 0
(X) 1.

When the value of is between 0 and 1, the interpretation is that the decision between hypothesis
H and alternative K is taken randomly, such that is the probability of deciding K. Thus,
given the data X = x, a Bernoulli random variable Z is generated with law (conditional on x)
L(Z) = B(1, (x)), and the decision is Z. The former nonrandomized test functions are special
cases: when (x) = 1 or (x) = 0, the Bernoulli r.v. Z is degenerate and takes the corresponding
value 1 or 0 with probability one. For a randomized test, we have for = 0 or = 1 , and writing
P (Z = ) for the unconditional probability in Z (when X is random)

P (Z = 1) = E P (Z = 1|X = x) = E (X), (8.8)


P (Z = 0) = 1 E (X)
106 Testing Statistical Hypotheses

so that both the errors of first and second kind are a function of the expected value of (X) under
the respective hypothesis.
This method of introducing artificial randomness into the decision process should be regarded with
common sense reservations from a practical point of view. However randomized tests provide
a completion of theory and therefore a better understanding of the basic problems of statistics.
For instance, inclusion of randomized tests allows to state that there is always a level test: take
(X) = , independently of the data. But the power of that trivial test is also , i. e. not very
good.
In the above situation, when there is no c such that FL (c) = 1 , consider the left limit of FL at
c0 :
FL, (c0 ) := lim FL (c) = P0 (L(X) < c0 )
c%c0

(this always exists for monotone functions), then the height of the jump of FL at c0 is the probability
that the r.v. L(X) takes the value c0 :

P0 (L(X) = c0 ) = FL (c0 ) FL, (c0 )


= P0 (L(X) c0 ) P0 (L(X) < c0 ) > 0.

Define
P0 (L(X) > c0 )
= ,
P0 (L(X) = c0 )
then
= P0 (L(X) > c0 ) + P0 (L(X) = c0 ) . (8.9)
Moreover, since FL, (c0 ) is a limit of values which are all < 1 ,

FL, (c0 ) 1

and by assumption FL (c0 ) > 1 , hence

FL (c0 ) (1 )
0 < = 1.
FL (c0 ) FL, (c0 )

That allows us to construe as the value of a randomized test , which is taken if the event
L(X) = c0 occurs. We can then define Neyman-Pearson tests for any statistical model, provided
the likelihood ratio L(X) is defined as a random variable. Then the distribution function FL is
defined, and we may construct a level test as above.

Definition 8.3.7 Assume Model Md (discrete) or Model Mc (continuous), and that = {0 , 1 }.


Let p be either probability functions or densities and define the likelihood ratio as a function of the
data x
p (x)/p0 (x) if p0 (x) > 0
L = L(x) = { 1
+ otherwise
A test for the hypotheses
H : = 0
K : = 1
The Neyman-Pearson Fundamental Lemma 107

Figure 1 Construction of the randomized Neyman-Pearson test

is called a randomized Neyman-Pearson test of level if there exist c [0, ) and [0, 1]
such that
1 if L(x) > c
(x) = { if L(x) = c
0 if L(x) < c
and such that
P0 (L(X) > c) + P0 (L(X) = c) = . (8.10)

Note we modified the definition of L(x): formerly we took L(x) = 0 if p0 (x) = 0. But under
H such values of x do not occur anyway (or with probability 0), so that FL as considered before
remains the same and a level is attained. The modification ensures that if we decide on the basis
of L (rejct if L is large enough), then p0 (x) = 0 implies that the decision is always 1.

Theorem 8.3.8 (Neyman-Pearson fundamental lemma, general case). In Model Md (dis-


crete) or Model Mc (continuous), for = {0 , 1 }, and any (0 < < 1) a N-P test of level
exists and is a most powerful -test (among all randomized) for the hypotheses H : = 0
vs.K : = 1 .

Proof. We have shown existence above: L(X) is a well defined r.v. if X has law P0 (it takes value
only with probability 0); it has distribution function FL . If c0 exists such that FL (c0 ) = 1
then take c = c0 and = 0. Otherwise, find c and as above, fulfilling (8.9). Only the properties
of distribution functions were used for establishing (8.9), i.e. (8.10). Let Z be the randomizing
random variable, which has conditional law L(Z) = B(1, (x)) given X = x. Then the probability
under H that H is rejected is

P0 (Z = 1) = E0 (X) (8.11)
= 1 P0 (L(X) > c) + P0 (L(X) = c) = ,
108 Testing Statistical Hypotheses

i.e. the test has indeed size . The optimality proof is analogous to Theorem 8.3.5. We assume
that p (x) are densities; the case of probability functions requires only changes in notation.
According to (8.11), let be a N-P test with given c, and

S> := {x: L(x) > c},


S= := {x: L(x) = c},
S< := {x: L(x) < c}.

Then for any randomized -test


Z Z
E1 (X) = (x)p1 (x)dx + (x)p1 (x)dx
S> S= S<
Z Z
(x)p1 (x)dx + c (x)p0 (x)dx
S> S= S<
Z Z
= (x) (p1 (x) cp0 (x)) dx + c (x)p0 (x)dx.
S>

The second term on the right is bounded from above by c, since is a level test. For the first
term, since p1 (x) cp0 (x) > 0 on S> and (x) 1, we obtain an upper bound by substituting 1
for . Hence
Z
E1 (X) (p1 (x) cp0 (x)) dx + c
S
Z > Z
= (x) (p1 (x) cp0 (x)) dx + c (x)p0 (x)dx
S>
Z Z
= (x)p1 (x)dx + c (x)p0 (x)dx
S> S=
Z
= (x)p1 (x)dx E1 (X).
S> S=

In some cases the Neyman-Pearson lemma allows the construction of UMP tests for composite
hypotheses. Consider the Gaussian location model (Model Mc,1 ), for sample size n, and the hy-
potheses H : 0 . K : > 0 . Consider the one sided Gauss test 0 : for the test
statistic
Xn 0 n1/2
Z(X) =

the test is defined by
0 (X) = 1 if Z0 (X) > z (8.12)
(0 otherwise), where z is the upper -quantile of N(0, 1). As was argued in Example 8.3.3,
condition L is fulfilled here; hence for Neyman-Pearson tests of two simple hypotheses within this
model randomization is not needed. We have composite hypotheses now, but the following can be
shown.

Proposition 8.3.9 In the Gaussian location model (Model Mc,1 ), for sample size n, for the test
problem H : 0 vs. K : > 0 , for any 0 < < 1 the one sided Gauss test (8.12) is a UMP
-test.
Likelihood ratio tests 109

Proof. Note first that 0 is an -test for H : 0 . Indeed for any 0


!
Xn n1/2 (0 ) n1/2
P (Z(X) > z ) = P > z +

1/2 !
Xn n
P > z = .

Consider now any point 1 > 0 . We claim that for simple hypotheses H : = 0 , K : = 1
the test 0 is a Neyman-Pearson test of level . Indeed, when p0 , p1 are the densities then for
x = (x1 , . . . , xn ) ( is the standard normal density)
Yn Pn 2
Pn 2
((xi 1 )/) i=1 (xi 1 ) i=1 (xi 0 )
L(x) = = exp
((xi 0 )/) 2 2
i=1

nxn (1 0 ) n21 n20
= exp exp
2 2 2

Since 1 > 0 , L(x) is a monotone function of xn , and L(x) > c is equivalent to xn > c for some
c . In turn, xn is a monotone function of Z(x) = (xn 0 ) n1/2 /, thus L(x) > c is equivalent to
Z(x) > c . We find c from the level condition:

P0 (Z(X) > c ) = ,

and since Z(X) is standard normal under 0 , we find c = z . Thus 0 is a Neyman-Pearson


-test for H : = 0 vs. K : = 1 . Any test of level for the original composite hypotheses
is also a test of level for H : = 0 ,vs. K : = 1 , so that the fundamental lemma (8.3.5)
implies
E1 E1 0 .
Since 1 > 0 was arbitrary, the proposition follows.

8.4 Likelihood ratio tests


We saw that the Neyman-Pearson lemma is closely connected to the likelihood principle: the N-P
test rejects if L(x) = p1 (x)/p0 (x) is too large at the observed x. The densities (or probability
functions) p (x) is a likelihood of when x is fixed (observed) and varies, and the decision is
taken as a function of the two likelihoods. That idea can be carried over to general composite
hypotheses.

Definition 8.4.1 Assume Model Md (discrete) or Model Mc (continuous), and that = 0 1


where 0 1 = . Let p be either probability functions or densities and define the likelihood ratio
as a function of the data x
sup1 p (x)
sup0 p (x) if sup0 p (x) > 0
L = L(x) = {
+ otherwise

A (possibly randomized) test for the hypotheses


H : 0
110 Testing Statistical Hypotheses

K : 1
is called a likelihood ratio test (LR test) if there exist c [0, ) such that
1 if L(x) > c
(x) = { .
0 if L(x) < c

Note that for L(x) = c we made no requirement; any value (x) in [0, 1] is possible, so that the
test is possibly a randomized one. Neyman-Pearson tests are a special case for simple hypotheses.
One interpretation is the following. Suppose that the suprema over both hypotheses are attained,
so that for certain i (x) i , i = 0, 1 we have

sup p (x) = max p (x) = pi (x) (x), i = 0, 1.


i i

Then i (x) are maximum likelihood estimators (MLE) of under assumptions i , and the
LR test can be interpreted as a Neyman-Pearson test for simple hypotheses H : = 0 (x) vs.
K : = 1 (x). Of course this is pure heuristics and none of the Neyman-Pearson optimality theory
applies, since the hypotheses have been formed on the basis of the data.
Consider the Gaussian location-scale model and recall the form of the t-statistic for given 0

Xn 0 n1/2
T0 (X) = .
Sn

The two sided t-test was already defined (cp. (8.1); it rejects when T0 (X) is too large. The one
sided t-test is the test which rejects when T0 (X) is too large (in analogy to the one sided Gauss
test for known 2 ).

Proposition 8.4.2 Consider the Gaussian location-scale model (Model Mc,2 ), for sample size n.
(i) For hypotheses H : 0 vs. K : > 0 , the one sided t-test is a LR test.
(ii) For hypotheses H : = 0 vs. K : 6= 0 , the two sided t-test is a LR test.

Proof. In relation (3.4) in the proof of Proposition 3.0.5, we obtained the following form of the
density p,2 of the data x = (x1 , . . . , xn ) :
!
1 Sn2 + (xn )2
p,2 (x) = exp , (8.13)
(2 2 )n/2 2 2 n1
n
X
Sn2 = n1 (xi xn )2 . (8.14)
i=1

Consider first the two sided case (ii). To find MLEs of and 2 under 6= 0 , we first maximize
for fixed 2 over all possible R. This gives an unrestricted MLE = xn , and since xn = 0
with probability 0, we obtain that 1 = xn is the MLE of with probability 1 under K. We now
have to maximize
2 1 Sn2
lx ( ) = 2 n/2 exp 2 1
( ) 2 n
over 2 > 0. For notational convenience, we set = 2 ; equivalently, one may minimize
2
lx () = log lx () = n log + nSn .
2 2
Likelihood ratio tests 111

Note that if Sn2 > 0, for 0 we have lx () and for also lx () , so that a
minimum exists and is a zero of the derivative of lx . The event s2n > 0 has probability 1 since
otherwise xi = xn , i = 1, . . . , n, i.e. all xi are equal, which clearly has probability 0 for independent
continuous xi . We obtain

l0 () = n nSn2
x =0
2 2 2
= Sn2

as the unique zero, so 21 = s2n is the MLE of 2 under K. Thus


!
1 Sn2 + (xn 1 )2
max p,2 (x) = exp
6=0 , 2 >0 (2 21 )n/2 2 21 n1
1 1 n
= exp
( 21 )n/2 (2)n/2 2

Now under H the MLE of is 0 = 0 . Defining


n
X
2 2
S0,n := Sn2 + (xn 0 ) = n 1
(xi 0 )2 ,
i=1

we see that the MLE of 2 under H is 20 = S0,n


2 . Hence

1 1
max p,2 (x) = exp (n/2)
=0 , 2 >0 ( 20 )n/2 (2)n/2
and the likelihood ratio L is
max6=0 ,2 >0 p,2 (x) ( 2 )n/2
L(x) = = 02 n/2
max=0 ,2 >0 p,2 (x) ( 1 )
!n/2
Sn2 + (xn 0 )2
=
s2n

Note that
n1/2 Xn 0 (n 1)1/2 (Xn 0 )
T0 (X) = =
Sn Sn
hence n/2
1 2
L(X) = 1 + T .
n 1 0

Thus L(x) is a strictly monotone increasing function of T0 , which proves (ii).
Consider now claim (i). For hypotheses H : 0 vs. K : > 0 , the one sided t-test which
rejects when the t-statistic
Xn 0 n1/2
T0 (X) = .
Sn
is too large (with a proper choice of critical value, such that an -test results). It is easy to see
that the rejection region T0 (X) > z where z is the upper -quantile of the tn1 -distribution
112 Testing Statistical Hypotheses

leads to an -test for H (exercise). to shoe equivalence to the LR test, note that when maximizing
p,2 (x) over the alternative, the supremum is not attained ( > 0 is an open interval). However
the supremum is the same as the maximum over 0 which is attained by certain maximum
likelihod estimators 1 , 21 . (we will find these, and also MLEs 0 , 20 under H).
The density p,2 of the data x = (x1 , . . . , xn ) is again (8.13), (8.14). To find MLEs of and 2
under > 0 , we first maximize for fixed 2 over all possible . When xn > 0 the solution is
= xn . When xn 0 , the problem is to minimize (xn )2 under a condition > 0 . This
minimum is not attained ( can be selected arbitrarily close to 0 , such that still > 0 , which
makes (xn )2 arbitrarily close to (xn 0 )2 , never attaining this value). However

inf (xn )2 = min (xn )2 = (xn 0 )2 .


>0 0

Thus the MLE of under 0 is 1 = max(xn , 0 ). This is not the MLE under K, but gives
the supremal value of the likelihood under K for given 2 . To continue, we have to maximize in
2 . Now
(xn 1 )2 = (min(0, xn 0 ))2

and defining
2
Sn,1 := Sn2 + (xn 1 )2

we obtain !
2
Sn,1
1
sup p,2 (x) = sup 2 n/2
exp 2 1 .
>0 , 2 >0 2 >0 (2 ) 2 n

The maximization in 2 is now analogous to the argument above given for part (ii). The maximizing
value is 21 = Sn,1
2 and the maximized likelihood (which is also the supremal likelihood under K) is


1 1 1
sup p,2 (x) = 2 n/2 exp 1 .
>0 ,2 >0 ( 1 ) (2)n/2 2n

Now under the hypothesis, since 0 is a closed interval, the MLEs can straightforwardly be
found. An analogous argument to the one above gives

0 = min(xn , 0 ),
min (xn )2 = (xn 0 )2 = (max(0, xn 0 ))2 ,
0

20 = Sn,0
2 2
where Sn,0 := Sn2 + (xn 0 )2 ,

1 1 1
sup p,2 (x) = exp 1 .
0 , 2 >0 ( 20 )n/2 (2)n/2 2n

Thus the likelihood ratio is


n/2 !n/2
20 Sn2 + (xn 0 )2
L(x) = =
21 Sn2 + (xn 1 )2
Likelihood ratio tests 113

Suppose first that the t-statistic T0 (X) has values 0; this is equivalent to xn 0 . In this case
0 = xn , 1 = 0 , hence
n/2 n/2
Sn2 1
L(x) = =
Sn2 + (xn 0 )2 1 + (xn 0 )2 /Sn2
n/2
1
= .
1 + (T0 (X))2 /(n 1)

Thus for nonpositive values of T0 (X), the likelihood ratio L(x) is a monotone decreasing function
of the absolute value of T0 (X), which means it is monotone increasing in T0 (X), on values
T0 (X) 0.
Consider now nonnegative values of T0 (X): T0 (X) 0. Then xn 0 , hence 0 = 0 , 1 = xn
and
2 n/2 2 !n/2
0 Sn + (xn 0 )2
L(x) = =
21 Sn2
n/2
= 1 + (T0 (X))2 /(n 1) .

Thus for values T0 (X) 0, the likelihood ratio L(x) is monotone increasing in T0 (X).
The two areas of values of T0 (X) we considered do overlap (in T0 (X) = 0); and we showed that
L(x) is a monotone increasing function of T0 (X) on both of these. Hence L(x) is a monotone
increasing function of T0 (X).
114 Testing Statistical Hypotheses
Chapter 9
CHI-SQUARE TESTS

9.1 Introduction
Consider the following problem related to Mendelian heredity. Two characteristica of pea plants
(phenotypes) are observed: form, which may be smooth or wrinkled, and color, which may be yellow
or green. Thus there are 4 combinations of form and color (4 combined phenotypes). Mendelian the-
ory predicts certain frequencies of these in the total population of pea plants; callPthem M1 , . . . , M4
(here M1 is the frequency of smooth/yellow etc); assume these are normed as 4j=1 Mj = 1. We
P a sample of n pea plants; let the observed frequency be Zj for each phenotype (j = 1, . . . , 4),
observe
then 4j=1 Zk = n. We wish to find out whether these observations support the Mendelian hypoth-
esis that (M1 , . . . , M4 ) are the frequencies of phenotypes in the total population.

Model Md,2 The observed random vector Z = (Z1 , . . . , Zk ) has a multinomial distribution Mk (n, p)
P
with unknown probability vector p = (p1 , . . . , pk ) ( kj=1 pj = 1, pj 0, j = 1, . . . , k).

Recall the basic facts about the multinomial law. Consider a random k-vector Y of form
(0, . . . , 0, 1, 0, . . . , 0) where exactly one component is 1 and the others are 0. The probability that
the 1 is at position j with is pj ; thus Y can describe an individual falling into one of k categories
(phenotypes in the aboveP example). This Y is said to have law Mk (1, p). If Y1 , . . . , Yn are i.i.d. with
law Mk (1, p) then Z = ni=1 Yi has the law Mk (n, p). (The Yi may be called counting vectors).
The probability function is
k
Y
n! z
P (Z = (z1 , . . . , zk )) = Qk pj j (9.1)
j=1 zj ! j=1

P
where kj=1 zj = n, zj 0 integer. Since the j-th component of Y1 is Bernoulli B(1, pj ), the j-th
P
component Zj of Z has binomial law B(n, pj ). The Zj are not independent; in fact kj=1 Zj = n.
For k = 2 all the information is in Z1 since Z2 = n Z1 ; thus for k = 2 observing a multinomial
M2 (n, (p1 , p2 )) is equivalent to observing a binomial B(n, p1 ).
In Model Md,2 consider the hypotheses
H : p = p0
K : p 6= p0 .
The test we wish to find thus is a significance test. Recall the basic rationale of hypothesis testing:
what we wish to statistically ascertain at level is K; if K is accepted then it can be claimed that
the deviation from the null hypothesis H is statistically significant; and there can be reasonable
confidence in the truth of K. On the contrary, when H is accepted, no statistical significance

The hypothesis H is often called the null hypothesis, even if it is not of the form = 0
116 Chi-square tests

claim can be attached to this result. When formulating a test problem, the statement for which
reasonable statistical certainty is desired is taken as K.
Let us find the likelihood ratio test for this problem. Setting = p, 0 = p0 = (p0,1 , . . . , p0,k ),
denoting p (z) the probability function (9.1) of Mk (n, p) and

Xk
= p :pj 0, pj = 1 , 1 = \ {0 }

j=1

we obtain the likelihood ratio statistic


Q z
sup1 p (z) supp1 kj=1 pj j
L(z) = = Qk zj .
p0 (z) j=1 p0,j

Consider the numerator; let us first maximize over p (this will be justified below). If some of
the zj are 0, we can set the corresponding pj = 0 (making the other pj larger; we set 00 = 1). We
now maximimze over pj such that zj > 0. Taking a logarithm, we have to maximize
X
zj log pj
j:zj >0

over all p . Since log x x 1, we have


X X X X
pj pj
zj log zj 1 =n pj zj
n1 zj n1 zj
j:zj >0 j:zj >0 j:zj >0 j:zj >0
= 0

so that X X
zj log pj zj log n1 zj
j:zj >0 j:zj >0

and for pj = n1 zj equality is attained. This is the unique maximizer since log x = x 1 only for
x = 1. We proved

Proposition 9.1.1 In the multinomial Model Md,2 , with L(Z) = Mk (n, p) with no restriction on
the parameter p the maximum likelihood estimator p is

p(Z) = n1 Z.

The interpretation is that p is a vector valued sample mean of the counting vectors Y1 , . . . , Yn . In
this sense, we have a generalization of the result for binomial observations (Proposition 3.0.3).
Recall that for the LR statistic we have to find the supremum
Q over 1 = \ {p0 }, i.e. only one
z
point p0 is taken out. Since the target function p 7 kj=1 pj j is continuous (with 00 = 1) on ,
we have
sup1 p (z) sup p (z)
L(z) = =
p0 (z) p0 (z)
k z
max p (z) Y n1 zj j
= = .
p0 (z) p0,j
j=1
Introduction 117

Since the logarithm is a monotone function, the acceptance region S (complement of critical /
rejection region) can also be written

k
X np
0,j
S = z : log (L(z))1 = zj log c .
zj
j=1

Even the logarithm is a relatively involved function of the data, so it is dicult to find its distribu-
tion under H and to determine the critical value c from that. We will use a Taylor approximation
of the logarithm to simplify it. The basis is the observation that the estimator p(Z) is consistent,
i.e. converges in probability to the true probability vector p
p(Z) = n1 Z p p.
Under the hypothesis, this true vector is p0 , so all values n1 zj /p0,j converge to one. Note the
Taylor expansion
x2
log (1 + x) = x + o(x2 ) as x 0
2
where o(x2 ) is a term which is of smaller order than x2 ( such that o(x2 )/x2 0). Thus, assuming
that each term p0,j /n1 zj 1 is small, we obtain
k
X
1 p0,j
log (L1 (z)) = zj log 1 + 1 1
n zj
j=1
k
X k 2
p0,j 1X p0,j
zj 1 zj 1 .
n1 zj 2 n1 zj
j=1 j=1

Here the first term on the right vanishes, since the p0,j sum to one and the zj sum to n. We obtain
k
X 2
1 1 p0,j
log (L1 (z)) zj 1 .
2 n1 zj
j=1

We need not make the approximation more rigorous, if we do not insist on using the likelihood
ratio test. In fact we will use the LR principle only to find a reasonable test (which should be
shown to have asymptotic level ). In this spirit, we proceed with another approximation n1 zj
p0,j to obtain
k 2 k 2
1X p0,j 1X p0,j n1 zj
zj 1 = zj
2 n1 zj 2 n1 zj
j=1 j=1
k 2
1 X n p0,j n1 zj
.
2 p0,j
j=1

Definition 9.1.2 In the multinomial Model Md,2 , with L(Z) = Mk (n, p), the 2 -statistic relative
to a given parameter vector p0 is
k 1/2
X
2
2 n p0,j n1 Zj
(Z) = .
p0,j
j=1
118 Chi-square tests

The name is derived from the asymptotic distribution of this statistic, which we will establish below
(the statistic does not have a 2 -distribution). The hypothesis H : p = p0 will be rejected if 2 (Z)
is too large; as shown above, that idea was obtained from the likelihood ratio principle.
But the 2 (Z) has an interpretation of its own, as a measure of deviation from the hypoth-
esis. Indeed n1 Zj are consistent estimators of the true parameter p, so the sum of squares
Pk 2
j=1 (p0,j pj ) can be seen as a measure of departure from H. In the chi-square statistic, we
have a weighted sum of squares with weights p1 0,j .
We know that since each Zj has a marginal binomial law, for each j we have a convergence in
distribution
L
n1/2 n1 Zj p0,j N (0, p0,j (1 p0,j )) (9.2)
i.e. has a limiting normal law by the CLT under H. The 2 distribution is a sum of squares of
independent normals. However the Zj are not independent in the multinomial law; so we need
more than the CLT for each Zj : in fact a multivariate CLT for the joint law of (Z1 , . . . , Zk ) is
required.

9.2 The multivariate central limit theorem


Recall the Central Limit Theorem (CLT) for i.i.d. real valued random variables (Example 7.3.4.
Recall also that since there are no further conditions on the law of Y1 , the CLT is valid for both
L
continuous and discrete Yi . Here the symbol refers to convergence in distribution (or in law),
see Definition 7.3.1.
Suppose now that Y1 , . . . , Yn are i.i.d. random vectors of dimension k; we also write Y for a
random vector which has the distribution of Y1 . Assume that

E kYk2 < . (9.3)


P
Here kYk2 = kj=1 Yj2 is the Euclidean norm of the random vector Y. Recall that for any random
vector Y in Rk , the covariance matrix is defined by

(Cov(Y))j,l = Cov(Yj , Yl ) = E (Yj EYj ) (Yl EYl ) , 1 j, l k.

if the expectations exist. (Here (A)i,j is the (i.j) entry of a matrix A). This existence is guaranteed
by condition (9.3), as a consequence of the Cauchy-Schwarz inequality:

|Cov(Yj , Yl )|2 Var(Yj )Var(Yl ) EYj2 EYl2
2
E kYk2 .

Note that for expectations of vectors and matrices, the following convention holds: the expectation
of a vector (matrix) is the vector (matrix) of expectations. This means

EY = (EYj )j=1,...,k

and since the k k-matrix whose components are Yj Yl can be written

YY> = (Yj Yl )l=1,...,k


j=1,...,k ,
The multivariate central limit theorem 119

the covariance matrix can be written

Cov(Y) = E (YEY) (Y EY)> .

Our starting point for the multivariate CLT is the observation that if t is a nonrandom k-vector,
then the r.v.s t> Yi , i = 1, . . . , n are real-valued i.i.d. r.v.s with finite second moment.Indeed,
since for any vector t, x we have
2
t> x = t> xx> t,
we obtain
2 2
Var(t> Y) = E t> YEt> Y = E t> (YEY) =
= Et> (YEY)(YEY)> t
= t> E(YEY)(YEY)> t = t> Cov(Y)t <

since Cov(Y) is a finite matrix. Thus according to the (univariate) CLT, if

2t = t> Cov(Y)t
is not 0, then !
n
X L
n1/2 n1 t> Yi Et> Y N (0, 2t ). (9.4)
i=1
Define the sample mean of the random vectors Yi by
n
X
Yn = n1 Yi ;
i=1

then (9.4) can be written


L
n1/2 t> Yn E Yn N (0, t> Cov(Y)t).
This suggests amultivariate
normal distribution Nk (0, ) with = Cov(Y) as the limit law for
the vector n1/2 Yn E Yn . Indeed recall that for a nonsingular (positive definite) matrix

L(Z) =Nk (0, ) implies that L(t> Z) =N(0, t> t) for every t 6= 0.

(Lemma 6.1.12. Actually the converse is also true; cf. Proposition 9.2.3 below). In the sequel, for
the multivariate CLT we will impose the condition that Cov(Y) is nonsingular. This means that
t> Cov(Y)t > 0 for every t 6= 0.
For an interpretation of this condition, note if it is violated then there exists a t 6= 0 such that
t> Cov(Y)t = 0 (this number is the variance of a random variable and thus it cannot be negative).
Then the r.v. t> Y is 0 with probability 1. Define the hyperplane (linear subspace) in Rk
n o
H = x Rk : t> x = 0 ;

then Y H with probability one. The condition that Cov(Y) is nonsingular thus excludes this
case of a degenerate random vector which is actually concentrated on a linear subspace of Rk .
However a multivariate CLT is still possible if the vector Y is linearly transformed (to a space of
lower dimension).
120 Chi-square tests

Definition 9.2.1 Let Qn , n = 1, 2, . . . be a sequence of distributions in Rk . The Qn are said


to converge in distribution to a limit Q0 if for random vectors Yn such that L(Yn ) = Qn ,
L(Y0 ) = Q0

L
t> Yn Q0,t as n ,
Q0,t = L(t> Y0 ),

for every t Rk , t 6= 0. In this case one writes

L
Yn Q0 .

Note that it is not excluded here that the limit law has a singular covariance matrix (t> Y0 might
be 0 with probability one for certain t, or even for all t). However for the multivariate CLT we
will exclude this case by assumption, since we did not systematically treat the multivariate normal
Nk (0, ) with singular .
With this definition, it is not immediately clear that the limit law Q0 is unique. It is desirable
to have this uniqueness; otherwise there could be two dierent limit laws Q0 , Q0 such that (for
L(Y0 ) = Q0 )
L(t> Y0 ) = L(t> Y0 ) for all t 6= 0.

That this is not possible will follow from the Proposition 9.2.3 below.

Theorem 9.2.2 (Multivariate CLT) Let Y1 , . . . , Yn be i.i.d. random vectors of dimension k,


each with distribution Q, fulfilling condition

E kY1 k2 < . (9.5)

Let
n
X
Yn = n1 Yi
i=1

and assume that the covariance matrix = Cov(Y1 ) is nonsingular. Then for fixed Q and n
L
n1/2 Yn EY1 Nk (0, ).

Proof. With our definition of convergence in distribution, it is an immediate consequence of


the univariate (one dimensional) CLT and the properties of the multivariate normal distribution
(following the pattern outlined above, that everything is reduced to the one dimensional case).
The uniqueness of the limiting law follows from the next statement. It means that we can in fact
use all properties of the the multivariate normal when it is a limiting law (e.g. independence of
components when they are uncorrelated).

Proposition 9.2.3 Let Q, Q be the distributions of two random vectors Y, Y with values in
Rk . Then
L(Y) = L(Y ) if and only if L(t> Y) = L(t> Y ) for all t 6= 0.
Application to multinomials 121

Proof. The complete argument is beyond the scope of this course; let us discuss some elements
(cp. also the arguments for the proof of the univariate CLT in [D]). Suppose that for all t Rk , the
expression
MY (t) = E exp(t> Y)
is finite (i. e. the expectation is finite). In that case, MY is called the moment generating function
(m.g.f.) of L(Y). Analogously to the one dimensional case, it can be shown that the m. g.f.
determines L(Y) uniquely (that is the key argument). Thus if L(t> Y) = L(t> Y ) then their
univariate m. g.f. coincide:
E exp(ut> Y) = E exp(ut> Y )
and conversely, if that is the case for all u and t then MY = MY , hence L(Y) = L(Y ).
Existence of the m.g.f. is a strong additional assumption on a distribution. The proof in the general
case (without any conditions on the laws L(Y), L(Y )) is based on the so called characteristic
function of a random vector
Y (t) = E exp(it> Y)
where the complex-valued expression

exp(iz) = cos(z) + i sin(z)

occurs (for z = t> Y). Since


|exp(iz)| = 1
(absolute value for complex numbers), no special strong assumptions have to be made for the
existence of the characteristic function: it exists for any random vector and can be used in the
proof in much the same way as above. The essential part is again that Y uniquely determines
L(Y).
Let A be a subset of Rk . A set is called regular if it has a volume and a boundary of zero volume
(the boundary is the intersection of A and Ac , where Ac is the complement and A is the closure
of A). Rectangles and balls are regular. The following statement is similar to Proposition 9.2.3, in
the sense that advanced mathematical tools are neeeded for its proof, and we only quote it here.

Proposition 9.2.4 Let Yn be a sequence of random vectors in Rk such that


L
Yn Q0 as n

where Q0 .is a continuous law in Rk (Q0 has a density). Then

P (Yn A) Q0 (A)

for all regular sets A Rk .

9.3 Application to multinomials


Let us apply the multivariate CLT to the multinomial random vector Z. Since the components are
linearly dependent (they sum toPn), we cannot expect a nonsingular covariance matrix. Recall that
if L(Z) = Mk (n, p) then Z = ni=1 Yi where Yi are independent Mk (1, p). If L(Y) = Mk (1, p)
then for (Y1 , . . . , Yk )> = Y
EYj = pj ,
122 Chi-square tests

and for j = l, since Yj is binomial,

EYj Yl = EYj2 = EYj = pj ,

while for j 6= l
EYj Yl = P (Yj = 1, Yl = 1) = 0
(the random vector Y has a 1 in exactly one position). We can now write down the covariance
matrix:
p pj pl , j = l
Cov(Yj Yl ) = EYj Yl EYj EYl = { j
pj pl , j 6= l.
1/2
Introduce transformed variables Yj = (Yj pj )/pj ; then

1/2 1/2
1/2 1/2 1 pj pl , j = l
Cov(Yj Yl ) = pj pl Cov(Yj Yl ) ={ 1/2 1/2
pj pl , j 6= l.

1/2
Let be the diagonal matrix with diagonal elements pj . Then, for vectors

1/2 1/2
Y =(Y p), p =p =(p1 , . . . , pk )>

we have (Y1 , . . . , Yk )> = Y and


Cov(Y) = Ik pp> . (9.6)
Let e1 , . . . , ek be an orthonormal basis of vectors in Rk such that ek = p. That is possible since p
has length 1 :
Xk
2 1/2 2
kpk = pk = 1.
j=1

Then e1 , . . . , ek1 are all orthogonal to p. Let G, G0 be matrices of dimension k k, (k 1) k


>
e1 e>
1
G = . . . , G0 = ... . (9.7)
>
e>k ek1

These are orthogonal matrices. Moreover we have


k
X 1/2 1/2
e> >
k Y = p Y = pj pj (Yj pj ) (9.8)
j=1
k
X
= (Yj pj ) = 1 1 = 0. (9.9)
j=1

1/2
Lemma 9.3.1 Let L(Z) = Mk (n, p), let be the diagonal matrix with diagonal elements pj
and let the (k 1) k-matrix F be defined by

F = G0
Application to multinomials 123

where G0 is defined by (9.7). Then


k
X
2
kF (Z np)k = p1
j (Zj npj )
2
(9.10)
j=1

and the random vector F Z has covariance matrix:

Cov(F Z) = nIk1 . (9.11)

Proof. Note that n


X
(Z np) = Yi
i=1
where Yi are i.i..d Mk (1, p) and
Yi = (Yi p).
Above it was shown (9.9) that

e> >
k Yi = p Yi = 0, i = 1, . . . , n.

Hence
e>
k (Z np) = 0.

This implies
k
X
p1
j (Zj npj )
2
= k(Z np)k2 = kG(Z np)k2
j=1
k
X k1
2 X 2
>
= ej (Z np) = e>
j (Z np)
j=1 j=1

= kG0 (Z np)k = kF (Z np)k2


2

thus the first claim (9.10) is proved. For the second claim, we note that in view of (9.6) and the
additivity of covariance matrices for independent vectors

Cov(F Z) = nCov(F Y1 ) = nCov(G0 Y1 )



= nG0 Ik pp> G>0

= nG0 G>
0 = nIk1 .

(computation rules for covariance matrices can be obtained from the rules for the multivariate
normal, cp. Lemma 6.1.12:
Cov(AX) = ACov(X)A> .

The following is the Central Limit Theorem for a multinomial random variable, which generalizes
the de Moivre-Laplace CLT for binomials (sums of i.i..d Bernoulli r.vs, cp. (1.2) and (9.2)). Since
the components of the multinomial are dependent, we need to multiply with a (k 1) k-matrix F
first, otherwise we would get a multivariate normal limiting distribution with singular covariance
matrix.
124 Chi-square tests

Proposition 9.3.2 Let L(Z) = Mk (n, p). Then for the (k 1) k-matrix F defined above we
have
L
n1/2 F n1 Z p Nk1 (0, Ik ) as n .

Proof. We can represent Z as a sum of i.i.d. vectors


n
X
Z= Yi
i=1

where each Yi is Mk (1, p). Hence


n
X
FZ = F Yi
i=1

the F Yi are again i.i.d. vectors expectation F p and with unit covariance matrix, according to
(9.11) for n = 1. For the multivariate CLT, the second moment condition is fulfilled trivially since
the vector Y1 takes only k possible values. The multivariate CLT (Theorem 9.2.2) yields the result.

The next result justifies the name of the 2 -statistic, by establishing an asymptotic distribution.

Theorem 9.3.3 Let L(Z) = Mk (n, p). Then for the 2 -statistic

k 1/2 1
X
2
2 n n Zj pj
(Z) =
pj
j=1

we have
L
2 (Z) 2k1 as n .

Proof. We have according to (9.10)

k
X
2
(Z) = n 1
p1
j (Zj npj )
2

j=1
2

= n1 kF (Z np)k2 = n1/2 F (n1 Z p) .

The above Proposition 9.3.2 implies that the expression inside kk2 is asymptotically multivariate
k 1-standard normal. Denote this expression

Vn = n1/2 F (n1 Z p).

For convergence in law to 2k1 we have to show that



P kVn k2 t F (t)

where F is the distribution function of the law 2k1 , at every continuity point t of F . Since this
law has a density, F is continuous, so it has to be shown for every t (it suces for t 0). The set
Chi-square tests for goodness of fit 125
n o
x : kxk2 t is a ball in Rk1 and hence regular in the sense of Proposition 9.2.4. Thus if is a
random vector with law Nk1 (0, Ik ) then

P kVn k2 t P kk2 t .

By definition of the 2k1 distribution, we have



P kk2 t = F (t)

so that the theorem follows from Proposition 9.2.4.

9.4 Chi-square tests for goodness of fit


In a goodness-of-fit test we test whether the distribution of the data is dierent from a given
specified distribution Q0 . More generally, this term is used also when we test whether the distri-
bution is outside a specified family (such as the normal family). In the first case, the hypothesis
H is simple (consists of Q0 ) and the test may also be called a significance test, but the terminol-
ogy goodness-of-fit emphasizes that we specifically focus on the actual shape of the distribution.
Goodness of fit to whole families of distributions (taken as H) will be discussed in the next section.

Theorem 9.4.1 Consider Model Md,2 : the observed random k-vector Z has law L(Z) = Mk (n, p)
where p is unknown. Consider the hypotheses
H : p = p0
K : p 6= p0 .
Let z be the upper -quantile of the distribution 2k1 . The test (Z) defined by

1 if 2 (Z) > z
(Z) = { (9.12)
0 otherwise
where
k
X
2 (np0,j Zj )2
(Z) = . (9.13)
np0,j
j=1

is the 2 -statistic relative to H, is an asymptotic -test, i. e. under p = p0


limsupn P ((Z) = 1) .

The form (9.13) of the 2 -statistic is easy to memorize: take observed frequency minus expected
frequency, square it, and divide by expected frequency, and sum over components (components are
also called cells).

Remark 9.4.2 On quantiles. In the literature (especially in tables) it is more common to use
lower quantiles; i.e. values q such that for a given random variable X
P (X q ) , P (X q ) 1 .
For = 1/2 one obtains a median (i.e. a theoretical median of the random variable X; note
that quantiles q need not be unique). When X has a continuous strictly monotone distribution
function F (at least in a neighborhood of q ) then q is the unique value
F (q ) = .
126 Chi-square tests

Lower quantiles are often written in the form 2k; if F corresponds to 2k . Thus for the upper
quantile z used above we have
z = 2k1;1 .

Example 9.4.3 Consider again the heredity example (beginning of Section 9.1). Suppose the
values of (M1 , . . . , M4 ) = (p0,1 , . . . , p0,4 ) predicted by the theory are
1 2
(p0,1 , . . . , p0,4 ) = 2
91 , 9 91, 9 91, 92
100
= (0.828 1, 0.0819, 0.0819, 0.00 81)

(we do not claim that these are the correct values corresponding to Mendelian theory). Suppose
we have 1000 observations with observed frequency vector

Z = (822, 96, 75, 7)

(these are hypothetical, freely invented data). We have


4
X (np0,j Zj )2
2 (Z) =
np0,j
j=1

(828 .1 822)2 (81.9 96)2 (81.9 75)2 (8.1 7)2


= + + +
828 .1 81.9 81.9 8.1
= 3. 2031

At significance level = 0.05, we find z = 23;0.95 = 7.82. The hypothesis is not rejected.

Example 9.4.4 Suppose we have a die and want to test whether it is fair, i.e. all six outcomes
are equally probable. For n independent trials, the frequency vector for outcomes (1, . . . , 6) is

Z = (Z1 , . . . , Z6 )

and the expected frequency vector would be


n n
np0 = ,..., .
6 6
This can easily be simulated on a computer; in fact one would then test whether the computer
die (i.e. a uniform random number taking values on integers {1, . . . , 6}) actually has the uniform
distribution. The random number generator of QBasic (an Ms-Dos Basic version) was tested in
this way, with n = 10000, and a result

z = (1686, 1707, 1739, 1583, 1643, 1642) .

We have n/6 = 1666.667 and a value for the 2 -statistic


6
X (n/6 zj )2
2 (z) = = 9.2408.
n/6
j=1

The quantile at = 0.05 is z = 25;0.95 = 11.07, so the hypothesis of a uniform distribution cannot
be rejected.
Chi-square tests for goodness of fit 127

Exercise. Suppose you are intent on proving that the number generator is bad, and run the above
simulation program 20 times. You claim that the random number generator is bad when the test
rejects at least once. Are you still doing an -test ? (assuming that n above is large enough so
that the level of one test is practically )
The 2 -test can also be used to test hypotheses that the data follow a specific distribution, not
necessarily multinomial. Suppose observations are i.i.d.real valued X1 , . . . , Xn , with distribution
Q. Suppose Q0 is a specific distribution, and consider hypotheses
H : Q = Q0
K : Q 6= Q0 .
This is transformed into multinomial hypotheses by selecting a partition of the real line into subsets
or cells A1 , . . . , Ak
[k
Ak = R, Ai Aj = , j 6= i .
j=1

The Aj are often called cells or bins; they are usually intervals. For a real r.v. X, define an
indicator vector
Y(X) = (Y1 , . . . , Yk ), Yj = 1Aj (X), j = 1, . . . , k (9.14)
i.e. Y(X) indicates into which of the k cells the r.v. X falls. Then obviously Y(X) has a
multinomial distribution:

L(Y(X)) = Mk (1, p), p = p(Q) =(Q(A1 ), . . . , Q(Ak ))

This p(Q) is the vector of cell probabilities, corresponding to the given partition. Thus, in the
above problem, the vectors Yi = Y(Xi ) are multinomial; they are sometimes called binned data.
Then
Xn
Z := Y(Xi ) (9.15)
i=1

is multinomial Mk (n, p) with the above value of p. When Q takes the value Q0 then also the vector
of cells probabilities takes the value

p0 = p(Q0 ) = (Q0 (A1 ), . . . , Q0 (Ak )).

From initial hypotheses H, K one obtains derived hypotheses


H 0 : p(Q) = p0 (Q)
K 0 : p(Q) 6= p0 (Q).
From Theorem 9.4.1 it is clear that as n , the 2 -test based Z for the multinomial hypothesis
H 0 is again an asymptotic -test.

Corollary 9.4.5 Suppose observations are i.i.d. real valued X1 , . . . , Xn , with distribution Q. Con-
sider hypotheses
H : Q = Q0
K : Q 6= Q0 .
For a partition of the real line into nonintersecting cells A1 , . . . , Ak , define the vector of cell fre-
quencies Z by (9.15). Then the 2 -test (Z) defined by (9.12) based Z is an asymptotic -test as
n .
128 Chi-square tests

This 2 -test has very wide range of applicability; it is not specified whether Q0 is discrete or
continuous. Every distribution Q0 gives rise to a specific multinomial distribution Mk (n, p(Q0 ))
which is then tested. For instance, a random number generator for standard normal variables can
be tested in this way. On the real line, at least one of the cells Aj contains an unbounded interval.
However there is a certain arbitrariness involved in the choice of the cells A1 , . . . , Ak . In fact
partitioning the data into groups amounts to a coarsening of the hypothesis: there are certainly
distributions Q 6= Q0 which have the same cell probabilities, i.e. p(Q0 ) = p(Q). These cannot
be told apart from Q0 by this method. If one choses a large number of groups k, the number of
observations in each cell may be small, so that the approximation based on the CLT appears less
credible.

Remark 9.4.6 A family of probability distributions P = {P , }, indexed by , is called


parametric if all are finite dimensional vectors ( Rk for some k), otherwise P is called
nonparametric. In hypothesis testing, any hypothesis corresponds to some P, thus the terminology
is extended to hypotheses. Any simple hypothesis (consisting of one probability distribution P =
{Q0 }) is parametric. In Corollary 9.4.5, the alternative K : Q 6= Q0 is nonparametric: the set of
all distributions Q = L(X1 ) cannot be parametrized by a finite dimensional vector (take e.g. only
all discrete distributions 6= Q0 characterized by probabilitities q1 , . . . , qr , r arbitrarily large).

Thus we encountered the first nonparametric hypothesis, in the form of the alternative Q 6= Q0 . In
this sense, the 2 -test for goodness of fit in Corollary 9.4.5 is a nonparametric test; in a narrower
sense this term is used for tests which have level on a nonparametric hypothesis. (However this
2 -test actually tests the hypothesis on the cell probabilities p(Q) = p(Q0 ), with asymptotic level
, and the set of all Q fulfilling this hypothesis is also nonparametric).

9.5 Tests with estimated parameters


Back in the multinomial model, consider now the situation that the hypothesis is not H : p = p0
but also composite: let H be a d + 1-dimensional linear subspace of Rk (0 d k 1)and assume
the hypothesis is H : p H. An example would be the hypothesis p1 = p2 (when k > 2). Now p is
already in a k 1 dimensional ane manifold:
n o
SP = x : 1> x = 1, xj 0, j = 1, . . . , k , (9.16)

which is called the probability simplex in Rk . It is the set of all k-dimensional probability vectors.
(Here 1 = (1, . . . , 1)> Rk ). Instead of fixing p0 as before, we now have only linear restrictions on
p: if H> is the orthogonal complement of H and h1 , . . . , hkd1 an orthonormal basis then

h>
j p = 0, j = 1, . . . , k d 1. (9.17)

If there is a probability vector in H, then h1 , . . . , hkd1 , 1 must be linearly independent (otherwise


1 would be a linear combination of hj , and we cannot have 1> p = 1), and it follows that

H0 = SP H (9.18)

has dimension k (k d 1 + 1) = d. (The dimension of H0 should be understood in the sense


that there are d + 1 points x0 , . . . , xd in H0 such that xj .x0 , j = 1, . . . , k are linearly independent,
and no more such points. Simlarly, SP has dimension k 1).When d = 0 then H0 consists of only
Tests with estimated parameters 129

one point p0 . The setting can be can be visualized as follows.

The probability simplex for k = 3 intersected with the linear space spanned by 1 (dimension 1).
The intersection is n1 1 (dimension d = 0).

The probability simplex intersected with a linear subspace H (dimension 2). The intersection is
H0 (dimension d = 1).

The multinomial data vector n1 Z also takes values in SP , which means intuitively that there are
k 1 degrees of freedom. Our parameter vector p varies in H0 with dimension d, which means
that there are d free parameters under the hypothesis which must be estimated. We now claim
130 Chi-square tests

that the corresponding 2 -statistic has a limiting 2 -distribution with degrees of freedom
dim(SP ) dim(H0 ) = (k 1) d = k d 1.
Let us discuss what we mean by estimated parameters. A guiding principle is still the likelihood
ratio principle: consider the LR statistic
sup1 p (z)
L(z) = (9.19)
sup0 p (z)
where 1 , 0 are the parameter spaces under H, K respectively. In the case 0 = {p0 } this led
us to the 2 -statistic relative to p0
k 1/2
X
2
2 n p0,j n1 Zj
(Z) = .
p0,j
j=1

Since now in (9.19) we also have to maximize over the hypothesis, we should expect that in place
of p0,j we now obtain estimated values under the hypothesis: p = p(Z), which are the maximum
likelihood estimators under H.
Write op (nr ) for a random vector such that nr kop (nr )k p 0. , r 0.
Lemma 9.5.1 The MLE p in a multinomial model {Mk (n, p), p H0 } for a parameter space H0
given by (9.18) fulfills
F (p p) = F n1 Z p + op (n1/2 ) (9.20)
where F is the (k 1) k-matrix defined in Lemma (9.3.1), p is the true parameter vector, and
is a (k 1) (k 1) projection matrix of rank d.
Comment. The result means that the k 1-vector F (p p) almost lies in the d-dimensional
linear subspace of Rk1 associated to the projection . This is related to the fact that both p, p
are in the d-dimensional manifold H0 .
Proof. We present only a sketch, suppressing some technical arguments. Let p denote the true
value and p the one over which one maximizes (and ultimately the maximizing value). Consider
the log-likelihood; up to an additive term which does not depend on p it is
k
X
Zj log pj
j=1

Maximizing this is the same as minimizing


k
X
pj
2 Zj log .
n1 Zj
j=1

A preliminary argument (which we cannot give here) shows that p is consistent, i.e. p p p. Also
n1 Z p p, so that pj /n1 Zj p 1. . A Taylor expansion of the logarithm yields
Xk
pj
2 Zj log 1 + 1 1
n Zj
j=1
k
X Xk 2
pj pj
= 2 Zj 1 + Zj 1 + op (1).
n1 Zj n1 Zj
j=1 j=1
Tests with estimated parameters 131

Here the first term vanishes and the remainder is


k 2
X n n1 Zj pj
+ op (1).
n1 Zj
j=1

In another approximation step, n1 Zj is replaced by its limit pj to yield

k 2
X n n1 Zj pj
+ op (1).
pj
j=1

The above expression is one which p minimizes. Similarly to (9.10) we now have

k
X 2
p1 1 2 1
j (n Zj pj ) = F (n Z p)
j=1

since with the choice of the vector ek and as in Lemma (9.3.1) we have ek = p = 2 p = 1
and
e> 1
k (n Z p) =1 1 = 0.

Furthermore, denoting
q = F (p p), Z = F (n1 Z p) (9.21)
we obtain
F (n1 Z p)2 = kZ qk2 (9.22)
and q minimizes
n kZ qk2 + op (1). (9.23)
Let us disregard the requirement that all components of p must be nonnegative; it can be shown
that since n1 Z SP , this requirement is fulfilled automatically for the minimizer p if n is large
enough. With this agreement, the vector q varies in the set

H1 = {F (x p), x S H} . (9.24)

where n o
S = x : 1> x = 1 .

This set H1 can be described as follows. The set S is an ane subspace of Rk ; for the given p SP
it can be represented n o
S = p + z : 1> z = 0 .

Then since p H, we have

H10 : = {(x p), x S H}


n o
= (z, 1> z = 0, z H
n o
= z, h>j z = 0, j = 1, . . . , k d 1, 1>
z = 0
132 Chi-square tests

and since 1 and the h> 0


j are linearly independent, H1 is a d-dimensional linear subspace of R .
k

Then H1 = F y, yH10 is a linear subspace of Rk1 , and by the construction of the matrix F
the space H1 has not smaller dimension than H10 , i.e. also dimension d. (Indeed, if the dimension
were less then there would be an nonzero element z of H10 orthogonal to all rows of F . Since these
are orthogonal to p by construction, cp. (9.7), z must be a multiple of p, which implies 1> p = 0,
in contradiction to 1> p = 1). Let be the projection matrix onto H1 in Rk1 . Since q minimizes
the distance to Z within the space H1 , we must have (approximately)

q =Z

i.e. q is the projection of Z onto H1 . This is already (9.20) up to the size op (n1/2 ) of the error
term, for which a more detailed argument based on (9.23) is necessary.
Consider now the 2 statistic relative to H, with (maximum likelihood) estimated parameter p =
p(Z). We obtain
k 2
X n pj (Z) n1 Zj
2
(Z) = .
pj (Z)
j=1

To find the asymptotic distribution, we substitute the denominator by its probability limit pj (the
true parameter):
k 2
X n pj (Z) n1 Zj
2
(Z) = + op (1)
pj
j=1
2
n F (n1 Z p)
2
2 1/2
= n kZ qk = n k1 )Z + op (n
(I )

according to the approximation of the Lemma above. Now the matrix = Ik1 is also a
projection matrix, namely onto the orthogonal complement of H1 , of rank k 1 d. It can be
represented
= CC >
where C is a (k 1 d) (k 1) orthogonal matrix (such that C > C = Ik1d ). Thus
2

2 (Z) = n1/2 C > Z + op (1) + op (1).

Since
L
n1/2 Z = n1/2 F (n1 Z p) Nk1 (0, Ik1 )
it follows
L
n1/2 C > Z Nk1d (0, Ik1d ) .
This implies
L
2 (Z) 2kd1 .
We see that if d parameters must be estimated under H, then the degrees of freedom in the limiting
2 law are k d 1. We argued for a hypothesis H : p H0 where H0 is a d-dimensional ane
manifold in Rk described in (9.18) (0 d < k 1).
Tests with estimated parameters 133

Theorem 9.5.2 Consider Model Md,2 : the observed random k-vector Z has law L(Z) = Mk (n, p)
where p is unknown. Let H0 be a d-dimensional set of probability vectors of form (9.18). Consider
the hypotheses
H : p H0
K : p H
/ 0
Let 2kd1;1 be the lower 1 -quantile of the distribution 2kd1 . The test (Z) defined by

1 if 2 (Z) > 2kd1;1


(Z) = { (9.25)
0 otherwise

where
k
X
2 (npj Zj )2
(Z) = . (9.26)
npj
j=1

is the 2 -statistic and p = (p1 , . . . , pk ) is the MLE of p relative to H0 , is an asymptotic -test.

Consider now a hypothesis of form H : p P where P is a parametric family of probability vectors

P = {p , }

and Rd . Under some smoothness conditions, and assuming that the mapping 7 p is
one-to-one, the set P can be regarded as a smooth subset of the probability simplex SP . We
can assume that in every point p P there is a tangent set of form H0 (or tangent ane subspace
H1 ) which has the same dimension d. In this sense, locally we are back in the previous case of
Theorem 9.5.2. Here locally means that if the MLE of is consistent, it will point us to a
vicinity of the true parameter , i.e. the true underlying probability vector p , and in this vicinity
we can substitute P by its tangent space H0 at p . This is the outline of the proof that the
2 -statistic with estimated parameters
2
k
X npj () Zj
2 (Z) = (9.27)
j=1 npj ()

where p = (p1 (), . . . , pk ()) has still a limiting 2 -distribution with k d 1 degrees of freedom.
The essential condition is that d parameters are to be estimated, and 7 p is smooth and
one-to-one.
In conjunction with binned data, this procedure can be used to test that the data are in a spe-
cific class of distributions, not necessarily multinomial. Suppose observations are i.i.d.real valued
X1 , . . . , Xn , with distribution Q. Suppose Q = {Q , } is a specific class of distributions, and
consider hypotheses
H:QQ
K:Q / Q.
This is transformed into multinomial hypotheses by selecting a partition of the real line into subsets
or cells A1 , . . . , Ak , as discussed above. We obtain a vector of cell probabilities

p(Q) =(Q(A1 ), . . . , Q(Ak )).


134 Chi-square tests

The observed vector of cell frequencies Z is multinomial Mk (n, p) with the above value of p. When
Q takes values inside the family Q, then p(Q) also takes values inside a family P defined by

P: = {p(Q ), } .

From initial hypotheses H, K one obtains derived hypotheses


H 0 : p(Q) P
K 0 : p(Q)
/ P.
Under smoothness conditions on Q it is clear that as n , the 2 -test based on Z, with estimated
parameters relative to the hypothesis H 0 is again an asymptotic -test. Here the degrees of freedom
in the limiting 2 -distribution is k d 1 if d the family Q has d parameters.
It should be stressed however that the estimator should be the MLE based on the binned multino-
mial data Z, not on the original data X1 , . . . , Xn . Thus, for a test of normality, strictly speaking,
one cannot use sample mean and sample variance as estimators, but one has to get the binned data
first and then estimate mean and variance from these multinomial data by maximum likelihood.

9.6 Chi-square tests for independence


Consider a bivariate random variable X = (X1 , X2 ) taking values in the finite set of pairs (j, l),
1 j r, 1 l s, where r, s 2, with probabilities

P ((X1 , X2 ) = (j, l)) = pjl .

These probabilities give the joint distribution of (X1 , X2 ),with marginal distributions
s
X r
X
P (X1 = j) = pjl =: q1,j , P (X2 = l) = pjl =: q2,l .
l=1 j=1

We are interested in the problem whether X1 , X2 are independent, i.e. whether the joint distribution
is the product of its marginals:

pjl = q1,j q2,l , j = 1, . . . , r, l = 1, . . . , s. (9.28)

Suppose that there are n i.i.d observations X1 , . . . , Xn , all having the distribution of X.
This can easily be transformed into a hypothesis about a multinomial distribution. Call the pairs
(j, l) cells; it is not important that they are pairs of natural numbers- these can just be symbols
for certain categories. Thus there are rs cells; they can be written as an r s-matrix. Define a
counting variable Yi associated to observation Xi = (X1i , X2i ): Yi is a r s-matrix such that

Yi = (Yi,jl )l=1,...,s
j=1...,r ,
1 if (X1i , X2i ) = (j, l)
Yi,jl = {
0 otherwise.

These Yi can be identified with vectors of dimension k = rs; when they are looked upon as vectors,
they have a multinomial distribution

L(Yi ) = Mk (1, p)
Chi-square tests for independence 135

where p is the r s-matrix of cell probabilities pjl . We can also define counting vectors for each
variable X1 , X2 separately:
Y1,i = (Y1,i,j )j=1...,r , Y2,i = (Y2,i,l )l=1...,s ,
1 if X1i = j, 1 if X2i = l,
Y1,i,j = { , Y2,i,l = { ,
0 otherwise. 0 otherwise.
Then the counting matrix Yi is obtained as
>
Yi = Y1,i Y2,i .
Again we have n of these observed counting vectors, and the matrix (or vector) of observed cell
frequencies is
Xn
Z= Yi .
i=1
Let us stress again that we identify a r s matrix with a rs-vector here. We write Mrs (n, p),
for the multinomial distribution of a r s-matrix with corresponding matrix of probabilities p. (
This can be identified with Mrs (n, p) when p is construed as a vector). We can also define cell
frequencies for the variables X1 , X2 separately:
n
X n
X
Z1 = Y1,i , Z2 = Y2,i . (9.29)
i=1 i=1
The hypothesis of independence of X1 , X2 translates into a hypothesis about the shape of the
probability matrix p: according to (9.28), for
q1 = (q1,j )j=1,...,r , , q2 = (q2,l )l=1,...,s ,
we have
H : p = q1 q>2,
K : p is not of this form.
The hypotheses H can be written in the form H : p PI where PI is a parametric family of
probability vectors (the lower index I in PI stands for independence):
n o
PI = q1 q> 2 , q1 SP,r , q2 SP,s

where SP,r is the probability simplex in Rr . Indeed, in the case of independence we have just the
two marginals, which are two probability vectors in Rr , Rs respectively. These marginal probability
vectors have r 1, and s 1 independent parameters respectively (the respective r 1, s 1 first
components). Thus PI can be smoothly parametrized by a r + s 2-dimensional parameter ,
where is a subset of Rr+s2 (but we do not make this explicit). Thus the hypotheses are
H : p PI
K : p P
/ I.
Define marginal cell frequencies
s
X r
X
Zj = Zjl , Zl = Zjl .
l=1 j=1

These coincide with the components of the vectors Z1 , Z2 defined in (??):


Zj = # {i : X1i = j} , Zl = # {i : X2i = l} .
136 Chi-square tests

Proposition 9.6.1 In the multinomial Model Md,2 , when Z is a multinomial r s matrix with
law L(Z) = Mrs (n, p), r, s 2, the maximum likelihood estimator p under the hypothesis of
independence p PI is
l=1,...,s
p = q1 q>
2 = (q1,j q2,l )j=1,...,r

where
q1 = n1 Z1 , q2 = n1 Z2
and Z1 , Z2 are the vectors of marginal cell frequencies

Z1 = (Zj )j=1,...,r , Z2 = (Zl )l=1,...,s .

Proof. The probability function for Z is (for a matrix z = (zjl )l=1,...,s


j=1,...,r )

r Y
Y s
z
P (Z = z) = C(n, z) pjljl
j=1 l=1

where C(n, z) is a factor which does not depend on the parameters. Independence means pjl =
q1,j q2,l , so the likelihood function is
r Y
Y s
zjl z
l(q1 , q2 ) = C(n, z) q1,j q2,ljl
j=1 l=1

Y s
r Y r Y
Y s
z z
= C(n, z) jl
q1,j q2,ljl
j=1 l=1 j=1 l=1
r
Y s
Y
zj zl
= C(n, z) q1,j q2,l .
j=1 l=1
Q zl
Now the factor sl=1 q2,l is the likelihood (up to a factor) for the multinomial vector Z2 defined
Q zj
in (9.29) with parameter q2 , and rj=1 q1,j is proportional to the the likelihood for Z1 . Thus
maximizing over q1 , q2 amounts to maximizing the product of two multinomial likelihoods, each in
its own parameter q1 ,q2 . The maximizer of each likelihood is the unrestricted MLE in a multinomial
model for Z1 or Z2 , thus according to Proposition 9.1.1

q1 = n1 Z1 , q2 = n1 Z2 .

This proves the result.


We can now write down the 2 -statistic with estimated parameters (estimated under the indepen-
dence hypothesis p PI ), according to (9.27):
s
r X
X
2
2 Zjl n(n1 Zj )(n1 Zl )
(Z) =
n(n1 Zj )(n1 Zl )
j=1 l=1
r X
X s
(nZjl Zj Zl )2
= . (9.30)
nZj Zl
j=1 l=1
Chi-square tests for independence 137

The dimension of Z is k = rs, while the number of estimated parameters is d = r 1 + s 1, so


according to the result in the previous section, as n
L
2 (Z) 2kd1 = 2rs1(r1)(s1)
= 2(r1)(s1) .

We thus obtain an asymptotic -test for independence.


We assumed initially that two real random variables (X1 , X2 ) take discrete values (j, l), 1 j r,
1 l s. But obviously, when these both take real values, one can use two partitions A1,j ,
j = 1, . . . , r and A2,l , l = 1, . . . , s and define cells

Bj,l = A1,j A2,l .

For n i.i..d data (X1i , X2i ), this gives rise to observed cell frequencies

Zjl = # {i : (X1i , X2i ) Bj,l }

which is then a multinomial matrix Z as above. The hypotheses of independence of X1 and X2


gives rise to a derived hypotheses p PI as above. Thus the 2 -statistic can again be used to test
independence.
A contingency table is a matrix of form

l=1 ... l=s


__ __ __ __ __
j=1 | Z11 Z1s | Z1
| ... ... | ...
... | Zj1 Zjl Zjs | Zj
| ... ... | ...
j=r | Zr1 Zrs | Zr
__ __ __ __ __
Z1 ... Zl ... Zs

It serves as a symbolic aid in computing the 2 -statistic (9.30). The 2 -test for independence is
also called 2 -test in a contingency table.
Exercise. Test your random number generator for independence in consecutive pairs. If N1 , N2 , . . .is
the sequence generated then take pairs X1 = (N1 , N2 ), X2 = (N3 , N4 ), . . ., and test independence
of the first from the second component. Note: if they are not independent then presumably the
pairs Xi are also not independent, so the alternatives which one might formulate are dierent from
the above. Still the contingency table. provides an asymptotic -test.
138 Chi-square tests
Chapter 10
REGRESSION

10.1 Regression towards the mean


To introduce the term, we start with a quotation from the Merriam-Webster dictionary for that
entry.
Regression: a: a trend or shift toward a lower or less perfect state: as a : progressive decline of a
manifestation of disease b :... c : reversion to an earlier mental or behavioral level d : a functional
relationship between two or more correlated variables that is often empirically determined from
data and is used especially to predict values of one variable when given values of the others <the
regression of y on x is linear>; specifically : a function that yields the mean value of a random
variable under the condition that one or more independent variables have specified values.
Let us explain the origin of the usage d in mathematical statistics. Around 1886 the biometrist
Francis Galton observed the size of pea plants and their ospring. He observed an eect which can
equivalently be described by observing body height of fathers and sons in the human species; since
the latter is a popular example, we shall phrase his results like this Predictably, he noticed that
tall fathers tended to have tall sons and short fathers tended to have short sons. At the same time,
he noticed that the sons of tall fathers tended to be shorter then their fathers, while the sons of
short fathers tended to be less short. Thus the respective height of sons tended to be closer to the
overall average body height of the total population. He called this eect regression towards the
mean.
This phenomenon can be confirmed in a probabilistic model, when we assume a joint normal
distribution for the height of fathers X and sons Y . Let (X, Y ) have a bivariate normal distribution
with mean vector (a, a), where a > 0. This a is the average total population height; for simplicity
assume it is the same for X and Y , i.e. height does not increase from one one generation to the
next. It should also be assumed that X and Y have the same variance:

Var(X) =: 2X = Var(Y ) =: 2Y

and that they are positively correlated:

Cov(X, Y ) =: XY > 0

(recall the correlation between X and Y is


Cov(X, Y ) XY
= p = ).
Var(X)Var(Y ) X Y


http://www.m-w.com

Sources dier as to what he actually observed; the textbook has fathers and sons (p. 555 ), while the Encyclo-
pedia of Statististal Sciences quotes Galton about seeds.
140 Regression

Thus
X 2X XY
L = N2 (a1, ), = .
Y XY 2Y
The average body height of sons, given the height of the father, is described by the conditional
expectation E(Y |X = x). To find it, we state a basic result on the conditional distribution L(Y |X =
x) in a bivariate normal. Some special cases appeared already ( Proposition 5.5.2).

Proposition 10.1.1 Suppose that X and Y have a joint bivariate normal distribution with expec-
tation vector = (X , Y )> and positive definite covariance matrix :
2
X X XY
L = N2 (, ), = . (10.1)
Y XY 2Y
Then
L(Y |X = x) = N y + (x x ), 2Y |X (10.2)
where
XY
= ,
2X
2XY
2Y |X = 2Y .
2X

Proof. Recall the form of the joint density p of X and Y : (Lemma 6.1.10: for z = (x, y)>

1 1 > 1
p(z) = p(x, y) = exp (z ) (z ) .
(2) ||1/2 2

The marginal density of x is


!
1 (x x )2
p1 (x) = exp .
(2)1/2 X 2 2X

The density of L(Y |X = x) (conditional density) is given by


p(x, y)
p(y|x) = ,
p1 (x)
and the conditional expectation E(Y |X = x) can be read o that density. Note that

|| = 2X 2Y 2XY ,
2 1 2
1 X XY 1 Y XY
= = .
XY 2Y || XY 2X

For ease of notation write x = x x , y = y y . This gives



1 1 2 2 2 2

p(x, y) = exp x Y 2xy XY + y X ,
(2) ||1/2 2 ||

X 1 2 2 2 2
|| x2
p(y|x) = exp x Y 2xy XY + y X +
(2)1/2 ||1/2 2 || 2 || 2X
Regression towards the mean 141

Note that
2
x2 2Y 2xy XY + y 2 2X = x2 2Y + 2X y x XY 2
X x2 2XY 2
X
= 2X (y x)2 + x2 2Y |X ,
|| 2
X = 2Y 2XY 2 2
X = Y |X .

The terms involving x2 now cancel out, and we obtain


!
1 1 2
p(y|x) = exp 2 (y x) .
(2)1/2 Y |X 2 Y |X

In view of
y x = y y (x x )
this proves (10.2).

Corollary 10.1.2 If X, Y have a bivariate normal distribution, then


(i) E(Y |X = x) is a linear function of x
(ii) The variance of L(Y |X = x) ( conditional variance ) does not depend on x.

Definition 10.1.3 Let X and Y have a joint bivariate normal distribution (10.1).
(i) The quantity
XY
= 2
X
is called the regression coecient for the regression of Y on X.
(ii) The linear function
y = E(Y |X = x) = y + (x x ) (10.3)
is called the regression function or regression line (for Y on X).

Thus is the slope of the regression line and y x is its intercept.


Note that the regression line always goes through the point (x , y ) given by the two means. Indeed
for x = x in (10.3) we obtain y = y .
Furthermore, the absolute value of is bounded by
X Y Y
|| =
2X X

as a consequence of the Cauchy-Schwarz inequality. For the conditional expectation we obtain

E(Y |X = x) = y + (x x ).

For father / son height model of Galton, we assumed that y = x = a, furthermore Y = X , and
positive correlation: XY > 0. This implies

0 < 1.

Here equality is not possible in : since is positive definite, it cannot be singular, hence || =
2X 2Y 2XY > 0, so that
| XY | < X Y
142 Regression

which means that


0 < < 1.
This gives the desired mathematical explanation of Galtons regression toward the mean. We have

E(Y |X = x) = a + (x a)
= (1 )a + x.

which means that E(Y |X = x) is a convex combination of x and a; the average height of sons,
given the height x of fathers, is always pulled toward the mean height a. It is less than x if x > a
and is greater than x if x < a.
It is interesting to note that an analogous phenomenon is observed for the relationship of sons
with given height to their fathers. Indeed, reversing the roles of X and Y , we obtain the second
regression line

x = E(X|Y = y) = x + 0 (y y ), (10.4)
XY
0 =
2Y

Under the assumption 2Y = 2X we have 0 = , hence 0 < 0 < 1, so that the fathers of tall sons
tend to be shorter etc. In (10.4) x is given as a function of y; when we put in the same form as the
first regression line, with y a function of x, we obtain
1
y = y + (x x )
0
and turns out that the other regression line also goes through the point (x , y ), but has dierent
slope 1/ 0 . This slope is higher (1/ 0 > 1) if 2Y = 2X . The linear function (10.4) is said to pertain
to the regression of X on Y .
Back in the general bivariate normal N2 (, ), consider the random variable

= Y E(Y |X = x) = Y y (x x );

we know from (10.2) that it has conditional law N (0, 2Y |X ), given x. It is often called the residual
(random variable). With the conditional law of , we can form the joint law of and X; since
the conditional law does not depend on x (Corollary 10.1.2 (ii)), it turns out that and X are
independent.

Corollary 10.1.4 If X, Y have a bivariate normal distribution, then the residual

= Y E(Y |X) = Y y (X x )

and X are independent.

Recall that E(Y |X) denotes the conditional expectation as a random variable, i.e. E(Y |X = x)
when X is understood as random.
As a consequence, we can write any bivariate normal distribution as

Y = y + + (10.5)
X = x + (10.6)
Regression towards the mean 143

where , are independent normal with laws N (0, 2X ), N (0, 2Y |X ) respectively. If x = 0 we can
write
Y = y + X + .
Assume that we wish to obtain a representation (10.5), (10.5) in terms of standard normals. Define
0 = 1 1
Y |X , 0 = X ; then for a matrix

!
X 0 X 0
M= = 2 1/2
X Y |X XY 1
X Y 2XY 2
X

we have for Z = (X, Y )> , U = ( 0 , 0 )>

Z = + M U, L (U) = N2 (0,I2 ), . (10.7)

Since is the covariance matrix of Z we should have

=M M >

according to the rules for the multivariate normal. The above relation can indeed be verified, and
represents a decomposition of . into a product of a lower triangular with its transpose.
Next we define a regression function for possibly nonnormal distributions

Definition 10.1.1 Let X, Y have a continuous or discrete joint distribution, where the first mo-
ment of Y exists: E|Y | < . The regression function (for Y on X ) is defined as

r(x) := E(Y |X = x).

In the normal case, we have seen that r is a linear function, determined by the regression coecient
and the two means. In the general case, one should verify that E(Y |X = x) exists, and clarify the
problem of uniqueness. Let us do that in the continuous case.

Proposition 10.1.1 Let X, Y have a continuous joint distribution, where E|Y | < .
(i) There is a version of the conditional density p(y|x) such that for this density, E(Y |X = x)
exists for all x.
(ii) For all versions of p(y|x), the law of the random variable E(Y |X) = r(X) is the same, thus
L(E(Y |X)) is unique.

Proof. Recall definition 5.4.2: any version of the conditional density p(y|x) fulfills

p(x, y) = p(y|x)pX (x). (10.8)

Now E|Y | < means that


Z Z Z Z
> |y|p(x, y)dydx = |y|p(y|x)pX (x)dydx
Z Z
= pX (x) |y|p(y|x)dy dx.
R
RLet A be the set of x such that g(x) = |y|p(y|x)dy is infinite. For this A we must have
A pX (x)dx = 0, or else the whole expression above would be infinite. (This needs a few lines
144 Regression

of reasoning with integrals.) Hence we must have P (X A) = 0. For these x we can modify our
version
R of p(y|x), e.g. take it as the standard normal density (as in the proof of Lemma 5.4.3), so
that |y|p(y|x)dy is also finite for these x. Thus we found a version p(y|x) such that
Z
E(Y |X = x) = yp(y|x)dy (10.9)

exists for all x. This proves (i).


For (ii), integrate (10.8) over y to obtain
Z
yp(x, y)dy = r(x)pX (x). (10.10)

Note that two densities for a probability law must coincide except on
R a set of probability 0. This
holds for p(x, y), and it implies that two versions of the function yp(x, y)dy must coincide for
all x. (in terms of X). But the versions of densities pX (x) coincide for a set of probability 0 in
terms of X; and then (10.10) implies such a property for r(x). This mean that L(r(X)) is uniquely
determined.
Note that strictly speaking, we are not entitled to speak of the regression function r(x) above,
as it is not unique. However the law of the random variable E(Y |X) = r(X) is unique.
The next statement recalls a best prediction property of the conditional expectation. In the
framework of discrete distributions, this was already discussed in Remark 2.4, in connection with
the properties of the conditional expectation E(Y |X) as a random variable. For the normal case,
this was exercise H5.2.

Proposition 10.1.2 Let X, Y have a continuous or discrete joint distribution, where the second
moment of Y exists: E|Y |2 < . Then the regression function has the property that

E (Y r(X))2 = min E (Y f (X))2


f MX

where MX is the set of all (measurable) functions of X such that E (f (X))2 < .

Proof. This resembles other calculations with conditional expectations (cf. Remark 2.4, p. 9).
A little more work is needed now to ensure that all expectations exist. We concentrate on the
continuous case. First note that under the conditions, E (Y f (X))2 is finite for every f MX .
We claim that also r MX . Indeed the Cauchy-Schwarz inequality gives
Z 2 Z Z

2
|r(x)| = yp(y|x)dy p(y|x)dy y 2 p(y|x)dy
Z
= y 2 p(y|x)dy.

The last integral can be shown to be finite for all x when E|Y |2 < , similarly to Lemma 10.1.1
above (possibly with a modification of p(y|x)). Thus
Z Z
2 2
E (r(X)) pX (x) y p(y|x)dy dx
Z Z
= y 2 p(x, y)dxdy = EY 2 <
Bivariate regression models 145

which proves r MX . This implies that all expectations in the following reasoning are finite. We
have for any f MX

E (Y f (X))2 = E (Y r(X) f (X) + r(X))2


= E (Y r(X))2 + 2E (Y r(X)) (f (X) r(X)) + E (f (X) r(X))2 ,

and the middle term vanishes, when we use the formulae.

E() = E (E(|X)) , E(Y h(X)|X) = h(X)E(Y |X).

Hence
E (Y f (X))2 E (Y r(X))2 .

The regression function is thus a characteristic of the joint distribution of X and Y . In general
r(x) := E(Y |X = x) is a nonlinear regression function; it is linear if the joint distribution is
normal.
In the general case, it is no longer true that the residual = Y r(X) and X are independent;
it can only be shown that they are uncorrelated (Exercise). However one can build a bivariate
distribution of X, Y from independent (with zero mean) and X:

Y = r(X) + . (10.11)

Assume that E |r(X)| < , so that Y has finite expectation and hence E (Y |X) exists. It then
follows that
E (Y |X) = E (r(X)|X) + E (|X) = r(X) + E = r(X)
so r is the regression function for (X, Y ).

10.2 Bivariate regression models


For the joint distribution of X,Y the relation (10.11) describes approximately the dependence of
Y on X, or a noisy causal relationship between X and Y . Suppose one has i.i.d. observations
Zi = (Xi , Yi )> , and one wishes to draw inferences on this dependence, i.e. on the regression function
r(x):
Yi = r(Xi ) + i
where i are the corresponding residuals. Since the regression function r(x) = E (Y |X) depends
only on the conditional distribution of Y given X, and X is observed, it makes sense to take a
conditional point of view, and assume that Xi = xi where xi are nonrandom values. These can be
taken to be the realized Xi Thus
Yi = r(xi ) + i
where the i are still independent. (Exercise: show that the joint conditional densities of all
i = Yi E(Yi |Xi ) under all Xi is the product of the individual conditional densities of i ).

Definition 10.2.1 Suppose that xi , i = 1, . . . , n are nonrandom values, and i , i = 1, . . . , n are


i.i.d. random variables with zero expectation, L() = Q. Let R be a set of functions on R and Q
be a set of probability laws on R. A bivariate regression model is given by observed data

Yi = r(xi ) + i ,
146 Regression

where it is assumed that r R and Q Q.


(i) A linear regression model is obtained when R is assumed to be a set of linear functions

r(x) = + x

(where linear restrictions on , may be present)


(ii) A normal linear regression model is obtained when in (i) Q is assumed as a set of normal
laws N (0, 2 ) ( 2 fixed or unknown)
(iii) A nonlinear regression model (wide sense) is obtained when for some parameter set
Rk , k 1
R = {r , }
and all functions r (x) are nonlinear in x.
(iv) A nonparametric regression model is obtained when R is a class of functions of x which
cannot be smoothly parametrized by some Rk , e.g. a set of dierentiable functions
n 0 o
R = r : r exists, r0 (x) C, all x .

In a nonparametric regression model, the functions are also nonlinear in x, but the term is reserved
for families r , indexed by a finite dimensional . A typical example for nonlinear regression
is polynomial regression
k
X
r(x) = j xj
j=0

or more generally
k
X
r (x) = j j (x)
j=0

where {j } is a system of functions (e.g. trigonometric regression). However in these examples,


the function values r (x) depend linearly on the parameter = (j )j=1,...,n . We shall see below
that these can be treated similarly to the linear case. Therefore the name of nonlinear regression
model in a narrow sense is reserved for models where r (x) depends on nonlinearly, e.g.

r (x) = sin(x).

In the linear case, we have


Yi = + xi + i , i = 1, . . . , n (10.12)
and commonly it is assumed that the variance exists: EY12 < . The linear case (without normality
assumption) is simply called a linear model.
Note that the normal location and location-scale models are special cases of the normal linear
model, when a restriction = 0 is assumed and xi = 1, i = 1, . . . , n. Here the xi do not resemble
realizations of i.i.d. random variables; this possibility enlarges considerably the scope of the linear
model. Indeed the xi can be chosen or designed, e. g. taken all as 1 as above or as as an equidistant
grid (or mesh):
xi = i/n, i = 1, . . . , n.
The set {xi , i = 1, . . . , n} is also called a regression design. Especially the case of nonparametric
regression with equidistant grid resembles the problems of function interpolation from noisy data
(also called smoothing) in numerical analysis.
The general linear model 147

10.3 The general linear model


Our starting point above was a bivariate normal distribution of variables (X, Y ), and the description
of E(Y |X), as the best predictor of Y given X. The same reasoning is possible when we have
several variables X1 , . . . , Xk , i.e. a whole vector X = (X1 , . . . , Xk )> and we are interested in an
approximate causal relationship between X and Y . This of course allows modelling of much more
complex relationships. The X1 , . . . , Xk are called regressor variables and Y the regressand.
Again E(Y |X) can be shown to be a linear function of X, but we forgo this and proceed directly
to setting up a linear regression model in several nonrandom regressors.
Let x1 , . . . , xn be a set of nonrandom k-vectors. These might be independent realizations of a
random vector X, but they might also be designed values. We assume that for some vector =
( 1 , . . . , k )> , observations are
Yi = x>
i + i , i = 1, . . . , n

where i are i.i.d. with E1 = 0, E21 < . This is a direct generalization of (10.12).

Definition 10.3.1 Let X be a nonrandom n k-matrix of rank k, and be a random n-vector


such that
E = 0, Cov() = 2 In .

(i) A (general) linear model is given by an observed n-vector

Y = X +

where X is fixed (known), is unobserved and Rk is unknown, and 2 is either known or


unknown ( 2 > 0).
(ii) A normal linear model is obtained when is assumed to have a normal distribution.

Notation and terminology. Now X is an n k-matrix ,whereas in the previous section X was a
random variable (the regressor variable), and generalizing this, in the first paragraph of this section
X = (X1 , . . . , Xk ) was a random vector of regressor variables. This reflects the situation that the
matrix X may arise from independent realizations xi of the random vector X, in conjunction with
a conditional point of view (the xi are considered nonrandom, and form the rows of the matrix X)
Therefore we keep the symbol X for the matrix above;X may be called regression matrix. The
columns of X ( 1 , . . . , k , say) may be called nonrandom regressor variables; they correspond
to the random regressors X1 , . . . , Xk .
In the normal linear model, we have L() = Nn (0,In ), and the components i are i.i.d. standard
normal. Hence L(Y) = Nn (X,In ). In the general case, i are only uncorrelated; however in most
cases, when modelling real world phenomena by a by a linear model, r.v.s which are uncorrelated
but not independent will not often occur.

Example 10.3.2 Let U have the uniform distribution on [0, 1] and

Z = cos(2U ), Y = sin(2U).

Then Z 2 + Y 2 = 1 and the pair (Z, Y ) takes values on the unit circle, which implies that Z, Y are
not independent (they do not have a joint density on R2 which is the product of its marginals.)
148 Regression

But Z, Y are uncorrelated:


Z 1
EZ = cos(2u)du = 0, EY = 0,
0
Z 1
Cov(Z, Y ) = EZY = cos(2u) sin(2u)du = 0.
0

Identifiability.
Let us explain the assumption rank(X) = k. Let x> i , i = 1, . . . , n be the rows of the n k-matrix
X and j , j = 1, . . . , k the colums:
>
x1
X= . . . = 1 , . . . , k .
x> n

Recall that rank(X) = k is equivalent to each of the two:

Lin ({x1 , . . . , xn }) = Rk , (10.13)

1 , . . . , k are linearly independent n-vectors. (10.14)


(where Lin() denotes the linear space spanned by a set of vectors, also called linear span, linear
hull).

Definition 10.3.3 Let P = {P ; } be a family of probability laws on a general space Z. The


parameter is called identifiable in P if for all pairs 1 , 2 , 1 6= 2 implies P1 6= P2 .

If P is a statistical model, i.e. is the set of possible (assumed) distributions of an observed random
variable Z with values in Z, then nonidentifiablity means that there exist two parameters 1 6= 2
which lead to the same law of Z. These cannot be distinguished in any statistical sense: for
hypotheses 1 = 2 vs. 1 6= 2 the trivial randomized test (Z) = is a most powerful -test. A
parameter is just an index or name for a probability law; identifiability means that no law in P
has two names.
Identifiablity thus is a basic condition for a statistical model, if inference on the parameter is
desired. If is nonindentifiable in P then it is advisable to reparametrize, i.e. give the laws other
names which are identifiable.
Assume for a moment that the assumption rank(X) = k is not part of the definition of the linear
model (Definition 10.3.1(i)).

Lemma 10.3.4 In the normal linear model, is identifiable if and only if rank(X) = k.

Proof. Assume rank(X) = k and 1 6= 2 . Since

EY = X,

it suces to show that X 1 6= X2 . If X1 = X2 the we would have X( 1 2 ) = 0 and


1 2 6= 0, which contradicts (10.14).
The general linear model 149

Conversely, assume identifiability; recall L(Y) = Nn (X, In ). If rank(X) < k then (10.14) is
violated, i.e. 1 , . . . , k are linearly dependent. Hence there is 6= 0 such that X = 0. Since also
X0 = 0, the parameters and 0 lead to the same distribution L(Y) = Nn (0, In ), hence is not
identifiable. This contradicts the assumption, hence rank(X) = k.
In the linear model without the normality assumption, another parameter is present, namely the
distribution of the random noise vector . When this law L() is assumed unknown, except for
the assumptions E = 0, Cov() = 2 In , then the parameter is = (, L()) and L(Y) = P is
indexed by . In this situation, we call = () identifiable if P1 = P2 implies (1 ) = (2 ).
It is easy to see (exercise) that also in this model, the condition rank(X) = k is necessary and
sucient for identifiability of .
10.3.1 Special cases of the linear model
1. Bivariate linear regression. We have

Yi = + xi + i , i = 1, . . . , n
>
Here k = 2, the rows of X are x>
i = (1, xi ), = (, ) , and identifiability holds as soon as not all
xi are equal.
2. Normal location-scale model. Here

Yi = + i , i = 1, . . . , n,

k = 1, X = (1, . . . , 1)> , i are i.i.d. N (0, 2 ), 2 > 0.


3. Polynomial regression. Here for some design points xi , i = 1, . . . , n
k
X
Yi = j j (xi ) + i , i = 1, . . . , n
j=1

where j (x) = xj1 . The matrix X is made up of the columns

j = (j (x1 ), . . . , j (xn ))> , j = 1, . . . , k. (10.15)

Note that we obtained as a special case of the linear model one which was earlier classified as a
nonlinear regression model (in a wide sense), since the functions
k
X
r(x) = j j (x)
j=1

are nonlinear in x (polynomials). However they are linear in the parameter = ( 1 , . . . , k )> ,
and for purposes of estimating this can be treated as a linear model.

Lemma 10.3.5 In the linear model arising from polynomial regression, the parameter is identi-
fiable if and only if among the design points xi , i = 1, . . . , n, there are at least k dierent points.

Proof. Identifiability means linear independence of the vectors j in (10.15). This in turn means
that for any coecients 1 , . . . , k , the relation
k
X
r(xi ) = j j (xi ) = 0, i = 1, . . . , n (10.16)
j=1
150 Regression

implies j = 0, j = 1, . . . , k. Now r(x) is a polynomial of degree k 1; if not all j vanish, then r


can have at most k 1 dierent zeros. Thus if among the design points xi , i = 1, . . . , n, there are
at least k dierent points then X has rank k, thus is identifiable. Conversely, assume that only
x1 , . . . , xk1 are dierent. Consider the polynomial r0 of degree k 1 having these points as zeros:
k1
Y
r0 (t) = (t xi ).
j=1

Let 1 , . . . , k be the coecients of r0 ; then .(10.16) holds for r = r0 , hence the vectors j are
linearly dependent.
The result that some of the design points xi may be the same opens up the possibility of a design
with replication: for a fixed number m k, take m dierent points xj , j = 1, . . . , m and repeated
measurements at these points, l times say, so that n = ml. The entire design may then be written
with a double index:
xjk = xj , j = 1, . . . , m, k = 1, . . . , l
and each of the xj appears l times. As a result, using double index notation again, we obtain
observations
Yjk = r(xj ) + jk , j = 1, . . . , m, k = 1, . . . , l (10.17)
P
which suggests taking averages Yj = l1 lk=1 Yjk to obtain a simplified model, with more accurate
data Yj . The choice of such a replicated design may be advantageous.
Comment on notation: we now use k also as a running index, even though above k denoted the
number of functions j involved, i.e. the dimension of the regression parameter . In the sequel,
we will use d for the dimension of . The reason is that use of k in expressions such as Yjk is
traditional, in connection with regression and replicated designs.
4. Analysis of variance. Consider a model
Yjk = j + jk , j = 1, . . . , m, k = 1, . . . , l (10.18)
where jk are independent noise variables. Here again a replication structure is present, similary to
(10.17), but no particular form is assumed for the function r. Thus, if r is an arbitrary function in
(10.17), we might as well write j = r(xj ) and assume j unrestricted. The case m = 2 (for normal
jk ) was already encountered in the two sample problem. Suppose there are two treatments,
j = 1, 2, respective observations Yjk , j = 1, 2, and one wishes to test whether the treatments have
an eect: H:1 = 2 vs. K : 1 6= 2 . To explain the name analysis of variance, let us find the
likelihood ratio test. In HW 7.1 we found a certain t-test for this problem; cf. also HW 6.1; this
will turn out to be the LR test.
Define the two sample means and variances
l
X l
X
yi = l1 yik , Sil2 = l1 (yik yi )2 , i = 1, 2.
k=1 k=1

and also pooled estimates (where n = 2l)


1
y = n1 (ly1 + ly2 ) = (y1 + y2 ) , (10.19)
2
l l
!
1 X X
Sn2 = (y1k y )2 + (y2k y )2 . (10.20)
n
k=1 k=1
The general linear model 151

Note that the pooled sample variance can be decomposed: since


l
X l
X
2
(y1k y ) = (y1k y1 )2 + l (y1 y )2 ,
k=1 k=1

we obtain
1 2 1 X
Sn2 = 2
S1l + S2l + (yj y )2 (10.21)
2 2
j=1,2
Note that the second term has the form of a sample variance, for a sample of size 2 with obser-
vations y1 , y2 The first term can be seen as the variability within groups and the second
term as the variability between groups.
Theorem 10.3.6 In the Gaussian two sample problem (10.18), L(jk ) = N (0, 2 ), 2 > 0 un-
known, for hypotheses
H : 1 = 2
K : 1 6= 2
the LR statistic is !
1P 2 n/2
j=1,2 (yj y )
L(y1 , y2 ) = 1 + 2 1 2 2
.
2 (S1l + S2l )
The LR test is equivalent to a t-test which rejects when |T | is too large, where
l1/2 (y1 y2 )
T = 1/2 ,
2 + S 2
S1l 2l
l
Sil2 = S 2 , i = 1, 2,
l 1 il
and L(T ) = t2l2 under H.
Comment. This form of the likelihood ratio statistic explains the name analysis of variance.
The pooled sample variance is decomposed according to (10.21), and the LR test rejects if the
variability between groups is too large, compared to the variability within groups.
Proof. The argument is very similar to the proof of Proposition 8.4.2 about the LR tests for the
one sample Gaussian location-scale model; it can be considered the two sample analog. Since
we have two independent samples with dierent expectations i and the same variance, the joint
density p1 ,2 ,2 of all the data y1 = (y11 , . . . , x1l ), y2 = (y21 , . . . , y2l ) is, under the alternative,
!
Y 1 Sil2 + (yi i )2
p1 ,2 ,2 (y1 , y2 ) = exp . (10.22)
(2 2 )l/2 2 2 l1
i=1,2

Maximizing the likelihood under the alternative we obtain estimates i = yi and for n = 2l
2 + S2
1 S1l 2l
max p1 ,2 ,2 (y1 , y2 ) = exp
1 6=2 ,2 >0 (2 2 )l 2 2 l1
2 + S 2 )/2
1 (S1l 2l
= exp
(2 2 )n/2 2 2 n1

1 1 1
= exp 1
( 2 )n/2 (2)n/2 2n
152 Regression

where
1 2
2 = 2
S1l + S2l .
2
Under the hypothesis 1 = 2 = the data y1k , y2k are identically distributed, as N (, 2 ). Thus
we can treat the two samples as one pooled sample of size n, and form the pooled mean and variance
estimates (10.19), (10.20). To maximize the likelihood in , 2 , we just refer to the results in the
one sample case (Proposition 8.4.2) to obtain

1 1 1
max p,,2 (y1 , y2 ) = 2 n/2 exp 1 .
, 2 >0 ( 0 ) (2)n/2 2n
Thus the likelihood ratio is
max1 6=2 ,2 >0 p1 ,2 ,2 (y1 , y2 )
L(y1 , y2 ) =
max,2 >0 p,,2 (y1 , y2 )
2 n/2 !n/2
0 Sn2
= = 2 .
2 S1l + S2l 2 /2

The LR test compares the sample variance 2


2 under
2
the hypothesis Sn with the sample variance under
the alternative, which can taken to be S1l + S2l /2. Using (10.21), we obtain
P 2
!n/2
j=1,2 (yj y )
L(y1 , y2 ) = 1+ 2 + S2 .
S1l 2l

As a consequence of (10.19), we have

y1 y = (y1 y2 )/2, y2 y = (y2 y1 )/2

and hence n/2


1
L(y1 , y2 ) = 1 + T2 .
2(l 1)
2 , S 2 are all
The statistic T has a t-distribution with 2l 2 degrees of freedom, since y1 , y2 , S1l 2l
independent,
(l/2)1/2 (y1 y2 )
T = 1/2 ,
2 + S 2 )/2
(S1l 2l


L (l/2)1/2 (y1 y2 ) = N(0, 2 ),

2(l 1) 2 2
L (S1l + S2l )/2 = 22l2 .

The ANOVA (analysis of variance) model (10.18) is a special case of the linear model. Write

= (1 , . . . , m )> , n = ml, (10.23)



1l
X = ... (10.24)
1l
Least squares and maximum likelihood estimation 153

where 1l is the l-vector consisting of 1s, X is the ml m matrix where the column vectors 1l are
arranged in a diagonal fashion (with 0s elsewhere),

Y = (Y11 , . . . , Y1l , Y21 , . . . , Yml )> ,

and the n-vector is formed analogously. Then

Y =X + ,

rank(X) = m, and the hypothesis H : 1 = . . . = m can be expressed as

H : Lin(1m )

where Lin(1m ) is the linear subspace of Rm spanned by the column vector 1m . That is a special
case of a linear hypothesis about (i.e. is in some specified subspace of Rm ).

10.4 Least squares and maximum likelihood estimation


Consider again the general linear model

Y = X + , (10.25)
2
E = 0, Cov() = In . (10.26)

and the problem of estimating , with known or unknown 2 . If X is a nk-matrix and rank(X) =
k (identifiability), then the expectation of the random vector Y is in the linear subspace of Rn
spanned by the matrix X (or by the k columns j of X):
n o
Lin(X) = Lin({1 , . . . , k }) = X, Rk

Xk
= z= j j , j arbitrary .

j=1

An equivalent way of writing the linear model is

EY Lin(X), Cov(Y) = 2 In .

Indeed no distributional assumptions about are made in the general linear model, so the above is
all the information one has. For obtaining an estimator of , one could try to apply the principle
of maximum likelihood. However since the distribution of (Q, say) is unspecified, it would have
to be considered a parameter, along with . Then Q and specify the distribution of Y, and
hence a likelihood for any realization Y = y. But maximizing it in Q does not lead to satisfactory
results, since the class for Q ( arbitrary distributions on Rn ) is too large. Thus one has to look
for a dierent principle to get an estimator of . The distribution Q of the noise is not of primary
interest in most cases; it can be considered a nuisance parameter . The regression parameter is
the parameter of interest, since it describes the dependence of Y on X, and possibly also 2 .
Of course one could assume normality, and then find maximum likelihood estimators. However
another principle can be invoked, without a normality assumption.
154 Regression

Definition 10.4.1 In the general linear model (10.25), (10.26), with observed vector Y Rn ,
Rk and rank(X) = k, a least squares estimator (LSE) of is a function = (Y) of
the observations such that 2
2
YX = min kYXk
Rk

where kk2 is the squared Euclidean norm in Rn .

The name least squares derives from the fact that the squared Euclidean norm of any vector
z Rn is the sum of squares of the components zi :
n
X
kzk2 = zi2 .
i=1

Recall that another from of writing the linear model was

Yi = x>
i + i , i = 1, . . . , n

where x>i are the rows of X ( xi are k-vectors). Thus another way of describing the least squares
estimator is to say that for given Y it is a minimizer of
n
X 2
LsY () = Yi x>
i .
i=1

This expression, depending on the observations Y, to be minimized in is also called the least
squares criterion.
Exercise. Show that if is the true parameter in (10.25), (10.26), then provides a best approx-
imation to the data Y in an average sense:

E kYXk2 = min E kYXk2


Rk

This minimization property can serve as a justification for the least squares criterion. If we could
compute E kYXk2 for any , we would take the minimizer and obtain the true . However to
find the expectation involved we already have to know , so we take just kYXk2 and minimize
it. In this sense, can be considered an empirical analog of .

Theorem 10.4.2 Consider the general linear model (10.25), (10.26), in the case k < n, with
observed vector Y Rn , Rk and rank(X) = k, with 2 either known or unknown ( 2 > 0)
(i) The LSE of is uniquely determined and given by
1
= X > X X > Y. (10.27)

(ii) Under a normality assumption L() = Nn (0, 2 In ), with probability one the LSE coincides
with the maximum likelihood estimator (MLE) for , both in cases of known 2 and unknown 2
(when 2 > 0).
Least squares and maximum likelihood estimation 155

1
Proof. Note that rank(X) = k implies that X > X exists. The key argument is that the matrix
1
X = X X > X X>

represents the linear projection operator in Rn onto the linear subspace Lin(X). Indeed, note that
X is a projection matrix, i.e. idempotent (X X = X ) and symmetric: > X = X . To see
these two properties, note
1 1 1
X X = X X > X X >X X >X X > = X X >X X > = X ,

and using the matrix rule (AB)> = B > A> , (which implies (H 1 )> = (H > )1 for symmetric
nonsingular H)
1 > 1 > 1 >
>
X = X X >X X> = X X X >X =X X >X X>
> 1
= X X >X X > = X .

Hence X is a projection matrix. It has rank k, and it leaves the space Lin(X) invariant: if
z Lin(X) then z =Xa for some a Rk , and
1
X z =X Xa =X X > X X > Xa =Xa = z.

Also for any y the vector X y is in Lin(X):


1 1
X y =X X > X X > y =Xy, where y = X > X X > y.

Moreover, consider the orthogonal complement of Lin(X), i.e. Lin(X) . Any z Lin(X) is
mapped into 0 by the linear map X : since X > z = 0, we have
1
X z =X X > X X > z = 0.

These facts establish that indeed X is a matrix which represents the projection onto Lin(X). (As
a consequence, In X is the projection operator onto Lin(X) ).
It is well known that the projection operator has a minimizing property: for any y Rn , X y gives
the element of Lin(X) closest to y (in the sense of kk). Indeed for any z Lin(X)

ky zk2 = kyX y+X yzk2 (10.28)

Note that X yz Lin(X), and

yX y = (In X )y Lin(X)

since In X projects onto Lin(X) . It follows that, when we compute (10.28) via kzk2 = z> z,
since the two vectors are orthogonal, we get a sum of squared norms:

ky zk2 = kyX yk2 + kX yzk2 .


156 Regression

The right side is minimized for z =X y.


Apply this minimizing property of X to obtain

kYXk2 kYX Yk2

where equality is achieved for


1
X =X y =X X > X X > Y.

Pre-multiply by X > to obtain


X > X = X > Y
which gives a unique solution
1
= X > X X > Y. (10.29)

Part (i) is proved. For (ii), write down the likelihood function: since L(Y) = Nn (X, 2 In ), it is

2 1 1 2
LY (, ) = exp 2 kYXk .
(2 2 )n/2 2

For known 2 , it is obvious that maximizing LY in is equivalent to minimizing the least squares
criterion
LsY () = kYXk2 .
For unknown 2 , restricted only by 2 > 0, minimize LY (, 2 ) first in , for given 2 . The solution
is again given by (10.29). We now have to maximize
!
2 1 n1 LsY ()
LY (, ) = exp
(2 2 )n/2 2 2 n1

in 2 . This procedure was already carried out in the proof of Proposition 3.0.5 (insert now
n1 LsY () for Sn2 ). Provided that LsY () > 0, the solution is

2 = n1 LsY ().

Thus (ii) is proved if we show that LsY () > 0 happens with probability one. Indeed LsY () = 0
means that Y Lin(X). Under normality, with covariance matrix 2 In it is obvious that Y Lin(X)
happens with probability 0, if k < n is fulfilled, since any proper linear subspace of Rn has proba-
bility 0 (indeed for any nonrandom vector z Rn , z 6= 0 the event z> Y = 0 has probability 0, since
z> Y is normally distributed with variance 2 kzk > 0).
It is easy to see that if k = n then the LSE of is still given by (10.27) and coincides with the
MLE under normality if 2 is known, but if 2 is unknown then LsY () = 0 and the MLE of 2
under normality does not exist (or should be taken as 0, with the likelihood function taking value
). The assumption k < n is realistic, since k represents the number of independent regressor
variables in most cases, and can be expected to be less than n.
Exercise. Consider the special cases of the linear model discussed in the previous subsection.
1. Bivariate linear regression:

Yi = + xi + i , i = 1, . . . , n
Least squares and maximum likelihood estimation 157

>
Here k = 2, the rows of X are x> i = (1, xi ), = (, ) , and identifiability holds as soon as not all
xi are equal). Show that the LSE of , are
Pn
(Y Yn )(xi xn )
Pni
n = Yn n xn , n = i=1 2
. (10.30)
i=1 (xi xn )

where xn is the mean of the nonrandom xi .

Remark 10.4.3 Note that the formula for n is analogous to the regression coecient in a bivari-
ate normal distribution for (X, Y ) according to Definition 10.1.3:
XY
=
2X

Therefore n is also called the empirical regression coecient (and the theoretical coe-
cient). The regression function for (X, Y ) was found as

y = E(Y |X = x) = y + (x x ) = + x

which shows that is also the analog of = y x .

Since the bivariate regression is very important for applications (it is programmed in scientific
pocket calculators), we summarize again what was done there.

Fitting a straight line to data (bivariate linear regression). Given pairs (Yi , xi ), i = 1, . . . , n,
find the straight line y = + x which best fits the data in the sense that the least squares
criterion
X n
(Yi + xi )2 .
i=1

is minimal. The solutions n , n are given by (10.30). The fitted straight line y = n + n x
passes through the point (xn , Yn ) with slope n .

2. Location-scale model. (normality not assumed). Here

Yi = + i , i = 1, . . . , n,

This can be obtained from 1. above when = 0 is assumed, and xi = 1. Show that the LSE of
is the sample mean:
n = Yn . (10.31)
3. ANOVA: Here
Yjk = j + jk , j = 1, . . . , m, k = 1, . . . , l
Show that the LSE of j is the group mean:

l
X
1
j = Yj = l Yjk , j = 1, . . . , m.
k=1
158 Regression

10.5 The Gauss-Markov Theorem


Consider again the general linear model:

Model (LM) (General linear model). Observations are

Y = X + , (10.32)
2
E = 0, Cov() = In . (10.33)

where Y is a random n-vector, X is a n k-matrix and rank(X) = k, Rk is unknown,


is an unobservable random n-vector, and L() satisfies (10.33) where 2 is known (Model
(LM1 ) or unknown, restricted by 2 > 0 (Model (LM2 ).

Recall that the normal linear model is obtained when we add the assumption L() = N (0, 2 In ).
This might be called NLM (possibly NLM1 or NLM2 ).
In (LM) we are now interested in optimality properties of the least squares estimator of
1
= X > X X > Y.

Above in (10.31) it was remarked that the sample mean Yn is a special case of , for X = 1 (the
n-vector consisting of 1s). In Section 5.7 it was shown that Yn is a minimax estimator under
normality, for known 2 (i.e. in the Gaussian location model). In Proposition 4.3.3 it was shown
that in the Gaussian location model, the sample mean is also a uniformly minimum variance
unbiased estimator (UMVUE). Both optimality properties depend on the normality assumption:
e.g. for the second optimality, within the class of unbiased estimators, it was shown that the
quadratic risk of Yn attains the Cramer-Rao bound, which is the inverse Fisher information. The
Fisher information IF is a function of the density (or probability function) of the data: recall the
general form of IF for a density p depending on
2

IF () = E log p (Y ) .

In the linear model, the distribution of Y is left unspecified. We might ask whether the sample
mean,
or more generally , still has any optimality properties. A natural choice for the risk is
2
E , i.e. expected loss for the squared Euclidean distance.
Note that is unbiased:
1 1
E = E X > X X >Y = X >X X > EY
1
= X >X X > X = .

Without normality, it cannot be shown that is best unbiased. But when we further restrict the
class of estimators admitted for comparison, to linear estimators, turns out to be optimal. Recall
that a similar device was already employed for the sample mean and minimaxity: in Exercise 5.7.2
a minimax linear estimator was proposed, i.e. an estimator which is minimax within the class of
linear ones.
The Gauss-Markov Theorem 159

Definition 10.5.1 Consider the general linear model (LM).


(i) A linear estimator of is any estimator b of form

b = AY

where A is a (nonrandom) k n-matrix.


(ii) A linear estimator b is called best linear unbiased estimator (BLUE) if b is unbiased:

E b = (10.34)

and if 2 2

E b E b (10.35)

for all linear unbiased b. Relations (10.34), (10.35) are assumed to hold for all values of the
unknown parameters ( Rk , and L() as specified).

Theorem 10.5.2 (Gauss, Markov) In the linear model (LM), the least squares estimator is the
unique BLUE.

Proof. Consider a linear unbiased estimator b = AY. The unbiasedness condition implies

= E b = EAY =AX

for all Rk , hence


AX = Ik . (10.36)
The loss is
2
2 2
b = kAY k = kAX+A k (10.37)
= kAk2 = (A)> A = > A> A. (10.38)

For the risk we have to compute the expected loss.


For any n n-matrix B = (Bij )j=1,...,n
i=1,...,n , define the trace as

n
X
tr [B] = Bii .
i=1

In other words, the trace is the sum of diagonal elements. Note that for n k-matrices B, C and
for n-vectors z, y we have
h i n X
X k h > i
tr BC > = >
Bij Cji = tr C B , (10.39)
i=1 j=1
h i n X
X n h i
> 2
tr B B = Bij = tr BB > , (10.40)
i=1 j=1
h i Xn h i
tr zy> = zi yi = y> z, tr zz> = z> z = kzk2 .
i=1
160 Regression

From (10.39) we see that C 7 tr [BC] is a linear operation on matrices, so that when C is a random
matrix, then for the expectation we have

Etr [BC] = tr [BEC] .

Applying this to the quadratic loss (10.37), (10.38), we have


2 h i

E b = E> A> A =Etr A> A>
h i h i
= Etr A> A> = tr A> AE(> )
h i h i
= tr A> A 2 In = 2 tr A> A .

The problem thus is to minimize tr A> A under the unbiasedness restriction AX = Ik from (10.36).
From (10.40) we see that if vec(A) is the kn-vector formed of all elements of A, then

h i X n
k X
tr A> A = A2ij = kvec(A)k2
i=1 j=1

i. e. problem is to minimize the length of the vector vec(A) under a set of ane linear restrictions
AX = Ik . Consider first the special case k = 1; then a = A> and X are vectors of dimension k
and the set {a : a> X = 1} is an ane hyperplane in Rk . To minimize the norm of a within this set
means taking a perpendicular to the hyperplane, i.e. having the same direction as X. This gives
a = X(X > X)1 as a minimizer
1 >
This argument is generalized to k 1 as follows. Let X = X X > X X be the projection
operator onto Lin(X) in the space Rn . We have
h i h i h i
tr A> A = tr AA> = tr A(In X + X )A>
h i h i
= tr A(In X )A> + tr AX A>
h i 1
> >
= tr A(In X )A + tr X X .


in view of AX = Ik . Here the term tr A(In X )A> is nonnegative, since for C = A(In X )
we have (recall that In X is idempotent, since it is a projection)
h i h i
tr A(In X )A> = tr CC > 0

since tr CC > is a sum of squares (10.40).Thus
2 1
2 >
E b tr X X .

1 >
This lower bound is attained for A = X > X X :
1
(In X )A> = (In X )X X > X =0 (10.41)
The Gauss-Markov Theorem 161

since In X projects onto Lin(X) . Thus is a BLUE.


It remains to show uniqueness. This follows from the fact that (10.41) is necessary for attainment
of the lower bound, and this implies
AX = A
The left side is 1 1
AX = AX X > X X > = X >X X >,

and the result is proved.


The BLUE property of Gauss-Markov Theorem is not a very strong optimality, since the class of
estimators is quite restricted. On the other hand, it says that that is uniformly best within
that class. Recall that the sample mean Yn is a special case, so we obtained another optimality
property of the sample mean. This also suggests more optimality properties of under normality
(minimaxity, best unbiased estimator); these indeed can be established, but we do not discuss this
here.
162 Regression
Chapter 11
LINEAR HYPOTHESES AND THE ANALYSIS OF VARIANCE

11.1 Testing linear hypotheses


In the normal linear model (NLM),
Y = X +
recall that the parameter varies in the whole space Rk . Consider a linear subspace of Rk , S say, of
dimension s < k, and a hypothesis H : S. Such a problem, i.e. testing a linear hypothesis,
arises naturally in a number of situations.
1. Bivariate linear regression. We have

Yi = + xi + i , i = 1, . . . , n.
>
Here k = 2, the rows of X are x>
i = (1, xi ), = (, ) . A linear hypotheses could be
H : = 0,
meaning that the nonrandom regressor variable xi has no influence upon Yi . If the xi are i.i.d.
realizations of a normal random variable X, L(X) = N(x , 2X ), then the regression coecient
is (see Definition (10.1.3)
XY
= 2 .
X
The hypothesis then means that XY = 0, i.e. X and Y are independent random variables, or
equivalenty X and Y are uncorrelated. Indeed the correlation coecient is
XY
=
X Y
and it is related to by
X
=
Y
(in Galtons case of fathers and sons where 2X = 2Y they are actually equal).
Using the linear algebra formalism of (NLM), the linear subspace S would be the subspace of R2
spanned by the vector (1, 0)> , i. e.
S = Lin((1, 0)> ).
2. Normal location-scale model. Here

Yi = + i , i = 1, . . . , n,

k = 1, X = (1, . . . , 1)> , i are i.i.d. N (0, 2 ), 2 > 0. The only possible linear hypothesis is
H : = 0.
164 Linear hypotheses and the analysis of variance

3. Polynomial regression. Here for some design points xi , i = 1, . . . , n


k
X
Yi = j j (xi ) + i , i = 1, . . . , n
j=1

the functions j are j (x) = xj1 and = ( 1 , . . . , k )> . A linear hypotheses could be
H : k = 0,
meaning that the regression function
k
X
r(x) = j j (xi )
j=1

is actually a polynomial of degree k 2 , and not of degree k 1 as postulated in the model (LM). A
special case is 1. above (the bivariate linear regression) for k = 2. The hypothesis means that the
mean function r(x) is a polynomial of lesser degree, or that the model is less complex. Of course,
as always with testing problems, the statistically significant result would be a rejection; for e.g.
k = 3 this means that it is not possible to describe the mean function of the Yi by a straight line,
and a higher degree polynomial is needed.
4. Analysis of variance (ANOVA). Here

Yjk = j + jk , j = 1, . . . , m, k = 1, . . . , l (11.1)

where jk are independent noise variables. The index j is associated with m treatments or groups,
and one might wish to test whether the treatments have any eect:
H:1 = . . . = m .
or in other words, whether the groups dier in the characteristic Yjk measured. The linear subspace
S of Rm would be
S = Lin(1)
where 1 = (1, . . . , 1)> Rm , and dim(S) = 1 (note that the m here corresponds to k in (10.32),
(10.33)).
5. General case. Recall that j were the colums of the matrix X, so that model (LM) can be
written
Xk
Y= j j +
j=1

where the j-th column may have arisen from n independent realizations of the j-th component of
a random vector X, or as designed nonrandom values. In either case, j may be construed as one
of k regressor variables which influences the regressand Y ). Such a variable j is also called an
explanatory variable, or one of j independent variables, when Y is the dependent variable.
Thus a hypothesis H : j = 0 postulates that j is without influence on Y, and can be dispensed
with. Clearly the polynomial regression above is a special case for j = (j (x1 ), . . . , j (xn ))>
(designed nonrandom values). Here again, what is sought is the statistical certainty in the case
of rejection, when it can be claimed that j actually does influence the measured quantity Y.
In the normal linear model (NLM), to find a test statistic for the problem
H : S.

Note that throughout math and statistics software, a vector of values is frequently called a variable.
Testing linear hypotheses 165

K : S
/
we could again apply the likelihood ratio principle.That was already done in a special case of
ANOVA, namely the two sample problem (cf. Theorem 10.3.6). We repeat some of that reasoning,
with general notation, assuming first 2 unknown. The density of Y is (as a function of y Rn )

1 1 2
p,2 (y) = exp kyXk .
(2 2 )n/2 2 2
The LR statistic is
max S,
/ 2 >0 p, 2 (y)
L(y) = .
maxS,2 >0 p,2 (y)
/ and then over 2 > 0. Maximizing first over
To find the numerator, we first maximize over S
Rk , we obtain the LSE of :
1
= X > X X > Y.

We now claim that with probability one, S,


/ so that is also the maximizer over S./ Note
that 1
L() = Nk (, 2 X > X )
1
and that the matrix X > X is nonsingular. It was already argued that a multivariate normal
vector, with nonsingular covariance matrix, takes values in a given lower dimensional subspace with
probability 0 (conclusion of proof of Theorem 10.4.2). Thus P ( S)
/ = 1 and almost surely

max p,2 (y) = max p,2 (y)


/
S, 2 >0 2 >0

1 1 2
= max exp 2 k(In X )yk .
2 >0 (2 2 )n/2 2
This maximization problem has been solved already in the proof of Proposition 3.0.5 (insert now
k(In X )yk2 for Sn2 ): the result is
1 1 n
max p,2 (y) = exp ,
/
S, 2 >0 ( 2 )n/2 (2)n/2 2
2 = n1 k(In X )yk2 . (11.2)

Key idea for linear hypotheses in linear models. To find the maximized likelihood under the
hypothesis, we note that S implies that X varies in a linear subspace of Lin(X). Indeed if
S is a k s-matrix such that S = Lin(S) then every S can be represented as = Sb for some
b Rs , thus
X =XSb
where rank(XS) = s and we see that

{X, S} = Lin(XS).

We can now apply the results for least squares (or ML) estimation of Rk to estimation of
b Rs . We can immediately write down a LSE for b and a derived one for S, but we skip
this and proceed directly to the MLE of 2 under the hypothesis, analogously to (11.2):

20 = n1 k(In XS )yk2
166 Linear hypotheses and the analysis of variance

where the projection In X is substituted by In XS . Write the respective linear subspaces of


Rn as
X := Lin(X), S0 := Lin(XS)
X = X , S0 = XS for the associated projection operators in Rn . It follows that the maximized
likelihood under H is
1 1 n
max p,2 (y) = exp ,
/
S, 2 >0 ( 20 )n/2 (2)n/2 2
20 = n1 k(In S0 )yk2 . (11.3)

Thus the LR statistic is


n/2 !n/2
20 k(In S0 )yk2
L(y) = = . (11.4)
2 k(In X )yk2

For a further transformation, note that the matrix X S0 is again a projection matrix, namely
onto the orthogonal complement of S0 in X . Denote this orthogonal complement as X S 0 (the
space of all x X which are orthogonal to S0 ). We then have a decomposition

Rn = X (X S 0 ) S0 (11.5)

where denotes the orthogonal sum, X = Rn X and all three spaces are orthogonal. We have
a corresponding decomposition of In (which is the projection onto Rn )

In = (In X ) + (X S0 ) + S0

and all three matrices on the right are projection matrices, orthogonal to one another. (Exercise:
show that X S0 is the projection operator onto X S 0 ). As a consequence

k(In S0 )yk2 = k(In X + X S0 )yk2


= k(In X )yk2 + k(X S0 )yk2 ,

and we obtain from (11.4)


!n/2
k(X S0 )yk2
L(y) = 1+ .
k(In X )yk2

Note that the form obtained for the two sample problem in Theorem 10.3.6 is a special case; there
we could further argue that the LR test is equivalent to a certain t-test.

Definition 11.1.1 Let Z1 , Z2 be independent r.v.s having 2 -distributions of k1 and k2 degrees


of freedom, respectively. The F-distribution with k1 , k2 degrees of freedom (denoted Fk1 ,k2 )
is the distribution of
k 1 Z1
Y = 11 .
k2 Z2
It is possible to write down the density of the F -distribution explicitly, with methods similar those
for the t-distribution (Proposition 7.2.7). Note that for k1 = 1 , Z1 is the square of a standard
normal, hence Y is the square of a t-distributed r. v. with k2 degrees of freedom.
Testing linear hypotheses 167

Definition 11.1.2 In the normal linear model (NLM), consider a linear hypothesis given by a
linear subspace S Rk , dim(S) = s < k
H : S.
K : S.
/
The F -statistic for this problem is

k(X S0 )Yk2 /(k s)


F (Y) = (11.6)
k(In X )Yk2 /(n k)

where X = Lin(X), S0 = Lin(XS) and S = Lin(S).

Proposition 11.1.3 In the normal linear model (NLM), under a hypothesis H : S, the per-
taining F -statistic has an F -distribution with k s, n k degrees of freedom.

Proof. Note that under the hypothesis,

EY =X S0 ,

hence
(X S0 )Y =(X S0 )(X + ) = (X S0 ).
But
EY =X X
is already in the model assumption (LM); it holds under H in particular. Thus

(In X )Y =(In X ).

Let now e1 , . . . , en be an orthonormal basis of Rn such that

Lin(e1 , . . . , es ) = S0 , Lin(es+1 , . . . , ek ) = X S 0 ,
Lin(ek+1 , . . . , en ) = X

i.e. the basis conforms to the decomposition (11.5). Then the three projection operators are given
by
Xs
S0 y = (e>i y)ei , etc.,
i=1

hence
n
X k
X
k(In X )k2 = (e> 2 2
i ) , k(X S0 )k = (e> 2
i ) .
i=k+1 i=s+1

Note that zi = 1 e>


i , i = 1, . . . , n are i.i.d. standard normals. For the F -statistic we obtain
Pk
z 2 /(k s)
F (Y) = Pni=s+1 i2 .
i=k+1 zi /(n k)

This proves the result.


An immediate consequence is the following.
168 Linear hypotheses and the analysis of variance

Theorem 11.1.4 In the normal linear model (NLM), consider a linear hypothesis given by a linear
subspace S Rk , dim(S) = s < k
H : S.
K : S.
/
Let F (Y) be the pertaining F -statistic and Fks,nk;1 the lower 1 -quantile of the distribution
Fks,nk . The F-test which rejects when
F (Y) > Fks,nk;1
is an -test.

11.2 One-way layout ANOVA


Consider a model
Yjk = j + jk , k = 1, . . . , lj , j = 1, . . . , m, (11.7)
where jk are i.i.d. normal noise variables and j are unknown parameters. The model is slightly
more general than (10.18) since we admitPdierent observations numbers lj for each group j. The
total number of observations then is n = m j=1 lj . Clearly this is again a special case of the normal
linear model (NLM). Assume that lj > 1 for at least one j.
The groups j = 1, . . . , m are also called factors or treatments. The question of interest is do
the factors have an influence upon the measured quantity Yjk ? This corresponds to a hypothesis
H : 1 = . . . = m . At first sight, the question appears similar to the one in bivariate regression,
where one asks whether a measured regressor quantity xi has an influence upon the regressand Yi
(the contrary is the linear hypothesis = 0). This in turn is related to a question are the r.v.s Y
and X correlated ?. However the dierence is that in ANOVA no numerical value xi is attached to
the groups; the groups or factors are just dierent categories (they are qualitative in nature). Thus,
even if one is willing to assume that individuals are randomly selected from one of the groups, to
compute a correlation or regression does not make sense- it is not clear which values Xi should be
associated to the groups. An example is the drug testing problem, where one has two samples, one
for old and new drug (m = 2). For ANOVA with m groups, an example would be that j represents
dierent social groups and Yjk the cholesterol level, or j might represent dierent regions of space
and Yjk the size of cosmic background radiation.
Recall Theorem 10.3.6 where it was shown that the two sample problem (with H : 1 = 2 ) can be
treated by a t-test. and that it is a special case of ANOVA. The general ANOVA case for m 2
can be treated by an F -test; it suces to formulate the test problem as a linear hypothesis in a
(normal) linear model and apply Theorem 11.1.4.
To write down the F -statistic,
k(X S0 )Yk2 /(k s)
F (Y) = (11.8)
k(In X )Yk2 /(n k)
it suces to identify the linear spaces and projection operators involved. We have
m
X
>
= (1 , . . . , m ) , n = lj , (11.9)
j=1

1l1
X = ... (11.10)
1lm
One-way layout ANOVA 169

where 1l is the l-vector consisting of 1s, X is the n m matrix where the vectors 1l are arranged
in a diagonal fashion (with 0s elsewhere),
Y = (Y11 , . . . , Y1l1 , Y212 , . . . , Ymlm )> , (11.11)
and the n-vector is formed analogously. Then
Y =X +
rank(X) = m, and the hypothesis H : 1 = . . . = m can be expressed as
H : S = Lin(1m )
where Lin(1m ) is a one dimensional linear space (dim(S) = s = 1). Thus
S0 = {X, S} = Lin(1n )
which can also be seen by noting that if all j are equal to one value , then then Yjk = + jk .
To find the value k(X S0 )Yk2 , note that
1
S0 = 1n 1> n 1n 1> 1 >
n = n 1n 1n ,

S0 Y = 1n n1 1> n Y = 1n Y

where Y is the overall sample mean:


lj
m X
X
1
Y = n Yjk .
j=1 k=1

Furthermore let
lj
X
Yj = lj1 Yjk
k=1
be the mean within the j-th group. We claim that

1 Y1
X Y =X X > X X >Y = X . . . . (11.12)
Ym
Plj
Indeed, X > Y gives the m-vector of the sums k=1 Yjk for each group, and X > X is a m m-
diagonal matrix with lj as diagonal elements. But we can also write

Y
S0 Y = 1n Y = X . . .
Y
hence

Y1 Y
X Y S0 Y = X . . . ,
Ym Y
m
X 2
k(X S0 )Yk2 = lj Yj Y .
j=1
170 Linear hypotheses and the analysis of variance

Similarly we obtain from (11.12) and (11.11)

(In X )Y = (Y11 Y1 , . . . , Y1l1 Y1 , Y21 Y2 , . . . , Ymlm Yl )> ,


lj
m X
2
X 2
k(In X )Yk = Yjk Yj .
j=1 k=1

Thus the F -statistic (11.8) takes the form


P 2
(m 1)1 mj=1 lj Yj Y
F (Y) = P Plj 2 . (11.13)
(n m)1 m j=1 k=1 Y jk Yj

The terms
lj
m X m
X 2 X 2
Dw = Yjk Yj , Db = lj Yj Y
j=1 k=1 j=1

can be called the sums of squares within groups and and sums of squares between groups
respectively. Recall that we introduced this terminology essentially already in the two sample
problem (just before Theorem 10.3.6; there we divided by n and called this variability). Consider
also the quantity
m X lj
X 2
D0 = Yjk Y = k(In S0 )Yk2 .
j=1 k=1

This is the total sums of squares; then n1 D0 = Sn2 is the total sample variance . We have a
decomposition
D0 = Dw + Db (11.14)
as an immediate consequence of the identity

k(In S0 )Yk2 = k(In X )Yk2 + k(X S0 )Yk2 .

The decomposition (11.14) of the total sample variance S 2 generalizes (10.21); it gives the name to
the test procedure analysis of variance. The hypothesis H of equality of means is rejected when
the between groups sum of squares Db is too large, compared to the within groups sum of squares
Dw .
Note that the quantities
dw = (n m)1 Dw , db = (m 1)1 Db
are both unbiased estimates of 2 under the hypothesis (cf. the proof of Proposition 11.1.3); they
can be called the respective mean sum of squares. The F -statistic (11.13) involves these quantities
dw , db which dier only by a factor from the sums of squares Dw , Db .
A common way of visualizing all the quantities involved is the ANOVA table:

sum of squares degrees of freedom mean s. of squ. F-value


between groups Db m1 (m 1)1 Db = db F (Y) = db /dw
within groups Dw nm (n m)1 Dw = dw
total D0 n1
Two-way layout ANOVA 171

11.3 Two-way layout ANOVA


Return to the cholesterol /social group example and suppose we have data from various countries,
where it can be assumed that the general level of cholesterol varies from country to country (e.g.
via dierent nutritional habits). Nevertheless we are still interested in the same question, namely
whether the cholesterol level diers across social groups or not. This can be modeled by a two
way layout, where we allow for additional eects (the countries). It will become clear why the
simpler ANOVA in the previous subsection is called one way layout.
In the previous subsection we could have written the nonrandom group means j as
m
X
j = + j , j = 1, . . . , m, j = 0
j=1
.

Any vector = (1 , . . . , m )> can be written uniquely in this form: set


m
X
1
=m j , j = j .
j=1

The one way layout can be written

Yjk = + j + jk , k = 1, . . . , lj , j = 1, . . . , m,

and the hypothesis is


1 = . . . = m = 0.
The j are called factor eects and is called the main eect . Then the Yjk can be decomposed

Yjk = Y + j + jk , k = 1, . . . , lj , j = 1, . . . , m, (11.15)
j = Yj Y , jk = Yjk Yj ,

where jk are called residuals and j can be interpreted as estimates of the factor eects j . The
relation (11.15) can be written as a decomposition of the data vector Y

Y = S0 Y + (X S0 )Y+(In X )Y.

The F statistic (11.13) then takes the form


P 2
(m 1)1 mj=1 lj j
F (Y) = P Plj .
(n m)1 m j=1 k=1 jk
2

This indicates how the two way layout can be treated. The model is

Yijk = ij + ijk , k = 1, . . . , l, i = 1, . . . , q, j = 1, . . . , m,

where ijk are i.i.d normal variables with variance 2 . For simplicity we assume that all groups
(i, j) have an equal number of observations l. The index i is called the first factor and j is called
the second factor. Again this is a normal linear model, but the matrix X has an involved form.
It is somewhat laborious to work out all the projections and derived sums of squares; we therefore
172 Linear hypotheses and the analysis of variance

forgo the projection approach and use the more elementary multiple dot notation. Note that
the projection approach is still needed for a rigourous proof that the tests statistics below have an
F -distribution. The array of nonrandom mean values ij can be written

ij = + i + j + ij (11.16)
q
X m
X
i = j = 0 (11.17)
i=1 j=1

where
q
X m
X
= = q 1 i = m1 j is the global eect
i=1 j=1
i = i is the main eect of value i of the first factor
j = j is the main eect of value j of the second factor
ij = ij i j + is the interaction
of value i of the first factor and value j of the second factor.

(note that relation (11.16, 11.17) immediately follow from the definitions above of the quantities
involved). It follows also that
q
X
ij = qj q q(j ) = 0 for all j = 1, . . . , m,
i=1
m
X
ij = 0 for all i = 1, . . . , q.
j=1

Assume now that there is no interaction between the factors 1 and 2, i.e. ij = 0. In
this case from (11.16) we obtain

Yijk = + i + j + ijk , k = 1, . . . , l, i = 1, . . . , q, j = 1, . . . , m.

Set n = mql. The two hypotheses of interest are


H1 : 1 = . . . = q = 0
H2 : 1 = . . . = m = 0
In our example, if factor 1 (indexed by i) is social group and factor 2 (indexed by j) is country,
then H1 would be cholesterol level does not depend on social group, even though it may vary
across countries and H2 would be cholesterol level does not depend on country, even though it
may vary across social groups.
The analog of (11.14) is

D0 = Dw + Db1 + Db2 , (11.18)


Xq m
X
2 2
Db1 = lm Yi Y , Db2 = lq Yj Y ,
i=1 j=1
X X
Dw = 2ijk , D0 = (Yijk Y )2 .
i,j,k i,j,k
Two-way layout ANOVA 173

where
ijk = Yijk Yi Yj + Y
are the residuals. The two-way ANOVA table is

sum of squares degrees of freedom mean s. of squ. F-value


1. factor, between gr. Db1 q1 Db1 /(q 1) = db1 db1 /dw
2. factor, between gr. Db2 m1 Db2 /(m 1) = db2 db2 /dw
residuals (within gr) Dw nmq+1 Dw /(n m q + 1) = dw
total D0 n1

Here the F -value in the row for the first factor is for testing H1 .

Remark 11.3.1 We briefly outline the associated projection arguments needed for the proof of the
F -distributions. Let Y be the n-vector of the data Yijk , e.g. arranged in a lexicographic fashion:

Y = (Y111 , Y112 , . . . , Y11l , Y121 , . . . , Yqml )>

let S00 be the subspace of Rn spanned by the vector 1n ,


( q
)
X
S10 = Z : Zijk = i , for some i where i = 0 ,
i=1
( q
)
X
S01 = Z : Zijk = j , for some j where j = 0 .
i=1

Note that S10 , S01 and S00 are mutually orthogonal in Rn : e.g. for any Z1 S10 and Z2 S01 we
have Z> 1 Z2 = 0 etc. Consider
P the linear space X spanned by these three subspaces, i.e. the set of
all linear combinations 3r=1 r Zr where Z1 S10 , Z2 S01 , Z3 S00 . This can be represented

Z : Zijk = P
+ i + j , forP
some , i , j ,
X = q q .
where i=1 i = 0, i=1 j = 0

or in short notation (using the orthogonal sum operation )

X = S10 S00 S01 .

To this corresponds a representation of projection operators:

X = S00 + S10 + S01

where S00 = 1n = n1 1n 1>


n . The basic assumption of no interaction ij = 0 in (11.16) means
that
EY X .
(If the assumption of no interaction is not made then we were only able to claim EY X0 where

X0 = Z : Zijk = ij , for some ij ).
174 Linear hypotheses and the analysis of variance

The basic (normal) linear model (under the assumption of no interaction) can be expressed as

L(Y) = Nn (EY, 2 In )
where EY X .

To write in the form Y =X + we need a matrix X which spans the space X ; we can avoid this
here. The two way ANOVA decomposition (11.18) can be written as

D0 = k(In S00 )Yk2 = k(In X )Yk2 + kS10 Yk2 + kS01 Yk2


= Dw + Db1 + Db2 .

In this linear model, the expression

dw = (n m q + 1)1 Dw = (n m q + 1)1 k(In X )Yk2

is an unbiased estimator of 2 and Dw / 2 has a law 2nmq+1 . The two linear hypotheses are
H1 : 1 = . . . = q = 0 or equivalently EY S01 S00
H2 : 1 = . . . = m = 0 or equivalently EY S10 S00 .
Under H1 , we have S10 EY = 0 and hence the expression

db1 = (q 1)1 Db1 = (q 1)1 kS10 Yk2

is independent of dw and such that Db1 / 2 has a law 2q1 . Thus db1 /dw has law Fq1,nmq+1 .

The textbook in chap. 11.3 treats the case where l = 1 and j are random variables (RCB,
randomized complete block design). This model is very similar to the two way layout treated here.
The theory of ANOVA with its associated design problems has many further ramifications.
Chapter 12
SOME NONPARAMETRIC TESTS

Recall Remark 9.4.6: a family of probability distributions P = {P , }, indexed by , is


called parametric if all are finite dimensional vectors ( Rk for some k), otherwise
P is called nonparametric. In hypothesis testing, any hypothesis corresponds to some P, thus
the terminology is extended to hypotheses. Any simple hypothesis (consisting of one probability
distribution P = {Q0 }) is parametric. The understanding is that the family P is too large to be
indexed by some Rk ; it should be indexed by some set of functions or other objects (e.g. the
associated distribution functions, or the densities if they exist, or the infinite series of their Fourier
ceocients, or the probability laws themselves). Typically some restrictions are then placed on the
functions involved, such that: the density is symmetric around 0, or is dierentiable with derivative
bounded by a constant etc.
In the 2 -goodness-of-fit test treated in Corollary 9.4.5, we already encountered a nonparametric
alternative K : Q 6= Q0 . In a more narrow sense, a nonparametric test is one in which the hypothesis
H is a nonparametric set. It was also noted that the 2 -test for goodness of fit actually tests the
hypothesis on the cell probabilities p(Q) = p(Q0 ), with asymptotic level , and the set of all Q
fulfilling this hypothesis is also nonparametric.
A genuinely nonparametric test was the 2 -test for independence ( 2 -test in a contingency table)
treated in Section 9.6: the hypothesis consists of all joint distributions of two r.v.ss X1 , X2 which
are the product of their marginals.

12.1 The sign test


Suppose that for a pair of (possibly dependent) random variables X, Y , one is interested in the
dierence Z = X Y , more precisely in the median of Z, i.e. med(Z). If X and Y represent
some measurements or treatments, then med(Z) > 0 would mean that X is better in some
sense than Y . Assume that Z has a continuous distribution, so that P (Z = 0) = 0. If one
observes n independent pairs of r.v.s (Xi , Yi ), i = 1, . . . , n all having the distribution of (X, Y )
then corresponding hypotheses would be
H :med(Z) = 0
K :med(Z) > 0.
A possible test statistic is
X n
S= sgn(Zi )
i=1

where Zi = Xi Yi and
1 if x > 0
sgn(x) = { 0 if x = 0
1 if x < 0.
176 Some nonparametric tests

It is plausible to reject H if the number of positive Zi is too large compared to the number of
negative Zi . This is the (one sided) sign test for H. Define
n
X
S0 = 1(0,) (Zi );
i=1

then S = 2S0 n. Moreover if p = P (Z > 0) then the statistic S0 has a binomial distribution
B(n, p). Under H we have
P (Z < 0) = P (Z > 0)

so that p = 1/2 and the distribution of S0 under H is B(n, 1/2). Since S is a strictly monotone
function of S0 , the sign test can also be based on S and the binomial distribution under H (rejection
when S is too large). Note that since B(n, 1/2) is a discrete distribution, in order to achieve
exact size under the hypothesis, for any given one has to use a randomized test in general
(alternatively, for a nonrandomized -test the size may be less than ).
This is a prototypical nonparametric test; the hypothesis H contains all distributions Q for Z which
have median 0, and this is a large nonparametric class of laws. The statistic S is distribution-free
under H, since its law is B(n, 1/2) which does not depend on Q in H.

12.2 The Wilcoxon signed rank test


The hypothesis H : med(Z) = 0 in the sign test has the disadvantage that it includes distributions
for Z which are very skewed, in the sense e.g. that most of the positive values may be concentrated
around a large value (+ > 0 say) whereas the probability mass on the negative half-axis may be
concentrated around a moderate negative value ( < 0 say). In that case one would tend to say
that X is better than Y in some sense, even though the median of Z = X Y is still 0. Thus
one formulates the hypothesis of symmetry
H : P (X Y > c) = P (Y X > c) for all c > 0
or equivalently
H : P (Z > c) = P (Z < c) for all c > 0.
Assume that Zi , i = 1, . . . , n are i.i.d. having the distribution of Z = X Y (a continuous
distribution). The rank of Zi among Z1 , . . . , Zn denoted by Ri is defined to be the number of Zj s
satisfying Zj Zi , j = 1, . . . , n. The rank of |Zi | among |Z1 | , . . . , |Zn | is similarly defined and
denoted by Ri . Because of the continuity assumption, we assume that no two Zi are equal and
that no Zi is zero. Define signed ranks

Si = sgn(Zi )Ri

and the Wilcoxon signed rank statistic W


n
X
Wn = Si .
i=1

Under the hypothesis of symmetry, about one half of the Si would be negative; thus Si would be
close to 0. Thus it seems plausible to reject H if W is too large (one sided test) or if |W | is too
large (two sided test).
The Wilcoxon signed rank test 177

Lemma 12.2.1 Under the hypothesis H of symmetry, W has the same distribution as
n
X
Vn = Mi i (12.1)
i=1

where Mi are independent Rademacher random variables, i.e.

P (Mi = 1) = P (Mi = 1) = 1/2

(or Mi = 2Bi 1 where Bi are Bernoulli B(1, 1/2)).

Proof. If Z has a symmetric distribution then sgn(Z) and |Z| are independent:

P (|Z| > c|sgn(Z) = 1) = P (|Z| > c|Z > 0) = 2P (Z > c)


= 2P (Z < c) = P (|Z| > c|Z < 0) = P (|Z| > c|sgn(Z) = 1)

thus the conditional distribution of |Z| given sgn(Z) does not depend on sgn(Z), which means
independence. Thus W has the same law as
n
X
Vn = Mi Ri
i=1

where Mi are independent of the original sample of Zi . Moreover, a random permutation of the
(M1 , . . . , Mn ) (independent of the Mi ) does not change the law of the vector (M1 , . . . , Mn ), so that
L(V ) = L(V ).
Note that EMi = 0 so that EW = 0 under H. The variance of Mi is
1 1
Var(M1 ) = EM12 = (1)2 + (1)2 = 1
2 2
so that
n
X 1
Var(Wn ) = Var(Vn ) = i2 = (n(n + 1)(2n + 1)) =: vn .
6
i=1

(the last formula can easily be proved by induction).


To obtain a test, we can now use a normal approximation for the law of Wn or find the quantiles
of its exact law. The normal approximation for the law of Vn seems plausible since Mi are i.i.d.
zero mean and the weights i in (12.1) are deterministic. An appropriate central limit theorem (
Lyapunov CLT) gives
Wn
1/2
=d N (0, 1).
vn
1/2
The corresponding asymptotic -test of H then rejects if Wn > z vn where z is the upper
-quantile of N (0, 1).
For the exact distribution, define
X X
Wn+ = Ri , Wn = Ri .
i:sgn(Zi )=1 i:sgn(Zi )=11
178 Some nonparametric tests

The test can also be based on Wn+ since


n
X n(n + 1)
Wn+ + Wn = i=
2
i=1
W = Wn+ Wn = 2Wn+ n(n + 1)/2

By Lemma (12.2.1), the law of Wn+ coincides with that of


n
X
Vn+ = Bi i
i=1

where Bi are i.i.d. Bernoulli B(1, 1/2). This distribution has been tabulated in the past ( one sided
critical values, without randomization, for selected values of and n = 1, . . . , 20 ) and this can
easily be included in statistical software today. Note that the two sided critical values for W can
be obtained from the fact that W has a symmetric distribution around 0.
Further justification. Recall that when Zi are i.i.d. normal N (, 2 ) with unknown 2 then
the hypothesis of symmetry around 0 reduces to H : = 0. For this the t-test is available,
based on T = n1/2 Zn /Sn or when 2 is known we could even use a Z-test based on the statistic
Z 0 = n1/2 Zn / and its normal distribution. The one sided Z-test was UMP test against alternatives
H : > 0, and the two sided Z- and t-tests can also be shown to have an optimality property
(UMP unbiased tests). When the assumption of normality of Q = L(Zi ) is not justified, for testing
symmetry we could try still to use the t-test: we have

n1/2 Zn /Sn =d N(0, 1)

if the second moment of the Q exists. In that case we would obtain an asymptotic -test for the
hypothesis of symmetry, but this breaks down if the class of admitted distributions is too large, i.e.
the second moment of Q may not exist. For instance, we may have Q() = Q0 ( ) where Q0 is a
Cauchy distribution (which is symmetric about 0). Symmetry of Q means = 0, and T is not an
appropriate test statistic (does not provide an asymptotic -test). In contrast, the Wilcoxon signed
statistic W is distribution free under the hypothesis of symmetry: according to Lemma (12.2.1) i.e.
its law does not depend on Q as long as Q is symmetric.
Nonparametric alternatives. A random variable Z1 with distribution function F1 is stochas-
tically larger than Z2 with distribution function F2 if

P (Z1 > x) P (Z1 > x) for all x R


and there exists x0 : P (Z1 > x0 ) > P (Z1 > x0 ).

If, for distribution functions we define the symbol F1 - F2 to mean F1 F2 and F1 (x0 ) < F2 (x0 )
for at least one x0 then is equivalent to

F1 - F2 .

An
Invariance considerations. Let z be a point in Rn (thought of as representing a realization of
Z = (Z1 , . . . , Zn )) and let be a transformation (z) = ( (z1 ), . . . , (zn )) where is a continuous,

cf. e.g. the table given in Rohatgi and Saleh, Introduction to Prob. and Statistics, 2nd Ed.
The Wilcoxon signed rank test 179

strictly monotone increasing and uneven real valued function on R (uneven means (x) = (x)).
If Q = L(Z) is a symmetric law then the law of (Z) is also symmetric:

P ( (Z) > c) = P (Z > 1 (c)) = P (Z < 1 (c))


= P (Z < 1 (c)) = P ( (Z) < c) for all c > 0.

Note that the statistic


L(Z) = (sgn(Zi ), Ri )i=1,...,n
is invariant under any transformation applied to Z. It can be shown that L is maximal invariant
under the group of transformation of Rn given by all , i.e. any other invariant map is a function
of L(Z). Thus the set of all signs and of all ranks Ri of |Zi | is a maximal invariant. This provides
a justification for use of the Wilcoxon signed rank statistic W .
180 Some nonparametric tests
Chapter 13
EXERCISES

13.1 Problem set H1


Exercise H1.1. Show that in model M1 , with parameter p = [0, 1] and risk function R(T, p)
defined by
R(T, p) = Ep (T (X) p)2 (*)
there is no estimator T (X) such that

R(T, p) = 0 for all p .

Exercise H1.2. Let pi , i = 1, . . . , k be a finite subset of (0, 1) (where k 2), and consider a
statistical model M01 with the same data X and parameter p as M1 but where p is now restricted
to = {p1 , .P
. . , pk }. Assume that each parameter value pi is assigned a prior probability
qi > 0 where ki=1 qi = 1. For any estimator T, define the mixed risk
k
X
B(T ) = R(T, pi )qi
i=1

where R(T, p) is again the quadratic risk (*).


a) Find the form of the Bayes estimator TB (i.e the minimizer of B(T )) and show that it is unique.
b) Show that TB is admissible in the model M01 .

13.2 Problem set H2


Exercise H2.1. Let X1 , ..., Xn be independent and identically distributed with Poisson law Po (),
where 0 is unknown. (The Poisson law with parameter = 0 is defined as the one point
distribution at 0, where X1 = 0 with probability one). Find the maximum likelihood estimator
(MLE) of (proof).
Exercise H2.2 Let X1 , . . . , Xn be independent and identically distributed such that X1 has the
uniform law on the set {1, . . . , r} for some integer r 1 (i.e. Pr (X1 = k) = 1/r, k = 1, . . . , r). In
the statistical model where r 1 is unknown, find the MLE of r (proof).
Exercise H2.3. Let X1 , . . . , Xn be independent and identically distributed such that X1 has the
geometric law Geom(p), i.e.

Pp (X1 = k) = (1 p)k1 p, k = 1, 2, . . .

i) In the statistical model where p (0, 1] is unknown, find the MLE of p (proof).
ii) Compute (or find in a textbook) the expectation of X1 under p.
182 Exercises

Exercise H2.4 Let X1 , . . . , Xn be independent and identically distributed such that X1 has the
uniform law on the interval [0, ] for some > 0 (i. e. X1 has the uniform density p (x) =
1 1[0,] (x). Here 1A (x) is the indicator function of a set A: 1A (x) = 1 if x A, 1A (x) = 0
otherwise). In the statistical model where > 0 is unknown, find the MLE of (proof).
Hint: it can be assumed that not all Xi are 0, since this event has probability 0 under any > 0.
Exercise H2.5 Let X1 , . . . , Xn be independent and identically distributed such that X1 has the ex-
ponential law Exp() on [0, ) with parameter (i. e. X1 has density p (x) = 1 exp(x1 )1[0,) (x)).
i).In the statistical model where > 0 is unknown, find the MLE of (proof)
ii) Recall the expectation of X1 under .
.

13.3 Problem set H3


Exercise H3.1. Let X1 , ..., Xn be independent and identically distributed with Poisson law Po (),
where = (0, ) is unknown.
(i) Compute the Fisher information at each .
(ii) Assume that condition D2 for the validity of the Cramer-Rao bound is fulfilled (for n = 1 it
is shown in the handout). Find a minimum variance unbiased estimator (UMVUE) of for this
model.

Exercise H3.2. Let X be an observed (integer-valued) random variable with binomial law B(n, )
where = (0, 1) is unknown.
(i) Compute the Fisher information at each .
(ii) Clearly condition D1 for the validity of the Cramer-Rao bound is fulfilled (sample space is
finite, prob. function dierentiable). Find a minimum variance unbiased estimator (UMVUE) of
for this model.

Exercise H3.3. Let X1 , ..., Xn be independent and identically distributed with Geometric law
Geom(1 ), i. e.
P (X1 = k) = (1 1 )k1 1
where = (1, ) is unknown. (Note that in Exercise H2.3 the family Geom(p), p (0, 1] was
considered. Here we just took = p1 as parameter and also excluded the value p = 1)
(i) Compute the Fisher information at each .
(ii) Assume that condition D2 for the validity of the Cramer-Rao bound is fulfilled . Find a
minimum variance unbiased estimator (UMVUE) of for this model.
Hint: Here the variance of X1 is important; compute or look up in a book.

13.4 Problem set H4


Exercise H4.1. Let X1 , . . . , Xn be independent and identically distributed such that X1 has the
exponential law Exp() on [0, ) with parameter (i. e. X1 has density p (x) = 1 exp(x1 )1[0,) (x)),
where is unknown and varies in the set = (0, ).
(i) Compute the Fisher information at each .
(ii) Assume that condition D3 for the validity of the Cramer-Rao bound is fulfilled. Find a minimum
variance unbiased estimator (UMVUE) of for this model.
Problem set H4 183

Exercise H4.2. Consider the Gaussian scale model: observations are X1 , . . . , Xn , independent
and identically distributed such that X1 has the normal law N(0, 2 ) with variance 2 , where 2
is unknown and varies in the set = (0, ).
(i) Compute the Fisher information IF ( 2 ) for one observation X1 .
(ii) Assume that condition D3 for the validity of the Cramer-Rao bound is fulfilled. Show that for
n observations in the Gaussian scale model, the sample variance
n
X
S 2 = n1 Xi2
i=1

is a uniformly best unbiased estimator.


Hints: (a) Note that 2 is treated as parameter, not ; so it may be convenient to write for 2
when taking derivatives.
(b) Note that
Var2 X12 = 2 4 .
A short proof runs as follows. We have X1 = Y for standard normal Z, so it suces to prove
VarZ 2 = 2. Now VarZ 2 = EZ 4 (EZ 2 )2 , so it suces to prove EZ 2 = 3. For the standard normal
density we have by partial integration, using 0 (x) = x(x)
Z Z Z
x (x)dx = x (x)dx = 3 x2 (x)dx = 3.
4 3 0

Exercise H4.3. In the handout, sec. 8.3 a family of Gamma densities was introduced as

1 1
f (x) = x exp(x)
()

for > 0. Define more generally, for some > 0

1
f, (x) = x1 exp(x 1 ).
()

The corresponding law is called the (, )-distribution.


(i) Let X1 , ..., Xn be independent and identically distributed with Poisson law Po (), where
= (0, ) is unknown. Assume that the results on Bayesian inference in section 8 carry over
from finite to countable sample space. Show that the family {(, ), > 0, > 0} is a conjugate
family of prior distributions.
(ii) Show that for a r.v. U with law (, )

EU = .

(iii) In the above Poisson model, with prior (, ), find the posterior expectation E(|X) of
and discuss its relation to the sample mean Xn for large n (, fixed).
Remark: Note that if Proposition 8.1 carries over to the case of countable sample space then
E(|X) is a Bayes estimator for quadratic risk.
184 Exercises

13.5 Problem set H5


Exercise H5.1. Consider the Gaussian location model with restricted parameter space =
[K, K], where K > 0, sample size n = 1 and 2 = 1. A linear estimator is of form T (X) = aX + b
where a, b are nonrandom (fixed) real numbers.
(i) Find the minimax linear estimator TLM (note that all a, b are allowed)
(ii) Show that TLM is strictly better than the sample mean Xn = X, everywhere on = [K, K]
(this implies that X is not admissible).
(iii) Show that TLM is Bayesian in the unrestricted model = R for a certain prior distribution
N (0, 2 ), and find the 2 .
Exercise H5.2. Consider the binary channel for information transmission, as a statistical model:
= {0, 1}, P = B(1, p ) where p (0, 1), = 0, 1. Assume also symmetry: p1 = 1 p0 and
1/2 < p1 . Note that for estimating , the quadratic loss coincides with the 0-1-loss: for t,
we have
0 if t =
(t )2 = {
1 if t 6= .
Consider estimators with values in (note there exist only four possible estimators here (maps
{0, 1} 7 {0, 1}: T1 (x) = x, T2 (x) = 1 x, T3 (x) = 0, T4 (x) = 1).
(i) Find the maximum likelihood estimator of
(ii) For a prior distribution Q on with q0 = Q({0}), q1 = Q({1}), and the above 0-1-loss, find
the Bayes estimator with values in . Note: the posterior expectation cannot be used since it will
be between 0 and 1 in general.
(iii) Assume that q1 tends to 1. Find the value such that if q1 > z then the Bayes estimator is T4
(disregards the data and takes always 1 as estimated value).
Exercise H5.3. Consider the binary Gaussian channel: = {0, 1}, P = N ( , 1) where 0 < 1
are some fixed values (a restricted Gaussian location model for sample size n = 1). Note that
estimators with values are described by indicators of sets A R such that T (x) = 1A (x) (i.e.
A = {x : T (x) = 1}).
(i) as in H5.2.
(ii) as in H5.2
(iii) Show that the Bayes estimator for a uniform prior ( q1 = 1/2) is minimax.

13.6 Problem set H6


Exercise H6.1. Let X1 , . . . , Xn1 , be independent N (1 , 2 ) and Y1 , . . . , Yn2 be independent
N (2 , 2 ), also independent of X1 , . . . , Xn1 (n1 ,n2 > 1) For each of the two samples, form the
sample means X, Y and the bias corrected sample variances
n n
2 1 X 1
1 X 2

S(1) = (Xi X)2 , S(2)


2
= (Yi Y )2
n1 1 n2 1
i=1 i=1

Consider the statistic


X Y
Z= 1/2
1 1
n1 + n2

which is standard normal if 1 = 2 . In a model where 1 , 2 are unknown but 2 is known, it


obviously can be used to build a confidence interval for the dierence 1 2 .
Problem set H7 185

For the case that in addition 2 is unknown, find a statistic which has a t-distribution if 1 = 2 (
this would then be called a studentized statistic) , and find the degrees of freedom.

13.7 Problem set H7


Exercise H7.1. Let X1 , . . . , Xn1 , be independent N (1 , 2 ) and Y1 , . . . , Yn2 be independent
2
N (2 , ), also independent of X1 , . . . , Xn1 (n1 , n2 > 1). In a model where these r.v.s are observed,
and 1 , 2 and 2 are all unknown, find an -test for the hypothesis

H : 1 = 2.

Exercise H7.2. Let z/2,n be the upper /2-quantile of the t-distribution with n degrees of
freedom and z/2 the respective quantile for the standard normal distribution. Use the tables at
the end of the textbook or a computer program to find

a) z/2,n for n = 5, n = 20 and z/2 for a value = 0.05
b) the same for = 0.01.
Exercise H7.3. In the introductory subsection 1.3.2 Confidence statements with the Chebyshev
inequality (early in the handout, p. 10) we constructed a confidence interval for the parameter p
for n i.i.d. Bernoulli observations X1 , . . . , Xn (model Md,1 ) of form [Xn , Xn + ] where Xn is
the sample mean and = n, = 2(n)1/2 . Using the Chebyshev inequality, it was shown that
this has coverage probability at least 1 .
(i) Use the central limit theorem

n1/2 (Xn p) =d N (0, p(1 p)) as n

and the upper bound p(1 p) 1/4 to construct an asymptotic 1 confidence interval of form
[Xn n, , Xn + n, ] for p, involving a quantile of the standard normal distribution N (0, 1).
(ii) Show that the ratio of the lengths of the two confidence intervals, i.e. n, /n, , does not
depend on n and find its numerical values for = 0.05 and = 0.01, using the table of N (0, 1) on
p. 608 textbook.
(iii) Use the property of the standard normal distribution function (x)

1 (x) x1 exp x2 /2

([D] p. 108) to prove that n, /n, 0 for 0 (n fixed).


Comment: the interval based on the normal approximation turns out to be shorter, and this eect
becomes more pronounced for smaller .
Exercise H7.4. Consider the test 0 (i.e. the test based on the t-statistic as in (8.1) handout,

where the quantile z/2 of the standard normal is used in place of z/2 ). On p. 94 handout it is
argued that this is an asymptotic -test. Show that in the Gaussian location-scale model Mc,2 , for
sample size n, this test is consistent as n on the pertaining alternative

1 (0 ) = (, 2 ) : 6= 0 .

Comment: The proof of consistency of the two-sided Gauss test, which is illustrated in the figure
p. 95, is similar but more direct since 2 is known there.
186 Exercises

13.8 Problem set E1

Exercise E1.1. (10%). Let X1 , . . . , Xn be independent identically distributed with unknown dis-
tribution Q, where it is known only that

Var(X1 ) K

for some known positive K. Then also = EX1 exists. Consider hypotheses
H : = 0
K : 6= 0 .
Find an -test (exact -test; i.e. level is observed for every n, not just asymptotic as n ).
(Hint: Chebyshev inequality, p. 10 or [D], p. 222).
Exercise E1.2. Let X1 , . . . , Xn , be independent Poisson Po(). Consider some 0 , 1 such that
0 < 0 < 1 .
(i) (5%) Consider simple hypotheses
H : = 0
K : = 1 .
Find a most powerful -test.
Note: the distribution of any proposed test statistic can be expected to be discrete, so that a
randomized test might be most powerful. For the solution, this aspect can be ignored; just indicate
the statistic, its distribution under H and the type of rejection region (such as reject when T is
too large).
(ii) (10%) Consider composite hypotheses
H : = 0
K : > 0 .
Find a uniformly most powerful (UMP) -test.
Hint: take a solution of (i) which does not depend on 1 .
(iii) (10%) Consider composite hypotheses
H : 0
K : > 0 .
Find a uniformly most powerful (UMP) -test.
Hint: take a solution of (ii) and show that it preserves level on H : 0 . Properties of the
Poisson distribution are useful.
Exercise E1.3. (20%) Consider the Gaussian location-scale model (Model Mc,2 ), for sample size
n, i. e. observations are i.i.d. X1 , . . . , Xn with distribution N (, 2 ) where R and 2 > 0 are
unknown. For a certain 20 > 0, consider hypotheses H : 2 20 vs. K : 2 > 20 .
Find an -test with rejection region of form (c, ) (i.e. a one-sided test) where c is a quantile of a
2 -distribution. (Note: it is not asked to find the LR test; but the test should have level . This
includes an argument that the level is observed on all parameters in the hypothesis H : 2 20 . )
Hint: A good estimator of 2 might be a starting point.
.
Exercise E1.4 (Two sample problem, F -test for variances). Let X1 , . . . , Xn be independent
N (1 , 21 ) and Y1 , . . . , Yn be independent N (2 , 22 ), also independent of X1 , . . . , Xn (n > 1) where
Problem set H8 187

1 , 21 and 2 , 22 are all unknown. Define the statistics


2
SX
F = F (X, Y ) = , (13.1)
SY2
n
X n
X
2 1
2 2 1
2
SX = n Xi Xn , SY = n Yi Yn
i=1 i=1

(here (X, Y ) symbolizes the total sample).


Define the F-distribution with k1 , k2 degrees of freedom (denoted Fk1 ,k2 ) as the distribution
of k11 Z1 /k22 Z2 where Zi are independent r.v.s having 2 -distributions of k1 and k2 degrees of
freedom, respectively.
i) (15%) Show that F (X, Y ) has an F -distribution if 21 = 22 , and find the degrees of freedom.
ii) (20%) For hypotheses H : 21 22 vs. K : 21 > 22 , find an -test with rejection region of
form (c, ) (i.e. a one-sided test) where c is a quantile of an F -distribution. (Note: it is not asked
to find the LR test; but the test should have level . This includes an argument that the level is
observed on all parameters in the hypothesis H : 21 22 ).

Exercise E1.5 (F -test for equality of variances). Consider the two sample problem of exercise
E1.4, but hypotheses H : 21 = 22 vs. K : 21 6= 22 .
i) (5%) Find the likelihood ratio test and show that it is equivalent to a test which rejects if the
F -statistic (13.1) is outside a certain interval of form [c1 , c].
ii) (5%) Show that the c of i) can be chosen as the upper /2 quantile of the distribution Fr,r for
a a certain r > 0.

13.9 Problem set H8


Exercise H8.1 (Exercise 8.59 e, p. 399 textbook). A famous medical experiment was conducted
by Joseph Lister in the late 1800s. Mortality associated with surgery was quite high and Lister
conjectured that the use of a disinfectant, carbolic acid, would help. Over a period of several years
Lister performed 75 amputations with an without using carbolic acid. The data are
Carbolic acid used ?
Yes No
Patient Yes 34 19
lived ? No 6 16

Use these data to test whether the use of carbolic acid is associated with patient mortality.
Exercise H8.2. Let X1 , . . . , Xn be independent N (1 , 1) and Y1 , . . . , Yn be independent N (2 , 1),
also independent of X1 , . . . , Xn and consider hypotheses
H : 1 =2 = 0
K : (1 , 2 ) 6= (0, 0).
Find the likelihood ratio test and show that it is equivalent to a test based on a statistic which
has a certain 2 -distribution under H (thus the critical value can be taken as a quantile of this
2 -distribution).
Exercise H8.3. Let X1 , . . . , Xn be independent Poisson Po(1 ) and Y1 , . . . , Yn be independent
Po(2 ), also independent of X1 , . . . , Xn . Let = (1 , 2 ) be the parameter vector, 0 a particular
value for this vector (with positive components) and consider hypotheses
188 Exercises

H : = 0
K : . 6= 0 .
Find an asymptotic -test. Hint: find a statistic similar to the 2 -statistic in the multinomial case
(Definition 9.1.2, p. 111 handout) and its asymptotic distribution under H.

Exercise H8.4 (Adapted from exercise 8.60, p. 399 textbook) Let Z = (Z1 , . . . , Zk ) have a
multinomial law Mk (n, p) with unknown p = (p1 , p2 , . . . , pk ), where k > 2. Consider hypotheses
on the first two components
H : p1 = p2
K : p1 6= p2
A test that if often used, called McNemars Test, rejects H if

(X1 X2 )2
> 21;1 (13.2)
X1 + X2

where 21;1 is the lower 1 quantile of the distribution 21;1 .


(i) Find the maximum likelihood estimator p of the parameter p under the hypothesis.
(ii) Show that the appropriate 2 -statistic with estimated parameter p (maximum likelihood esti-
mator under H as above), as defined in relation (9.26) on p.127 handout, coincides with McNemars
statistic (13.2) (exact equality, not approximate with an error term).
Comment: It follows that McNemars test is the 2 -test for this problem and is an asymptotic
-test, cf. Theorem 9.5.2, p.127 handout.

13.10 Problem set H9

Exercise H9.1. Consider the general linear model

Y = X + ,
E = 0, Cov() = 2 In .

where X is a n k-matrix, rank(X) = k. Show that provides a best approximation to the data
Y in an average sense:
E kYXk2 = min E kYXk2
Rk

Exercise H9.2. Consider the bivariate linear regression model:

Yi = + xi + i , i = 1, . . . , n

where not all xi are equal, , are real valued and i are uncorrelated with variance 2 .
(i) Show that the LSE of , are
Pn
(Y Yn )(xi xn )
Pni
n = Yn n xn , n = i=1 2
. (13.3)
i=1 (xi xn )

where xn is the mean of the nonrandom xi .



The textbook writes an upper -quantile, called 21, there; it coincides the lower quantile 21;1 .
Problem set H10 189

(ii) Find the distribution of n when i are independent normal: L(i ) = N (0, 2 ).
Hint: for (ii), a possibility is to find the projection matrix 1 projecting onto the space Lin(1),
where 1 is the n-vector consisting of 1s, and use

(Y1 Yn , . . . , Yn Yn )> = (In 1 )(Y1 , . . . , Yn )> .

More elementary arguments are also possible.


Exercise H9.3. Consider the general linear model as in H9.1, but with an assumption Cov() =
where is a known positive definite (symmetric) nn-matrix. Find the BLUE (best linear unbiased
estimator) of , as defined in Def. 10.5.1 .
Hint: Recall Lemma 6.1.7, p. 70 and the fact that Cov(A) = AA> for any n n-matrix A.
Exercise H9.4. Suppose that a r.v. Y and the random k-vector X have a joint normal distribution
Nk+1 (0, ), where is a positive definite (k + 1) (k + 1)-matrix. Write in partitioned form

XX XY
=
Y X 2Y

where
XY = >
Y X = EXY

is a k-vector, 2Y = EY 2 and XX = Cov(X). Show that

E(Y |X) = X> ,


where = 1
XX XY . (13.4)

Comment. Compare this with the form of in the case k = 1 (i.e = 2 X XY as in Def.
10.1.3, p. 135 handout), and with the form of the LSE in the linear model (Theorem 10.42, p.
149): = (X > X)1 X > Y.
Hint. Consider independent X , where X has the same law as X and L() = N (0, 2 ), for some
2 > 0 (X is a random k-vector, a random variable) and define

Y = X> +.

with as above. Find a value of the variance 2 such that (X ,Y ) have the same joint distribution
as (X,Y ). This solves the problem, since then

E(Y |X) = E(Y |X ).

13.11 Problem set H10


A PRACTICE EXAM (2 1/2 hours time)

Exercise H10.1. (15 %) A company registered 100 cases within a year where some employee was
missing exactly one day at work. These were distributed among the days of the week as follows:
Day M T W Th F
No. 22 19 16 18 25
Test the hypothesis that these one day absences are uniformly distributed among the days of the
week, at level = 0.05.
190 Exercises

(Solution consists of: value of the test statistic, critical value for the test [quantile of the pertaining
distribution], resulting decision. )
Exercise H10.2 (15 %) The personnel manager of a bank wants to find out whether the chance
to successfully pass a job interview depends on the sex of the applicant. For 35 randomly selected
applicants, 21 of which were male, the results of the interview were evaluated. It turned out that
exactly 16 applicants passed the interview, 5 of which were female. Use a 2 -test in a contingency
table to test whether interview result and sex are independent, at level = 0.05.
(Solution consists of: value of the test statistic, critical value for the test [quantile of the pertaining
distribution], resulting decision. )
Comment: most sources would recommend Fishers exact test here, but this was not treated and
the 2 -test is also applicable.
Exercise H10.3. (20 %) Suppose 18 wheat fields have been divided into m = 3 groups, with
l1 = 5, l2 = 7 and l3 = 6 members. There are three kinds of fertilizer j , and the group j of fields
is treated with fertilizer j , j = 1, 2, 3. The yield results for all fields are given in the following
table.
1 781 655 611 789 596
2 545 786 976 663 790 568 720
3 696 660 639 467 650 380
Assume that these values are realizations of independent random variables Yjk with distribution
N (j , 2 ), k = 1, . . . , lj , j = 1, 2, 3, where the mean values j correspond to fertilizer j . Test the
hypothesis that the three group means coincide, at level = 0.05.
(Solution consists of: value of the test statistic, critical value for the test [quantile of the pertaining
distribution], resulting decision. )
Exercise H10.4. (25 %) Consider a normal linear model of type NLM2 (cf. section 10.5, p. 152
handout), for dimension k = 1, i.e. observations are

Y = X + ,

where X is a nonrandom n 1-matrix (i.e. an n-vector in this case), is an unknown real valued
parameter and
L() = Nn (0, 2 In )
where 2 > 0 is unknown. Assume also that rank(X) = 1 (identifiability condition; i. e. X 6= 0).
Consider some value 0 and hypotheses
H : 0
K : > 0.
Let be the LSE of and define the statistic

( 0 )(X > X)1/2


T 0 (Y) = n1 1/2
. (13.5)
(Y> (In X )Y)

where X is the projection matrix onto the linear space Lin(X). Show that T 0 (Y) can be used as
a test statistic to construct an -test, and indicate the distribution and the quantile used to find
the rejection region.
Comment: Note that this is not a linear hypothesis on .
Hint: in the case that all elements of X are 1 we obtain the Gaussian location-scale model (mean
value ), for which the present hypothesis testing problem was discussed extensively.
Problem set E2 191

Exercise H10.5. (25 %) Suppose that the random vector Z = (X, Y ) has a bivariate normal
distribution with EX = EY = 0 and covariance matrix

Var(X) = 2X , Var(Y ) = 2Y , Cov(X, Y ) = X Y ,

where = EXY / X Y is the correlation coecient between X and Y . Suppose that Zi = (Xi , Yi ),
i = 1, . . . , n are i.i.d. observed random 2-vectors each having the distribution of Z. Define the
empirical covariance matrix
n
X n
X
2 1
SX = n Xi2 , SY2 1
=n Yi2 ,
i=1 i=1
n
X
SXY = n1 Xi Yi
i=1

(note that we do not use centered data Xi Xn etc. here for empirical variances / covariances
since EX = EY = 0 is known) and the empirical correlation coecient

SXY
=
SX SY
q
(where SX = SX 2 etc.).

Consider hypotheses
H :=0
K : 6= 0.
Define the statistic

T0 (Z) = n 1p (13.6)
12
(where Z represents all the data Zi , i = 1, . . . , n). Show that T0 (Z) can be used as a test statistic
to construct an -test, and indicate the distribution and the quantile used to find the rejection
region.
Hint: Note that this is closely related to the previous exercise H10.4. In H 10.4, set 0 = 0,
consider the two sided problem H : = 0 vs. K : 6= 0 (then H is a linear hypothesis) and look
for similarities of the statistics (13.5) and (13.6). The distribution of (13.5) was found in H10.4,
but now the Xi are random. How does this aect the distribution of the test statistic ?
Further comment: when it is not assumed that EX = EY = 0, the definition of has to be
modified
in an obvious way, by using centered data X i Xn , Yi Yn , and n 2 appears in place
of n 1. In this form the test is found in the literature.

13.12 Problem set E2


ANOTHER PRACTICE EXAM (2 1/2 hours time)

Exercise E2.1. (25 %) A course in economics was taught to two groups of students, one in a
classrooom situation and the other on TV. There were 24 students in each group. These students
were first paired according to cumulative grade-point averages and background in economics, and
then assigned to the courses by a flip of a coin (this was repeated 24 times). At the end of the course
192 Exercises

each class was given the same final examination. Use the Wilcoxon signed rank test (level = 0.05,
normal approximation to the test statistic) to test that the two methods of teaching are equally
eective against a two-sided alternative. The dierences in final scores for each pair of students,
the TV students score having been subtracted from the corresponding classroom students score
were as follows:
14 4 6 2 1 18
6 12 8 4 13 7
2 6 21 7 2 11
3 14 2 17 4 5
Hint: treatment of ties. Let Z1 , . . . , Zn be the data. If some |Zi | have the same absolute values
(i.e. ties occur) then they are assigned values Ri which are the averages of the ranks. Example:
4 values |Zi1 |, . . . , |Zi4 | have the same absolute value c and only two of the other |Zi | are smaller.
The |Zi1 |, . . . , |Zi4 | would then occupy ranks 3, 4, 5, 6. Since they are tied, they are all assigned the
average rank (3 + 4 + 5 + 6)/4 = 4.5. The next rank assigned is then 7 (or higher if there is another
tie).
Remark: For the two-sided version of the Wilcoxon signed rank test, as described in section 12.2
handout, the last sentence on p. 175 should read as The corresponding asymptotic -test of H
1/2
then rejects if |Wn | > z/2 vn where z/2 is the upper /2-quantile of N (0, 1)).
As a starting point, to limit the bookkeeping eort in the solution, the following table gives the
ordered absolute values of the data |Zi |, starring those that were originally negative. Below each
entry (every other row of the table) is the rank Ri of |Zi | (in the notation of the handout), where
ties are treated as indicated:
1* 2 2 2 2 3
1 3.5 3.5 3.5 3.5 6
4 4 4 5 6 6
8 8 8 10 12 12
6 7 7 8 11 12
12 14.5 14.5 16 17 18
13 14 14 17 18 21
19 20.5 20.5 22 23 24
Solution.
Exercise E2.2. Consider the one-way layout ANOVA treated in handout sec. 11.2, relation
(11.7):
Yjk = j + jk , k = 1, . . . , l, j = 1, . . . , m, (13.7)

where jk are i.i.d. normal N (0, 2 ) noise variables and j are unknown parameters, and there is
an equal number of observation l > 1 for each factor j. The total number of observations is n = ml.
In this case the F test given by the statistic (11.13) (handout) can be considered an average t test.
(i) (25 %) Let i and j be two dierent factor indices from {1, . . . , m}. Show that a t test of
H : i = j
K : i 6= j
can be based on the statistic
Yi Yj
Tij (Y) = 1/2
2dw /l
Problem set E2 193

where
m X
X l
2
dw = (n m)1 Yjk Yj
j=1 k=1

is the mean sum of squares within groups. (More precisely, show that Tij has a t-distribution under
H and find the degrees of freedom).
(ii) (25 %) Show that
m X
X m
1
Tij2 (Y) = F (Y)
m(m 1)
j=1 i=1

where F (Y) is the F statistic (11.13) (handout). This relation shows that F (Y) is an average of
the (nonzero) Tij2 (Y).
Exercise E2.3. (25 %) Consider again the model (13.7) of exercise F2, with the same assump-
tions, but in an asymptotic framework where the number l of observation in each group tends to
infinity (m stays fixed). Consider the F -statistic
P 2
(m 1)1 m j=1 l Yj Y
F (Y) = P Pl 2
(n m)1 m j=1 k=1 Yjk Yj

Find the limiting distribution, as l , of the statistic (m 1)F (Y) under the hypothesis of
equality of means H : 1 = . . . = m . (It follows that this distribution can be used to obtain an
asymptotic -test of H).
Hint: Limiting distributions of other test statistics in the handout have been obtained e.g. in
Theorem 7.3.8 and Theorem 9.3.3.
194 Exercises
Chapter 14
APPENDIX: TOOLS FROM PROBABILITY, REAL ANALYSIS AND LINEAR
ALGEBRA

14.1 The Cauchy-Schwartz inequality


Proposition 14.1.1 (Cauchy-Schwartz-inequality). Suppose that for the random variables Yi ,
i = 1, 2, the second moments EYi2 exist, for i = 1, 2. Then the expectation of Y1 Y2 exists, and
1/2 1/2
|EY1 Y2 | EY12 EY22 .

Proof. For any > 0


2
0 1/2 Y1 1/2 Y2 = Y12 2Y1 Y1 + 1 Y22 ,

thus
1 2
Y1 Y2 Y1 + 1 Y22 .
2
This proves that EY1 Y2 exists and
1
|EY1 Y2 | EY12 + 1 EY22 .
2
1/2 1/2
If both EY22 , EY12 > 0 then for = EY22 / EY12 we obtain the assertion. If one of them
is 0 (EY22 = 0, say), then by taking > 0 arbitrarily small, we obtain |EY1 Y2 | = 0.

14.2 The Lebesgue Dominated Convergence Theorem


This is a result from real analysis for measure theoretic integrals, which contain both sums and
integrals as a special cases. We formulate here a special case relating to expectation of random
variables. For a statement and proof in full generality see Durrett [D], Appendix, (5.6), p. 468.

Theorem 14.2.1 (Lebesgue) Let X be a random variable taking values in a sample space X and
let rn (x), n = 1, 2, . . . be a sequence of functions on X such that rn (x) r0 (x) for all x X .
Assume furthermore that there exists a function r(x) 0 such that |rn (x)| r(x) for all x X
and all n (domination property), and E r(X) < . Then

E rn (X) E r0 (X), as n .

Você também pode gostar