Introduction To Monte Carlo Procedures: The Non-Parametric and Parametric Bootstrap 1. Review of The Non-Parametric Bootstrap

Introduction to Monte Carlo Procedures: the
Non-parametric and Parametric Bootstrap
1. Review of the Non-parametric Bootstrap

Given a data set, say {x1 , x2 , . . . , xn } and a statistic of interest, say θ, the
basic algorithm for the non-parametric bootstrap consists of the following:
1. Resample the data with equal probability and with replacement. That
is each resampling is performed on the entire n data points, so that each
observation has probability n1 of being sampled at every resampling. For
example, for a original sample of size 5, one bootstrap sample might
be x∗ = {x4 , x1 , x3 , x2 , x2 }.
2. Calculate the statistic of interest, θ∗ = g(x∗ ), call the bth estimate θb∗ ,
and store the value in a vector.
3. Repeat (1) and (2) a large number of times.
The resulting vector of bootstrap statistics then provides an estimate of
the distribution of the statistic, by way of
1. Bootstrap estimate of the expected value:
1 X ∗
θ̂ = θ (1.1)
B b b
∗
2. Bootstrap quantiles: Let θ[q] represent the q th quantile of the bootstrap
statistic. That is, take the vector of statistics produced by the boot-
strap procedure and rank them from smallest to largest. The ranks of
the vector then correspond to the bootstrap estimate of the quantiles of
the distribution. For example, if the number of bootstrap iterations was
1000, then, the 25th element of the ranked vector of bootstrap statistics
is the bootstrap estimate of the 0.025th quantile of the distribution of
the statistic.
3. The bootstrap estimate of the standard error
v
u B
u 1 X ¡ ∗ ¢2
se(θ)
ˆ =t θb − θ̄ (1.2)
B − 1 b=1
1
³ ´
∗ ∗
4. A (1 − α)% confidence interval is θ[B α , θ
] [B(1− α )]
2 2
Hypothesis testing and parameter estimation can then be carried out

using the bootstrap estimates. The non-parametric bootstrap will be the
best approach to inference when everything we know about the distribution
comes from the sample. In the case where we know something about the
distribution before we look at the sample, parametric approaches will give
us better results.
1.1. Inference with the Non-parametric Bootstrap

Inference with the bootstrap is a direct extension of traditional inference.
Example 1 The table below shows the results of a small experiment in which
7 mice were randomly chosen from 16 to receive a new medical treatment,
while the remaining 9 were assigned to the non-treatment group. Investiga-
tors wanted to test whether the treatment prolonged life after surgery. The
table shows the survival times in days.
Group Data n mean SD

Treatment 94,197,16,38,99,141,23 7 86.86 25.24
Control 52,104,146,10,51,30,40,27,46 9 56.22 14.14
Difference 30.63 28.93
Say we wish to test for treatment differences, and know that the median
is a better measure of the center of distribution than the mean.
How do we make inferences using the bootstrap?
2
1000 bootstrapped Differences of Treatment Medians
500
450
400
350
300
frequency
250
200
150
100
50
0
−100 −50 0 50 100 150 200
median difference
Figure 1: Bootstrapped differences in the median lifetimes, in hours, of mice

receiving two different post-surgery treatments.
The bootstrap 95% confidence interval was (−29, 101). What is our con-
clusion?
2. The Parametric Bootstrap

Sometimes we know the distribution of the sample, but we cannot derive the
distribution of the statistic of interest. Sometimes we can use asymptotic
approximations, but if our sample is small these may be grossly inaccurate.
Furthermore, there are cases where the non-parametric bootstrap will fail.
Can you think of one?
3
In the case where we know the distribution of the sample, but not of
the sample statistic, the parametric bootstrap often provides a powerful ap-
proach.
The basic algorithm for the parametric bootstrap is as follows:
1. Simulate a random sample of the same size as your original sample of

interest, using sample estimates for the parameters.
2. Calculate the statistic of interest, θ∗ , from the simulated sample. Save

the value in a vector.
3. Repeat (1) and (2) a large number of times.
The resulting vector then provides an estimate of the distribution of the

statistic, just as for the non-parametric case.
Example 2 Recall that the distribution of p̂ is approximately normal with

mean p and variance np(1 − p). Suppose we wish to conduct inference on a
population proportion using the exact distribution of the underlying sample
from which we calculate p̂. We know that the underlying distribution of each
of our sample observations is bernoulli with unknown parameter p. How
would we conduct the parametric bootstrap?
P
Xi
1. First calculate the sample estimate of p, which is p̂ = i
n
.
2. Then simulate a random sample of n bernoulli(p) random variables

and calculate p̂.
3. Repeat (2) a large number of times.
4. Plot a histogram, compute quantiles and confidence intervals, etc.
Below we consider one example where the non-parametric bootstrap fails

and the parametric bootstrap proves to be quite useful.
2.1. Distribution of the Sample Maximum

Let X1 , X2 , · · · , Xn be independent and identically distributed random vari-
ables whose probability distribution function (pdf) is given by f and whose
cumulative distribution function (cdf) is given by F .
4
Rx
That is, P r{Xi ≤ x} = F (x) = −∞ f (x)dx ∀x. Let Y[n] = max{X1 , · · · , Xn },
in words, Y[n] is the largest value in the sample, or the sample maximum.
What are the pdf and cdf of Y ?
Gn (y) = P r{Y[n] ≤ y}
= P r{X1 ≤ y, X2 ≤ y, . . . , Xn ≤ y}
= P r{X1 ≤ y}P r{X2 ≤ y} · · · P r{Xn ≤ y}
= [F (y)]n (2.1)
The pdf, gn can be found by differentiating
gn (y) = n [F (y)]n−1 f (y) (2.2)
However, for many random variables these distribution functions are

frightfully complicated. The normal distribution for example, has no closed
form solution for the distribution of the sample maximum. We want a better
way to use the information in the sample for our inference.
Why will the non-parametric bootstrap not work for the sample max?
Example 3 The following data are a random sample of Large-mouth Bass

from a reservoir on the Savannah River Site, a former nuclear processing fa-
cility. The reservoir was used as a cooling pond for nuclear effluent through
the 1980s, receiving high levels of radioactive materials that now reside in
the sediments in the pond. It is of interest to know the probability that if
163 Bass are taken from the reservoir each year that the maximum tissue
concentration of radiocesium will exceed 30 picocuries per gram.
n min max mean SD

163 4.33 34.06 13.17 4.58
5
Radiocesium Tissue Concentration in Bass from PAR Pond
45
40
35
30
frequency
25
20
15
10
0
0 5 10 15 20 25 30 35
picocuries per gram
137
Figure 2: An approximately Normal Data set of Cs Body Burdens
How do we conduct inference using the parametric bootstrap?
6
A parametric bootstrap was performed using the normal distribution for
the underlying distribution of the data. A histogram of the bootstrapped max-
imums is shown below.
Bootstrapped Maximum Radiocesium Tissue Concentrations in Bass from PAR Pond

300
250
200
frequency
150
100
50
0
20 25 30 35
picocuries per gram
Figure 3: Bootstrapped Maximum Body Burdens
There were 17 observations in the bootstrapped maximums that were above

30 picocuries per gram.
What is the bootstrap estimate of the probability that the maximum body bur-
den in a sample of size 163 will exceed 30 picocuries per gram?
7
2.2. Code for Non-parametric Bootstrap Two Sample
Inference
treatment = [94,197,16,38,99,141,23];
control = [52,104,146,10,51,30,40,27,46];
B=1000; mediantreat=zeros(B,1);
mediancontrol=zeros(B,1);
medianDiff=zeros(B,1);
boottreat=zeros(length(treatment),1);
bootcontrol=zeros(length(control),1);
for b=1:B
for j=1:length(treatment);
pick=unidrnd(length(treatment));
boottreat(j)=treatment(pick);
end
for k=1:length(control);
pick=unidrnd(length(control));
bootcontrol(k)=control(pick);
end
mediantreat(b) = median(boottreat);
mediancontrol(b) = median(bootcontrol);
medianDiff(b) = mediantreat(b)-mediancontrol(b);
end
hist(medianDiff);
title(’1000 bootstrapped Differences of Treatment Medians’)
xlabel(’median difference’)
ylabel(’frequency’)
8
sortmedian=sort(medianDiff);
BSCI=[sortmedian(25),sortmedian(975)];
2.3. Code for Parametric Bootstrap of the Sample Max-

imum
hist(bass);
title(’Radiocesium Tissue Concentrations in Bass from PAR Pond’);
xlabel(’picocuries per gram’);
ylabel(’frequency’);
mu = mean(bass);
sigma = sqrt(var(bass));
B=1000;
maxbass=zeros(B,1);
for b=1:B
basspboot = randn(length(bass),1)*sigma + mu;
maxbass(b)=max(basspboot);
end
hist(maxbass);
title(’Bootstrapped Maximum Radiocesium Tissue Concentrations
in Bass from PAR Pond’);
xlabel(’picocuries per gram’);
ylabel(’frequency’);
Count30=zeros(B,1);
for j=1:B
if maxbass(j)>=30, Count30(j)=1;
9
end
end
p30=sum(Count30)/B;
10

Introduction To Monte Carlo Procedures: The Non-Parametric and Parametric Bootstrap 1. Review of The Non-Parametric Bootstrap

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Introduction To Monte Carlo Procedures: The Non-Parametric and Parametric Bootstrap 1. Review of The Non-Parametric Bootstrap

Enviado por

Direitos autorais:

Formatos disponíveis

Introduction to Monte Carlo Procedures: the

Non-parametric and Parametric Bootstrap

1. Review of the Non-parametric Bootstrap

Hypothesis testing and parameter estimation can then be carried out

1.1. Inference with the Non-parametric Bootstrap

Group Data n mean SD

How do we make inferences using the bootstrap?

Figure 1: Bootstrapped differences in the median lifetimes, in hours, of mice

2. The Parametric Bootstrap

Can you think of one?

1. Simulate a random sample of the same size as your original sample of

2. Calculate the statistic of interest, θ∗ , from the simulated sample. Save

3. Repeat (1) and (2) a large number of times.

The resulting vector then provides an estimate of the distribution of the

Example 2 Recall that the distribution of p̂ is approximately normal with

2. Then simulate a random sample of n bernoulli(p) random variables

3. Repeat (2) a large number of times.

4. Plot a histogram, compute quantiles and confidence intervals, etc.

Below we consider one example where the non-parametric bootstrap fails

2.1. Distribution of the Sample Maximum

What are the pdf and cdf of Y ?

The pdf, gn can be found by differentiating

gn (y) = n [F (y)]n−1 f (y) (2.2)

However, for many random variables these distribution functions are

Example 3 The following data are a random sample of Large-mouth Bass

n min max mean SD

How do we conduct inference using the parametric bootstrap?

Bootstrapped Maximum Radiocesium Tissue Concentrations in Bass from PAR Pond

Figure 3: Bootstrapped Maximum Body Burdens

There were 17 observations in the bootstrapped maximums that were above

2.3. Code for Parametric Bootstrap of the Sample Max-

basspboot = randn(length(bass),1)*sigma + mu;

Você também pode gostar