Properties of Bootstrap Samples

American Journal of Mathematical and Management
Sciences
ISSN: 0196-6324 (Print) 2325-8454 (Online) Journal homepage: http://www.tandfonline.com/loi/umms20
Properties of Bootstrap Samples
Michael R. Chernick & v. K. Murthy
To cite this article: Michael R. Chernick & v. K. Murthy (1985) Properties of Bootstrap
Samples, American Journal of Mathematical and Management Sciences, 5:1-2, 161-170, DOI:
10.1080/01966324.1985.10737161
To link to this article: http://dx.doi.org/10.1080/01966324.1985.10737161
Published online: 14 Aug 2013.
Submit your article to this journal
Article views: 11
View related articles
Citing articles: 2 View citing articles
Full Terms & Conditions of access and use can be found at

http://www.tandfonline.com/action/journalInformation?journalCode=umms20
Download by: [University of California, Los Angeles (UCLA)] Date: 23 June 2016, At: 06:52
AMERICAN JOURNAL OF MATHEMATICAL AND MANAGEMENT SCIENCES
Copyright~ 1985 by American Sciences Preu, Inc.
Downloaded by [University of California, Los Angeles (UCLA)] at 06:52 23 June 2016
PROPERTIES OF BOOTSTRAP SAMPLES
Michael R. Chernick v. K. Murthy

California State University Hughes Aircraft Company
Management Science Department P.O. Box 902
Fullerton, CA 92634 El Segundo, CA 90245
SYNOPTIC ABSTRACT
Efron's version of the "bootstrap" procedure was devised

as a method for obtaining nonparametric estimates of standard
deviations and biases of estimators. Important applications
include error rate of classifiers, non-linear regression,
econometric modeling, discriminant analysis and principal
components. Applications of the bootstrap procedure usually
require the generation of "bootstrap samples ." This paper
presents a connection between bootstrap samples and the
classical occupancy problem. Small sample properties of a
bootstrap sample are obtained based on the application of
results on the occupancy problem. The implications of these
results on the estimate of misclassification probability in
discriminant analysis are addressed .
Key Words and Phrases: bootstrap, cross-validation,

classical occupancy problem, discriminant analysis,
misclassification probability, urn models.
1985, VOL. 5, NOS. 1 & 2, 161-170

0196-6324/85/010161-10 $5.00
162 MICHAEL R, CHERNICK AND V. K. MURTHY
l.IHTRODUCTIOH
Civen a sample x , x , ... , Xn of random vectors and

1 2
a real valued estimator e(x , x , ... ,Xn)' Efron (1982)
1 2
introduces a method called the "bootstrap" to assess the

distribution of e. The bootstrap procedure can be used to
estimate quantities such as the bias, the standard deviation
and the quantiles of the distribution of e. Connections
between the bootstrap, cross-validation, the jackknife and the
delta method are also given in Efron (1982). The term
"bootstrap" appeared earlier in the control literature to
describe a procedure for Kalman filter estimation with unknown
noise covariance . Weiss (1970) gives a survey of procedures
including this bootstrap approach.
Practical application of the method often requires
generation of bootstrap samples: samples of size n drawn at
random with replacement from the empirical distribution
function F . Fn is the distribution which assigns
probability 1/n to each Xi fori= 1,2, . .. , n. For each
bootstrap sample, an estimator e* is computed. Given k
bootstrap samples, the empirical distribution of the k
e* 's can be used to estimate the standard deviation, the
mean and quantiles for e.
Sometimes bootstrap sampling is not necessary as the
bootstrap estimates can be derived analytically in simple
eases . The variance of a population is one such example as
Efron has pointed out in some lectures .
Bootstrap samples only give a Monte Carlo approximation to
the bootstrap estimate . The bootstrap estimate is the limit
of this Monte Carlo estimate as the number of bootstrap
samples tends to infinity . If we define Monte Carlo methods
to include sampling from estimated distributions, the
bootstrap may be viewed as a form of Monte Carlo.
PROPERTIES OF BOOTSTRAP SAMPLES 163
Although large aample properties of the bootstrap have

been atudied, little is known about its small sample
behavior. In the case of the estimation of misclassification
error, Efron (1983) gives small sample comparisons of variants
of the bootstrap with cross-validation and the so-called ".632

estimator." OUr work was primarily motivated by this
application .
2. CONNECTION WITH THE OCCUPANCY PROBLEM
A bootstrap sample trpically contains only a subset of the

data. Observations may occur once, twice or more. or not at
all. Represent each observation Xi as an urn and each
bootstrap observation X*, as a ball. A bootstrap sample can
be viewed as the placement of n balls into n urns where each
ball has probability 1/n of being placed in the ith urn for
i = 1,2,3, ... ,n. The number of balls in the ith urn is
equivalent to the number of replications of Xi in a
bootstrap sample. This shows that bootstrap sampling is a
special case of the samplin& considered in the classical
occupancy problem. The number of observations not appearing
in a bootstrap sample is equivalent to the number of empty
urns.
3. RESULTS
Feller (1968) page 102 shows that the probability that

exactly m urns are empty when r balls are placed at random in
n urns is given by
n-m
t (-l)v (n-m) (1-(m+v)/n)r form O,l, ... ,n. (1)
v
V=O
Set r =n in (1) to obtain the probability that exactly m

observations do not appear in a bootstrap sample. The mean
and variance of the random variable X : (the number of empty
164 MICHAEL R. CHERNICK AND V. K. MURTHY
urns) are given by
n
E(X) ~ (1-1/n)r (2)
j=1
and
n
Var(X) I (1-1/n)r [1-(1-1/n)r) (3)
j=1
n
r 2r
+ 2< I r<1-2/n) - (1-1/n) ).
j=1 j<j ~n
Continuing the analogy, X is the number of observations

that do not appear in the bootstrap sample . Equations (2) and
(3) are special cases of Equations (3.10) and (3.11) of
Johnson and Kotz (1977). Note that for these cases the
summands do not depend on the indices of summation. Since
r =n for the bootstrap the expected number of observations
not included in the bootstrap sample is
E(X) n [(n-1)/n)n (4)
and
Var(X) n [(n-1)/n)n [1-((n-1)/n)n)
(5)
+ n(n-1) [((n-2)/n)n -((n-1)/n) 2n) .
Let P X/n. The expectation of P is E(P ) = ((n-1)/n)n.

n n n
-1 2
So lim E(P ) e The variance of P is Var(X)/n It is
n n
n-
shown (Theorem 6.4 p.331 of Johnson and Kotz (1977)) that
(P - E(P ))/ IVar (P ) has asymptotically a standard normal
n n n
distribution. Consequently for large n a tYpical bootstrap
sample includes approximatelY the proportion 1-e-1 , i.e. 63~
of the observations. In small samples, the percentage is

higher on the average. For n =2 it is 75~, and for n =3
approximately 7~.
Figure 1 shows the mean and standard deviations for Pn

where n varies from 2 to 50. Note that the variance is much
larger in small samples. For n = 2, the bootstrap sample
either contains 50~ or 100~ of the observations with
probability 1/2 for each possibility .

Also of interest is the number of observations occurring
at least v times. Explicit forms for these probability
distributions are obtained as a special case of the sparse and
crowded cell distributions of Sobel and Uppuluri (1974) .
Sobel and Uppuluri express these probabilities in terms of
incomplete beta functions. In the case of crowded cells
E(C) = bi p (v,n-v+l) (6)
where b = n, c is the number of crowded cells (i.e. urns

containing at least v balls) p = 1/n and I (v,n-v+l) is the
p
probability given by
p
I (v,n-v+l)
p (v-l)~!(n-v)! ~ (1-x)n-v xv-1 dx. (7)
n
~
i=V
Since C is the number of urns that contain at least v

balls, C also represents the number of observations that occur
at least v times in a bootstrap sample .
Equation (6) is taken from Sobel and Uppuluri (1974) who
also derive factorial moments and the variance of C in terms
of the incomplete beta functions . Figure 2 gives the expected
frequency of occurrence of k replications of an observation in
a bootstrap sample of size n where n ranges from 2 to 30 (with
infinity included at the end) and k = 0, 1, ... ,min (4,n).
166 MICHAEL R, CHERNICK AND V. K, MURTHY
These results were obtained from (7) by subtracting

I (k+1,n-k) from I (k,n-k+1). The limiting expected
p p
frequency of repetitions is Poisson with rate parameter
'k = 1.
Since the selection probability is 1/n for each urn, the

bootstrap sample has a special multinomial distribution and
the expected fraction of observations occurring exactly k
times in a bootstrap sample equals the probability that an
observation occurs exactly k times . That binomial probability
as noted by Efron (1983) is
n) (n-l)n-k _
Probability that an urn contains (k n P(k). (8)
exactly k balls n
Note that Efron's result comes directly from consideration

of the multinomial distribution and does not involve the
occupancy relationship. Figure 2 and the Poisson limiting
results are more easily obtained directly from (8). Note
however that
E(C) n I 11 n(v, n-v+l)
n I (~) (1/n)k (1-1/n)n-k n

n
~ P(k), (9)
k=V k=V
and
P(k) = Il/n(k,n-k+l) - Il/n(k+l , n-k).
4. APPLICATION TO CLASSIFICATION PROBLEMS

In this section the classification of multivariate
observations into one of two populations (denoted population 1
and population 2) is considered. Efron (1983) calls the
average number of observations misclassified by a classifier
the apparent error rate. The true error rate is the
probability of incorrectly classifying a future observation.
Usually the apparent error rate underestimates the true error

rate since these observations are used to construct the
classification rule. The bootstrap procedure is used to
estimate the bias of the apparent error rate. For each
bootstrap sample, Efron estimates the bias using the sample

analog to the following equation:
(10)
where E* is the expectation under bootstrap sampling,

W(Fn) is the bias for the empirical distribution Fn' Pi*
is the proportion of replications of the ith observation in
a bootstrap sample of size n,~i* (a variable taking values
0 or 1) is the classification of the ith observation based
on the application of the classifier to a bootstrap sample,
0 if the ith observation is from population 1, and
1 if the ith observation is from population 2. The
function Q is simply 0 if y. equals ~i* and is 1
otherwise.
1
For a given bootstrap sample P.
* =0 if Xi
does not occur, P.
* = 1/n 1
if Xi occurs once and in
* = k/n
general Pi
1
if Xi occurs k times. Consequently,
if an observation does not occur in a bootstrap sample, it has
a positive contribution on the estimate of bias; if the
observation occurs once it has no contribution and if it
occurs two or more times it has a negative contribution. The
probability that an observation, occurring k times in a
bootstrap sample, is misclassified is called the kth
repetition error rate by Efron (1983). These rates are
estimated for the five sampling experiments as illustrated in
Figure 3, p.325, of Efron (1983). Efron observes in the five
sampling experiments that the bootstrap and randomized
bootstrap perform better than cross-validation but not as well
as the .632 estimator . Figures 1 and 2 provide some insight
into these results. Alternative procedures for estimating the

bias are suggested from the bootstrap sampling distribution.
The current paper motivated Chernick, Murthy, and Healy (1985)
to compare these estimates in a more extensive Monte Carlo
study with a new estimator, "the KC estimator", which modifies

the bootstrap according to the repetition rates (i.e. each
bootstrap sample is controlled to replicate observations k
times for k = 1,2 , 3,4, and 5 according to the proportions in
Figure 2 at the asymptotic value (n=inf . )). Results show that
the new estimate is comparable to the . 632 estimator .
It is not coincidental that .632 is 1- !lm E(Pn> The

value .632 in Equation (6 . 8) on p.322 of Efron (1983) is the
limiting probability that a newly chosen observation from Fn
is identical to a bootstrap observation. This is the same as
the limiting probability that a particular observation is
placed in the ith urn for some i and hence equals one minus
the limiting proportion of empty urns.
0 . 5 ~~--------------------------------------------------------,
Proportion J Var(P n )
0.4
E(P )
n
0 .3
0 .2
0.1
04-~------~--------~----------r---------.---------~
0 20 40 Somple Size
FIGURE 1: Expected Proportion of Empty Urns.
0.3
k ~ 2
0.2
0.1
k 3
0
0 20 40
Somple Size
FIGURE 2: Replication Frequency as a Function of Sample Size.

ACKNOWLEDGMENTS
The authors would like to express their appreciation to an

anonymous referee for many constructive suggestions which
improved the exposition of the paper and also for finding

errors in Equations (2) and (3) in an earlier version of the
manuscript.
REFERENCES
Chernick, H. R., Murthy, v. K., and Nealy, C. D. (1985).

Application of bootstrap and other resampling techniques :
evaluation of classifier performance, Pattern Recognition
Letters, North-Holland, Amsterdam, to appear
Efron, B. (1982). The Jackknife the Bootstrap and Other

Resampling Plans. CBKS-NSF Regional Conference Series in
Applied Mathematics 38, Society for Industrial and Applied
Mathematics, Philadelphia.
Efron, B. (1983). Estimating the error rate of a prediction

rule: improvement on cross validation. Journal of the American
Statistical Association, 78,
316-331.
Feller, W. (1968). An Introduction to Probability Theory and

Its Applications, Volume 1, John Wiley & Sons,Inc., New York.
Johnson, N. L. and Kotz, S. (1977). Urn Models and Their

Application. John Wiley & Sons, Inc., New York.
Sobel, H. and Uppuluri, V. R. R. (1974). Sparse and crowded

cells and Dirichlet distributions. The Annals of Statistics,
2, 977-987.
Weiss, I. H. (1970). A survey of discrete Kalman-Bucy

filtering with unknown noise covariances. AIAA Guidance,
Control and Flight Mechanics Conference, AIAA Paper No.
70-955, American Institute of Aeronautics and Astronautics,
New York .
Received 3/21/84; Revised 5/17/85.

Properties of Bootstrap Samples

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Properties of Bootstrap Samples

Enviado por

Direitos autorais:

Formatos disponíveis

American Journal of Mathematical and Management

ISSN: 0196-6324 (Print) 2325-8454 (Online) Journal homepage: http://www.tandfonline.com/loi/umms20

Properties of Bootstrap Samples

Michael R. Chernick & v. K. Murthy

To link to this article: http://dx.doi.org/10.1080/01966324.1985.10737161

Published online: 14 Aug 2013.

Submit your article to this journal

View related articles

Citing articles: 2 View citing articles

Full Terms & Conditions of access and use can be found at

PROPERTIES OF BOOTSTRAP SAMPLES

Michael R. Chernick v. K. Murthy

Efron's version of the "bootstrap" procedure was devised

Key Words and Phrases: bootstrap, cross-validation,

1985, VOL. 5, NOS. 1 & 2, 161-170

Civen a sample x , x , ... , Xn of random vectors and

introduces a method called the "bootstrap" to assess the

Although large aample properties of the bootstrap have

of the bootstrap with cross-validation and the so-called ".632

2. CONNECTION WITH THE OCCUPANCY PROBLEM

A bootstrap sample trpically contains only a subset of the

Feller (1968) page 102 shows that the probability that

Set r =n in (1) to obtain the probability that exactly m

urns) are given by

Continuing the analogy, X is the number of observations

E(X) n [(n-1)/n)n (4)

+ n(n-1) [((n-2)/n)n -((n-1)/n) 2n) .

Let P X/n. The expectation of P is E(P ) = ((n-1)/n)n.

of the observations. In small samples, the percentage is

Figure 1 shows the mean and standard deviations for Pn

probability 1/2 for each possibility .

E(C) = bi p (v,n-v+l) (6)

where b = n, c is the number of crowded cells (i.e. urns

Since C is the number of urns that contain at least v

These results were obtained from (7) by subtracting

Since the selection probability is 1/n for each urn, the

Note that Efron's result comes directly from consideration

E(C) n I 11 n(v, n-v+l)

n I (~) (1/n)k (1-1/n)n-k n

4. APPLICATION TO CLASSIFICATION PROBLEMS

Usually the apparent error rate underestimates the true error

bootstrap sample, Efron estimates the bias using the sample

where E* is the expectation under bootstrap sampling,

into these results. Alternative procedures for estimating the

study with a new estimator, "the KC estimator", which modifies

It is not coincidental that .632 is 1- !lm E(Pn> The

FIGURE 1: Expected Proportion of Empty Urns.

FIGURE 2: Replication Frequency as a Function of Sample Size.

The authors would like to express their appreciation to an

improved the exposition of the paper and also for finding

Chernick, H. R., Murthy, v. K., and Nealy, C. D. (1985).

Efron, B. (1982). The Jackknife the Bootstrap and Other

Efron, B. (1983). Estimating the error rate of a prediction

Feller, W. (1968). An Introduction to Probability Theory and

Johnson, N. L. and Kotz, S. (1977). Urn Models and Their

Sobel, H. and Uppuluri, V. R. R. (1974). Sparse and crowded

Weiss, I. H. (1970). A survey of discrete Kalman-Bucy

Received 3/21/84; Revised 5/17/85.

Você também pode gostar