Você está na página 1de 11

American Journal of Mathematical and Management

Sciences

ISSN: 0196-6324 (Print) 2325-8454 (Online) Journal homepage: http://www.tandfonline.com/loi/umms20

Properties of Bootstrap Samples

Michael R. Chernick & v. K. Murthy

To cite this article: Michael R. Chernick & v. K. Murthy (1985) Properties of Bootstrap
Samples, American Journal of Mathematical and Management Sciences, 5:1-2, 161-170, DOI:
10.1080/01966324.1985.10737161

To link to this article: http://dx.doi.org/10.1080/01966324.1985.10737161

Published online: 14 Aug 2013.

Submit your article to this journal

Article views: 11

View related articles

Citing articles: 2 View citing articles

Full Terms & Conditions of access and use can be found at


http://www.tandfonline.com/action/journalInformation?journalCode=umms20

Download by: [University of California, Los Angeles (UCLA)] Date: 23 June 2016, At: 06:52
AMERICAN JOURNAL OF MATHEMATICAL AND MANAGEMENT SCIENCES
Copyright~ 1985 by American Sciences Preu, Inc.
Downloaded by [University of California, Los Angeles (UCLA)] at 06:52 23 June 2016

PROPERTIES OF BOOTSTRAP SAMPLES

Michael R. Chernick v. K. Murthy


California State University Hughes Aircraft Company
Management Science Department P.O. Box 902
Fullerton, CA 92634 El Segundo, CA 90245

SYNOPTIC ABSTRACT

Efron's version of the "bootstrap" procedure was devised


as a method for obtaining nonparametric estimates of standard
deviations and biases of estimators. Important applications
include error rate of classifiers, non-linear regression,
econometric modeling, discriminant analysis and principal
components. Applications of the bootstrap procedure usually
require the generation of "bootstrap samples ." This paper
presents a connection between bootstrap samples and the
classical occupancy problem. Small sample properties of a
bootstrap sample are obtained based on the application of
results on the occupancy problem. The implications of these
results on the estimate of misclassification probability in
discriminant analysis are addressed .

Key Words and Phrases: bootstrap, cross-validation,


classical occupancy problem, discriminant analysis,
misclassification probability, urn models.

1985, VOL. 5, NOS. 1 & 2, 161-170


0196-6324/85/010161-10 $5.00
162 MICHAEL R, CHERNICK AND V. K. MURTHY

l.IHTRODUCTIOH

Civen a sample x , x , ... , Xn of random vectors and


1 2
a real valued estimator e(x , x , ... ,Xn)' Efron (1982)
1 2
Downloaded by [University of California, Los Angeles (UCLA)] at 06:52 23 June 2016

introduces a method called the "bootstrap" to assess the


distribution of e. The bootstrap procedure can be used to
estimate quantities such as the bias, the standard deviation
and the quantiles of the distribution of e. Connections
between the bootstrap, cross-validation, the jackknife and the
delta method are also given in Efron (1982). The term
"bootstrap" appeared earlier in the control literature to
describe a procedure for Kalman filter estimation with unknown
noise covariance . Weiss (1970) gives a survey of procedures
including this bootstrap approach.
Practical application of the method often requires
generation of bootstrap samples: samples of size n drawn at
random with replacement from the empirical distribution
function F . Fn is the distribution which assigns
probability 1/n to each Xi fori= 1,2, . .. , n. For each
bootstrap sample, an estimator e* is computed. Given k
bootstrap samples, the empirical distribution of the k
e* 's can be used to estimate the standard deviation, the
mean and quantiles for e.
Sometimes bootstrap sampling is not necessary as the
bootstrap estimates can be derived analytically in simple
eases . The variance of a population is one such example as
Efron has pointed out in some lectures .
Bootstrap samples only give a Monte Carlo approximation to
the bootstrap estimate . The bootstrap estimate is the limit
of this Monte Carlo estimate as the number of bootstrap
samples tends to infinity . If we define Monte Carlo methods
to include sampling from estimated distributions, the
bootstrap may be viewed as a form of Monte Carlo.
PROPERTIES OF BOOTSTRAP SAMPLES 163

Although large aample properties of the bootstrap have


been atudied, little is known about its small sample
behavior. In the case of the estimation of misclassification
error, Efron (1983) gives small sample comparisons of variants
Downloaded by [University of California, Los Angeles (UCLA)] at 06:52 23 June 2016

of the bootstrap with cross-validation and the so-called ".632


estimator." OUr work was primarily motivated by this
application .

2. CONNECTION WITH THE OCCUPANCY PROBLEM

A bootstrap sample trpically contains only a subset of the


data. Observations may occur once, twice or more. or not at
all. Represent each observation Xi as an urn and each
bootstrap observation X*, as a ball. A bootstrap sample can
be viewed as the placement of n balls into n urns where each
ball has probability 1/n of being placed in the ith urn for
i = 1,2,3, ... ,n. The number of balls in the ith urn is
equivalent to the number of replications of Xi in a
bootstrap sample. This shows that bootstrap sampling is a
special case of the samplin& considered in the classical
occupancy problem. The number of observations not appearing
in a bootstrap sample is equivalent to the number of empty
urns.

3. RESULTS

Feller (1968) page 102 shows that the probability that


exactly m urns are empty when r balls are placed at random in
n urns is given by

n-m
t (-l)v (n-m) (1-(m+v)/n)r form O,l, ... ,n. (1)
v
V=O

Set r =n in (1) to obtain the probability that exactly m


observations do not appear in a bootstrap sample. The mean
and variance of the random variable X : (the number of empty
164 MICHAEL R. CHERNICK AND V. K. MURTHY

urns) are given by

n
E(X) ~ (1-1/n)r (2)
j=1
Downloaded by [University of California, Los Angeles (UCLA)] at 06:52 23 June 2016

and
n
Var(X) I (1-1/n)r [1-(1-1/n)r) (3)
j=1

n
r 2r
+ 2< I r<1-2/n) - (1-1/n) ).
j=1 j<j ~n

Continuing the analogy, X is the number of observations


that do not appear in the bootstrap sample . Equations (2) and
(3) are special cases of Equations (3.10) and (3.11) of
Johnson and Kotz (1977). Note that for these cases the
summands do not depend on the indices of summation. Since
r =n for the bootstrap the expected number of observations
not included in the bootstrap sample is

E(X) n [(n-1)/n)n (4)

and
Var(X) n [(n-1)/n)n [1-((n-1)/n)n)
(5)

+ n(n-1) [((n-2)/n)n -((n-1)/n) 2n) .

Let P X/n. The expectation of P is E(P ) = ((n-1)/n)n.


n n n
-1 2
So lim E(P ) e The variance of P is Var(X)/n It is
n n
n-
shown (Theorem 6.4 p.331 of Johnson and Kotz (1977)) that
(P - E(P ))/ IVar (P ) has asymptotically a standard normal
n n n
distribution. Consequently for large n a tYpical bootstrap
sample includes approximatelY the proportion 1-e-1 , i.e. 63~

of the observations. In small samples, the percentage is


higher on the average. For n =2 it is 75~, and for n =3
approximately 7~.
PROPERTIES OF BOOTSTRAP SAMPLES 165

Figure 1 shows the mean and standard deviations for Pn


where n varies from 2 to 50. Note that the variance is much
larger in small samples. For n = 2, the bootstrap sample
either contains 50~ or 100~ of the observations with
Downloaded by [University of California, Los Angeles (UCLA)] at 06:52 23 June 2016

probability 1/2 for each possibility .


Also of interest is the number of observations occurring
at least v times. Explicit forms for these probability
distributions are obtained as a special case of the sparse and
crowded cell distributions of Sobel and Uppuluri (1974) .
Sobel and Uppuluri express these probabilities in terms of
incomplete beta functions. In the case of crowded cells

E(C) = bi p (v,n-v+l) (6)

where b = n, c is the number of crowded cells (i.e. urns


containing at least v balls) p = 1/n and I (v,n-v+l) is the
p
probability given by

p
I (v,n-v+l)
p (v-l)~!(n-v)! ~ (1-x)n-v xv-1 dx. (7)

n
~
i=V

Since C is the number of urns that contain at least v


balls, C also represents the number of observations that occur
at least v times in a bootstrap sample .
Equation (6) is taken from Sobel and Uppuluri (1974) who
also derive factorial moments and the variance of C in terms
of the incomplete beta functions . Figure 2 gives the expected
frequency of occurrence of k replications of an observation in
a bootstrap sample of size n where n ranges from 2 to 30 (with
infinity included at the end) and k = 0, 1, ... ,min (4,n).
166 MICHAEL R, CHERNICK AND V. K, MURTHY

These results were obtained from (7) by subtracting


I (k+1,n-k) from I (k,n-k+1). The limiting expected
p p
frequency of repetitions is Poisson with rate parameter
'k = 1.
Downloaded by [University of California, Los Angeles (UCLA)] at 06:52 23 June 2016

Since the selection probability is 1/n for each urn, the


bootstrap sample has a special multinomial distribution and
the expected fraction of observations occurring exactly k
times in a bootstrap sample equals the probability that an
observation occurs exactly k times . That binomial probability
as noted by Efron (1983) is

n) (n-l)n-k _
Probability that an urn contains (k n P(k). (8)
exactly k balls n

Note that Efron's result comes directly from consideration


of the multinomial distribution and does not involve the
occupancy relationship. Figure 2 and the Poisson limiting
results are more easily obtained directly from (8). Note
however that

E(C) n I 11 n(v, n-v+l)

n I (~) (1/n)k (1-1/n)n-k n


n
~ P(k), (9)
k=V k=V
and
P(k) = Il/n(k,n-k+l) - Il/n(k+l , n-k).

4. APPLICATION TO CLASSIFICATION PROBLEMS


In this section the classification of multivariate
observations into one of two populations (denoted population 1
and population 2) is considered. Efron (1983) calls the
average number of observations misclassified by a classifier
the apparent error rate. The true error rate is the
probability of incorrectly classifying a future observation.
PROPERTIES OF BOOTSTRAP SAMPLES 167

Usually the apparent error rate underestimates the true error


rate since these observations are used to construct the
classification rule. The bootstrap procedure is used to
estimate the bias of the apparent error rate. For each
Downloaded by [University of California, Los Angeles (UCLA)] at 06:52 23 June 2016

bootstrap sample, Efron estimates the bias using the sample


analog to the following equation:

(10)

where E* is the expectation under bootstrap sampling,


W(Fn) is the bias for the empirical distribution Fn' Pi*
is the proportion of replications of the ith observation in
a bootstrap sample of size n,~i* (a variable taking values
0 or 1) is the classification of the ith observation based
on the application of the classifier to a bootstrap sample,
0 if the ith observation is from population 1, and
1 if the ith observation is from population 2. The
function Q is simply 0 if y. equals ~i* and is 1
otherwise.
1
For a given bootstrap sample P.
* =0 if Xi
does not occur, P.
* = 1/n 1
if Xi occurs once and in
* = k/n
general Pi
1
if Xi occurs k times. Consequently,
if an observation does not occur in a bootstrap sample, it has
a positive contribution on the estimate of bias; if the
observation occurs once it has no contribution and if it
occurs two or more times it has a negative contribution. The
probability that an observation, occurring k times in a
bootstrap sample, is misclassified is called the kth
repetition error rate by Efron (1983). These rates are
estimated for the five sampling experiments as illustrated in
Figure 3, p.325, of Efron (1983). Efron observes in the five
sampling experiments that the bootstrap and randomized
bootstrap perform better than cross-validation but not as well
as the .632 estimator . Figures 1 and 2 provide some insight
168 MICHAEL R. CHERNICK AND V. K. MURTHY

into these results. Alternative procedures for estimating the


bias are suggested from the bootstrap sampling distribution.
The current paper motivated Chernick, Murthy, and Healy (1985)
to compare these estimates in a more extensive Monte Carlo
Downloaded by [University of California, Los Angeles (UCLA)] at 06:52 23 June 2016

study with a new estimator, "the KC estimator", which modifies


the bootstrap according to the repetition rates (i.e. each
bootstrap sample is controlled to replicate observations k
times for k = 1,2 , 3,4, and 5 according to the proportions in
Figure 2 at the asymptotic value (n=inf . )). Results show that
the new estimate is comparable to the . 632 estimator .

It is not coincidental that .632 is 1- !lm E(Pn> The


value .632 in Equation (6 . 8) on p.322 of Efron (1983) is the
limiting probability that a newly chosen observation from Fn
is identical to a bootstrap observation. This is the same as
the limiting probability that a particular observation is
placed in the ith urn for some i and hence equals one minus
the limiting proportion of empty urns.
PROPERTIES OF BOOTSTRAP SAMPLES 169

0 . 5 ~~--------------------------------------------------------,
Proportion J Var(P n )

0.4
E(P )
n
Downloaded by [University of California, Los Angeles (UCLA)] at 06:52 23 June 2016

0 .3

0 .2

0.1

04-~------~--------~----------r---------.---------~
0 20 40 Somple Size

FIGURE 1: Expected Proportion of Empty Urns.

0.3

k ~ 2
0.2

0.1
k 3

0
0 20 40
Somple Size

FIGURE 2: Replication Frequency as a Function of Sample Size.


170 MICHAEL R. CHERNICK AND V. K. MURTHY

ACKNOWLEDGMENTS

The authors would like to express their appreciation to an


anonymous referee for many constructive suggestions which
Downloaded by [University of California, Los Angeles (UCLA)] at 06:52 23 June 2016

improved the exposition of the paper and also for finding


errors in Equations (2) and (3) in an earlier version of the
manuscript.

REFERENCES

Chernick, H. R., Murthy, v. K., and Nealy, C. D. (1985).


Application of bootstrap and other resampling techniques :
evaluation of classifier performance, Pattern Recognition
Letters, North-Holland, Amsterdam, to appear

Efron, B. (1982). The Jackknife the Bootstrap and Other


Resampling Plans. CBKS-NSF Regional Conference Series in
Applied Mathematics 38, Society for Industrial and Applied
Mathematics, Philadelphia.

Efron, B. (1983). Estimating the error rate of a prediction


rule: improvement on cross validation. Journal of the American
Statistical Association, 78,
316-331.

Feller, W. (1968). An Introduction to Probability Theory and


Its Applications, Volume 1, John Wiley & Sons,Inc., New York.

Johnson, N. L. and Kotz, S. (1977). Urn Models and Their


Application. John Wiley & Sons, Inc., New York.

Sobel, H. and Uppuluri, V. R. R. (1974). Sparse and crowded


cells and Dirichlet distributions. The Annals of Statistics,
2, 977-987.

Weiss, I. H. (1970). A survey of discrete Kalman-Bucy


filtering with unknown noise covariances. AIAA Guidance,
Control and Flight Mechanics Conference, AIAA Paper No.
70-955, American Institute of Aeronautics and Astronautics,
New York .

Received 3/21/84; Revised 5/17/85.

Você também pode gostar