Você está na página 1de 29

Density Estimation

Density Estimation:
Deals with the problem of estimating probability density functions (PDFs)
based on some data sampled from the PDF.
May use assumed forms of the distribution, parameterized in some
way (parametric statistics);
or
May avoid making assumptions about the form of the PDF (non-
parametric statistics).
We are concerned more here with the non-parametric case (see Roger
Barlows lectures for parametric statistics)
1 Frank Porter, SLUO Lectures on Statistics, 1517 August 2006
Some References (I)
Richard A. Tapia & James R. Thompson, Nonparametric Density Esti-
mation, Johns Hopkins University Press, Baltimore (1978).
David W. Scott, Multivariate Density Estimation, John Wiley & Sons,
Inc., New York (1992).
Adrian W. Bowman and Adelchi Azzalini, Applied Smoothing Tech-
niques for Data Analysis, Clarendon Press, Oxford (1997).
B. W. Silverman, Density Estimation for Statistics and Data Analysis,
Monographs on Statistics and Applied Probability, Chapman and Hall
(1986);
http://nedwww.ipac.caltech.edu/level5/March02/Silverman/Silver contents.html
K. S. Cranmer, Kernel Estimation in High Energy Physics, Comp.
Phys. Comm. 136, 198 (2001) [hep-ex/0011057v1];
http://arxiv.org/PS cache/hep-ex/pdf/0011/0011057.pdf
2 Frank Porter, SLUO Lectures on Statistics, 1517 August 2006
Some References (II)
M. Pivk & F. R. Le Diberder, sPlot: a statistical tool to unfold data
distributions, Nucl. Instr. Meth. A 555, 356 (2005).
R. Cahn, How sPlots are Best (2005),
http://babar-hn.slac.stanford.edu:5090/hn/aux/auxvol01/rncahn/
rev splots best.pdf
BaBar Statistics Working Group, Recommendations for Display of Pro-
jections in Multi-Dimensional Analyses,
http://www.slac.stanford.edu/BFROOT/www/Physics/Analysis/
Statistics/Documents/MDgraphRec.pdf
Additional specic references will noted in the course of the lectures.
3 Frank Porter, SLUO Lectures on Statistics, 1517 August 2006
Preliminaries
Well couch discussion in terms of observations (dataset) from some
experiment. Our dataset consists of the values x
i
; i = 1, 2, . . . , n.
Our dataset consists of repeated samplings from a (presumed un-
known) probability distribution.
IIDIndependent, Identically Distributed
Well note generalizations here and there.
Order is not important; if we are discussing a time series, we
could introduce ordered pairs {(x
i
, t
i
), i = 1, . . . , n}, and call it
two-dimensional [But beware the correlations then; probably not
IID!].
In general, our quantities can be multi-dimensional; no special no-
tation will be used to distinguish one- from multi-variate cases.
Well discuss where issues enter with dimensionality.
4 Frank Porter, SLUO Lectures on Statistics, 1517 August 2006
Notation
At our convenience we may use E, , and all to mean ex-
pectation:
E(x) x x
_
xp(x)dx,
where p(x) is the probability density function (PDF) for x (or, more
generally p(x)dx (dx) is the probability measure).
Estimators are denoted with a hat: In these lectures, well be
concerned with estimators for the density function itself, hence p(x)
is a random variable giving our estimate for p(x).
We will not be especially rigorous. For example, we wont make a
notational distinction between the random variable and an instance.
5 Frank Porter, SLUO Lectures on Statistics, 1517 August 2006
Motivation
Why do we want to estimate densities?
Well, that is the whole point. . .
Harder question: Why non-parametric estimates?
Comparison with models (which may be parametric)
May be easier/better than parametric modeling for eciency cor-
rections and background subtraction
Visualization
Unfolding
Comparing samples
6 Frank Porter, SLUO Lectures on Statistics, 1517 August 2006
R, A Toolkit, er, Language, You Might be Interested In. . .
The S Language: developed with
statistical analysis of data in mind.
> x <- rnorm(100,10,1)
> hist(x,xlim=range(5,15))
>
Histogram of x
x
F
r
e
q
u
e
n
c
y
6 8 10 12 14
0
5
1
0
1
5
2
0
Free, open source version is R, from the R Project. Downloads avail-
able for Linux/MacOS X/Windows, e.g., at:
http://cran.cnr.berkeley.edu/
Commercial version is S-Plus, at http://www.insightful.com/
7 Frank Porter, SLUO Lectures on Statistics, 1517 August 2006
Empirical Probability Density Function
Place a delta function at each data
point. The estimator is (EPDF, for
Empirical Probability Density Func-
tion)
p(x) =
1
n
n

i=1
(x x
i
).
0 200 400 600 800 1000
x
Note that x could be multi-dimensional here.
This is the sampling density for the bootstrap (more later; also see
Ilya Narsky lectures).
8 Frank Porter, SLUO Lectures on Statistics, 1517 August 2006
The Histogram
Perhaps our most ubiquitous density estimator is the histogram:
h(x) =
n

i=1
B(x x
i
; w),
where x
i
is the center of the bin in which ob-
servation x
i
lies, w is the bin width, and
B(x; w) =
_
1 x (w/2, w/2)
0 otherwise
(called the Indicator function in probabil-
ity).
x
i
~
x
B(x-x ; w)
~
i
i
x
w
1
This is written for uniform bin widths, but may be generalized to
diering widths with appropriate relative normalization factors.
The estimator for the probability density function (PDF) is:
p(x) =
1
nw
h(x).
9 Frank Porter, SLUO Lectures on Statistics, 1517 August 2006
Histogram Example
0 200 400 600 800 1000
x
0
1
2
3
4
5
6
0 100 200 300 400 500 600 700 800 900
m(p pi) - m(p) - m(pi)
E
v
e
n
t
s
/
1
0

M
e
V
Left: EPDF; Right: Histogram with w = 10 MeV.
[Actual sampling is 100 points from a (1232) Breit-Wigner (Cauchy)
on a second-order polynomial background. Background probability is
50%.]
10 Frank Porter, SLUO Lectures on Statistics, 1517 August 2006
Criticisms of Histogram as Density Estimator
Discontinuous even if PDF is continuous.
Dependence on bin size and bin origin.
Information from location of datum within a bin is ignored.
11 Frank Porter, SLUO Lectures on Statistics, 1517 August 2006
Kernel Estimation
Take the histogram, but replace bin function B with something else:
p(x) =
1
n
n

i=1
k(x x
i
; w),
where k(x, w) is the kernel function, normalized to unity:
_

k(x; w) dx = 1.
Usually interested in kernels of the form
k(x x
i
; w) =
1
w
K
_
x x
i
w
_
,
indeed this may be used as the denition of kernel. The kernel esti-
mator for the PDF is then:
p(x) =
1
nw
n

i=1
K
_
x x
i
w
_
,
The role of parameter w as a smoothing parameter is clearer.
12 Frank Porter, SLUO Lectures on Statistics, 1517 August 2006
Multi-Variate Kernel Esitmation
Explicit multi-variate case, d = 2 dimensions:
p(x, y) =
1
nw
x
w
y
n

i=1
K
_
x x
i
w
x
_
K
_
y y
i
w
y
_
.
This is a product kernel form, with the same kernel in each dimension,
except for possibly dierent smoothing parameters. It does not have
correlations.
The kernels we have introduced are classied more explicitly as xed
kernels: The smoothing parameter is independent of x.
13 Frank Porter, SLUO Lectures on Statistics, 1517 August 2006
Ideogram
A simple variant on the kernel idea is to permit the kernel to depend on
additional knowledge in the data.
Physicists call this an ideogram.
Most common is the Gaussian ideogram, in which each data point
is entered as a Gaussian of area one and standard deviation appro-
priate to that datum.
This addresses a way that the IID assumption might be broken.
[Aside: Be careful to get your likelihood function right if you are in-
corporating variable resolution information in your ts; see, e.g., Punzi:
http://www.slac.stanford.edu/econf/C030908/papers/WELT002.pdf ]
14 Frank Porter, SLUO Lectures on Statistics, 1517 August 2006
Sample Ideograms (I)
WEIGHTED AVERAGE
493.6640.011 (Error scaled by 2.5)
Values above of weighted average, error,
and scale factor are based upon the data in
this ideogram only. They are not neces-
sarily the same as our `best' values,
obtained from a least-squares constrained fit
utilizing measurements of other (related)
quantities as additional information.
BACKENSTO... 73 0.4
CHENG 75 K Pb 13-12 0.8
CHENG 75 K Pb 12-11 3.6
CHENG 75 K Pb 11-10 0.5
CHENG 75 K Pb 10-9 0.1
CHENG 75 K Pb 9-8 1.1
BARKOV 79 0.0
LUM 81 0.2
GALL 88 K W 11-10 2.2
GALL 88 K W 9-8 0.4
GALL 88 K Pb 11-10 0.2
GALL 88 K Pb 9-8 22.6
DENISOV 91 20.5

2
52.6
(Confidence Level 0.001)
493.5 493.6 493.7 493.8 493.9 494
m
K
(MeV)
(from RPP 2006)
15 Frank Porter, SLUO Lectures on Statistics, 1517 August 2006
Sample Ideograms (II)
Figure 1. A histogram of magnetic eld values (black),
compared with a smoothed frequency distribution con-
structed using a Gaussian ideogram technique (red).
Note detailed comparison.
(from J. S. Halekas et al., Magnetic Properties of Lunar Geologic Ter-
ranes: New Statistical Results, Lunar and Planetary Science XXXIII
(2002), 1368.pdf)
16 Frank Porter, SLUO Lectures on Statistics, 1517 August 2006
Parametric vs non-Parametric Density Estimation (I)
Distinction is fuzzy
A histogram is non-parametric, in the sense that no assumption about
the form of the sampling distribution is made.
Often an implicit assumption that distribution is smooth on scale
smaller than bin size. For example, we know something about the
resolution of our apparatus.
But the estimator of the parent distribution made with a histogram is
parametric the parameters are populations (or frequencies) in each
bin. The estimators for those parameters are the observed histogram
populations. Even more parameters than a typical parametric t!
17 Frank Porter, SLUO Lectures on Statistics, 1517 August 2006
Parametric vs non-Parametric Density Estimation (II)
Essence of dierence may be captured in notions of local and non-
local:
If a datum at x
i
inuences the density estimator at some
other point x this is non-local. A non-parametric estimator is
one in which the inuence of a point at x
i
on the estimate at
any x with d(x
i
, x) > vanishes, asymptotically.

Notice that for a kernel estimator, the bigger the smoothing paramter
w, the more non-local the estimator,
p(x) =
1
nw
n

i=1
K
_
x x
i
w
_
.

As well discuss, the optimal choice of smoothing parameter depends


on n.
18 Frank Porter, SLUO Lectures on Statistics, 1517 August 2006
Optimization
We would like to make an optimal density estimate from our data.
What does that mean?
Need a criterion for optimal
Choice of criterion is subjective; it depends on what you want to
achieve.
We may compare the estima-
tor for a quantity (here, value of
the density at x) with the true
value: (x) =

f(x) f(x).
x
f(x)
f(x)
^
(x)

19 Frank Porter, SLUO Lectures on Statistics, 1517 August 2006


Mean Squared Error (I)
A common choice in parametric estimation is to minimize the sum
of the squares. We may take this idea over here, and form the Mean
Squared Error (MSE):
MSE[

f(x)]

f(x) f(x)
_
2 _
= Var[

f(x)] + Bias
2
[

f(x)],
where
Var[

f(x)] E
_
_

f(x) E[

f(x)]
_
2
_
Bias[

f(x)] E[

f(x)] f(x)
20 Frank Porter, SLUO Lectures on Statistics, 1517 August 2006
Mean Squared Error (II)
Since this isnt quite our familiar parameter estimation, lets take a
little time to make sure it is understood:
Suppose p(x) is an estimator for the PDF p(x), based on data {x
i
; i =
1, . . . , n}, IID from p(x). Then
E[p(x)] =
_

_
p(x; {x
i
})Prob({x
i
})d
n
({x
i
})
=
_

_
p(x; {x
i
})
n

i=1
[p(x
i
)dx
i
]
21 Frank Porter, SLUO Lectures on Statistics, 1517 August 2006
Exercise: Proof of formula for the MSE
MSE[

f(x)] = (

f(x) f(x))
2

=
_

_
_

f(x; {x
i
}) f(x)
_
2
n

i=1
[p(x
i
)dx
i
]
=
_

_
_

f(x; {x
i
}) E(

f) + E(

f) f(x)
_
2
n

i=1
[p(x
i
)dx
i
]
=
_

_ _
_

f(x; {x
i
}) E(

f)
_
2
+
_
E(

f) f(x)
_
2
2
_

f(x; {x
i
}) E(

f)
_ _
E(

f) f(x)
__
n

i=1
[p(x
i
)dx
i
]
= Var[

f(x)] + Bias
2
[

f(x)] + 0.
[In typical treatments of parametric statistics, we assume unbiased es-
timators, hence the Bias term is zero. That isnt a good assumption
here.]
22 Frank Porter, SLUO Lectures on Statistics, 1517 August 2006
The Problem With Smoothing (I)
Thm: [Rosenblatt (1956)] A uniform minimum variance unbiased estimator
for p(x) does not exist.
Unbiased:
E[p(x)] = p(x).
Uniform minimum variance:
Var [p(x)|p(x)] Var [q(x)|p(x)] , x,
for all p(x), where q(x) is any other estimator of p(x).
23 Frank Porter, SLUO Lectures on Statistics, 1517 August 2006
The Problem With Smoothing (II)
For example, suppose we have a kernel estimator:
p(x) =
1
n
n

i=1
k(x x
i
; w),
Its expectation is:
E[p(x)] =
1
n
n

i=1
_
k(x x
i
; w)p(x
i
)dx
i
=
_
k(x y)p(y)dy.
Unless k(x y) = (x y), p(x) will be biased for some p(x).
But (x y) has innite variance.
24 Frank Porter, SLUO Lectures on Statistics, 1517 August 2006
The Problem with Smoothing (III)
So the nice properties we strive for in parameter estimation (and some-
times achieve) are beyond reach.
Intuition: smoothing lowers peaks and lls in valleys.
2 4 6 8 10 12 14 16
0
1
0
0
2
0
0
3
0
0
4
0
0
F
r
e
q
u
e
n
c
y
x
Red curve: PDF
Histogram: Sampling from PDF
Black curve: Gaussian kernel esti-
mator for PDF
25 Frank Porter, SLUO Lectures on Statistics, 1517 August 2006
Comment on Number of Bins in Histogram
Note: Sturges rule, based on optimizing MSE, was used in deciding
how many bins, k, to make in the histogram:
k = 1 + log
2
n.
The argument behind this rule has been criticized (1995):
http://www-personal.buseco.monash.edu.au/hyndman/papers/sturges.pdf
Indeed we see in our example that we would have by hand selected
more bins; our histogram is over-smoothed. There are other rules for
optimizing the number of bins. For example, Scotts rule for the bin
width is:
w = 3.5sn
1/3
,
where s is the sample standard deviation.
[More later]
26 Frank Porter, SLUO Lectures on Statistics, 1517 August 2006
Dependence on Smoothing Parameter
Plot showing eect of choice of smoothing parameter:
2 4 6 8 10 12 14 16
0
5
0
1
0
0
1
5
0
2
0
0
x
F
r
e
q
u
e
n
c
y
27 Frank Porter, SLUO Lectures on Statistics, 1517 August 2006
The Curse of Dimensionality
Roger Barlow gave a nice example of the impact of the Curse of
Dimensionality in parametric statistics. It is a signicant aiction in
density estimation as well.
Dicult to display and visualize as the number of dimensions in-
creases.
All the volume (of a bounded region) goes to the boundary (expo-
nentially!) as the dimensions increases. I.e., data becomes sparse.
1/2
1/4
1/8
, . . .
1
2
d
Tendency for exponentially growing computation requirement with
dimensions.
Even worse than parametric statistics.
28 Frank Porter, SLUO Lectures on Statistics, 1517 August 2006
Summary
We have introduced:
Basic notions in (non-parametric) density estimation
Some simple variations on the theme
A foundation towards optimization
An idea of where and how things will fail
Next: Further sophistication on these ideas; and introduction of other
variations in approach and application.
29 Frank Porter, SLUO Lectures on Statistics, 1517 August 2006

Você também pode gostar