Você está na página 1de 9

Rates of Uniform Consistency for k-NN Regression

Heinrich Jiang
arXiv:1707.06261v1 [stat.ML] 19 Jul 2017

Abstract theory is on its properties under various risk measures or


asymptotic convergence. Notions of consistency involv-
We derive high-probability finite-sample uni- ing risk measures such as mean squared error are con-
form rates of consistency for k-NN regression siderably weaker than the sup-norm as the latter imposes
that are optimal up to logarithmic factors under a uniform guarantee on the error |fk (x) f (x)| where
mild assumptions. We moreover show that k- fk is the k-NN regression estimate of function f . Exist-
NN regression adapts to an unknown lower in- ing work on studying fk under the sup-norm thus far are
trinsic dimension automatically. We then apply asymptotic. We give the first sup-norm finite-sample re-
the k-NN regression rates to establish new re- sult. This result matches the minimax optimal rate up to
sults about estimating the level sets and global logarithmic factors.
maxima of a function from noisy observations. We then discuss the setting where the data lies on a lower
dimensional manifold and provide rates that depend only
on the intrinsic dimension and independent of ambient
1 Introduction dimension. This shows that k-NN regression is able to
automatically escape the curse of dimensionality in this
The popular k-nearest neighbor (k-NN) regression is a sense without any preprocessing or modifications.
simple yet powerful approach to nonparametric regres-
sion. The value of the functional is taken to be the un- We then show the utility of our k-NN regression results
weighted average observation of the k closest samples. in recovering certain structures of an arbitrary function.
Although this procedure has been known for a long time The motivation can be traced back to the rich theory of
and has a deep practical significance, there is still sur- density-based clustering. There, one is given a finite
prisingly much about its convergence properties yet to sample from a probability density p. The clusters can
be understood. then be modeled based on certain structures in the un-
derlying density p. Such structures include the level-sets
We derive finite-sample high probability uniform bounds {x : p(x) } for some density level or the local
for k-NN regression under a standard additive model maximas of p. Then to estimate these, one typically uses
y = f (x) + where f is an unknown function, is a plug-in approach using a density estimator pb (e.g. for
sub-Gaussian white noise and y is the noisy observation. level-sets, {x : pb(x) }). It turns out that given uni-
The samples {(xi , yi )}ni=1 are drawn i.i.d. as follows: form bounds on pb, we can estimate these structures with
xi is drawn according to an unknown density pX , which strong guarantees.
shares the same support as f , and then observation yi is
generated by the additive model based on xi . In this paper, instead of estimating these structures in a
density, we estimate these structures for a general func-
We then give simple procedures to estimate the level sets tion f . This is possible because of our established finite-
and global maximas of a function given noisy observa- sample sup-norm bounds for nonparametric regression.
tions and apply the k-NN regression bounds to establish There are however some key differences in our setting.
new Hausdorff recovery guarantees for these structures. In the density setting, one has access to i.i.d. samples
Each of these results are interesting on their own. drawn from the density. Here, we have an i.i.d. sample
The bulk of the work on k-NN regression convergence x drawn from some density pX not necessarily related to
f , and then we obtain a noisy observation of the value
f (x). This can be viewed as a noisy observation of the
feature of x. In other words, we estimate the stuctures
Email: <heinrich.jiang@gmail.com> based on the features of data, while in the density setting,
Preprint. there are no features and the structures are instead based
on the dense regions of the dataset. [18], Jiang [9, 10]. It involves estimating Lp () := {x :
p(x) } given a finite i.i.d. sample X from p, where
is some known density level and p is the unknown den-
2 Related Works and Contributions sity. Lp () can be seen as the high density regions of the
data and thus the connected components can be used as
2.1 k-NN Regression Rates the core-sets in clustering. It can be shown that given a
density estimator pbn with guarantees on |b pn p| , then
The consistency properties of k-NN regression have been b p () := {x X : pbn (x) }, the Hausdorff
taking L
studied for a long time and we highlight some of the work b p () can also be bounded.
here. Biau et al. [2] give guarantees under L2 risk. De- distance between Lp () and L
vroye et al. [7] give consistency guarantees under the L1 In this paper, we extend this idea to functions f which
risk. Stone [19] provides results under Lp for p 1. All are not necessarily densities given noisy observations of
these notions of consistency so far are under some inte- f . We obtain similar results to those familiar in the den-
grated risk, and thus are weaker than the sup-norm (i.e. sity setting, which are made possible by our established
L ), which imposes a uniform guarantee. bounds for estimating f . An advantage of this approach
A number of works such as Mack and Silverman [16], is that it can be applied to clustering where there are fea-
Cheng [4], Devroye [6], Lian et al. [14], Kudraszow and tures where clusters are defined as regions of similar fea-
Vieu [12] give strong uniform convergence rates. How- ture value rather than similar density. In density-based
ever, these results are asymptotic. Our bounds explore clustering, it is typical that one does not assume access
the finite-sample consistency properties of k-NN regres- to the features and thus such procedures fail to readily
sion, which we will demonstrate later can show strong re- take advantage of the features when performing cluster-
sults about k-NN based learning algorithms which were ing. A similar approach was taken by Willett and Nowak
not possible with existing results. To the best of our [22] by using nonparametric regression to estimate the
knowledge, this is the first such finite-sample uniform level sets of a function; however our consistency results
consistency result for this procedure, which matches the are instead under the Hausdorff metric.
minimax rate up to logarithmic factors.
2.3 Global Maxima Estimation
We then extend our results to the setting where the
data lies on a lower dimensional manifold. This is of We next give an interesting result for estimating the
practical interest because the curse of dimensionality global maxima of a function. Given n i.i.d. samples from
forces nonparametric methods such as k-NN to require some distribution on the input space and seeing a noisy
an exponential-in-dimension sample complexity; how- observations of f at the samples, we show a guarantee
ever as a concession, we can show that many of these on the distance between the sample point with the high-
methods can have sample complexity depending on the est k-NN regression value and the (unique) point which
intrinsic dimension (e.g. doubling dimension, manifold maximizes f . This gives us insight into how well a grid
dimension, covering number) and independent of the am- search or randomized search can estimate the maximum
bient dimension. In modern data applications where the of a function.
dimension can be arbitrarily high, oftentimes the number
of degrees of freedom remains much lower. It thus be- This result can be compared to mode estimation in the
comes important to understand these methods under this density setting where the object is to find the point
setting. which maximizes the density function [20]. Dasgupta
and Kpotufe [5] show that given n draws from a density,
Kulkarni and Posner [13] give results for k-NN regres- the sample point which maximizes the k-NN density es-
sion based on the covering numbers of the support of timator is close to the true maximizer of the density.
the distribution. Kpotufe [11] shows that k-NN regres-
sion actually adapts to the local intrinsic dimension with-
out any modifications to the procedure or data in the L2 3 k-NN Regression
norm. In this paper, we show that the same holds in the
sup-norm as well. Throughout the paper, we assume a function f over RD
with compact support X and that we have datapoints
(x1 , y1 ), ..., (xn , yn ) drawn follows. The xi s are drawn
2.2 Level Set Estimation
i.i.d. from density pX with support X . Then yi =
Density level-set estimation has been extensively studied f (xi ) + xi where xi are i.i.d. drawn according to ran-
and has significant implications to density-based cluster- dom variable .
ing. Some works include Tsybakov et al. [21], Singh et al. Definition 1. f : X R where X RD is compact.
The first regularity assumption ensures that the support Definition 4. Let vD be the volume of a D-dimensional
X does not become arbitrarily thin anywhere. Otherwise, unit ball.
it becomes impossible to estimate the function in such Theorem 1 (k-NN Regression Rate). Suppose that As-
areas from a random sample. sumptions 1, 2, and 3 hold and that
Assumption 1 (Support Regularity). There exists > 0
and r0 > 0 such that Vol(X B(x, r)) Vol(B(x, r)) 1
28 D log2 (4/) log n k vD r0D n.
for all x X and 0 < r < r0 . 2
Then probability at least 1 , the following holds uni-
The next regularity assumption ensures that with a suffi- formly in x X .
ciently large sample, we will obtain a good covering of
the input space.  1/D !
2k
Assumption 2 (pX bounded from below). pX,0 := |f (x) fk (x)| uf x,
pX,0 vD n
inf xX pX (x) > 0. r
D log n + log(2/)
Finally, we have a standard sub-Gaussian white noise as- + 2 .
k
sumption in our additive model.
Assumption 3 (Sub-Gaussian White noise). satisfies Note that the above result is fairly general and makes
E[] = 0 and sub-Gaussian with parameter 2 (i.e. no smoothness assumptions. In particular, f need not
E[exp()] exp( 2 2 /2) for all R). even be continuous. We can then apply this to the class
of Hlder continuous functions to obtain the following
Then define k-NN regression as follows. result.
Definition 2 (k-NN). Let the k-NN radius of x X be Corollary 1 (Rate for Hlder continuous functions).
rk (x) := inf{r : |B(x, r) X| k} where B(x, r) := Suppose that Assumptions 1, 2, and 3 hold and that
{x X : |x x | r} and the k-NN set of x X be
Nk (x) := B(x, rk (x)) X. Then for all x X , the 1
28 D log2 (4/) log n k vD r0D n.
k-NN regression function with respect to the samples is 2
defined as If f is Hlder continuous (i.e. |f (x) f (x )| C |x
1
n
X x | ), then the following holds:
fk (x) := yi 1 [xi Nk (x)] .
|Nk (x)|  /D
i=1 2k
P sup |f (x) fk (x)| C
xX pX,0 vD n
Next, we define the following pointwise modulus of con- r !
tinuity, which will be used to express the bias for an arbi- D log n + log(2/)
trary function in later result. + 2 1 .
k
Definition 3 (Modulus of continuity).
Remark 1. Taking k = O(n2/(2+D) ) gives us a rate
uf (x, r) := sup |f (x) f (x )|. of
x B(x,r)
e /(2+D) ),
sup |f (x) fk (x)| . O(n
We now state our main result about k-NN regression. xX
Informally, it says that under the mild assumptions de- which is the minimax optimal rate for estimating a
scribed above, forpk & log n, |fk (x) f (x)| . Hlder function, up to logarithmic factors.
uf (x, (k/n)1/D )+ (log n)/k uniformly in x X with
high probability. Remark 2. It is understood that all our results will also
hold under the assumption that the xi s are fixed and de-
The first term correponds to the bias term. Using uni- terministic (e.g. on a grid) as long as there is a sufficient
form VC-type concentration bounds, it can be shown that covering of the space.
the k-NN radius can be uniformly bounded by approxi-
mately distance (k/n)1/D and hence no point in the k-
4 Level Set Estimation
NN set will be that far. The bias can then be expressed in
terms of that distance and uf .
The level set is the region of the input space that have
The second term corresponds to the variance. The 1/ k value greater than a fixed threshold.
factor is not surprising since the noise
terms are averaged Definition 5 (Level-Set).
over k observations and the extra log n factor comes
from the cost of obtaining a uniform bound. Lf () := {x X : f (x) }.
In order to be able to estimate the level-sets with rate then with probability at least 1 ,
guarantees, we make the following regularity assump-
bf ()) 2 (2/C)1/ .
dH (Lf (), L
tion. It states that for each maximal connected compo-
nent of the level set, the change in the function around the
Remark 4. Choosing k at the optimal setting k
boundary has a Lipschitz form with smoothness and cur- e /(2+D) ). Then it fol-
n2/(2+D) , we have = O(n
vature > 0 around some neighborhood of the boundary.
lows that we recover the level sets at a Hausdorff rate
This notion of regularity at the boundaries of the level- e 1/(2+D) ). This can be compared to the lower
of O(n
sets is a standard one in density level-set estimation e.g.
bound O(n1/(2+D) ) established by Tsybakov et al.
Tsybakov et al. [21], Singh et al. [18].
[21] for estimating the level sets of an unknown density.
Definition 6 (Level-Set Regularity). Let d(x, C) :=
inf x C |x x |, C be the boundary of C, and C Next, we recover the exact level sets, defined as follows.
r := {x : d(x , C) r}. A function f satisfies - Definition 8.
regularity at level if the following holds. There exists
rM , C, C > 0 such that for each maximal connected sub- Lf () := {x X : f (x) = }.
set C Lf (), we have

C d(x, C) | f (x)| C d(x, C) , We use the following estimator.

for all x C rM . c () := {x X : 2 < fk (x) < + 2}.


L f
Remark 3. The upper bound on |f (x)| ensures that f
The next result states that Lf () can also be recovered
is sufficiently smooth so that k-NN regression will give us
with Hausdorff guarantees.
sufficiently accurate estimates near the boundaries. The
lower bound on | f (x)| ensures that the level-set is Theorem 3 ((Exact)-Level Set Recovery). Let f be con-
salient enough to be detected. tinuous and satisfy -regularity at level . Suppose fk
satisfies
To recover Lf () based on the samples, we use the fol-
lowing estimator, where X = {x1 , ..., xn }. sup |f (x) fk (x)| ,
xX
cf () := {x X : fk (x) }.
L 1
where 0 < < 3 C min{rM , r0 } . If
There are two simple but key differences from Lf ().
The first is that since we dont have access to the true 16 (2C)D/ log(4/) D log n
n ,
function f , we use the k-NN regression estimate fk . vD pX,0 D/
Next, instead of taking x X , we instead restrict to the
samples X. This is crucial as it allows our estimator to then with probability at least 1 ,
be practical. c ()) (3/C)1/ .
dH (Lf (), L f
We provide consistency result under the Hausdorff met-
ric, defined as follows. It must be noted that the exact level set has a lower di-
Definition 7. mension than D, yet we can still recover it with strong
Hausdorff guarantees.
dH (X, Y ) = inf{ 0 : X Y , Y X }.

The next result shows that given an estimate of f that is 5 Global Maxima Estimation
uniformly bounded by , then the level sets of f can be
recovered at a rate of 1/ . In this section, we give guarantees on estimating the
global maxima of f .
Theorem 2 ((Super)-Level Set Recovery). Let f be con-
tinuous and satisfy -regularity at level . Suppose fk Definition 9. x0 is a maxima of f if f (x) < f (x0 ) for
satisfies all x B(x0 , r)\{x0 } for some r > 0.

sup |f (x) fk (x)| , We then make the following assumptions, which states
xX
that f has a unique maxima, where it has a negative-
1 definite Hessian.
where 0 < < 2 C min{rM , r0 } . If
Assumption 4. f has a unique maxima x0 :=
16 (2C)D/ log(4/) D log n argmaxxX f (x) and f has a negative-definite Hessian
n ,
vD pX,0 D/ at x0 .
These assumptions lead to the following, which states The volume of M is bounded above by a constant.
that f has quadratic smoothness and decay around x0 .
M has condition number 1/ , which controls the
Lemma 1 (Dasgupta and Kpotufe [5]). Let f satisfy As- curvature and prevents self-intersection.
sumption 4. Then there exists C, C, rM , > 0 such that
the following holds. Let pX be the density of P with respect to the uniform
2 2 measure on M .
C |x0 x| f (x0 ) f (x) C |x0 x|

for all x A0 where A0 is a connected component of We now give the manifold analogues of Theorem 1 and
{x : f (x) } and A0 contains B(x0 , rM ). Corollary 1.
Theorem 5 (k-NN Regression Rate). Suppose that As-
We utilize the following estimator, which is the maxi- sumptions 2, 3, and 5 hold and that
mizer of fk amongst sample points X = {x1 , ..., xn }.
k 28 D log2 (4/) log n,
b := argmax fk (x).
x   d
xX 1 1
k min , pX,0 vd n.
4 4d
b in estimating
We next give the result of the accuracy of x
x0 . Then with probability at least 1 , the following holds
Theorem 4. Suppose that f is continuous and that As- uniformly in x X .
sumptions 1, 2, 3, and 4 hold. Let k satisfy  1/d !
4k
|f (x) fk (x)| uf x,
210 D log2 (4/) log n vd n pX,0
k 4 / 2 }
, r
min{1, C 2 rM D log n + log(2/)
(  D/2 ) + 2 .
1 D
2
C rM k
k vD min r0 , n.
2 32 C
Similar to the full dimensional case, we can then apply
Then the following holds with probability at least 1 . this to the class of Hlder continuous functions.
 r Corollary 2 (Rate for Hlder continuous functions).
2 32 D log n + log(2/) Suppose that Assumptions 2, 3, and 5 hold and that
|x x0 | max ,
C k
 2/D  k 28 D log2 (4/) log n,
32C 2k   d
. 1 1
C pX,0 vD n k min , pX,0 vd n.
4 4d
Remark 5. Taking k n4/(4+D) optimizes the above
e 1/(4+D) ). This can If f is Hlder continuous (i.e. |f (x) f (x )| C |x
expression so that |x x0 | . O(n
x | ), then the following holds
be compared to the minimax rate for mode estimation
O(n1/(4+D) ) established by Tsybakov [20].  /d
4k
P sup |f (x) fk (x)| C
Remark 6. An analogue for global minima also holds. xX vd n pX,0
r !
D log n + log(2/)
6 Regression On Manifolds + 2 1 .
k
In this section, we show that if the data has a lower intrin-
Remark 7. Taking k = O(n2/(2+d) ) gives us a rate
sic dimension, then k-NN will automatically attain rates e /(2+d) ), which is more attractive than the full
of O(n
as if it were in the lower dimensional space and indepen-
e /(2+D) ) when intrinsic di-
dimensional version O(n
dent of the ambient dimension.
mension d is lower than ambient dimension D.
We make the following regularity assumptions which are
standard among works in manifold learning e.g. [8, 1].
7 Proofs
Assumption 5. P is supported on M where:
7.1 Supporting Results
M is a d-dimensional smooth compact Riemannian
manifold without boundary embedded in compact Suppose that P is the distribution corresponding to pX .
subset X RD . Let Pn be the empirical distribution w.r.t. x1 , ..., xn . We
need the following result giving guarantees on the masses Now we will bound |A|. Since H is finite, choose vec-
of empirical balls with respect to the mass of true balls. tors e1 , ..., ed such that they form an orthogonal basis of
Lemma 2 (Chaudhuri and Dasgupta [3]). Pick 0 < < Rd and none of these vectors are perpendicular to any
1. Assume that k D log n. Then with probability at H H. Let e1 , ..., ed induce hyperplanes H1 , ..., Hd ,
least 1 , for every ball B RD we have respectively (i.e. Hi being the orthogonal complement
of ei ). Without loss of generality, orient the space such

D log n that e1 is the vertical direction (i.e. so that we can use
P(B) C,n Pn (B) > 0 descriptions such as above and below). For each re-
n
k k k gion in A that is bounded below, associate such a region
P(B) + C,n Pn (B) to its
n n n n
 lowest point. Then it follows that there are at most
D of these regions since they are the intersection of D
k k k hyperplanes.
P(B) C,n Pn (B) < .
n n n
We next count the regions unbounded below. Place H1
where C,n := 16 log(2/) D log n. below the lowest point corresponding the regions in A
that were bounded below. Then we have that the regions
7.2 Proof for k-NN Regression unbounded below are {A A : A H1 6= }. It thus
remains now to count A1 := {AH1 : A A, AH1 6=
The next result bounds rk (x) uniformly in x X . }.
Lemma 3. The following holds with probability at least We now orient the space so that e2 corresponds to the ver-
1 /2. If tical direction. Then we can repeat the same procedure
1 and for each region in A1 that is bounded below with the
28 D log2 (4/) log n k vD r0D n,
2 lowest point. There are at most D1 n
since they are
then an intersection of D 1 hyperplanes in H along with
 1/D
2k H1 , and then placing e2 sufficiently low, the remaining
sup rk (x) . regions correspond to A2 := {A H1 H2 : A
xX vD n pX,0
A, A H1 H2 6= }.
 1/D
2k
Proof. Let r = vD npX,0 . We have Continuing this process, it follows that when we orient
ei to be the vertical direction, in order to count Ai :=
P(B(x, r)) inf pX (x ) vD rD {A H1 Hi : A A, A H1 Hi 6= },
x B(x,r)X the number
 of regions in Ai bounded below is at most
n
2k and the remaining ones are correspond to Ai+1 .
pX,0 vD rD = . Di
n P 
It thus follows that |A| D n D
j=0 j Dn , as desired.
By Lemma 2 and the condition on k, it follows that with
probability 1 /2, uniformly in x X , Pn (B(x, r))
k
n . Hence, rk (x) < r and the result follows immediately.
Proof of Theorem 1. We have

The next result bounds the number of distinct k-NN sets |fk (x) f (x)|

over X . 1 X n

Lemma 4. Let M be the number of distinct k-NN sets = (i + f (xi ) f (x)) 1 [xi Nk (x)]
|Nk (x)|
i=1
over X , that is, M := |{Nk (x) : x X }|. Then
1 X n

M D nD . (f (xi ) f (x)) 1 [xi Nk (x)]
|Nk (x)|
i=1

1 X n
Proof.First, let A be the partitioning of X induced by
+ xi 1 [xi Nk (x)]
the n2 hyperplanes defined as the perpendicular bisec- |Nk (x)|
i=1

tors of each pair of points xi , xj for i 6= j. Let us denote
1 X n
this set of hyperplanes as H. We have that if x, x are
uf (x, rk (x)) + xi 1 [xi Nk (x)] .
in the same partition of A, then Nk (x) = Nk (x ). If Nk (x)
i=1
not, then any path from x to x must cross some perpen-
dicular bisector in Nk (x ) Nk (x), which would be a The first term can be viewed as the bias term and the
contradiction. Thus, M |A|. second can be viewed as variance term.
By Lemma 3, we can bound the first term as follows with 7.3 Proofs for Level Set Estimation
probability at least 1 /2 uniformly in x X .
 1/D ! Proof of Theorem 2. Let r := (2/C)1/ . We have
2k
uf (x, rk (x)) uf x, .
X,0 vD n sup fk (x) C r + < .
X \(Lf ()r)
For the variance term, we have by Hoeffdings inequality
that if b f () Lf () r Lf () 2r. It
This shows that L
n
1 X remains to show the other direction, that

Ax := xi 1 [xi Nk (x)]
k b f () 2r.
i=1 Lf () L
then
! Define r := (/(2C))1/ . Since dH (Lf (), Lf ( +
2 t  )) r, and r < r, it suffices to show that
P Ax > exp t2 .
k
p b f () r.
Lf ( + ) L
Taking t = D log n + log(2D/), then we have
! To do this, it is enough to show that for all x Lf (+),
2 t
P Ax > . (1) the following holds:
k 2D nD
By Lemma 4 and union bound, it follows that Pn (B(x, r)) > 0,
!
2 t and that (2) any x B(x, r) X satisfies fk (x ) .
P sup Ax > . We have
xX k 2
Hence, we have with probability at least 1 , P(B(x, r)) vD rD pX,0
 1/D ! 16 log(4/)D log n
2k ,
|f (x) fk (x)| uf x, n
pX,0 vd n
r where the last inequality holds by the condition on n.
D log n + log(2/)
+ 2 . Thus by Lemma 2, Pn (B(x, r)) > 0. Finally, we have
k
uniformly in x X . inf fk (x ) + C r > ,
x B(x,r)

In fact, it is easy to see that a simple modification to the


as desired.
proof will yield the following.
Corollary 3 (k-NN Regression Upper and Lower
Bounds). Let Proof of Theorem 3. Let r := (3/C)1/ . We have

uf (x, r) := sup f (x ) f (x) sup fk (x) sup f (x) +


x B(x,r) (L
f ()r)\Lf () (L
f ()r)\Lf ()

uf (x, r) := sup f (x) f (x ) C r + 2.


x B(x,r)
r
D log n + log(2/) Next,
var := 2
k
 1/D sup fk (x) sup f (x)
2k (L
f ()r)Lf () (L
f ()r)Lf ()
k := .
pX,0 vD n
+ C r + 2.
Suppose that Assumptions 1, 2, and 3 hold and that
It follows that L c () L () r. It now remains to
k 28 D log2 (4/) log n. f f
show that Lf () L c () r. Define r := (/(2C))1/ .
Then probability at least 1 , the following holds uni- f
Since r < r, it suffices to show that for every x Lf (),
formly in x X .
(1) that
fk (x) f (x) + uf (x, k ) + var
fk (x) f (x) uf (x, k ) var . Pn (B(x, r)) > 0,
and (2) that every x B(x, r) satisfies 2 < 7.5 Proof of Regression on Manifolds
fk (x ) < + 2. We have
We need the following guarantee on the volume of the
P(B(x, r)) vD rD pX,0 intersection of a Euclidean ball and M ; this is required
16 log(4/)D log n to get a handle on the true mass of the ball under P in
,
n later arguments. The proof can be found in [9].
where the last inequality holds by the condition on n. Lemma 5 (Ball Volume). If 0 < r < min{ /(4d), 1/ },
Thus by Lemma 2, Pn (B(x, r)) > 0. Finally, we have and x M then
inf fk (x ) C r > 2,
x B(x,r) vold (B(x, r) M )
1 2 r2 1 + 4d r/,
and vd rd

sup fk (x ) + + C r < + 2, where vold is the volume w.r.t. the uniform measure on
x B(x,r) M.
as desired.
The next is the manifold analogue of Lemma 3.
7.4 Proof of Global Maxima Estimation Lemma 6. Suppose that Assumptions 2, 3, and 5 hold.
The following holds with probability at least 1 /2. If
Proof of Theorem 4. Define the following.
r k 28 D log2 (4/) log n,
D log n + log(2/)   d
var := 2
k 1 1
 1/D k min , pX,0 vd n.
2k 4 4d
k :=
pX,0 vD n
then for all x M
r2 := max{16var/C, (2k /c)2 },
 1/d
4k
where c2 = C/8C. The goal is now to show |xx0 | r. rk (x) .
The proof now mirrors that of Theorem 1 of Dasgupta vd n pX,0
and Kpotufe [5]. It suffices to show that
 1/d
sup fk (x) < inf fk (x), Proof. Let r = 4k
. We have
xX \B(x0 ,r) xB(x0 ,rn ) vd npX,0

where rn = d(x0 , X). We have by Corollary 3: P(B(x, r))


sup fk (x) inf pX (x ) Vold (B(x, r) M )
xX \B(x0 ,r) x B(x,r)M

sup f (x) + uf (x, k ) + var 1 2k


xX \B(x0 ,r) pX,0 (1 2 r2 ) vd rd pX,0 vd rd .
2 n
sup f (x) + uf (x, r/2) + var
xX \B(x0 ,r) By Lemma 2 and the condition on k, it follows that with
sup f (x) + var probability 1 /2, uniformly in x X , Pn (B(x, r))
k
xX \B(x0 ,r/2) n . Hence, rk (x) < r and the result follows immediately.
f (x0 ) C(r/2)2 + var .
On the other hand,
Theorem 5 now follows by replacing the usage of
inf fk (x) Lemma 3 with Lemma 6.
xB(x0 ,rn )

inf f (x) uf (x, k ) var


xB(x0 ,rn )
Acknowledgments
inf f (x) uf (x, cr/2) var
xB(x0 ,cr/2)
I am grateful to Fedor Petrov [17] and Calvin Lin [15]
inf f (x) var
xB(x0 ,cr) for insightful public posts on MathOverflow and Mathe-
matics Stack Exchange, which played an important role
f (x0 ) C(cr)2 var .
in some of the proofs. I would also like to thank Melody
The result now follows from our choice of r. Guan for proofreading earlier drafts.
References [14] Heng Lian et al. Convergence of functional k-
nearest neighbor regression estimate with func-
[1] Sivaraman Balakrishnan, Srivatsan Narayanan,
tional responses. Electronic Journal of Statistics,
Alessandro Rinaldo, Aarti Singh, and Larry Wasser-
5:3140, 2011.
man. Cluster trees on manifolds. In Advances
in Neural Information Processing Systems, pages [15] Calvin Lin. How many resulting re-
26792687, 2013. gions if we partition Rm with n hyper-
planes? Mathematics Stack Exchange. URL
[2] Grard Biau, Frdric Crou, and Arnaud Guyader.
https://math.stackexchange.com/q/409642.
Rates of convergence of the functional k-nearest
(version: 2013-06-02).
neighbor estimate. IEEE Transactions on Informa-
tion Theory, 56(4):20342040, 2010. [16] Yue-pok Mack and Bernard W Silverman. Weak
and strong uniform consistency of kernel regression
[3] Kamalika Chaudhuri and Sanjoy Dasgupta. Rates
estimates. Probability Theory and Related Fields,
of convergence for the cluster tree. In Advances
61(3):405415, 1982.
in Neural Information Processing Systems, pages
343351, 2010. [17] Fedor Petrov. Bounding number of k-nearest
neighbor sets in Rd . MathOverflow. URL
[4] Philip E Cheng. Strong consistency of nearest
https://mathoverflow.net/q/272418.
neighbor regression function estimators. Journal
(version: 2017-06-17).
of Multivariate Analysis, 15(1):6372, 1984.
[18] Aarti Singh, Clayton Scott, Robert Nowak, et al.
[5] Sanjoy Dasgupta and Samory Kpotufe. Optimal
Adaptive hausdorff estimation of density level sets.
rates for k-nn density and mode estimation. In Ad-
The Annals of Statistics, 37(5B):27602782, 2009.
vances in Neural Information Processing Systems,
pages 25552563, 2014. [19] Charles J Stone. Consistent nonparametric regres-
sion. The annals of statistics, pages 595620, 1977.
[6] Luc Devroye. The uniform convergence of nearest
neighbor regression function estimators and their [20] Aleksandr Borisovich Tsybakov. Recursive estima-
application in optimization. IEEE Transactions on tion of the mode of a multivariate distribution. Prob-
Information Theory, 24(2):142151, 1978. lemy Peredachi Informatsii, 26(1):3845, 1990.
[7] Luc Devroye, Laszlo Gyorfi, Adam Krzyzak, and [21] Alexandre B Tsybakov et al. On nonparametric es-
Gbor Lugosi. On the strong universal consistency timation of density level sets. The Annals of Statis-
of nearest neighbor regression function estimates. tics, 25(3):948969, 1997.
The Annals of Statistics, pages 13711385, 1994. [22] Rebecca M Willett and Robert D Nowak. Minimax
[8] Christopher Genovese, Marco Perone-Pacifico, Is- optimal level-set estimation. IEEE Transactions on
abella Verdinelli, and Larry Wasserman. Minimax Image Processing, 16(12):29652979, 2007.
manifold estimation. Journal of machine learning
research, 13(May):12631291, 2012.
[9] Heinrich Jiang. Density level set estimation on man-
ifolds with dbscan. International Conference on
Machine Learning (ICML), 2017.
[10] Heinrich Jiang. Uniform convergence rates for ker-
nel density estimation. International Conference on
Machine Learning (ICML), 2017.
[11] Samory Kpotufe. k-nn regression adapts to local
intrinsic dimension. In Advances in Neural Infor-
mation Processing Systems, pages 729737, 2011.
[12] Nadia L Kudraszow and Philippe Vieu. Uniform
consistency of knn regressors for functional vari-
ables. Statistics & Probability Letters, 83(8):1863
1870, 2013.
[13] Sanjeev R Kulkarni and Steven E Posner. Rates of
convergence of nearest neighbor estimation under
arbitrary sampling. IEEE Transactions on Informa-
tion Theory, 41(4):10281039, 1995.

Você também pode gostar