0% found this document useful (0 votes)
59 views33 pages

Davy Depth Based Classifier

This document introduces a new class of depth-based classification procedures that are nearest-neighbor in nature. Depth, after symmetrization, provides the center-outward ordering needed to define nearest neighbors. The resulting classifiers inherit nonparametric validity from nearest-neighbor classifiers and are proven to be universally consistent under mild conditions. Simulations show the depth-based classifiers outperform other affine-invariant nearest-neighbor approaches. The document discusses applications to real data examples and other potential uses of the depth-based neighbors.

Uploaded by

Sanjay Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views33 pages

Davy Depth Based Classifier

This document introduces a new class of depth-based classification procedures that are nearest-neighbor in nature. Depth, after symmetrization, provides the center-outward ordering needed to define nearest neighbors. The resulting classifiers inherit nonparametric validity from nearest-neighbor classifiers and are proven to be universally consistent under mild conditions. Simulations show the depth-based classifiers outperform other affine-invariant nearest-neighbor approaches. The document discusses applications to real data examples and other potential uses of the depth-based neighbors.

Uploaded by

Sanjay Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Nonparametrically Consistent

Depth-based Classifiers
Davy Paindaveine

and Germain Van Bever


Universite Libre de Bruxelles, Brussels, Belgium
Abstract
We introduce a class of depth-based classication procedures that are of
a nearest-neighbor nature. Depth, after symmetrization, indeed provides the
center-outward ordering that is necessary and sucient to dene nearest neigh-
bors. The resulting classiers are ane-invariant and inherit the nonparamet-
ric validity from nearest-neighbor classiers. In particular, we prove that the
proposed depth-based classiers are consistent under very mild conditions. We
investigate their nite-sample performances through simulations and show that
they outperform ane-invariant nearest-neighbor classiers obtained through
an obvious standardization construction. We illustrate the practical value of
our classiers on two real data examples. Finally, we shortly discuss the possible
uses of our depth-based neighbors in other inference problems.
Keywords and phrases: Ane-invariance, Classication procedures, Nearest Neigh-
bors, Statistical depth functions, Symmetrization.

Davy Paindaveine is Professor of Statistics, Universite Libre de Bruxelles, ECARES and


Departement de Mathematique, Avenue F. D. Roosevelt, 50, CP 114/04, B-1050 Bruxelles, Belgium
(E-mail: dpaindav@ulb.ac.be). He is also member of ECORE, the association between CORE and
ECARES. Germain Van Bever is FNRS PhD candidate, Universite Libre de Bruxelles, ECARES
and Departement de Mathematique, Campus de la Plaine, Boulevard du Triomphe, CP 210, B-1050
Bruxelles, Belgium (E-mail: gvbever@ulb.ac.be). This work was supported by an A.R.C. contract
and a FNRS Aspirant contract, Communaute francaise de Belgique.
1
a
r
X
i
v
:
1
2
0
4
.
2
9
9
6
v
1


[
m
a
t
h
.
S
T
]


1
3

A
p
r

2
0
1
2
1 INTRODUCTION
The main focus of this work is on the standard classication setup in which the
observation, of the form (X, Y ), is a random vector taking values in R
d
0, 1. A
classier is a function m : R
d
0, 1 that associates with any value x a predictor
for the corresponding class Y . Denoting by I
A
the indicator function of the set A,
the so-called Bayes classier, dened through
m
Bayes
(x) = I
_
(x) > 1/2
_
, with (x) = P[Y = 1 [ X = x], (1.1)
is optimal in the sense that it minimizes the probability of misclassication P[m(X) ,=
Y ]. Under absolute continuity assumptions, the Bayes rule rewrites
m
Bayes
(x) = I
_
f
1
(x)
f
0
(x)
>

0

1
_
, (1.2)
where
j
= P[Y = j] and f
j
denotes the pdf of X conditional on [Y = j]. Of
course, empirical classiers m
(n)
are obtained from i.i.d. copies (X
i
, Y
i
), i = 1, . . . , n,
of (X, Y ), and it is desirable that such classiers are consistent, in the sense that,
as n , the probability of misclassication of m
(n)
, conditional on (X
i
, Y
i
),
i = 1, . . . , n, converges in probability to the probability of misclassication of the
Bayes rule. If this convergence holds irrespective of the distribution of (X, Y ), the
consistency is said to be universal.
Classically, parametric approaches assume that the conditional distribution of X
given [Y = j] is multinormal with mean
j
and covariance matrix
j
(j = 0, 1).
This gives rise to the so-called quadratic discriminant analysis (QDA)or to linear
discriminant analysis (LDA) if it is further assumed that
0
=
1
. It is standard
to estimate the parameters
j
and
j
(j = 0, 1) by the corresponding sample means
and empirical covariance matrices, but the use of more robust estimators was recom-
mended in many works; see, e.g., Randles et al. (1978), He and Fung (2000), Dehon
2
and Croux (2001), or Hartikainen and Oja (2006). Irrespective of the estimators used,
however, these classiers fail to be consistent away from the multinormal case.
Denoting by d

(x, ) = ((x )

1
(x ))
1/2
the Mahalanobis distance be-
tween x and in the metric associated with the symmetric and positive denite
matrix , it is well known that the QDA classier rewrites
m
QDA
(x) = I
_
d

1
(x,
1
) < d

0
(x,
0
) + C
_
, (1.3)
where the constant C depends on
0
,
1
, and
0
, hence classies x into Population 1
if it is suciently more central in Population 1 than in Population 0 (centrality,
in elliptical setups, being therefore measured with respect to the geometry of the
underlying equidensity contours). This suggests that statistical depth functions, that
are mappings of the form x D(x, P) indicating how central x is with respect to a
probability measure P (see Section 2.1 for a more precise denition), are appropriate
tools to perform nonparametric classication. Indeed, denoting by P
j
the probability
measure associated with Population j (j = 0, 1), (1.3) makes it natural to consider
classiers of the form
m
D
(x) = I
_
D(x, P
1
) > D(x, P
0
)
_
,
based on some xed statistical depth function D. This max-depth approach was rst
proposed in Liu et al. (1999) and was then investigated in Ghosh and Chaudhuri
(2005b). Dutta and Ghosh (2012a,b) considered max-depth classiers based on the
projection depth and on (an ane-invariant version of) the L
p
depth, respectively.
Hubert and Van der Veeken (2010) modied the max-depth approach based on pro-
jection depth to better cope with possibly skewed data.
Recently, Li et al. (2012) proposed the Depth vs Depth (DD) classiers that ex-
tend the max-depth ones by constructing appropriate polynomial separating curves
in the DD-plot, that is, in the scatter plot of the points (D
(n)
0
(X
i
), D
(n)
1
(X
i
)), i =
3
1, . . . , n, where D
(n)
j
(X
i
) refers to the depth of X
i
with respect to the data points
coming from Population j. Those separating curves are chosen to minimize the em-
pirical misclassication rate on the training sample and their polynomial degree m is
chosen through cross-validation. Lange et al. (2012) dened modied DD-classiers
that are computationally ecient and apply in higher dimensions (up to d = 20).
Other depth-based classiers were proposed in J ornsten (2004), Ghosh and Chaud-
huri (2005a), and Cui et al. (2008).
Being based on depth, these classiers are clearly of a nonparametric nature. An
important requirement in nonparametric classication, however, is that consistency
holds as broadly as possible and, in particular, does not require structural distri-
butional assumptions. In that respect, the depth-based classiers available in the
literature are not so satisfactory, since they are at best consistent under elliptical
distributions only
1
. This restricted-to-ellipticity consistency implies that, as far as
consistency is concerned, the Mahalanobis depth is perfectly sucient and is by now
means inferior to the more nonparametric (Tukey (1975)) halfspace depth or (Liu
(1990)) simplicial depth, despite the fact that it uninspiringly leads to LDA through
the max-depth approach. Also, even this restricted consistency often requires esti-
mating densities; see, e.g., Dutta and Ghosh (2012a,b). This is somewhat undesirable
since density and depth are quite antinomic in spirit (a deepest point may very well
be a point where the density vanishes). Actually, if densities are to be estimated in
the procedure anyway, then it would be more natural to go for density estimation all
the way, that is, to plug density estimators in (1.2).
The poor consistency of the available depth-based classiers actually follows from
their global nature. Zakai and Ritov (2009) indeed proved that any universally consis-
1
The classiers from Dutta and Ghosh (2012b) are an exception that slightly extends consistency
to (a subset of) the class of L
p
-elliptical distributions.
4
tent classier needs to be of a local nature. In this paper, we therefore introduce local
depth-based classiers, that rely on nearest-neighbor ideas (kernel density techniques
should be avoided, since, as mentioned above, depth and densities are somewhat in-
compatible). From their nearest-neighbor nature, they will inherit consistency under
very mild conditions, while from their depth nature, they will inherit ane-invariance
and robustness, two important features in multivariate statistics and in classication
in particular. Identifying nearest neighbors through depth will be achieved via an
original symmetrization construction. The corresponding depth-based neighborhoods
are of a nonparametric nature and the good nite-sample behavior of the resulting
classiers most likely results from their data-driven adaptive nature.
The outline of the paper is as follows. In Section 2, we rst recall the concept
of statistical depth functions (Section 2.1) and then describe our symmetrization
construction that allows to dene the depth-based neighbors to be used later for clas-
sication purposes (Section 2.2). In Section 3, we dene the proposed depth-based
nearest-neighbor classiers and present some of their basic properties (Section 3.1)
before providing consistency results (Section 3.2). In Section 4, Monte Carlo sim-
ulations are used to compare the nite-sample performances of our classiers with
those of their competitors. In Section 5, we show the practical value of the proposed
classiers on two real-data examples. We then discuss in Section 6 some further
applications of our depth-based neighborhoods. Finally, the Appendix collects the
technical proofs.
2 DEPTH-BASED NEIGHBORS
In this section, we review the concept of statistical depth functions and dene the
depth-based neighborhoods on which the proposed nearest-neighbor classiers will be
based.
5
2.1 Statistical depth functions
Statistical depth functions allow to measure centrality of any x R
d
with respect
to a probability measure P over R
d
(the larger the depth of x, the more central x is
with respect to P). Following Zuo and Sering (2000a), we dene a statistical depth
function as a bounded mapping D( , P) from R
d
to R
+
that satises the following
four properties:
(P1) ane-invariance: for any d d invertible matrix A, any d-vector b and any
distribution P over R
d
, D(Ax + b, P
A,b
) = D(x, P), where P
A,b
is dened
through P
A,b
[B] = P[A
1
(B b)] for any d-dimensional Borel set B;
(P2) maximality at center: for any P that is symmetric about (in the sense
2
that P[ + B] = P[ B] for any d-dimensional Borel set B), D(, P) =
sup
xR
d D(x, P);
(P3) monotonicity relative to the deepest point: for any P having deepest point ,
D(x, P) D((1 ) + x, P) for any x R
d
and any [0, 1];
(P4) vanishing at innity: for any P, D(x, P) 0 as |x| .
For any statistical depth function and any > 0, the set R

(P) = x R
d
:
D(x, P) is called the depth region of order . These regions are nested, and,
clearly, inner regions collect points with larger depth. Below, it will often be conve-
nient to rather index these regions by their probability content : for any [0, 1),
we will denote by R

(P) the smallest R

(P) that has P-probability larger than or


equal to . Throughout, subscripts and superscripts for depth regions are used for
depth levels and probability contents, respectively.
2
Zuo and Sering (2000a) also considers more general symmetry concepts; however, we restrict
in the sequel to central symmetry, that will be the right concept for our purposes.
6
Celebrated instances of statistical depth functions include
(i) the Tukey (1975) halfspace depth D
H
(x, P) = inf
uS
d1 P[u

(X x) 0],
where o
d1
= u R
d
: |u| = 1 is the unit sphere in R
d
;
(ii) the Liu (1990) simplicial depth D
S
(x, P) = P[x S(X
1
, X
2
, . . . , X
d+1
)], where
S(x
1
, x
2
, . . . , x
d+1
) denotes the closed simplex with vertices x
1
, x
2
, . . . , x
d+1
and
where X
1
, X
2
, . . . , X
d+1
are i.i.d. P;
(iii) the Mahalanobis depth D
M
(x, P) = 1/(1 + d
2
(P)
(x, (P))), for some ane-
equivariant location and scatter functionals (P) and (P);
(iv) the projection depth D
Pr
(x, P) = 1/(1+sup
uS
d1 [u

x(P
[u]
)[/(P
[u]
)), where
P
[u]
denotes the probability distribution of u

X when X P and where (P)


and (P) are univariate location and scale functionals, respectively.
Other depth functions are the simplicial volume depth, the spatial depth, the L
p
depth, etc. Of course, not all such depths fulll Properties (P1)-(P4) for any dis-
tribution P; see Zuo and Sering (2000a). A further concept of depth, of a slightly
dierent (L
2
) nature, is the so-called zonoid depth; see Koshevoy and Mosler (1997).
Of course, if d-variate observations X
1
, . . . , X
n
are available, then sample versions
of the depths above are simply obtained by replacing P with the corresponding empir-
ical distribution P
(n)
(the sample simplicial depth then has a U-statistic structure).
A crucial fact for our purposes is that a sample depth provides a center-outward
ordering of the observations with respect to the corresponding deepest point

(n)
:
one may indeed order the X
i
s in such a way that
D(X
(1)
, P
(n)
) D(X
(2)
, P
(n)
) . . . D(X
(n)
, P
(n)
). (2.1)
Neglecting possible ties, this states that, in the depth sense, X
(1)
is the observation
closest to

(n)
, X
(2)
the second closest, . . . , and X
(n)
the one farthest away from

(n)
.
7
For most classical depths, there may be innitely many deepest points, that form
a convex region in R
d
. This will not be an issue in this work, since the symmetrization
construction we will introduce, jointly with Properties (Q2)-(Q3) below, asymptot-
ically guarantee unicity of the deepest point. For some particular depth functions,
unicity may even hold for nite samples: for instance, in the case of halfspace depth,
it follows from Rousseeuw and Struyf (2004) and results on the uniqueness of the
symmetry center (Sering (2006)) that, under the assumption that the parent distri-
bution admits a density, symmetrization implies almost sure unicity of the deepest
point.
2.2 Depth-based neighborhoods
A statistical depth function, through (2.1), can be used to dene neighbors of the deep-
est point

(n)
. Implementing a nearest-neighbor classier, however, requires dening
neighbors of any point x R
d
. Property (P2) provides the key to the construction
of an x-outward ordering of the observations, hence to the denition of depth-based
neighbors of x : symmetrization with respect to x.
More precisely, we propose to consider depth with respect to the empirical dis-
tribution P
(n)
x
associated with the sample obtained by adding to the original ob-
servations X
1
, X
2
, . . . , X
n
their reections 2x X
1
, . . . , 2x X
n
with respect to x.
Property (P2) implies that x is theunique (at least asymptotically; see above)
deepest point with respect to P
(n)
x
. Consequently, this symmetrization construction,
parallel to (2.1), leads to an (x-outward) ordering of the form
D(X
x,(1)
, P
(n)
x
) D(X
x,(2)
, P
(n)
x
) . . . D(X
x,(n)
, P
(n)
x
).
Note that the reected observations are only used to dene the ordering but are not
ordered themselves. For any k 1, . . . , n, this allows to identifyup to possible
8
tiesthe k nearest neighbors X
x,(i)
, i = 1, . . . , k, of x. In the univariate case (d = 1),
these k neighbors coincideirrespective of the statistical depth function Dwith the
k data points minimizing the usual distances [X
i
x[, i = 1, . . . , n.
In the sequel, the corresponding depth-based neighborhoodsthat is, the sample
depth regions R
(n)
x,
= R

(P
(n)
x
)will play an important role. In accordance with
the notation from the previous section, we will write R
(n)
x
for the smallest depth
region R
(n)
x,
that contains at least a proportion of the data points X
1
, X
2
, . . . , X
n
.
For = k/n, R
(n)
x
is therefore the smallest depth-based neighborhood that contains k
of the X
i
s; ties may imply that the number of data points in this neigborhood, K
(n)
x
say, is strictly larger than k.
Note that a distance (or pseudo-distance) (x, y) d(x, y) that is symmetric in its
arguments is not needed to identify nearest neighbors of x. For that purpose, a col-
lection of distances y d
x
(y) from a xed point is indeed sucient (in particular,
it is irrelevant that this distance satises or not the triangular inequality). In that
sense, the (data-driven) symmetric distance associated with the Oja and Paindaveine
(2005) lift-interdirections, that was recently used to build nearest-neighbor regression
estimators in Biau et al. (2012), is unnecessarily strong. Also, only an ordering of the
distances is needed to identify nearest neighbors. This ordering of distances from a
xed point x is exactly what the depth-based x-outward ordering above is providing.
3 DEPTH-BASED kNN CLASSIFIERS
In this section, we rst dene the proposed depth-based classiers and present some
of their basic properties (Section 3.1). We then state the main result of this paper,
related to their consistency properties (Section 3.2).
9
3.1 Denition and basic properties
The standard k-nearest-neighbor (kNN) procedure classies the point x into Popula-
tion 1 i there are more observations from Population 1 than from Population 0 in the
smallest Euclidean ball centered at x that contains k data points. Depth-based kNN
classiers are naturally obtained by replacing these Euclidean neighborhoods with the
depth-based neighborhoods introduced above, that is, the proposed kNN procedure
classies x into Population 1 i there are more observations from Population 1 than
from Population 0 in the smallest depth-based neighborhood of x that contains k
observationsi.e., in R
(n)
x
, = k/n. In other words, the proposed depth-based
classier is dened as
m
(n)
D
(x) = I
_

n
i=1
I[Y
i
= 1]W
(n)
i
(x) >

n
i=1
I[Y
i
= 0]W
(n)
i
(x)
_
, (3.1)
with W
(n)
i
(x) =
1
K
(n)
x
I[X
i
R
(n)
x
], where K
(n)
x
=

n
j=1
I[X
j
R
(n)
x
] still denotes
the number of observations in the depth-based neighborhood R
(n)
x
. Since
m
(n)
D
(x) = I
_

(n)
D
(x) > 1/2
_
, with
(n)
D
(x) =

n
i=1
I[Y
i
= 1]W
(n)
i
(x), (3.2)
the proposed classier is actually the one obtained by plugging, in (1.1), the depth-
based estimator
(n)
D
(x) of the conditional expectation (x). This will be used in
the proof of Theorem 3.1 below. Note that in the univariate case (d = 1), m
(n)
D
,
irrespective of the statistical depth function D, reduces to the standard (Euclidean)
kNN classier.
It directly follows from Property (P1) that the proposed classier is ane-invariant,
in the sense that the outcome of the classication will not be aected if X
1
, . . . , X
n
and x are subject to a common (arbitrary) ane transformation. This clearly im-
proves over the standard kNN procedure that, e.g., is sensitive to unit changes. Of
course, one natural way to dene an ane-invariant kNN classier is to apply the orig-
10
inal kNN procedure on the standardized data points

1/2
X
i
, i = 1, . . . , n, where

is an ane-equivariant estimator of shapein the sense that

(AX
1
+b, . . . , AX
n
+b) A

(X
1
, . . . , X
n
)A

for any invertible d d matrix A and any d-vector b. A natural choice for

is the
regular covariance matrix, but more robust choices, such as, e.g., the shape estimators
from Tyler (1987), D umbgen (1998), or Hettmansperger and Randles (2002) would
allow to get rid of any moment assumption. Here, we stress that, unlike our adaptive
depth-based methodology, such a transformation approach leads to neighborhoods
that do not exploit the geometry of the distribution in the vicinity of the point x to
be classied (these neighborhoods indeed all are ellipsoids with x-independent orienta-
tion and shape); as we show through simulations below, this results into signicantly
worse performances.
The main depth-based classiers availableamong which those relying on the
max-depth approach of Liu et al. (1999) and Ghosh and Chaudhuri (2005b), as well as
the more ecient ones from Li et al. (2012)suer from the outsider problem
3
: if
the point x to be classied does not sit in the convex hull of any of the two populations,
then most statistical depth functions will give x zero depth with respect to each
population, so that x cannot be classied through depth. This is of course undesirable,
all the more so that such a point x may very well be easy to classify. To improve
on this, Hoberg and Mosler (2006) proposed extending the original depth elds by
using the Mahalanobis depth outside the supports of both populations, a solution
that quite unnaturally requires combining two depth functions. Quite interestingly,
our symmetrization construction implies that the depth-based kNN classier (that
involves one depth function only) does not suer from the outsider problem; this is
an important advantage over competing depth-based classiers.
3
The term outsider was recently introduced in Lange et al. (2012).
11
While our depth-based classiers in (3.1) are perfectly well-dened and enjoy, as
we will show in Section 3.2 below, excellent consistency properties, practitioners might
nd quite arbitrary that a point x such that

n
i=1
I[Y
i
= 1]W
(n)
i
(x) =

n
i=1
I[Y
i
=
0]W
(n)
i
(x) is assigned to Population 0. Parallel to the standard kNN classier, the
classication may alternatively be based on the population of the next neighbor. Since
ties are likely to occur when using depth, it is natural to rather base classication
on the proportion of data points from each population in the next depth region.
Of course, if the next depth region still leads to an ex-aequo, the outcome of the
classication is to be determined on the subsequent depth regions, until a decision
is reached (in the unlikely case that an ex-aequo occurs for all depth regions to be
considered, classication should then be done by ipping a coin). This treatment of
ties is used whenever real or simulated data are considered below.
Finally, practitioners have to choose some value for the smoothing parameter k
n
.
This may be done, e.g., through cross-validation (as we will do in the real data
example of Section 5). The value of k
n
is likely to have a strong impact on nite-
sample performances, as conrmed in the simulations we conduct in Section 4.
3.2 Consistency results
As expected, the local (nearest-neighbor) nature of the proposed classiers makes
them consistent under very mild conditions. This, however, requires that the statis-
tical depth function D satises the following further properties:
(Q1) continuity: if P is symmetric about and admits a density that is positive at
and continuous in a neighborhood of , then x D(x, P) is continuous in a
neighborhood of .
(Q2) unique maximization at the symmetry center: if P is symmetric about and
12
admits a density that is positive at and continuous in a neighborhood of ,
then D(, P) > D(x, P) for all x ,= .
(Q3) consistency: for any bounded d-dimensional Borel set B, sup
xB
[D(x, P
(n)
)
D(x, P)[ = o(1) almost surely as n , where P
(n)
denotes the empirical
distribution associated with n random vectors that are i.i.d. P.
Property (Q2) complements Property (P2), and, in view of Property (P3), only
further requires that is a strict local maximizer of x D(x, P). Note that Prop-
erties (Q1)-(Q2) jointly ensure that the depth-based neighborhoods of x from Sec-
tion 2.2 collapse to the singleton x when the depth level increases to its maximal
value. Finally, since our goal is to prove that our classier satises an asymptotic
property (namely, consistency), it is not surprising that we need to control the asymp-
totic behavior of the sample depth itself (Property (Q3)). As shown by Theorem A.1
in the Appendix, Properties (Q1)-(Q3) are satised for many classical depth func-
tions.
We can then state the main result of the paper.
Theorem 3.1 Let D be a depth function satisfying (P2), (P3) and (Q1)-(Q3). Let k
n
be a sequence of positive integers such that k
n
and k
n
= o(n) as n . Assume
that, for j = 0, 1, X[[Y = j] admits a density f
j
whose collection of discontinuity
points is closed and has Lebesgue measure zero. Then the depth-based k
n
NN classier
m
(n)
D
in (3.1) is consistent in the sense that
P[m
(n)
D
(X) ,= Y [ T
n
] P[m
Bayes
(X) ,= Y ] = o
P
(1) as n ,
where T
n
is the sigma-algebra associated with (X
i
, Y
i
), i = 1, . . . , n.
Classically, consistency results for classication are based on a famous theorem
from Stone (1977); see, e.g., Theorem 6.3 in Devroye et al. (1996). However, it is an
13
open question whether Condition (i) of this theorem holds or not for the proposed clas-
siers, at least for some particular statistical depth functions. A sucient condition
for Condition (i) is actually that there exists a partition of R
d
into cones C
1
, . . . , C

d
with vertex at the origin of R
d
(
d
not depending on n) such that, for any X
i
and
any j, there exist (with probability one) at most k data points X

X
i
+ C
j
that
have X
i
among their k depth-based nearest neighbors. Would this be established for
some statistical depth function D, it would prove that the corresponding depth-based
k
n
NN classier m
(n)
D
is universally consistent, in the sense that consistency holds
without any assumption on the distribution of (X, Y ).
Now, it is clear from the proof of Stones theorem that this condition (i) may be
dropped if one further assumes that X admits a uniformly continuous density. This
is however a high price to pay, and that is the reason why the proof of Theorem 3.1
rather relies on an argument recently used in Biau et al. (2012); see the Appendix.
4 SIMULATIONS
We performed simulations in order to evaluate the nite-sample performances of the
proposed depth-based kNN classiers. We considered six setups, focusing on bivariate
X
i
s (d = 2) with equal a priori probabilities (
0
=
1
= 1/2), and involving the
following densities f
0
and f
1
:
Setup 1 (multinormality) f
j
, j = 0, 1, is the pdf of the bivariate normal distribution
with mean vector
j
and covariance matrix
j
, where

0
=
_
0
0
_
,
1
=
_
1
1
_
,
0
=
_
1 1
1 4
_
,
1
= 4
0
;
Setup 2 (bivariate Cauchy) f
j
, j = 0, 1, is the pdf of the bivariate Cauchy distribution
with location center
j
and scatter matrix
j
, with the same values of
j
and
j
14
as in Setup 1;
Setup 3 (at covariance structures) f
j
, j = 0, 1, is the pdf of the bivariate normal
distribution with mean vector
j
and covariance matrix
j
, where

0
=
_
0
0
_
,
1
=
_
1
1
_
,
0
=
_
5
2
0
0 1
_
,
1
=
0
;
Setup 4 (uniform distributions on half-moons) f
0
and f
1
are the densities of
_
X
Y
_
=
_
U
V
_
and
_
X
Y
_
=
_
0.5
2
_
+
_
1 0.5
0.5 1
__
U
V
_
,
respectively, where U Unif(1, 1) and V [[U = u] Unif(1 u
2
, 2(1 u
2
));
Setup 5 (uniform distributions on rings) f
0
and f
1
are the uniform distributions on the
concentric rings x R
2
: 1 |x| 2 and x R
2
: 1.75 |x| 2.5,
respectively;
Setup 6 (bimodal populations) f
j
, j = 0, 1, is the pdf of the multinormal mixture
1
2
A(
I
j
,
I
j
) +
1
2
A(
II
j
,
II
j
), where

I
0
=
_
0
0
_
,
II
0
=
_
3
3
_
,
I
0
=
_
1 1
1 4
_
,
II
0
= 4
I
0
,

I
1
=
_
1.5
1.5
_
,
II
1
=
_
4.5
4.5
_
,
I
1
=
_
4 0
0 0.5
_
, and
II
1
=
_
0.75 0
0 5
_
.
For each of these six setups, we generated 250 training and test samples of
size n = n
train
= 200 and n
test
= 100, respectively, and evaluated the misclassi-
cation frequencies of the following classiers:
1. the usual LDA and QDA classiers (LDA/QDA);
2. the standard Euclidean kNN classiers (kNN), with = k/n = 0.01, 0.05,
0.10 and 0.40, and the corresponding Mahalanobis kNN classiers (kNNa)
15
obtained by performing the Euclidean kNN classiers on standardized data,
where standardization is based on the regular covariance matrix estimate of the
pooled training sample;
3. the proposed depth-based kNN classiers (D-kNN) for each combination of the
k used in kNN/kNNa and a statistical depth function (we focused on halfspace
depth, simplicial depth, or Mahalanobis depth);
4. the depth vs depth (DD) classiers from Li et al. (2012), for each combina-
tion of a polynomial curve of degree m (m = 1, 2, or 3) and a statistical
depth function (halfspace depth, simplicial depth, or Mahalanobis depth). Ex-
act DD-classiers (DD) as well as smoothed versions (DDsm) were actually
implemented although, for computational reasons, only the smoothed version
was considered for m = 3. Exact classiers search for the best separating poly-
nomial curve (d, r(d)) of order m passing through the origin and m DD-points
(D
(n)
0
(X
i
), D
(n)
1
(X
i
)) (see the Introduction) in the sense that it minimizes the
missclassication error
n

i=1
_
I[Y
i
= 1]I[d
(n)
i
> 0] + I[Y
i
= 0]I[d
(n)
i
> 0]
_
, (4.1)
with d
(n)
i
:= r(D
(n)
0
(X
i
)) D
(n)
1
(X
i
). Smoothed versions use derivative-based
methods to nd a polynomial minimizing (4.1), where the indicator I[d > 0] is
replaced by the logistic function 1/(1 + e
td
) for a suitable t. As suggested in
Li et al. (2012), value t = 100 was chosen in these simulations. 100 randomly
chosen polynomials were used as starting points for the minimization algorithm,
the classier using the resulting polynomial with minimal misclassication (note
that this time-consuming scheme always results into better performances than
the one adopted in Li et al. (2012), where only one minimization is performed,
starting from the best random polynomial considered).
16
Since the DD classication procedure is a renement of the max-depth procedures
of Ghosh and Chaudhuri (2005b) that leads to better misclassication rates (see Li
et al. (2012)), the original max-depth procedures were omitted in this study.
Boxplots of misclassication frequencies (in percentages) are reported in Figures 1
and 2. The main learnings from these simulations are the following:
In most setups, the proposed depth-based kNN classiers compete well with the
Euclidean kNN classiers and improve over the latter under the at covariance
structures in Setup 3. This may be attributed to the lack of ane-invariance of
the Euclidean kNN classiers, which leads to discard this procedure and rather
focus on its ane-invariant version (kNNa). It is very interesting to note
that the kNNa classiers in most cases are outperformed by the depth-based
kNN classiers. In other words, the natural way to make the standard kNN
classier ane-invariant results into a dramatic cost in terms of nite-sample
performances. Incidentally, we point out that, in some setups, the choice of the
smoothing parameter k
n
appears to have less impact on ane-invariant kNN
procedures than on the original kNN procedures; see, e.g., Setup 3.
The proposed depth-based kNN classiers also compete well with DD-classiers
both in elliptical and non-elliptical setups. Away from ellipticity (Setups 4
to 6), in particular, they perform at least as welland sometimes outperform
(Setup 4)DD-classiers; a single exception is associated with the use of Ma-
halanobis depth in Setup 5, where the DD-classiers based on m = 2, 3 per-
form better. Apparently, another advantage of depth-based kNN classiers over
DD-classiers is that their nite-sample performances depend much less on the
statistical depth function D used.
17
1
0
9
8
7
6
5
4
3
2
1
0 10 20 30 40 50
LDA
QDA
kNN (!=1%)
kNN (!=5%)
kNN (!=10%)
kNN (!=40%)
kNNaff (!=1%)
kNNaff (!=5%)
kNNaff (!=10%)
kNNaff (!=40%)
L
D
A
/
Q
D
A
,

k
N
N
Setup 1
9
8
7
6
5
4
3
2
1
0 10 20 30 40 50
D-kNN (!=1%)
D-kNN (!=5%)
D-kNN (!=10%)
D-kNN (!=40%)
DD (m=1)
DD (m=2)
DDsm (m=1)
DDsm (m=2)
DDsm (m=3)
H
a
l
f
s
p
a
c
e

d
e
p
t
h
9
8
7
6
5
4
3
2
1
0 10 20 30 40 50
D-kNN (!=1%)
D-kNN (!=5%)
D-kNN (!=10%)
D-kNN (!=40%)
DD (m=1)
DD (m=2)
DDsm (m=1)
DDsm (m=2)
DDsm (m=3)
S
i
m
p
l
i
c
i
a
l

d
e
p
t
h
9
8
7
6
5
4
3
2
1
0 10 20 30 40 50
D-kNN (!=1%)
D-kNN (!=5%)
D-kNN (!=10%)
D-kNN (!=40%)
DD (m=1)
DD (m=2)
DDsm (m=1)
DDsm (m=2)
DDsm (m=3)
M
a
h
a
l
a
n
o
b
i
s

d
e
p
t
h
1
0
9
8
7
6
5
4
3
2
1
0 10 20 30 40 50
Setup 2
9
8
7
6
5
4
3
2
1
0 10 20 30 40 50
9
8
7
6
5
4
3
2
1
0 10 20 30 40 50
9
8
7
6
5
4
3
2
1
0 10 20 30 40 50
1
0
9
8
7
6
5
4
3
2
1
0 10 20 30 40 50
Setup 3
9
8
7
6
5
4
3
2
1
0 10 20 30 40 50
9
8
7
6
5
4
3
2
1
0 10 20 30 40 50
9
8
7
6
5
4
3
2
1
0 10 20 30 40 50
Figure 1: Boxplots of misclassication frequencies (in percentages), from 250 replica-
tions of Setups 1 to 3 described in Section 4, with training sample size n = n
train
= 200
and test sample size n
test
= 100, of the LDA/QDA classiers, the Euclidean kNN
classiers (kNN) and their Mahalanobis (ane-invariant) counterparts (KNNa), the
proposed depth-based kNN classiers (D-kNN), and some exact and smoothed version
of the DD-classiers (DD and DDsm); see Section 4 for details.
18
1
0
9
8
7
6
5
4
3
2
1
0 10 20 30 40 50
LDA
QDA
kNN (!=1%)
kNN (!=5%)
kNN (!=10%)
kNN (!=40%)
kNNaff (!=1%)
kNNaff (!=5%)
kNNaff (!=10%)
kNNaff (!=40%)
L
D
A
/
Q
D
A
,

k
N
N
Setup 4
9
8
7
6
5
4
3
2
1
0 10 20 30 40 50
D-kNN (!=1%)
D-kNN (!=5%)
D-kNN (!=10%)
D-kNN (!=40%)
DD (m=1)
DD (m=2)
DDsm (m=1)
DDsm (m=2)
DDsm (m=3)
H
a
l
f
s
p
a
c
e

d
e
p
t
h
9
8
7
6
5
4
3
2
1
0 10 20 30 40 50
D-kNN (!=1%)
D-kNN (!=5%)
D-kNN (!=10%)
D-kNN (!=40%)
DD (m=1)
DD (m=2)
DDsm (m=1)
DDsm (m=2)
DDsm (m=3)
S
i
m
p
l
i
c
i
a
l

d
e
p
t
h
9
8
7
6
5
4
3
2
1
0 10 20 30 40 50
D-kNN (!=1%)
D-kNN (!=5%)
D-kNN (!=10%)
D-kNN (!=40%)
DD (m=1)
DD (m=2)
DDsm (m=1)
DDsm (m=2)
DDsm (m=3)
M
a
h
a
l
a
n
o
b
i
s

d
e
p
t
h
1
0
9
8
7
6
5
4
3
2
1
0 10 20 30 40 50
Setup 5
9
8
7
6
5
4
3
2
1
0 10 20 30 40 50
9
8
7
6
5
4
3
2
1
0 10 20 30 40 50
9
8
7
6
5
4
3
2
1
0 10 20 30 40 50
1
0
9
8
7
6
5
4
3
2
1
0 10 20 30 40 50
Setup 6
9
8
7
6
5
4
3
2
1
0 10 20 30 40 50
9
8
7
6
5
4
3
2
1
0 10 20 30 40 50
9
8
7
6
5
4
3
2
1
0 10 20 30 40 50
Figure 2: Boxplots of misclassication frequencies (in percentages), from 250 replica-
tions of Setups 4 to 6 described in Section 4, with training sample size n = n
train
= 200
and test sample size n
test
= 100, of the LDA/QDA classiers, the Euclidean kNN
classiers (kNN) and their Mahalanobis (ane-invariant) counterparts (KNNa), the
proposed depth-based kNN classiers (D-kNN), and some exact and smoothed version
of the DD-classiers (DD and DDsm); see Section 4 for details.
19
5 REAL-DATA EXAMPLES
In this section, we investigate the performances of our depth-based kNN classiers on
two well known benchmark datasets. The rst example is taken from Ripley (1996)
and can be found on the books website (http://www.stats.ox.ac.uk/pub/PRNN).
This data set involves well-specied training and test samples, and we therefore simply
report the test set misclassication rates of the dierent classiers included in the
study. The second example, blood transfusion data, is available at http://archive.
ics.uci.edu/ml/index.html. Unlike the rst data set, no clear partition into a
training sample and a test sample is provided here. As suggested in Li et al. (2012), we
randomly performed such a partition 100 times (see the details below) and computed
the average test set missclassication rates, together with standard deviations.
A brief description of each dataset is as follows:
Synthetic data was introduced and studied in Ripley (1996). The
dataset is made of observations from two populations, each of them being
actually a mixture of two bivariate normal distributions diering only in
location. As mentioned above, a partition into a training sample and a test
sample is provided: the training and test samples contain 250 and 1000
observations, respectively, and both samples are divided equally between
the two populations.
Transfusion data contains the information on 748 blood donors selected
from the blood donor database of the Blood Transfusion Service Center
in Hsin-Chu City, Taiwan. It was studied in Yeh et al. (2009). The
classication problem at hand is to know whether or not the donor gave
blood in March 2007. In this dataset, prior probabilities are not equal; out
of 748 donors, 178 gave blood in March 2007, when 570 did not. Following
20
Li et al. (2012), one out of two linearly correlated variables was removed
and three measurements were available for each donor: Recency (number
of months since the last donation), Frequency (total number of donations)
and Time (time since the rst donation). The training set consists in 100
donors from the rst class and 400 donors from the second, while the rest
is assigned to the test sample (therefore containing 248 individuals).
Table 1 reports theexact (synthetic) or averaged (transfusion)misclassication
rates of the following classiers: the linear (LDA) and quadratic (QDA) discriminant
rules, the standard kNN classier (kNN) and its Mahalanobis ane-invariant ver-
sion (kNNa), the depth-based kNN classiers using halfspace depth (D
H
-kNN) and
Mahalanobis depth (D
M
-kNN), and the exact DD-classiers for any combination of
a polynomial order m 1, 2 and a statistical depth function among the two con-
sidered for depth-based kNN classiers, namely the halfspace depth (DD
H
) and the
Mahalanobis depth (DD
M
)smoothed DD-classiers were excluded from this study,
as their performances, which can only be worse than those of exact versions, showed
much sensitivity to the smoothing parameter t; see Section 4. For all nearest-neighbor
classiers, leave-one-out cross-validation was used to determine k.
The results from Table 1 indicate that depth-based kNN classiers perform very
well in both examples. For synthetic data, the halfspace depth-based kNN classier
(10.1%) is only dominated by the standard (Euclidian) kNN procedure (8.7%). The
latter, however, has to be discarded as it is dependent on scale and shape changesin
line with this, note that the kNN classier applied in Dutta and Ghosh (2012b)
is actually the kNNa classier (11.7%), as classication in that paper is performed
on standardized data. The Mahalanobis depth-based kNN classiers (14.4%) does
not perform as well as its halfspace counterpart. For transfusion data, however, both
depth-based kNN classiers dominate their competitors.
21
Synthetic Transfusion
LDA 10.8 29.60 (0.9)
QDA 10.2 29.21 (1.5)
kNN 8.7 29.74 (2.0)
kNNa 11.7 30.11 (2.1)
D
H
-kNN 10.1 27.75 (1.6)
D
M
-kNN 14.4 27.36 (1.5)
DD
H
(m = 1) 13.4 28.26 (1.7)
DD
H
(m = 2) 12.9 28.33 (1.6)
DD
M
(m = 1) 17.5 31.44 (0.1)
DD
M
(m = 2) 12.0 31.54 (0.6)
Table 1: Missclassication rates (in %), on the two benchmark datasets considered
in Section 5, of the linear (LDA) and quadratic (QDA) discriminant rules, the stan-
dard kNN classier (kNN) and its Mahalanobis ane-invariant version (kNNa), the
depth-based kNN classiers using halfspace depth (D
H
-kNN) and Mahalanobis depth
(D
M
-kNN), and the exact DD-classiers for any combination of a polynomial degree
m 1, 2 and a choice of halfspace depth (DD
H
) or Mahalanobis depth (DD
M
).
22
6 FINAL COMMENTS
The depth-based neighborhoods we introduced are of interest in other inference prob-
lems as well. As an illustration, consider the regression problem where the conditional
mean function x m(x) = E[Y [ X = x] is to be estimated on the basis of mutu-
ally independent copies (X
i
, Y
i
), i = 1, . . . , n of a random vector (X, Y ) with values
in R
d
R, or the problem of estimating the common density f of i.i.d. random d-
vectors X
i
, i = 1, . . . , n. The classical k
n
NN estimators for these problems are

f
(n)
(x) =
k
n
n
d
(B
n
x
)
and m
(n)
(x) =
n

i=1
W
n(n)
i
(x)Y
i
=
1
k
n
n

i=1
I
_
X
i
B
n
x

Y
i
, (6.1)
where
n
= k
n
/n, B

x
is the smallest Euclidean ball centered at x that contains
a proportion of the X
i
s, and
d
stands for the Lebesgue measure on R
d
. Our
construction naturally leads to considering the depth-based k
n
NN estimators

f
(n)
D
(x)
and m
(n)
D
(x) obtained by replacing in (6.1) the Euclidean neighborhoods B
n
x
with
their depth-based counterparts R
n
x
and k
n
=

n
i=1
I
_
X
i
B
n
x

with K
n(n)
x
=

n
i=1
I
_
X
i
R
n
x

.
A thorough investigation of the properties of these depth-based procedures is of
course beyond the scope of the present paper. It is, however, extremely likely that
the excellent consistency properties obtained in the classication problem extend to
these nonparametric regression and density estimation setups. Now, recent works
in density estimation indicate that using non-spherical (actually, ellipsoidal) neigh-
borhoods may lead to better nite-sample properties; see, e.g., Chac on (2009) or
Chac on et al. (2011). In that respect, the depth-based kNN estimators above are
very promising since they involve non-spherical (and for most classical depth, even
non-ellipsoidal) neighborhoods whose shape is determined by the local geometry of
the sample. Note also that depth-based neighborhoods only require choosing a single
scalar bandwidth parameter (namely, k
n
), whereas general d-dimensional ellipsoidal
23
neighborhoods impose selecting d(d + 1)/2 bandwidth parameters.
A APPENDIX
The main goal of this Appendix is to prove Theorem 3.1. We will need the following
lemmas.
Lemma A.1 Assume that the depth function D satises (P2), (P3), (Q1), and (Q2).
Let P be a probability measure that is symmetric about and admits a density that is
positive at and continuous in a neighborhood of . Then, (i) for all a > 0, there exists
<

= max
xR
d D(x, P) such that R

(P) B

(a) := x R
d
: |x | a; (ii)
for all <

, there exists > 0 such that B

() R

(P).
Proof of Lemma A.1. (i) First note that the existence of

follows from Property


(P2). Fix then > 0 such that x D(x, P) is continuous over B

(); existence
of is guaranteed by Property (Q1). Continuity implies that x D(x, P) reaches
a minimum in B

(), and Property (Q2) entails that this minimal value,

say, is
strictly smaller than

. Using Property (Q1) again, we obtain that, for each


[

],
r

: o
d1
R
+
u supr R
+
: + ru R

(P)
is a continuous function that converges pointwise to r

(u) 0 as

. Since o
d1
is compact, this convergence is actually uniform, i.e., sup
uS
d1 [r

(u)[ = o(1) as

. Part (i) of the result follows.


(ii) Property (Q2) implies that, for any [

), the mapping r

takes values
in R
+
0
. Therefore there exists u
0
() o
d1
such that r

(u) r

(u
0
()) =

> 0.
This implies that, for all [

), we have B

) R

(P), which proves the


24
result for these values of . Nestedness of the R

(P)s, which follows from Prop-


erty (P3), then establishes the result for an arbitrary <

.
Lemma A.2 Assume that the depth function D satises (P2), (P3), and (Q1)-(Q3).
Let P be a probability measure that is symmetric about and admits a density that is
positive at and continuous in a neighborhood of . Let X
1
, . . . , X
n
be i.i.d. P and
denote by X
,(i)
the ith depth-based nearest neighbor of . Let K
n(n)

be the number of
depth-based nearest neighbors in R
n

(P
(n)
), where
n
= k
n
/n is based on a sequence k
n
that is as in Theorem 3.1 and P
(n)
stands for the empirical distribution of X
1
, . . . , X
n
.
Then, for any a > 0, there exists n = n(a) such that

K
n(n)

i=1
I[|X
,(i)
| > a] = 0
almost surely for all n n(a).
Note that, while X
,(i)
may not be properly dened (because of ties), the quan-
tity

K
n(n)

i=1
I[|X
,(i)
| > a] = 0 always is.
Proof of Lemma A.2. Fix a > 0. By Lemma A.1, there exists <

such that
R

(P) B

(a). Fix then and > 0 such that < < + <

. Theorem 4.1
in Zuo and Sering (2000b) and the fact that P
(n)

= P weakly as n
(where P
(n)

and P

are the -symmetrized versions of P


(n)
and P, respectively) then
entail that there exists an integer n
0
such that
R
+
(P) R

(P
(n)

) R

(P) R

(P)
almost surely for all n n
0
. From Lemma A.1 again, there exists > 0 such that
B

() R
+
(P). Hence, for any n n
0
, one has that
B

() R

(P
(n)

) B

(a)
almost surely.
Putting N
n
=

n
i=1
I[X
i
B

()], the SLLN yields that N


n
/n P[B

()] =
P[B

()] > 0 as n , since X P admits a density that, from continuity, is


25
positive over a neighborhood of . Since k
n
= o(n) as n , this implies that, for
all n n
0
( n
0
),
n

i=1
I[X
i
R

(P
(n)

)] N
n
k
n
almost surely. It follows that, for such values of n,
R
n

(P
(n)
) = R
n
(P
(n)

) R

(P
(n)

) B

(a)
almost surely, with
n
= k
n
/n. Therefore, max
i=1,...,K
n(n)

|X
,(i)
| a almost
surely for large n, which yields the result.
Lemma A.3 For a plug-in classication rule m
(n)
(x) = I[
(n)
(x) > 1/2] obtained
from a regression estimator
(n)
(x) of (x) = E[I[Y = 1] [ X = x], one has that
P[ m
(n)
(X) ,= Y ]L
opt
2
_
E[(
(n)
(X)(X))
2
]
_
1/2
, where L
opt
= P[m
Bayes
(X) ,= Y ]
is the probability of misclassication of the Bayes rule.
Proof of Lemma A.3. Corollary 6.1 in Devroye et al. (1996) states that
P[ m
(n)
(X) ,= Y [ T
n
] L
opt
2 E[[
(n)
(X) (X)[ [ T
n
],
where T
n
stands for the sigma-algebra associated with the training sample (X
i
, Y
i
),
i = 1, . . . , n. Taking expectations in both sides of this inequality and applying
Jensens inequality readily yields the result.
Proof of Theorem 3.1. From Bayes theorem, X admits the density x f(x) =

0
f
0
(x) +
1
f
1
(x). Letting Supp
+
(f) = x R
d
: f(x) > 0 and writing C(f
j
) for
the collection of continuity points of f
j
, j = 0, 1, put N = Supp
+
(f) C(f
0
) C(f
1
).
Since, by assumption, R
d
C(f
j
) (j = 0, 1) has Lebesgue measure zero, we have that
P[X R
d
N] P[X R
d
Supp
+
(f)] +

j{0,1}
P[X R
d
C(f
j
)]
=
_
R
d
\Supp
+
(f)
f(x) dx = 0,
26
so that P[X N] = 1. Note also that x (x) =
1
f
1
(x)/(
0
f
0
(x) +
1
f
1
(x)) is
continuous over N.
Fix x N and let Y
x,(i)
= Y
j(x)
with j(x) such that X
x,(i)
= X
j(x)
. With this
notation, the estimator
(n)
D
(x) from Section 3.1 rewrites

(n)
D
(x) =
n

i=1
Y
i
W
(n)
i
(x) =
1
K
(n)
x
K
(n)
x

i=1
Y
x,(i)
.
Proceeding as in Biau et al. (2012), we therefore have that (writing for simplicity
instead of
n
in the rest of the proof)
T
(n)
(x) := E[(
(n)
D
(x) (x))
2
] 2T
(n)
1
(x) + 2T
(n)
2
(x),
with
T
(n)
1
(x) = E
_

1
K
(n)
x
K
(n)
x

i=1
_
Y
x,(i)
(X
x,(i)
)
_

2
_
and
T
(n)
2
(x) = E
_

1
K
(n)
x
K
(n)
x

i=1
_
(X
x,(i)
) (x)
_

2
_
.
Writing T
(n)
X
for the sigma-algebra generated by X
i
, i = 1, . . . , n, note that, condi-
tional on T
(n)
X
, the Y
x,(i)
(X
x,(i)
)s, i = 1, . . . , n, are zero mean mutually independent
random variables. Consequently,
T
(n)
1
(x) = E
_
1
(K
(n)
x
)
2
K
(n)
x

i,j=1
E
_
(Y
x,(i)
(X
x,(i)
)
_
(Y
x,(j)
(X
x,(j)
)
_

T
(n)
X

_
= E
_
1
(K
(n)
x
)
2
K
(n)
x

i=1
E
_
(Y
x,(i)
(X
x,(i)
)
_
2

T
(n)
X

_
E
_
4
K
(n)
x
_

4
k
n
= o(1),
27
as n , where we used the fact that K
(n)
x
k
n
almost surely. As for T
(n)
2
(x), the
Cauchy-Schwarz inequality yields (for an arbitrary a > 0)
T
(n)
2
(x) E
_
1
K
(n)
x
K
(n)
x

i=1
_
(X
x,(i)
) (x)
_
2
_
= E
_
1
K
(n)
x
K
(n)
x

i=1
_
(X
x,(i)
) (x)
_
2
I[|X
x,(i)
x| a]
_
+E
_
1
K
(n)
x
K
(n)
x

i=1
_
(X
x,(i)
) (x)
_
2
I[|X
x,(i)
x| > a]
_
sup
yBx(a)
[(y) (x)[
2
+ 4 E
_
1
K
(n)
x
K
(n)
x

i=1
I[|X
x,(i)
x| > a]
_
=:

T
2
(x; a) +

T
(n)
2
(x; a).
Continuity of at x implies that, for any > 0, one may choose a = a() > 0 so
that

T
2
(x; a()) < . Since Lemma A.2 readily yields that T
(n)
2
(x; a()) = 0 for large n,
we conclude that T
(n)
2
(x)hence also T
(n)
(x)is o(1). The Lebesgue dominated
convergence theorem then yields that E[(
(n)
D
(X) (X))
2
] is o(1). Therefore, using
the fact that P[ m
(n)
D
(X) ,= Y [ T
n
] L
opt
almost surely and applying Lemma A.3, we
obtain
E
_
[P[ m
(n)
D
(X) ,= Y [ T
n
] L
opt
[

= E[P[ m
(n)
D
(X) ,= Y [ T
n
] L
opt
]
= P[ m
(n)
D
(X) ,= Y ] L
opt
2
_
E[(
(n)
D
(X) (X))
2
]
_
1/2
= o(1),
as n , which establishes the result.
Finally, we show that Properties (Q1)-(Q3) hold for several classical statistical
depth functions.
Theorem A.1 Properties (Q1)-(Q3) hold for (i) the halfspace depth and (ii) the
simplicial depth. (iii) If the location and scatter functionals (P) and (P) are such
28
that (a) (P) = as soon as the probability measure P is symmetric about and such
that (b) the empirical versions (P
(n)
) and (P
(n)
) associated with an i.i.d. sample
X
1
, . . . , X
n
from P are strongly consistent for (P) and (P), then Properties (Q1)-
(Q3) also hold for the Mahalanobis depth.
Proof of Theorem A.1. (i) The continuity of D in Property (Q1) actually holds
under the only assumption that P admits a density with respect to the Lebesgue
measure; see Proposition 4 in Rousseeuw and Ruts (1999). Property (Q2) is a con-
sequence of Theorems 1 and 2 in Rousseeuw and Struyf (2004) and the fact that
the angular symmetry center is unique for absolutely continuous distributions; see
Sering (2006). For halfspace depth, Property (Q3) follows from (6.2) and (6.6) in
Donoho and Gasko (1992).
(ii) The continuity of D in Property (Q1) actually holds under the only assumption
that P admits a density with respect to the Lebesgue measure; see Theorem 2 in Liu
(1990). Remark C in Liu (1990) shows that, for an angularly symmetric probability
measure (hence also for a centrally symmetric probability measure) admitting a den-
sity, the symmetry center is the unique point maximizing simplicial depth provided
that the density remains positive in a neighborhood of the symmetry center; Prop-
erty (Q2) trivially follows. Property (Q3) for simplicial depth is stated in Corollary 1
of D umbgen (1992).
(iii) This is trivial.
Finally, note that Properties (Q1)-(Q3) also hold for projection depth under very
mild assumptions on the univariate location and scale functionals used in the deni-
tion of projection depth; see Zuo (2003).
29
References
Biau, G., Devroye, L., Dujmovic, V., and Krzyzak, A. (2012), An Ane Invariant
k-Nearest Neighbor Regression Estimate, arXiv:1201.0586v1, math.ST. 9, 14, 27
Chac on, J. E. (2009), Data-driven Choice of the Smoothing Parametrization for
Kernel Density Estimators, Canadian Journal of Statistics, 37, 249265. 23
Chac on, J. E., Duong, T., and Wand, M. P. (2011), Asymptotics for General Mul-
tivariate Kernel Density Derivative Estimators, Statistica Sinica, 21, 807840. 23
Cui, X., Lin, L., and Yang, G. (2008), An Extended Projection Data Depth and
Its Applications to Discrimination, Communications in Statistics - Theory and
Methods, 37, 22762290. 4
Dehon, C. and Croux, C. (2001), Robust Linear Discriminant Analysis using S-
Estimators, Canadian Journal of Statistics, 29, 473492. 2
Devroye, L., Gy or, L., and Lugosi, G. (1996), A Probabilistic Theory of Pattern
Recognition (Stochastic Modelling and Applied Probability), New York: Springer.
13, 26
Donoho, D. L. and Gasko, M. (1992), Breakdown Properties of Location Estimates
based on Halfspace Depth and Projected Outlyingness, The Annals of Statistics,
20, 18031827. 29
D umbgen, L. (1992), Limit Theorems for the Simplicial Depth, Statistics & Prob-
ability Letters, 14, 119128. 29
(1998), On Tylers M-Functional of Scatter in High Dimension, Annals of the
Institute of Statistical Mathematics, 50, 471491. 11
30
Dutta, S. and Ghosh, A. K. (2012a), On Robust Classication using Projection
Depth, Annals of the Institute of Statistical Mathematics, 64, 657676. 3, 4
(2012b), On Classication Based on L
p
Depth with an Adaptive Choice of p,
Submitted. 3, 4, 21
Ghosh, A. K. and Chaudhuri, P. (2005a), On Data Depth and Distribution-free
Discriminant Analysis using Separating Surfaces, Bernoulli, 11, 127. 4
(2005b), On Maximum Depth and Related Classiers, Scandinavian Journal of
Statistics, 32, 327350. 3, 11, 17
Hartikainen and Oja, H. (2006), On some Parametric, Nonparametric and Semi-
parametric Discrimination Rules, DIMACS Series in Discrete Mathematics and
Theoretical Computer Science, 72, 6170. 3
He, X. and Fung, W. K. (2000), High Breakdown Estimation for Multiple Popula-
tions with Applications to Discriminant Analysis, Journal of Multivariate Analy-
sis, 72, 151162. 2
Hettmansperger, T. P. and Randles, R. H. (2002), A Practical Ane Equivariant
Multivariate Median, Biometrika, 89, 851860. 11
Hoberg, A. and Mosler, K. (2006), Data Analysis and Classication with the Zonoid
Depth, DIMACS Series in Discrete Mathematics and Theoretical Computer Sci-
ence, 72, 4959. 11
Hubert, M. and Van der Veeken, S. (2010), Robust Classication for Skewed Data,
Advances in Data Analysis and Classication, 4, 239254. 3
J ornsten, R. (2004), Clustering and Classication Based on the L1 Data Depth,
Journal of Multivariate Analysis, 90, 6789. 4
31
Koshevoy, G. and Mosler, K. (1997), Zonoid Trimming for Multivariate Distribu-
tions, The Annals of Statistics, 25, 19982017. 7
Lange, T., Mosler, K., and Mozharovskyi, P. (2012), Fast Nonparametric Classi-
cation based on Data Depth, Discussion Papers in Statistics and Econometrics
01/2012, University of Cologne. 4, 11
Li, J., Cuesta-Albertos, J., and Liu, R. Y. (2012), DD-Classier: Nonparametric
Classication Procedures based on DD-Plots, Journal of the American Statistical
Association, to appear. 3, 11, 16, 17, 20, 21
Liu, R. Y. (1990), On a Notion of Data Depth based on Random Simplices, The
Annals of Statistics, 18, 405414. 4, 7, 29
Liu, R. Y., Parelius, J. M., and Singh, K. (1999), Multivariate Analysis by Data
Depth: Descriptive Statistics, Graphics and Inference, The Annals of Statistics,
27, 783840. 3, 11
Oja, H. and Paindaveine, D. (2005), Optimal Signed-Rank Tests based on Hyper-
planes, Journal of Statistical Planning and Inference, 135, 300323. 9
Randles, R. H., Brott, J. D., Ramberg, J. S., and Hogg, R. V. (1978), Generalized
Linear and Quadratic Discriminant Functions using Robust Estimates, Journal of
the American Statistical Association, 73, 564568. 2
Ripley, B. D. (1996), Pattern Recognition and Neural Networks, Cambridge: Cam-
bridge University Press. 20
Rousseeuw, P. J. and Ruts, I. (1999), The Depth Function of a Population Distri-
bution, Metrika, 49, 213244. 29
32
Rousseeuw, P. J. and Struyf, A. (2004), Characterizing Angular Symmetry and
Regression Symmetry, Journal of Statistical Planning and Inference, 122, 161
173. 8, 29
Sering, R. J. (2006), Multivariate Symmetry and Asymmetry, Encyclopedia of
statistical sciences, 8, 53385345. 8, 29
Stone, C. J. (1977), Consistent Nonparametric Regression, The Annals of Statistics,
5, 595620. 13
Tukey, J. W. (1975), Mathematics and the Picturing of Data, Proceedings of the
International Congress of Mathematicians, 2, 523531. 4, 7
Tyler, D. E. (1987), A Distribution-free M-Estimator of Multivariate Scatter, The
Annals of Statistics, 15, 234251. 11
Yeh, I. C., Yang, K. J., and Ting, T. M. (2009), Knowledge Discovery on RFM Model
using Bernoulli Sequence, Expert Systems with Applications, 36, 58665871. 20
Zakai, A. and Ritov, Y. (2009), Consistency and Localizability, Journal of Machine
Learning Research, 10, 827856. 4
Zuo, Y. (2003), Projection-based Depth Functions and Associated Medians, The
Annals of Statistics, 31, 14601490. 29
Zuo, Y. and Sering, R. (2000a), General Notions of Statistical Depth Function,
The Annals of Statistics, 28, 461482. 6, 7
(2000b), Structural Properties and Convergence Results for Contours of Sample
Statistical Depth Functions, The Annals of Statistics, 28, 483499. 25
33

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy