SDV
SDV
SDV
Visualization
Ronaldo Dias
Departamento de Estatstica - IMECC.
Universidade Estadual de Campinas.
Sao Paulo, Brasil.
E-mail address: dias@ime.unicamp.br
Preface
In recent years more and more data have been collected in order to extract information or to learn valuable characteristics about experiments, phenomena, observational facts, etc.. This is what its been called learning from data. Due to their complexity, several datasets have been analyzed by nonparametric approaches. This field
of Statistics impose minimum assumptions to get useful information from data. In
fact, nonparametric procedures, usually, let the data speak for themselves. This work
is a brief introduction to a few of the most useful procedures in the nonparametric estimation toward smoothing and data visualization. In particular, it describes the theory and the applications of nonparametric curve estimation (density and regression)
problems with emphasis in kernel, nearest neighbor, orthogonal series, smoothing
splines methods. The text is designed for undergraduate students in mathematical
sciences, engineering and economics. It requires at least one semester in calculus,
probability and mathematical statistics.
ii
Contents
1 Introduction
2 Kernel estimation
2.1
The Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2
2.2.1
10
2.3
12
2.4
Bandwidth Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
2.4.1
14
2.4.2
15
2.4.3
Least-Squares Cross-Validation . . . . . . . . . . . . . . . . . . .
18
19
2.5
23
3.1
25
3.2
26
3.3
28
3.3.1
32
4 Spline Functions
35
4.1
35
4.2
39
4.3
41
iii
CONTENTS
iv
5
45
5.1
Additive Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
5.2
50
53
6.1
53
6.2
P-splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
6.3
62
Final Comments
. . . . . . . . . . . . . . . . . . . .
69
List of Figures
2.2.1 Naive estimate constructed from Old faithful geyser data with h = 0.1
2.2.2 Kernel density estimate constructed from Old faithful geyser data with
Gaussian kernel and h = 0.25 . . . . . . . . . . . . . . . . . . . . . . . .
2.2.3 Bandwidth effect on kernel density estimates. The data set income was
rescaled to have mean 1. . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
15
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
24
26
27
37
38
41
43
5.1.1 True, tensor product, gam non-adaptive and gam adaptive surfaces . .
50
52
vi
LIST OF FIGURES
6.1.1 Spline least square fittings for different values of K . . . . . . . . . . . .
54
56
6.1.3 Five thousand replicates of the affinity and the partial affinity for adaptive nonparametric regression using H-splines with the true curve. . .
58
59
60
62
67
6.3.8 One hundred estimates of the curve 6.3.7 and a Bayesian confidence
interval for the regression curve g(t) = exp(t2 /2) cos(4t) with t
[0, ]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
Chapter 1
Introduction
Probably, the most used procedure to describe a possible relationship among variables is the statistical technique known as regression analysis. It is always useful to
begin the study of regression analysis by making use of simple models. For this, assume that we have collected observations from a continuous variable Y at n values of
a predict variable T. Let (t j , y j ) be such that:
y j = g (t j ) + j ,
j = 1, . . . , n,
(1.0.1)
where the random variables j are uncorrelated with mean zero and variance 2 .
Moreover, g(t j ) are the values obtained from some unknown function g computed at
the points t1 , . . . , tn . In general, the function g is called regression function or regression
curve.
A parametric regression model assumes that the form of g is known up to a finite
number of parameters. That is, we can write a parametric regression model by,
y j = g (t j , ) + j ,
j = 1, . . . , n,
(1.0.2)
where = ( 1 , . . . , p )T R p . Thus, to determine from the data a curve g is equivalent to determine the vector of parameters . One may notice that, if g has a linear
p
form, i.e., g(t, ) = j=1 j x j (t), where { x j (t)} j=1 are the explanatory variables, e.g.,
regression model.
1
CHAPTER 1. INTRODUCTION
Certainly, there are other methods of fitting curves to data. A collection of tech-
differentiable the rate is usually, n4/5 . However, in the case where the parametric
model is incorrectly specified, ad hoc, the rate n1 cannot be achieved. In fact, the
parametric estimator does not even converge to the true regression curve.
Chapter 2
Kernel estimation
Suppose we have n independent measurements {(ti , yi )}in=1 , the regression equation is, in general, described as in (1.0.1). Note that the regression curve g is the
conditional expectation of the independent variable Y given the predict variable T,
that is, g(t) = E[Y | T = t]. When we try to approximate the mean response function
g, we concentrate on the average dependence of Y on T = t. This means that we try
to estimate the conditional mean curve
g(t) = E[Y | T = t] =
f TY (t, y)
dy,
f T (t)
(2.0.1)
where f TY (t, y) denotes the joint density of (T, Y ) and f T (t) the marginal density of
T. In order to provide an estimate g (t) of g we need to obtain estimates of f TY (t, y)
and f T (t). Consequently, density estimation methodologies will be described.
3
2.1
The Histogram
The histogram is one of the first, and one of the most common, methods of density estimation. It is important to bear in mind that the histogram is a smoothing
technique used to estimate the unknown density and hence it deserves some consideration.
Let us try to combine the data by counting how many data points fall into a small
interval of length h. This kind of interval is called a bin. Observe that the well known
dot plot of Box, Hunter and Hunter (1978) is a particular type of histogram where
h = 0.
Without loss of generality, we consider a bin centered at 0, namely the interval
[h/2, h/2) and let FX be the distribution function of X such that FX is absolutely
continuous with respect to a Lesbegue measure on R. Consequently the probability
that an observation of X will fall into the interval [h/2, h/2) is given by:
P(X [h/2, h/2)) =
Z h/2
h/2
f X ( x )dx,
1
#{ Xi [h/2, h/2)}.
n
Now applying the mean value theorem for continuous bounded function we obtain,
P(X [h/2, h/2)) =
Z h/2
h/2
f ( x )dx = f ( )h,
Formally, suppose we observe random variables X1 , . . . , Xn whose unknown common density is f . Let k be the number of bins, and define Cj = [ x0 + ( j 1)h, x0 + jh),
to be :
1 if x A
I ( x A) =
0 otherwise,
1
fh ( x ) =
nh
n j I ( x C j ),
j =1
for all x. Here, note that the density estimate fh depends upon the histogram bandwidth
h. By varying h we can have different shapes of fh . For example, if one increases h,
one is averaging over more data and the histogram appears to be smoother. When
h 0, the histogram becomes a very noisy representation of the data (needle-plot,
Hardle (1990)). The opposite, situation when h , the histogram, now, becomes
overly smooth (box-shaped). Thus, h is the smoothing parameter of this type of density estimate, and the question of how to choose the histogram bandwidth h turns out
to be an important question in representing the data via the histogram. For details on
how to estimate h see Hardle (1990).
The motivation behind the histogram can be expanded quite naturally. For this consider a weight function,
1 , if | x |< 1
2
K(x) =
0, otherwise
1
f( x ) =
nh
K(
i =1
x Xi
).
h
(2.2.1)
We can see that f extends the idea of the histogram. Notice that this estimate just
places a box of side (width) 2h and height (2nh)1 on each observation and then
sums to obtain f. See Silverman (1986) for a discussion of this kind of estimator. It is
not difficult to verify that f is not a continuous function and has zero derivatives everywhere except on the jump points Xi h. Besides having the undesirable character
of nonsmoothness (Silverman (1986)), it could give a misleading impression to a untrained observer since its somewhat ragged character might suggest several different
bumps.
Figure 2.2.1 shows the nonsmooth character of the naive estimate. The data seem
to have two major modes. However, the naive estimator suggests several different
small bumps.
0.3
0.0
0.1
0.2
density estimate
0.4
0.5
0.6
x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x xx
x xx
Eruptions length
Figure 2.2.1: Naive estimate constructed from Old faithful geyser data with h = 0.1
To overcome some of these difficulties, assumptions have been introduced on the
function K. That is, K must be a nonnegative kernel function that satisfies the following property:
Z
K ( x )dx = 1.
0.3
0.1
0.2
density estimate
0.4
0.5
0.0
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xx x xx x
xxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Eruptions length
Figure 2.2.2: Kernel density estimate constructed from Old faithful geyser data with
Gaussian kernel and h = 0.25
Note that an estimate based on the kernel function places bumps on the observations and the shape of those bumps is determined by the kernel function K. The
bandwidth h sets the width around each observation and this bandwidth controls the
degree of smoothness of a density estimate. It is possible to verify that as h 0, the
estimate becomes a sum of Dirac delta functions at the observations while as h ,
it eliminates all the local roughness and possibly important details are missed.
The data for the Figure 2.2.3 which is labelled income were provided by Charles
Kooperberg. This data set consists of 7125 random samples of yearly net income in
the United Kingdom (Family Expenditure Survey, 1968-1983). The income data is
considerably large and so it is more of a challenge to computing resources and there
are severe outliers. The peak at 0.24 is due to the UK old age pension, which caused
0.8
0.4
0.0
0.2
Relative Frequency
0.6
h=R default
h=.12
h=.25
h=.5
10
12
transformed data
Figure 2.2.3: Bandwidth effect on kernel density estimates. The data set income was
rescaled to have mean 1.
many people to have nearly identical incomes. The width of the peak is about 0.02,
compared to the range 11.5 of the data. The rise of the density to the left of the peak
is very steep.
There is a vast (Silverman (1986)) literature on kernel density estimation studying its mathematical properties and proposing several algorithms to obtain estimates
based on it. This method of density estimation became, apart from the histogram,
the most commonly used estimator. However it has drawbacks when the underlying
density has long tails Silverman (1986). What causes this problem is the fact that the
bandwidth is fixed for all observations, not considering any local characteristic of the
data.
In order to solve this problem several other Kernel Density Estimation Methods
were proposed such as the nearest neighbor and the variable kernel. A detailed discussion and illustration of these methods can be found in Silverman (1986).
10
The idea behind the nearest neighbor method is to adapt the amount of smoothing
to local characteristics of the data. The degree of smoothing is then controlled by an
integer k. Essentially, the nearest neighbor density estimator uses distances from x in
f ( x ) to the data point. For example, let d( x1 , x ) be the distance of data point x1 from
the point x, and for each x denote dk ( x ) as the distance from its kth nearest neighbor
among the data points x1 , . . . , xn .
The kth nearest neighbor density estimate is defined as,
f( x ) =
k
,
2ndk ( x )
n
1
x Xi
K(
).
ndk ( x ) i =1
dk ( x )
11
Observe that the overall amount of smoothing is governed by the choice of k, but the
bandwidth used at any particular point depends on the density of observations near
that point. Again, we face the problems of discontinuity of at all the points where
the function dk ( x ) has discontinuous derivative. The precise integrability and tail
properties will depend on the exact form of the kernel.
Figure 2.2.4 shows the effect of the smoothing parameter k on the density estimate. Observe that as k increases rougher the density estimate becomes. This effect
is equivalent when h is approaching to zero in the kernel density estimator.
density
True
k=40
k=30
k=20
0.3
0.4
0.5
0.6
0.7
data
12
1 n
x Xi
E[K (
)]
nh i =1
h
x Xi
1
E[K (
)]
hZ
h
xu
1
) f (u)du
K(
=
h
Zh
K (y) f ( x + yh)dy.
(2.3.1)
K (y)dy = f ( x ). Thus, f is an
To compute the bias of this estimator we have to make the assumption that the underlying density is twice differentiable and satisfies the following conditions PrakasaRao (1983):
Condition 1. supx K ( x ) M < ; | x |K ( x ) 0 as | x | .
Condition 2. K ( x ) = K ( x ), x (, ) with
x2 K ( x )dx < .
y2 K (y)dy + o(h2 ).
We observe that since we have assumed the kernel K is symmetric around zero, we
R
have that yK (y)h f ( x )dy = 0, and the bias is quadratic in h. Parzen (1962)
Using a similar approach we obtain :
1
2
nh kK k2 f ( x )
MSE f [ f( x )] =
1
nh
13
1
), where kK k22 =
+ o( nh
f ( x )kK k22 +
h4
4 ( f (x)
kK ( x )k2 dx
1
) + o ( h4 ) ,
y2 K (y)dy)2 + o( nh
1/5
f ( x )kK k22
R
n1/5 .
2
2
2
( f ( x )) ( y K (y)dy) n
(2.4.1)
The problem with this approach is that h depends on two unknown functions
f () and f (). An approach to overcome this problem uses a global measure that
can be defined as:
I MSE[ f] =
MSE f [ f( x )]dx
h4
1
kK k22 + (
nh
4
y2 K (y)dy)2 k f k22 + o(
1
) + o ( h 4 ).
nh
(2.4.2)
IMSE is the well known integrated mean squared error of a density estimate. The optimal value of h considering the IMSE is define as
hopt = arg min I MSE[ f].
h>0
14
it can be shown that,
hopt =
where c2 =
c22/5
Z
K2 ( x )dx
1/5
k f k22
1/5
n1/5 ,
(2.4.3)
A very natural way to get around the problem of not knowing f is to use a
standard family of distributions to assign a value of the term k f k22 in expression
(2.4.3). For example, assume that a density f belongs to the Gaussian family with
mean and variance 2 , then
Z
( f ( x )) dx =
=
( ( x ))2 dx
3 1 5
2 0.212 5,
8
(2.4.4)
where ( x ) is the standard normal density. If one uses a Gaussian kernel, then
3
hopt = (4 )1/10 ( 1/2 )1/5 n1/5
8
4 1/5
=
n1/5 = 1.06 n1/5
3
(2.4.5)
Hence, in practice a possible choice for hopt is 1.06 n1/5 , where is the sample
standard deviation.
If we want to make this estimate more insensitive to outliers, we have to use a
more robust estimate for the scale parameter of the distribution. Let R be the sample
interquartile, then one possible choice for h is
R
) n1/5
((3/4) (1/4))
R
) n1/5 ,
= 1.06 min(,
1.349
(2.4.6)
15
Figure 2.4.5 exhibits how a robust estimate of the scale can help in choosing the
bandwidth. Note that by using R we have strong evidence that the underlying density has two modes.
0.30
0.15
0.00
0.05
0.10
Relative Frequency
0.20
0.25
True
sigmahat
interquartile
data
Figure 2.4.5: Comparison of two bandwidths, (the sample standard deviation) and
R (the sample interquartile) for the mixture 0.7 N (2, 1) + 0.3 N (1, 1).
Consider kernel density estimates f and suppose we want to test for a specific h
the hypothesis
f( x ) = f ( x )
vs.
f( x ) 6= f ( x ),
for a fixed x The likelihood ratio test would be based on the test statistic f ( x )/ f( x ).
For a good bandwidth this statistic should thus be close to 1. Alternatively, we would
16
expect E[log(
f (X)
)]
f( X )
log
f (x)
f( x )
f ( x )dx.
(2.4.7)
Of course, we are not able to compute dKL ( f , f) from the data, since we do not
know f . But from a theoretical point of view, we can investigate this distance for
the choice of an appropriate bandwidth h. When dKL ( f , f) is close to 0 this would
give the best agreement with the hypothesis f = f . Hence, we are looking for a
bandwidth h, which minimizes dKL ( f , f).
Suppose we are given a set of additional observations Xi , independent of the others. The likelihood for these observations is i f (Xi ). Substituting f in the likelihood
equation we have i f(Xi ) and the value of this statistic for different h would indicate
which value of h is preferable, since the logarithm of this statistic is close to dKL ( f , f).
Usually, we do not have additional observations. A way out of this dilemma is to base
the estimate f on the subset { X j } j6=i , and to calculate the likelihood for Xi . Denoting
the leave-one-out estimate
f(Xi ) = (n 1)1 h1 K (
j 6 =i
Hence,
i =1
Xi X j
).
h
f(Xi ) = (n 1)n hn K (
i =1 j 6 =i
Xi X j
).
h
(2.4.8)
n i
h
=1
j 6 =i
CVKL (h) =
(2.4.9)
(2.4.10)
17
Since we assumed that Xi are i.i.d., the scores log fi (Xi ) are identically distributed
and so,
E[CVKL (h)] E
hZ
log f( x ) f ( x )dx
E[dKL ( f , f)] +
log[ f ( x )] f ( x )dx.
(2.4.11)
The second term of the right-hand side does not depend on h. Then, we can expect
that we approximate the optimal bandwidth that minimizes dKL ( f , f).
The Maximum likelihood cross validation has two shortcomings:
When we have identical observations in one point, we may obtain an infinite
value if CVKL (h) and hence we cannot define an optimal bandwidth.
Suppose we use a kernel function with finite support, e.g., the interval [1, 1]. If
an observation Xi is more separated from the other observations than the bandwidth
h, the likelihood fi (Xi ) becomes 0. Hence the score function reaches the value .
Maximizing CVKL (h) forces us to use a large bandwidth to prevent this degenerated
case. This might lead to slight over-smoothing for the other observations.
18
=
d ISE (h)
f ( x )dx =
( f h f )2 ( x )dx
f h2 ( x )dx
f h2 ( x )dx
2
2
( f h f )( x )dx +
f 2 ( x )dx
( f h f )( x )dx
(2.4.12)
R
For the last term, observe that ( f h f )( x )dx = E[ f h (Xi )] where the expectation is un-
(2.4.13)
CVLS (h) =
(2.4.14)
i =1
= I MSE[ f h ] k f k22 .
(2.4.15)
An interesting question is, how good is the approximation of d ISE by CVLS . To investigate this define a sequence of bandwidths hn = h(X1 , . . . , Xn ) to be asymptotically
optimal, if
d ISE (hn )
1, a.s. when n .
infh>0 d ISE (h)
It can be shown that if the density f is bounded then h LS is asymptotically optimal.
Similarly to maximum likelihood cross-validation one can found in Hardle (1990) an
algorithm to compute the least-squares cross-validation.
19
[0, 1]. The idea is to use the theory of orthogonal series method and then to reduce the
estimation procedure by estimating the coefficients of its Fourier expansion. Define
the sequence v ( x ) by
(x) = 1
( x ) = 2 sin 2rx r = 1, 2, . . .
2r
It is well known that f can be represented as Fourier series i=0 ai i , where, for
each i 0,
ai =
f ( x )i ( x )dx.
(2.5.1)
Now, suppose that X is a random variable with density f . Then (2.5.1) can be
written
ai = Ei (X )
and so an unbiased estimator of f based on X1 , . . . , Xn is
ai =
1
n
i (Xj ).
j =1
Note that the i=1 a i i converges to a sum of delta functions at the observations, since
(x) =
1 n
( x Xi )
n i
=1
Z 1
0
( x )i ( x )dx
(2.5.2)
20
and hence the ai are exactly the Fourier coefficients of the function . The easiest to
way to smooth is to truncate the expansion ai i at some point. That is, choose K
and define a density estimate f by
f( x ) =
ai i (x).
(2.5.3)
i =1
3
0
density
True
K=3
K=10
K=100
0.0
0.2
0.4
0.6
0.8
1.0
data
Figure 2.5.6: Effect of the smoothing parameter K on the orthogonal series method
for density estimation
A more general approach would be, choose a sequence of weights i , such that,
i 0 as i . Then
f( x ) =
i ai i (x).
i =0
The rate at which the weights i converge to zero will determine the amount of
smoothing. For non finite interval we can have weight functions a( x ) = e x
orthogonal functions ( x ) proportional to Hermite polynomials.
2 /2
and
21
The data in figure 2.5.6 were provided to me by Francisco Cribari-Neto and consists of the variation rate of ICMS (imposto sobre circulaca o de mercadorias e servicos)
tax for the city of Brasilia, D.F., from August 1994 to July 1999.
22
Chapter 3
Kernel nonparametric Regression
Method
Suppose we have i.i.d. observations {(Xi , Yi )}in=1 and the nonparametric regression
model given in equation (1.0.1). By equation (2.0.1) we know how to estimate the
denominator by using the kernel density estimation method. For the numerator one
can estimate the joint density using the multiplicative kernel
f h1 ,h2 ( x, y) =
1 n
Kh1 ( x Xi )Kh2 (y Yi ).
n i
=1
y f h1 ,h2 ( x, y)dy =
1 n
Kh1 ( x Xi )Yi .
n i
=1
Based on the methodology of kernel density estimation Nadaraya (1964) and Watson
(1964) suggested the following estimator gh for g.
gh ( x ) =
in=1 Kh ( x Xi )Yi
nj=1 Kh ( x X j )
(3.0.1)
24
Now, consider the model (1.0.1) and let X1 , . . . , Xn be i.i.d. random variables with
density f X such that Xi is independent of i for all i = 1, . . . , n. Assume the conditions
given in Section 2.3 and suppose that f and g are twice continuously differentiable
in neighborhood of the point x. Then, if h 0 and nh as n , we
R
have gh g in probability. Moreover, suppose E[| i |2+ ] and |K ( x |2+ dx < ,
R
for some > 0, then nh ( gh E[ gh ]) N (0, ( f X ( x ))1 2 (K ( x ))2 dx ) in distribu-
tion, where N (, ) stands for a Gaussian distribution, (see details in Pagan and Ullah
(1999)).
As an example, figure 3.0.1 shows the effect of choosing h on the NadarayaWatson procedure. The data consist of the speed of cars and the distances taken
to stop. It is important to notice that the data were recorded in the 1920s. (These
datasets can be found in the software R) The Nadaraya-Watson kernel method can
be extended to the multivariate regression problem by considering the multidimen-
100
120
60
40
20
0
dist
80
h=2
h=5
10
15
20
speed
25
25
One may notice that regression by kernels is based on local averaging of observations Yi in a fixed neighborhood of x. Instead of this fixed neighborhood, k-NN
employs varying neighborhoods in the X variable support. That is,
gk ( x ) =
1 n
Wki ( x )Yi ,
n i
=1
(3.1.1)
where,
n/k if i J
x
Wki ( x ) =
0
otherwise,
(3.1.2)
E[ gk ( x )] g( x )
1
[ g ( x ) f ( x ) + 2g ( x ) f ( x )](k/n)2
3
24( f ( x ))
(3.1.3)
2
.
k
(3.1.4)
and
Var [ gk ( x )]
We observe that the bias increasing and the variance is decreasing in the smoothing
parameter k. To balance this trade-off one should choose k n4/5 . For details, see
Hardle (1990).
Figure 3.1.2 shows the effect of the parameter k on the regression curve estimates.
Note that the curve estimate with k = 2 is less smoother than the curve estimate
with k = 1. The data set consist of the revenue passenger miles flown by commercial
airlines in the United States for each year from 1937 to 1960 and is available through
R package.
26
25000
30000
airmiles data
15000
0
5000
10000
airmiles
20000
Data
K=1
K=2
1940
1945
1950
1955
1960
Figure 3.1.2: Effect of the smoothing parameter k on the k-NN regression estimates.
Cleveland (1979) proposed the algorithm LOWESS, locally weighted scatter plot
smoothing, as an outlier resistant method based on local polynomial fits. The basic
idea is to start with a local polynomial (a k-NN type fitting) least squares fit and then
to use robust methods to obtain the final fit. Specifically, one can first fit a polynomial
regression in a neighborhood of x, that is, find R p+1 which minimize
n
Wki
i =1
yi j x
j =0
2
(3.2.1)
where Wki denote k-NN weights. Compute the residuals i and the scale parameter = median(i ). Define robustness weights i = K (i /6 ), where K (u) =
(15/16)(1 u)2 , if |u| 1 and K (u) = 0, if otherwise. Then, fit a polynomial regression as in (3.2.1) but with weights (i Wki ( x )). Cleveland suggests that p = 1
provides good balance between computational ease and the need for flexibility to
27
reproduce patterns in the data. In addition, the smoothing parameter can be determined by cross-validation as in (2.4.10). Note that when using the R function lowess
or loess, f acts as the smoothing parameter. Its relation to the k-NN nearest neighbor
is given by
k = n f ,
f (0, 1),
120
lowess(cars)
60
40
20
0
dist
80
100
f = 2/3
f = .2
10
15
20
25
speed
28
section is developed considering historical results, beginning with Good and Gaskins
(1971), and ending with the most recent result given by Gu (1993).
The maximum likelihood (M.L.) method has been used as statistical standard procedure in the case where the underlying density f is known except by a finite number
of parameters. It is well known the M.L. has optimal properties (asymptotically unbiased and asymptotically normal distributed) to estimate the unknown parameters.
Thus, it would be interesting if such standard technique could be applied on a more
general scheme where there is no assumption on the form of the underlying density
by assuming f to belong to a pre-specified family of density functions.
Let X1 , . . . , Xn be i.i.d. random variables with unknown density f . The likelihood
function is given by:
L( f | X1 , . . . , Xn ) = f (Xi ).
i =1
The problem with this approach can be described by the following example. Recall fh ( x ) a kernel estimate, that is,
1
fh ( x ) =
nh
K(
i =1
x Xi
),
h
with h = h/c, where c is constant greater than 0, i.e., for the moment the bandwidth
is h/c. Let h be small enough such that |
Xi Xi
h/c |
1
K (0)
1
nh .
Thus,
L(
1 n
) .
nh
29
Letting h 0, we have L . That is, L( f | X1 , . . . , Xn ) does not have a finite maximum over the class of all densities. Hence, the likelihood function can be as large
as one wants it just by taking densities with the smoothing parameter approaching zero. Densities having this characteristic, e.g., bandwidth h 0, approximate
to delta functions and the likelihood function ends up to be a sum of spikes delta
functions. Therefore, without putting constraints on the class of all densities, the
maximum likelihood procedure cannot be used properly.
One possible way to overcome the problem described above is to consider a penalized log-likelihood function. The idea is to introduce a penalty term on the loglikelihood function such that this penalty term quantifies the smoothness of g =
log f .
Let us take, for instance, the functional J ( g) =
define the penalized log-likelihood function by
L ( g) =
1 n
g(Xi ) J ( g) ,
n i
=1
(3.3.1)
where is the smoothing parameter which controls two conflicting goals, the fidelity
to the data given by in=1 g(Xi ) and the smoothness, given by the penalty term J ( g).
The pioneer work on penalized log-likelihood method is due to Good and Gaskins(1971), who suggested a Bayesian scheme with penalized log-likelihood (using
their notation) becomes:
= ( f ) = L( f ) ( f ) ,
where L = in=1 g(Xi ) and is the smoothness penalty.
R
R
In order to simplify the notation, let h have the same meaning as h( x )dx.
Now, consider the number of bumps in the density as the measure of roughness or
smoothness. The first approach was to take the penalty term proportional to Fishers
information, that is,
Z
( f ) = ( f )2 / f .
R
Now by setting f = 2 , ( f ) becomes ( )2 , and then replace f by in the penal-
ized likelihood equation. Doing that the constraint f 0 is eliminated and the other
R
R
constraint, f = 1, turns out to be equivalent to 2 = 1, with L2 (, ).
30
R
Good and Gaskins(1971) verified that when the penalty 4 ( )2 yielded density
curves having portions that looked too straight. This fact can be explained noting
that the curvature depends also on the second derivatives. Thus ( )2 should be
included on the penalty term. The final roughness functional proposed was:
( f ) = 4
( ) +
( )2 ,
with , satisfying,
3
22 + = 4 ,
4
(3.3.2)
the basis for this constraint is the feeling that the class of normal distributions form
the smoothest class of distributions, the improper uniform distribution being limiting form. Moreover, they pointed out that some justification for this feeling is that
a normal distribution is the distribution of maximum entropy for a given mean and
R
variance. The integral ( )2 is also minimized for a given variance when f is nor-
mal (Good and Gaskins, 1971). They thought was reasonable to give the normal
1
N
log(22 ) 2
2
2
(xi )2 2 164 .
i =1
N +
3
iN=1 ( xi x )2 2
+ 2 + 4 = 0.
2
(3.3.3)
3
= 4 N,
4
31
of Silvermans approach is that using the logarithm of the density and the augmented
Penalized likelihood functional, any density estimates obtained will automatically be
positive and integrate to one. Specifically,
Let (m1 , . . . , mk ) be a sequence of natural numbers so that 1 ik=1 mi m,
where m > 0 is such that g(m1) exists and is continuous. Define a linear differential
operator D as:
D ( g) =
h g1 , g2 i =
D ( g1 ) D ( g2 ) .
where the integral is taken over a open set with respect to Lebesgue measure.
Let S be the set of real functions g on for which:
the (m 1)th derivatives of g exist everywhere and are piecewise differentiable,
h g, gi < ,
e g < .
Given the data X1 , . . . , Xn i.i.d. with common density f , such that g = log f , g is the
solution, if it exists, of the optimization problem
max{
subject to
1 n
g(Xi ) h g, gi} ,
n i =1
2
e g = 1. And the density estimate f = e g , where the the null space of the
Silverman presented an important result which makes the computation of the constrained optimization problem a relatively easy computational scheme of finding
32
0 ( g ) = g ( Xi ) +
n i =1
2
( g )2
and
1 n
( g ) = g ( Xi ) +
n i =1
e +
2
g
( g )2 .
Silverman proved that unconstrained minimum of ( g) is identical with the constrained minimum of 0 , if such a minimizer exists.
m.s.
an n g L2 ,
n =0
2
with
n =0 | an | < and { an } R. That is, in L can be arbitrarily approximated
33
2 /2
where,
2
Hn ( x ) = (1)n e x (
d n x2
e ).
dx n
The log density estimator proposed by OSullivan (1988) is defined as the minimizer of
1 n
g ( xi ) +
n i =1
Z b
a
g(s)
ds +
Z b
a
( g(m) )2 ds,
(3.3.4)
for fixed > 0, and data points x1 , . . . , xn . The minimization is over a class of absolutely continuous functions on [ a, b] whose mth derivative is square integrable.
Computational advantages of this log density estimators using approximations
by cubic B-splines are:
It is a fully automatic procedure for selecting an appropriate value of the smoothing parameter , based on the AIC type criteria.
The banded structures induced by B-splines leads to an algorithm where the
computational cost is linear in the number of observations (data points).
It provides approximate pointwise Bayesian confidence intervals for the estimator.
A disadvantage of OSullivans work is that it does not provide any comparison of
performance with other available techniques.
We see that the previous computational framework is unidimensional, although
Silvermans approach can be extended to higher dimensions.
34
Chapter 4
Spline Functions
4.1 Acquiring the Taste
Due to their simple structure and good approximation properties, polynomials
are widely used in practice for approximating functions. For this propose, one usually divides the interval [ a, b] in the function support into sufficiently small subintervals of the form [ x0 , x1 ], . . . , [ xk , xk+1 ] and then uses a low degree polynomial pi for
approximation over each interval [ xi , xi +1 ], i = 0, . . . , k. This procedure produces a
piecewise polynomial approximating function s();
s( x ) = pi ( x ) on [ xi , xi +1 ], i = 0, . . . , k.
In the general case, the polynomial pieces pi ( x ) are constructed independently of
each other and therefore do not constitute a continuous function s( x ) on [ a, b]. This
is not desirable if the interest is on approximating a smooth function. Naturally, it is
necessary to require the polynomial pieces pi ( x ) to join smoothly at knots x1 , . . . , xk ,
and to have all derivatives up to a certain order, coincide at knots. As a result, we get
a smooth piecewise polynomial function, called a spline function.
Definition 4.1.1 The function s( x ) is called a spline function (or simply spline) of degree
r with knots at { xi }ik=1 if =: x0 < x1 < . . . < xk < xk+1 := , where =: x0 and
xk+1 := are set by definition,
35
36
(t x )r if t > x
r
(t x )+ =
0
if t x
is called the truncated power function of degree r with knot x.
s (t) =
i t
i =0
j =r +1
j (t x jr )r+
It would be interesting if we could have basis functions that make it easy to compute the spline functions. It can be shown that B-splines form a basis of spline spaces
Schumaker (1981). Also, B-splines have an important computational property, they
are splines which have smallest possible support. In other words, B-splines are zero
on a large set. Furthermore, a stable evaluation of B-splines with the aid of a recurrence relation is possible.
Definition 4.1.3 Let = { x j }{ jZ} be a nondecreasing sequence of knots. The i-th Bspline of order k for the knot sequence is defined by
Bkj (t) = ( xk+ j x j )[ x j , . . . , xk+ j ](t x j )k+1
for all
t R,
where, [ x j , . . . , xk+ j ](t x j )k+1 is (k 1)th divided difference of the function ( x x j )k+
evaluated at points x j , . . . , xk+ j .
37
From the Definition 4.1.3 we notice that Bkj (t) = 0 for all t 6 [ x j , x j+k ]. It follows
that only k B-splines have any particular interval [ x j , x j+1] in their support. That
is, of all the B-splines of order k for the knot sequence , only the k B-splines
Bkjk+1, Bkjk+2, . . . , Bkj might be nonzero on the interval [ x j , x j+1 ]. (See de Boor (1978)
for details). Moreover, Bkj (t) > 0 for all x ( x j , x j+k ) and jZ Bkj (t) = 1, that is, the
B-spline sequence Bkj consists of nonnegative functions which sum up to 1 and provides a partition of unity. Thus, a spline function can be written as linear combination
of B-splines,
s (t) =
j Bkj (t).
j Z
The value of the function s at point t is simply the value of the function jZ j Bkj (t)
0.0
0.2
0.4
Bsplines
0.6
0.8
1.0
which makes good sense since the latter sum has at most k nonzero terms.
0.0
0.2
0.4
0.6
0.8
1.0
38
able to have a basis for N S 2m itself. To construct such a basis consisting of splines
with small supports we just need functions based on the usual B-splines. Particularly, when m = 2, we will be constructing basis functions for the Natural Cubic Spline
0.2
0.0
0.2
Natural Splines
0.4
0.6
0.0
0.2
0.4
0.6
0.8
1.0
39
with knots at {t j }Kj=1. Notice that S0 is a (K + 4)-dimensional linear space. Now, let
and on [tK , ). Thus, S has a basis of the form 1, B1 . . . , BK 1, such that B1 is linear
function with negative slope on (, t1 ] and B2 , . . . , BK 1 are constant functions on
the same interval. Similarly, BK 1 is linear function with positive slope on [tK , ) and
B1 , . . . , BK 2 are constant on the interval [tK , ).
Let be the parametric space of dimension p = K 1, such that for = (1 , . . . , p )
c() = log(
and
K 1
exp( j B j ( x )dx ))
j =1
K 1
f ( x; ) = exp{ j B j ( x ) c()}.
j =1
The p-parametric exponential family f (, ), R p of positive twice differentiable density function on R is called logspline family and the corresponding loglikelihood function is given by
L() =
log f (x; );
The log-likelihood function L() is strictly concave and hence the maximum likelihood estimator of is unique, if it exists. We refer to f = f (, ) as the logspline
density estimate. Note that the estimation of makes logspline procedure not essentially nonparametric. Thus, estimation of by Newton-Raphson, together with small
40
numbers of basis function necessary to estimate a density, make the logspline algorithm extremely fast when it is compared with Gu (1993) algorithm for smoothing
spline density estimation.
In the Logspline approach the number of knots is the smoothing parameter. That
is, too many knots lead to a noisy estimate while too few knots give a very smooth
curve. Based on their experience of fitting logspline models, Kooperberg and Stone
provide a table with the number of knots based on the number of observations. No
indication was found that the number of knots takes in consideration the structure of
the data (number of modes, bumps, asymmetry, etc.). However, an objective criterion
for the choice of the number of knots, Stepwise Knot Deletion and Stepwise knot Addition,
are included in the logspline procedure.
For 1 j p, let B j be a linear combination of a truncated power basis ( x tk )3+
for the a knot sequence t1 , . . . , t p , that is,
B j ( x ) = j + j0 x + jk ( x tk )3+ .
k
Then
j Bj (x) = j j0 + jk j (x tk )3+ .
j
SE( k ) = Tk ( I ( ))1 k )
where I () is the Fisher information matrix obtained from the log-likelihood function.
The knots t1 and tK are considered permanent knots, and tk , 2 k K, are
nonpermanent knots. Then at any step delete (similarly for addition step) that knot
which has the smallest value of | Tk |/SE( Tk ). In this matter, we have a sequence
41
= 3. The choice was made, according to them, because this value of makes the
probability that f is bimodal when f is Gamma(5) to be about 0.1. Figure 4.2.3 shows
an example of logspline density estimation for a mixture of two normal densities.
0.20
Mixture of Normals
0.00
0.05
0.10
0.15
True
logspline
42
We know that this transformation is not one-to-one and Gu and Qiu (1993) proposed
R
side conditions on g such that g( x0 ) = 0, x0 X or X g = 0. Given those conditions
1 n
g(Xi ) + log
n i =1
eg +
J ( g)
2
(4.3.1)
in a Hilbert space H, where J is a roughness penalty and is the smoothing parameter. The space H is such that the evaluation is continuous so that the first term in
(i.e. the empirical distribution) and a quadratic J makes easier the numerical solution of the variational problem (4.3.1). Since, H is an infinite dimensional space, the
minimizer of (4.3.1) is, in general, not computable. Thus, Gu and Qiu (1993) propose
calculating the solution of the variational problem in finite dimensional space, say,
ically, if one solves the variational problem (4.3.1) in Hn by a standard Newton instead of calculating
Raphson procedure, then by starting from a current iterate g,
the next iterate with a fixed , one may choose a that minimizes the loss function.
Figure 4.3.4 exhibits the performance of SSDE for Buffalo Snow data. (This data
set can be found in R.)
43
0.020
0.0
0.005
0.010
0.015
SSDE
Logspline(d)
Kernel
20
40
60
80
100
120
data
44
Chapter 5
The thin-plate spline on Rd
There are many applications where a unknown function g of one or more variables and a set of measurements are given such that:
yi = Li g + i
(5.0.1)
(5.0.2)
The thin-plate smoothing spline is the solution to the following variational problem.
Find g H to minimize
L ( g ) =
1 n
d
(yi g(ti ))2 + Jm
( g)
n i
=1
45
(5.0.3)
46
where is the smoothing parameter which controls the trade off between fidelity to
d . Note that, when is large a prethe data and smoothness with penalty term Jm
mium is being placed on smoothness and functions with large mth derivatives are
penalized. In fact, gives an mth order polynomial regression fit to the data.
Conversely, for small values of more emphasis is put on goodness-of-fit and the
limit case of 0, we have interpolation. In general, in smoothing spline non-
d is given by
parametric regression the penalty term Jm
d
Jm
( g)
m!
=
1 ! . . . , d !
1 + ...+ d = m
...
Z
2
m g
dx j .
x1 1 . . . xd d
j
The condition 2m d > 0 is necessary and sufficient in order to have bounded evaluation functionals in H, i.e., H is a reproducing kernel in Hilbert space. Moreover,
d is the M-dimensional space spanned by polythe null space of the penalty term Jm
i =1
j =1
ci Em (t, ti ) + bj j (t)
= Qc + Tb
(5.0.4)
conditionally positive definite (See, Wahba (1990)). Efforts have been done in order
to reduce substantially the computational cost of solving smoothing splines fitting by
47
introducing the concept of H-splines (Luo and Wahba (1997) and Dias (1999)), where
the number of basis functions and act as the smoothing parameters.
A major conceptual problem with spline smoothing is that it is defined implicitly
as the solution to a variational problem rather than as an explicit formula involving
the data values. This difficulty can be resolved, at least approximately, by considering
how the estimate behaves on large data sets. It can be shown from the quadratic
nature of (5.0.3) that g is linear in the observations yi , in the sense that there exists a
weight function H (s, t) such that
n
yi H (s, ti ).
g ( s ) =
(5.0.5)
i =1
It is possible to obtain the asymptotic form of the weight function, and hence an
approximate explicit form of the estimate. For the sake of simplicity consider d = 1,
m = 2 and suppose that the design points have local density f (t) with respect to a
Lesbegue measure on R. Assuming the following conditions, (Silverman (1984)),
1. g H[ a, b].
2. There exists an absolutely continuous distribution function F on [ a, b] such that
Fn F uniformly as n .
3. f = F , 0 < inf[ a,b] f sup[ a,b] f < .
4. The density has bounded first derivative on [ a, b].
5. a(n) = sup[ a,b] | Fn F|, the smoothing parameter depends on n in such a way
that 0 and 1/4 a(n) 0 as n .
In particular, one can assume that the design points are regularly distributed with
density f ; that is, ti = F1 ((i 1/2)/n). Then, sup | Fn F| = (1/2)n1 so that
n4 and 0 for (5) to hold. Thus, as n ,
H (s, t) =
1 1
st
K(
),
f (t) h (t) h (t)
1
exp(|u|/ 2) sin(|u|/ 2 + /4),
2
48
and the bandwidth h(t) satisfies
yi = +
gj (t j ) + i
(5.1.1)
j =1
where t j are the predictor variables and as defined before in section 5, i are uncorrelated error measurements with E[i ] = 0 and Var [i ] = 2 . The functions g j are
unknown but assumed to be smooth functions lying in some metric space. Section
5 describes a general framework for defining and estimating general nonparametric
regression models which includes additive models as a special case. For this, suppose that is the space of the vector predictor t and assume the H is reproducing
kernel in Hilbert space. Hence H has the decomposition
p
H = H0 +
Hk
(5.1.2)
k=1
fined in section 5. The space H0 is the space of functions that are not to be penalized
in the optimization. For example, recall equation (5.0.3) and let m = 2 then H0 is the
space of linear functions in t.
49
{yi gk (ti )}
i =1
k=0
k ||gk ||2Hk ,
(5.1.3)
k=1
Qk c + Tb,
g =
(5.1.4)
k=1
where Qk and T are given in equation (5.0.4) and the vectors c and b are found by
minimizing the finite dimensional penalized least square criterion
p
||y Tb
k=1
(5.1.5)
k=1
This general problem (5.1.4) can potentially be solved by a backfitting type algorithm
as in Hastie and Tibshirani (1990).
Algorithm 5.1.1
(0)
1. Initialize g j = g j
for j = 0, . . . , p.
2. Cycle j = 0, . . . , p, . . . , j = 0, . . . , p, . . .
g j = S j (y g j (t j ))
j6=k
50
0.8
0.8
linea
0.6
0.6
r pre
0.4
dicto
0.2
1.5
0.2
1.0
0.0
0.5
1.0
0.0
0.5
0.5
0.0
0.0
0.5
x
0.4
0.5
x
0.0
1.0
0.0
1.0
1.50.5
0.8
0.8
linea
0.6
r pre
r pre
linea
0.6
0.4
0.4
dicto
dicto
0.2
0.2
r
1.0
0.0
1.0
0.0
0.5
0.5
x
0.0
0.0
0.5
0.5
x
0.0
1.0
0.0
1.0
Figure 5.1.1: True, tensor product, gam non-adaptive and gam adaptive surfaces
( g (u))2 du,
the optimization problem with the kth data point left out. Then following Wahbas
notation, the ordinary cross-validation function V0 () is defined as
V0 () =
1 n
[k]
(yk g (tk ))2 ,
n k
=1
(5.2.1)
(5.2.2)
where hkk () is the kth entry of H . By substituting (5.2.2) into (5.2.1) we obtain a
simplified form of V0 , that is,
V0 () =
1 n
(yk g (tk ))2 /(1 hkk ())2
n k
=1
(5.2.3)
The right hand of (5.2.3) is easier to compute than (5.2.1), however the GCV is even
easier. The generalized cross-validation (GCV) is method for choosing the smoothing
parameter , which is based on leaving-one-out, but it has two advantages. It is easy to
compute and it posses some important theoretical properties the would be impossible
to prove for leaving-one-out, although, as pointed out by Wahba, in many cases the
GCV and leaving-one-out estimates will give similar answers. The GCV function is
defined by
V () =
1
2
1 n
2
2
n ||( I H )y||
,
(
y
g
(
t
))
/
(
1
h
(
))
=
k
k
kk
1
2
n k
tr
(
I
H
]
[
=1
n
(5.2.4)
1 n
(Li g Li g)2
n i =1
(5.2.5)
where, Li is the evaluation functional defined in section 4.3, the GCV estimate of is
the minimizer of (5.2.5). Consider the expected value of T (),
E[ T ()] =
1 n
E[(Li g Li g)2 ].
n i
=1
(5.2.6)
52
The GCV theorem Wahba (1990) says that if g is in a reproducing kernel Hilbert space
then there is a sequence of minimizers (n) of EV () that comes close to achieving
the minimum possible value of the expected mean square error, E[ T ()], using (n),
as n . That is, let the expectation inefficiency In be defined as
In =
E[ T ( (n))]
,
E[ T ( )]
where is the minimizer of E[ T ()]. Then, under mild conditions as such the ones
described and discussed by Golub, Heath and Wahba (1979) and Craven and Wahba
(1979), we have In 1 as n .
Figure 5.2.2 shows the scatter plot of the revenue passenger miles flown by commercial airlines in the United States for each year from 1937 to 1960. (This data can be
found in the software). The smoothing parameter was computed by GCV method
through the R function smooth.spline().
25000
30000
airmiles data
15000
0
5000
10000
airmiles
20000
Data
SS
1940
1945
1950
1955
1960
Figure 5.2.2: Smoothing spline fitting with smoothing parameter obtained by GCV
method
Chapter 6
Regression splines, P-splines and
H-splines
6.1 Sequentially Adaptive H-splines
In regression splines, the idea is to approximate g by a finite dimensional subspace
of W , a Sobolev space, spanned by basis functions B1 , . . . , BK , K n. That is,
K
g gK =
c j Bj ,
j =1
where the parameter K controls the flexibility of the fitting. A very common choice
for basis functions is the set of cubic B-splines (de Boor, 1978). The B-splines basis
functions provide numerically superior scheme of computation and have the main
feature that each B j has compact support. In practice, it means that we obtain a stable
evaluation of the resulting matrix with entries Bi,j = B j (ti ), for j = 1, . . . , K and i =
1, . . . , n is banded. Unfortunately, the main difficulty when working with regression
splines is to select the number and the positions of a sequence of breakpoints called
knots where the piecewise cubic polynomials are tied to enforce continuity and lower
order continuous derivatives. (See Schumaker (1972) for details. ) Regression splines
are attractive because of their computational scheme where standard linear model
techniques can be applied. But smoothness of the estimate cannot easily be varied
53
54
0.2
0.0
y
-0.2
-0.4
True
K=4
K=12
K=60
0.5
0.0
1.0
1.5
2.0
2.5
3.0
55
A ( g) =
(yi g(ti ))
i =1
( g )2 .
(6.1.1)
of natural cubic splines (NCS) spanned by the basis functions { Bi }iK=1 and X is a n K
A (c) = ky Xck22 + c T c,
following linear system (X T X + )c = X T y. Note that the linear system now involves K K matrices instead of using n n matrices which is the case of smoothing
splines. Both K and controls the trade off between smoothness and fidelity to the
data. Figure 6.1.2 shows, for > 0, an example of the relationship between K and
. Note that when the number of basis functions increases, the smoothing parameter
decreases to a point and then it increases with K. That is, for large values of K, the
smoothing parameter becomes larger in order to enforce smoothness.
56
0.010
0.005
smoohting paramter
0.015
10
20
30
40
50
knots
where A() = X (X T X + )1 X T .
57
then t g 0 and
g2
R
tg =
,
g2
Z q
tf
t g )2 = 2(1 ( f , g)),
where
( f , g) =
Z q
t f tg =
f 2 g2
R
=
f 2 g2
qR
| f g|
R ,
f 2 g2
Increasing the number of basis functions K by one, the procedure will stop when
gK, gK +1, in the sense of the partial affinity,
| gK, gK +1, |
1,
( gK, , gK +1, ) = qR
g2 g2
K, K +1,
58
0 98
0 94
0 96
a n y
0.990
0.995
0 92
0.985
0 88
0 90
0 86
0.970
0.975
0.980
partial affinity
1 00
1.000
10
20
30
knots
40
50
10
20
30
40
50
kno s
Figure 6 1 3 Five thousand replicates of the affinity and the partial affinity for adaptive nonparametric regression using H-splines with the true curve
One may notice that the affinity is a concave function of the number of basis functions (knots) and the partial affinity approaches one quickly Moreover numerical
experiments have shown that the maximum of the affinity and the stabilization of
the partial affinity coincide That means increasing the K arbitrarily not only increases the computational cost but also does not provide the best fitted curve (in the
59
50
10
100
20
30
200
40
50
300
0.90
0.94
0.98
0.990
0.998
affinity,
n=100
200
100
200
600
300
400
1000
500
affinity,
n=20
0.994
0.994
0.998
0.996
affinity,
n=200
0.998
1.000
affinity,
n=500
Figure 6.1.4: Density estimates of the affinity based on five thousand replicates of the
curve yi = xi3 + i with i N (0, .5). Solid line is a density estimate using beta model
and dotted line is a nonparametric density estimate.
Figure 6.1.4 shows that the empirical affinity distribution (unimodal, skewed to
60
the left with range between 0 and 1), a nonparametric density estimate using kernel
method and a parametric one using a beta model whose parameters were estimated
using method of the moments.
0.8
TRUE
H-splines
S-splines
0.6
oo
oo
o
o
o
o
o
o
o
0.4
o
o o
o
o
0.2
o
o
o
oo
oo
0.0
o
o
o
o
o
o
o
o
oo
-0.2
oo o o
o
ooo
o
o
-0.4
oo
o o
o
o
o
o
o
o
o oo o oo oo
o
o
o
oo
oo o
oo
o
oo
o
o
oo
o
o
0.0
0.5
1.0
1.5
2.0
Figure 6.1.5: A comparison between smoothing splines (S-splines) and hybrid splines
(H-splines) methods.
Figure 6.1.5 shows that, in general, H-splines method has similar performance as
smoothing splines. But as mentioned before the H-splines approach solves a linear
system of order K while smoothing splines must have to solve a linear system of
order n K.
6.2. P-SPLINES
61
6.2 P-splines
The basic idea of P-splines proposed by Eilers and Marx (1996) is to use a considerable number of knots and to control the smoothness through a difference penalty
on coefficients of adjacent B-splines. For this, lets consider a simple regression model
y( x ) = g( x ) + , where is a random variable with symmetric distribution with mean
zero and finite variance. Assume that the regression curve g can be well approximate
by a linear combination of, without loss of generality, cubic B-splines, denoted by
B( x ) = B( x; 3). Specifically, Given n data points ( xi , yi ) on a set of K B-splines B j (.),
we take, g( xi ) = Kj=1 a j B j ( xi ). Now, the penalized least square problem becomes to
find a vector of coefficients a = (a1 . . . , aK ) that minimizes:
n
PLS(a) =
i =1
Yi a j B j ( xi )
j =1
o2
Z n K
o
(
a
B
x
)
j j i .
j =1
a j B j ( xi ; 3)
=h
2 a j B j ( x i ; 1 )
j =1
j =1
where h is the distance between knots and 2 a j = (a j ) = (a j a j1 ) The Psplines method penalizes the higher-order of the finite differences of the coefficients
of adjacent B-splines. That is,
n
i =1
Yi a j B j ( xi )
j =1
o2
(m (a j ))2 .
j = m +1
Eilers and Marx (1996) shows that the difference penalty is a good discrete approximation to the integrated square of the kth derivative and with this this penalty moments of the data are conserved and polynomial regression models occur as limits
for large values of .
Figure 6.2.6 shows a comparison of smooth.spline and P-spline estimates on simulated example.
62
sin(x*2*pi/10)+.5 + N(0,.71)
**
** *
* * **
*
**
*
* *
* ***
*
**
* * * **
*
* * **
*
*
* * *
*
*
*
* ** *
*
* *
*
*
*
*
***
*
*
*
*
*
*
*
*
*
*
*
*
* *
**
*
*
*
** *
*
*
1
y
True
smooth.spline
Pspline
*
*
*
**
* **
** *
*
*
**
* *
*
*
*
**
*
*
*
** *
** *
* **
*
* *
*
* *
*
*
*
**
*
*
*
**
* *
*
**
*
*
* *
*
*
**
*
*
* **
*
*
*
*
*
*
* *
*
* *
*
*
*
*
**
*
*
*
*
*
*
10
6.3
63
i = 1, . . . , n.
where i s are uncorrelated with a N (0, 2 ). Moreover, assume that the parametric
form of the regression curve g is unknown. Then the likelihood of g given the observations y is,
1
||y g||2 }.
(6.3.1)
22
The Bayesian justification of penalized maximum likelihood is to place a prior density
R
proportional to exp{ 2 ( g )2 } over the space of all smooth functions. (see details
ly ( g) (2 )n/2 exp{
in Silverman and Green (1994) and Kimeldorf and Wahba (1970)). However, an in-
finite dimensional case has a paradox alluded to by Wahba (1983). Silverman (1985)
proposed a finite dimensional Bayesian formulation to avoid the paradoxes and difficulties involved in the infinite dimensional case. For this, let gK, = iK=1 i Bix = XK K
with a knot sequence x placed at order statistics. A complete Bayesian approach
would assign prior distribution to the coefficients of the expansion, to the knot positions, to the number of knots and for 2 . A Bayesian approach to hybrid splines
non-parametric regression assigns priors for gK , K, and 2 . Given a realization
of K the interior knots are placed at order statistics. This well known procedure in
non-parametric regression reduces the computational cost substantially and avoids
trying to solve a difficult problem of optimizing the knot positions. Any other procedure has to take into account the fact that changes in the knot positions might cause
considerable change in the function g (see details in Wahba (1982) for ill-posed problems in splines non-parametric regression). Moreover, in theory the number of basis
functions (which is a linear function of the number of knots) can be as large as the
sample size. But then one has to solve a system of n equations instead of K. An
attempt to keep the computational cost down one might want to have K small as
possible and hence over-smoothing may occur. For any K large enough keeps the
balance between over-smoothing and under-smoothing. Thus the penalized likelihood becomes with g = gK ,
2 n/2
l p ( )
( gK )2 },
(6.3.2)
64
where gK = gK, = XK K . The reason why we suppress the subindex in gK, will
be explained later in this section. Note that maximization of the penalized likelihood
is equivalent to minimization of (5.0.3). For this proposed Bayesian set up we have a
prior for ( gK , K, , 2 ) of the form,
p( gK , K, , 2 ) = p( gK |K, ) p(K, ) p(2 )
(6.3.3)
p( gK |K, ) exp{
( gK )2 },
2
1
p(2 ) 2 (u+1) exp{v2 },
( )
Z
exp{ a} aK /K!
,
1 exp{ a}(1 + q )
K = 1, . . . , K
j
where q =
j= K +1 a /j!, and
65
(6.3.4)
In order to sample from the posterior ( |y) we have to consider the variation of
dimensionality of this problem. Hence one has to design move types between subspaces K . However, assigning a prior to gK , equivalently to the coefficients K ,
leads to a serious computational difficulty pointed out by Denison, Mallick and Smith
(1998) where a comparative study was developed. They suggest that the least square
estimates for the vector K leads to a non-significant deterioration in performance for
overall curve estimation. Similarly, given a realization of (K, , 2 ), we solve the penalized least square objective function (6.1.1) to obtain the estimates, K = K (y), for
the vector K and consequently we have an estimate g K = XK K . Thus, there is no
prior assigned for this vector of parameters, and so, we write gK = gK, . Having got
K , we approximate the marginal posterior (|y) by the conditional posterior
(|y, K ) l p (|y, K ) p(),
(6.3.5)
with
p() = p(K, , 2 ) = p(|K ) p(K ) p (2 ).
Note that if one assigns independent normal distributions to the parameters K it will
not be difficult to obtain the marginal posterior (|y) and approximations will not
be necessary. However, the results will be very similar.
66
man (2002) used reversible jump methodology (Green (1995)). This technique is beyond the level of this book and it will not be explained here but the interested reader
will find the details of the algorithm in Dias and Gamerman (2002).
y + (ti ) =
1
100
100
y j (ti ).
j =1
Figure 6.3.8, exhibit approximate Bayesian confidence intervals for the true curve
regression f and it was computed as following. Let y(ti ) and y(ti ) be a particular
model and its estimate provided by this proposed method with i = 1, . . . , n where n
is the sample size. For each i = 1, . . . , n the fitting vectors (y1 (ti ), y2 (ti ), . . . , y100 (ti ))T
form random samples and from those vectors the lower and upper quantiles were
computed in order to obtain the confidence intervals.
67
*
*
*
y
0
* *
*
*
*
*
*
*
*
*
*
*
*
*
*
* *
*
*
*
*
True
SS estimate
Bayesian estimate
*
*
0.0
0.5
1.0
1.5
2.0
2.5
3.0
68
Average
.025 quantile
.975 quantile
-0.5
-1.0
-1.5
-1.5
-0.5
0.0
0.5
1.0
-1.0
0.0
0.5
1.0
0.0
0.5
1.0
1.5
2.0
2.5
3.0
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Figure 6.3.8: One hundred estimates of the curve 6.3.7 and a Bayesian confidence
interval for the regression curve g(t) = exp(t2 /2) cos(4t) with t [0, ].
Chapter 7
Final Comments
69
70
Bibliography
Bates, D. and Wahba, G. (1982). Computational Methods for Generalized Cross-Validation
with large data sets, Academic Press, London.
Box, G. E. P., Hunter, W. G. and Hunter, J. S. (1978). Statistics for Experiments: An
Introduction to Design, Data Analysis, and Model Building, John Wiley and Sons
(New York, Chichester).
Cleveland, W. S. (1979). Robust locally weighted regression and smoothing scatterplots, J. Amer. Statist. Assoc. 74(368): 829836.
Craven, P. and Wahba, G. (1979). Smoothing noisy data with spline functions, Numerische Mathematik 31: 377403.
de Boor, C. (1978). A Practical Guide to Splines, Springer Verlag, New York.
Denison, D. G. T., Mallick, B. K. and Smith, A. F. M. (1998). Automatic bayesian curve
fitting, Journal of the Royal Statistical Society B 60: 363377.
Dias, R. (1994). Density estimation via h-splines, University of Wisconsin-Madison.
Ph.D. dissertation.
Dias, R. (1996). Sequential adaptive nonparametric regression via H-splines. Technical Report RP 43/96, University of Campinas, June 1996. Submitted.
Dias, R. (1998). Density estimation via hybrid splines, Journal of Statistical Computation
and Simulation 60: 277294.
71
BIBLIOGRAPHY
72
Dias, R. (1999). Sequential adaptive non parametric regression via H-splines, Communications in Statistics: Computations and Simulations 28: 501515.
Dias, R. and Gamerman, D. (2002). A Bayesian approach to hybrid splines nonparametric regression, Journal of Statistical Computation and Simulation. 72(4): 285297.
Eilers, P. H. C. and Marx, B. D. (1996). Flexible smoothing with B-splines and penalties, Statist. Sci. 11(2): 89121. With comments and a rejoinder by the authors.
Golub, G. H., Heath, M. and Wahba, G. (1979). Generalized cross-validation as a
method for choosing a good ridge parameter, Technometrics 21(2): 215223.
Good, I. J. and Gaskins, R. A. (1971). Nonparametric roughness penalties for probability densities, Biometrika 58: 255277.
Green, P. J. (1995). Reversible jump Markov Chain Monte Carlo computation and
bayesian model determination, Biometrika 82: 711732.
Gu, C. (1993). Smoothing spline density estimation: A dimensionless automatic algorithm, J. of the Amer. Statl. Assn. 88: 495504.
Gu, C. and Qiu, C. (1993). Smoothing spline density estimation:theory, Ann. of Statistics 21: 217234.
Hardle, W. (1990). Smoothing Techniques With Implementation in S, Springer-Verlag
(Berlin, New York).
Hastie, T. J. and Tibshirani, R. J. (1990). Generalized Additive Models, Chapman and
Hall.
Kimeldorf, G. S. and Wahba, G. (1970). A correspondence between Bayesian estimation on stochastic processes and smoothing by splines, The Annals of Mathematical Statistics 41: 495502.
Kooperberg, C. and Stone, C. J. (1991). A study of logspline density estimation, Computational Statistics and Data Analyis 12: 327347.
BIBLIOGRAPHY
73
Luo, Z. and Wahba, G. (1997). Hybrid adaptive splines, Journal of the American Statistical Association 92: 107116.
Nadaraya, E. A. (1964). On estimating regression, Theory of probability and its applications 10: 186190.
OSullivan, F. (1988). Fast computation of fully automated log-density and log-hazard
estimators, SIAM J. on Scientific and Statl. Computing 9: 363379.
Pagan, A. and Ullah, A. (1999). Nonparametric econometrics, Cambridge University
Press, Cambridge.
Parzen, E. (1962). On estimation of a probability density function and mode, Ann. of
Mathematical Stat. 33: 10651076.
Prakasa-Rao, B. L. S. (1983). Nonparametric Functional Estimation, Academic Press
(Duluth, London).
Schumaker, L. L. (1972). Spline Functions and Aproximation theory, Birkhauser.
Schumaker, L. L. (1981). Spline Functions: Basic Theory, WileyISci:NJ.
Scott, D. W. (1992). Multivariate Density Estimation. Theory, Practice, and Visualization,
John Wiley and Sons (New York, Chichester).
Silverman, B. W. (1982). On the estimation of a probability density function by the
maximum penalized likelihood method, Ann. of Statistics 10: 795810.
Silverman, B. W. (1984). Spline smoothing: The equivalent variable kernel method,
Ann. of Statistics 12: 898916.
Silverman, B. W. (1985). Some aspects of the spline smoothing approach to nonparametric regression curve fitting, Journal of the Royal Statistical Society, Series B,
Methodological 47: 121.
Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis, Chapman
and Hall (London).
BIBLIOGRAPHY
74