Statistical ML Overview
Statistical ML Overview
where U is the error term, representing variations in Y not explained by X, with E[U ] = 0.
Our goal is to find the f connecting X and Y .
Y = β0 + β1 X + U
is the simple linear regression model. Extending this to cases where X is a [T x p] matrix,
representing T observations of data-points (p-dimensional vectors), we have
Y = β0 + β1 X1 + ... + βp Xp + U
or in matrix notation
Y = Xβ + U
Here, we have p+1 parameters to be estimated, since β = (β0 , β1 , ..., βp )T , and T observations.
Without loss of generality, we can assume that the data is de-meaned and standardized, so
β0 = 0.
Under some conditions, the best linear projection of Y onto X is the Ordinary Least Squares
(OLS) solution
β̂OLS = (XT X)−1 XT Y
However, this is not a good option in some cases:
• The OLS solution is not valid in high dimensions (when p > T ), since then, XT X is
singular
• When T is huge, matrix inversion is expensive
• Nonlinearities
1
Nikhil Vicas FIN 580
Quantitative Data Analysis
Because of this issue, as aforementioned, the OLS solution is invalid in high dimensions (when
p > T ). When the data matrix X is square (p = T ), the OLS solution is unique, but it is
very bad.
Dimension Reduction
Regularization
where pλ (|βj |; α, data) a non-negative penalty function indexed by λ. Here, λ determines the
number of features that enter the model: when λ → ∞, no variables enter the model; when
λ = 0, β̂(0) is the same as β̂OLS
2
Nikhil Vicas FIN 580
Quantitative Data Analysis
Xt = ΛFt + Vt
where Ft are the unobserved factors, Λ is a matrix of unobserved factor loadings, and Vt
is a vector of idiosyncratic errors. There are many ways to estimate the factors, but one
widespread and simple solution is Principal Component Analysis (PCA), where the factors
are the first k Principal Components.
Our goal is to take the (de-meaned and standardized) dataset X of shape T x n and extract
as much information from X as possible, building another dataset Z of shape T x k where
k << n. The key insight of PCA is that “most of the variation that exists in your data
across many high dimensions is captured in fewer dimensions.”
The main idea of PCA is to look for the vector γ ∈ Rn such that the sample variance of Xγ
is maximized.
Then, the sample variance of Xγ is given by T1 γ T XT Xγ = γ T Σ̂γ. We will restrict the space
of γ to be just unit vectors: |γ| = 1, since otherwise, we could arbitrarily increase sample
variance by increasing the size of γ: we seek the direction of X exhibiting the greatest vari-
ance.
This yields the eigenvector of Σ̂ with the largest associated eigenvalue as the vector γ ∈ Rn
that maximizes sample variance (the vector that encapsulates the highest degree of variance
of X).
We continue this process to find the next “principal component”, maximizing γ2T Σ̂γ2 such
that γ2 is orthogonal to the first principal component that we found. Differentiating the
Lagrangain with the new orthogonality conditions yields the eigenvector of Σ̂ with the next
3
Nikhil Vicas FIN 580
Quantitative Data Analysis
largest associated eigenvalue.
Σ̂γj = λj γj
By construction, the k columns of Z are orthogonal random variables, each with sample
variances of λ1 , ...λk (in decreasing size). This means that
Choosing k
There is a tradeoff in choosing the number of principal components to use in the reduced
model: if k is too large, then we do not sufficiently reduce the dimension; on the other hand,
if k is too small, then we lose information from our dataset.
for j ∈ [n] giving us αj , the proportion of variance attributed to the j th PC. For example:
4
Nikhil Vicas FIN 580
Quantitative Data Analysis
• Rule of thumb: stop when the next PC does not meet some threshold for explained
variance
• Informal cutoff: stop when the k PCs (with highest corresponding αs) explain some
amount of the variance (e.g. 80%)
λj
arg max
1≤j<n λj+1
PC Regression
so our PC hat matrix is given by Γk (ΓTk Γk )−1 ΓTk , compared to the OLS hat matrix X(XT X)−1 XT
(basically we just replaced X by Γ and completed a now possible OLS regression with the
reduced-dimensionality dataset).
5
Nikhil Vicas FIN 580
Quantitative Data Analysis
3 Ridge Regression
Another method for solving the problem of high-dimensionality’s invalidation of the OLS
solution is the general shrinkage method: penalized least squares
p
" T #
X X
T 2
β̂(λ) = arg min (Yt − β Xt ) + pλ (|βj |; α, data)
β∈B t=1 j=1
First, let us start by noting for a generic parameters of interest θ, θ̂ is an unbiased estimator
for θ when
E[θ̂] = θ
We define the “mean squared error” of an estimator as
For regression problems, the OLS estimator is the Best Linear Unbiased Estimator, under
certain assumptions, via the Gauss-Markov theorem. However, this does not mean that they
have the lowest MSE. The James-Stein estimator can be constructed for a regression problem
with a lower MSE than the OLS predictor, despite being biased. One popular estimator is
the Ridge estimator, which despite being biased, has a lower variance than the OLS estimator
and is valid in high dimensions.
which shrinks the OLS estimator towards zero for paramaeters that are deemed redundant.
The Ridge estimator always has a strictly convex optimization problem, as long as λ > 0
which is the OLS case, regardless of the dataset X. However, it is not a valid tool for variable
selection and is an inconsistent estimator for β in general.
6
Nikhil Vicas FIN 580
Quantitative Data Analysis
As seen in the image below, it “shrinks” the OLS estimator towards zero, depending on the
“shrinkage” strength, defined by λ:
Ridge works well since it is essentially just OLS on an augmented (T + p) x p dataset (always
invertible):
x11 x12 ... x1p
.. .. .. ..
. . . .
xT 1 xT 2 ... xT p
√
Xλ = λ √0 . . . 0
0 λ ... 0
.. .. .. ..
. . . √.
0 0 ... λ
7
Nikhil Vicas FIN 580
Quantitative Data Analysis
with
Y1
..
.
Y
Y = T
0
.
..
0
So even if (XT X)−1 is singular, the Ridge (XTλ Xλ )−1 will not be.
The Ridge estimator is not unbiased, however, it provides a lower variance than the OLS
estimator.
X = UΣV T
where Σ is a p x p diagonal matrix consisting of the singular values, which are the square
roots of eigenvalues of XT X.
As a consequence of this,
d dp T
1
ŶRidge = Udiag 2 , ..., 2 U Y
d1 + λ dp + λ
and
ŶOLS = UU T Y
This shows that Ridge essentially integrates elements of PCA and shrinks more redundant
covariates towards 0 by selecting covariates that explain most of the variance of the dataset
X, as given by singular values, which are corresponding eigenvalues of the principal compo-
nents (eigenvectors of the covariance matrix).
Larger dj singular values have less penalization towards 0 (vectors that explain more variance
of the dataset are not penalized as much, when λ > 0).
8
Nikhil Vicas FIN 580
Quantitative Data Analysis
Selection of λ
The selection of λ, the penalty parameter, influences the number of features that enter the
model. A large penalty parameter shrinks more towards zero for all βi and a λ = 0 just
preserves OLS properties (when it is not high-dimension).
In general, there are no closed-form results for choosing the penalty parameter, but just
data-driven best practices. There are two common approaches:
• Cross Validation: “Split the sample into parts. Use one part to estimate the model and
other part to evaluate the estimates.”
4 Lasso Regression
The Least Absolute Selection and Shrinkage Operator, proposed by Tibshirani in 1996, is
another approach to regularized regression. It is given by
p
" T #
X X
T 2
β̂Lasso (λ) = arg min (Yt − β Xt ) + λ |βj |
β∈B t=1 j=1
By construction, it can handle many more variables than dimensions (p >> T ) and can
sometimes even select the correct subset of relevant variables in the dataset. However, it is
a biased estimator and parameter inference is only possible in some restrictive cases.
It is schematically similar to Ridge, but now with the L1 norm of β as a penalty term, rather
than the L2 norm. Hence, Lasso is also known as L1 regularization and Ridge is known as
L2 regularization.
9
Nikhil Vicas FIN 580
Quantitative Data Analysis
Lasso is a convex program; but it is not strictly convex when XT X is singular (high-
dimension). Thus, β̂Lasso (λ) is not , in general unique. However, the prediction Xβ̂Lasso (λ) is
unique. Tibshirani did, however, show that “if the entries of X are drawn from a continuous
probability distribution, then the Lasso solution is unique almost surely”
Lasso also produces a sparse model: the number of non-zero parameter estimates will be at
most min(T, p)
Selection Consistency
A consistent estimator has estimated parameters converging to the true parameters as the
sample size increases.
β̂ (n) →p β0 as n → ∞
In the context of Lasso, an important notion of consistency is “model selection consistency”–
if, in the true model β0 a variable has a nonzero coefficient, then this type of consistency
10
Nikhil Vicas FIN 580
Quantitative Data Analysis
mandates that the estimator also has a nonzero coefficient for said variable:
h i
P {i : β̂ i ̸= 0} = {i : β 0i ̸= 0} → 1 as n → ∞.
Thus, a key question answered by Peng Zhao and Bin Yu in 2006 [5] is when does Lasso
exhibit model selection consistency? The answer was found to be conditional on meeting a
“representable condition” for the data X. Formally,
||(X⊤ −1
2 X1 ) (X1 X1 )sgn(β 0(1) )||∞ < 1.
This can be interpreted as “means that the irrelevant variables cannot be too correlated
with the relevant variables” in X. This means that Lasso’s choice of the set of ”important”
variables–the one that have non-zero coefficient–must not be correlated with those that are
discarded. Thus, Lasso may have difficulty in instances where variables are highly correlated
(multi-colinearity is present).
Lasso Extensions
There are many relatives and extenstions of Lasso, such as Elastic Net, Adaptive Lasso,
Group Lasso, Fused Lasso, etc.
Elastic Net
Elastic net combines elements of Ridge and Lasso. Naive elastic net is given by
" #
β̂(λ1 , λ2 ) = arg min |Y − Xβ|22 + λ1 |β|1 + λ2 |β|22
β∈B
It attenuates over-shrinkage, compared to Lasso or Ridge. Elastic Net can select more than T
variables without “saturating”, whereas Lasso cannot. It is equivalent to a LASSO problem
on augmented data .
Adaptive Lasso
p
"T
#
X 1 X
β̂adaLasso (λ) = arg min (Yt − β T Xt )2 + λ wj |βj |
β∈B t=1
T j=1
where we now introduce wj which is adapted from an initial run of Lasso, providing an initial
estimator. Usually wj = |β̃j |−τ for τ ∈ (0, 1] where β̃j is an initial estimator from e.g. Lasso.
Under some conditions, it has the oracle property and it provides consistent estimates for
the non-zero parameters (those not set to zero during the initial Lasso run).
11
Nikhil Vicas FIN 580
Quantitative Data Analysis
Group Lasso
where the length of the j th group is given by dj . When di = 1 ∀i ∈ [G], Group Lasso is the
same as Lasso.
is intractable when H is infinite (any possible function is allowed) since there is no efficient
technique to search over the function space.
where hj (·) is the j th basis function for HD . It can either be indexed by a vector of parame-
ters (not fully known) or fully known. The number of basis functions, JT and the dimension
of the space D depend on the sample size, T , of the dataset X. When there is no parameter
12
Nikhil Vicas FIN 580
Quantitative Data Analysis
θ (the basis functions are fully known, then D = JT ; otherwise, D > JT .
For example, for HD being the space of polynomial functions of Xt up to order J, then the
basis functions are fully known and J = D
JT
nX o
j
Pol(JT ) = βj X , X ∈ [0, 1] : β ∈ R
j=1
The motivation for this polynomial sieve space is the Stone-Weierstrass theorem, stating that
there exists a well-approximating polynomial for any continuous function m(·) with compact
support X on R:
Other common sieve spaces include trigonometric functions, logistics, ReLU, etc.
Now, consider “splines”: let JT ∈ Z+ and let ξi ∈ R for i ∈ {0, ..., JT + 1} with 0 = ξ0 < ξ1 <
. . . < ξJT < ξJT +1 = 1. In other words, we divide [0, 1] into JT + 1 intervals Ij = [ξj , ξj+1 )
(with IJT = [ξJT , ξJT +1 ]). Know assume these “knots” ξ0 , ..., ξJT +1 all have a “bounded mesh
ratio”:
maxj∈{0,...,JT +1} (ξj+1 − ξj )
≤c
minj∈{0,...,JT +1} (ξj+1 − ξj )
for some c > 0.
However, notice that the behavior of polynomial functions near the boundary of the datasets
tend to be erratic, meaning extrapolation loses value beyond our data bounds. These issues
are made worse with splines, since those spline fits behave wildly near the boundaries. Thus,
“natural cubic splines” are functions with condition imposed near boundary knots: namely,
that the function must be linear beyond the boundary knot. This results in bias near the
boundary region, but this is accepted for the dramatic decrease in variance by restricting the
spline’s behavior.
The problem of knot selection is avoided in “smoothing splines,” which use a maximal set of
knots. Here, the complexity of the spline fit is controlled by reguralization.
13
Nikhil Vicas FIN 580
Quantitative Data Analysis
6 Ensemble Methods
Boosting
Boosting is a greedy method for additive models. Greedy algorithms are those that make a
locally optimal choice in stages, with the aim of eventually reaching a global optimum for
the problem. Specifically, boosting is a method that uses additive week learners, each chosen
greedily to be a local optimum to best fit residuals of the stage’s current boosting model,
aiming to reduce these residuals to be significantly small and thus be a optimum method for
fitting data. It does this via gradient descent.
where the fm (Xt ) is a weak learner. This is a predictor that performs rather poorly, but just
better than the average (performs slightly better than average chance). Thus, combining
these weak models should, hopefully, yield a learner that overall is strong. Our goal is to
estimate the optimal additive model:
1. Initialize fˆ[0] (the T -dimensional initial predictor) with some value, usually a constant
value. Let j = 0. Choose the set of weak learners we can utilize at every step.
3. Chose the weak learner that best fits U [j−1] . Denote Û [j−1] as the fitted values of this
weak learner.
14
Nikhil Vicas FIN 580
Quantitative Data Analysis
4. Update
fˆ[j] = fˆ[j−1] + ν Û [j−1]
where ν ∈ (0, 1] is the step size.
5. Repeat steps 2-4 until the maximum number of iterations is reached.
Gradient boosting is easy to adjust for different choices of loss functions, has abilities for
variable selection, is useful for high-dimensional data, and is robust to multi-colinearity.
However, it sacrifices interpretability. Often times, the weak learners in question will be
Classifications and Regression Trees (CARTs).
Bagging
Bagging is another ensemble method, introduced to reduce the variance of a predictor. It
operates by bootstrap aggregating. We boostrap (sample with replacement) from our dataset
Xt , create a predictor on the bootstrap dataset, and then take the expectation of such
a predictor (in practice, this is done via Monte Carlo methods). Formally, the bagging
algorithm is:
A statistic is stable at X if
θ̂T (X) = θ(X) + op (1)
as T → ∞ for some fixed value θ(X) (some stable limit). Informally, an unstable statistic
is subject to large changes from small changes in input data, whereas stable statistics are
relatively not-impacted by these small changes.
15
Nikhil Vicas FIN 580
Quantitative Data Analysis
7 Neural Networks
Again, we seek to find a way to find a relationship betwen our response variable Y and input
variables X. Assuming there exists some unknown mapping
Y = f (X) + U
we would like to estimate f to predict a new response Y ∗ from a new data point X ∗ .
JT
X
H(X; θ) = β0 + βj S(γj′ X + γ0,j )
j=1
where
• The parameter θ := (β0 , . . . , βJT , γ1′ , . . . , γJ′ T , γ0,1 , . . . , γ0,Jt )T are “weights” that tune
the model. Further, the γ0,i and β0 are “biases”
So a neural network (at least of this architecture) is simply a method of transforming the
data via weights, biases, and an activation function.
• Logistic function:
1
S(X) =
1 + e−X
• Hyperbolic tangent (tanh):
eX − e−X
S(X) =
eX + e−X
• Rectified Linear Unit (ReLU):
ReLU(X) = max(0, X)
In general, but not always, the activation function S is a squashing function. A function
S : R → [a, b], a < b, is a squashing, or sigmoid, functio if it is non-decreasing and has
16
Nikhil Vicas FIN 580
Quantitative Data Analysis
limX→∞ S(X) = b and limX→−∞ S(X) = a. However, this is not always the case–the afore-
mentioned ReLU function is not squashing.
The universal approximation theorem states that a “feed-forward NN with a single hidden
layer with “arbitrary” squashing functions can approximate any Borel-measurable function
from one finite dimensional space to another to any desired degree of accuracy, provided
sufficiently many (finite) hidden units are available.”
It is very intuitive to visualize neural networks graphically, as a set of connecting nodes and
edges:
a) The inputs are linearly combined (including a bias) when being fed into the hidden layer
γj′ X + γ0,j
b) There are nonlinear transformations (due to the activation function) in the hidden layer:
S(γj′ X + γ0,j )
c) The outputs of the activation function in the hidden layer arePthen linearly combined and
combined with a final output bias to create an output β0 + S(γj′ X + γ0,j )
We can also write the single layer neural network as a matrix: we define
17
Nikhil Vicas FIN 580
Quantitative Data Analysis
• X̃t := [1, XtT ]T , t ∈ [T ]
Finally, the output of the feed-forward NN is given by combining O(XΓ, the matrix of
activation function on transformed input values (in the hidden layer) with the output weights
and bias vector β := [β0 , β1 , . . . , βJT ]T to get:
H(X, θ) = O(XΓ)β
This neural network is trained (the parameter vector is optimized) via stochastic gradient
descent. Beforehand, however, we must determine the architecture of the NN; in the case of
a single-hidden layer, this involved deciding the number of hidden units JT . A small number
of hidden units will likely underfit data while a big value will overfit it.
In this case, let Jl be the number of hidden units in layer l ∈ [L]. Then, for each hidden
layer l define Γl := [γ̃1,l , . . . , γ̃Jl ,l ]. The output O of layer l is then (of size n × (Kl + 1)):
T
1 S(γ̃1l O1l−1 (·)) . . . S(γ̃JTl l O1l−1 (·))
1 S(γ̃ T O2l−1 (·)) . . . S(γ̃ T O2l−1 (·))
1l Jl l
Ol (Ol−1 (·)Γl ) = ..
.. . . ..
. . . .
T
1 S(γ̃1l Onl−1 (·)) . . . S(γ̃JTl l Onl−1 (·))
18
Nikhil Vicas FIN 580
Quantitative Data Analysis
where O0 is X
Thus, the output of the deep neural net is
Because of its architecture, there is a huge amount of parameters to train in deep neural
nets. Using stochastic gradient descent (backpropogation for deep neural nets, since we have
to complete the chain rule to calculate propagating effects of altering parameters in earlier
layers), this can be trained, but this should be done in conjunction with regularization meth-
ods. For deep NNs, this method is dropout.
Dropout is randomly dropping neurons (as well as connections) from the NN during training.
For a shallow net, consider a p × 1 vector r of iid Bernoulli(q) random variables.
Then, instead of
K
X
H(X; θ) = β0 + βj S(γjT Xt + γ0,j )
j=1
we will have
K
X
H(X; θ) = β0 + βj S(γjT [r ⊙ Xt ] + γ0,j )
j=1
Where ⊙ represents the Hademard product. Then, we use the final estimates for γj , multi-
plied by q (to account for the effects of regularization).
In this section, I’ll focus solely on the image classification aspects of CNNs, although there
is some application of it towards forecasting time-series data.
CNNs consist of different key elements, including a convolutional layer (or multiple) in which
the image kernel is applied, nonlinear transformation (like shallow neural networks), pooling
for dimension reduction, and other paramter such as stride.
The key element here are ‘convolutions’ in which an image kernel, which acts like a filter,
passes over all pixels in the image. The choice of this convolution impacts which features
19
Nikhil Vicas FIN 580
Quantitative Data Analysis
(such as contours or edges) can be extracted from the image.
The point of this convolution is to extract the relevant features from an image, which in
future layer improve classification performance.
Another parameter for convolutional neural nets is ‘stride’. In the previous example, the
convolution passed over every pixel in the image by shifting over pixel-by-pixel. Instead,
adjusting the stride, or downsampling, can reduce the problem dimension by skipping over
pixels and moving by more than one pixel per step.
20
Nikhil Vicas FIN 580
Quantitative Data Analysis
Because of border effects (how a convolution can only be applied when the filter passes over
all pixels, meaning the border pixels cannot have the transformation applied), we can also
include padding–or adding a border of 0 pixel values enveloping the image, so that the kernel
can be applied to edges as well.
For non-grayscale images (which only have one value for the darkness of the image), all
images can be represented as three different grids for each RGB color. These grids can be
appropriately downsized through the NN architecture as well as convolutions and pooling to
end up in a fully-connected deep neural network, at which point a classification can be made.
For example, consider:
21
Nikhil Vicas FIN 580
Quantitative Data Analysis
Traditional RNNs, however, suffer from the vanishing / exploding gradient problem, in which
the backpropogated cost function can either become incredibly small or diverge to ∞:
T
1X
QT (θ) = (yt − ŷt )2
T t=1
has ∂Q∂θ
T (θ)
→ 0 or ∞. A solution to this issue is the LSTM, which has a specific architecture
with both a long-term memory pipeline and a short-term memory pipeline. The LSTM cell
is given by
22
Nikhil Vicas FIN 580
Quantitative Data Analysis
Where the red circles represent a logistic activation function and the blue circles represent a
tanh activation function. The top horizontal pipeline is the ‘cell’ state, which is the memory
the LSTM stores to remember the past. The first vertical pipeline is the ‘forget’ gate, which
tells the LSTM which information to forget from the cell state. The bottom horizontal
pipeline includes the input gate, which tells which new information should be stored in the
cell state. There are two outputs from this cell: the upper ‘long term memory’ cells state,
and the bottom ’hidden state’ which represents more ‘short term memory’ that will be used
in the next state. At every cell, the inputs are then the ‘long term’ memory, the hidden
‘short term’ memory, and an input vector. Mathematically, this is given by, after initiating
with c0 = 0 and h0 = 0
ft = Logistic(Wf xt + Uf ht−1 + bf )
it = Logistic(Wi xt + Ui ht−1 + bi )
ot = Logistic(Wo xt + Uo ht−1 + bo )
pt = Tanh(Wc xt + Uc ht−1 + bc )
ct = (fT ⊙ ct−1 ) + (it ⊙ pt )
ht = ot ⊙ Tanh(ct )
yt = Wy ht + by
23
Nikhil Vicas FIN 580
Quantitative Data Analysis
8 Tree-based Methods
Classification and Regression Trees (CARTs) are a flexible, non-parametric predictive model
that operate by partitioning the input space X. They have numerous advantages, such as
interpretability, a hierarchical nature, and the ability to easily handle missing and categorical
variables (by treating them as their own category). However, they are highly unstable; this
can be fixed, though, by bagging or boosting multiple CARTs together (as mentioned before
in the ensemble methods section).
One well-known tree-based bethod is XGBoost, Extreme Gradient Boosting, which boosts
regression trees and adds some intelligent advancements for greater accuracy (although this
comes at the cost of interpretibility).
Trees create a local model, f for different partition regions. Typically, this f is very simple,
such as the average of the data points’ corresponding output value in a region.
K
X
H(Xt ) = βk Ik (Xt )
k=1
where I is a product of indicator functions, showing whether a data point (in multiple di-
mensions) belongs in some region: Ik (Xt ) = 1 if Xt ∈ Rk and 0 otherwise
One of the main ideas of regression trees is recursive partitioning: creating partitions until
the total sum of squares is below a certain pre-specified value or when the number of obser-
vations in each region reaches a minimum value.
Pruning is also an important concept for trees. Pruning is a form of regularization for trees,
with the degree of pruning controlled by a parameter α, much like the learning rate in other
models. For each subtree τ ∈ τ0 , the original tree, consider the loss function
K
X
Cα (T ) = Qk (τ ) + αK
k=1
where K is the number of regions X was partitioned into. The optimal α is typically cho-
sen by cross-validation. Pruning clearly reduces over-fitting a tree by penalizing a greater
number of final nodes, much like how regularized regression penalizes an overfit model by
shrinking regression parameters to being smaller. The main idea here is to find, for a given
α, the subtree minimizing the regularized loss function Cα (T ).
24
Nikhil Vicas FIN 580
Quantitative Data Analysis
There are many tools which are useful for showing the impacts of each individual feature in
X on the regression tree prediction of Xt .
where N is the number of parent nodes and sj is the index of the split variable. I don’t know
what δ̂j2 is here.
Another useful method is using a Partial Dependence Plot, which shows the marginal effect
of 1 or 2 features on the predicted outcome of the regression tree (or more generally, for any
ML model). Isolating 1 (or 2, if you want a 3D graph) feature, first set Xt = (X1t , XT2t )T and
then define h i Z
f1 (X1t ) = EX2 f (X1t , X2t ) = f (X1t , X2t )dP(X2t )
In other words, the expected value over the distribution of X2t , or very simply, holding the
feature of interest fixed, what is the average output over the range of all other features. We
can then plot a function over X1t to see what the output function will on average look like
over the domain of X1t . Generally, this expectation is not analytically computed, but found
using a Monte Carlo approach:
T
1Xˆ
fˆ1 (X1t ) = f (X1t , X2t )
T t=1
This method does, however, assume that the variables for which the partial dependence is
computed are not correlated with other variables, which is often not realistic in practice.
Random Forests
Aside from gradient boosting, fitting on the residuals of weak learners, using regression trees,
another popular method is “random forests”, which applies a bagging approach rather than
a boosting approach. Bagging reduces the instability of the estimate drastically, but comes
at a clear loss of interpretability, much like boosting.
The random forest algorithm is given as, for a bag size of B, b ∈ [B]:
25
Nikhil Vicas FIN 580
Quantitative Data Analysis
1. Bootstrap (sample with replacement) a sample of size T from the original dataset
2. For each bootstrapped sample, construct a tree, τb , using a subset q out of the original
p variables as the potential split variables. The tree is grown until the minimum
number of observations in each leaf (terminal region) is reached. There is no pruning
(regularization) of these trees.
3. The final prediction is an average prediction of the individual trees
B
1 X
τb (X)
B b=1
There is no theoretical answer for deciding between a boosted and bagged model–it is an
entirely empirical exercise.
9 Clustering
Now, we begin considering unsupervised learning models: models for X only, where there is
no known response variable Y in our dataset. Specifically, clustering is a collection of tools
to group and segment data into subgroups called cluster such that data points in the same
group are more similar to each other than points outside the group. This can be used to see
if the data consists of a set of distinct subgroups.
A good clustering technique has clusters with i) high intra-class similarity and ii) low inter-
class similarity (for some later-defined metric of similarity/dissimilarity)
For data X, xil i ∈ [n], l ∈ [p], consider the dissimilarity measure along the l-th attribute
ρl (xil , xjl )
Then for any two data points (row vectors in X), we consider the overall dissimilarity measure
via a weighted sum of attribute dissimilarity measures
p
X
d(xi , xj ) = wl ρl (xil , xjl )
l=1
Pp
for weights wi ≥ 0 and l=1 wl = 1
Here, the dissimilarity measure ρ(·, ·) can be chosen depending on the data type of the data
points:
26
Nikhil Vicas FIN 580
Quantitative Data Analysis
• For ordinal data, where there is some intrinsic order, we can convert the ordering of
discrete data into some numerical measure
ρ(x, y) = 1{x ̸= y}
From this, we can construct a symmetric Matrix of Dissimilarity, D[n×n] where Dij = d(xi , xj )
for i, j ∈ [n].
The “total dissimilarity” of X is then the sum over the dissimilarity matrix
n X
X n
T := dij = 1T D1
i=1 j=1
This total dissimilarity can be decomposed further, if we have cluster assignments. For K
clusters and an assigment vector for the n data points a = [a1 , . . . , an ]T where ai ∈ [K] is the
assignment of unit i,
n X
X n K XhX
X X i
T = dij = dij + dij
i=1 j=1 k=1 ai =k aj =k aj ̸=k
=: W (a) + B(a)
Where W (a) is the sum over the same-cluster assignment dissimilarities, and B(a) is the
between-cluster assignment dissimilarities.
Notice, then, that since T does not depend on a at all, then goal of minimizing W (a) is
equivalent to maximizing B(a)
K-means Clustering
One of the oldest and most popular clustering methods is K-means cluster, known for its
simplicity, ability to work for any continuous data. Here, the dissimilarity measure for data
points is the Euclidean distance squared
p
X
dij = d(xi , xj ) = (xil − xjl )2 = |xi − xj |2
l=1
27
Nikhil Vicas FIN 580
Quantitative Data Analysis
The goal of K-means is then the optimization problem, to find the assignment vector a∗ ∈ A
minimizing W (a):
XK X
∗
a = arg min Nk |xi − x̄k |2
a∈A
k=1 ai =k
repeating until convergence (when the within-dissimilarity stops decreasing). Since the sum
of squares is minimized, convergence is assured here, but it may be to a suboptimal local
minimum.
Further, the solution depends on the initial vector of assignments and centroid means, which
are typically randomly generated. Thus, in practice one often runs K-means many times
and averages results (similar to ensemble methods for reducing instability). Before using
K-means, remember to standardize variables, since scale matters, especially when computing
Euclidean distance on multiple dimensions. Although K must be chosen a-priori, we can use
information criterion (e.g. AIC or BIC) to determine the optimal value, although notice it
is not as useful as in the regression case.
Hierarchical Clustering
Heirarchical clustering methods produce a sequence of nested cluster memberships, meaning
there is no need to choose a-priori the number of clusters or starting positions. There are
two types of heirarchical clusterings: agglomerative and divisive.
Divisive approaches start with all points in the same cluster, splitting clusters into two, re-
sulting in the biggest dissimilarity until all points are in their own cluster. On the other
hand, agglomerative approaches are bottom-up, starting with each point in their own cluster
and merging clusters with the least similarity until there is only one cluster. We will focus
on agglomerative clustering it is simpler.
Notice that now we have similarity/dissimilarity between clusters, not just datapoints, which
is called “linkage”. There are different method for linkage between sets, which determines
the outcome of the agglomerative processes. Some common examples include
28
Nikhil Vicas FIN 580
Quantitative Data Analysis
•
ρsingle (A, B) = min d(xi , xj )
i∈A,j∈B
•
ρcomplete (A, B) = max d(xi , xj )
i∈A,j∈B
•
1 X
ρaverage (A, B) = d(xi , xj )
nA nB i∈A,j∈B
•
ρcentroid (A, B) = d(x̄A , x̄B )
A visualization tool for heirarchical clustering are “dendograms” which display a heirarcichal
sequence of clustering assignments.
A solution to this issue is to use a new function G : R → [0, 1] (almost like a squashing
function) to transform the linear regression fit onto the interval of interest. Thus
Popular examples for G are the probit and logit functions. If the data is linearly separable,
a CDF-type function perfectly separates them.
The Probit function is simply the CDF of a Gaussian distribution Φ(·) meaning
29
Nikhil Vicas FIN 580
Quantitative Data Analysis
and
Φ−1 [P[Yi = 1|Xi ]] = XiT β
is called the Probit link function, which maps the probability to the real line.
exp(z)
G(z) =
1 + exp(z)
so
exp(XiT β)
P[Yi = 1|Xi ] =
1 + exp(XiT β)
The Logit function is easier to estimate and compute than probit and easier to generalize to
multimodel models.
We estimate the β here using a maximum likelihood approach. For this instance, we assume
that {(Xi , Yi )}i∈[n] are independent and identically distributed, and that p < n is fixed. Then,
the likelihood of a datapoint is given by
There is no closed form for this estimator, and it must be solved numerically. We also have
the asymptotic distribution
n
√ n hX g(XiT β)2 T
i−1 o
n(β̂ − β0 ) → N 0, X i Xi
i=1
G(XiT β)(1 − G(XiT β))
dG(z)
where g = dz
(z)
We are also able to consider “partial effects,” or sensitivity of the probability output to
changes in the value of an input in the X vector (somewhat like greeks for options, finding
30
Nikhil Vicas FIN 580
Quantitative Data Analysis
the sensitivity of a function to different components of the input). These partial effects are
given by
∂p(X; β)
= g(X T β)βj
∂xj
Notice that these partial effects depend on the non-deterministic data themselves. However,
“relative” effects comparing partial effects on two different inputs do not:
∂p(X; β)/∂xj βj
=
∂p(X; β)/∂xh βh
We can consider two different statistics for a determined X, well encapsulating the entire
dataset’s influence from the Xi vector:
We can use the delta method to then find the asymptotic variance of these partial effects.
Once we have a ‘soft’ probability value of the binary Y , we do need a cutoff point to finally
classify Y . The choice of cutoff point depends on application (like in some instances, we
may choose to be more conservative, or risk-taking). We can test based on accuracy metrics
considering true and false positives, since accuracy can be given by: Accuracy = (TP +
TN)/(TP + FP + FN + TN).
In high dimensions, we find β through a regularization approach (e.g. Ridge, Lasso, Elastic-
Net, AdaLasso, etc.). A Ridge method, for example, would be like
h X i
β̂ = arg min − log L(β|{Xi }i∈[n] ) + λ βj2
β∈Rp j
These methods can be extended to multimodal models as well, such as Multimodal Logit:
for Y ∈ {0, . . . , J}
exp(XiT βj )
P[Yi = j|Xi ] =
1 + k exp(XiT βk )
P
31
Nikhil Vicas FIN 580
Quantitative Data Analysis
In many instances, the data is not linearly seperable, as is needed in this method. Then, we
can use a nonlinear model
P[Yi = j|Xi ] = G[f (Xi ; θ)]
where f (Xi ; θ) is a nonlinear function of Xi (for example, trees, neural networks, SVMs, etc.).
The first method we discuss are classification (decision) trees: these are very similar to the
aforementioned regression trees, where the main difference is the loss function used. In these
trees, in a terminal node k (region Rk ) with nk observations,
1 X
p̂kj = 1{Yi = j}
nk i∈R
k
Another classification method to discuss is SVC, support vector classification. In the modst
simple form, it is a linear classifier (the later mentioned SVM is not linear). SVC splits data
into two groups, labeled -1 and 1; it is very effective in high-dimensional cases.
f (X; w, b) = sgn(wT X + b)
The data is linearly seperable if ∃θ := w, b such that ∀i ∈ {1, . . . , n} subsect to the constraints
wt Xi + b > 0 if Yi = 1
wt Xi + b < 0 if Yi = −1
The question then is how do we pick an optimal hyperplane if multiple exist? We do this by
‘maximizing the margin.’ First, we recall that the disctance between a point X and a plane
defined by (w, b) is given by
|wT X + b|
d(X, (w, b)) =
|w|
32
Nikhil Vicas FIN 580
Quantitative Data Analysis
then define the positive and negative support vectors as (∈ instead of = since there may be
multiple):
X+ (w) ∈ arg min{wT X : Yi = 1}
X− (w) ∈ arg max{wT X : Yi = −1}
The margin we seek to maximize is the normalized distance between these two SVs
wT 2
m(w) = (X+ − X− ) =
|w| |w|
wt Xi + b > 1 if Yi = 1
wt Xi + b < 1 if Yi = −1
this is equivalent to
|w|2
min
w,b 2
such that Yi (wT Xi + b) ≥ 1 ∀i. If the solution exists and it is unique, it is called the optimal
hyperplane.
We can also introduce some slack if the data is not exactly linearly separable (maybe there
are some outliers). Then for {ξ}i∈[n] , ξ ≥ 0we can make constraints Yi (wT Xi + b) ≥ 1 − ξi , ∀i.
When ξi s are big enough that the constraints are always met, this is called a soft margin.
Beyond SVC are Support Vector Machines, which use the kernel trick (for an a priori chosen
kernel) to transform the data to be linearly separable in some higher dimension, where a
SVC can be used. We can consider SVM to be a non-linear separable device since it can
seperate data non-linearly by placing it in a higher dimension, where it can be seperated by
a hyperplane.
References
[1] Caio Almeida and Marcelo Medeiros. FIN 580 Quantitative Data Analysis in Finance
Lecture Slides.
[2] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical
Learning. Springer Series in Statistics. New York, NY, USA: Springer New York Inc.,
2001.
[3] Andy Jones. LASSO and the irrepresentable condition. url: https://andrewcharlesjones.
github.io/journal/irrepresentable.html.
33
Nikhil Vicas FIN 580
Quantitative Data Analysis
[4] Peter Melchior. SML 505 Modern Statistics, Princeton University. url: sml505.pmelchior.
net.
[5] Peng Zhao and Bin Yu. “On Model Selection Consistency of Lasso”. In: Journal of Ma-
chine Learning Research 7.90 (2006), pp. 2541–2563. url: http://jmlr.org/papers/
v7/zhao06a.html.
34