0% found this document useful (0 votes)

10 views34 pages

Statistical ML Overview

Uploaded by

Nikhil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views34 pages

Statistical ML Overview

Uploaded by

Nikhil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Nikhil Vicas FIN 580

Quantitative Data Analysis

1 Introduction
In general, for any pair of random variables (Y, X)

Y = E[Y |X] + U = f (X) + U

where U is the error term, representing variations in Y not explained by X, with E[U ] = 0.
Our goal is to find the f connecting X and Y .

The simplest model for f (X) is linear: if X is a one-dimensional vector,

Y = β0 + β1 X + U

is the simple linear regression model. Extending this to cases where X is a [T x p] matrix,
representing T observations of data-points (p-dimensional vectors), we have

Y = β0 + β1 X1 + ... + βp Xp + U

or in matrix notation
Y = Xβ + U
Here, we have p+1 parameters to be estimated, since β = (β0 , β1 , ..., βp )T , and T observations.
Without loss of generality, we can assume that the data is de-meaned and standardized, so
β0 = 0.

Under some conditions, the best linear projection of Y onto X is the Ordinary Least Squares
(OLS) solution
β̂OLS = (XT X)−1 XT Y
However, this is not a good option in some cases:

• The OLS solution is not valid in high dimensions (when p > T ), since then, XT X is
singular
• When T is huge, matrix inversion is expensive
• Nonlinearities

Theorem: When p > T , XT X is not invertible.

Pf.: Let y ∈ Span(XT X). Then ∃x such that XT Xx = y =⇒ XT (Xx) = y =⇒ y ∈
Span(XT ). Thus, Span(XT X) ⊂ Span(XT ), implying that:

Rank(XT X) ≤ Rank(XT ) ≤ min(T, p)

1
Nikhil Vicas FIN 580
Quantitative Data Analysis
Because of this issue, as aforementioned, the OLS solution is invalid in high dimensions (when
p > T ). When the data matrix X is square (p = T ), the OLS solution is unique, but it is
very bad.

There are two main solutions to the problem of high dimensionality:

• Dimension reduction (factor analysis)

• Regularization (penalized estimation)

Dimension Reduction

For p > T , instead of

Yt = XT β
we assume that
Xt = ΛFt + Vt
where the dimension of Ft is much smaller than T . Then, we replace Xt by the factors Ft
in our target model and regress on the latent factors of X. This, however, poses a challenge
in determining the number of factors to draw from the data, and estimating the factors
themselves.

Regularization

Instead of our OLS

β̂OLS = (XT X)−1 XT Y
we now find a penalized least squares estimator:
p
" T #
X X
β̂(λ) = arg min (Yt − β T Xt )2 + pλ (|βj |; α, data)
β∈B t=1 j=1

where pλ (|βj |; α, data) a non-negative penalty function indexed by λ. Here, λ determines the
number of features that enter the model: when λ → ∞, no variables enter the model; when
λ = 0, β̂(0) is the same as β̂OLS

2 Principal Component Analysis

Since the OLS solution is invalid when n > T , one method we can undertake is reducing the
dimension of X using a factor model, which assumes there are k << n underlying factors in
X driving underlying variance.

2
Nikhil Vicas FIN 580
Quantitative Data Analysis

Xt = ΛFt + Vt
where Ft are the unobserved factors, Λ is a matrix of unobserved factor loadings, and Vt
is a vector of idiosyncratic errors. There are many ways to estimate the factors, but one
widespread and simple solution is Principal Component Analysis (PCA), where the factors
are the first k Principal Components.

Our goal is to take the (de-meaned and standardized) dataset X of shape T x n and extract
as much information from X as possible, building another dataset Z of shape T x k where
k << n. The key insight of PCA is that “most of the variation that exists in your data
across many high dimensions is captured in fewer dimensions.”

The main idea of PCA is to look for the vector γ ∈ Rn such that the sample variance of Xγ
is maximized.

We start by observing the sample covariance matrix of X (as long as X is centered):

1 T
Σ̂ = X X
T

Then, the sample variance of Xγ is given by T1 γ T XT Xγ = γ T Σ̂γ. We will restrict the space
of γ to be just unit vectors: |γ| = 1, since otherwise, we could arbitrarily increase sample
variance by increasing the size of γ: we seek the direction of X exhibiting the greatest vari-
ance.

In other words, our problem is

max γ T Σ̂γ
γ∈Rn ,|γ|=1

This is done by setting the derivative, with respect to γ, of the Lagrangian

L(γ, λ) = γ T Σ̂γ − λ(γ T γ − 1)

to 0, where λ ∈ R is our Lagrange multiplier.

This yields the eigenvector of Σ̂ with the largest associated eigenvalue as the vector γ ∈ Rn
that maximizes sample variance (the vector that encapsulates the highest degree of variance
of X).

We continue this process to find the next “principal component”, maximizing γ2T Σ̂γ2 such
that γ2 is orthogonal to the first principal component that we found. Differentiating the
Lagrangain with the new orthogonality conditions yields the eigenvector of Σ̂ with the next

3
Nikhil Vicas FIN 580
Quantitative Data Analysis
largest associated eigenvalue.

In general, we find that the j th Principal Component of X is of the form

Σ̂γj = λj γj

where γj is clearly an eigenvector of Σ̂ with corresponding eigenvalue λj .

We can then collect the k (for some pre-determined k) Principal Components into an n x k
matrix, called Γ:
Γ := [γ1 , ..., γk ]
We construct our reduced-dimension dataset Z with only k PC columns as

Z(k) = XΓk = [Z1 , ..., Zk ]

By construction, the k columns of Z are orthogonal random variables, each with sample
variances of λ1 , ...λk (in decreasing size). This means that

Var(Z) = diag(λ1 , ..., λk )

Choosing k

There is a tradeoff in choosing the number of principal components to use in the reduced
model: if k is too large, then we do not sufficiently reduce the dimension; on the other hand,
if k is too small, then we lose information from our dataset.

In general, it is a good idea to run a full PCA, where k = n, and plotting

λj
α k = Pn
k=1 λj

for j ∈ [n] giving us αj , the proportion of variance attributed to the j th PC. For example:

4
Nikhil Vicas FIN 580
Quantitative Data Analysis

Methods of choosing k include:

• Rule of thumb: stop when the next PC does not meet some threshold for explained
variance

• Informal cutoff: stop when the k PCs (with highest corresponding αs) explain some
amount of the variance (e.g. 80%)

• Onatski (2010) biggest drop: Compute

λj
arg max
1≤j<n λj+1

PC Regression

After PCA, we can regress on our first k principal components of X:

Ŷ = Γk β̂PCR = Γk (ΓTk Γk )−1 ΓTk Y

so our PC hat matrix is given by Γk (ΓTk Γk )−1 ΓTk , compared to the OLS hat matrix X(XT X)−1 XT
(basically we just replaced X by Γ and completed a now possible OLS regression with the
reduced-dimensionality dataset).

5
Nikhil Vicas FIN 580
Quantitative Data Analysis
3 Ridge Regression
Another method for solving the problem of high-dimensionality’s invalidation of the OLS
solution is the general shrinkage method: penalized least squares
p
" T #
X X
T 2
β̂(λ) = arg min (Yt − β Xt ) + pλ (|βj |; α, data)
β∈B t=1 j=1

where pλ (|βj |; α, data) a non-negative penalty function indexed by λ.

First, let us start by noting for a generic parameters of interest θ, θ̂ is an unbiased estimator
for θ when
E[θ̂] = θ
We define the “mean squared error” of an estimator as

MSE[θ̂] = E[(θ̂ − θ)2 ]

which can be decomposed into

MSE[θ̂] = (Bias(θ̂))2 + Var(θ̂)

For regression problems, the OLS estimator is the Best Linear Unbiased Estimator, under
certain assumptions, via the Gauss-Markov theorem. However, this does not mean that they
have the lowest MSE. The James-Stein estimator can be constructed for a regression problem
with a lower MSE than the OLS predictor, despite being biased. One popular estimator is
the Ridge estimator, which despite being biased, has a lower variance than the OLS estimator
and is valid in high dimensions.

The Ridge estimator is given by

p
" T
#
X X
β̂Ridge (λ) = arg min (Yt − β T Xt )2 + λ βj2
β∈B t=1 j=1

which shrinks the OLS estimator towards zero for paramaeters that are deemed redundant.

The Ridge estimator always has a strictly convex optimization problem, as long as λ > 0
which is the OLS case, regardless of the dataset X. However, it is not a valid tool for variable
selection and is an inconsistent estimator for β in general.

The Ridge estimator is, however, given in a closed form:

β̂Ridge (λ) = (XT X + λI)−1 XT Y

6
Nikhil Vicas FIN 580
Quantitative Data Analysis

As seen in the image below, it “shrinks” the OLS estimator towards zero, depending on the
“shrinkage” strength, defined by λ:

Ridge works well since it is essentially just OLS on an augmented (T + p) x p dataset (always
invertible):  
x11 x12 ... x1p
 .. .. .. .. 
 . . . . 
 
xT 1 xT 2 ... xT p 
 √ 
Xλ =   λ √0 . . . 0 

 0 λ ... 0 
 .. .. .. .. 
 
 . . . √. 
0 0 ... λ

7
Nikhil Vicas FIN 580
Quantitative Data Analysis
with  
Y1
 .. 
 . 
Y 
 
Y =  T
0
 . 
 .. 
0
So even if (XT X)−1 is singular, the Ridge (XTλ Xλ )−1 will not be.

The Ridge estimator is not unbiased, however, it provides a lower variance than the OLS
estimator.

Ridge and PCs

Performing singular value decomposition for X, we get

X = UΣV T

where Σ is a p x p diagonal matrix consisting of the singular values, which are the square
roots of eigenvalues of XT X.

It can be shown that

d dp T
1
β̂Ridge (λ) = Vdiag 2 , ..., 2 U Y
d1 + λ dp + λ

As a consequence of this,
d dp T
1
ŶRidge = Udiag 2 , ..., 2 U Y
d1 + λ dp + λ

and
ŶOLS = UU T Y

This shows that Ridge essentially integrates elements of PCA and shrinks more redundant
covariates towards 0 by selecting covariates that explain most of the variance of the dataset
X, as given by singular values, which are corresponding eigenvalues of the principal compo-
nents (eigenvectors of the covariance matrix).

Larger dj singular values have less penalization towards 0 (vectors that explain more variance
of the dataset are not penalized as much, when λ > 0).

8
Nikhil Vicas FIN 580
Quantitative Data Analysis
Selection of λ

The selection of λ, the penalty parameter, influences the number of features that enter the
model. A large penalty parameter shrinks more towards zero for all βi and a λ = 0 just
preserves OLS properties (when it is not high-dimension).

In general, there are no closed-form results for choosing the penalty parameter, but just
data-driven best practices. There are two common approaches:

• Cross Validation: “Split the sample into parts. Use one part to estimate the model and
other part to evaluate the estimates.”

• Information Criterion: including Akaike Information Criterion (AIC), Bayesian Infor-

mation Criterion (BIC), etc.

4 Lasso Regression
The Least Absolute Selection and Shrinkage Operator, proposed by Tibshirani in 1996, is
another approach to regularized regression. It is given by
p
" T #
X X
T 2
β̂Lasso (λ) = arg min (Yt − β Xt ) + λ |βj |
β∈B t=1 j=1

By construction, it can handle many more variables than dimensions (p >> T ) and can
sometimes even select the correct subset of relevant variables in the dataset. However, it is
a biased estimator and parameter inference is only possible in some restrictive cases.

It is schematically similar to Ridge, but now with the L1 norm of β as a penalty term, rather
than the L2 norm. Hence, Lasso is also known as L1 regularization and Ridge is known as
L2 regularization.

9
Nikhil Vicas FIN 580
Quantitative Data Analysis

Lasso is a convex program; but it is not strictly convex when XT X is singular (high-
dimension). Thus, β̂Lasso (λ) is not , in general unique. However, the prediction Xβ̂Lasso (λ) is
unique. Tibshirani did, however, show that “if the entries of X are drawn from a continuous
probability distribution, then the Lasso solution is unique almost surely”

Lasso also produces a sparse model: the number of non-zero parameter estimates will be at
most min(T, p)

Selection Consistency

Follows from [3]

A consistent estimator has estimated parameters converging to the true parameters as the
sample size increases.
β̂ (n) →p β0 as n → ∞
In the context of Lasso, an important notion of consistency is “model selection consistency”–
if, in the true model β0 a variable has a nonzero coefficient, then this type of consistency

10
Nikhil Vicas FIN 580
Quantitative Data Analysis
mandates that the estimator also has a nonzero coefficient for said variable:
h i
P {i : β̂ i ̸= 0} = {i : β 0i ̸= 0} → 1 as n → ∞.

Thus, a key question answered by Peng Zhao and Bin Yu in 2006 [5] is when does Lasso
exhibit model selection consistency? The answer was found to be conditional on meeting a
“representable condition” for the data X. Formally,

||(X⊤ −1
2 X1 ) (X1 X1 )sgn(β 0(1) )||∞ < 1.

This can be interpreted as “means that the irrelevant variables cannot be too correlated
with the relevant variables” in X. This means that Lasso’s choice of the set of ”important”
variables–the one that have non-zero coefficient–must not be correlated with those that are
discarded. Thus, Lasso may have difficulty in instances where variables are highly correlated
(multi-colinearity is present).

Lasso Extensions
There are many relatives and extenstions of Lasso, such as Elastic Net, Adaptive Lasso,
Group Lasso, Fused Lasso, etc.

Elastic Net

Elastic net combines elements of Ridge and Lasso. Naive elastic net is given by
" #
β̂(λ1 , λ2 ) = arg min |Y − Xβ|22 + λ1 |β|1 + λ2 |β|22
β∈B

It attenuates over-shrinkage, compared to Lasso or Ridge. Elastic Net can select more than T
variables without “saturating”, whereas Lasso cannot. It is equivalent to a LASSO problem
on augmented data .

Adaptive Lasso
p
"T
#
X 1 X
β̂adaLasso (λ) = arg min (Yt − β T Xt )2 + λ wj |βj |
β∈B t=1
T j=1

where we now introduce wj which is adapted from an initial run of Lasso, providing an initial
estimator. Usually wj = |β̃j |−τ for τ ∈ (0, 1] where β̃j is an initial estimator from e.g. Lasso.
Under some conditions, it has the oracle property and it provides consistent estimates for
the non-zero parameters (those not set to zero during the initial Lasso run).

11
Nikhil Vicas FIN 580
Quantitative Data Analysis
Group Lasso

Group Lasso selects groups of variables rather than individual variables

" T G
#
X1 X p
β̂Group (λ) = arg min (Yt − β T Xt )2 + λ dg |βg |2
β∈B t=1
T j=1

where the length of the j th group is given by dj . When di = 1 ∀i ∈ [G], Group Lasso is the
same as Lasso.

5 Sieves and Splines

With the model in mind
Yt = E[Y |Xt ] + Ut = h(Xt ) + Ut
for some unknown function h connecting X to Y , we seek to estimate this unknown ht :=
h(Xt ) ∈ H for function space H.

However, the problem

T
X
ĥ = arg min (Yt − ht )2
ht ∈H t=1

is intractable when H is infinite (any possible function is allowed) since there is no efficient
technique to search over the function space.

A solution to this problem is the method of sieves: to estimate functions in a sequence of

simpler finite-dimensional function spaces HD
T
X
ĥ = arg min (Yt − ht )2
ht ∈HD t=1

In this method, we select a sequence of function spaces HD , D ∈ {1, 2, ...} converging to H.

The approximating function is given by
JT
X
hD (Xt ) = βj hj (Xt )
j=1

where hj (·) is the j th basis function for HD . It can either be indexed by a vector of parame-
ters (not fully known) or fully known. The number of basis functions, JT and the dimension
of the space D depend on the sample size, T , of the dataset X. When there is no parameter

12
Nikhil Vicas FIN 580
Quantitative Data Analysis
θ (the basis functions are fully known, then D = JT ; otherwise, D > JT .

For example, for HD being the space of polynomial functions of Xt up to order J, then the
basis functions are fully known and J = D
JT
nX o
j
Pol(JT ) = βj X , X ∈ [0, 1] : β ∈ R
j=1

The motivation for this polynomial sieve space is the Stone-Weierstrass theorem, stating that
there exists a well-approximating polynomial for any continuous function m(·) with compact
support X on R:

∀ε > 0, ∃P (X) such that sup |m(X) − P (X)| < ε

X∈X

Other common sieve spaces include trigonometric functions, logistics, ReLU, etc.

Now, consider “splines”: let JT ∈ Z+ and let ξi ∈ R for i ∈ {0, ..., JT + 1} with 0 = ξ0 < ξ1 <
. . . < ξJT < ξJT +1 = 1. In other words, we divide [0, 1] into JT + 1 intervals Ij = [ξj , ξj+1 )
(with IJT = [ξJT , ξJT +1 ]). Know assume these “knots” ξ0 , ..., ξJT +1 all have a “bounded mesh
ratio”:
maxj∈{0,...,JT +1} (ξj+1 − ξj )
≤c
minj∈{0,...,JT +1} (ξj+1 − ξj )
for some c > 0.

A function on [0, 1] is a spline of order r if on each interval Ij , it is a polynomial of degree

r − 1 or less and for r ≥ 1, it is r − 1 times continuously differentiable on [0,1]. The locations
of knots ξi are typically quantiles of our data Xt .

However, notice that the behavior of polynomial functions near the boundary of the datasets
tend to be erratic, meaning extrapolation loses value beyond our data bounds. These issues
are made worse with splines, since those spline fits behave wildly near the boundaries. Thus,
“natural cubic splines” are functions with condition imposed near boundary knots: namely,
that the function must be linear beyond the boundary knot. This results in bias near the
boundary region, but this is accepted for the dramatic decrease in variance by restricting the
spline’s behavior.

The problem of knot selection is avoided in “smoothing splines,” which use a maximal set of
knots. Here, the complexity of the spline fit is controlled by reguralization.

13
Nikhil Vicas FIN 580
Quantitative Data Analysis
6 Ensemble Methods
Boosting
Boosting is a greedy method for additive models. Greedy algorithms are those that make a
locally optimal choice in stages, with the aim of eventually reaching a global optimum for
the problem. Specifically, boosting is a method that uses additive week learners, each chosen
greedily to be a local optimum to best fit residuals of the stage’s current boosting model,
aiming to reduce these residuals to be significantly small and thus be a optimum method for
fitting data. It does this via gradient descent.

The general form for addtive models is

M
X
Yt = f (Xt ) + Ut = βm fm (Xt ) + Ut
m=1

where the fm (Xt ) is a weak learner. This is a predictor that performs rather poorly, but just
better than the average (performs slightly better than average chance). Thus, combining
these weak models should, hopefully, yield a learner that overall is strong. Our goal is to
estimate the optimal additive model:

f ∗ := arg min E{L[Y, f (X)]}

for a loss function (in this case, just quadratic loss) L

The gradient boosting algorithm is as follows:

1. Initialize fˆ[0] (the T -dimensional initial predictor) with some value, usually a constant
value. Let j = 0. Choose the set of weak learners we can utilize at every step.

2. Set j = j + 1 and compute

∂
− L(Y, f )
∂f
and
fˆ[j−1] (Xt ), t ∈ [T ]
Consider the T -dimensional ‘psuedo-residuals’,
h ∂ i
U [j−1] := − L(Yt , fˆ[j−1] (Xt )) , t ∈ [T ]
∂f

3. Chose the weak learner that best fits U [j−1] . Denote Û [j−1] as the fitted values of this
weak learner.

14
Nikhil Vicas FIN 580
Quantitative Data Analysis
4. Update
fˆ[j] = fˆ[j−1] + ν Û [j−1]
where ν ∈ (0, 1] is the step size.
5. Repeat steps 2-4 until the maximum number of iterations is reached.

Gradient boosting is easy to adjust for different choices of loss functions, has abilities for
variable selection, is useful for high-dimensional data, and is robust to multi-colinearity.
However, it sacrifices interpretability. Often times, the weak learners in question will be
Classifications and Regression Trees (CARTs).

Bagging
Bagging is another ensemble method, introduced to reduce the variance of a predictor. It
operates by bootstrap aggregating. We boostrap (sample with replacement) from our dataset
Xt , create a predictor on the bootstrap dataset, and then take the expectation of such
a predictor (in practice, this is done via Monte Carlo methods). Formally, the bagging
algorithm is:

1. Construct a boostrap sample

Zt∗ = (Yt∗ , Xt∗T ), t ∈ [T ]

2. Compute the bootstrapped predictor

θ̂T∗ (X) = hT (Z1∗ , . . . , ZT∗ )(X∗ )

3. The bagged predictor is then given by

θ̂TB (X) = E∗ θ̂T∗ (X)

This last step via Monte Carlo is

B
1 X ∗
θ̂TB (X) = θ̂ (X)
B j=1 T,(j)

If the predictor is unstable, then bagging provides a dramatic reduction in variance.

A statistic is stable at X if
θ̂T (X) = θ(X) + op (1)
as T → ∞ for some fixed value θ(X) (some stable limit). Informally, an unstable statistic
is subject to large changes from small changes in input data, whereas stable statistics are
relatively not-impacted by these small changes.

15
Nikhil Vicas FIN 580
Quantitative Data Analysis
7 Neural Networks
Again, we seek to find a way to find a relationship betwen our response variable Y and input
variables X. Assuming there exists some unknown mapping

Y = f (X) + U

we would like to estimate f to predict a new response Y ∗ from a new data point X ∗ .

Single Hidden Layer Neural Networks

Neural networks approximate the unknown function f (·) with

JT
X
H(X; θ) = β0 + βj S(γj′ X + γ0,j )
j=1

where

• S is a basis function called an “activation function”

• The parameter θ := (β0 , . . . , βJT , γ1′ , . . . , γJ′ T , γ0,1 , . . . , γ0,Jt )T are “weights” that tune
the model. Further, the γ0,i and β0 are “biases”

So a neural network (at least of this architecture) is simply a method of transforming the
data via weights, biases, and an activation function.

Popular choices for activation functions include:

• Logistic function:
1
S(X) =
1 + e−X
• Hyperbolic tangent (tanh):
eX − e−X
S(X) =
eX + e−X
• Rectified Linear Unit (ReLU):

ReLU(X) = max(0, X)

In general, but not always, the activation function S is a squashing function. A function
S : R → [a, b], a < b, is a squashing, or sigmoid, functio if it is non-decreasing and has

16
Nikhil Vicas FIN 580
Quantitative Data Analysis
limX→∞ S(X) = b and limX→−∞ S(X) = a. However, this is not always the case–the afore-
mentioned ReLU function is not squashing.

The universal approximation theorem states that a “feed-forward NN with a single hidden
layer with “arbitrary” squashing functions can approximate any Borel-measurable function
from one finite dimensional space to another to any desired degree of accuracy, provided
sufficiently many (finite) hidden units are available.”

It is very intuitive to visualize neural networks graphically, as a set of connecting nodes and
edges:

This helps us visualize the fact that

a) The inputs are linearly combined (including a bias) when being fed into the hidden layer
γj′ X + γ0,j

b) There are nonlinear transformations (due to the activation function) in the hidden layer:
S(γj′ X + γ0,j )

c) The outputs of the activation function in the hidden layer arePthen linearly combined and
combined with a final output bias to create an output β0 + S(γj′ X + γ0,j )

We can also write the single layer neural network as a matrix: we define

17
Nikhil Vicas FIN 580
Quantitative Data Analysis
• X̃t := [1, XtT ]T , t ∈ [T ]

• γ̃j := [γ0,j , γ Tj ]T , j ∈ [JT ]

and then use these vectors to for the matrices

• X[T ×(p+1)] := [X̃1 , . . . , X̃T ]T

• Γ[(p+1)×JT ] := [γ̃1 , . . . , γ̃JT ]T

Thus, XΓ is a T × JT matrix, which is then an input for

 
1 S(γ̃1T x̃1 ) . . . S(γ̃JTT x̃1 )
1 S(γ̃1T x̃2 ) . . . S(γ̃JTT x̃2 ) 
O(XΓ) :=  ..
 
.. .. .. 
. . . . 
T T
1 S(γ̃1 x̃T ) . . . S(γ̃JT x̃T )

Finally, the output of the feed-forward NN is given by combining O(XΓ, the matrix of
activation function on transformed input values (in the hidden layer) with the output weights
and bias vector β := [β0 , β1 , . . . , βJT ]T to get:

H(X, θ) = O(XΓ)β

This neural network is trained (the parameter vector is optimized) via stochastic gradient
descent. Beforehand, however, we must determine the architecture of the NN; in the case of
a single-hidden layer, this involved deciding the number of hidden units JT . A small number
of hidden units will likely underfit data while a big value will overfit it.

Deep Neural Networks

A deep neural network is simple a neural net with more than one hidden layer, which may be
fully connected or not. According to Mhaskar, Liao, and Poggio in 2017: “While the univer-
sal approximation property holds both for hierarchical and shallow networks, deep networks
can approximate the class of compositional functions as well as shallow networks but with
exponentially lower number of training parameters and sample complexity.”

In this case, let Jl be the number of hidden units in layer l ∈ [L]. Then, for each hidden
layer l define Γl := [γ̃1,l , . . . , γ̃Jl ,l ]. The output O of layer l is then (of size n × (Kl + 1)):
 T

1 S(γ̃1l O1l−1 (·)) . . . S(γ̃JTl l O1l−1 (·))
1 S(γ̃ T O2l−1 (·)) . . . S(γ̃ T O2l−1 (·)) 
1l Jl l
Ol (Ol−1 (·)Γl ) =  ..
 
.. . . .. 
. . . . 
T
1 S(γ̃1l Onl−1 (·)) . . . S(γ̃JTl l Onl−1 (·))

18
Nikhil Vicas FIN 580
Quantitative Data Analysis
where O0 is X
Thus, the output of the deep neural net is

H(X; θ) = OL (. . . O3 (O2 (O1 (XΓ1 )Γ2 )Γ3 ) . . .)ΓL β

Because of its architecture, there is a huge amount of parameters to train in deep neural
nets. Using stochastic gradient descent (backpropogation for deep neural nets, since we have
to complete the chain rule to calculate propagating effects of altering parameters in earlier
layers), this can be trained, but this should be done in conjunction with regularization meth-
ods. For deep NNs, this method is dropout.

Dropout is randomly dropping neurons (as well as connections) from the NN during training.
For a shallow net, consider a p × 1 vector r of iid Bernoulli(q) random variables.
Then, instead of
K
X
H(X; θ) = β0 + βj S(γjT Xt + γ0,j )
j=1

we will have
K
X
H(X; θ) = β0 + βj S(γjT [r ⊙ Xt ] + γ0,j )
j=1

Where ⊙ represents the Hademard product. Then, we use the final estimates for γj , multi-
plied by q (to account for the effects of regularization).

Convolutional Neural Network

Convolutional Neural Networks, or CNNs, are neural nets that have proven to be successful
at such tasks as image recognition and classification. They rely on an image kernel, known
as a convolution, used for feature extraction for images in order to determine the most im-
portant portions of data.

In this section, I’ll focus solely on the image classification aspects of CNNs, although there
is some application of it towards forecasting time-series data.

CNNs consist of different key elements, including a convolutional layer (or multiple) in which
the image kernel is applied, nonlinear transformation (like shallow neural networks), pooling
for dimension reduction, and other paramter such as stride.

The key element here are ‘convolutions’ in which an image kernel, which acts like a filter,
passes over all pixels in the image. The choice of this convolution impacts which features

19
Nikhil Vicas FIN 580
Quantitative Data Analysis
(such as contours or edges) can be extracted from the image.

The point of this convolution is to extract the relevant features from an image, which in
future layer improve classification performance.

Another parameter for convolutional neural nets is ‘stride’. In the previous example, the
convolution passed over every pixel in the image by shifting over pixel-by-pixel. Instead,
adjusting the stride, or downsampling, can reduce the problem dimension by skipping over
pixels and moving by more than one pixel per step.

20
Nikhil Vicas FIN 580
Quantitative Data Analysis

Because of border effects (how a convolution can only be applied when the filter passes over
all pixels, meaning the border pixels cannot have the transformation applied), we can also
include padding–or adding a border of 0 pixel values enveloping the image, so that the kernel
can be applied to edges as well.

For non-grayscale images (which only have one value for the darkness of the image), all
images can be represented as three different grids for each RGB color. These grids can be
appropriately downsized through the NN architecture as well as convolutions and pooling to
end up in a fully-connected deep neural network, at which point a classification can be made.
For example, consider:

21
Nikhil Vicas FIN 580
Quantitative Data Analysis

The hyperparameters in a CNN to be optimized are thus:

• The number of convolution layers

• The number of pooling layers

• The number and dimension of filters in each convolution layer

• The architecture of the deep neural network

Long Short Term Memory Neural Networks

Long Short Term Memory Neural Networks are a type of recurrent neural network, which
have in-built ‘memory’ features, meaning past patterns and features are able to influence the
prediction for future states. Thus, they work very well in time-series settings.

Traditional RNNs, however, suffer from the vanishing / exploding gradient problem, in which
the backpropogated cost function can either become incredibly small or diverge to ∞:
T
1X
QT (θ) = (yt − ŷt )2
T t=1

has ∂Q∂θ
T (θ)
→ 0 or ∞. A solution to this issue is the LSTM, which has a specific architecture
with both a long-term memory pipeline and a short-term memory pipeline. The LSTM cell
is given by

22
Nikhil Vicas FIN 580
Quantitative Data Analysis

Where the red circles represent a logistic activation function and the blue circles represent a
tanh activation function. The top horizontal pipeline is the ‘cell’ state, which is the memory
the LSTM stores to remember the past. The first vertical pipeline is the ‘forget’ gate, which
tells the LSTM which information to forget from the cell state. The bottom horizontal
pipeline includes the input gate, which tells which new information should be stored in the
cell state. There are two outputs from this cell: the upper ‘long term memory’ cells state,
and the bottom ’hidden state’ which represents more ‘short term memory’ that will be used
in the next state. At every cell, the inputs are then the ‘long term’ memory, the hidden
‘short term’ memory, and an input vector. Mathematically, this is given by, after initiating
with c0 = 0 and h0 = 0
ft = Logistic(Wf xt + Uf ht−1 + bf )
it = Logistic(Wi xt + Ui ht−1 + bi )
ot = Logistic(Wo xt + Uo ht−1 + bo )
pt = Tanh(Wc xt + Uc ht−1 + bc )
ct = (fT ⊙ ct−1 ) + (it ⊙ pt )
ht = ot ⊙ Tanh(ct )
yt = Wy ht + by

23
Nikhil Vicas FIN 580
Quantitative Data Analysis
8 Tree-based Methods
Classification and Regression Trees (CARTs) are a flexible, non-parametric predictive model
that operate by partitioning the input space X. They have numerous advantages, such as
interpretability, a hierarchical nature, and the ability to easily handle missing and categorical
variables (by treating them as their own category). However, they are highly unstable; this
can be fixed, though, by bagging or boosting multiple CARTs together (as mentioned before
in the ensemble methods section).

One well-known tree-based bethod is XGBoost, Extreme Gradient Boosting, which boosts
regression trees and adds some intelligent advancements for greater accuracy (although this
comes at the cost of interpretibility).

Trees create a local model, f for different partition regions. Typically, this f is very simple,
such as the average of the data points’ corresponding output value in a region.

Formally, regression trees estimate f (·) by

K
X
H(Xt ) = βk Ik (Xt )
k=1

where I is a product of indicator functions, showing whether a data point (in multiple di-
mensions) belongs in some region: Ik (Xt ) = 1 if Xt ∈ Rk and 0 otherwise

One of the main ideas of regression trees is recursive partitioning: creating partitions until
the total sum of squares is below a certain pre-specified value or when the number of obser-
vations in each region reaches a minimum value.

Pruning is also an important concept for trees. Pruning is a form of regularization for trees,
with the degree of pruning controlled by a parameter α, much like the learning rate in other
models. For each subtree τ ∈ τ0 , the original tree, consider the loss function
K
X
Cα (T ) = Qk (τ ) + αK
k=1

where K is the number of regions X was partitioned into. The optimal α is typically cho-
sen by cross-validation. Pruning clearly reduces over-fitting a tree by penalizing a greater
number of final nodes, much like how regularized regression penalizes an overfit model by
shrinking regression parameters to being smaller. The main idea here is to find, for a given
α, the subtree minimizing the regularized loss function Cα (T ).

24
Nikhil Vicas FIN 580
Quantitative Data Analysis
There are many tools which are useful for showing the impacts of each individual feature in
X on the regression tree prediction of Xt .

Relative Importance and Partial Dependence

First, the “relative importance” of the i-th covariate for a tree τ is given by
N
δ̂j2 1{sj = i}
X
Ii (τ ) =
j=1

where N is the number of parent nodes and sj is the index of the split variable. I don’t know
what δ̂j2 is here.

Another useful method is using a Partial Dependence Plot, which shows the marginal effect
of 1 or 2 features on the predicted outcome of the regression tree (or more generally, for any
ML model). Isolating 1 (or 2, if you want a 3D graph) feature, first set Xt = (X1t , XT2t )T and
then define h i Z
f1 (X1t ) = EX2 f (X1t , X2t ) = f (X1t , X2t )dP(X2t )

In other words, the expected value over the distribution of X2t , or very simply, holding the
feature of interest fixed, what is the average output over the range of all other features. We
can then plot a function over X1t to see what the output function will on average look like
over the domain of X1t . Generally, this expectation is not analytically computed, but found
using a Monte Carlo approach:
T
1Xˆ
fˆ1 (X1t ) = f (X1t , X2t )
T t=1

This method does, however, assume that the variables for which the partial dependence is
computed are not correlated with other variables, which is often not realistic in practice.

Random Forests
Aside from gradient boosting, fitting on the residuals of weak learners, using regression trees,
another popular method is “random forests”, which applies a bagging approach rather than
a boosting approach. Bagging reduces the instability of the estimate drastically, but comes
at a clear loss of interpretability, much like boosting.

The random forest algorithm is given as, for a bag size of B, b ∈ [B]:

25
Nikhil Vicas FIN 580
Quantitative Data Analysis
1. Bootstrap (sample with replacement) a sample of size T from the original dataset
2. For each bootstrapped sample, construct a tree, τb , using a subset q out of the original
p variables as the potential split variables. The tree is grown until the minimum
number of observations in each leaf (terminal region) is reached. There is no pruning
(regularization) of these trees.
3. The final prediction is an average prediction of the individual trees
B
1 X
τb (X)
B b=1

There is no theoretical answer for deciding between a boosted and bagged model–it is an
entirely empirical exercise.

9 Clustering
Now, we begin considering unsupervised learning models: models for X only, where there is
no known response variable Y in our dataset. Specifically, clustering is a collection of tools
to group and segment data into subgroups called cluster such that data points in the same
group are more similar to each other than points outside the group. This can be used to see
if the data consists of a set of distinct subgroups.

A good clustering technique has clusters with i) high intra-class similarity and ii) low inter-
class similarity (for some later-defined metric of similarity/dissimilarity)

For data X, xil i ∈ [n], l ∈ [p], consider the dissimilarity measure along the l-th attribute
ρl (xil , xjl )
Then for any two data points (row vectors in X), we consider the overall dissimilarity measure
via a weighted sum of attribute dissimilarity measures
p
X
d(xi , xj ) = wl ρl (xil , xjl )
l=1
Pp
for weights wi ≥ 0 and l=1 wl = 1
Here, the dissimilarity measure ρ(·, ·) can be chosen depending on the data type of the data
points:

• Continuous data: ρ(·, ·) can be any monotonically increasing function f where

ρ(x, y) = g(|x − y|)

26
Nikhil Vicas FIN 580
Quantitative Data Analysis
• For ordinal data, where there is some intrinsic order, we can convert the ordering of
discrete data into some numerical measure

• Categorical data: indicator variable

ρ(x, y) = 1{x ̸= y}

From this, we can construct a symmetric Matrix of Dissimilarity, D[n×n] where Dij = d(xi , xj )
for i, j ∈ [n].

The “total dissimilarity” of X is then the sum over the dissimilarity matrix
n X
X n
T := dij = 1T D1
i=1 j=1

This total dissimilarity can be decomposed further, if we have cluster assignments. For K
clusters and an assigment vector for the n data points a = [a1 , . . . , an ]T where ai ∈ [K] is the
assignment of unit i,
n X
X n K XhX
X X i
T = dij = dij + dij
i=1 j=1 k=1 ai =k aj =k aj ̸=k

=: W (a) + B(a)
Where W (a) is the sum over the same-cluster assignment dissimilarities, and B(a) is the
between-cluster assignment dissimilarities.

Notice, then, that since T does not depend on a at all, then goal of minimizing W (a) is
equivalent to maximizing B(a)

K-means Clustering
One of the oldest and most popular clustering methods is K-means cluster, known for its
simplicity, ability to work for any continuous data. Here, the dissimilarity measure for data
points is the Euclidean distance squared
p
X
dij = d(xi , xj ) = (xil − xjl )2 = |xi − xj |2
l=1

In K-means, the number of cluster assignments K must be chosen a-priori.

27
Nikhil Vicas FIN 580
Quantitative Data Analysis
The goal of K-means is then the optimization problem, to find the assignment vector a∗ ∈ A
minimizing W (a):
XK X
∗
a = arg min Nk |xi − x̄k |2
a∈A
k=1 ai =k

The algorithm is:

1. Compute cluster means (centroids), knowing cluster assignments:
1 X
x̄k = xi
Nk a =k
i

2. Create new assigments by assigning each point to the closest centroid

ai = arg min |xi − x̄k |2
k∈[K]

repeating until convergence (when the within-dissimilarity stops decreasing). Since the sum
of squares is minimized, convergence is assured here, but it may be to a suboptimal local
minimum.

Further, the solution depends on the initial vector of assignments and centroid means, which
are typically randomly generated. Thus, in practice one often runs K-means many times
and averages results (similar to ensemble methods for reducing instability). Before using
K-means, remember to standardize variables, since scale matters, especially when computing
Euclidean distance on multiple dimensions. Although K must be chosen a-priori, we can use
information criterion (e.g. AIC or BIC) to determine the optimal value, although notice it
is not as useful as in the regression case.

Hierarchical Clustering
Heirarchical clustering methods produce a sequence of nested cluster memberships, meaning
there is no need to choose a-priori the number of clusters or starting positions. There are
two types of heirarchical clusterings: agglomerative and divisive.
Divisive approaches start with all points in the same cluster, splitting clusters into two, re-
sulting in the biggest dissimilarity until all points are in their own cluster. On the other
hand, agglomerative approaches are bottom-up, starting with each point in their own cluster
and merging clusters with the least similarity until there is only one cluster. We will focus
on agglomerative clustering it is simpler.

Notice that now we have similarity/dissimilarity between clusters, not just datapoints, which
is called “linkage”. There are different method for linkage between sets, which determines
the outcome of the agglomerative processes. Some common examples include

28
Nikhil Vicas FIN 580
Quantitative Data Analysis
•
ρsingle (A, B) = min d(xi , xj )
i∈A,j∈B

•
ρcomplete (A, B) = max d(xi , xj )
i∈A,j∈B

•
1 X
ρaverage (A, B) = d(xi , xj )
nA nB i∈A,j∈B

•
ρcentroid (A, B) = d(x̄A , x̄B )

A visualization tool for heirarchical clustering are “dendograms” which display a heirarcichal
sequence of clustering assignments.

10 Discrete Choice and Classification

Now, we return to the case of supervised learning, in which we want to create a method for
classifying new predictions based on an input Xt . In this case, though, we change the index
of our training data Xi for i ∈ [n], instead of using t. First, we will only consider the case
of a binary variable Y ∈ {0, 1}. Thus, we are interested in P[Yi = 1|Xi ]. We might first
consider some version of OLS,
Yi = XiT β + Ui
but this does not work well in this instance since y ∈ {0, 1} and thus E[Yi |Xi ] = P[Yi =
1|Xi ] = XiT β but since the covariates are continuous and potentially unbounded, the range
for XiT β is R, while it should be restricted to [0, 1]. Clearly, OLS is not a great fit for binary
data, since the predictions could be well above 0 or 1.

A solution to this issue is to use a new function G : R → [0, 1] (almost like a squashing
function) to transform the linear regression fit onto the interval of interest. Thus

P[Yi = 1|Xi ] = G(XiT β)

Popular examples for G are the probit and logit functions. If the data is linearly separable,
a CDF-type function perfectly separates them.

The Probit function is simply the CDF of a Gaussian distribution Φ(·) meaning

P[Yi = 1|Xi ] = Φ(XiT β)

29
Nikhil Vicas FIN 580
Quantitative Data Analysis
and
Φ−1 [P[Yi = 1|Xi ]] = XiT β
is called the Probit link function, which maps the probability to the real line.

Similarly, the Logit function is the logistic CDF

exp(z)
G(z) =
1 + exp(z)
so
exp(XiT β)
P[Yi = 1|Xi ] =
1 + exp(XiT β)
The Logit function is easier to estimate and compute than probit and easier to generalize to
multimodel models.

We estimate the β here using a maximum likelihood approach. For this instance, we assume
that {(Xi , Yi )}i∈[n] are independent and identically distributed, and that p < n is fixed. Then,
the likelihood of a datapoint is given by

f (Yi |Xi ; β) = [G(XiT β)]Yi [1 − G(XiT β]1−Yi

Thus, the likelihood of the data is

n
Y
L(β|{Xi }i∈[n] ) = f (Yi |Xi , β)
i=1

The optimal parameter β is then,

n n
X n
X o
β̂ = arg min(− log L) = arg min − log[G(XiT β)] − log[1 − G(XiT β)]
β∈Rp β∈Rp
i=1,Yi =1 i=1,Yi =0

There is no closed form for this estimator, and it must be solved numerically. We also have
the asymptotic distribution
n
√ n hX g(XiT β)2 T
i−1 o
n(β̂ − β0 ) → N 0, X i Xi
i=1
G(XiT β)(1 − G(XiT β))

dG(z)
where g = dz
(z)

We are also able to consider “partial effects,” or sensitivity of the probability output to
changes in the value of an input in the X vector (somewhat like greeks for options, finding

30
Nikhil Vicas FIN 580
Quantitative Data Analysis
the sensitivity of a function to different components of the input). These partial effects are
given by
∂p(X; β)
= g(X T β)βj
∂xj
Notice that these partial effects depend on the non-deterministic data themselves. However,
“relative” effects comparing partial effects on two different inputs do not:

∂p(X; β)/∂xj βj
=
∂p(X; β)/∂xh βh

We can consider two different statistics for a determined X, well encapsulating the entire
dataset’s influence from the Xi vector:

First is the partial effect at the average (PEA)

PEAj = g(X̄ T β̂)β̂j

And the average partial effect (APE) is

n
h1 X i
APEj = g(XiT β̂) β̂j
n i=1

We can use the delta method to then find the asymptotic variance of these partial effects.

Once we have a ‘soft’ probability value of the binary Y , we do need a cutoff point to finally
classify Y . The choice of cutoff point depends on application (like in some instances, we
may choose to be more conservative, or risk-taking). We can test based on accuracy metrics
considering true and false positives, since accuracy can be given by: Accuracy = (TP +
TN)/(TP + FP + FN + TN).

In high dimensions, we find β through a regularization approach (e.g. Ridge, Lasso, Elastic-
Net, AdaLasso, etc.). A Ridge method, for example, would be like
h X i
β̂ = arg min − log L(β|{Xi }i∈[n] ) + λ βj2
β∈Rp j

These methods can be extended to multimodal models as well, such as Multimodal Logit:
for Y ∈ {0, . . . , J}
exp(XiT βj )
P[Yi = j|Xi ] =
1 + k exp(XiT βk )
P

31
Nikhil Vicas FIN 580
Quantitative Data Analysis
In many instances, the data is not linearly seperable, as is needed in this method. Then, we
can use a nonlinear model
P[Yi = j|Xi ] = G[f (Xi ; θ)]
where f (Xi ; θ) is a nonlinear function of Xi (for example, trees, neural networks, SVMs, etc.).

The first method we discuss are classification (decision) trees: these are very similar to the
aforementioned regression trees, where the main difference is the loss function used. In these
trees, in a terminal node k (region Rk ) with nk observations,
1 X
p̂kj = 1{Yi = j}
nk i∈R
k

is the proportion of observations of class j observed in region k. Then we set

k(j) = arg max p̂kj

We optimize this by minimizing (perhaps regularized) errors, such as misclassification error,

gini index, deviance, negative log likelihood.

Another classification method to discuss is SVC, support vector classification. In the modst
simple form, it is a linear classifier (the later mentioned SVM is not linear). SVC splits data
into two groups, labeled -1 and 1; it is very effective in high-dimensional cases.

A linear classifier is of the form

f (X; w, b) = sgn(wT X + b)

Geometrically, wT X + b is a hyperplane in Rd where w is the normal vector of the hyperplane

and b is the offset. The linear classifier, through the sign function, simply says which side of
the hyperplane the prediction is on.

The data is linearly seperable if ∃θ := w, b such that ∀i ∈ {1, . . . , n} subsect to the constraints

wt Xi + b > 0 if Yi = 1

wt Xi + b < 0 if Yi = −1

The question then is how do we pick an optimal hyperplane if multiple exist? We do this by
‘maximizing the margin.’ First, we recall that the disctance between a point X and a plane
defined by (w, b) is given by
|wT X + b|
d(X, (w, b)) =
|w|

32
Nikhil Vicas FIN 580
Quantitative Data Analysis
then define the positive and negative support vectors as (∈ instead of = since there may be
multiple):
X+ (w) ∈ arg min{wT X : Yi = 1}
X− (w) ∈ arg max{wT X : Yi = −1}
The margin we seek to maximize is the normalized distance between these two SVs

wT 2
m(w) = (X+ − X− ) =
|w| |w|

With respect to normalized constraints

wt Xi + b > 1 if Yi = 1

wt Xi + b < 1 if Yi = −1
this is equivalent to
|w|2
min
w,b 2
such that Yi (wT Xi + b) ≥ 1 ∀i. If the solution exists and it is unique, it is called the optimal
hyperplane.

We can also introduce some slack if the data is not exactly linearly separable (maybe there
are some outliers). Then for {ξ}i∈[n] , ξ ≥ 0we can make constraints Yi (wT Xi + b) ≥ 1 − ξi , ∀i.
When ξi s are big enough that the constraints are always met, this is called a soft margin.

Beyond SVC are Support Vector Machines, which use the kernel trick (for an a priori chosen
kernel) to transform the data to be linearly separable in some higher dimension, where a
SVC can be used. We can consider SVM to be a non-linear separable device since it can
seperate data non-linearly by placing it in a higher dimension, where it can be seperated by
a hyperplane.

References
[1] Caio Almeida and Marcelo Medeiros. FIN 580 Quantitative Data Analysis in Finance
Lecture Slides.
[2] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical
Learning. Springer Series in Statistics. New York, NY, USA: Springer New York Inc.,
2001.
[3] Andy Jones. LASSO and the irrepresentable condition. url: https://andrewcharlesjones.
github.io/journal/irrepresentable.html.

33
Nikhil Vicas FIN 580
Quantitative Data Analysis
[4] Peter Melchior. SML 505 Modern Statistics, Princeton University. url: sml505.pmelchior.
net.
[5] Peng Zhao and Bin Yu. “On Model Selection Consistency of Lasso”. In: Journal of Ma-
chine Learning Research 7.90 (2006), pp. 2541–2563. url: http://jmlr.org/papers/
v7/zhao06a.html.

Ch. 10 Principal Components Analysis (PCA)
No ratings yet
Ch. 10 Principal Components Analysis (PCA)
17 pages
Dimensionality Reduction (Pca)
No ratings yet
Dimensionality Reduction (Pca)
32 pages
Sharda_11e_full_accessible_ppt_04
No ratings yet
Sharda_11e_full_accessible_ppt_04
40 pages
Contract Template - Articles
No ratings yet
Contract Template - Articles
2 pages
Week 2 Notes
No ratings yet
Week 2 Notes
23 pages
Lecture 20
No ratings yet
Lecture 20
22 pages
Factor Analysis Regression: Reinhold Kosfeld Jørgen Lauridsen
No ratings yet
Factor Analysis Regression: Reinhold Kosfeld Jørgen Lauridsen
16 pages
New Routes From Minimal Approximation Error To Principal Components
No ratings yet
New Routes From Minimal Approximation Error To Principal Components
14 pages
Probabilistic MFG
No ratings yet
Probabilistic MFG
102 pages
PCA
No ratings yet
PCA
42 pages
Sap Rap Action
No ratings yet
Sap Rap Action
7 pages
Ch9-Factor Analysis Model
No ratings yet
Ch9-Factor Analysis Model
44 pages
Computer Vision: Spring 2006 15-385,-685 Instructor: S. Narasimhan Wean 5403 T-R 3:00pm - 4:20pm
No ratings yet
Computer Vision: Spring 2006 15-385,-685 Instructor: S. Narasimhan Wean 5403 T-R 3:00pm - 4:20pm
58 pages
Principal Components Regression
No ratings yet
Principal Components Regression
14 pages
Principal Component Analysis: Jianxin Wu
No ratings yet
Principal Component Analysis: Jianxin Wu
24 pages
PC Regression
No ratings yet
PC Regression
25 pages
Principal Components Analysis (PCA) : 2.1 Outline of Technique
No ratings yet
Principal Components Analysis (PCA) : 2.1 Outline of Technique
21 pages
Chap9 PDF
No ratings yet
Chap9 PDF
5 pages
Dimension Reduction and Hidden Structure: 1.1 Principal Component Analysis (PCA)
No ratings yet
Dimension Reduction and Hidden Structure: 1.1 Principal Component Analysis (PCA)
40 pages
Agenda: Principal Component Analysis (PCA)
No ratings yet
Agenda: Principal Component Analysis (PCA)
14 pages
Factor Analysis
No ratings yet
Factor Analysis
57 pages
MAT 385 Final Review
No ratings yet
MAT 385 Final Review
24 pages
Micrium ARM uCOS II Cortex M3
No ratings yet
Micrium ARM uCOS II Cortex M3
38 pages
Factor Analysis
No ratings yet
Factor Analysis
26 pages
CC
No ratings yet
CC
352 pages
Principal Component Analysis: Atent Ariables
No ratings yet
Principal Component Analysis: Atent Ariables
13 pages
Combinatorics Lecture Notes
No ratings yet
Combinatorics Lecture Notes
150 pages
Dimension Reduction
No ratings yet
Dimension Reduction
23 pages
Sanjay Singh Principal Component Analysis
No ratings yet
Sanjay Singh Principal Component Analysis
9 pages
Unit 3dimentionality Reduction
No ratings yet
Unit 3dimentionality Reduction
13 pages
Week 9 Lecture - Revision Test-dual-translated
No ratings yet
Week 9 Lecture - Revision Test-dual-translated
92 pages
Component Analysis Is A Dimension-Reduction Tool That Can
No ratings yet
Component Analysis Is A Dimension-Reduction Tool That Can
2 pages
Factor Analysis
No ratings yet
Factor Analysis
22 pages
FALLSEM2024-25 SWE1015 ETH VL2024250103260 2024-09-18 Reference-Material-I
No ratings yet
FALLSEM2024-25 SWE1015 ETH VL2024250103260 2024-09-18 Reference-Material-I
62 pages
PCA
100% (1)
PCA
33 pages
FALLSEM2024-25_BCSE401L_TH_VL2024250102082_2024-09-04_Reference-Material-I
No ratings yet
FALLSEM2024-25_BCSE401L_TH_VL2024250102082_2024-09-04_Reference-Material-I
19 pages
PCA
No ratings yet
PCA
11 pages
data analysis
No ratings yet
data analysis
40 pages
cs229 Notes10 PDF
No ratings yet
cs229 Notes10 PDF
6 pages
Pca PDF
No ratings yet
Pca PDF
33 pages
(Highlight) Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani - An Introduction to Statistical Learning - With Applications in R-Springer Science+Business Media (2021) (1)
No ratings yet
(Highlight) Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani - An Introduction to Statistical Learning - With Applications in R-Springer Science+Business Media (2021) (1)
9 pages
15PCA
No ratings yet
15PCA
27 pages
Chap 9
No ratings yet
Chap 9
5 pages
Multivariate Statistics Principal Component Analysis (PCA)
No ratings yet
Multivariate Statistics Principal Component Analysis (PCA)
41 pages
Principal_Components_Analysis (1)
100% (1)
Principal_Components_Analysis (1)
24 pages
Computer Vision: Spring 2006 15-385,-685
No ratings yet
Computer Vision: Spring 2006 15-385,-685
58 pages
User Guide
100% (1)
User Guide
42 pages
ACPusingR
No ratings yet
ACPusingR
25 pages
11668a5f867641748200d0bfd6a889a3_hst951_7
No ratings yet
11668a5f867641748200d0bfd6a889a3_hst951_7
32 pages
Pca
No ratings yet
Pca
6 pages
6 Dimension Reduction Theory
No ratings yet
6 Dimension Reduction Theory
18 pages
Probabilistic Principal Component Analysis (Tipping, Bishop)
No ratings yet
Probabilistic Principal Component Analysis (Tipping, Bishop)
13 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
39 pages
Lecture FPCA
No ratings yet
Lecture FPCA
67 pages
20-pca
No ratings yet
20-pca
50 pages
Principal Component Analysis (PCA) Application To Images: Outline of The Lecture
No ratings yet
Principal Component Analysis (PCA) Application To Images: Outline of The Lecture
26 pages
Lecture 9 - Data Reduction
No ratings yet
Lecture 9 - Data Reduction
36 pages
Art Types
No ratings yet
Art Types
14 pages
BCME QB MECH.ENGG. PART
No ratings yet
BCME QB MECH.ENGG. PART
10 pages
Principal Component Analysis: Term Paper For Data Mining & Data Warehousing
No ratings yet
Principal Component Analysis: Term Paper For Data Mining & Data Warehousing
11 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
3 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
33 pages
3
No ratings yet
3
12 pages
Mathematics Lesson Plan
No ratings yet
Mathematics Lesson Plan
7 pages
Session 11 and 12 Agile Project Management
No ratings yet
Session 11 and 12 Agile Project Management
31 pages
Hayward Auto-Skim Manual
No ratings yet
Hayward Auto-Skim Manual
8 pages
Study On Start Up Failures
No ratings yet
Study On Start Up Failures
19 pages
ML - Unit 3
No ratings yet
ML - Unit 3
4 pages
4 - Basics in Statistics and Linear Algebra
No ratings yet
4 - Basics in Statistics and Linear Algebra
7 pages
PCA Complete
No ratings yet
PCA Complete
8 pages
Row Context and Filter Context
No ratings yet
Row Context and Filter Context
7 pages
Osaka Rebranding Final Report
No ratings yet
Osaka Rebranding Final Report
17 pages
Physical Properties of Reservoir Rocks
No ratings yet
Physical Properties of Reservoir Rocks
159 pages
ALTUM Datasheet
No ratings yet
ALTUM Datasheet
2 pages
Basic Administrative Procedures
No ratings yet
Basic Administrative Procedures
14 pages
Tata Nano Battery From Chennai Maruthi Power
No ratings yet
Tata Nano Battery From Chennai Maruthi Power
3 pages
Pca Tutorial
No ratings yet
Pca Tutorial
11 pages
Preview-9781569908167 A42563111
No ratings yet
Preview-9781569908167 A42563111
82 pages
Education - Post 12th Standard - CSV
No ratings yet
Education - Post 12th Standard - CSV
11 pages
PHYS 102 A.Y. 2019-2020 Electricity and Magnetism
100% (1)
PHYS 102 A.Y. 2019-2020 Electricity and Magnetism
13 pages
0580 Example Candidate Responses Paper 2 (For Examination From 2016)
No ratings yet
0580 Example Candidate Responses Paper 2 (For Examination From 2016)
45 pages
Exam ZTE
No ratings yet
Exam ZTE
3 pages
Unifac
No ratings yet
Unifac
47 pages
Calculus: Maths of the Gods
From Everand
Calculus: Maths of the Gods
Bill Todorovich
No ratings yet
Historiography - An Introductory Guide
100% (5)
Historiography - An Introductory Guide
251 pages
Focus Notes Assurance
No ratings yet
Focus Notes Assurance
4 pages
Franchising: An Entrepreneur's Guide 4e: by Richard J. Judd and Robert T. Justis
No ratings yet
Franchising: An Entrepreneur's Guide 4e: by Richard J. Judd and Robert T. Justis
20 pages
GMCC EE75H1F-U 1de3 HP BT
67% (3)
GMCC EE75H1F-U 1de3 HP BT
3 pages
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)
Geotechnical Design of Shallow Foundations
No ratings yet
Geotechnical Design of Shallow Foundations
35 pages
Paramjit Singh Valuation of Agriculture Shed
No ratings yet
Paramjit Singh Valuation of Agriculture Shed
2 pages
Trance Arrangement
No ratings yet
Trance Arrangement
4 pages
The Keys of Medical Astrology
91% (11)
The Keys of Medical Astrology
26 pages
Mathematical Functions
From Everand
Mathematical Functions
Oliver Linton
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.