Introduction To Optimization: CBMM Summer School Aug 12, 2018

INTRODUCTION TO
OPTIMIZATION
CBMM Summer School
Aug 12, 2018
What you will learn
§ What optimization is used for
§ Different optimization concepts
§ Commonly used terminology, marked with:
§ In the notes: pointers to how to perform optimization in

common programming languages (R, Python, Matlab)
2
Important terms
§ Likelihood
§ Maximum likelihood estimate
§ Cost function
§ Gradient
§ Gradient descent
§ Global / local minima
§ Convex / non-convex functions
§ Differentiable functions
§ Stochastic gradient descent
§ Regularization
§ Sparse coding
§ Momentum
3
Materials and notes
http://bit.do/IntroOptim
Find notes at:

http://cbmm.mit.edu/summer-
school/2018/resources
4
Agenda
§ Likelihood and cost functions
§ Single variable optimization
§ Multi-variable optimization
§ Optimization for machine learning

– Stochastic gradient descent
– Regularization
– Sparse coding
– Momentum
5
LIKELIHOOD & COST
FUNCTIONS
6
What is the likelihood?
§ Likelihood: The probability of observing your
data given a particular model
L(θ | D) = P(D | θ )
A set of data
D = {d1, d2 ,..., d N }
A model and its associated parameterization
7
Example: Balls in urns
Suppose we have an urn with white and black balls, but we don’t
know the proportion. We pull some with replacement and get:
BBWBWWBWWWBW
What is the probability of getting this sequence if the balls were
equally distributed – P(Black) = 0.5?
P(D | θ = 0.5) = 0.5* 0.5* (1− 0.5)*... = 0.55 * (1− 0.5)7 = 2.44 *10 −4
What is the probability of getting this sequence if the urn was 75%
filled with black balls?
P(D | θ = 0.75) = 0.75* 0.75* (1− 0.75)*... = 0.755 * (1− 0.75)7 = 1.45*10 −5
8
BBWBWWBWWWBW
We can abstract this function and view it for any (allowed) value of
the probability (θ): 5 7
L(θ | D) = θ * (1− θ )
What does this tell us
about the urn?
9
Maximum likelihood estimator
§ Maximum Likelihood Estimator: The value of the
parameters of a given model that maximizes the
likelihood function for a set of data
θˆmle = argmax L(θ | D)

θ
10
BBWBWWBWWWBW
Θmle= 5/12
How do we find this value more generally?
11
Cost functions
§ Cost function / loss function: A function that maps a
set of events into a number that represents the “cost”
of that event occurring
§ More general than likelihood
§ The “cost” of likelihood is typically the negative log-

likelihood:
C(θ , D) = − log(L(θ | D))
12
Likelihood -> Cost
C(θ , D) = − log(L(θ | D))
θˆmle = argmax L(θ | D) = argmin C(θ , D)

θ θ
13
SINGLE-VARIABLE
OPTIMIZATION
14
Back to the urn problem…
BBWBWWBWWWBW
How do we find this value more generally?
θˆmle = argmin C(θ , D)

θ 15
Grid search (brute force)
Here’s the minimum!

16
Grid search (brute force)
§ Pros:
– Guaranteed to get you close to the minimum (with fine
enough grid)
– Easy to implement
§ Cons:
– Inefficient
– Only really works for 1-2 parameters
17
Gradient descent
Fog You
Camp
18
Gradient descent
This way is
downhill
This way is
downhill
19
Gradient descent
§ Gradient descent: An optimization technique
where parameters are updated proportionally
to the negative gradient to continually step
“downhill” in the cost function
Step-size parameter
20
Gradient descent
21
Gradient descent
But we had to calculate
the gradient!
≈ 5 / 12 22
Gradient descent
But what if we don’t know how to calculate the gradient?
… We approximate it!
23
Local vs. global minima
C(θ ) = θ * (θ +1)* (θ +1.5)* (θ − 2)

arbitrary.cost = function(theta) {
return(theta*(theta+1)*(theta+1.5)*(theta-2))
}
arbitrary.gradient = function(theta, epsilon = 0.001) {
return( (arbitrary.cost(theta+epsilon) - arbitrary.cost(theta-epsilon)) / (2*epsilon) )
}
theta = -2.5
gamma = .0001
tau = .000001
prior.cost = arbitrary.cost(theta)
running = TRUE
while(running) {
grad = arbitrary.gradient(theta)
theta = theta - gamma * grad
new.cost = arbitrary.cost(theta)
if(abs(new.cost - prior.cost)< tau) {
running = FALSE
Global Minimum
} else {
prior.cost = new.cost
}
} Local Minimum
print(theta)
## [1] -1.29429
θ = -1.294
We have an answer - great, right?
24
Before we celebrate, let’s take a look at that function:
thetas = seq(-3,3,by=.01)
Local vs. global minima
§ Local minimum: a point x∗ is a local minimum
if it is the lowest value of the function within
some range centered on x∗
§ Global minimum: a point x∗ is a global

minimum if it is the lowest value of the function
across all allowable values of the parameter
25
Convex vs. non-convex functions
§ Convex function: A
function is convex if
for every line
segment drawn
between two points
on the function, that
line segment passes
above the graph
§ Non-convex function:
A function that is not
convex
A minimum on a convex function is guaranteed to be a global minimum!
26
Implementation
Language Single-var
R optimize
Python* minimize_scalar
Matlab fmindbnd
*: Python optimization functions require the “scipy” package
The implementations of optimization use more advanced algorithms than

discussed here. See the notes for further details!
27
MULTI-VARIABLE
OPTIMIZATION
28
Lecture attendance problem
29
Working Hung-over
P(Skip | Work) = ? P(Skip | Hung-over) = ?
Skipping
30
Working Hung-over Skipping
✔ ✖ ✖
✔ ✔ ✖
✖ ✖ ✖
✖ ✔ ✔ 31
Attending Skipping
HungO ~HungO HungO ~HungO
Work 42 17 Work 0 10
~Work 57 26 ~Work 18 30
32
33
Multi-dimensional gradients
Almost the same as single-dimensional, but as a vector to
represent both the magnitude and direction:
34
Multi-dimensional gradients
35
Multi-dimensional gradient descent
36
Multi-dimensional gradient descent
θwork = 0.3
θhungover = 0.4
37
Differentiable functions
§ Differentiable: A function is differentiable if there
exists a function that provides the gradient at
every allowable value of its parameters
38
Implementation
Language Single-var Multi-var

R optimize optim
Python* minimize_scalar minimize
Matlab fmindbnd fminsearch
*: Python optimization functions require the “scipy” package
The implementations of optimization use more advanced algorithms than

discussed here. See the notes for further details!
39
OPTIMIZATION FOR MACHINE
LEARNING
40
Optimization for machine learning
§ Regularization
§ Sparse coding
§ Momentum
41
Stochastic gradient descent
§ What happens when the cost function is expensive?
>14 million images!
§ You want to fit your model to the entire dataset… but

computing the entire cost function thousands of times
would take forever!
42
If your cost function is decomposable into costs from each

individual observation, you can approximate the gradient by
calculating the cost on a random subset of the data
43
§ Stochastic gradient descent: Performing
gradient descent iteratively using subsets of
the full data set to calculate the cost function
44
45
Regularization
Polynomial machine:
46
Regularization
Polynomial machine:
47
Regularization
48
Regularization
§ Regularization: Imposing an additional cost
that is related to the magnitude of the
parameters
Strength of regularization
Regularization term – e.g., L2:
49
Regularization
50
Regularization
51
Regularization
52
Sparse coding
§ This lets you recreate
all natural images very
well
§ But each image

requires contributions
from every basis image
§ Inefficient if:
– Cost of activation
(neural spiking)
– Composition of
Olshousen & Fields (1996) features
53
Sparse coding
§ Sparse coding: Imposing an additional cost
that is related to the magnitude of activated
units
L1 sparsity Log-penalty
54
Sparse coding
Olshousen & Fields (1996)
55
Momentum
Can we predict college GPA from ACT scores?
http://www.calvin.edu/~stob/data/actgpanona.csv
56
Momentum
57
Momentum
Gradient descent
b0 = 0.807
b1 = 0.098
Takes 17,665
iterations!
?
b0 = 1.113
b1 = 0.087
Linear regression
58
Momentum
59
Momentum
§ Momentum: Modifying gradient descent such
that your next step is a combination of the
gradient and the previous step
Gradient contribution Momentum contribution
60
Momentum
Gradient descent Gradient descent + momentum
b0 = 0.807
b1 = 0.098
N = 17,665
b0 = 1.045
b1 = 0.090
N = 1,884
b0 = 1.113
b1 = 0.087
Linear regression
61
SUMMARY
62
Important terms
§ Likelihood
§ Maximum likelihood estimate
§ Cost function
§ Gradient
§ Gradient descent
§ Global / local minima
§ Convex / non-convex functions
§ Differentiable functions
§ Regularization
§ Sparse coding
§ Momentum
63
QUESTIONS?
64

Introduction To Optimization: CBMM Summer School Aug 12, 2018

Uploaded by

Copyright:

Available Formats

Introduction To Optimization: CBMM Summer School Aug 12, 2018

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Introduction To Optimization: CBMM Summer School Aug 12, 2018

Uploaded by

Copyright:

Available Formats

INTRODUCTION TO

§ Different optimization concepts

§ Commonly used terminology, marked with:

§ In the notes: pointers to how to perform optimization in

Find notes at:

§ Single variable optimization

§ Optimization for machine learning

θˆmle = argmax L(θ | D)

How do we find this value more generally?

§ More general than likelihood

§ The “cost” of likelihood is typically the negative log-

C(θ , D) = − log(L(θ | D))

θˆmle = argmax L(θ | D) = argmin C(θ , D)

How do we find this value more generally?

θˆmle = argmin C(θ , D)

Here’s the minimum!

C(θ ) = θ * (θ +1)* (θ +1.5)* (θ − 2)

§ Global minimum: a point x∗ is a global

A minimum on a convex function is guaranteed to be a global minimum!

*: Python optimization functions require the “scipy” package

The implementations of optimization use more advanced algorithms than

P(Skip | Work) = ? P(Skip | Hung-over) = ?

Language Single-var Multi-var

*: Python optimization functions require the “scipy” package

The implementations of optimization use more advanced algorithms than

>14 million images!

§ You want to fit your model to the entire dataset… but

If your cost function is decomposable into costs from each

Regularization term – e.g., L2:

§ But each image

Olshousen & Fields (1996)

Gradient contribution Momentum contribution

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.