Introduction To Optimization: CBMM Summer School Aug 12, 2018

Download as pdf or txt
Download as pdf or txt
You are on page 1of 64

INTRODUCTION TO

OPTIMIZATION
CBMM Summer School
Aug 12, 2018
What you will learn
§ What optimization is used for

§ Different optimization concepts

§ Commonly used terminology, marked with:

§ In the notes: pointers to how to perform optimization in


common programming languages (R, Python, Matlab)

2
Important terms
§ Likelihood
§ Maximum likelihood estimate
§ Cost function
§ Gradient
§ Gradient descent
§ Global / local minima
§ Convex / non-convex functions
§ Differentiable functions
§ Stochastic gradient descent
§ Regularization
§ Sparse coding
§ Momentum

3
Materials and notes

http://bit.do/IntroOptim

Find notes at:


http://cbmm.mit.edu/summer-
school/2018/resources
4
Agenda
§ Likelihood and cost functions

§ Single variable optimization

§ Multi-variable optimization

§ Optimization for machine learning


– Stochastic gradient descent
– Regularization
– Sparse coding
– Momentum

5
LIKELIHOOD & COST
FUNCTIONS

6
What is the likelihood?
§ Likelihood: The probability of observing your
data given a particular model

L(θ | D) = P(D | θ )
A set of data
D = {d1, d2 ,..., d N }
A model and its associated parameterization

7
Example: Balls in urns
Suppose we have an urn with white and black balls, but we don’t
know the proportion. We pull some with replacement and get:

BBWBWWBWWWBW
What is the probability of getting this sequence if the balls were
equally distributed – P(Black) = 0.5?

P(D | θ = 0.5) = 0.5* 0.5* (1− 0.5)*... = 0.55 * (1− 0.5)7 = 2.44 *10 −4

What is the probability of getting this sequence if the urn was 75%
filled with black balls?

P(D | θ = 0.75) = 0.75* 0.75* (1− 0.75)*... = 0.755 * (1− 0.75)7 = 1.45*10 −5

8
Example: Balls in urns
Suppose we have an urn with white and black balls, but we don’t
know the proportion. We pull some with replacement and get:

BBWBWWBWWWBW
We can abstract this function and view it for any (allowed) value of
the probability (θ): 5 7
L(θ | D) = θ * (1− θ )
What does this tell us
about the urn?

9
Maximum likelihood estimator
§ Maximum Likelihood Estimator: The value of the
parameters of a given model that maximizes the
likelihood function for a set of data

θˆmle = argmax L(θ | D)


θ

10
Example: Balls in urns
Suppose we have an urn with white and black balls, but we don’t
know the proportion. We pull some with replacement and get:

BBWBWWBWWWBW
Θmle= 5/12

How do we find this value more generally?

11
Cost functions
§ Cost function / loss function: A function that maps a
set of events into a number that represents the “cost”
of that event occurring

§ More general than likelihood

§ The “cost” of likelihood is typically the negative log-


likelihood:

C(θ , D) = − log(L(θ | D))

12
Likelihood -> Cost
C(θ , D) = − log(L(θ | D))

θˆmle = argmax L(θ | D) = argmin C(θ , D)


θ θ
13
SINGLE-VARIABLE
OPTIMIZATION

14
Back to the urn problem…
Suppose we have an urn with white and black balls, but we don’t
know the proportion. We pull some with replacement and get:

BBWBWWBWWWBW

How do we find this value more generally?

θˆmle = argmin C(θ , D)


θ 15
Grid search (brute force)

Here’s the minimum!


16
Grid search (brute force)
§ Pros:
– Guaranteed to get you close to the minimum (with fine
enough grid)
– Easy to implement

§ Cons:
– Inefficient
– Only really works for 1-2 parameters

17
Gradient descent

Fog You

Camp

18
Gradient descent

This way is
downhill

This way is
downhill

19
Gradient descent
§ Gradient descent: An optimization technique
where parameters are updated proportionally
to the negative gradient to continually step
“downhill” in the cost function

Step-size parameter

20
Gradient descent

21
Gradient descent
But we had to calculate
the gradient!

≈ 5 / 12 22
Gradient descent
But what if we don’t know how to calculate the gradient?

… We approximate it!

23
Local vs. global minima

C(θ ) = θ * (θ +1)* (θ +1.5)* (θ − 2)


arbitrary.cost = function(theta) {
return(theta*(theta+1)*(theta+1.5)*(theta-2))
}
arbitrary.gradient = function(theta, epsilon = 0.001) {
return( (arbitrary.cost(theta+epsilon) - arbitrary.cost(theta-epsilon)) / (2*epsilon) )
}

theta = -2.5
gamma = .0001
tau = .000001

prior.cost = arbitrary.cost(theta)

running = TRUE
while(running) {
grad = arbitrary.gradient(theta)
theta = theta - gamma * grad
new.cost = arbitrary.cost(theta)
if(abs(new.cost - prior.cost)< tau) {
running = FALSE
Global Minimum
} else {
prior.cost = new.cost
}
} Local Minimum
print(theta)

## [1] -1.29429
θ = -1.294
We have an answer - great, right?
24
Before we celebrate, let’s take a look at that function:

thetas = seq(-3,3,by=.01)
Local vs. global minima
§ Local minimum: a point x∗ is a local minimum
if it is the lowest value of the function within
some range centered on x∗

§ Global minimum: a point x∗ is a global


minimum if it is the lowest value of the function
across all allowable values of the parameter

25
Convex vs. non-convex functions
§ Convex function: A
function is convex if
for every line
segment drawn
between two points
on the function, that
line segment passes
above the graph
§ Non-convex function:
A function that is not
convex

A minimum on a convex function is guaranteed to be a global minimum!

26
Implementation

Language Single-var
R optimize
Python* minimize_scalar
Matlab fmindbnd

*: Python optimization functions require the “scipy” package

The implementations of optimization use more advanced algorithms than


discussed here. See the notes for further details!

27
MULTI-VARIABLE
OPTIMIZATION

28
Lecture attendance problem

29
Lecture attendance problem

Working Hung-over

P(Skip | Work) = ? P(Skip | Hung-over) = ?

Skipping

30
Lecture attendance problem
Working Hung-over Skipping

✔ ✖ ✖

✔ ✔ ✖

✖ ✖ ✖

✖ ✔ ✔ 31
Lecture attendance problem

Attending Skipping
HungO ~HungO HungO ~HungO
Work 42 17 Work 0 10
~Work 57 26 ~Work 18 30

32
Lecture attendance problem

33
Multi-dimensional gradients
Almost the same as single-dimensional, but as a vector to
represent both the magnitude and direction:

34
Multi-dimensional gradients

35
Multi-dimensional gradient descent

36
Multi-dimensional gradient descent

θwork = 0.3
θhungover = 0.4

37
Differentiable functions
§ Differentiable: A function is differentiable if there
exists a function that provides the gradient at
every allowable value of its parameters

38
Implementation

Language Single-var Multi-var


R optimize optim
Python* minimize_scalar minimize
Matlab fmindbnd fminsearch

*: Python optimization functions require the “scipy” package

The implementations of optimization use more advanced algorithms than


discussed here. See the notes for further details!

39
OPTIMIZATION FOR MACHINE
LEARNING

40
Optimization for machine learning
§ Stochastic gradient descent

§ Regularization

§ Sparse coding

§ Momentum

41
Stochastic gradient descent
§ What happens when the cost function is expensive?

>14 million images!

§ You want to fit your model to the entire dataset… but


computing the entire cost function thousands of times
would take forever!

42
Stochastic gradient descent

If your cost function is decomposable into costs from each


individual observation, you can approximate the gradient by
calculating the cost on a random subset of the data

43
Stochastic gradient descent
§ Stochastic gradient descent: Performing
gradient descent iteratively using subsets of
the full data set to calculate the cost function

44
Stochastic gradient descent

45
Regularization
Polynomial machine:

46
Regularization
Polynomial machine:

47
Regularization

48
Regularization
§ Regularization: Imposing an additional cost
that is related to the magnitude of the
parameters

Strength of regularization

Regularization term – e.g., L2:

49
Regularization

50
Regularization

51
Regularization

52
Sparse coding
§ This lets you recreate
all natural images very
well

§ But each image


requires contributions
from every basis image

§ Inefficient if:
– Cost of activation
(neural spiking)
– Composition of
Olshousen & Fields (1996) features

53
Sparse coding
§ Sparse coding: Imposing an additional cost
that is related to the magnitude of activated
units

L1 sparsity Log-penalty

54
Sparse coding

Olshousen & Fields (1996)

55
Momentum
Can we predict college GPA from ACT scores?

http://www.calvin.edu/~stob/data/actgpanona.csv
56
Momentum

57
Momentum
Gradient descent

b0 = 0.807
b1 = 0.098

Takes 17,665
iterations!
?

b0 = 1.113
b1 = 0.087
Linear regression

58
Momentum

59
Momentum
§ Momentum: Modifying gradient descent such
that your next step is a combination of the
gradient and the previous step

Gradient contribution Momentum contribution

60
Momentum
Gradient descent Gradient descent + momentum

b0 = 0.807
b1 = 0.098
N = 17,665

b0 = 1.045
b1 = 0.090
N = 1,884

b0 = 1.113
b1 = 0.087
Linear regression

61
SUMMARY

62
Important terms
§ Likelihood
§ Maximum likelihood estimate
§ Cost function
§ Gradient
§ Gradient descent
§ Global / local minima
§ Convex / non-convex functions
§ Differentiable functions
§ Stochastic gradient descent
§ Regularization
§ Sparse coding
§ Momentum

63
QUESTIONS?

64

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy