Introduction To Optimization: CBMM Summer School Aug 12, 2018
Introduction To Optimization: CBMM Summer School Aug 12, 2018
Introduction To Optimization: CBMM Summer School Aug 12, 2018
OPTIMIZATION
CBMM Summer School
Aug 12, 2018
What you will learn
§ What optimization is used for
2
Important terms
§ Likelihood
§ Maximum likelihood estimate
§ Cost function
§ Gradient
§ Gradient descent
§ Global / local minima
§ Convex / non-convex functions
§ Differentiable functions
§ Stochastic gradient descent
§ Regularization
§ Sparse coding
§ Momentum
3
Materials and notes
http://bit.do/IntroOptim
§ Multi-variable optimization
5
LIKELIHOOD & COST
FUNCTIONS
6
What is the likelihood?
§ Likelihood: The probability of observing your
data given a particular model
L(θ | D) = P(D | θ )
A set of data
D = {d1, d2 ,..., d N }
A model and its associated parameterization
7
Example: Balls in urns
Suppose we have an urn with white and black balls, but we don’t
know the proportion. We pull some with replacement and get:
BBWBWWBWWWBW
What is the probability of getting this sequence if the balls were
equally distributed – P(Black) = 0.5?
P(D | θ = 0.5) = 0.5* 0.5* (1− 0.5)*... = 0.55 * (1− 0.5)7 = 2.44 *10 −4
What is the probability of getting this sequence if the urn was 75%
filled with black balls?
P(D | θ = 0.75) = 0.75* 0.75* (1− 0.75)*... = 0.755 * (1− 0.75)7 = 1.45*10 −5
8
Example: Balls in urns
Suppose we have an urn with white and black balls, but we don’t
know the proportion. We pull some with replacement and get:
BBWBWWBWWWBW
We can abstract this function and view it for any (allowed) value of
the probability (θ): 5 7
L(θ | D) = θ * (1− θ )
What does this tell us
about the urn?
9
Maximum likelihood estimator
§ Maximum Likelihood Estimator: The value of the
parameters of a given model that maximizes the
likelihood function for a set of data
10
Example: Balls in urns
Suppose we have an urn with white and black balls, but we don’t
know the proportion. We pull some with replacement and get:
BBWBWWBWWWBW
Θmle= 5/12
11
Cost functions
§ Cost function / loss function: A function that maps a
set of events into a number that represents the “cost”
of that event occurring
12
Likelihood -> Cost
C(θ , D) = − log(L(θ | D))
14
Back to the urn problem…
Suppose we have an urn with white and black balls, but we don’t
know the proportion. We pull some with replacement and get:
BBWBWWBWWWBW
§ Cons:
– Inefficient
– Only really works for 1-2 parameters
17
Gradient descent
Fog You
Camp
18
Gradient descent
This way is
downhill
This way is
downhill
19
Gradient descent
§ Gradient descent: An optimization technique
where parameters are updated proportionally
to the negative gradient to continually step
“downhill” in the cost function
Step-size parameter
20
Gradient descent
21
Gradient descent
But we had to calculate
the gradient!
≈ 5 / 12 22
Gradient descent
But what if we don’t know how to calculate the gradient?
… We approximate it!
23
Local vs. global minima
theta = -2.5
gamma = .0001
tau = .000001
prior.cost = arbitrary.cost(theta)
running = TRUE
while(running) {
grad = arbitrary.gradient(theta)
theta = theta - gamma * grad
new.cost = arbitrary.cost(theta)
if(abs(new.cost - prior.cost)< tau) {
running = FALSE
Global Minimum
} else {
prior.cost = new.cost
}
} Local Minimum
print(theta)
## [1] -1.29429
θ = -1.294
We have an answer - great, right?
24
Before we celebrate, let’s take a look at that function:
thetas = seq(-3,3,by=.01)
Local vs. global minima
§ Local minimum: a point x∗ is a local minimum
if it is the lowest value of the function within
some range centered on x∗
25
Convex vs. non-convex functions
§ Convex function: A
function is convex if
for every line
segment drawn
between two points
on the function, that
line segment passes
above the graph
§ Non-convex function:
A function that is not
convex
26
Implementation
Language Single-var
R optimize
Python* minimize_scalar
Matlab fmindbnd
27
MULTI-VARIABLE
OPTIMIZATION
28
Lecture attendance problem
29
Lecture attendance problem
Working Hung-over
Skipping
30
Lecture attendance problem
Working Hung-over Skipping
✔ ✖ ✖
✔ ✔ ✖
✖ ✖ ✖
✖ ✔ ✔ 31
Lecture attendance problem
Attending Skipping
HungO ~HungO HungO ~HungO
Work 42 17 Work 0 10
~Work 57 26 ~Work 18 30
32
Lecture attendance problem
33
Multi-dimensional gradients
Almost the same as single-dimensional, but as a vector to
represent both the magnitude and direction:
34
Multi-dimensional gradients
35
Multi-dimensional gradient descent
36
Multi-dimensional gradient descent
θwork = 0.3
θhungover = 0.4
37
Differentiable functions
§ Differentiable: A function is differentiable if there
exists a function that provides the gradient at
every allowable value of its parameters
38
Implementation
39
OPTIMIZATION FOR MACHINE
LEARNING
40
Optimization for machine learning
§ Stochastic gradient descent
§ Regularization
§ Sparse coding
§ Momentum
41
Stochastic gradient descent
§ What happens when the cost function is expensive?
42
Stochastic gradient descent
43
Stochastic gradient descent
§ Stochastic gradient descent: Performing
gradient descent iteratively using subsets of
the full data set to calculate the cost function
44
Stochastic gradient descent
45
Regularization
Polynomial machine:
46
Regularization
Polynomial machine:
47
Regularization
48
Regularization
§ Regularization: Imposing an additional cost
that is related to the magnitude of the
parameters
Strength of regularization
49
Regularization
50
Regularization
51
Regularization
52
Sparse coding
§ This lets you recreate
all natural images very
well
§ Inefficient if:
– Cost of activation
(neural spiking)
– Composition of
Olshousen & Fields (1996) features
53
Sparse coding
§ Sparse coding: Imposing an additional cost
that is related to the magnitude of activated
units
L1 sparsity Log-penalty
54
Sparse coding
55
Momentum
Can we predict college GPA from ACT scores?
http://www.calvin.edu/~stob/data/actgpanona.csv
56
Momentum
57
Momentum
Gradient descent
b0 = 0.807
b1 = 0.098
Takes 17,665
iterations!
?
b0 = 1.113
b1 = 0.087
Linear regression
58
Momentum
59
Momentum
§ Momentum: Modifying gradient descent such
that your next step is a combination of the
gradient and the previous step
60
Momentum
Gradient descent Gradient descent + momentum
b0 = 0.807
b1 = 0.098
N = 17,665
b0 = 1.045
b1 = 0.090
N = 1,884
b0 = 1.113
b1 = 0.087
Linear regression
61
SUMMARY
62
Important terms
§ Likelihood
§ Maximum likelihood estimate
§ Cost function
§ Gradient
§ Gradient descent
§ Global / local minima
§ Convex / non-convex functions
§ Differentiable functions
§ Stochastic gradient descent
§ Regularization
§ Sparse coding
§ Momentum
63
QUESTIONS?
64