CV Lec4
CV Lec4
CV Lec4
Optimization
Parametric Approach: Linear Classifier
Optimization
• We saw that a setting of the parameters 𝑊 that produced predictions for
examples 𝑥𝑖 consistent with their ground truth labels 𝑦𝑖 would also have a very
low loss 𝐿
• We are now going to introduce the third and last key component: optimization.
• Optimization is the process of finding the set of parameters 𝑊 that minimize the
loss function.
Loss function visualization
• The loss functions we’ll look at in this class are usually defined over very high-dimensional spaces
(e.g. in CIFAR-10 a linear classifier weight matrix is of size [10 x 3073] for a total of 30,730
parameters), making them difficult to visualize.
Strategy #1: A first very bad idea solution:
Random search
• Since it is so simple to check how good a given set of parameters W is, the first
(very bad) idea that may come to mind is to simply try out many different random
weights and keep track of what works best.
• In other words, our approach will be to start with a random W and then
iteratively refine it, making it slightly better each time.
Strategy #2: Random Local Search
• The first strategy you may think of is to try to extend one foot in a random direction and
then take a step only if it leads downhill.
• The gradient is a generalization of slope for functions that don’t take a single number but
a vector of numbers.
• The slope in any direction is the dot product of the direction with the gradient
• This problem only gets worse, since modern Neural Networks can easily have tens of
millions of parameters. Clearly, this strategy is not scalable and we need something
better.
• Once you derive the expression for the gradient it is straight-forward to implement the
expressions and use them to perform the gradient update.
Gradient Decent
• Initialize the weights
• Evaluate the gradient for each weight with respect to the loss
• Upgrade the weight with a step size in the negative direction of the gradient
• Algorithm:
𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒 𝑤 = 0
𝐹𝑜𝑟 𝑡 = 1,2, . . , 𝑇
𝜕
𝑤 ← 𝑤 − 𝛼 𝑇𝑟𝑎𝑖𝑛𝐿𝑜𝑠𝑠 𝑤, 𝑏
𝜕𝑤
Gradient Decent
• Initialize the weights
• Evaluate the gradient for each weight with respect to the loss
• Upgrade the weight with a step size in the negative direction of the gradient
Step size (Learning Rate)
• The gradient tells us the direction in which the function has the steepest rate of
increase, but it does not tell us how far along this direction we should step.
• As we will see later in the course, choosing the step size (also called the learning
rate) will become one of the most important (and most headache-inducing)
hyperparameter settings in training a neural network.
Stochastic Gradient Descent (SGD)
• In large-scale applications, the training data can have on order of millions of
examples. Hence, it seems wasteful to compute the full loss function over the
entire training set in order to perform only a single parameter update.
• Compute the gradient over batches of the training data. This batch is then used to
perform a parameter update. This is commonly referred to as minibatch gradient decent
(SGD).
Some common optimization techniques
• SGD with Momentum
• RMSprop
• Adam (Adaptive Moment Estimation)
•…
To sum up …
• The loss function represents an optimization landscape where we aim to reach the
bottom.
• Numerical gradients are simple but approximate and computationally expensive. Analytic
gradients are exact but require mathematical derivation, making them more error-prone.
• Setting the step size (learning rate) for parameter updates is crucial.
• The Gradient Descent algorithm was introduced as an iterative process that computes
gradients and updates parameters to minimize the loss.
Linear classifier for CIFAR10 image classification
• When using linear classification with the CIFAR-10 dataset, the goal is to classify
images into one of the 10 classes (airplane, automobile, bird, cat, deer, dog, frog,
horse, ship, truck) using a linear model.
• For each class the model will learn a 10x3072 weight matrix.
Example trained weights of
linear classifier on CIFAR-10
• The core takeaway from this is that the ability to compute the
gradient of a loss function with respect to its weights (and have some
intuitive understanding of it) is the most important skill needed to
design, train and understand neural networks.
Regularization
• When training machine learning models, especially complex ones like deep neural
networks, the model may fit the training data too well.
• This can lead to overfitting, where the model learns not only the underlying
patterns but also the noise and specific quirks of the training data.
http://vision.stanford.edu/teaching/cs231n-demos/linear-classify/
Linear classifier for image classification
• Taking the raw pixels and feeding them to the linear classifier in not such a great idea
given the high dimensionality of the problem.
• What was common before the dominance of neural networks and deep learning was a
two-stage approach.
• First, you would talk the image and compute different feature representations such as
(Ex: SIFT, SURF, HOG, Bag of words, … )
• And then those features are given to the classifier not the entire image