CV Lec4

Image Classification
Optimization
Parametric Approach: Linear Classifier
Optimization
• We saw that a setting of the parameters 𝑊 that produced predictions for
examples 𝑥𝑖 consistent with their ground truth labels 𝑦𝑖 would also have a very
low loss 𝐿
• We are now going to introduce the third and last key component: optimization.
• Optimization is the process of finding the set of parameters 𝑊 that minimize the
loss function.
Loss function visualization
• The loss functions we’ll look at in this class are usually defined over very high-dimensional spaces
(e.g. in CIFAR-10 a linear classifier weight matrix is of size [10 x 3073] for a total of 30,730
parameters), making them difficult to visualize.
Strategy #1: A first very bad idea solution:
Random search
• Since it is so simple to check how good a given set of parameters W is, the first
(very bad) idea that may come to mind is to simply try out many different random
weights and keep track of what works best.
• Instead of relying on pure randomness, we need to define an optimization

algorithm that allows us to literally improve W and b.
Core idea: iterative refinement.
• Of course, it turns out that we can do much better. The core idea is that finding
the best set of weights W is a very difficult or even impossible problem
(especially once W contains weights for entire complex neural networks)
• but the problem of refining a specific set of weights W to be slightly better is

significantly less difficult.
• In other words, our approach will be to start with a random W and then
iteratively refine it, making it slightly better each time.
Strategy #2: Random Local Search
• The first strategy you may think of is to try to extend one foot in a random direction and
then take a step only if it leads downhill.
• Concretely, we will start out with a random 𝑊 , generate random perturbations 𝛿𝑊 to it

and if the loss at the perturbed 𝑊 + 𝛿𝑊 is lower, we will perform an update.
Strategy #3: Following the Gradient
• It turns out that there is no need to randomly search for a good direction: we can
compute the best direction along which we should change our weight vector that is
mathematically guaranteed to be the direction of the steepest descent
• In one-dimensional functions, the slope is the instantaneous rate of change of the

function at any point you might be interested in.
• The gradient is a generalization of slope for functions that don’t take a single number but
a vector of numbers.
• Additionally, the gradient is just a vector of slopes (more commonly referred to as

derivatives) for each dimension in the input space.
Computing the gradient
• The mathematical expression for the derivative of a 1-D function with respect its
input is:
• In multiple dimensions, the gradient is the vector of (partial derivatives) along

each dimension
• The slope in any direction is the dot product of the direction with the gradient
• The direction of steepest descent is the negative gradient

Computing the gradient analytically with Calculus
• You may have noticed that evaluating the numerical gradient has complexity linear in the
number of parameters. In our example we had 30730 parameters in total and therefore
had to perform 30,731 evaluations of the loss function to evaluate the gradient and to
perform only a single parameter update.
• This problem only gets worse, since modern Neural Networks can easily have tens of
millions of parameters. Clearly, this strategy is not scalable and we need something
better.
• The loss is just a function of 𝑊, Use calculus to compute an analytic gradient
• Once you derive the expression for the gradient it is straight-forward to implement the
expressions and use them to perform the gradient update.
Gradient Decent
• Initialize the weights
• Evaluate the gradient for each weight with respect to the loss
• Upgrade the weight with a step size in the negative direction of the gradient
• Algorithm:
𝐼𝑛𝑖𝑡𝑖𝑎𝑙𝑖𝑧𝑒 𝑤 = 0
𝐹𝑜𝑟 𝑡 = 1,2, . . , 𝑇
𝜕
𝑤 ← 𝑤 − 𝛼 𝑇𝑟𝑎𝑖𝑛𝐿𝑜𝑠𝑠 𝑤, 𝑏
𝜕𝑤
Gradient Decent
• Initialize the weights
• Evaluate the gradient for each weight with respect to the loss
• Upgrade the weight with a step size in the negative direction of the gradient
Step size (Learning Rate)
• The gradient tells us the direction in which the function has the steepest rate of
increase, but it does not tell us how far along this direction we should step.
• As we will see later in the course, choosing the step size (also called the learning
rate) will become one of the most important (and most headache-inducing)
hyperparameter settings in training a neural network.
Stochastic Gradient Descent (SGD)
• In large-scale applications, the training data can have on order of millions of
examples. Hence, it seems wasteful to compute the full loss function over the
entire training set in order to perform only a single parameter update.
• Stochastic Gradient Descent is a variant of the gradient descent optimization

algorithm that updates the model parameters based on the gradient of the loss
function with respect to a single training example at each iteration, rather than
computing the gradient using the entire dataset
Mini-batch gradient descent
• Mini-batch Gradient Descent is a variant of Gradient Descent that strikes a balance
between Batch Gradient Descent and Stochastic Gradient Descent (SGD). It splits the
training dataset into small, randomly selected batches and computes the gradient of the
loss function using only that mini-batch. It is widely used in training large models, such as
neural networks.
• Compute the gradient over batches of the training data. This batch is then used to
perform a parameter update. This is commonly referred to as minibatch gradient decent
(SGD).
Some common optimization techniques
• SGD with Momentum
• RMSprop
• Adam (Adaptive Moment Estimation)
•…
To sum up …
• The loss function represents an optimization landscape where we aim to reach the
bottom.
• Iterative refinement is employed to optimize the loss function by gradually adjusting

weights until the loss is minimized.
• The gradient of a function indicates the steepest ascent direction.
• Numerical gradients are simple but approximate and computationally expensive. Analytic
gradients are exact but require mathematical derivation, making them more error-prone.
• Setting the step size (learning rate) for parameter updates is crucial.
• The Gradient Descent algorithm was introduced as an iterative process that computes
gradients and updates parameters to minimize the loss.
Linear classifier for CIFAR10 image classification
• When using linear classification with the CIFAR-10 dataset, the goal is to classify
images into one of the 10 classes (airplane, automobile, bird, cat, deer, dog, frog,
horse, ship, truck) using a linear model.
• For each class the model will learn a 10x3072 weight matrix.
Example trained weights of
linear classifier on CIFAR-10
• The core takeaway from this is that the ability to compute the
gradient of a loss function with respect to its weights (and have some
intuitive understanding of it) is the most important skill needed to
design, train and understand neural networks.
Regularization
• When training machine learning models, especially complex ones like deep neural
networks, the model may fit the training data too well.
• This can lead to overfitting, where the model learns not only the underlying
patterns but also the noise and specific quirks of the training data.
• As a result, the model performs poorly on new, unseen data.
• Regularization helps by penalizing overly complex models, encouraging the model

to find simpler patterns that generalize better.
Regularization
The regularization strength

(a hyperparameter)
Regularization intuition: toy example training data
Regularization
Regularization
Regularization
Recap
Web demo for a linear classifier
http://vision.stanford.edu/teaching/cs231n-demos/linear-classify/
Linear classifier for image classification
• Taking the raw pixels and feeding them to the linear classifier in not such a great idea
given the high dimensionality of the problem.
• What was common before the dominance of neural networks and deep learning was a
two-stage approach.
• First, you would talk the image and compute different feature representations such as
(Ex: SIFT, SURF, HOG, Bag of words, … )
• And then those features are given to the classifier not the entire image

CV Lec4

Uploaded by

Copyright:

Available Formats

CV Lec4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CV Lec4

Uploaded by

Copyright:

Available Formats

Image Classification

• Instead of relying on pure randomness, we need to define an optimization

• but the problem of refining a specific set of weights W to be slightly better is

• Concretely, we will start out with a random 𝑊 , generate random perturbations 𝛿𝑊 to it

• In one-dimensional functions, the slope is the instantaneous rate of change of the

• Additionally, the gradient is just a vector of slopes (more commonly referred to as

• In multiple dimensions, the gradient is the vector of (partial derivatives) along

• The direction of steepest descent is the negative gradient

• The loss is just a function of 𝑊, Use calculus to compute an analytic gradient

• Stochastic Gradient Descent is a variant of the gradient descent optimization

• Iterative refinement is employed to optimize the loss function by gradually adjusting

• The gradient of a function indicates the steepest ascent direction.

• As a result, the model performs poorly on new, unseen data.

• Regularization helps by penalizing overly complex models, encouraging the model

The regularization strength

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.