Machine Vesion hw6

MACHINE VISON ASSIGNMENT
loss function and optimizers
TEWODROS ASMAMAW
Loss Functions
In most learning networks, error is calculated as the difference between the actual output y and
the predicted output ŷ. The function that is used to compute this error is known as Loss Function
also known as Cost function.
Loss functions are used to determine the error (aka “the loss”) between the output of our
algorithms and the given target value.
At its core, a loss function is incredibly simple: It’s a method of evaluating how well your
algorithm models your dataset. If your predictions are totally off, your loss function will output
a higher number. If they’re pretty good, it’ll output a lower number. As you change pieces of
your algorithm to try and improve your model, your loss function will tell you if you’re getting
anywhere.
types of loss function

1. mean_squared_error
2. mean_absolute_error
3. mean_absolute_percentage_error
4. mean_squared_logarithmic_error
5. squared_hinge
6. hinge
7. categorical_hinge
8. logcosh
9. categorical_crossentropy
10. sparse_categorical_crossentropy
11. binary_crossentropy
12. kullback_leibler_divergence
13. poisson
14. cosine_proximity …etc

A lot of the loss functions that you see implemented in machine learning can get complex and
confusing. But if you remember the end goal of all loss functions—measuring how well your
algorithm is doing on your dataset and we can keep that complexity in check.
Mean squared error
Mean Squared Error is one of the most common loss functions. MSE loss function is widely
used in linear regression as the performance measure. To calculate MSE, you take the
difference between your predictions and the ground truth, square it, and average it out
across the whole dataset.
Mean squared error (MSE) is the workhorse of basic loss functions; it’s easy to understand and
implement and generally works pretty well. To calculate MSE, you take the difference between
your predictions and the ground truth, square it, and average it out across the whole dataset.
where y(i) is the actual expected output and ŷ(i) is the model’s prediction.
Likelihood loss
The likelihood function is also relatively simple and is commonly used in classification
problems. The function takes the predicted probability for each input example and multiplies
them. And although the output isn’t exactly human-interpretable, it’s useful for comparing
models.
For example, consider a model that outputs probabilities of [0.4, 0.6, 0.9, 0.1] for the ground
truth labels of [0, 1, 1, 0]. The likelihood loss would be computed as (0.6) * (0.6) * (0.9) * (0.9)
= 0.2916. Since the model outputs probabilities for TRUE (or 1) only, when the ground truth
label is 0 we take (1-p) as the probability. In other words, we multiply the model’s outputted
probabilities together for the actual outcomes.
Log loss (cross entropy loss)

Log loss is a loss function also used frequently in classification problems, and is one of the
most popular measures for Kaggle competitions. It’s just a straightforward modification of the
likelihood function with logarithms.
This is actually exactly the same formula as the regular likelihood function, but with logarithms
added in. You can see that when the actual class is 1, the second half of the function disappears,
and when the actual class is 0, the first half drops. That way, we just end up multiplying the
log of the actual predicted probability for the ground truth class.
The cool thing about the log loss loss function is that is has a kick: It penalizes heavily for
being very confident and very wrong. The graph below is for when the true label =1, and you
can see that it skyrockets as the predicted probability for label = 0 approaches 1.
mean_absolute_error
In statistics, mean absolute error (MAE) is a measure of difference between two continuous
variables
yi : Actual value
xi : Predicted Value
mean_absolute_percentage_error
The mean absolute percentage error (MAPE), also known as mean absolute percentage
deviation (MAPD), is a measure of prediction accuracy of a forecasting method in statistics,
for example in trend estimation, also used as a Loss function for regression problems in
Machine Learning.
mean_squared_logarithmic_error
Optimizers
What is an optimizer in machine learning?
We know that loss functions are used to understand how good/bad our model performs on the
data provided to it. Loss functions are essentially the summation of the difference between the
predicted and calculated values for given training samples. For training a neural network to
minimize its losses so as to perform better, we need to tweak the weights and parameters
associated with the model and the loss function. This is where optimizers play a crucial
role.Optimizers associate loss function and model parameters together by updating the model,
i.e. the weights and biases of each node based on the output of the loss function.
In simpler terms, optimizers shape and mold your model into its most accurate possible form
by futzing with the weights. The loss function is the guide to the terrain, telling the optimizer
when it’s moving in the right or wrong direction.
Types of optimizers
Gradient Descent: The granddaddy of optimizers
Gradient descent is an iterative optimization algorithm. It is dependent on the derivatives of the

loss function for finding minima. Running the algorithm for numerous iterations and epochs
helps to reach the global minima (or closest to it).
The most popular optimizer is Gradient Descent. This algorithm is used across all types of
machine learning (and other math problems) to optimize. It’s fast, robust, and flexible. Here’s
how it works:
1. Calculate what a small change in each individual weight would do to the loss function
2. Adjust each individual weight based on its gradient (i.e. take a small step in the
determined direction)
3. Keep doing steps #1 and #2 until the loss function gets as low as possible
The tricky part of this algorithm (and optimizers in general) is understanding gradients,
which represent what a small change in a weight or parameter would do to the loss
function.
Role of Gradient:
Gradient refers to the slope of the equation in general. Gradients are partial derivatives and can
be considered as the small change reflected in the loss function with respect to the small change
in weights or parameters of the function. Now, this slight change can tell us what to do next to
reduce the output of loss function — reduce this weight by 0.02 or increase this parameter by
0.005 or anything else—to lower the output of the loss function and thereby make our model
more accurate.
The learning rates
Learning rate is the size of the steps our algorithm takes to reach the global minima. Taking very
large steps may skip the global minima, and the model will never reach the optimal value for
loss function. On the other hand, taking very small steps will take forever to converge. Thus,
size of the step is also dependent on the gradient value.
In the above formula, α is the learning rate, J is the cost function, and ϴ is the parameter to be
updated. As you can see, partial derivative of J with respect to ϴj gives us the gradient. Note
that, as we get closer to the global minima, the slope/gradient of the curve becomes less and less
steep, which gives us a smaller value of derivative, which in turn reduces the step size
automatically.
Stochastic Gradient Descent
Instead of calculating the gradients for all of your training examples on every pass of gradient
descent, it’s sometimes more efficient to only use a subset of the training examples each
time. Stochastic Gradient Descent is an implementation that either uses batches of examples at
a time or random examples on each pass.
Other types of optimizers
Momentum
Momentum is like a ball rolling downhill. The ball will gain momentum as it rolls down the hill.
Momentum helps accelerate Gradient Descent(GD) when we have surfaces that curve more
steeply in one direction than in another direction
For updating the weights it takes the gradient of the current step as well as the gradient of the
previous time steps. This helps us move faster towards convergence.
Nesterov accelerated gradient(NAG)
Nesterov acceleration optimization is like a ball rolling down the hill but knows exactly when
to slow down before the gradient of the hill increases again.We calculate the gradient not with
respect to the current step but with respect to the future step. We evaluate the gradient of the
looked ahead and based on the importance then update the weights.NAG is like you are going
down the hill where we can look ahead in the future. This way we can optimize our descent
faster. Works slightly better than standard Momentum.
Advantages of NAG
It has an understanding of motion or future gradients. Thus, it can decrease momentum when
the gradient value is lesser or the slope is low and can increase momentum when the slope is
steep.
Disadvantages of NAG
NAG is not adaptive with respect to the parameter importance. Thus, all parameters are updated
in a similar manner.
Adagrad
Adagrad adapts the learning rate specifically to individual features; that means that some of the
weights in your dataset will have different learning rates than others. This works really well for
sparse datasets where a lot of input examples are missing. Adagrad has a major issue though:
The adaptive learning rate tends to get really small over time. Some other optimizers below
seek to eliminate this problem.
Here, as you can see, AdaGrad uses the sum of the square of past gradients to calculate the
learning rate for each parameter (∇L(ϴ) is the gradient or partial derivative of cost function with
respect to ϴ).
Advantages of AdaGrad
Works great for datasets which have missing samples or are sparse in nature.
Disadvantages of AdaGrad
The learning might be very slow, since, according to the formula above, division by bigger
numbers (sum of past gradients become bigger and bigger with time) means that the learning
rate is decreasing over time — therefore the pace of learning would also decrease
RMSprop
RMSprop is a special version of Adagrad .Instead of letting all of the gradients accumulate for
momentum, it only accumulates gradients in a fixed window. RMSprop is similar to Adaprop,
which is another optimizer that seeks to solve some of the issues that Adagrad leaves open.
Advantages of RMS-Prop
AdaGrad decreases the learning rate with each time step, but RMS-Prop can adapt to an increase
or a decrease in the learning rate with each epoch.
Disadvantages of RMS-Prop
The learning can be pretty slow, same reason as in AdaGrad.
Adadelta
• Adadelta is an extension of Adagrad and it also tries to reduce Adagrad’s

aggressive, monotonically reducing the learning rate
• It does this by restricting the window of the past accumulated gradient to some
fixed size of w. Running average at time t then depends on the previous
average and the current gradient
• In Adadelta we do not need to set the default learning rate as we take the ratio of
the running average of the previous time steps to the current gradient
Adam
Adam stands for adaptive moment estimation and is another way of using past gradients to
calculate current gradients. Adam also utilizes the concept of momentum by adding fractions
of previous gradients to the current one. This optimizer has become pretty widespread and is
practically accepted for use in training neural nets.
It’s easy to get lost in the complexity of some of these new optimizers. Just remember that they
all have the same goal: Minimizing our loss function. Even the most complex ways of doing
that are simple at their core. Rather than influencing the learning rate with respect to first
moments as in RMSProp, Adam also uses the average of the second moments of the gradients.
Also, Adam calculates an exponential moving average of the gradient and the squared gradient.
Thus, Adam can be considered as a combination of AdaGrad and RMS-Prop.
In the above formula, α is the learning rate, β1(usually ~0.9) is the exponential decay rate with
respect to the first moments, β2 is the exponential decay rate with respect to the second moments
(usually ~0.999). ∈ is just a small value to avoid division by zero.
Advantages of Adam
Adam optimizer is well suited for large datasets and is computationally efficient.
Disadvantages of Adam
There are few disadvantages as the Adam optimizer tends to converge faster, but other
algorithms like the Stochastic gradient descent focus on the datapoints and generalize in a better
manner. Thus, the performance depends on the type of data being provided and the
speed/generalization trade-off.
Nadam- Nesterov-accelerated Adaptive Moment Estimation
• Nadam combines NAG and Adam
• Nadam is employed for noisy gradients or for gradients with high curvatures
• The learning process is accelerated by summing up the exponential decay of the

moving averages for the previous and current gradient
Python code
Compile the model, with loss and optimizer functions

Before training, the model has to be compiled. When compiled for training, the model is given:
• Loss function — A way of measuring how far off predictions are from the desired
outcome. (The measured difference is called the "loss".)
• Optimizer function — A way of adjusting internal values in order to reduce the loss.
model.compile(loss='mean_squared_error',
optimizer=tf.keras.optimizers.Adam(0.1))
During training, the optimizer function is used to calculate adjustments to the model's internal
variables. The goal is to adjust the internal variables until the model (which is really a math
function) mirrors the actual equation for converting Celsius to Fahrenheit.
We'll use Matplotlib to visualize this (you could use another tool). As you can see, our model
improves very quickly at first, and then has a steady, slow improvement until it is very near
"perfect" towards the end.
model.compile(loss='mean_absolute_error',
optimizer=tf.keras.optimizers.Adadelta(0.1))
The correct answer is 100×1.8+32=212, so our model is not doing well
model.compile(loss='mean_absolute_percentage_error',
The correct answer is 100×1.8+32=212, so our model is doing really well.
The first variable is close to ~1.8 and the second to ~32. These values (1.8 and 32) are the
actual variables in the real conversion formula.
This is really close to the values in the conversion formula. We'll explain this in an upcoming
video where we show how a Dense layer works, but for a single neuron with a single input and
a single output, the internal math looks the same as the equation for a line, y=mx+b, which
has the same form as the conversion equation, f=1.8c+32.
Since the form is the same, the variables should converge on the standard values of 1.8 and 32,
which is exactly what happened.
model.compile(loss='mean_squared_logarithmic_error',
The correct answer is 100×1.8+32=212, so our model is doing not that much well
The first variable is not close to ~1.8 and the second also is not close to ~32. These values (1.8
and 32) are the actual variables in the real conversion formula.
model.compile(loss='mean_squared_logarithmic_error',
optimizer=tf.keras.optimizers.RMSprop(0.1))

model.compile(loss='mean_absolute_error',
optimizer=tf.keras.optimizers.RMSprop(0.1))


Machine Vesion hw6

Uploaded by

Copyright:

Available Formats

Machine Vesion hw6

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Vesion hw6

Uploaded by

Copyright:

Available Formats

MACHINE VISON ASSIGNMENT

loss function and optimizers

types of loss function

14. cosine_proximity …etc

Mean squared error

Log loss (cross entropy loss)

What is an optimizer in machine learning?

Gradient Descent: The granddaddy of optimizers

Gradient descent is an iterative optimization algorithm. It is dependent on the derivatives of the

The learning rates

Stochastic Gradient Descent

Other types of optimizers

The learning can be pretty slow, same reason as in AdaGrad.

• Adadelta is an extension of Adagrad and it also tries to reduce Adagrad’s

Nadam- Nesterov-accelerated Adaptive Moment Estimation

• Nadam combines NAG and Adam

• The learning process is accelerated by summing up the exponential decay of the

Compile the model, with loss and optimizer functions

has the same form as the conversion equation, f=1.8c+32.

The correct answer is 100×1.8+32=212, so our model is doing really well.

has the same form as the conversion equation, f=1.8c+32.

The correct answer is 100×1.8+32=212, so our model is doing really well.

has the same form as the conversion equation, f=1.8c+32.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.