Machine Vesion hw6
Machine Vesion hw6
Machine Vesion hw6
TEWODROS ASMAMAW
Loss Functions
In most learning networks, error is calculated as the difference between the actual output y and
the predicted output ŷ. The function that is used to compute this error is known as Loss Function
also known as Cost function.
Loss functions are used to determine the error (aka “the loss”) between the output of our
algorithms and the given target value.
At its core, a loss function is incredibly simple: It’s a method of evaluating how well your
algorithm models your dataset. If your predictions are totally off, your loss function will output
a higher number. If they’re pretty good, it’ll output a lower number. As you change pieces of
your algorithm to try and improve your model, your loss function will tell you if you’re getting
anywhere.
2. mean_absolute_error
3. mean_absolute_percentage_error
4. mean_squared_logarithmic_error
5. squared_hinge
6. hinge
7. categorical_hinge
8. logcosh
9. categorical_crossentropy
10. sparse_categorical_crossentropy
11. binary_crossentropy
12. kullback_leibler_divergence
13. poisson
Mean Squared Error is one of the most common loss functions. MSE loss function is widely
used in linear regression as the performance measure. To calculate MSE, you take the
difference between your predictions and the ground truth, square it, and average it out
across the whole dataset.
Mean squared error (MSE) is the workhorse of basic loss functions; it’s easy to understand and
implement and generally works pretty well. To calculate MSE, you take the difference between
your predictions and the ground truth, square it, and average it out across the whole dataset.
where y(i) is the actual expected output and ŷ(i) is the model’s prediction.
Likelihood loss
The likelihood function is also relatively simple and is commonly used in classification
problems. The function takes the predicted probability for each input example and multiplies
them. And although the output isn’t exactly human-interpretable, it’s useful for comparing
models.
For example, consider a model that outputs probabilities of [0.4, 0.6, 0.9, 0.1] for the ground
truth labels of [0, 1, 1, 0]. The likelihood loss would be computed as (0.6) * (0.6) * (0.9) * (0.9)
= 0.2916. Since the model outputs probabilities for TRUE (or 1) only, when the ground truth
label is 0 we take (1-p) as the probability. In other words, we multiply the model’s outputted
probabilities together for the actual outcomes.
This is actually exactly the same formula as the regular likelihood function, but with logarithms
added in. You can see that when the actual class is 1, the second half of the function disappears,
and when the actual class is 0, the first half drops. That way, we just end up multiplying the
log of the actual predicted probability for the ground truth class.
The cool thing about the log loss loss function is that is has a kick: It penalizes heavily for
being very confident and very wrong. The graph below is for when the true label =1, and you
can see that it skyrockets as the predicted probability for label = 0 approaches 1.
mean_absolute_error
In statistics, mean absolute error (MAE) is a measure of difference between two continuous
variables
yi : Actual value
xi : Predicted Value
mean_absolute_percentage_error
The mean absolute percentage error (MAPE), also known as mean absolute percentage
deviation (MAPD), is a measure of prediction accuracy of a forecasting method in statistics,
for example in trend estimation, also used as a Loss function for regression problems in
Machine Learning.
mean_squared_logarithmic_error
Optimizers
We know that loss functions are used to understand how good/bad our model performs on the
data provided to it. Loss functions are essentially the summation of the difference between the
predicted and calculated values for given training samples. For training a neural network to
minimize its losses so as to perform better, we need to tweak the weights and parameters
associated with the model and the loss function. This is where optimizers play a crucial
role.Optimizers associate loss function and model parameters together by updating the model,
i.e. the weights and biases of each node based on the output of the loss function.
In simpler terms, optimizers shape and mold your model into its most accurate possible form
by futzing with the weights. The loss function is the guide to the terrain, telling the optimizer
when it’s moving in the right or wrong direction.
Types of optimizers
1. Calculate what a small change in each individual weight would do to the loss function
2. Adjust each individual weight based on its gradient (i.e. take a small step in the
determined direction)
3. Keep doing steps #1 and #2 until the loss function gets as low as possible
The tricky part of this algorithm (and optimizers in general) is understanding gradients,
which represent what a small change in a weight or parameter would do to the loss
function.
Role of Gradient:
Gradient refers to the slope of the equation in general. Gradients are partial derivatives and can
be considered as the small change reflected in the loss function with respect to the small change
in weights or parameters of the function. Now, this slight change can tell us what to do next to
reduce the output of loss function — reduce this weight by 0.02 or increase this parameter by
0.005 or anything else—to lower the output of the loss function and thereby make our model
more accurate.
Learning rate is the size of the steps our algorithm takes to reach the global minima. Taking very
large steps may skip the global minima, and the model will never reach the optimal value for
loss function. On the other hand, taking very small steps will take forever to converge. Thus,
size of the step is also dependent on the gradient value.
In the above formula, α is the learning rate, J is the cost function, and ϴ is the parameter to be
updated. As you can see, partial derivative of J with respect to ϴj gives us the gradient. Note
that, as we get closer to the global minima, the slope/gradient of the curve becomes less and less
steep, which gives us a smaller value of derivative, which in turn reduces the step size
automatically.
Instead of calculating the gradients for all of your training examples on every pass of gradient
descent, it’s sometimes more efficient to only use a subset of the training examples each
time. Stochastic Gradient Descent is an implementation that either uses batches of examples at
a time or random examples on each pass.
Momentum
Momentum is like a ball rolling downhill. The ball will gain momentum as it rolls down the hill.
Momentum helps accelerate Gradient Descent(GD) when we have surfaces that curve more
steeply in one direction than in another direction
For updating the weights it takes the gradient of the current step as well as the gradient of the
previous time steps. This helps us move faster towards convergence.
Nesterov accelerated gradient(NAG)
Nesterov acceleration optimization is like a ball rolling down the hill but knows exactly when
to slow down before the gradient of the hill increases again.We calculate the gradient not with
respect to the current step but with respect to the future step. We evaluate the gradient of the
looked ahead and based on the importance then update the weights.NAG is like you are going
down the hill where we can look ahead in the future. This way we can optimize our descent
faster. Works slightly better than standard Momentum.
Advantages of NAG
It has an understanding of motion or future gradients. Thus, it can decrease momentum when
the gradient value is lesser or the slope is low and can increase momentum when the slope is
steep.
Disadvantages of NAG
NAG is not adaptive with respect to the parameter importance. Thus, all parameters are updated
in a similar manner.
Adagrad
Adagrad adapts the learning rate specifically to individual features; that means that some of the
weights in your dataset will have different learning rates than others. This works really well for
sparse datasets where a lot of input examples are missing. Adagrad has a major issue though:
The adaptive learning rate tends to get really small over time. Some other optimizers below
seek to eliminate this problem.
Here, as you can see, AdaGrad uses the sum of the square of past gradients to calculate the
learning rate for each parameter (∇L(ϴ) is the gradient or partial derivative of cost function with
respect to ϴ).
Advantages of AdaGrad
Works great for datasets which have missing samples or are sparse in nature.
Disadvantages of AdaGrad
The learning might be very slow, since, according to the formula above, division by bigger
numbers (sum of past gradients become bigger and bigger with time) means that the learning
rate is decreasing over time — therefore the pace of learning would also decrease
RMSprop
RMSprop is a special version of Adagrad .Instead of letting all of the gradients accumulate for
momentum, it only accumulates gradients in a fixed window. RMSprop is similar to Adaprop,
which is another optimizer that seeks to solve some of the issues that Adagrad leaves open.
Advantages of RMS-Prop
AdaGrad decreases the learning rate with each time step, but RMS-Prop can adapt to an increase
or a decrease in the learning rate with each epoch.
Disadvantages of RMS-Prop
Adadelta
• It does this by restricting the window of the past accumulated gradient to some
fixed size of w. Running average at time t then depends on the previous
average and the current gradient
• In Adadelta we do not need to set the default learning rate as we take the ratio of
the running average of the previous time steps to the current gradient
Adam
Adam stands for adaptive moment estimation and is another way of using past gradients to
calculate current gradients. Adam also utilizes the concept of momentum by adding fractions
of previous gradients to the current one. This optimizer has become pretty widespread and is
practically accepted for use in training neural nets.
It’s easy to get lost in the complexity of some of these new optimizers. Just remember that they
all have the same goal: Minimizing our loss function. Even the most complex ways of doing
that are simple at their core. Rather than influencing the learning rate with respect to first
moments as in RMSProp, Adam also uses the average of the second moments of the gradients.
Also, Adam calculates an exponential moving average of the gradient and the squared gradient.
Thus, Adam can be considered as a combination of AdaGrad and RMS-Prop.
In the above formula, α is the learning rate, β1(usually ~0.9) is the exponential decay rate with
respect to the first moments, β2 is the exponential decay rate with respect to the second moments
(usually ~0.999). ∈ is just a small value to avoid division by zero.
Advantages of Adam
Adam optimizer is well suited for large datasets and is computationally efficient.
Disadvantages of Adam
There are few disadvantages as the Adam optimizer tends to converge faster, but other
algorithms like the Stochastic gradient descent focus on the datapoints and generalize in a better
manner. Thus, the performance depends on the type of data being provided and the
speed/generalization trade-off.
• Nadam is employed for noisy gradients or for gradients with high curvatures
Python code
• Loss function — A way of measuring how far off predictions are from the desired
outcome. (The measured difference is called the "loss".)
• Optimizer function — A way of adjusting internal values in order to reduce the loss.
model.compile(loss='mean_squared_error',
optimizer=tf.keras.optimizers.Adam(0.1))
During training, the optimizer function is used to calculate adjustments to the model's internal
variables. The goal is to adjust the internal variables until the model (which is really a math
function) mirrors the actual equation for converting Celsius to Fahrenheit.
We'll use Matplotlib to visualize this (you could use another tool). As you can see, our model
improves very quickly at first, and then has a steady, slow improvement until it is very near
"perfect" towards the end.
model.compile(loss='mean_absolute_error',
optimizer=tf.keras.optimizers.Adadelta(0.1))
The correct answer is 100×1.8+32=212, so our model is not doing well
model.compile(loss='mean_absolute_percentage_error',
optimizer=tf.keras.optimizers.Adam(0.1))
The correct answer is 100×1.8+32=212, so our model is doing really well.
The first variable is close to ~1.8 and the second to ~32. These values (1.8 and 32) are the
actual variables in the real conversion formula.
This is really close to the values in the conversion formula. We'll explain this in an upcoming
video where we show how a Dense layer works, but for a single neuron with a single input and
a single output, the internal math looks the same as the equation for a line, y=mx+b, which
Since the form is the same, the variables should converge on the standard values of 1.8 and 32,
which is exactly what happened.
model.compile(loss='mean_squared_logarithmic_error',
optimizer=tf.keras.optimizers.Adam(0.1))
The correct answer is 100×1.8+32=212, so our model is doing not that much well
The first variable is not close to ~1.8 and the second also is not close to ~32. These values (1.8
and 32) are the actual variables in the real conversion formula.
model.compile(loss='mean_squared_logarithmic_error',
optimizer=tf.keras.optimizers.RMSprop(0.1))
This is really close to the values in the conversion formula. We'll explain this in an upcoming
video where we show how a Dense layer works, but for a single neuron with a single input and
a single output, the internal math looks the same as the equation for a line, y=mx+b, which
Since the form is the same, the variables should converge on the standard values of 1.8 and 32,
which is exactly what happened.
model.compile(loss='mean_absolute_error',
optimizer=tf.keras.optimizers.RMSprop(0.1))
This is really close to the values in the conversion formula. We'll explain this in an upcoming
video where we show how a Dense layer works, but for a single neuron with a single input and
a single output, the internal math looks the same as the equation for a line, y=mx+b, which
Since the form is the same, the variables should converge on the standard values of 1.8 and 32,
which is exactly what happened.