Unit 3 Deep Learning
Unit 3 Deep Learning
Activation Function
In artificial neural networks, each neuron forms a weighted sum of its
inputs and passes the resulting scalar value through a function referred to
as an activation function.
Let’s consider the simple neural network model without any hidden layers.
Here is the output-
Y = ∑ (weights*input + bias)
So, if inputs are x1+x2+x3…. xn and the weights are w1+w2 + w3.......wn
and it can range from -infinity to +infinity. So, it is necessary to bound the
output to get the desired prediction or generalized results.
The activation function compares the input value to a threshold value. If the
input value is greater than the threshold value, the neuron is activated. It’s
disabled if the input value is less than the threshold value, which means its
output isn’t sent on to the next or hidden layer.
Mathematically, the binary activation function can be represented as:
These activation functions are mainly divided basis on their range and curves.
The remainder of this article will outline the major non-linear activation
functions used in neural networks.
The activation that works almost always better than sigmoid function is
Tanh function also known as Tangent Hyperbolic function. It’s actually
mathematically shifted version of the sigmoid function. Both are similar
and can be derived from each other.
Equation: -
Value Range: - -1 to +1
Nature: - non-linear
Uses: - Usually used in hidden layers of a neural network as it’s values lies between -
1 to 1 hence the mean for the hidden layer comes out be 0 or very close to it, hence
helps in centring the data by bringing mean close to 0. This makes learning for the
next layer much easier.
As shown in figure, the ReLU is half rectified (from bottom). f(z) is zero when
z is less than zero and f(z) is equal to z when z is above or equal to zero.
Now how does ReLU transform its input? It uses this simple formula:
f(x)=max(0,x)
ReLU function is its derivative both are monotonic. The function returns 0 if it
receives any negative input, but for any positive value x, it returns that value
back. Thus, it gives an output that has a range from 0 to infinity.
Now let us give some inputs to the ReLU activation function and see how it
transforms them and then we will plot them also.
def ReLU(x):
if x>0:
return x
else:
return 0
Range: [ 0 to infinity)
But the issue is that all the negative values become zero immediately which
decreases the ability of the model to fit or train from the data properly.
That means any negative input given to the ReLU activation function turns the
value into zero immediately in the graph, which in turns affects the resulting
graph by not mapping the negative values appropriately.
It Stands for Rectified linear unit. It is the most widely used activation
function. Chiefly implemented in hidden layers of Neural network.
Equation :- A(x) = max(0,x). It gives an output x if x is positive and 0
otherwise.
Value Range :- [0, inf)
Nature :- non-linear, which means we can easily backpropagate the
errors and have multiple layers of neurons being activated by the ReLU
function.
Uses :- ReLu is less computationally expensive than tanh and sigmoid
because it involves simpler mathematical operations. At a time only a few
neurons are activated making the network sparse making it efficient and
easy for computation.
In simple words, RELU learns much faster than sigmoid and Tanh
function.
What is L1 regularization?
L1 regularization, also known as Lasso regularization, is a machine-
learning strategy that inhibits overfitting by introducing a penalty term
into the model's loss function based on the absolute values of the model's
parameters.
L1 regularization seeks to reduce some model parameters toward zero in
order to lower the number of non-zero parameters in the model.
L1 regularization is particularly useful when working with high-
dimensional data since it enables one to choose a subset of the most
important attributes.
This lessens the risk of overfitting and makes the model easier to
understand. The size of a penalty term is controlled by the
hyperparameter lambda, which regulates the L1 regularization's
regularization strength.
As lambda rises, more parameters will be lowered to zero, improving
regularization.
L1 Regularization, also called a lasso regression, adds the “absolute value
of magnitude” of the coefficient as a penalty term to the loss function.
Lasso Regression (Least Absolute Shrinkage and Selection Operator)
adds “absolute value of magnitude” of coefficient as penalty term to the
loss function.
What is L2 regularization?
L2 regularization, also known as Ridge regularization, is a machine
learning technique that avoids overfitting by introducing a penalty term
into the model's loss function based on the squares of the model's
parameters.
The goal of L2 regularization is to keep the model's parameter sizes short
and prevent oversizing.
In order to achieve L2 regularization, a term that is proportionate to the
squares of the model's parameters is added to the loss function.
This word works as a limiter on the parameters' size, preventing them
from growing out of control.
A hyperparameter called lambda that controls the regularization's
intensity also controls the size of the penalty term. The parameters will be
smaller and the regularization will be stronger the greater the lambda.
Ridge regression adds “squared magnitude” of coefficient as penalty
term to the loss function. Here the highlighted part represents L2
regularization element.
Here, if lambda is zero then you can imagine we get back ordinary least
squares.
However, if lambda is very large then it will add too much weight and it
will lead to under-fitting. Having said that it is important how lambda is
chosen. This technique works very well to avoid over-fitting issue.