AI & ML Unit 5 Notes
AI & ML Unit 5 Notes
Neuron:
• A neuron is a cell in the brain whose principal function is the collection, processing
and dissemination of electric signals.
Neural Networks:
Perceptron:
• A network with all the inputs connected directly to the outputs is called a Single
layer neural network or a perceptron.
• It is the basic processing element
• It has inputs that may come from the environment or may be the outputs of other
perceptrons.
• Perceptron model is also treated as one of the best and simplest types of artificial
neural networks.
• Input Nodes: This is the primary component of perceptron which accepts the
initial data into system.
• Weight: It represents the strength of the connection between units. Weight is
directly proportional to strength of the associated input neuron in deciding output.
• Activation Function: These are the final important components that help to
determine whether the neuron will fire or not.
Types of activation function:
(i) Sign function
(ii) Step function
(iii) Sigmoid function
• The output of perceptron as a dot product : Y = WTx
• Each perceptron is a local function of its inputs and synaptic weights.
Sigmoid function:
Perceptron Function:
• Perceptron function “f(x)” can be achieved as output by multiplying the input ‘x’
with learned weight co-efficient ‘w’.
• It can be expressed by f(x) = 1; if w.x+b>0 ; otherwise, f(x)=0
Characteristics of perceptron:
Single-Layer Perceptron:
Multi-Layer Perceptron:
• In the multi-layer perceptron diagram, we can see that there are three inputs and
thus three input nodes and the hidden layer has three nodes.
• The output layer gives two outputs, therefore there are two output nodes.
• Every node in the multi-layer perception uses a sigmoid activation function. The
sigmoid activation function takes real values as input and converts them to numbers
between 0 and 1 using the sigmoid formula. (x)=1/(1+(exp(-x))
• The multi-layer perceptron model is also known as the Backpropagation algorithm,
which executes in two stages as follows:
✓ Forward Stage: Activation functions start from the input layer in the forward
stage and terminate on the output layer.
✓ Backward Stage: In the backward stage, weight and bias values are modified
as per the model's requirement.
• The neural network has neurons that work in correspondence with weight, bias,
and their respective activation function. In a neural network, we would update the
weights and biases of the neurons on the basis of the error at the output. This
process is known as backpropagation.
• Two Types of Backpropagation Networks are:
1. Static Back-propagation
2. Recurrent Backpropagation
Static Back-propagation: It is one kind of backpropagation network which
produces a mapping of static input for static output.
Recurrent Back-propagation: It is data mining is fed forward until a fixed value
is achieved.
• xj, j = 0,… , d are the inputs and zh, h = 1, … , H are the hidden units where H is
the dimensionality of this hidden space. z0 is the bias of the hidden layer. yi, i = 1,
… , K are the output units. whj are weights in the first layer, and vih are the weights
in the second layer.
Advantages:
✓ It can be used to solve complex non-linear problems.
✓ It handles large amounts of input data well.
✓ It makes quick predictions after training.
✓ It works well with both small and large input data.
Disadvantages:
✓ Time consuming
✓ Depends on quality of training
Activation Function:
• In an artificial neural network, the function which takes the incoming signals as
input and produces the output signal is known as the activation function.
• The activation functions are:
✓ ReLU Function
✓ Sigmoid Function
✓ Linear Function
✓ Tanh Function
✓ Softmax Function
ReLU Function:
Sigmoid Function:
Linear Function:
Tanh Function:
• The activation that works almost always better than sigmoid function is Tanh
function also known as Tangent Hyperbolic function.
• Equation: f(x) = tanh(x) = 2/1+e-2x -1
• Value Range: -1 to +1
• Nature: Non-linear
• Uses: Usually used in hidden layers of a neural network as it’s value lies between
-1 to 1.
Softmax Function
• It is a subclass of the sigmoid function, the softmax function comes in handy when
dealing with multiclass classification issues.
• Used frequently when managing several classes.
• The softmax function would split by the sum of the outputs and squeeze all outputs
for each category between 0 and 1.
Network Training:
Training Set:
• Training set is a set of pairs of input patterns with corresponding desired output
patterns.
• Each pair represents how the network is supposed to respond to a particular input.
• The network is trained to respond correctly to each input pattern from the training
set.
Test Set:
• The test set is the dataset that the model is trained on.
Step 4: Calculate the forward pass (what would be the output with the current weights)
Step 6: Adjust the weights (using the learning rate increment or decrement) according to
the backward pass (backward gradient propagation)
• Batch Gradient Descent involves calculations over the full training set at
each step as a result of which it is very slow on very large training data.
• In SGD, only one training example is used to compute the gradient and
update the parameters at each iteration.
• SGD is generally noisier than typical Gradient Descent, it usually took a higher
number of iterations to reach the minima, because of its randomness in its descent.
Advantages:
• Speed: SGD is faster than other variants of Gradient Descent.
• Memory Efficiency: It is memory-efficient and can handle large datasets that
cannot fit into memory.
Disadvantages:
• Noisy updates: The updates in SGD are noisy and have a high variance
• Slow Convergence: SGD may require more iterations to converge to the
minimum
• Less accurate
Error Backpropagation:
2. Input is modeled using real weights W. The weights are usually randomly
selected.
3. Calculate the output for every neuron from the input layer, to the hidden layers,
to the output layer.
4. Calculate the error in the outputs ErrorB= Actual Output – Desired Output
5. Travel back from the output layer to the hidden layer to adjust the weights such
that the error is decreased.
• Static Back-Propagation
• Recurrent Back-Propagation
Static Back-Propagation
Recurrent Back-Propagation:
• Recurrent Back propagation in data mining is fed forward until a fixed value
is achieved
Advantages:
• It does not have any parameters to tune except for the number of input.
• It is a standard process that usually works well.
Disadvantages:
Unit Saturation:
• The vanishing gradient problem is an issue that sometimes arises when training
machine learning algorithms through gradient descent.
• This most often occurs in neural networks that have several neuronal layers such
as in a deep learning system, but also occurs in recurrent neural networks.
• The key point is that the calculated partial derivatives used to compute the gradient
as one goes deeper into the network.
• Since the gradients control how much the network learns during training, the
gradients are very small or zero, then little to no training can take place, leading to
poor predictive performance.
The Problem:
• As more layers using certain activation functions are added to neural networks,
the gradients of the loss function approaches zero, making the network hard to
train.
Why:
• Certain activation functions, like the sigmoid function, squishes a large input
space into a small input space between 0 and 1.
• For shallow network with only a few layers that use these activations, this isn't
a big problem. However, when more layers are used, it can cause the gradient
to be too small for training to work effectively.
• Gradients of neural networks are found using backpropagation. Simply put,
backpropagation finds the derivatives of the network by moving layer by layer
from the final layer to the initial one.
• By the chain rule, the derivatives of each layer are multiplied down the network.
• However, when n hidden layers use activation like the sigmoid an function, n
small derivatives are multiplied together.
• Thus, the gradient decreases exponentially as we propagate down to the initial
layers.
Solution:
• The simplest solution is to use other activation functions, such as ReLU, which
doesn't cause a small derivative.
• The residual connection directly adds the value at the beginning of the block,
x, to the end of the block (F(x) + x).
• This residual connection doesn't go through activation functions that
"squashes" the derivatives, resulting in a higher overall derivative of the block.
• Finally, batch normalization layers can also resolve the issue.
ReLU:
• Leaky ReLU: In this the output is also linear on the negative side
Advantages:
Disadvantages:
• The derivative is zero for a ≤ 0, there is no further training if, for a hidden unit,
the weighted sum somehow becomes negative.
Hyperparameter Tuning:
GridSearchCV:
RandomizedSearchCV:
Batch Normalization:
Normalization:
Batch Normalization:
• Batch normalization is a process to make neural networks faster and more stable
through adding extra layers in a deep neural network.
• The new layer performs the standardizing and normalizing operations on the input
of a layer coming from a previous layer.
• A typical neural network is trained using a collected set of input data called batch.
• A similar case can also be made for the hidden units, and this is the idea behind
batch normalization.
• For each batch or minibatch, for each hidden unit j we calculate the mean mj and
standard deviation sj of its values, and we first znormalize:
• We can then map these to have arbitrary mean γj and scale βj and then we apply
the activation function.
• First, mj and sj are calculated anew for each batch, and we see immediately that
batch normalization is not meaningful with online learning or very small
minibatches.
• Second, γj and βj are parameters that are initialized and updated (after each batch
or minibatch) using gradient descent, just like the connection weights. So they
require extra memory and computation.
• An internal covariate shift occurs when there is a change in the input distribution
to our network.
• When the input distribution changes, hidden layers try to learn to adapt to the
new distribution. This slows down the training process.
Advantages:
Overfitting:
• Overfitting means that the model is a good fit on the train data compared to the
data.
• Overfitting is also a result of the model being too complex
• In other words, in such a scenario, the model has low bias and high variance
and is too complex. This is called overfitting.
• Hints
• Weight Decay
• Ride Regression (or) L2 Regularization
• Lasso Regression (or) L1 Regularization.
• Dropout
Hints:
• Hints are properties of the target function that are known to us independent
of the training examples.
• The identity of the object does not change when it is translated, rotated, or
scaled.
• These are hints that can be incorporated into the learning process to make
learning easier.
Weight Decay:
Ridge regression:
• The Ridge regression technique is used to analyze the model where the
variables may be having multicollinearity.
• It reduces the insignificant independent variables though it does not remove
them completely. This type of regularization uses the L₂ norm for regularization
Lasso regression:
Dropout:
Bias:
• Neural network bias can be defined as the constant which is added to the product
of features and weights
• It is used to offset the result.
• It helps the models to shift the activation function towards the positive or
negative side.
• "The process of receiving an input to produce some kind of output to make some
kind of prediction is known as Feed Forward."
• Feed Forward neural network is the core of many other important neural
networks such as convolution neural network.
Artificial Neuron: