0% found this document useful (0 votes)
149 views

AI & ML Unit 5 Notes

Uploaded by

Anandakumar A
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
149 views

AI & ML Unit 5 Notes

Uploaded by

Anandakumar A
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Unit – V

Neuron:

• A neuron is a cell in the brain whose principal function is the collection, processing
and dissemination of electric signals.

Neural Networks:

• The brain’s information-processing capacity is thought to emerge primarily from


networks of such neurons.
• For this reason, some of the earliest AI work aimed to create artificial neural
networks.

Perceptron:

• A network with all the inputs connected directly to the outputs is called a Single
layer neural network or a perceptron.
• It is the basic processing element
• It has inputs that may come from the environment or may be the outputs of other
perceptrons.
• Perceptron model is also treated as one of the best and simplest types of artificial
neural networks.

• Input Nodes: This is the primary component of perceptron which accepts the
initial data into system.
• Weight: It represents the strength of the connection between units. Weight is
directly proportional to strength of the associated input neuron in deciding output.
• Activation Function: These are the final important components that help to
determine whether the neuron will fire or not.
Types of activation function:
(i) Sign function
(ii) Step function
(iii) Sigmoid function
• The output of perceptron as a dot product : Y = WTx
• Each perceptron is a local function of its inputs and synaptic weights.

Sigmoid function:

• It is a function which is plotted as ‘S’ shaped graph.


• Equation: A= 1/(1+e-x)
• Nature: Non-linear
• Value range: 0 to 1

Perceptron Function:

• Perceptron function “f(x)” can be achieved as output by multiplying the input ‘x’
with learned weight co-efficient ‘w’.
• It can be expressed by f(x) = 1; if w.x+b>0 ; otherwise, f(x)=0

Characteristics of perceptron:

• It is a machine learning algorithm for supervised learning of binary classifiers.


• The weight co-efficient is automatically learned.
• Initially the weights are multiplied with input features, and the decision is made
whether the neuron is fired or not.

Single-Layer Perceptron:

• This is one of the easiest Artificial Neural Networks types.


• A single-layered perceptron model consists feed-forward network and also includes
a threshold transfer function inside the model.
• The main objective of the single-layer perceptron model is to analyze the linearly
separable objects with binary outcomes.
• In a single layer perceptron model, its algorithms do not contain recorded data, so
it begins with inconstantly allocated input for weight parameters.

Multi-Layer Perceptron:

• Multi-layer perception is also known as MLP.


• It is fully connected dense layers, which transform any input dimension to the
desired dimension.
• A multi-layer perception is a neural network that has multiple layers.
• To create a neural network we combine neurons together so that the outputs of
some neurons are inputs of other neurons.

• In the multi-layer perceptron diagram, we can see that there are three inputs and
thus three input nodes and the hidden layer has three nodes.
• The output layer gives two outputs, therefore there are two output nodes.
• Every node in the multi-layer perception uses a sigmoid activation function. The
sigmoid activation function takes real values as input and converts them to numbers
between 0 and 1 using the sigmoid formula. (x)=1/(1+(exp(-x))
• The multi-layer perceptron model is also known as the Backpropagation algorithm,
which executes in two stages as follows:
✓ Forward Stage: Activation functions start from the input layer in the forward
stage and terminate on the output layer.
✓ Backward Stage: In the backward stage, weight and bias values are modified
as per the model's requirement.
• The neural network has neurons that work in correspondence with weight, bias,
and their respective activation function. In a neural network, we would update the
weights and biases of the neurons on the basis of the error at the output. This
process is known as backpropagation.
• Two Types of Backpropagation Networks are:
1. Static Back-propagation
2. Recurrent Backpropagation
Static Back-propagation: It is one kind of backpropagation network which
produces a mapping of static input for static output.
Recurrent Back-propagation: It is data mining is fed forward until a fixed value
is achieved.
• xj, j = 0,… , d are the inputs and zh, h = 1, … , H are the hidden units where H is
the dimensionality of this hidden space. z0 is the bias of the hidden layer. yi, i = 1,
… , K are the output units. whj are weights in the first layer, and vih are the weights
in the second layer.
Advantages:
✓ It can be used to solve complex non-linear problems.
✓ It handles large amounts of input data well.
✓ It makes quick predictions after training.
✓ It works well with both small and large input data.

Disadvantages:

✓ Time consuming
✓ Depends on quality of training

Activation Function:

• In an artificial neural network, the function which takes the incoming signals as
input and produces the output signal is known as the activation function.
• The activation functions are:
✓ ReLU Function
✓ Sigmoid Function
✓ Linear Function
✓ Tanh Function
✓ Softmax Function

ReLU Function:

• It stands for Rectified Linear Unit.


• It is the most widely used activation function.
• Chiefly implemented in hidden layers of neural network.
• Equation: A(x) = max(0,x). It gives an output x if x is positive and 0 otherwise.
• Value Range: [0, inf)
• Nature: Non-linear
• Uses: ReLu is less computationally expensive than tanh and sigmoid.
• It learns much faster than sigmoid and Tanh function.

Sigmoid Function:

• It is a function which is plotted as ‘S’ shaped graph.


• Equation: A= 1/(1+e-x)
• Nature: Non-linear
• Value range: 0 to 1
• Uses: Usually used in output layer of binary classification.

Linear Function:

• Linear function has the equation similar to as of a straight line i.e. y = x


• No matter how many layers we have, if all are linear in nature, the final activation
function of last layer is nothing but just a linear function of the input of first layer.
• Range : -inf to +inf
• Uses : Linear activation function is used at just one place i.e. output layer.

Tanh Function:

• The activation that works almost always better than sigmoid function is Tanh
function also known as Tangent Hyperbolic function.
• Equation: f(x) = tanh(x) = 2/1+e-2x -1
• Value Range: -1 to +1
• Nature: Non-linear
• Uses: Usually used in hidden layers of a neural network as it’s value lies between
-1 to 1.

Softmax Function

• It is a subclass of the sigmoid function, the softmax function comes in handy when
dealing with multiclass classification issues.
• Used frequently when managing several classes.
• The softmax function would split by the sum of the outputs and squeeze all outputs
for each category between 0 and 1.

Network Training:

Training Set:

• Training set is a set of pairs of input patterns with corresponding desired output
patterns.
• Each pair represents how the network is supposed to respond to a particular input.
• The network is trained to respond correctly to each input pattern from the training
set.

Test Set:

• The test set is the dataset that the model is trained on.

Steps to train a neural model:

Step 1: First an ANN will require a random weight initialization.

Step 2: Split the dataset in batches (batch size)

Step 3: Send the batches 1 by 1 to the GPU

Step 4: Calculate the forward pass (what would be the output with the current weights)

Step 5: Compare the calculated output to the expected output (loss)

Step 6: Adjust the weights (using the learning rate increment or decrement) according to
the backward pass (backward gradient propagation)

Step 7: Go back to step 2

Gradient Descent Optimization:

• Gradient Descent is a generic optimization algorithm capable of finding optimal


solutions to a wide range of problems.
• The general idea is to tweak parameters iteratively in order to minimize the cost
function.
• An important parameter of Gradient Descent (GD) is the size of the steps,
determined by the learning rate hyperparameters.
Types of Gradient Descent:
✓ Batch Gradient Descent
✓ Stochastic Gradient Descent
✓ Mini-batch Gradient Descent

Batch Gradient Descent:

• Batch Gradient Descent involves calculations over the full training set at
each step as a result of which it is very slow on very large training data.

Stochastic Gradient Descent:

• In SGD, only one training example is used to compute the gradient and
update the parameters at each iteration.

Mini-batch Gradient Descent:

• In mini-batch gradient descent, a small batch of training examples is used to


compute the gradient and update the parameters at each iteration.

Stochastic Gradient Descent:

• In Stochastic Gradient Descent, a few samples are selected randomly instead of


the whole data set for each iteration.
• In Gradient Descent, there is a term called “batch” which denotes the total number
of samples from a dataset that is used for calculating the gradient for each
iteration.
• In SGD, only one training example is used to compute the gradient and update the
parameters at each iteration.
• This problem is solved by Stochastic Gradient Descent. In SGD, it uses only a
single sample, i.e., a batch size of one, to perform each iteration.
• SGD Algorithm:

• SGD is generally noisier than typical Gradient Descent, it usually took a higher
number of iterations to reach the minima, because of its randomness in its descent.

Advantages:
• Speed: SGD is faster than other variants of Gradient Descent.
• Memory Efficiency: It is memory-efficient and can handle large datasets that
cannot fit into memory.

Disadvantages:

• Noisy updates: The updates in SGD are noisy and have a high variance
• Slow Convergence: SGD may require more iterations to converge to the
minimum
• Less accurate
Error Backpropagation:

• Backpropagation is one of the important concepts of a neural network. or a single


training example.
• Backpropagation algorithm calculates the gradient of the error function.
• The main features of Backpropagation are the iterative, recursive and efficient
method through which it calculates the updated weight to improve the network until
it is not able to perform the task for which it is being trained.
• The Back propagation algorithm in neural network computes the gradient of the
loss function for a single weight by the chain rule.
• It efficiently computes one layer at a time, unlike a native direct computation. It
computes the gradient, but it does not define how the gradient is used.

1. Inputs X, arrive through the preconnected path

2. Input is modeled using real weights W. The weights are usually randomly
selected.

3. Calculate the output for every neuron from the input layer, to the hidden layers,
to the output layer.

4. Calculate the error in the outputs ErrorB= Actual Output – Desired Output

5. Travel back from the output layer to the hidden layer to adjust the weights such
that the error is decreased.

6. Keep repeating the process until the desired output is achieved.


Types of Backpropagation Networks:

• Static Back-Propagation
• Recurrent Back-Propagation

Static Back-Propagation

• It is one kind of backpropagation network which produces a mapping of a


static input for static output. It is useful to solve static classification.

Recurrent Back-Propagation:

• Recurrent Back propagation in data mining is fed forward until a fixed value
is achieved

Advantages:

• It does not have any parameters to tune except for the number of input.
• It is a standard process that usually works well.

Disadvantages:

• Backpropagation needs a very large amount of time for training.


• Backpropagation requires a matrix-based method instead of mini-batch.

Unit Saturation:

• The vanishing gradient problem is an issue that sometimes arises when training
machine learning algorithms through gradient descent.
• This most often occurs in neural networks that have several neuronal layers such
as in a deep learning system, but also occurs in recurrent neural networks.
• The key point is that the calculated partial derivatives used to compute the gradient
as one goes deeper into the network.
• Since the gradients control how much the network learns during training, the
gradients are very small or zero, then little to no training can take place, leading to
poor predictive performance.
The Problem:

• As more layers using certain activation functions are added to neural networks,
the gradients of the loss function approaches zero, making the network hard to
train.

Why:

• Certain activation functions, like the sigmoid function, squishes a large input
space into a small input space between 0 and 1.

The sigmoid function and its derivative:

Why it’s significant:

• For shallow network with only a few layers that use these activations, this isn't
a big problem. However, when more layers are used, it can cause the gradient
to be too small for training to work effectively.
• Gradients of neural networks are found using backpropagation. Simply put,
backpropagation finds the derivatives of the network by moving layer by layer
from the final layer to the initial one.
• By the chain rule, the derivatives of each layer are multiplied down the network.
• However, when n hidden layers use activation like the sigmoid an function, n
small derivatives are multiplied together.
• Thus, the gradient decreases exponentially as we propagate down to the initial
layers.

Solution:

• The simplest solution is to use other activation functions, such as ReLU, which
doesn't cause a small derivative.
• The residual connection directly adds the value at the beginning of the block,
x, to the end of the block (F(x) + x).
• This residual connection doesn't go through activation functions that
"squashes" the derivatives, resulting in a higher overall derivative of the block.
• Finally, batch normalization layers can also resolve the issue.

ReLU:

• It stands for Rectified Linear Unit.


• It is the most widely used activation function.
• Chiefly implemented in hidden layers of neural network.

• Equation: A(x) = max(0,x). It gives an output x if x is positive and 0 otherwise.


• Value Range: [0, inf)
• Nature: Non-linear
• Uses: ReLu is less computationally expensive than tanh and sigmoid.
• It learns much faster than sigmoid and Tanh function.
• An activation function for hidden units that has become popular recently with deep
networks is the rectified linear unit (ReLU), which is defined as

• Leaky ReLU: In this the output is also linear on the negative side

Advantages:

• Sparse representations lead to faster Training.


• it does not saturate

Disadvantages:

• The derivative is zero for a ≤ 0, there is no further training if, for a hidden unit,
the weighted sum somehow becomes negative.

Hyperparameter Tuning:

• A Machine Learning model is defined as a mathematical model with a number of


parameters that need to be learned from the data.
• By training a model with existing data, we are able to fit the model parameters.
• However, there is another kind of parameter, known as Hyperparameters, that
cannot be directly learned from the regular training process.
• They are usually fixed before the actual training process begins.
• Some examples of model hyperparameters include:
1. The penalty in Logistic Regression Classifier i.e. L1 or L2 regularization
2. The learning rate for training a neural network.
3. The C and sigma hyperparameters for support vector machines.
4. The k in k-nearest neighbors
• The two best strategies for Hyperparameter tuning are: 1. GridSearchCV 2.
RandomizedSearchCV

GridSearchCV:

• In GridSearchCV approach, the machine learning model is evaluated for a range of


hyperparameter values.
• This approach is called GridSearchCV, because it searches for the best set of
hyperparameters from a grid of hyperparameters values.
• For example, if we want to set two hyperparameters C and Alpha of the Logistic
Regression Classifier model, with different sets of values.
• As in the image, for C = [0.1, 0.2, 0.3, 0.4, 0.5] and Alpha = [0.1, 0.2, 0.3, 0.4]. For
a combination of C=0.3 and Alpha=0.2, the performance score comes out to be
0.726(Highest), therefore it is selected.

The following code illustrates how to use GridSearchCV


# Necessary imports
from sklearn.linear_model import Logistic Regression
from sklearn.model_selection import GridSearchCV
# Creating the hyperparameter grid
c_space = np.logspace(-5, 8, 15) param_grid = {'C': c_space}
# Print the tuned parameters and score
print("Tuned Logistic Regression
Parameters:{}".format(logreg_cv.best_params_))
print("Best score is {}".format(logrcg_cv.best_score_))
Output: Tuned Logistic Regression Parameters: {'C': 3.7275937203149381) Best
score is 0.7708333333333334
Drawback:
GridSearch CV will go through all the intermediate combinations of
hyperparameters which makes grid search computationally very expensive.

RandomizedSearchCV:

• RandomizedSearchCV solves the drawbacks of GridSearchCV, as it goes through


only a fixed number of hyperparameter settings.
• This approach reduces unnecessary computation.

Batch Normalization:

Normalization:

• Normalization is a data pre-processing tool used to bring the numerical data to a


common scale without distorting its shape.

Batch Normalization:

• Batch normalization is a process to make neural networks faster and more stable
through adding extra layers in a deep neural network.
• The new layer performs the standardizing and normalizing operations on the input
of a layer coming from a previous layer.
• A typical neural network is trained using a collected set of input data called batch.
• A similar case can also be made for the hidden units, and this is the idea behind
batch normalization.
• For each batch or minibatch, for each hidden unit j we calculate the mean mj and
standard deviation sj of its values, and we first znormalize:
• We can then map these to have arbitrary mean γj and scale βj and then we apply
the activation function.

• First, mj and sj are calculated anew for each batch, and we see immediately that
batch normalization is not meaningful with online learning or very small
minibatches.
• Second, γj and βj are parameters that are initialized and updated (after each batch
or minibatch) using gradient descent, just like the connection weights. So they
require extra memory and computation.

Why Batch normalization?

• An internal covariate shift occurs when there is a change in the input distribution
to our network.
• When the input distribution changes, hidden layers try to learn to adapt to the
new distribution. This slows down the training process.

Advantages:

✓ Speed Up the Training


✓ Handles internal covariate shift
✓ The model is less delicate to hyperparameter tuning.
Regularization:

• Regularization is one of the most important concepts of machine learning.


• It is a technique to prevent the model from overfitting by adding extra information
to it.
• Regularization helps choose a simple model rather than a complex one.
• Generalization error is "a measure of how accurately an algorithm can predict
outcome values for previously unseen data."
• Regularization refers to the modifications that can be made to a leaming algorithm
that helps to reduce this generalization error.

Overfitting:

• Overfitting means that the model is a good fit on the train data compared to the
data.
• Overfitting is also a result of the model being too complex
• In other words, in such a scenario, the model has low bias and high variance
and is too complex. This is called overfitting.

Commonly used regularization techniques:

• Hints
• Weight Decay
• Ride Regression (or) L2 Regularization
• Lasso Regression (or) L1 Regularization.
• Dropout

Hints:

• Hints are properties of the target function that are known to us independent
of the training examples.

• The identity of the object does not change when it is translated, rotated, or
scaled.
• These are hints that can be incorporated into the learning process to make
learning easier.

Weight Decay:

• Incentivize the network to use smaller weights by adding a penalty to the


loss function.
• The idea in weight decay is to add some small constant background force
that always pulls a weight toward zero,

Ridge regression:

• The Ridge regression technique is used to analyze the model where the
variables may be having multicollinearity.
• It reduces the insignificant independent variables though it does not remove
them completely. This type of regularization uses the L₂ norm for regularization

Lasso regression:

• Least Absolute Shrinkage and Selection Operator (or LASSO) Regression


penalizes the coefficients to the extent that it becomes zero.
• It eliminates the insignificant independent variables. This regularization
technique uses the L1 norm for regularization.

Dropout:

• "Dropout" in machine learning refers to the process of randomly ignoring


certain nodes in a layer during training.
• The neural network on the left represents a typical neural network where all
units are activated. On the right, the red units have been dropped out of the model-
the values of their weights and biases are not considered during training.
• Dropout is used as a regularization technique - it prevents overfitting by
ensuring that no units are codependent.
• In dropout, we have a hyperparameter p, and we drop the input or hidden
unit with probability p, that is, set its output to zero, or keep it with probability 1 –
p.

Difference between Shallow and Deep Neural Network:


Difference between Stochastic Gradient Descent and Gradient Descent:

Difference between Data Mining and Machine Learning:

Bias:

• Neural network bias can be defined as the constant which is added to the product
of features and weights
• It is used to offset the result.
• It helps the models to shift the activation function towards the positive or
negative side.

Feed Forward neural network:

• "The process of receiving an input to produce some kind of output to make some
kind of prediction is known as Feed Forward."
• Feed Forward neural network is the core of many other important neural
networks such as convolution neural network.
Artificial Neuron:

• An artificial neuron is a connection point in an artificial neural network


• Artificial neural networks, like the human body's biological neural network,
have a layered architecture and each network node.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy