0% found this document useful (0 votes)

99 views

Ad3451 Ml Unit 4 Notes

Uploaded by

nancy.louise

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

99 views

Ad3451 Ml Unit 4 Notes

Uploaded by

nancy.louise

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

rit RAJAlAKSHMI

INSTITUTE OF
TECHNOLOGY

UNIT IV NEURAL NETWORKS 9

Multilayer perceptron, activation functions, network training – gradient descent optimization –

stochastic gradient descent, error backpropagation, from shallow networks to deep networks –Unit
saturation (aka the vanishing gradient problem) – ReLU, hyper parameter tuning, batch normalization,
regularization, dropout

4.1 Multi-layer Perceptron

Multi-Layer perceptron defines the most complex architecture of artificial neural networks. It is
substantially formed from multiple layers of the perceptron.

The pictorial representation of multi-layer perceptron learning is as shown below-

Jnput layer Output Layer

MLP networks are used for supervised learning format. A typical learning algorithm for MLP networks
is also called back propagation's algorithm.

A multilayer perceptron (MLP) is a feed forward artificial neural network that generates a set of outputs
from a set of inputs. An MLP is characterized by several layers of input nodes connected as a directed
graph between the input nodes connected as a directed graph between the input and output layers. MLP
uses backpropagation for training the network. MLP is a deep learning method.

4.2 Activation Functions in Neural Networks

Elements of a Neural Network

Input Layer: This layer accepts input features. It provides information from the outside world to the
network, no computation is performed at this layer, nodes here just pass on the information(features)
to the hidden layer.
Hidden Layer: Nodes of this layer are not exposed to the outer world, they are part of the abstraction
provided by any neural network. The hidden layer performs all sorts of computation on the features
entered through the input layer and transfers the result to the output layer.
Output Layer: This layer bring up the information learned by the network to the outer world.
What is an activation function and why use them?
The activation function decides whether a neuron should be activated or not by calculating the
weighted sum and further adding bias to it. The purpose of the activation function is to introduce non-
linearity into the output of a neuron.
Explanation: We know, the neural network has neurons that work in correspondence with weight,
bias, and their respective activation function. In a neural network, we would update the weights and
biases of the neurons on the basis of the error at the output. This process is known as back-
propagation. Activation functions make the back-propagation possible since the gradients are
supplied along with the error to update the weights and biases.
Why do we need Non-linear activation function?
A neural network without an activation function is essentially just a linear regression model. The
activation function does the non-linear transformation to the input making it capable to learn and
perform more complex tasks.
Mathematical proof
Suppose we have a Neural net like this :-
,,
i1 w1 h1

out

r2 w4 h2

Elements of the diagram are as follows:

Hidden layer i.e. layer 1:
z(1) = W(1)X + b(1) a(1)
Here,
z(1) is the vectorized output of layer 1
•
W(1) be the vectorized weights assigned to neurons of hidden layer i.e. w1, w2, w3 and w4
•
X be the vectorized input features i.e. i1 and i2
•
b is the vectorized bias assigned to neurons in hidden layer i.e. b1 and b2
•
a(1) is the vectorized form of any linear function.
•
(Note: We are not considering activation function here)

Layer 2 i.e. output layer :-

Note : Input for layer 2 is output from layer 1
z(2) = W(2)a(1) + b(2)
a(2) = z(2)
Calculation at Output layer
z(2) = (W(2) * [W(1)X + b(1)]) + b(2)
z(2) = [W(2) * W(1)] * X + [W(2)*b(1) + b(2)]
Let,
[W(2) * W(1)] = W
[W(2)*b(1) + b(2)] = b
Final output : z(2) = W*X + b
which is again a linear function
This observation results again in a linear function even after applying a hidden layer, hence we can
conclude that, doesn’t matter how many hidden layer we attach in neural net, all layers will behave
same way because the composition of two linear function is a linear function itself. Neuron can not
learn with just a linear function attached to it. A non-linear activation function will let it learn as per
the difference w.r.t error. Hence we need an activation function.
Variants of Activation Function
Linear Function
• Equation : Linear function has the equation similar to as of a straight line i.e. y = x
• No matter how many layers we have, if all are linear in nature, the final activation function
of last layer is nothing but just a linear function of the input of first layer.
• Range : -inf to +inf
• Uses : Linear activation function is used at just one place i.e. output layer.
• Issues : If we will differentiate linear function to bring non-linearity, result will no more
depend on input “x” and function will become constant, it won’t introduce any ground-
breaking behavior to our algorithm.
For example : Calculation of price of a house is a regression problem. House price may have any
big/small value, so we can apply linear activation at output layer. Even in this case neural net must
have any non-linear function at hidden layers.
Sigmoid Function

1.0

0.8

)( 0.6

ij 0.4

0.2

0.0

-10.0 -7.5 -5.0 -2.5 0.0 2.5 5.0 7.5 10.0

• It is a function which is plotted as ‘S’ shaped graph.

• Equation : A = 1/(1 + e-x)
• Nature : Non-linear. Notice that X values lies between -2 to 2, Y values are very steep. This
means, small changes in x would also bring about large changes in the value of Y.
• Value Range : 0 to 1
• Uses : Usually used in output layer of a binary classification, where result is either 0 or 1, as
value for sigmoid function lies between 0 and 1 only so, result can be predicted easily to
be 1 if value is greater than 0.5 and 0 otherwise.
Tanh Function
y
f(x) = 21 (1+e·(-2x)) -1

1.5

4 .3 ·2 ·1 4 x

·1

-1.5

• The activation that works almost always better than sigmoid function is Tanh function also
known as Tangent Hyperbolic function. It’s actually mathematically shifted version of the
sigmoid function. Both are similar and can be derived from each other.
• Equation :-
2 -1
f(x) tanh(x)
• Value Range :- -1 to +1
• Nature :- non-linear
• Uses :- Usually used in hidden layers of a neural network as it’s values lies between -1 to
1 hence the mean for the hidden layer comes out be 0 or very close to it, hence helps
in centering the data by bringing mean close to 0. This makes learning for the next layer
much easier.
RELU Function

• •

-10 -5 s 10

• It Stands for Rectified linear unit. It is the most widely used activation function. Chiefly
implemented in hidden layers of Neural network.
• Equation :- A(x) = max(0,x). It gives an output x if x is positive and 0 otherwise.
• Value Range :- [0, inf)
• Nature :- non-linear, which means we can easily backpropagate the errors and have
multiple layers of neurons being activated by the ReLU function.
• Uses :- ReLu is less computationally expensive than tanh and sigmoid because it involves
simpler mathematical operations. At a time only a few neurons are activated making the
network sparse making it efficient and easy for computation.
In simple words, RELU learns much faster than sigmoid and Tanh function.
Softmax Function

··�--
°'
f 11.4
i" .,
OJ

· ��?
''
00 2-S so 7S 100 125
Inputs
15 0 11 S 20 O

The softmax function is also a type of sigmoid function but is handy when we are trying to handle
multi- class classification problems.
• Nature :- non-linear
• Uses :- Usually used when trying to handle multiple classes. the softmax function was
commonly found in the output layer of image classification problems.The softmax function
would squeeze the outputs for each class between 0 and 1 and would also divide by the sum
of the outputs.
• Output:- The softmax function is ideally used in the output layer of the classifier where we
are actually trying to attain the probabilities to define the class of each input.
• The basic rule of thumb is if you really don’t know what activation function to use, then
simply use RELU as it is a general activation function in hidden layers and is used in most
cases these days.
• If your output is for binary classification then, sigmoid function is very natural choice for
output layer.
• If your output is for multi-class classification then, Softmax is very useful to predict the
probabilities of each classes.

4.3. Network Training

❖ Training: It is the process in which the network is taught to change its weight
and bias.
❖ Learning: It is the internal process of training where the artificial neural systemlearns
to update/adapt the weights and biases.

Different Training /Learning procedure available in ANN are

➢ Supervised learning
➢ Unsupervised learning
➢ Reinforced learning
➢ Hebbian learning
➢ Gradient descent learning
➢ Competitive learning
➢ Stochastic learning

1.4.1. Requirements of Learning Laws:

• Learning Law should lead to convergence of weights

• Learning or training time should be less for capturing the

information from the trainingpairs
• Learning should use the local information

• Learning process should able to capture the complex non linear

mapping availablebetween the input & output pairs
• Learning should able to capture as many as patterns as possible

• Storage of pattern information's gathered at the time of learning

should be high for thegiven network

Neural Network
Learning algorithms

Supervised Learning
( Error based)

Stochastic Error Correction

Gradient descent

Least Mean Back

Square Propagation

Figure 3: Different Training methods of ANN

Supervised learning :

Every input pattern that is used to train the network is associated with an output pattern which isthe
target or the desired pattern.

•
A teacher is assumed to be present during the training process, when a comparison is made
between the network’s computed output and the correct expected output, to determine the
error.The error can then be used to change network parameters, which result in an improvement
in performance.
Unsupervised learning:

In this learning method the target output is not presented to the network.It is as if there is no
teacher to present the desired patterns and hence the system learns of its own by discovering
and adapting to structural features in the input patterns.
Reinforced learning:

In this method, a teacher though available, doesnot present the expected answer but only
indicates if the computed output correct or incorrect.The information provided helps the
network in the learning process.
Hebbian learning:

This rule was proposed by Hebb and is based on correlative weight adjustment.This is the
oldestlearning mechanism inspired by biology.In this, the input-output pattern pairs (𝑥𝑖, 𝑦𝑖) are
associated by the weight matrix W, known as the correlation matrix.
It is computed as
𝑛 𝑥𝑖𝑦𝑖𝑇
W = ∑ 𝑖= ------------ eq(1)
1

Here 𝑦𝑖𝑇 is the transposeof the associated output vector 𝑦𝑖.Numerous variants of the
rule havebeen proposed.
Gradient descent learning:

This is based on the minimization of error E defined in terms of weights and activation
function of the network.Also it is required that the activation function employed by the network
is differentiable, as the weight update is dependent on the gradient of the error E.
Thus if ∆𝑤𝑖𝑗 is the weight update of the link connecting the 𝑖𝑡ℎ and 𝑗𝑡ℎ neuron of the two
neighbouring layers, then ∆𝑤𝑖𝑗 is defined as,
∆𝑤 = ɳ 𝜕 𝐸 ----------- eq(2)
𝑖𝑗 𝜕𝑤𝑖𝑗

Where, ɳ is the learning rate parameter and

𝜕𝐸 is the error gradient with reference to the
𝜕𝑤𝑖𝑗
weight 𝑤𝑖𝑗.

4.4 Gradient Descent:

❖ Gradient Descent is a popular optimization technique in Machine
Learning and Deep Learning of the learning algorithms.
❖ A gradient is the slope of a function.
❖ It measures the degree of change of a variable in response to the changes
of another variable.
❖ Mathematically, Gradient Descent is a convex function whose output is the
partial derivativeof a set of parameters of its inputs.
❖ The greater the gradient, the steeper the slope.Starting from an initial
value, Gradient Descent is run iteratively to find the optimal values of the
parameters to find the minimum possible value of the given cost function.
Types of Gradient Descent:
Typically, there are three types of Gradient Descent:
1. Batch Gradient Descent
2. Stochastic Gradient Descent
3. Mini-batch Gradient Descent
Stochastic Gradient Descent (SGD):

❖ The word ‘stochastic‘ means a system or a process that is linked with a random
probability.
❖ Hence, in Stochastic Gradient Descent, a few samples are selected randomly
instead of the whole data set for each iteration.
❖ In Gradient Descent, there is a term called “batch” which denotes the totalnumber
of samples from a dataset that is used for calculating the gradient for each iteration.
❖ In typicalGradient Descent optimization, like Batch Gradient Descent, the batch is
taken to be the whole dataset.
❖ Although, using the whole dataset is really useful for getting to the minima in a
less noisy and less random manner, but the problem arises when our
datasets gets big.
❖ Suppose, you have a million samples in your dataset, so if you use a typical
Gradient Descent optimization technique, you will have to use all of the one million
samples for completing one iteration while performing the Gradient Descent, and
it has to be done for every iteration until the minima is reached. Hence, it becomes
computationally very expensive to perform

4.5 Backpropagation

❖ The backpropagation consists of an input layer of neurons, an output layer, and at

least one hidden layer.
❖ The neurons perform a weighted sum upon the input layer, which is then used by
the activation function as an input, especially by the sigmoid activation function.
❖ It also makes use of supervised learning to teach the network.
❖ It constantly updates the weights of the network until the desired output is met by
the network.
❖ It includes the following factors that are responsible for the training and
performance of the network:

o Random (initial) values of weights.

o A number of training cycles.
o A number of hidden neurons.
o The training set.
o Teaching parameter values such as learning rate and momentum.

Working of Backpropagation

Consider the diagram given below.

0 @ Hidden layer(s)
Input layer w ···-···-···-····)./······· ..

<. e
Backprop
output layer

1. The preconnected paths transfer the inputs X.

2. Then the weights W are randomly selected, which are used to model the input.

•
3. After then, the output is calculated for every individual neuron that passes from
the input layer to the hidden layer and then to the output layer.
4. Lastly, the errors are evaluated in the outputs. ErrorB= Actual Output - Desired
Output
5. The errors are sent back to the hidden layer from the output layer for adjusting
the weights to lessen the error.
6. Until the desired result is achieved, keep iterating all of the processes.

Need of Backpropagation
o Since it is fast as well as simple, it is very easy to implement.
o Apart from no of inputs, it does not encompass of any other parameter to perform
tuning.
o As it does not necessitate any kind of prior knowledge, so it tends out to be more
flexible.
o It is a standard method that results well.

What is a Feed Forward Network?

A feedforward neural network is an artificial neural network where the nodes never
form a cycle. This kind of neural network has an input layer, hidden layers, and an
output layer. It is the first and simplest type of artificial neural network.

Types of Backpropagation Networks

Two Types of Backpropagation Networks are:

• Static Back-propagation
• Recurrent Backpropagation

Static back-propagation:
It is one kind of backpropagation network which produces a mapping of a static input
for static output. It is useful to solve static classification issues like optical character
recognition.

Recurrent Backpropagation:
Recurrent Back propagation in data mining is fed forward until a fixed value is achieved.
After that, the error is computed and propagated backward.

The main difference between both of these methods is: that the mapping is rapid in
static back-propagation while it is nonstatic in recurrent backpropagation.

Best practice Backpropagation

Backpropagation in neural network can be explained with the help of “Shoe Lace”
analogy

Too little tension =

• Not enough constraining and very loose

Too much tension =

• Too much constraint (overtraining)

• Taking too much time (relatively slow process)
• Higher likelihood of breaking
Pulling one lace more than other =

• Discomfort (bias)

Disadvantages of using Backpropagation

• The actual performance of backpropagation on a specific problem is dependent

on the input data.
• Back propagation algorithm in data mining can be quite sensitive to noisy data
• You need to use the matrix-based approach for backpropagation instead of mini-
batch.

Backpropagation Process in Deep Neural Network

Backpropagation is one of the important concepts of a neural network. Our task is to

classify our data best. For this, we have to update the weights of parameter and bias, but
how can we do that in a deep neural network? In the linear regression model, we use
gradient descent to optimize the parameter. Similarly here we also use gradient descent
algorithm using Backpropagation.

For a single training example, Backpropagation algorithm calculates the gradient of

the error function. Backpropagation can be written as a function of the neural network.
Backpropagation algorithms are a set of methods used to efficiently train artificial neural
networks following a gradient descent approach which exploits the chain rule.

The main features of Backpropagation are the iterative, recursive and efficient method
through which it calculates the updated weight to improve the network until it is not able
to perform the task for which it is being trained. Derivatives of the activation function to
be known at network design time is required to Backpropagation.

Now, how error function is used in Backpropagation and how Backpropagation works?
•
Let start with an example and do it mathematically to understand how exactly updates
the weight using Backpropagation.

...�o.,w
Target Value
T1=0 01
12=0 99

wla(J.55

'·"
Input values

X1=0.05
X2=0.10

Initial weight

W1=0.15 w5=0.40
W2=0.20 w6=0.45
W3=0.25 w7=0.50
W4=0.30 w8=0.55

Bias Values

b1=0.35 b2=0.60

Target Values

T1=0.01
T2=0.99

Now, we first calculate the values of H1 and H2 by a forward pass.

Forward Pass

To find the value of H1 we first multiply the input value from the weights as

H1=x1×w1+x2×w2+b1
H1=0.05×0.15+0.10×0.20+0.35
H1=0.3775

To calculate the final result of H1, we performed the sigmoid function as

Hln�•t= I
I + ;irr
I

Hlf'lnal •
I + ..,,,,
I
I

111 fl•ol = 0 . 593 269992

We will calculate the value of H2 in the same way as H1

H2=x1×w3+x2×w4+b1
H2=0.05×0.25+0.10×0.30+0.35
H2=0.3925

To calculate the final result of H1, we performed the sigmoid function as

H2t1na1 - --'--
I + .,.,.
I

H2nn.ai ,.
I+ ..,.,..
l
I

_ 0.5968
H2 ,i..i � 84378

Now, we calculate the values of y1 and y2 in the same way as we calculate the H1 and H2.

To find the value of y1, we first multiply the input value i.e., the outcome of H1 and H2
from the weights as

y1=H1×w5+H2×w6+b2
y1=0.593269992×0.40+0.596884378×0.45+0.60
y1=1.10590597

To calculate the final result of y1 we performed the sigmoid function as

Y1t1nal ""---'-
! + l

Y Inn.at ..
l + eLI0590S9'
I
l "'
Yln .. 1 = O . 75 136507

We will calculate the value of y2 in the same way as y1

y2=H1×w7+H2×w8+b2
y2=0.593269992×0.50+0.596884378×0.55+0.60
y2=1.2249214

To calculate the final result of H1, we performed the sigmoid function as

Y2nrw ,,. _I_
+
"'
I I

Y2nnaJ = l
2 ,
I + ;,au>>'-,•9-1-
y2,, ... = 0 . 77 Z928465

Our target values are 0.01 and 0.99. Our y1 and y2 value is not matched with our target
values T1 and T2.
Now, we will find the total error, which is simply the difference between the outputs
from the target outputs. The total error is calculated as

Etotal = If(target- output)2

So, the total error is

= � (0 01 - 0.75136507)2 + 'i (0.99 - 0.772926465)2

• 0.274611064 + 0.0235600257
E. .... 1=0.29837111

Now, we will backpropagate this error to update the weights using a backward pass.

Backward pass at the output layer

To update the weight, we calculate the error correspond to each weight with the help of
a total error. The error on weight w is calculated by differentiating total error with
respect to w.

We perform backward process so first consider the last weight w5 as

Errorwi = ":�;·
1
.•.•..••.• (1)

- 1
E,0101 - 2 (Tl - Ylflno!)
i
+ 21 (T2 - y2n,.01) 2 , (2)

From equation two, it is clear that we cannot partially differentiate it with respect to w5
because there is no any w5. We split equation one into multiple terms so that we can
easily differentiate it with respect to w5 as

aE,0,..1 aE,0,.1 ayl_flnal ayl

-.-w-5- = aylfin•I X X awS (3)
ayl

Now, we calculate each term one by one to differentiate Etotal with respect to w5 as
= 2
'
x2 x (Tl -yt6.,....)2-• x (-1) + o
= -rn -ylnna1)
= -(0.01- 0.75136507)

''-� o. 74136507
-c'i=: (4)
clyln .. 1

yli;....i = I '
+ e·fl (5)

°'Ylfinol = <J(l '

+ e·yil
ilyl

.�· i!yl

( ",- "•"' J
• 7,"+�,
= e·yl x {ylA...i)2 . .... (6)

l-yli;na1
e -yl = ......•...... (?)
yln....i

Putting the value of e-y in equation (5)

1 - ylf;nol
cac , X (y•'fino! )'
Y•1,....i

""yli;...i X {l -ylf;nol)

-= 0.75136507 X (1 - 0.75136507)
aytn...1
= o. JB68ts602 (a)
ayi
yl = HlnnoJ x wS + H2Anol X w6 + b2 (9)

--·
O'jl
OwS
O(Hlfi-....1XwS+H2n.,.1><w6+b2)
6WS
= Hlnw

"'
ws :0.596884378 (10)

clEtoul 8ylnna1
, ,an d --
a,,
Oylnnal ay1 Ows
So, we put the values of in equation no (3) to find the final
result.
i:IE,,, ... i aE,..,.1
x
ay1n ... 1 ay 1
x--

·�-
--=
aws i:lylfin.J ay1 aw5

= 0.74136507 x 0.186815602 x 0.593269992

Errorw5 = aws = o 0821670407 ... (11)

Now, we will calculate the updated weight w5new with the help of the following formula

i:IE,..,.i
wSn.w = w5 - ri X -,-- Here. ri = learning rate = 0.5
wS
= 0.4 - 0.5 X 0.0821670407

wSnow = 0. 35891648 ... (12)

In the same way, we calculate w6new,w7new, and w8new and this will give us the following
values

w5new=0.35891648
w6new=408666186
w7new=0.511301270
w8new=0.561370121

Backward pass at Hidden layer

Now, we will backpropagate to our hidden layer and update the weight w1, w2, w3, and
w4 as we have done with w5, w6, w7, and w8 weights.

We will calculate the error at w1 as

From equation (2), it is clear that we cannot partially differentiate it with respect to w1
because there is no any w1. We split equation (1) into multiple terms so that we can easily
differentiate it with respect to w1 as

� _ itll,0,•! itHlJlnal itHI

- x x ... ... ... ( 13 )
<lwl dHlnno! dHI dwl

Now, we calculate each term one by one to differentiate Etotal with respect to w1 as

0(% (Tl -ylfj,w)2 + !cr2 -y2r....i)2) ......... (14)

'"'
We again split this because there is no any H1final term in Etoatal as

''-'
ilHlnu.i
OE1
ilHlfln.o1
OE:
+ ilHln,w (lS)

will again split because in E1 and E2 there is no H1 term. Splitting is

done as

sa, ae, tlyl

dHln...i = -,, - !x ilH !final •.. '" ..• (!&)

es,
'"
.JE1 d
We again Split both cfyl -
ily2 because there is no any y1 and y2 term in E1 and E2. We
split it as

Now, we find the value of by putting values in equation (18) and (19) as

From equation (18)

es,
-�--,--ee, ifylfi....i
ifyl <fylnon1 ay1

= il(:z (TI - y l1,na1)2) x 1'yl1ma1

dyl1,....i <fyl
1 8yli;...i
= 2 x -(Tl -ylfi....i) x (-1) x--
' "'
From equation (8)

=2 '
x i0.01 - 0.75136507) x (-1) x 0.186815602

''·
- "' 0.136498562
Oyl
(20)
From equation (19)

OE:
_, __ OE: , __
ily2nnal
ily2 ily26an! ily2

il(i(T2
-y2n ... 1)2)
= x ily2r.nol
ily2n ... 1 3y2
I ily2nnol
= 2 x -(T2 -y21,no1) x (-1) x -- . ..(21)
2 Cy2

Y2nul ""
1 + '.-,: ... .. .. (22)

Cy211omi 0(1 + '8-yl)

ay2= Cy2

·�-=
(1 + .-,')'
= .-,: )( (y2nna1)1 ... (23)

Y2n ... 1 = , +-.c,� ,

cc '

_ I - y2n ... 1
• -,i - .. . ..... (24)
Y2n ... 1

Putting the value of e-y2 in equation (23)

= l - y2r, ... i X (y2

n ... 1 )'
Y 2 finol

= Y2nn X �- Y2nnol)
= 0.77292$465 x (1 - 0.772928465)

<ly2,-,..n1
= 0.17SS100S3 (25)
<ly2

From equation (21)

= 2 >< � (0.99 - 0.772928465) >< (-1) >< 0.175510053

:;;
= -0. 0380982366126414 .. . .. (26)

Now from equation (16) and (17)

OE1
�-x
ae, ;Jyt
aHl1ma1 c}yl i!Hl1;.a1

<l(Hl1,mo1 X w, + H21,..,.. X w• + b2)

""0.138498562 x��=-�-�=��-�
i!Hli;w

= 0.136498562 X
d(Hlt;-.l X w, + H2fi ... l X w• + b2)
a Hli;...i
= 0 138498562 x wS
= 0.!36498562 x O 40
ae,
a 111,, •• 1 .. o. oss39942 .. a (27)

OE1 OE1 c1y2

att 1 fi...i .. -..,- ,x 00, 0, _'n",-,
,
0,

= -0.0380982366126414 X il(Hlflmol X w, + HZfi ... l X wi + bZ) •

ilHl11...i

: -0.0380982366126414 X w7

: -0.0380982366126414 X 0.50
,,, = -0.0190491183063207 (28)
• 111,,.o1

-'e's•� and-:-'e's'�
-;
illllfinal BHlnnal
Put the value of in equation (15) as

ilE,0,o1
ilHlr,na1

O.OSS3994248J(-0
= or 90491183063207)
OE,.,..i = o. 0364908241736793 . .. (29)
iJHtn .. 1

ilH l_final dHl

We have we need to figure out ilHl 8wl as
I
DHln .... 1 8<1 +e-HI)
OHl i.lHl

(! + .-Hl)Z
e-HI X (Hlflo..Jl (30)

I
Hlf;n..i = 1+, "'

-H> 1-Hlflnal
e "" (31)
Hlho1

Putting the value of e-H1 in equation (30)

_1-Hlr.nol
- HI X ( Hlnnol )'
final

= Hlfinal X (1 - Hln...,1)

= 0.593269992 X (1 - 0.593269992)

a�:;�....i = o.241300,oes9z3199

We calculate the partial derivative of the total net input to H1 with respect to w1 the same
as we did for the output neuron:

Hl = Hlfinal X wS + H211,,o1 X w6 + b2 ...... (32)

i.lyl 0(xlX'wl+x2Xw3+blX1)
i!w-- , "�---�a"w"t�---�
= xi

am
awi=oos .... (33)

-::-a
0 0E0o< s<el• c ilHlfinal ilHl
, , and --
So, we put the values of 8Hlfinal OHi Owl in equation (13) to find the final result.

DE,0,o.1 8E,0,01 8Hlr.na.1 8H1

--= x x--
dwl DHlr,,..i 8Hl Owl
= 0.0364908241736793 X 0.2413007085923199 X o.os
as..,
Errorw1 = awl = 0. 000438568 (34)

Now, we will calculate the updated weight w1new with the help of the following formula
wlnew = wl _ '1 X �E,�,.i - 1 earning rate =

..
�wl Here '1 -
0.5
.
=0.15-05 x 0.000438568
...,. - 0. !49780716 ·-······(35)

In the same way, we calculate w2new,w3new, and w4 and this will give us the following
values

w1new=0.149780716
w2new=0.19956143
w3new=0.24975114
w4new=0.29950229

We have updated all the weights. We found the error 0.298371109 on the network when
we fed forward the 0.05 and 0.1 inputs. In the first round of Backpropagation, the total
error is down to 0.291027924. After repeating this process 10,000, the total error is down
to 0.0000351085. At this point, the outputs neurons generate 0.159121960 and
0.984065734 i.e., nearby our target value when we feed forward the 0.05 and 0.1.

2.5.1 Difference Between a Shallow Net & Deep

Learning Net:

Sl.No Shallow Net’s Deep Learning Net’s

1 One Hidden layer(or very Deep Net’s has many layers of

less no. ofHidden Hiddenlayers with more no.
Layers) of neurons in each layers
2 Takes input only as DL can have raw data like
VECTORS image, textas
inputs
3 Shallow net’s needs more DL can fit functions better
parametersto have with lessparameters than a
better fit shallow network
4 Shallow networks with one DL can compactly express
Hidden layer (same no of highly complex functions
neurons as DL) cannot place over input space
complex functions over the
input space
5 The number of units in a DL don’t need to
shallow network grows increase it size(neurons) for
exponentially withtask complex problems
complexity.
6 Shallow network is more Training in DL is easy and no
difficult to train with our issue oflocal minima
current algorithms (e.g. it has in DL
issues of local minima etc)

4.6 The Vanishing Gradient Problem

The Problem, Its Causes, Its Significance, and Its Solutions

The problem:
As more layers using certain activation functions are added to neural networks, the
gradients of the loss function approaches zero, making the network hard to train.
Why:

Certain activation functions, like the sigmoid function, squishes a large input space into a
small input space between 0 and 1. Therefore, a large change in the input of the sigmoid
function will cause a small change in the output. Hence, the derivative becomes small.

Image 1: The sigmoid function and its derivative

As an example, Image 1 is the sigmoid function and its derivative. Note how when the
inputs of the sigmoid function becomes larger or smaller (when |x| becomes bigger), the
•
derivative becomes close to zero.

Why it’s significant:

For shallow network with only a few layers that use these activations, this isn’t a big
problem. However, when more layers are used, it can cause the gradient to be too small
for training to work effectively.

Gradients of neural networks are found using backpropagation. Simply put,

backpropagation finds the derivatives of the network by moving layer by layer from the
final layer to the initial one. By the chain rule, the derivatives of each layer are multiplied
down the network (from the final layer to the initial) to compute the derivatives of the
initial layers.
However, when n hidden layers use an activation like the sigmoid function, n small
derivatives are multiplied together. Thus, the gradient decreases exponentially as we
propagate down to the initial layers.

A small gradient means that the weights and biases of the initial layers will not be
updated effectively with each training session. Since these initial layers are often crucial
to recognizing the core elements of the input data, it can lead to overall inaccuracy of the
whole network.

Solutions:

The simplest solution is to use other activation functions, such as ReLU, which doesn’t
cause a small derivative.

Residual networks are another solution, as they provide residual connections straight to
earlier layers. As seen in Image 2, the residual connection directly adds the value at the
beginning of the block, x, to the end of the block (F(x)+x). This residual connection
doesn’t go through activation functions that “squashes” the derivatives, resulting in a
higher overall derivative of the block.
x t----
weight layer
F(x) relu
x
weight layer
identity

F(x).,_+ x: +,.____�/
relu
Image 2: A residual block

Finally, batch normalization layers can also resolve the issue. As stated before, the
problem arises when a large input space is mapped to a small one, causing the
derivatives to disappear. In Image 1, this is most clearly seen at when |x| is big. Batch
normalization reduces this problem by simply normalizing the input so |x| doesn’t reach
the outer edges of the sigmoid function. As seen in Image 3, it normalizes the input so
that most of it falls in the green region, where the derivative isn’t too small.
••
·� �-;J
..
.••
.
..
---
- . '''
;;., ...

. . . . . . .' ' . .... . - .,-------�- - - - .

Image 3: Sigmoid function with restricted inputs

4.7 Hyperparameters in Machine Learning

Hyperparameters in Machine learning are those parameters that are explicitly

defined by the user to control the learning process. These hyperparameters are used
to improve the learning of the model, and their values are set before starting the learning
process of the model.

❖ Here the prefix "hyper" suggests that the parameters are top-level parameters that
are used in controlling the learning process.
❖ The value of the Hyperparameter is selected and set by the machine learning
engineer before the learning algorithm begins training the model.
❖ Hence, these are external to the model, and their values cannot be changed
during the training process.

Some examples of Hyperparameters in Machine Learning

o The k in kNN or K-Nearest Neighbour algorithm
o Learning rate for training a neural network
o Train-test split ratio
o Batch Size
o Number of Epochs
o Branches in Decision Tree
o Number of clusters in Clustering Algorithm

Model Parameters:

Model parameters are configuration variables that are internal to the model, and a model
learns them on its own. For example, W Weights or Coefficients of independent
variables in the Linear regression model. or Weights or Coefficients of independent
variables in SVM, weight, and biases of a neural network, cluster centroid in
clustering. Some key points for model parameters are as follows:
o They are used by the model for making predictions.
o They are learned by the model from the data itself
o These are usually not set manually.
o These are the part of the model and key to a machine learning Algorithm.

Model Hyperparameters:

Hyperparameters are those parameters that are explicitly defined by the user to control
the learning process. Some key points for model parameters are as follows:

o These are usually defined manually by the machine learning engineer.

o One cannot know the exact best value for hyperparameters for the given problem.
The best value can be determined either by the rule of thumb or by trial and error.
o Some examples of Hyperparameters are the learning rate for training a neural
network, K in the KNN algorithm,

Categories of Hyperparameters

Broadly hyperparameters can be divided into two categories, which are given below:

1. Hyperparameter for Optimization

2. Hyperparameter for Specific Models

Hyperparameter for Optimization

The process of selecting the best hyperparameters to use is known as hyperparameter

tuning, and the tuning process is also known as hyperparameter optimization.
Optimization parameters are used for optimizing the model.

Hy�crparametcr-
tun1ng
M
O? >
Best h yperparamctcrs

Model training

Model parameters

Some of the popular optimization parameters are given below:

o Learning Rate: The learning rate is the hyperparameter in optimization
algorithms that controls how much the model needs to change in response to the
estimated error for each time when the model's weights are updated. It is one of
the crucial parameters while building a neural network, and also it determines the
frequency of cross-checking with model parameters. Selecting the optimized
learning rate is a challenging task because if the learning rate is very less, then it
may slow down the training process. On the other hand, if the learning rate is too
large, then it may not optimize the model properly.

o Batch Size: To enhance the speed of the learning process, the training set is
divided into different subsets, which are known as a batch. Number of Epochs: An
epoch can be defined as the complete cycle for training the machine learning
model. Epoch represents an iterative learning process. The number of epochs
varies from model to model, and various models are created with more than one
epoch. To determine the right number of epochs, a validation error is taken into
account. The number of epochs is increased until there is a reduction in a
validation error. If there is no improvement in reduction error for the consecutive
epochs, then it indicates to stop increasing the number of epochs.

Hyperparameter for Specific Models

Hyperparameters that are involved in the structure of the model are known as
hyperparameters for specific models. These are given below:

o A number of Hidden Units: Hidden units are part of neural networks, which refer
to the components comprising the layers of processors between input and output
units in a neural network.

It is important to specify the number of hidden units hyperparameter for the neural
network. It should be between the size of the input layer and the size of the output layer.
More specifically, the number of hidden units should be 2/3 of the size of the input layer,
plus the size of the output layer.

For complex functions, it is necessary to specify the number of hidden units, but it should
not overfit the model.

o Number of Layers: A neural network is made up of vertically arranged

components, which are called layers. There are mainly input layers, hidden
layers, and output layers. A 3-layered neural network gives a better
performance than a 2-layered network. For a Convolutional Neural network, a
greater number of layers make a better model.
4.8 Batch Normalization:

❖ It is a method of adaptive reparameterization, motivated by the difficulty

of training very deep models.In Deep networks, the weights are updated
for each layer.
❖ So the output will no longer be on the same scale as the input (even
though input is normalized).
❖ Normalization - is a data pre-processing tool used to bring the numerical
data toa common scale without distorting its shape.
❖ when we input the data to a machine or deep learning algorithm we tend
to change the values to a balanced scale because, we ensure that our
model can generalize appropriately.(Normalization is used to bring the
input into a balanced scale/ Range).

Let's understand this through an example, we have a deep neural network as shown in the following image.

x,
x,
0
x,
�
x.

Initially, our inputs X'l, X2, X3, X4 are in normalized form as they are coming from the pre-processing stage.
the input passes through the firstjayer, it transforms, as a sigmoid function applied over the dot product of
X and the weight matrix W.

x,
x,
x,
x.

x,
x,
x,
x,

h1 = o(W,X)
h2 = o(W2h1) = o(W2o(W1X))
Normalize the inputs

Image Source: https://www.analyticsvidhya.com/blog/2021/03/introduction-to-batch-

normalization/
❖ Even though the input X was normalized but the output is no longer on
the same scale.
❖ The data passes through multiple layers of network with multiple
times(sigmoidal) activation functions are applied, which leads to an
internal co-variate shift in the data.
❖ This motivates us to move towards Batch Normalization
❖ Normalization is the process of altering the input data to have mean as
zero and standard deviationvalue as one.
Procedure to do Batch Normalization:

(1) Consider the batch input from layer h, for this layer we need to
calculate the mean of this hidden activation.After calculating the
mean the next step is to calculate the standard deviation of the
hidden activations.
(2) Now we normalize the hidden activations using these Mean &
Standard Deviation values. To dothis, we subtract the mean from
each input and divide the whole value with the sum of standard
deviation and the smoothing term (ε).
(3) As the final stage, the re-scaling and offsetting of the input is
performed. Here two components of the BN algorithm is used,
γ(gamma) and β (beta). These parameters are used for re-scaling
(γ) and shifting(β) the vector contains values from the previous
operations.
These two parameters are learnable parameters, Hence
during the training of neural network,the optimal values of γ and β
are obtained and used. Hence we get the accurate normalization of
eachbatch.
4.9 Regularization
Definition: - “any modification we make to a learning algorithm that is intended
to reduce its generalization error but not its training error.”
❖ In the context of deep learning, most regularization strategies
are based onregularizing estimators.
❖ Regularization of an estimator works by trading increased bias
for reducedvariance.An effective regularizer is one that makes a
profitable trade, reducing variancesignificantly while not overly
increasing the bias.
❖ Many regularization approaches are based on limiting the capacity
of models, such as neural networks, linear regression, or logistic
regression, by adding a parameter norm penalty Ω(θ) to the
objective function J. We denote the regularized objective function
by J˜
J˜(θ; X, y) = J(θ; X, y) + αΩ(θ)
where α ∈ [0, ∞) is a hyperparameter that weights the relative contribution
of the normpenalty term, Ω, relative to the standard objective function J.
Setting α to 0 results in no regularization. Larger values of α correspond to
more regularization.
The parameter norm penalty Ω that penalizes only the weights of the aﬃne
transformation at each layer and leaves the biases unregularized.
L2 Regularization
One of the simplest and most common kind of parameter norm penalty is L2
parameter & it’s also called commonly as weight decay. This regularization
strategy drives the weights closerto the origin by adding a regularization term

r.2<a>= y,:;, 11 wll= .

L2regularization is also known as ridge regression or Tikhonov regularization.

To simplify, weassume no bias parameter, so θ is just w. Such a model has the
following total objective function.

- a
J(w;X,y)=2w w J(w;X,y),

,vith the corrcapo nd iu g pm-a.mouor gi:a<lic i t

"v.,,,](w; X y) - c,w + "v .,,,J(w: X, y).

To take a siuglc gradient step o up dar c the wcighr s. we perform this u pcla.to

w +- w - c (aw+ "v.,,,J(w: X. y)).

Writ tou auot.hor way, the upclato is

w +- (1 - ca)w - c"v .,,,J(w; X. y) .

We can see that the addition of the weight decay term has modiﬁed the learning
rule to multiplicatively shrink the weight vector by a constant factor on each
step, just before performing the usual gradient update. This describes what
happens in a single step.The approximation ^J is Given by

.i ; e)

Where H is the Hessian matrix of J with respect to w evaluated at w∗.

The minimum of ˆJ occurs where its gradient ∇wˆJ(w) = H(w − w∗) is equal to ‘0’To
study the eﬀ ect of weight decay,

°' '11' -+- Ff ( '11' - -rx» *) - 0

( I-f-+- ex I )-u:, - Ff -z.v *
-u:, = ( I-f -+- ex I)- 1 Ff -z.v *

As α approaches 0, the regularized solution ˜w approaches w*. But what happens as α grows?
Because H is real and symmetric, we can decompose it into a diagonal matrix Λ and an
orthonormal basis of eigenvectors, Q, such that H = QΛQT. Applying Decomposition to theabove
equation, We Obtain

-,:;:, (QA.Q' -+- o,I)-1QA.Q' -z_v*

[ Q(A.-+- o,I)Q' J-1 QA.Q' -z_v*

Q(A.-+- o,.I)-lA.Q'-z_v*_

.....
;'
§ .,,.
/ ,,
I
I
\ '
' ''
WI

Figure 2: Weight updation effect

The solid ellipses represent contours of equal value of the unregularized
objective. The dotted circles represent contours of equal value of the L 2
regularizer. At the point w˜, these competing objectives reach an equilibrium. In
the first dimension, the eigenvalue of the Hessian of J is small. The objective
function does not increase much when moving horizontally away from w∗ .
Because the objective function does not express a strong preference along this
direction, the regularizer has astrong effect on this axis. The regularizer pulls w1
close to zero. In the second dimension, the objective function is very sensitive to
movements away from w∗ . The corresponding eigenvalue is large, indicating
high curvature. As a result, weight decay affects the position of w2 relatively little.

L1 Regularization

While L2 weight decay is the most common form of weight decay, there are other
ways to penalize the size of the model parameters. Another option is to use L1 regularization.
➢ L1 regularization on the model parameter w is defined as the sum of
absolute values of theindividual parameters.

L1 weight decay controls the strength of the regularization by scaling the

penalty Ω using a positive hyperparameter α. Thus, the regularized objective
function J˜(w; X, y) is given by

i(w; X, y) = <>llwlli + J(w; X, y),

with the corresponding gradient as

'vwi(w;X,y) = asign(w) + 'vwJ(X,y;w), Eq-1

By inspecting equation 1, we can see immediately that the effect of L 1

regularization is quite different from that of L 2 regularization. Specifically,
we can see that the regularization contribution to the gradient no longer
scales linearly with each wi ; instead it is a constant factorwith a sign equal to
sign(wi). •
Quadratic approximation of the L 1 regularized objective function decomposes into a sum over the parameters

i(w;X,y) = J(w";X,y) + L[ H;,;(w; -w;)2 +alw;j].

'
The problem of minimizing this approximate cost function has an analytical solution with the following form:

w;
=sign(w;)max{lw:1-
�,o}.
H,,,

Consider the situation where w * i > 0 for all i. There are two possible outcomes:

1. The case where w; ::;

,,, Here the optimal value of Wi under the regularized
Ha ..

objective is simply w; = 0. This occurs because the contribution of J ( w; X, y)

to the regularized objective ](w; X, y) is overwhelmed-in direction i-by
the L1 regularization, which pushes the value of w; to zero.

2. The case where w;

> Ha,,,.. In this case, the regularization does not move the
optimal value of w; to zero but instead just shifts it in that direction by a
distance equal to #,-;.
,,,
Difference between L1 & L2 Parameter Regularization
S.No L 1 Regularization L2 Regularization

Panelizes the sum of absolute

1 penalizes the sum of square weights.
value of weights.

2 It has a sparse solution. It has a non-sparse solution.

3 It gives multiple solutions. It has only one solution.

4 Constructed in feature selection. No feature selection.

\-t
-
5 Robust to outliers. Not robust to outliers.

It generates simple and lt gives more accurate predictions when the output
6
interpretable models.
. . -
variable is the function of whole input variables.

Unable to learn complex data

7 Able to learn complex data patterns.
patterns.

Computationally inefficient over Computatio�lly efficient because of having

8
non-sparse conditions. analytical solutions.

Difference between Normalization and Standardization

Normalization Standardization

This technique uses minimum and max This technique uses mean and standard deviation
values for scaling of model. for scaling of model.

It is helpful when features are of different It is helpful when the mean of a variable is set to 0
scales. and the standard deviation is set to 1.

Scales values ranges between [0, 1] or [-1, 1]. Scale values are not restricted to a specific range.

It got affected by outliers. It is comparatively less affected by outliers.

Scikit-Learn provides a transformer called Scikit-Learn provides a transformer called

MinMaxScaler for Normalization. StandardScaler for Normalization.

It is also called Scaling normalization. It is known as Z-score normalization.

It is useful when feature distribution is It is useful when feature distribution is normal.

unknown.
4.10 Dropout in Neural Networks
A Neural Network (NN) is based on a collection of connected units or nodes called
artificial neurons, which loosely model the neurons in a biological brain. Since such a
network is created artificially in machines, we refer to that as Artificial Neural Networks
(ANN).
Problem: When a fully-connected layer has a large number of neurons, co-adaptation is
more likely to happen. Co-adaptation refers to when multiple neurons in a layer extract
the same, or very similar, hidden features from the input data. This can happen when
the connection weights for two different neurons are nearly identical.

c
0.5 0.5

•
This poses two different problems to our model:
• Wastage of machine’s resources when computing the same output.
• If many neurons are extracting the same features, it adds more significance to
those features for our model. This leads to overfitting if the duplicate extracted
features are specific to only the training set.
Solution to the problem: As the title suggests, we use dropout while training the NN to
minimize co-adaptation. In dropout, we randomly shut down some fraction of a layer’s
neurons at each training step by zeroing out the neuron values. The fraction of neurons
to be zeroed out is known as the dropout rate, . The remaining neurons have their

values multiplied by so that the overall sum of the neuron values remains the
same.
0.1 0.8
0.4 0.5
0.3 0.1

xl.S xl.5 0 0 xl.S xl.S

The two images represent dropout applied to a layer of 6 units, shown at multiple
training steps. The dropout rate is 1/3, and the remaining 4 neurons at each training
step have their value scaled by x1.5. Thereby, we are choosing a random sample of
neurons rather than training the whole network at once. This ensures that the co-
adaptation is solved and they learn the hidden features better.
Why dropout works?
• By using dropout, in every iteration, you will work on a smaller neural
network than the previous one and therefore, it approaches regularization.
• Dropout helps in shrinking the squared norm of the weights and this tends to
a reduction in overfitting.

(Ebook) Student Edition 2019 (Hmh Social Studies: Ancient Civilizations) by Houghton Mifflin Harcourt ISBN 9780544669215, 0544669215 download pdf
100% (3)
(Ebook) Student Edition 2019 (Hmh Social Studies: Ancient Civilizations) by Houghton Mifflin Harcourt ISBN 9780544669215, 0544669215 download pdf
71 pages
Test Unit 5: When I Was Young: Vocabulary and Grammar
No ratings yet
Test Unit 5: When I Was Young: Vocabulary and Grammar
4 pages
Ad3451 ML Unit 4 Notes Eduengg
No ratings yet
Ad3451 ML Unit 4 Notes Eduengg
36 pages
Activation Functions in Neural Networks - 241102 - 224129
No ratings yet
Activation Functions in Neural Networks - 241102 - 224129
7 pages
Unit Iv
No ratings yet
Unit Iv
34 pages
Activation Function in NN
No ratings yet
Activation Function in NN
29 pages
Activation Function
No ratings yet
Activation Function
4 pages
AD3451 ML UNIT 4 NOTES
No ratings yet
AD3451 ML UNIT 4 NOTES
36 pages
UNIT V NEURAL NETWORKS
No ratings yet
UNIT V NEURAL NETWORKS
35 pages
Module1 - Upto Loss Function
No ratings yet
Module1 - Upto Loss Function
137 pages
Perceptron in Machine Learning
No ratings yet
Perceptron in Machine Learning
11 pages
activatn fn 2
No ratings yet
activatn fn 2
10 pages
Module1
No ratings yet
Module1
124 pages
4 - Activation Functions in Neural Networks
No ratings yet
4 - Activation Functions in Neural Networks
12 pages
26- netinput activation function forward and back propogation
No ratings yet
26- netinput activation function forward and back propogation
41 pages
activation fn
No ratings yet
activation fn
15 pages
NN unit_1
No ratings yet
NN unit_1
27 pages
Unit 2_Activation Function_PR
No ratings yet
Unit 2_Activation Function_PR
22 pages
Deep Learning: International Islamic University of Chittagong
No ratings yet
Deep Learning: International Islamic University of Chittagong
31 pages
0905 Cs 161183 Vishal
No ratings yet
0905 Cs 161183 Vishal
38 pages
Activation Function
No ratings yet
Activation Function
31 pages
Deep Learning
No ratings yet
Deep Learning
5 pages
Need and Use of Activation Functions in Anndeep Learning
No ratings yet
Need and Use of Activation Functions in Anndeep Learning
7 pages
Artificial Neural Networks(ANN)
No ratings yet
Artificial Neural Networks(ANN)
67 pages
Activation Function
No ratings yet
Activation Function
44 pages
7 Types of Neural Network Activation Functions
No ratings yet
7 Types of Neural Network Activation Functions
16 pages
lecture 9-NN- modified
No ratings yet
lecture 9-NN- modified
94 pages
Machine Learning (CSO851) - Lecture 08
No ratings yet
Machine Learning (CSO851) - Lecture 08
27 pages
4 4 Choosing The Right Activation Function For Neural Networks
No ratings yet
4 4 Choosing The Right Activation Function For Neural Networks
25 pages
Activation Function To Back Pro
No ratings yet
Activation Function To Back Pro
22 pages
Activation Function
No ratings yet
Activation Function
43 pages
Activation
No ratings yet
Activation
7 pages
M2 PPT
No ratings yet
M2 PPT
84 pages
UNIT II DNN
No ratings yet
UNIT II DNN
24 pages
ML_Lec-22
No ratings yet
ML_Lec-22
25 pages
Aditya Jain NN Assignment
No ratings yet
Aditya Jain NN Assignment
13 pages
4. ANNs
No ratings yet
4. ANNs
57 pages
Feed Forward NN
No ratings yet
Feed Forward NN
35 pages
Notes On Introduction To Deep Learning
No ratings yet
Notes On Introduction To Deep Learning
19 pages
Performance Analysis of Various Activation Functio
No ratings yet
Performance Analysis of Various Activation Functio
7 pages
UNIT-III Activation-function
No ratings yet
UNIT-III Activation-function
6 pages
DL M2 Tech
No ratings yet
DL M2 Tech
32 pages
Fundamentals Deep Learning Activation Functions When To Use Them
No ratings yet
Fundamentals Deep Learning Activation Functions When To Use Them
15 pages
CS 522 Selected Topics in CS: Lecture 07 - Artificial Neural Network
No ratings yet
CS 522 Selected Topics in CS: Lecture 07 - Artificial Neural Network
52 pages
Unit 5 Activation Function
No ratings yet
Unit 5 Activation Function
15 pages
Activation Funtions
No ratings yet
Activation Funtions
26 pages
ML unit 4
No ratings yet
ML unit 4
23 pages
cst414- Deep learning
No ratings yet
cst414- Deep learning
34 pages
Ann
No ratings yet
Ann
40 pages
Unit 4
No ratings yet
Unit 4
19 pages
Functii de Activare1
No ratings yet
Functii de Activare1
89 pages
Soft Computing Manual.-1
No ratings yet
Soft Computing Manual.-1
45 pages
Mod 2.3 - Activation Function, Loss Functions
No ratings yet
Mod 2.3 - Activation Function, Loss Functions
12 pages
Unit 2
No ratings yet
Unit 2
18 pages
Study of Ensemble of Activation Functions in Deep Learning
No ratings yet
Study of Ensemble of Activation Functions in Deep Learning
10 pages
4-Neural Networks and Activation Function
No ratings yet
4-Neural Networks and Activation Function
28 pages
Forward_and_Backward_Propagation_Deep_Learning_1703697260
No ratings yet
Forward_and_Backward_Propagation_Deep_Learning_1703697260
9 pages
Nndl Umit 1 Important Questions
No ratings yet
Nndl Umit 1 Important Questions
8 pages
Unit 3 Deep Learning
No ratings yet
Unit 3 Deep Learning
11 pages
2K21_EE_192 MLP
No ratings yet
2K21_EE_192 MLP
59 pages
Unit 2 - Machine Learning
No ratings yet
Unit 2 - Machine Learning
19 pages
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Concept of Taxonomy
No ratings yet
Concept of Taxonomy
9 pages
Objectives List
No ratings yet
Objectives List
13 pages
1.details of The Institution: Principal@msrcasc - Edu.in
No ratings yet
1.details of The Institution: Principal@msrcasc - Edu.in
31 pages
Rundown Acara JACCM Update 02032018
No ratings yet
Rundown Acara JACCM Update 02032018
12 pages
Q1 Journals
No ratings yet
Q1 Journals
2 pages
Cambridge Flyers 1 2017 Authentic Exemination Papers Answer Booklet Key
No ratings yet
Cambridge Flyers 1 2017 Authentic Exemination Papers Answer Booklet Key
33 pages
Mark Scheme (Results) : January 2018
No ratings yet
Mark Scheme (Results) : January 2018
31 pages
Obituaries Adaj
No ratings yet
Obituaries Adaj
2 pages
Assistant Review Officer - High Court of Judicature at Allahabad
No ratings yet
Assistant Review Officer - High Court of Judicature at Allahabad
18 pages
History Coursework Gcse Examples
100% (2)
History Coursework Gcse Examples
8 pages
Syllabus
No ratings yet
Syllabus
9 pages
Mirza Riyasat Ali: Education Skills
No ratings yet
Mirza Riyasat Ali: Education Skills
1 page
Adult Attention-Deficit Hyperactivity Disorder Key Conceptual Issues
100% (1)
Adult Attention-Deficit Hyperactivity Disorder Key Conceptual Issues
12 pages
NATE Module 1 - Week1 PDF
No ratings yet
NATE Module 1 - Week1 PDF
21 pages
Appendix E Sample Interview Guide (Revise or Delete Title As Needed)
No ratings yet
Appendix E Sample Interview Guide (Revise or Delete Title As Needed)
4 pages
Merged HRM301
No ratings yet
Merged HRM301
20 pages
What Is An Interview?: Johari Window
No ratings yet
What Is An Interview?: Johari Window
11 pages
Nep GovernmentPg2022
No ratings yet
Nep GovernmentPg2022
2 pages
Lesson Plan Grade 2 Competency 1 Quarter 1
No ratings yet
Lesson Plan Grade 2 Competency 1 Quarter 1
17 pages
pham2021
No ratings yet
pham2021
14 pages
Web Technology Record
No ratings yet
Web Technology Record
63 pages
PWSAT
No ratings yet
PWSAT
17 pages
Student Self-reflection Activity.
No ratings yet
Student Self-reflection Activity.
1 page
Conducting Case Analysis
No ratings yet
Conducting Case Analysis
2 pages
John Kestyn Resume
No ratings yet
John Kestyn Resume
2 pages
Elementary GW 11a
No ratings yet
Elementary GW 11a
2 pages
12 Sample Questions PDF
100% (1)
12 Sample Questions PDF
24 pages
history 2024
No ratings yet
history 2024
62 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.