Ann L1
Ann L1
Networks
Dr. Tehseen Zia
What is Artificial Neural Networks
• Computational models inspired by the human brain:
• Algorithms that try to mimic the brain.
• Massively parallel, distributed system, made up of simple
processing units (neurons)
• Synaptic connection strengths among neurons are used to store
the acquired knowledge.
• Knowledge is acquired by the network from its environment
through a learning process
Properties
• Inputs are flexible
• any real values
• Typically take vectors
• Target function may be discrete-valued, real-valued, or
vectors of discrete or real values
• Outputs are real numbers between 0 and 1
• Resistant to errors in the training data
• Long training time
• The function produced can be difficult for humans to
interpret
When to consider neural networks
• Input is high-dimensional discrete or raw-valued
• Output is discrete or real-valued
• Output is a vector of values
• Possibly noisy data
• Form of target function is unknown
• Human readability of the result is not important
Examples:
• Image classification
• Language model
• Speech phoneme recognition
• Financial prediction
History
• Early Beginnings (1940s - 1950s):
• The concept of artificial neurons and neural networks was first introduced in the 1940s by Warren
McCulloch and Walter Pitts, who proposed a mathematical model of a simplified neuron.
• In 1958, Frank Rosenblatt developed the Perceptron, a single-layer neural network designed for
binary classification tasks.
• Limitations and the Perceptron Controversy (1960s):
• Despite initial excitement, the Perceptron had limitations and could only solve linearly separable
problems.
• A famous study by Marvin Minsky and Seymour Papert in 1969 highlighted the limitations of single-
layer perceptrons, leading to a period of skepticism about neural networks.
• Gradient Based Learning (1980s):
• In the 1980s, researchers like David Rumelhart, Geoffrey Hinton, and James McClelland contributed to
the development of parallel distributed processing models, which laid the groundwork for modern
neural networks.
• They invented gradient based learning for training neural networks.
• They demonstrated the power of multi-layer neural networks and introduced the backpropagation
algorithm for training them.
• Convolutional Neural Networks (CNNs) (1980s - 1990s):
• Yann LeCun and others developed Convolutional Neural Networks (CNNs) in the late 1980s, particularly for
image recognition tasks.
History
• Recurrent Neural Networks (RNNs) (1980s - 1990s):
• RNNs, designed for sequence data, were developed during this period. They
found applications in natural language processing and speech recognition.
• AI Winter (1990s):
• Research in ANNs faced challenges and setbacks, leading to a period known
as the "AI winter" where funding and interest in artificial intelligence waned.
• Resurgence of Deep Learning (2000s - Present):
• The 2000s saw a resurgence of interest in ANNs, driven by more powerful
computing hardware, larger datasets, and advances in training algorithms.
• The term "deep learning" gained popularity in the 2010s, describing neural
networks with multiple hidden layers.
• Deep Learning Boom (2010s - Present):
• Deep learning achieved remarkable breakthroughs in computer vision,
natural language processing, speech recognition, leading to advancements
in applications like image recognition, machine translation, autonomous
vehicles and games.
Why Artificial Neural Network?
Why ANN?
• Hand engineered features are time-consuming, brittle and not
scalable in practice
• Can we learn the underlying features directly from data?
Why Now?
• Neural networks date back decades, so why the resurgence?
1952 Gradient decent Big Data Hardware Software
1958 Perceptron
• Learnable weights • Large Datasets • Graphic Processing • Improved
• Easier Collection and Units Techniques
Storage • Massive parallelization • New Models
• Toolkits
1986 Backpropagation
• Multi-layer Perceptron
𝑥2 𝑤2
𝑤𝑚
∑ ^
𝑦
𝑥𝑚
𝑚
∑=∑ 𝑥 𝑖 𝑤𝑖
𝑖=1
1 𝑖𝑓 ∑ ≥ 0
where 𝑔 ( ∑ )=
− 1 𝑜𝑡h𝑒𝑟𝑤𝑖𝑠𝑒
The Perceptron
Inputs Weights Sum Output
𝑥1 𝑤1
𝑥2 𝑤2
𝑤𝑚
∑ ^
𝑦
𝑥𝑚
Linearity combination of inputs
𝑚
∑=∑ 𝑥 𝑖 𝑤𝑖
𝑖=1
Output
1 𝑖𝑓 ∑ ≥ 0
where 𝑔 ( ∑ )=
− 1 𝑜𝑡h𝑒𝑟𝑤𝑖𝑠𝑒
The Perceptron
Inputs Weights Sum Output
𝑥1 𝑤1
𝑥2 𝑤2
𝑤𝑚
∑ ^
𝑦
𝑥𝑚
1 𝑖𝑓 ∑ ≥ 0
where 𝑔 ( ∑ )=
− 1 𝑜𝑡h𝑒𝑟𝑤𝑖𝑠𝑒
The Perceptron
Inputs Weights Sum Output
𝑥1 𝑤1
𝑥2 𝑤2
𝑤𝑚
∑ ^
𝑦
𝑥𝑚
=
X=
[ ]
𝑥1
𝑥2
W=
[−32 ]
Example )
The Perceptron
Inputs Weights Sum Output
∑=3 𝑥 1 −2 𝑥2
𝑥1 𝑤1
𝑥2 𝑤2
𝑤𝑚
∑ ^
𝑦
𝑥𝑚
=
𝑥1
X=
[ ]
𝑥1
𝑥2
W=
[ ]
3
−2
Example )
𝑥
The Perceptron
Inputs Weights Sum Output
∑=3 𝑥 1 −2 𝑥2
𝑥1 𝑤1
𝑥2 𝑤2
𝑤𝑚
∑ ^
𝑦 (2,3)
𝑥𝑚
= (0,0)
𝑥1
X=
[ ]
𝑥1
𝑥2
W=
[ ]
3
−2
Example )
(-2,-3)
𝑥
The Perceptron
Inputs Weights Sum Output
∑=3 𝑥 1 −2 𝑥2
𝑥1 𝑤1
𝑥2 𝑤2
𝑤𝑚
∑ ^
𝑦 (2,3)
𝑥𝑚
= (0,0)
𝑥1
X=
[ ]
𝑥1
𝑥2
W=
[ ]
3
−2
Example )
(-2,-3)
𝑥
The Perceptron
Inputs Weights Sum Output
∑=3 𝑥 1 −2 𝑥2
𝑥1 𝑤1
𝑥2 𝑤2
𝑤𝑚
∑ ^
𝑦 (2,3)
𝑥𝑚
= (0,0)
𝑥1
X=
[ ]
𝑥1
𝑥2
W=
[ ]
3
−2
Example )
(-2,-3)
𝑥
Implementing AND with Perceptron
𝑥 2 𝐿𝑜𝑔𝑖𝑐𝑎𝑙 𝐴 𝑁𝐷
(0,1) (1,1)
(0,0)
𝑥1
(1,0)
Implementing AND with Perceptron
𝑥 2 𝐿𝑜𝑔𝑖𝑐𝑎𝑙 𝐴 𝑁𝐷
(0,1) (1,1)
(0,0)
𝑥1
(1,0)
Implementing AND with Perceptron
𝑥 2 𝐿𝑜𝑔𝑖𝑐𝑎𝑙 𝐴 𝑁𝐷
𝑥 1
𝑤 1=1
^
𝑦 (0,1) (1,1)
𝑥
∑=𝑤 1 𝑥 1+𝑤 2 𝑥 2 ¿ 2
2 1
(0,0)
𝑥1
(1,0)
=1
=0
Implementing AND with Perceptron
𝑥 2 𝐿𝑜𝑔𝑖𝑐𝑎𝑙 𝑂 𝑅
(0,1) (1,1)
(0,0)
𝑥1
(1,0)
Implementing AND with Perceptron
𝑥 2 𝐿𝑜𝑔𝑖𝑐𝑎𝑙 𝑂 𝑅
(0,1) (1,1)
(0,0)
𝑥1
(1,0)
Implementing OR with Perceptron
𝑥 2 𝐿𝑜𝑔𝑖𝑐𝑎𝑙 𝑂 𝑅
𝑥 1
𝑤 1=1
^
𝑦 (0,1) (1,1)
𝑥
∑=𝑤 1 𝑥 1+𝑤 2 𝑥 2 ¿ 1
2 1
(0,0)
𝑥1
(1,0)
=1
=0
Non Linearly Separable Problems
What if we want to distinguish red versus green points
We cannot because
XOR is nonlinearly
separable
Can Multiple Perceptron Solve Non
Linearly Separable Problems
Perceptron # 2
Perceptron # 1
Can Multiple Perceptron Solve Non
Linearly Separable Problems
Decision rule:
if ∑ of P1 < 0 -> black
elseif ∑ of P2 > 0 -> black
else white
Perceptron # 2
Perceptron # 1
Can Multiple Perceptron Solve Non
Linearly Separable Problems
Decision rule:
if ∑ of P1 < 0 -> black
elseif ∑ of P2 > 0 -> black
else white
Perceptron # 2
Perceptron # 1
Multi-perceptron Architecture
Perceptron # 1
𝑤
𝑥1 11 ∑ 𝑤11
𝑤12 ^
𝑦
𝑤21 ∑
𝑥 2 𝑤22 ∑ 𝑤21
Perceptron # 2 Perceptron # 3
Multi-perceptron Architecture
Perceptron # 1
𝑤
𝑥1 11 ∑ 𝑤11
𝑤12 ^
𝑦
𝑤21 ∑
𝑥 2 𝑤22 ∑ 𝑤21
Perceptron # 2 Perceptron # 3
Multi-perceptron Architecture
Perceptron # 1
𝑤
𝑥1 11 ∑ 𝑤11
𝑤12 ^
𝑦
𝑤21 ∑
𝑥 2 𝑤22 ∑ 𝑤21
Perceptron # 2 Perceptron # 3
Multi-perceptron Mathematically
• Perceptron 1:
• Perceptron 2:
• Perceptron 3:
Multi-perceptron Mathematically
• Perceptron 1:
• Perceptron 2:
• Perceptron 3:
+++
Multi-perceptron Mathematically
• Perceptron 1:
• Perceptron 2:
• Perceptron 3:
+++
+
Linear function
Sum of linear function is a linear function
Multi-perceptron Architecture
Perceptron # 1
𝑤
𝑥1 11 ∑ 𝑤11
𝑤12 ^
𝑦
𝑤21 ∑
𝑥 2 𝑤22 ∑ 𝑤21
Perceptron # 2 Perceptron # 3
Multi-perceptron Architecture
Perceptron # 1
𝑤
𝑥1 11 ∑ 𝛿 𝑤11
𝑤12 ^
𝑦
𝑤21 ∑ 𝛿
𝑥 2 𝑤22 ∑ 𝑤21
𝛿
Perceptron # 2 Perceptron # 3
Multi-perceptron Architecture
Perceptron # 1
Non-linearity
𝑤
𝑥1 11 ∑ 𝛿 𝑤11
𝑤12 ^
𝑦
𝑤21 ∑ 𝛿
𝑥 2 𝑤22 ∑ 𝑤21
𝛿
Perceptron # 2 Perceptron # 3
The Perceptron
Inputs Weights Sum Output
𝑥1 𝑤1
𝑥2 𝑤2
𝑤𝑚
∑ ^
𝑦
𝑥𝑚
Linearity combination of inputs
𝑚
∑=∑ 𝑥 𝑖 𝑤𝑖
𝑖=1
Output
1 𝑖𝑓 ∑ ≥ 0
where 𝑔 ( ∑ )=
− 1 𝑜𝑡h𝑒𝑟𝑤𝑖𝑠𝑒
The Perceptron
Inputs Weights Sum Non linearity Output
𝑥1 𝑤1
𝑥2 𝑤2
𝑤𝑚
∑ ∫ ^
𝑦
𝑥𝑚
Linearity combination of inputs
𝑚
𝑧=∑ 𝑥 𝑖 𝑤𝑖
𝑖=1
Output
where
The Perceptron
Inputs Weights Sum Non linearity Output
𝑥1 𝑤1
𝑥2 𝑤2
𝑤𝑚
∑ ∫ ^
𝑦
𝑥𝑚
Linearity combination of inputs
𝑚
𝑧=∑ 𝑥 𝑖 𝑤𝑖
𝑔(𝑧)
𝑖=1
Output
where
𝑧
Importance of Activation Function
𝑥1
𝑦 1=𝑔 ( 𝑧 1 )
𝑧1
𝑥2 𝑦 2=𝑔 ( 𝑧 2)
𝑧2
𝑥𝑚
Single Layer Neural Network
Input Hidden Output
𝑧1
𝑔 ( 𝑧1 )
1
𝑊 𝑊
2
𝑥1 ^
𝑦1
𝑧2
𝑔 ( 𝑧 2)
𝑥2 𝑔 ( 𝑧 3)
𝑧3 ^
𝑦2
𝑥𝑚 𝑔(𝑧
𝑧3
4 )
𝑑1
𝑦 𝑖=𝑔( ∑ 𝑔 ( 𝑧 𝑖 ) 𝑤 𝑗 ,𝑖 )
2
^
𝑗=1
Single Layer Neural Network
Input Hidden Output
𝑧1
𝑔 ( 𝑧1 )
1
𝑊 𝑊
2
𝑥1 ^
𝑦1
𝑧2
𝑔 ( 𝑧 2)
𝑥2 𝑔 ( 𝑧 3)
𝑧3 ^
𝑦2
𝑥𝑚 𝑔(𝑧
𝑧3
4 )
Example Problem
Will I pass this class?
Pass
Fail
Pass
Fail
𝑧1
𝑔 ( 𝑧1 )
1
𝑊 𝑊
2
𝑥1
𝑧2 ^
𝑦 1 Predicted = 0.1
𝑔 ( 𝑧 2)
[4 5]
𝑥2 𝑔 ( 𝑧 3)
𝑧3
Single Layer Neural Network
𝑧1
𝑔 ( 𝑧1 )
1
𝑊 𝑊
2
𝑥1
𝑧2 ^
𝑦 1 Predicted = 0.1
𝑔 ( 𝑧 2)
[4 5]
𝑥2 𝑔 ( 𝑧 3) Actual = 1
𝑧3
Single Layer Neural Network
𝑧1
𝑔 ( 𝑧1 )
1
𝑊 𝑊
2
𝑥1
𝑧2 ^
𝑦 1 Predicted = 0.1
𝑔 ( 𝑧 2)
[4 5]
𝑥2 𝑔 ( 𝑧 3) Actual = 1
𝑧3
Quantifying Loss
The loss of our network measures the cost incurred from incorrect predictions
𝑧1
𝑔 ( 𝑧1 )
1
𝑊 𝑊
2
𝑥1
𝑧2 ^
𝑦 1 Predicted = 0.1
𝑔 ( 𝑧 2)
[4 5]
𝑥2 𝑔 ( 𝑧 3) Actual = 1
𝑧3
)
Predicted Actual
Empirical Loss
The empirical loss measures total loss over entire dataset
𝑧1
𝑔 ( 𝑧1 )
𝑊
1
𝑊
2 𝑓 (𝑥) 𝑦
𝑥1
[ ] [ ] [ ]
4 5 0 .1 × 1
𝑧2 ^
𝑦1
𝑔 ( 𝑧 2)
2 1 0.8 × 0
0.6 √ 1
5 8
𝑥2 𝑔 ( 𝑧 3)
⋮ ⋮ 𝑧3 ⋮
𝑛
1
𝐽 ( 𝑊 )= ∑ 𝓛 ¿ ¿
𝑛 𝑖 =1
Mean Squared Error Loss
The Mean squared error can be used with regression models that output
continuous real numbers.
𝑧1
𝑔 ( 𝑧1 )
𝑊
1
𝑊
2𝑓 (𝑥) 𝑦
𝑥1
[ ] [ ] [ ]
4 5 40 × 87
𝑧2 ^
𝑦1
𝑔 ( 𝑧 2)
2 1 85 × 65
97 √ 95
5 8
𝑥2 𝑔 ( 𝑧 3)
⋮ ⋮ 𝑧3 ⋮
𝑛
1
𝐽 ( 𝑊 )= ∑ ¿ ¿ ¿
𝑛 𝑖 =1
Training Neural Network
Loss Optimization
We want to find the network weights that achieve lowest loss
𝑛
1
𝑊 =𝑎𝑟𝑔𝑚𝑖𝑛𝑤 ∑ 𝓛 ¿ ¿
∗
𝑛 𝑖 =1
𝑊 ∗ =𝑎𝑟𝑔𝑚𝑖𝑛𝑤 𝐽 (𝑊 )
Loss Optimization
We want to find the network weights that achieve lowest loss
𝑛
1
𝑊 =𝑎𝑟𝑔𝑚𝑖𝑛𝑤 ∑ 𝓛 ¿ ¿
∗
𝑛 𝑖 =1
𝑊 ∗ =𝑎𝑟𝑔𝑚𝑖𝑛𝑤 𝐽 (𝑊 )
1 2
𝑊 ={𝑊 ,𝑊 ,… }
Loss Optimization
𝑥1 𝑤1
∑ ∫ ^
𝑦 𝑊 ∗ =𝑎𝑟𝑔𝑚𝑖𝑛𝑤 𝐽 (𝑊 )
𝑥2 𝑤2
𝐽 (𝑤1 ,𝑤 2)
𝑤2
𝑤1
Loss Optimization
𝑥1 𝑤1
∑ ∫ ^
𝑦 𝑊 ∗ =𝑎𝑟𝑔𝑚𝑖𝑛𝑤 𝐽 (𝑊 )
𝑥2 𝑤2
𝑅𝑎𝑛𝑑𝑜𝑚𝑙𝑦𝑝𝑖𝑐𝑘𝑎𝑛𝑖𝑛𝑖𝑡𝑖𝑎𝑙(𝑤 1,𝑤 2)
𝐽 (𝑤1 ,𝑤 2)
𝑤2
𝑤1
Loss Optimization
𝑥1 𝑤1
∑ ∫ ^
𝑦 𝑊 ∗ =𝑎𝑟𝑔𝑚𝑖𝑛𝑤 𝐽 (𝑊 )
𝑥2 𝑤2
𝜕 𝐽 (𝑊)
𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡
𝑊
𝐽 (𝑤1 ,𝑤 2)
𝑤2
𝑤1
Loss Optimization
𝑥1 𝑤1
∑ ∫ ^
𝑦 𝑊 ∗ =𝑎𝑟𝑔𝑚𝑖𝑛𝑤 𝐽 (𝑊 )
𝑥2 𝑤2
𝐽 (𝑤1 ,𝑤 2)
𝑤2
𝑤1
Loss Optimization
𝑥1 𝑤1
∑ ∫ ^
𝑦 𝑊 ∗ =𝑎𝑟𝑔𝑚𝑖𝑛𝑤 𝐽 (𝑊 )
𝑥2 𝑤2
𝐽 (𝑤1 ,𝑤 2)
𝑤2
𝑤1
Loss Optimization
𝑥1 𝑤1
∑ ∫ ^
𝑦 𝑊 ∗ =𝑎𝑟𝑔𝑚𝑖𝑛𝑤 𝐽 (𝑊 )
𝑥2 𝑤2
𝐽 (𝑤1 ,𝑤 2)
𝑤2
𝑤1
Gradient Descent
Quantifying Loss
𝑧1
𝑔 ( 𝑧1 )
1
𝑊 𝑊
2
𝑥1
𝑧2 ^
𝑦 1 Predicted = 0.1
𝑔 ( 𝑧 2)
[4 5]
𝑥2 𝑔 ( 𝑧 3)
𝑧3
Single Layer Neural Network
𝑧1
𝑔 ( 𝑧1 )
1
𝑊 𝑊
2
𝑥1
𝑧2 ^
𝑦1
𝑔 ( 𝑧 2)
𝑥2 𝑔 ( 𝑧 3)
𝑧3
The Perceptron: Forward Propagation
Non-Linear activation function
𝑥1 𝑤1
𝑥2 𝑤2
𝑤3
∑ ∫ ^
𝑦
Output
𝑥𝑚
Linearity combination of inputs
Inputs Weights Sum Non-Linearity Output
The Perceptron: Forward Propagation
𝑥1 𝑤1
𝑥2 𝑤2
𝑤𝑚
∑ ∫ ^
𝑦 ()
𝑥𝑚
[ ] [ ]
𝑥1 𝑤1
Where: X= ⋮ W= ⋮
𝑥𝑚 𝑤𝑚
Inputs Weights Sum Non-Linearity Output
The Perceptron: Forward Propagation
𝑥1 𝑤1
𝑥2 𝑤2
𝑤𝑚
∑ ∫ ^
𝑦
𝑥𝑚
𝑔(𝑧)
Inputs Weights Sum Non-Linearity Output
𝑧
Importance of Activation Function
• The purpose of activation function is to introduce non-linearity into
the network
𝑥1 𝑤1
𝑥2 𝑤2
𝑤𝑚
∑ ∫ ^
𝑦 ()
𝑥𝑚
Where: X=
[ ]
𝑥1
𝑥2
W=
[ ]
3
−2
𝑥1 𝑤1
𝑥2 𝑤2
𝑤𝑚
∑ ∫ ^
𝑦 ()
𝑥𝑚
Where: X=
[ ]
𝑥1
𝑥2
W=
[ ]
3
−2
(-2)
The Perceptron: Example
𝑥1 𝑤1
𝑥2 𝑤2
𝑤𝑚
∑ ∫ ^
𝑦 ()
𝑥𝑚
Where: X=
[ ]
𝑥1
𝑥2
W=
[ ]
3
−2
(-2)
This is a line in 2D
The Perceptron: Example
(-2)
𝑥1 𝑤1
𝑥2 𝑤2
𝑤𝑚
∑ ∫ ^
𝑦
𝑥𝑚 𝑥1
𝑥2
The Perceptron: Example
(-2)
𝑥1 𝑤1
(2,3)
𝑥2 𝑤2
𝑤𝑚
∑ ∫ ^
𝑦
(0,0)
𝑥𝑚 𝑥1
𝑥2
The Perceptron: Example
(-2)
𝑥1 𝑤1
(2,3)
𝑥2 𝑤2
𝑤𝑚
∑ ∫ ^
𝑦
(0,0)
𝑥𝑚 𝑥1
𝑥2
The Perceptron: Example
(-2)
𝑥1 𝑤1
(2,3)
(-1,2)
𝑥2 𝑤2
𝑤𝑚
∑ ∫ ^
𝑦
(0,0)
𝑥𝑚 𝑥1
𝑥2