0% found this document useful (0 votes)
24 views27 pages

MODULE 5

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views27 pages

MODULE 5

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Perceptron - Part 2

1 / 27
Outline

• Network of Perceptron
• Sigmoid Neuron
• Learning setup

2 / 27
Network of Perceptron
How to implement any boolean function using a network of perceptrons ...

• For this discussion, we will assume True = +1 and


False = -1
• We consider 2 inputs and 4 perceptrons
• Each input is connected to all the 4 perceptrons with
specific weights
• The bias (w0) of each perceptron is -2 (i.e., each
perceptron will fire only if the weighted sum of its
input is ≥ 2)
• Each of these perceptrons is connected to an output
perceptron by weights (which need to be learned)
• The output of this perceptron (y) is the output of
this network
3 / 27
Terminology

• This network contains 3 layers


• The layer containing the inputs (𝑥 1 , 𝑥 2 ) is
called the input layer
• The middle layer containing the 4 percep-
trons is called the hidden layer
• The final layer containing one output neu-
ron is called the output layer
• The outputs of the 4 perceptrons in the
hidden layer are denoted by ℎ1 , ℎ2 , ℎ3 , ℎ4
• The red and blue edges are called layer 1
weights 𝑤 1 , 𝑤 2 , 𝑤 3 , 𝑤 4 are called layer 2
weights.

4 / 27
• We claim that this network can be used
to implement any boolean function (lin-
early separable or not) !
• In other words, we can find 𝑤 1 , 𝑤 2 , 𝑤 3 , 𝑤 4
such that the truth table of any boolean
function can be represented by this net-
work
• Astonishing claim! Well, not really, if you
understand what is going on
• Each perceptron in the middle layer fires
only for a specific input (and no two per-
ceptrons fire for the same input)
the first perceptron fires for {−1, −1}
the second perceptron fires for {−1, 1}
the third perceptron fires for {1, −1}
the fourth perceptron fires for {1, 1}
5 / 27
Let us see why this network works by taking an example of the XOR function

• Let 𝑤 0 be the bias output of the neuron


(i.e., it will fire if 4𝑖=1 𝑤 𝑖 ℎ𝑖 ≥ 𝑤 0 )
Í

• This results in the following four condi-


tions to implement XOR:
𝑤1 < 𝑤0, 𝑤2 ≥ 𝑤0, 𝑤3 ≥ 𝑤0, 𝑤4 < 𝑤0

6 / 27
Let us see why this network works by taking an example of the XOR function
• Unlike before, there are no contradictions
now and the system of inequalities can be
satisfied
• Essentially each wi is now responsible for
one of the 4 possible inputs and can be
adjusted to get the desired output for that
input

7 / 27
• It should be clear that the same network
can be used to represent the remaining 15
boolean functions also
• Each boolean function will result in a dif-
ferent set of non-contradicting inequalit-
ies which can be satisfied by appropriately
setting 𝑤 1 , 𝑤 2 , 𝑤 3 , 𝑤 4

8 / 27
• What if we have more than 3 inputs ?
• Again each of the 8 perceptorns will fire only for one of the 8 inputs
• Each of the 8 weights in the second layer is responsible for one of the 8 inputs and can
be adjusted to produce the desired output for that input

9 / 27
What if we have n inputs ?

• Answer: 2𝑛 perceptrons
• Theorem:
Any boolean function of n inputs can be represented exactly by a network of percep-
trons containing 1 hidden layer with 2𝑛 perceptrons and one output layer containing 1
perceptron

10 / 27
• Again, why do we care about boolean functions ? How does this help us with our original
problem: which was to predict whether we like a movie or not? Let us see!

11 / 27
• We are given this data about our past movie experi-
ence
• For each movie, we are given the values of the var-
ious factors (𝑥 1 , 𝑥2 , ........, 𝑥 𝑛 ) that we base our de-
cision on and we are also also given the value of y
(like/dislike) 𝑝 𝑖 ’s are the points for which the output
was 1 and ni’s are the points for which it was 0
• The data may or may not be linearly separable
• The proof that we just saw tells us that it is
possible to have a network of perceptrons and
learn the weights in this network such that for any
given 𝑝 𝑖 or 𝑛 𝑗 the output of the network will be the
same as 𝑦 𝑖 or 𝑦 𝑗 (i.e., we can sep- arate the positive
and the negative points)

12 / 27
What are MLP?

• Networks of the form that we just saw (containing, an input, output and one or more hidden
layers) are called Multilayer Perceptrons (MLP, in short)

13 / 27
So far discussed

Comparison between MP Neuron Model and Perceptron Model

MP Neuron Model Perceptron model


It works well on linearly separable data. It works well on linearly separable data
It only accepts boolean input It accepts any real input.
Inputs aren’t weighted which makes this Perceptron model can take weights with re-
model less flexible spective to inputs provided.
we can adjust threshold input to make the we can adjust threshold input to make the
model fit our the dataset. model fit our the dataset.
Note:A perceptron will fire if the weighted sum of its inputs is greater than the threshold (-𝑤 0 )

14 / 27
Sigmoid Neuron

15 / 27
• What about arbitrary functions of the form y = f(x) where x𝜖 Rn (instead of {0, 1} 𝑛 )
and y 𝜖 R (instead of {0, 1} 𝑛 )
• Can we have a network which can (approximately) represent such functions ?
• Now, we have done a lot of Boolean functions, but now we want to move on to arbitrary
functions of the form y is equal to f of x; where x could belong to 𝑅𝑛 and y could belong
to R. where [𝑥 1 , 𝑥2 , ...𝑥 𝑛 ] 𝜖 𝑅

16 / 27
• The thresholding logic used by a perceptron is very harsh !
• For example, let us return to our problem of deciding whether
we will like or dislike a movie
• Consider that we base our decision only on one input (𝑥 1 =
criticsRating which lies between 0 and 1)
• If the threshold is 0.5 (𝑤 0 = -0:5) and 𝑤 1 = 1 then what would
be the decision for a movie with criticsRating = 0.51 ? (like)
• What about a movie with criticsRating = 0.49 ? (dislike)
• It seems harsh that we would like a movie with rating 0.51
but not one with a rating of 0.49 Figure 1

17 / 27
• This behavior is not a characteristic of the
specific problem we chose or the specific
weight and threshold that we chose
• It is a characteristic of the perceptron func-
tion itself which behaves like a step func-
tion
• There will always be this sudden change in
Í𝑛
the decision (from 0 to 1) when 𝑖=1 𝑤 𝑖 𝑥𝑖
crosses the threshold (-𝑤 0 )
• For most real world applications we would
expect a smoother decision function which
gradually changes from 0 to 1 Figure 2

18 / 27
• Introducing sigmoid neurons where the
output function is much smoother than the
step function
• Here is one form of the sigmoid
function called the logistic function
1
( )
− 𝑤0 + 𝑛 𝑤𝑖 𝑥𝑖
Í
1+𝑒 𝑖=1
• We no longer see a sharp transition
around the threshold -w0
• Also the output y is no longer binary but a
real value between 0 and 1 which can be
interpreted as a probability
• Instead of a like/dislike decision we get
Figure 3
the probability of liking the movie

19 / 27
Figure 4

20 / 27
Figure 5

21 / 27
A typical Supervised Machine
Learning Setup

22 / 27
• Well, just as we had an algorithm for learn-
ing the weights of a perceptron, we also
need a way of learning the weights of a
sigmoid neuron
• Before we see such an algorithm we will
revisit the concept of error

Figure 6

23 / 27
• Earlier we mentioned that a single perceptron can-
not deal with this data because it is not linearly
separable
• What does "cannot deal with" mean?
• What would happen if we use a perceptron model
to classify this data ?
• We would probably end up with a line like this ...
• This line doesn’t seem to be too bad
• Sure, it misclassifies 3 blue points and 3 red points
but we could live with this error in most real world
applications
• From now on, we will accept that it is hard to drive
Figure 7
the error to 0 in most cases and will instead aim to
reach the minimum possible error

24 / 27
This brings us to a typical machine learning setup which has the following components...
𝑛
Data: {𝑥 𝑖 , 𝑦 𝑖 }𝑖=1
Model: Our approximation of the relation between x and y. For example,
1
𝑦ˆ =
1+𝑒 (
− 𝑤𝑇 𝑥 )
or

𝑦ˆ = 𝑤𝑇 𝑥

or

𝑦ˆ = 𝑥𝑇 𝑊𝑥

or just about any function


Parameters: In all the above cases, w is a parameter which needs to be learned from the data
Learning algorithm: An algorithm for learning the parameters (w) of the model (for example,
perceptron learning algorithm, gradient descent, etc.)
Objective/Loss/Error function: To guide the learning algorithm - the learning algorithm
should aim to minimize the loss function.
25 / 27
As an illustration, consider our movie example
𝑛
Data: {𝑥 𝑖 = 𝑚𝑜𝑣𝑖𝑒, 𝑦 𝑖 = 𝑙𝑖𝑘𝑒/𝑑𝑖𝑠𝑙𝑖𝑘𝑒}𝑖=1
Model: Our approximation of the relation between x and y (the probability of liking a movie).
1
𝑦ˆ =
1+𝑒 (
− 𝑤𝑇 𝑥 )
Parameter: w
Learning algorithm: Gradient Descent [we will see soon]
Objective/Loss/Error function: One possibility is

( 𝑦ˆ 𝑖 − 𝑦 𝑖 ) 2
Í𝑛
L(w) = 𝑖=1

The learning algorithm should aim to find a w which minimizes the above function (squared
error between y and 𝑦ˆ ).

26 / 27
Summary

• Network of Perceptron
• Sigmoid Neuron
• Learning setup

27 / 27

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy