MODULE 5
MODULE 5
1 / 27
Outline
• Network of Perceptron
• Sigmoid Neuron
• Learning setup
2 / 27
Network of Perceptron
How to implement any boolean function using a network of perceptrons ...
4 / 27
• We claim that this network can be used
to implement any boolean function (lin-
early separable or not) !
• In other words, we can find 𝑤 1 , 𝑤 2 , 𝑤 3 , 𝑤 4
such that the truth table of any boolean
function can be represented by this net-
work
• Astonishing claim! Well, not really, if you
understand what is going on
• Each perceptron in the middle layer fires
only for a specific input (and no two per-
ceptrons fire for the same input)
the first perceptron fires for {−1, −1}
the second perceptron fires for {−1, 1}
the third perceptron fires for {1, −1}
the fourth perceptron fires for {1, 1}
5 / 27
Let us see why this network works by taking an example of the XOR function
6 / 27
Let us see why this network works by taking an example of the XOR function
• Unlike before, there are no contradictions
now and the system of inequalities can be
satisfied
• Essentially each wi is now responsible for
one of the 4 possible inputs and can be
adjusted to get the desired output for that
input
7 / 27
• It should be clear that the same network
can be used to represent the remaining 15
boolean functions also
• Each boolean function will result in a dif-
ferent set of non-contradicting inequalit-
ies which can be satisfied by appropriately
setting 𝑤 1 , 𝑤 2 , 𝑤 3 , 𝑤 4
8 / 27
• What if we have more than 3 inputs ?
• Again each of the 8 perceptorns will fire only for one of the 8 inputs
• Each of the 8 weights in the second layer is responsible for one of the 8 inputs and can
be adjusted to produce the desired output for that input
9 / 27
What if we have n inputs ?
• Answer: 2𝑛 perceptrons
• Theorem:
Any boolean function of n inputs can be represented exactly by a network of percep-
trons containing 1 hidden layer with 2𝑛 perceptrons and one output layer containing 1
perceptron
10 / 27
• Again, why do we care about boolean functions ? How does this help us with our original
problem: which was to predict whether we like a movie or not? Let us see!
11 / 27
• We are given this data about our past movie experi-
ence
• For each movie, we are given the values of the var-
ious factors (𝑥 1 , 𝑥2 , ........, 𝑥 𝑛 ) that we base our de-
cision on and we are also also given the value of y
(like/dislike) 𝑝 𝑖 ’s are the points for which the output
was 1 and ni’s are the points for which it was 0
• The data may or may not be linearly separable
• The proof that we just saw tells us that it is
possible to have a network of perceptrons and
learn the weights in this network such that for any
given 𝑝 𝑖 or 𝑛 𝑗 the output of the network will be the
same as 𝑦 𝑖 or 𝑦 𝑗 (i.e., we can sep- arate the positive
and the negative points)
12 / 27
What are MLP?
• Networks of the form that we just saw (containing, an input, output and one or more hidden
layers) are called Multilayer Perceptrons (MLP, in short)
13 / 27
So far discussed
14 / 27
Sigmoid Neuron
15 / 27
• What about arbitrary functions of the form y = f(x) where x𝜖 Rn (instead of {0, 1} 𝑛 )
and y 𝜖 R (instead of {0, 1} 𝑛 )
• Can we have a network which can (approximately) represent such functions ?
• Now, we have done a lot of Boolean functions, but now we want to move on to arbitrary
functions of the form y is equal to f of x; where x could belong to 𝑅𝑛 and y could belong
to R. where [𝑥 1 , 𝑥2 , ...𝑥 𝑛 ] 𝜖 𝑅
16 / 27
• The thresholding logic used by a perceptron is very harsh !
• For example, let us return to our problem of deciding whether
we will like or dislike a movie
• Consider that we base our decision only on one input (𝑥 1 =
criticsRating which lies between 0 and 1)
• If the threshold is 0.5 (𝑤 0 = -0:5) and 𝑤 1 = 1 then what would
be the decision for a movie with criticsRating = 0.51 ? (like)
• What about a movie with criticsRating = 0.49 ? (dislike)
• It seems harsh that we would like a movie with rating 0.51
but not one with a rating of 0.49 Figure 1
17 / 27
• This behavior is not a characteristic of the
specific problem we chose or the specific
weight and threshold that we chose
• It is a characteristic of the perceptron func-
tion itself which behaves like a step func-
tion
• There will always be this sudden change in
Í𝑛
the decision (from 0 to 1) when 𝑖=1 𝑤 𝑖 𝑥𝑖
crosses the threshold (-𝑤 0 )
• For most real world applications we would
expect a smoother decision function which
gradually changes from 0 to 1 Figure 2
18 / 27
• Introducing sigmoid neurons where the
output function is much smoother than the
step function
• Here is one form of the sigmoid
function called the logistic function
1
( )
− 𝑤0 + 𝑛 𝑤𝑖 𝑥𝑖
Í
1+𝑒 𝑖=1
• We no longer see a sharp transition
around the threshold -w0
• Also the output y is no longer binary but a
real value between 0 and 1 which can be
interpreted as a probability
• Instead of a like/dislike decision we get
Figure 3
the probability of liking the movie
19 / 27
Figure 4
20 / 27
Figure 5
21 / 27
A typical Supervised Machine
Learning Setup
22 / 27
• Well, just as we had an algorithm for learn-
ing the weights of a perceptron, we also
need a way of learning the weights of a
sigmoid neuron
• Before we see such an algorithm we will
revisit the concept of error
Figure 6
23 / 27
• Earlier we mentioned that a single perceptron can-
not deal with this data because it is not linearly
separable
• What does "cannot deal with" mean?
• What would happen if we use a perceptron model
to classify this data ?
• We would probably end up with a line like this ...
• This line doesn’t seem to be too bad
• Sure, it misclassifies 3 blue points and 3 red points
but we could live with this error in most real world
applications
• From now on, we will accept that it is hard to drive
Figure 7
the error to 0 in most cases and will instead aim to
reach the minimum possible error
24 / 27
This brings us to a typical machine learning setup which has the following components...
𝑛
Data: {𝑥 𝑖 , 𝑦 𝑖 }𝑖=1
Model: Our approximation of the relation between x and y. For example,
1
𝑦ˆ =
1+𝑒 (
− 𝑤𝑇 𝑥 )
or
𝑦ˆ = 𝑤𝑇 𝑥
or
𝑦ˆ = 𝑥𝑇 𝑊𝑥
( 𝑦ˆ 𝑖 − 𝑦 𝑖 ) 2
Í𝑛
L(w) = 𝑖=1
The learning algorithm should aim to find a w which minimizes the above function (squared
error between y and 𝑦ˆ ).
26 / 27
Summary
• Network of Perceptron
• Sigmoid Neuron
• Learning setup
27 / 27