Peer Instruction: Might Not Just Be One Correct Answer
Peer Instruction: Might Not Just Be One Correct Answer
Peer Instruction: Might Not Just Be One Correct Answer
CS 478 - Perceptrons 5
Expanded Neuron
CS 478 - Perceptrons 6
Perceptron Learning Algorithm
CS 478 - Perceptrons 7
Perceptron Node – Threshold Logic Unit
x1 w1
x2 w2 𝜃q z
xn wn
n
1 if åx w ³q
i =1
i i
z= n
0 if åx w <q
i =1
i i
CS 478 - Perceptrons 8
Perceptron Node – Threshold Logic Unit
x1 w1
x2 w2 𝜃 z
xn wn
CS 478 - Perceptrons 9
Perceptron Learning Algorithm
x1 .4
.1 z
x2 -.2
åx w ³q
x1 x2 t
1 if i i
i =1
.8 .3 1 z= n
.4 .1 0 0 if åx w <q
i =1
i i
CS 478 - Perceptrons 10
First Training Instance
.8 .4
.1 z =1
åx w ³q
x1 x2 t
1 if i i
i =1
.8 .3 1 z= n
.4 .1 0 0 if åx w <q
i =1
i i
CS 478 - Perceptrons 11
Second Training Instance
.4 .4
.1 z =1
åx w ³q
x1 x2 t
1 if i i
.8 .3 1 z=
i =1 Dwi = (t - z) * c * xi
n
.4 .1 0 0 if åx w <q
i =1
i i
CS 478 - Perceptrons 12
Perceptron Rule Learning
Dwi = c(t – z) xi
Where wi is the weight from input i to perceptron node, c is the learning
rate, t is the target for the current instance, z is the current output, and xi is
ith input
Least perturbation principle
– Only change weights if there is an error
– small c rather than changing weights sufficient to make current pattern correct
– Scale by xi
Create a perceptron node with n inputs
Iteratively apply a pattern from the training set and apply the perceptron
rule
Each iteration through the training set is an epoch
Continue training until total training set error ceases to improve
Perceptron Convergence Theorem: Guaranteed to find a solution in finite
time if a solution exists
CS 478 - Perceptrons 13
CS 478 - Perceptrons 14
Augmented Pattern Vectors
1 0 1 -> 0
1 0 0 -> 1
Augmented Version
1 0 1 1 -> 0
1 0 0 1 -> 1
Treat threshold like any other weight. No special case.
Call it a bias since it biases the output up or down.
Since we start with random weights anyways, can ignore
the - notion, and just think of the bias as an extra
available weight. (note the author uses a -1 input)
Always use a bias weight
CS 478 - Perceptrons 15
Perceptron Rule Example
Assume a 3 input perceptron plus bias (it outputs 1 if net > 0, else 0)
Assume a learning rate c of 1 and initial weights all 0: Dwi = c(t – z) xi
Training set 0 0 1 -> 0
1 1 1 -> 1
1 0 1 -> 1
0 1 1 -> 0
CS 478 - Perceptrons 16
Example
Assume a 3 input perceptron plus bias (it outputs 1 if net > 0, else 0)
Assume a learning rate c of 1 and initial weights all 0: Dwi = c(t – z) xi
Training set 0 0 1 -> 0
1 1 1 -> 1
1 0 1 -> 1
0 1 1 -> 0
CS 478 - Perceptrons 17
Example
Assume a 3 input perceptron plus bias (it outputs 1 if net > 0, else 0)
Assume a learning rate c of 1 and initial weights all 0: Dwi = c(t – z) xi
Training set 0 0 1 -> 0
1 1 1 -> 1
1 0 1 -> 1
0 1 1 -> 0
CS 478 - Perceptrons 18
**Challenge Question** - Perceptron
Assume a 3 input perceptron plus bias (it outputs 1 if net > 0, else 0)
Assume a learning rate c of 1 and initial weights all 0: Dwi = c(t – z) xi
Training set 0 0 1 -> 0
1 1 1 -> 1
1 0 1 -> 1
0 1 1 -> 0
CS 478 - Perceptrons 20
Example
Assume a 3 input perceptron plus bias (it outputs 1 if net > 0, else 0)
Assume a learning rate c of 1 and initial weights all 0: Dwi = c(t – z) xi
Training set 0 0 1 -> 0
1 1 1 -> 1
1 0 1 -> 1
0 1 1 -> 0
CS 478 - Perceptrons 21
Example
Assume a 3 input perceptron plus bias (it outputs 1 if net > 0, else 0)
Assume a learning rate c of 1 and initial weights all 0: Dwi = c(t – z) xi
Training set 0 0 1 -> 0
1 1 1 -> 1
1 0 1 -> 1
0 1 1 -> 0
CS 478 - Perceptrons 22
Perceptron Homework
Assume a 3 input perceptron plus bias (it outputs 1 if net > 0, else 0)
Assume a learning rate c of 1 and initial weights all 1: Dwi = c(t – z) xi
Show weights after each pattern for just one epoch
Training set 1 0 1 -> 0
1 1 0 -> 0
1 0 1 -> 1
0 1 1 -> 1
CS 478 - Perceptrons 23
Training Sets and Noise
0 0 1 0 1 1 0 0 1 1 0 -> 0 1 1 0
i.e. P(error) = .05
CS 478 - Perceptrons 24
If no bias weight then
the hyperplane must go
through the origin
CS 478 - Perceptrons 25
Linear Separability
CS 478 - Perceptrons 26
Linear Separability and Generalization
CS 478 - Perceptrons 27
Limited Functionality of Hyperplane
CS 478 - Perceptrons 28
How to Handle Multi-Class Output
This is an issue with any learning model which only
supports binary classification (perceptron, SVM, etc.)
Create 1 perceptron for each output class, where the
training set considers all other classes to be negative
examples
– Run all perceptrons on novel data and set the output to the class of
the perceptron which outputs high
– If there is a tie, choose the perceptron with the highest net value
Create 1 perceptron for each pair of output classes, where
the training set only contains examples from the 2 classes
– Run all perceptrons on novel data and set the output to be the class
with the most wins (votes) from the perceptrons
– In case of a tie, use the net values to decide
– Number of models grows by the square of the output classes
CS 478 - Perceptrons 29
UC Irvine Machine Learning Data Base
Iris Data Set
4.8,3.0,1.4,0.3, Iris-setosa
5.1,3.8,1.6,0.2, Iris-setosa
4.6,3.2,1.4,0.2, Iris-setosa
5.3,3.7,1.5,0.2, Iris-setosa
5.0,3.3,1.4,0.2, Iris-setosa
7.0,3.2,4.7,1.4, Iris-versicolor
6.4,3.2,4.5,1.5, Iris-versicolor
6.9,3.1,4.9,1.5, Iris-versicolor
5.5,2.3,4.0,1.3, Iris-versicolor
6.5,2.8,4.6,1.5, Iris-versicolor
6.0,2.2,5.0,1.5, Iris-viginica
6.9,3.2,5.7,2.3, Iris-viginica
5.6,2.8,4.9,2.0, Iris-viginica
7.7,2.8,6.7,2.0, Iris-viginica
6.3,2.7,4.9,1.8, Iris-viginica
CS 478 – Introduction 30
Objective Functions: Accuracy/Error
How do we judge the quality of a particular model (e.g.
Perceptron with a particular setting of weights)
Consider how accurate the model is on the data set
– Classification accuracy = # Correct/Total instances
– Classification error = # Misclassified/Total instances (= 1 – acc)
Usually minimize a Loss function (aka cost, error)
For real valued outputs and/or targets
– Pattern error = Target – output: Errors could cancel each other
S|ti – zi| (L1 loss)
CS 478 - Perceptrons 31
Mean Squared Error
CS 478 - Perceptrons 32
**Challenge Question** - Error
Given the following data set, what is the L1 (S|ti – zi|), SSE/L2
(S(ti – zi)2), MSE, and RMSE error for the entire data set?
x y Output Target Data Set
2 -3 1 1
0 1 0 1
.5 .6 .8 .2
L1 x
SSE x
MSE x
RMSE x
A. .4 1 1 1
B. 1.6 2.36 1 1
C. .4 .64 .21 0.46
D. 1.6 1.36 .68 .82
E. None of the above
CS 478 - Perceptrons 33
**Challenge Question** - Error
Given the following data set, what is the L1 (S|ti – zi|), SSE/L2
(S(ti – zi)2), MSE, and RMSE error for the entire data set?
x y Output Target Data Set
2 -3 1 1
0 1 0 1
.5 .6 .8 .2
L1 1.6
SSE 1.36
MSE 1.36/3 = .45
RMSE .45^.5 = .67
A. .4 1 1 1
B. 1.6 2.36 1 1
C. .4 .64 .21 0.46
D. 1.6 1.36 .68 .82
E. None of the above
CS 478 - Perceptrons 34
SSE Homework
Given the following data set, what is the L1, SSE (L2),
MSE, and RMSE error of Output1, Output2, and the entire
data set? Fill in cells that have an x.
CS 478 - Perceptrons 35
Gradient Descent Learning: Minimize
(Maximze) the Objective Function
Error Landscape
SSE:
Sum
Squared
Error
S (ti – zi)2
0
Weight Values
CS 478 - Perceptrons 36
Deriving a Gradient Descent Learning
Algorithm
Goal is to decrease overall error (or other objective
function) each time a weight is changed
Total Sum Squared error one possible objective function E:
S (ti – zi)2
¶E
Seek a weight changing algorithm such that is
¶wij
negative
If a formula can be found then we have a gradient descent
learning algorithm
Delta rule is a variant of the perceptron rule which gives a
gradient descent learning algorithm with perceptron nodes
CS 478 - Perceptrons 37
Delta rule algorithm
Delta rule uses (target - net) before the net value goes through the
threshold in the learning rule to decide weight update
CS 478 - Perceptrons 38
Batch vs Stochastic Update
To get the true gradient with the delta rule, we need to sum errors over
the entire training set and only update weights at the end of each epoch
Batch (gradient) vs stochastic (on-line, incremental)
– With the stochastic delta rule algorithm, you update after every pattern, just like
with the perceptron algorithm (even though that means each change may not be
exactly along the true gradient)
– Stochastic is more efficient and best to use in almost all cases, though not all have
figured it out yet
We’ll take about this a little bit when we get to Backpropagation
CS 478 - Perceptrons 39
Perceptron rule vs Delta rule
Perceptron rule (target - thresholded output) guaranteed to
converge to a separating hyperplane if the problem is
linearly separable. Otherwise may not converge – could
get in cycle
Singe layer Delta rule guaranteed to have only one global
minimum. Thus it will converge to the best SSE solution
whether the problem is linearly separable or not.
– Could have a higher misclassification rate than with the perceptron
rule and a less intuitive decision surface – we will discuss this later
with regression
Stopping Criteria – For these models stop when no longer
making progress
– When you have gone a few epochs with no significant
improvement/change between epochs (including oscillations)
CS 478 - Perceptrons 40
Exclusive Or
1 0
X2
0 1
X1
CS 478 - Perceptrons 41
Linearly Separable Boolean Functions
d = # of dimensions
CS 478 - Perceptrons 42
Linearly Separable Boolean Functions
d = # of dimensions
P = 2d = # of Patterns
CS 478 - Perceptrons 43
Linearly Separable Boolean Functions
d = # of dimensions
P = 2d = # of Patterns
2P = 22d= # of Functions
n Total Functions Linearly Separable Functions
0 2 2
1 4 4
2 16 14
CS 478 - Perceptrons 44
Linearly Separable Boolean Functions
d = # of dimensions
P = 2d = # of Patterns
2P = 22d= # of Functions
n Total Functions Linearly Separable Functions
0 2 2
1 4 4
2 16 14
3 256 104
4 65536 1882
5 4.3 × 109 94572
6 1.8 × 1019 1.5 × 107
7 3.4 × 1038 8.4 × 109
CS 478 - Perceptrons 45
Linearly Separable Functions
d
å
(P-1)!
LS(P,d) = 2 (P-1-i)!i! for P > d
i=0
= 2P for P ≤ d
lim
d -> ∞ (# of LS functions) = ∞
CS 478 - Perceptrons 46
Linear Models which are Non-Linear in the
Input Space
n
So far we have used f (x,w) = sign(å wi x i )
1=1
CS 478 - Perceptrons 47
Quadric Machine
-3 -2 -1 0 1 2 3
f1
Perceptron with just feature f1 cannot separate the data
Could we add a transformed feature to our perceptron?
CS 478 - Perceptrons 49
Simple Quadric Example
-3 -2 -1 0 1 2 3
f1
Perceptron with just feature f1 cannot separate the data
Could we add a transformed feature to our perceptron?
f2 = f12
CS 478 - Perceptrons 50
Simple Quadric Example
f2
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
f1 f1
Perceptron with just feature f1 cannot separate the data
Could we add another feature to our perceptron f2 = f12
Note could also think of this as just using feature f1 but
now allowing a quadric surface to separate the data
CS 478 - Perceptrons 51
Quadric Machine Homework
Assume a 2 input perceptron expanded to be a quadric perceptron (it outputs 1 if
net > 0, else 0). Note that with binary inputs of -1, 1, that x2 and y2 would always
be 1 and thus do not add info and are not needed (they would just act like two more
bias weights)
Assume a learning rate c of .4 and initial weights all 0: Dwi = c(t – z) xi
Show weights after each pattern for one epoch with the following non-linearly
separable training set (XOR).
Has it learned to solve the problem after just one epoch?
Which of the quadric features are actually needed to solve this training set?
x y Target
-1 -1 0
-1 1 1
1 -1 1
1 1 0
CS 478 - Regression 52