0% found this document useful (0 votes)
49 views33 pages

Lec13 Neural Networks and Deep Learning PDF

Uploaded by

keerthi2k6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views33 pages

Lec13 Neural Networks and Deep Learning PDF

Uploaded by

keerthi2k6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Neural Networks and Deep Learning

Brain inspired way to learn from patterns

COURSE: CS60045

Pallab Dasgupta
Professor,
Dept. of Computer Sc & Engg

1
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
Brain inspired computing

• Simple units
• The power is in
the network

INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR

2
Preliminaries

• Deciding the capacity of the model


• Under-fitting, if the capacity is
weak
• Over-fitting, if the capacity is
unnecessarily large

• Neural network offers a generic


model, which offers:
• Structural variants, so as to
scale up / down the capacity
• Various types of activation
functions, which enables the
modeling of various types of
functions.

INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR

3
Neural Networks
A neural network consists of a set of nodes
(neurons/units) connected by links 𝒂𝒂𝟎𝟎 = −𝟏𝟏

• Each link has a numeric weight 𝑾𝑾𝟎𝟎,𝒊𝒊


Bias Weight 𝒂𝒂𝒊𝒊 = 𝒈𝒈(𝒊𝒊𝒊𝒊𝒊𝒊 )

Each unit has: 𝒈𝒈


• a set of input links from other units, 𝑾𝑾𝒋𝒋,𝒊𝒊
𝒂𝒂𝒋𝒋 𝒊𝒊𝒊𝒊𝒊𝒊 𝒂𝒂𝒊𝒊
• a set of output links to other units,
• a current activation level, and
Input Output
an activation function to compute the Input Links Function Activation

activation level in the next time step. Function

𝒏𝒏 𝒏𝒏

𝒊𝒊𝒊𝒊𝒊𝒊 = � 𝑾𝑾𝒋𝒋,𝒊𝒊 𝒂𝒂𝒋𝒋 𝒂𝒂𝒊𝒊 = 𝒈𝒈(𝒊𝒊𝒊𝒊𝒊𝒊 ) = 𝒈𝒈 � 𝑾𝑾𝒋𝒋,𝒊𝒊 𝒂𝒂𝒋𝒋


𝒋𝒋=𝟎𝟎 𝒋𝒋=𝟎𝟎

INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR

4
Perceptron
𝒂𝒂 = 𝒈𝒈(𝒊𝒊𝒊𝒊)
𝒙𝒙𝟎𝟎 = −𝟏𝟏 𝑾𝑾𝟎𝟎
𝒈𝒈
𝑾𝑾𝟏𝟏
𝒙𝒙𝟏𝟏 𝒊𝒊𝒊𝒊 𝒂𝒂

𝒙𝒙𝟐𝟐 𝑾𝑾𝟐𝟐 Input Output


Function Activation
Function
Studying a perceptron helps us to
𝟐𝟐 understand the limitations in capacity and
𝒊𝒊𝒊𝒊 = � 𝑾𝑾𝒊𝒊 𝒙𝒙𝒊𝒊
𝟎𝟎 𝒊𝒊𝒊𝒊 𝒊𝒊𝒊𝒊 ≤ 𝟎𝟎 the corresponding inability to model certain
𝒂𝒂 = �
𝒋𝒋=𝟎𝟎
𝟏𝟏 𝒊𝒊𝒊𝒊 𝒊𝒊𝒊𝒊 > 𝟎𝟎
.
types of functions.

INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR

5
Perceptron
Linear Function:
𝒂𝒂 = 𝒈𝒈(𝒊𝒊𝒊𝒊)
𝒙𝒙𝟎𝟎 = −𝟏𝟏 𝑾𝑾𝟎𝟎 𝒊𝒊𝒊𝒊 = 𝒙𝒙𝟏𝟏 𝑾𝑾𝟏𝟏 + 𝒙𝒙𝟐𝟐 𝑾𝑾𝟐𝟐 − 𝑾𝑾𝟎𝟎
𝒈𝒈
𝑾𝑾𝟏𝟏
𝒙𝒙𝟏𝟏 𝒊𝒊𝒊𝒊 𝒂𝒂 𝟎𝟎 𝒊𝒊𝒊𝒊 𝒊𝒊𝒊𝒊 ≤ 𝟎𝟎
𝒂𝒂 = �
𝟏𝟏 𝒊𝒊𝒊𝒊 𝒊𝒊𝒊𝒊 > 𝟎𝟎
.
𝒙𝒙𝟐𝟐 𝑾𝑾𝟐𝟐 Input Output
Function Activation
Function
AND: W1 = 1, W2 = 1, W0 = 1
in = x1 + x2 − 1

OR: W1 = 2, W2 = 2, W0 = 1
in = 2x1 + 2x2 − 1

What about XOR?

INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR

6
Multiple Layers Increase the Capacity

The black and white dots are not linearly separable, that
is, no linear function of the following form separates them:
𝒊𝒊𝒊𝒊 = 𝒙𝒙𝟏𝟏 𝑾𝑾𝟏𝟏 + 𝒙𝒙𝟐𝟐 𝑾𝑾𝟐𝟐 − 𝑾𝑾𝟎𝟎

With two layers, it is possible to


model the XOR function.

INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR

7
Supervised Learning by back-propagating errors
The basic idea:
• We compute the output error as: Golden
output
Error = golden output (y) − output of network (a)
• The training error function computed over all training data is: Training
Neural Network
𝟏𝟏 input
𝑬𝑬 = ∑𝒊𝒊(𝒚𝒚𝒊𝒊 − 𝒂𝒂𝒊𝒊 )𝟐𝟐
𝟐𝟐

• We wish to find values of Wj such that E is minimum over the


Adjust weights
training data
• For this purpose we may iteratively do the following:
• Present a training sample to the network
• Compute the error for this output
• Factorize the error in proportion to the contribution of the
nodes and readjust the weights accordingly

INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR

8
Learning in Single Layered Networks
Idea: Optimize the weights so as to minimize error function:
𝟏𝟏 𝟏𝟏 𝟐𝟐
𝑬𝑬 = 𝑬𝑬𝑬𝑬𝑬𝑬𝟐𝟐 = 𝒚𝒚 − 𝒈𝒈 ∑𝒏𝒏𝒋𝒋=𝟎𝟎 𝑾𝑾𝒋𝒋 𝒙𝒙𝒋𝒋
𝟐𝟐 𝟐𝟐

We can use gradient descent to reduce the squared error by


calculating the partial derivative of E with respect to each weight.

𝝏𝝏𝝏𝝏
𝝏𝝏𝑾𝑾𝒋𝒋 Weight update rule:
𝝏𝝏𝝏𝝏𝝏𝝏𝝏𝝏
= 𝑬𝑬𝑬𝑬𝑬𝑬 × 𝑾𝑾𝒋𝒋 ← 𝑾𝑾𝒋𝒋 + 𝜶𝜶 × 𝑬𝑬𝑬𝑬𝑬𝑬 × 𝒈𝒈𝒈(𝒊𝒊𝒊𝒊) × 𝒙𝒙𝒋𝒋
𝝏𝝏𝑾𝑾𝒋𝒋
𝒏𝒏 where α is the learning rate
𝝏𝝏
= 𝑬𝑬𝑬𝑬𝑬𝑬 × 𝒚𝒚 − 𝒈𝒈 � 𝑾𝑾𝒋𝒋 𝒙𝒙𝒋𝒋
𝝏𝝏𝑾𝑾𝒋𝒋
𝒋𝒋=𝟎𝟎 We purposefully eliminate a fraction of the
= −𝐄𝐄𝐄𝐄𝐄𝐄 × 𝒈𝒈𝒈(𝒊𝒊𝒊𝒊) × 𝒙𝒙𝒋𝒋 error through the weight adjustment rule,
but not the whole of it. Why?
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR

9
Multi-Layer Feed-Forward Network
Weight updation rule at the output layer:
Oi Output units
𝑾𝑾𝒋𝒋 ← 𝑾𝑾𝒋𝒋 + 𝜶𝜶 × 𝑬𝑬𝑬𝑬𝑬𝑬 × 𝒈𝒈𝒈(𝒊𝒊𝒊𝒊) × 𝒙𝒙𝒋𝒋
Wj,i (same as single layer)

aj Hidden units

Wk,j

Ik Input units

In multilayer networks, the hidden layers also contribute to the error at the output.
• So the important question is: How do we revise the hidden layers?

10
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
Back-Propagation Learning

• To update the connections between the input units


and the hidden units, we need to define a quantity
analogous to the error term for output nodes
• The propagation rule for the ∆ values is
the following:
• We do an error back-propagation, defining error as 𝚫𝚫𝒋𝒋 = 𝒈𝒈𝒈(𝒊𝒊𝒊𝒊𝒋𝒋 ) ∑𝒊𝒊 𝑾𝑾𝒋𝒋,𝒊𝒊 𝚫𝚫𝒊𝒊
𝚫𝚫𝒊𝒊 = 𝑬𝑬𝑬𝑬𝑬𝑬𝒊𝒊 × 𝒈𝒈𝒈(𝒊𝒊𝒊𝒊𝒊𝒊 )
• The idea is that a hidden node j is responsible for
• The update rule for the hidden layers is:
some fraction of the error in each of the output
𝑾𝑾𝒌𝒌,𝒋𝒋 ← 𝑾𝑾𝒌𝒌,𝒋𝒋 + 𝜶𝜶 × 𝒂𝒂𝒌𝒌 × 𝚫𝚫𝒋𝒋
nodes to which it connects

• Thus the ∆i values are divided according to the


strength of the connection between the hidden
node and the output node and are propagated
back to provide the ∆j values for the hidden layer.

11
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
The mathematics behind the updation rule
The squared error on a single example is defined as:
𝟏𝟏
𝑬𝑬 = ∑𝒊𝒊(𝒚𝒚𝒊𝒊 − 𝒂𝒂𝒊𝒊 )𝟐𝟐
𝟐𝟐

where the sum is over the nodes in the output layer. To obtain the gradient with respect to a specific weight Wj,i
in the output layer, we need only expand out the activation ai as all other terms in the summation are unaffected
by Wj,i

𝝏𝝏𝝏𝝏 𝝏𝝏𝒂𝒂𝒊𝒊 𝝏𝝏𝝏𝝏 𝒊𝒊𝒊𝒊𝒊𝒊 ′


𝝏𝝏𝒊𝒊𝒊𝒊𝒊𝒊
= − 𝒚𝒚𝒊𝒊 − 𝒂𝒂𝒊𝒊 = − 𝒚𝒚𝒊𝒊 − 𝒂𝒂𝒊𝒊 = − 𝒚𝒚𝒊𝒊 − 𝒂𝒂𝒊𝒊 𝒈𝒈 𝒊𝒊𝒊𝒊𝒊𝒊
𝝏𝝏𝑾𝑾𝒋𝒋,𝒊𝒊 𝝏𝝏𝑾𝑾𝒋𝒋,𝒊𝒊 𝝏𝝏𝑾𝑾𝒋𝒋,𝒊𝒊 𝝏𝝏𝑾𝑾𝒋𝒋,𝒊𝒊

𝝏𝝏
= − 𝒚𝒚𝒊𝒊 − 𝒂𝒂𝒊𝒊 𝒈𝒈 𝒊𝒊𝒊𝒊𝒊𝒊 � 𝑾𝑾𝒋𝒋,𝒊𝒊 𝒂𝒂𝒋𝒋
𝝏𝝏𝑾𝑾𝒋𝒋,𝒊𝒊 ai
𝒋𝒋

= − 𝒚𝒚𝒊𝒊 − 𝒂𝒂𝒊𝒊 𝒈𝒈′ 𝒊𝒊𝒊𝒊𝒊𝒊 𝒂𝒂𝒋𝒋 = −𝒂𝒂𝒋𝒋 𝚫𝚫𝒊𝒊


𝑾𝑾𝒋𝒋,𝑖𝑖 ← 𝑾𝑾𝒋𝒋,𝑖𝑖 + 𝜶𝜶 × 𝑎𝑎𝒋𝒋 × ∆𝑖𝑖 Wj,i
aj
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
The mathematics contd.

𝝏𝝏𝝏𝝏 𝝏𝝏𝝏𝝏 𝒊𝒊𝒊𝒊𝒊𝒊 ′


𝝏𝝏𝒊𝒊𝒊𝒊𝒊𝒊
= − � 𝒚𝒚𝒊𝒊 −𝒂𝒂𝒊𝒊 = − � 𝒚𝒚𝒊𝒊 −𝒂𝒂𝒊𝒊 𝒈𝒈 𝒊𝒊𝒊𝒊𝒊𝒊
𝝏𝝏𝑾𝑾𝒌𝒌,𝒋𝒋 𝝏𝝏𝑾𝑾𝒌𝒌,𝒋𝒋 𝝏𝝏𝑾𝑾𝒌𝒌,𝒋𝒋
𝒊𝒊 𝒊𝒊
𝝏𝝏 𝝏𝝏𝒂𝒂𝒋𝒋 𝝏𝝏𝒈𝒈(𝒊𝒊𝒊𝒊𝒋𝒋 )
= − � 𝚫𝚫𝒊𝒊 � 𝑾𝑾𝒋𝒋,𝒊𝒊 𝒂𝒂𝒋𝒋 = − � 𝚫𝚫𝒊𝒊 𝑾𝑾𝒋𝒋,𝒊𝒊 = − � 𝚫𝚫𝒊𝒊 𝑾𝑾𝒋𝒋,𝒊𝒊
𝝏𝝏𝑾𝑾𝒌𝒌,𝒋𝒋 𝝏𝝏𝑾𝑾𝒌𝒌,𝒋𝒋 𝝏𝝏𝑾𝑾𝒌𝒌,𝒋𝒋
𝒊𝒊 𝒋𝒋 𝒊𝒊 𝒊𝒊
𝝏𝝏𝒊𝒊𝒊𝒊𝒋𝒋

= − � 𝚫𝚫𝒊𝒊 𝑾𝑾𝒋𝒋,𝒊𝒊 𝒈𝒈 𝒊𝒊𝒊𝒊𝒋𝒋 ai
𝝏𝝏𝑾𝑾𝒌𝒌,𝒋𝒋
𝒊𝒊

𝝏𝝏

= − � 𝚫𝚫𝒊𝒊 𝑾𝑾𝒋𝒋,𝒊𝒊 𝒈𝒈 𝒊𝒊𝒊𝒊𝒋𝒋
𝝏𝝏𝑾𝑾𝒌𝒌,𝒋𝒋
� 𝑾𝑾𝒌𝒌,𝒋𝒋 𝒂𝒂𝒌𝒌 Wj,i
𝒊𝒊 𝒌𝒌
aj
= − � 𝚫𝚫𝒊𝒊 𝑾𝑾𝒋𝒋,𝒊𝒊 𝒈𝒈′ 𝒊𝒊𝒊𝒊𝒋𝒋 𝒂𝒂𝒌𝒌 = −𝒂𝒂𝒌𝒌 𝚫𝚫𝒋𝒋
𝒊𝒊

Wk,j
𝑾𝑾𝒌𝒌,𝒋𝒋 ← 𝑾𝑾𝒌𝒌,𝒋𝒋 + 𝜶𝜶 × 𝒂𝒂𝒌𝒌 × 𝚫𝚫𝒋𝒋
ak
Problems with this Learning

• The weight updation rules define a single step of


gradient descent
• Gradient descent may reach a local minima
• The minimum training error reached at the end
of training is not the best

• The final network is not explainable. We do not know


what the network has learned.
• For a single layer network, the error can be
explained in terms of the inputs and the weights
• In a multi-layer network, the hidden layers do
not make any sense to the end user

14
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
Convolutional and Recurrent
Neural Networks
• Convolution is useful for learning artifacts that
have a small locality of reference
• Recurrence is useful for learning sequences

15
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
16
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
The Convolution Operation

Suppose we are tracking the location of a spaceship with a laser sensor.


• Our laser sensor produces a single output x(t), the position of the spaceship at time t
• Suppose that our laser sensor is somewhat noisy, and therefore we wish to take the average of
multiple measurements.
• More recent measurements have more weight, so we need a weighting function w(a), which
returns the weight of measurement taken at the past time, a.

𝒔𝒔 𝒕𝒕 = � 𝒙𝒙 𝒂𝒂 𝒘𝒘 𝒕𝒕 − 𝒂𝒂 𝒅𝒅𝒅𝒅 = (𝒙𝒙 ∗ 𝒘𝒘)(𝒕𝒕)

This operation is called convolution. The first argument, x( ), is called the input, and the
second argument, w( ), is called the kernel.

17
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
Discrete Convolution

If we assume that x and w are defined only on integer t, we can define discrete convolution:

𝒔𝒔 𝒕𝒕 = 𝒙𝒙 ∗ 𝒘𝒘 𝒕𝒕 = � 𝒙𝒙 𝒂𝒂 𝒘𝒘(𝒕𝒕 − 𝒂𝒂)
𝒂𝒂=−∞

Convolution can also be defined over more than one axis at a time. For example, if we use a two dimensional
image I as our input, we may want to use a two dimensional kernel:

𝒔𝒔 𝒊𝒊, 𝒋𝒋 = 𝑰𝑰 ∗ 𝑲𝑲 𝒊𝒊, 𝒋𝒋 = � � 𝑰𝑰 𝒎𝒎, 𝒏𝒏 𝑲𝑲(𝒊𝒊 − 𝒎𝒎, 𝒋𝒋 − 𝒏𝒏)


𝒎𝒎 𝒏𝒏

Convolution is commutative, that is, we can also write (by replacing m by i — m and n by j — n):

𝒔𝒔 𝒊𝒊, 𝒋𝒋 = 𝑲𝑲 ∗ 𝑰𝑰 𝒊𝒊, 𝒋𝒋 = � � 𝑰𝑰 𝒊𝒊 − 𝒎𝒎, 𝒋𝒋 − 𝒏𝒏 𝑲𝑲(𝒎𝒎, 𝒏𝒏)


𝒎𝒎 𝒏𝒏

18
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
Convolution Networks help us to learn image filters

Machine learning can be used to learn these filters.


• The weights of a convolutional network are learned
• How does the network look like?

19
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
If kernel width is small, the network will be sparse

20
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
Convolution and Pooling A pooling function replaces the output
of the net at a certain location with a
summary statistic of the nearby
outputs

Set of three learned filters

The output of pooling unit is the same


in both cases. Hence both the 5s are
recognized.

21
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
Sequence Modeling: Recurrent and Recursive Networks

• Recurrent Neural Networks (RNNs) are a family of neural networks for processing sequential data
• Recurrent networks can scale to much longer sequences than would be practical for networks without
sequence-based specialization
• Most recurrent networks can also process sequences of variable length
• The key idea behind RNNs is parameter sharing
• For example, in a dynamical system, the parameters of the transfer function do not change with time
• Therefore we can use the same part of the neural network over and over again

22
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
Unfolding Computation
Consider a dynamical system:
𝒔𝒔(𝒕𝒕) = 𝒇𝒇 𝒔𝒔 𝒕𝒕−𝟏𝟏
; 𝜽𝜽
where s(t) is the state at time t and θ is the set of parameters of f

• The state after a finite number of steps can be obtained by applying the definition recursively. For example,
after 3 steps:
𝒔𝒔(𝟑𝟑) = 𝒇𝒇 𝒔𝒔 𝟐𝟐 ; 𝜽𝜽 = 𝒇𝒇 𝒇𝒇 𝒔𝒔 𝟏𝟏 ; 𝜽𝜽 ; 𝜽𝜽

• For a dynamical system driven by an external input signal x(t) :


𝒔𝒔(𝒕𝒕) = 𝒇𝒇 𝒔𝒔 𝒕𝒕−𝟏𝟏
, 𝒙𝒙 𝒕𝒕 ; 𝜽𝜽

23
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
Unfolding computation and Recurrent Network

h h(t−1) h(t) h(t+1)


f f f f
unfold

x x(t−1) x(t) x(t+1)

𝒉𝒉(𝒕𝒕) = 𝒇𝒇 𝒉𝒉 𝒕𝒕−𝟏𝟏 , 𝒙𝒙 𝒕𝒕 ; 𝜽𝜽

• Regardless of the sequence length, the learned model always has the same input size, because it is specified
in terms of transition from one state to another state, rather than specified in terms of a variable-length history
of states
• It is possible to use the same transition function f with the same parameters at each step

24
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
Useful topologies of RNNs

• RNNs that produce an output at each time step and have recurrent connections between hidden units

• RNNs that produce an output at each time step and have recurrent connections only from the output at one
time step to the hidden units at the next time step

• RNNs with recurrent connections between hidden units, that read an entire sequence and then produce a
single output

25
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
RNN with hidden-hidden feedback

RNN with hidden-hidden feedback is


universal. Any function computable by
a Turing machine can be computed by
such a RNN of finite size (weights can
have infinite precision).

Figure from Deep Learning,


Goodfellow, Bengio and Courville
RNN with output-hidden feedback

Less powerful than the hidden-hidden


feedback model.
Advantage: Each time step can be trained
in isolation (why?)

Figure from Deep Learning,


Goodfellow, Bengio and Courville
RNN with output only at the end

Can be used to summarize a sequence


and produce a fixed-size representation to
be used as an input for further processing

Figure from Deep Learning,


Goodfellow, Bengio and Courville
Boltzmann Machines
A Boltzmann machine is a network of units with an energy defined for the overall network. Its units
produce binary results. The global energy, E, is:
𝑬𝑬 = − ∑𝒊𝒊<𝒋𝒋 𝒘𝒘𝒊𝒊𝒊𝒊 𝒔𝒔𝒊𝒊 𝒔𝒔𝒋𝒋 + ∑𝒊𝒊 𝜽𝜽𝒊𝒊 𝒔𝒔𝒊𝒊
where:
• wij is the connection strength between unit j and unit i.
• si is the state, si ∈ { 0,1 }, of unit i
• 𝜽𝜽𝒊𝒊 is the bias of unit i in the global energy function. (−𝜽𝜽𝒊𝒊 is the activation threshold for the unit)

∆𝑬𝑬𝒊𝒊 = � 𝒘𝒘𝒊𝒊𝒊𝒊 𝒔𝒔𝒋𝒋 + � 𝒘𝒘𝒋𝒋𝒋𝒋 𝒔𝒔𝒋𝒋 + 𝜽𝜽𝒊𝒊


𝒋𝒋>𝒊𝒊 𝒋𝒋<𝒊𝒊

• From this we obtain (the scalar T is called the temperature):


𝟏𝟏
𝒑𝒑𝒊𝒊=𝑶𝑶𝑶𝑶 =
∆𝑬𝑬𝒊𝒊
𝟏𝟏 + 𝒆𝒆𝒆𝒆𝒆𝒆 −
𝑻𝑻

29
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
Source: DARPA

30
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
The ML problem in regression
What is the function 𝒇𝒇 . ?

Solution: This is where the different ML methods come in


• Linear model: 𝒇𝒇 𝒙𝒙 = 𝒘𝒘𝑻𝑻 𝒙𝒙
• Linear basis functions: 𝒇𝒇 𝒙𝒙 = 𝒘𝒘𝑻𝑻 𝝓𝝓(𝒙𝒙)
• Where 𝝓𝝓 𝒙𝒙 = [𝝓𝝓𝟎𝟎 𝒙𝒙 𝝓𝝓𝟏𝟏 𝒙𝒙 … 𝝓𝝓𝑳𝑳 (𝒙𝒙)]𝑻𝑻 and 𝝓𝝓𝒍𝒍 (𝒙𝒙) is the basis function.
• Choices for the basis function:
• Powers of 𝒙𝒙: 𝝓𝝓𝒍𝒍 𝒙𝒙 = 𝒙𝒙𝒍𝒍
• Gaussian / Sigmoidal / Fourier / …
• Neural networks
• …

31
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
Classification
Given training data set with:
• Input values: 𝒙𝒙𝒏𝒏 = [𝒙𝒙𝟏𝟏 𝒙𝒙𝟐𝟐 … 𝒙𝒙𝑴𝑴 ]𝑻𝑻 for 𝒏𝒏 = 𝟏𝟏 … 𝑵𝑵.
• Output class labels, for example:
• 0/1 or −1/+1 for binary classification problems
• 1 … K for multi-class classification problems
• 1-of-K coding scheme:
𝐲𝐲 = [𝟎𝟎 … 𝟎𝟎 𝟏𝟏 𝟎𝟎 … 𝟎𝟎]𝑻𝑻
where, if 𝒙𝒙𝒏𝒏 belongs to class k, then the kth bit is 1 and all others are 0.

Objective: Predict the output class for new, unknown inputs 𝒙𝒙


�𝒎𝒎 .

32
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR
Classification strategies

Linear discriminants
(2-class classifiers)

K-class discriminant

Combining 2-class classifiers to obtain multi-class classifiers is a bad idea !!

33
INDIAN INSTITUTE OF TECHNOLOGY KHARAGPUR

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy