0% found this document useful (0 votes)
2 views

Chapter02-Introduction-to-DeepLearning

Chapter 02 introduces deep learning concepts, including linear regression, classification, and the workings of neural networks. It discusses various learning algorithms such as supervised, unsupervised, and reinforcement learning, and explains the importance of cost functions and optimization techniques like gradient descent. The chapter also covers the perceptron model and the significance of automated feature extraction in deep learning compared to classical machine learning.

Uploaded by

aslfgy1979
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Chapter02-Introduction-to-DeepLearning

Chapter 02 introduces deep learning concepts, including linear regression, classification, and the workings of neural networks. It discusses various learning algorithms such as supervised, unsupervised, and reinforcement learning, and explains the importance of cost functions and optimization techniques like gradient descent. The chapter also covers the perceptron model and the significance of automated feature extraction in deep learning compared to classical machine learning.

Uploaded by

aslfgy1979
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 84

Chapter 02

Introduction to
deep learning
Woo Youn Kim
Professor of Chemistry &
Graduate School of Data Science, KAIST
CEO, HITS
Contents
• Linear regression

• Linear classification

• Concept of deep learning

• Multi-layer perceptron & Non-linearity

• Forward propagation for prediction

• Backward propagation for training


LAIDD: Chapter 3 (linear regression and classification) of the Lecture series on “QSAR(Quantitative Structure
Activity Relationship)” (struct 2
Statistical modeling

hypothesis

observations

• Statistical modeling aims to obtain a hypothesis with training samples (observations) {𝐗, 𝐘}, where
we consider that samples are independent and identically distributed (i.i.d.) random variables
sampled from the joint distribution 𝑝(𝐗, 𝐘).

• Statistical inference is a procedure of inferring a new output 𝒚∗ of a new input 𝒙∗ by using the
obtained hypothesis. 3
Learning algorithms

• Reinforcement learning: to learn a decision (policy) from experience by interacting with environments
• Supervised learning: to learn a hypothesis that best explains the input-label relationship
• Unsupervised learning: to learn features, which are also called as representations or latent variables
• Semi-supervised learning: obtaining a hypothesis by using both labeled and unlabeled data
• Self-supervised learning:
4
Supervised learning
Solubility
Toxicity
Drug efficacy
.
.
.

Input data Output data


Feature extraction (L)
Structure (X) Property (Y)

Y = f(X); f = neural networks

• Regression: predicting values


• Classification: predicting active or inactive
5
Linear regression

6
Regression
Let us start with a talk about a few examples of supervised learning problems.

X Y
hypothesis
Molecular structure Energy
Molecular structure Solubility
Molecular structure Biological activity

observations

𝑋
When the target variable that we’re trying to predict is continuous, we apply a
regression method to it.

7
Regression (linear hypothesis)

𝑊: parameters or weights
𝑏 : bias

The figure illustrates how linear regression works by fitting different lines
(hypotheses) to a given dataset. It visualizes the equation H(x)=Wx+b, where
W is the weight (or slope), and b is the bias (intercept). The goal is to find the
best-fitting line that minimizes the error between the predicted values and the
actual data points. ​
8
Cost function
How fit the line to our (training) data numerically?

9
Cost minimization

10
Simplified hypothesis

To simplify the calculation, we ignore the bias term in the hypothesis and thus in the cost function.

11
What cost looks like?
• 𝑾 = 𝟏, 𝒄𝒐𝒔𝒕 𝑾 = 𝟎
1 2 2
( 1∗1−1 + 1∗2−2 + 1 ∗ 3 − 3 2)
3

• 𝑾 = 𝟎, 𝒄𝒐𝒔𝒕 𝑾 = 𝟒. 𝟔𝟕
1 2 2
( 0∗1−1 + 0∗2−2 + 0 ∗ 3 − 3 2)
3
...

This graph shows the cost function Cost(W) with respect


to different values of W, which represents the weights in
a linear regression model.

12
Gradient descent algorithm
• Minimize the cost function using gradient descent algorithm

• Start with an initial guess


- Start at 0,0 (or any other value)
- Keep changing W and b a little bit to try and reduce cost(W, b)

• Each time you change the parameters, you select the gradient
which reduces cost(W, b) the most possible

a (= DW) : updating rate (or learning rate)


- the larger, the faster
- but may cause ill-convergence

• Repeat until you converge to a local minimum (i.e., the gradient = 0)

13
Least Mean Squares (LMS)

The resulting equation is called the LMS update rule (LMS stands for “least mean squares”)

14
Convex function

Note that the optimization problem here has only


one global, and no other local optima.

Thus, gradient descent always converges to the


the global minimum unless the learning rate α is
too large.
cost(W, b)

15
Gradient descent algorithm
• Batch gradient descent
Repeat until convergence {

scan through the entire training set before taking a single step:
computationally expensive: suppose you have million samples!

• Stochastic gradient descent


(also incremental gradient descent)
Loop {
for i=1 to m, {

}
}
start making progress right away, and continue to make progress with each example it looks at.
much faster convergence than the batch gradient descent.
16
Gaussian noise
Suppose that the target variables and the inputs are related via the equation.

where 𝜖 𝑖
is an error term that captures either unmodeled effects or random noise.

If 𝜖 𝑖 is independently and identically distributed according to a Gaussian distribution (also


called a Normal distribution) with mean zero and some variance 𝜎 2 ,

Thus,

The notation “𝑝 𝑦 𝑖 |𝑥 𝑖 ; 𝜃 ” indicates that this is the distribution of 𝑦 𝑖


given 𝑥 𝑖
and parameterized by 𝜃.

17
Likelihood
Given a data set, 𝑋 𝑥 𝑖
, and a ML model 𝜃, what is the distribution of target, Y 𝑦 𝑖
?

The probability of the data is given by p(Y|X; 𝜃), which is viewed a function of Y for a fixed value of 𝜃.

It can be instead called the likelihood function.

What is a reasonable way of choosing our best guess of the parameters 𝜃 or training the machine
learning model?

18
Maximum likelihood
The principal of maximum likelihood says that we should choose 𝜃 so as to make
the data as highly probable as possible.

That is, we should choose 𝜃 to maximize 𝐿 𝜃 .

19
Maximum likelihood
Instead of maximizing 𝐿 𝜃 , we can also maximize any strictly increasing function of 𝐿 𝜃 .

For example, maximize the log likelihood ℓ 𝜃 :

Note that the final 𝜽 has no


dependence on 𝝈.

Maximizing ℓ 𝜃 gives the same answer as minimizing

which is the original LMS cost function.

20
Classification

21
Classification

Classification problem: the values y we now want to predict take on only a small number of discrete
values.

Binary classification problem: y can take on only two values, 0 and 1.

0 or - : the negative class,

1 or + : the positive class,

22
Decision boundary

For 2D, the dimension of θ will be one.

Thus, the decision boundary will be a


linear line.

23
Problem of linear regression
Let’s try to predict y given z (= θTx) using the linear regression algorithm

It performs very poorly!

Simply it doesn’t make sense for hθ(x) to take values larger than 1 or smaller than 0
when we know that y ∈ {0, 1}.

24
Logistic function
change hθ(x) as

logistic function or the sigmoid function.

✓ It becomes 0.5 as a data point is on the decision


boundary (z = 0).
✓ It goes to 1 as z → ∞ and to 0 as z → −∞.

✓ It is always bounded between 0 and 1.

25
Logistic function

A useful property of the derivative of the sigmoid function,

26
Maximum likelihood

Assuming that

More compactly

27
Maximum likelihood
For m training examples, the likelihood of the parameters

Maximizing via the gradient ascent.


The corresponding log likelihood

for the maximization

28
Cost function
The log likelihood ➔ one has to maximize it

Cost function = -log likelihood ➔ one has to minimize it

𝑦 (𝑖) : true value in the training data ➔ true probablity of being in class 1
ℎ 𝑥 𝑖 : predicted value by the classifier ➔ predicted probability of being in class 1

1 − 𝑦 (𝑖) : true probablity of being in class 2


1 − ℎ 𝑥 𝑖 : predicted probability of being in class 2

29
Cost function

Two important properties:


1. Non-negative because 0 ≤ 𝑝𝑟𝑜𝑏 ≤ 1

2. For all training data (m), if ℎ 𝑥 𝑖


is close to 𝑦 (𝑖) , the cost goes to zero.

if 𝑦 (𝑖) = 0, ℎ 𝑥 𝑖
→0: 1st term = 0 and ln1=0 in the 2nd term; therefore Cost ➔0
if 𝑦 (𝑖) = 1, ℎ 𝑥 𝑖 →1: ln1=0 in the 1st term and 2nd term =0; therefore Cost ➔0

Indeed, it can play a role as a cost function.

30
Maximum likelihood

Let us take a derivative to derive the stochastic gradient ascent rule:

The stochastic gradient ascent rule

31
Maximum likelihood
The stochastic gradient ascent rule

Difference between the true and predicted values

The derivative becomes zero as the difference becomes zero.

In other words, the classifier perfectly predicts the class of input data!

32
Cross entropy

Gibbs entropy formula

VS
Entropy in information theory
It is also called the cross entropy

33
Softmax classification

34
Multinomial classification
We need 3 decision boundaries!

For 2D, 3 lines:

35
Softmax function
The softmax function If k = 2 like the binary classification (y = 0 or 1),

Prob. for k = 1

Prob. for k = 2

➔Probability of having the k-th label. The result is the same with the logistic function
for binary classification!!

Generalization of the logistic function for multinomial problems

36
Softmax classification
The hypothesis function
The softmax regression
Dog Cat Rabbit

Probability of being in each class

One has to determine the parameters {𝜃} with a given training set ➔ cost function

37
One-hot encoding
1
Dog = 0
0
0
Cat = 1
0
0
Rabbit = 0
1

• Multiple classes can be represented with a vector having elements whose number is the same
with the number of classes.
• This is called one-hot encoding.
• In this example, we have three classes and thus a vector with 3 elements.
• We take the maximum value and let it the unit value, resulting in only a single non-zero
element (in this example, the first element)
• Therefore, the input data is labeled the class 1.

38
Cost function
As the generalization of the logistic regression, we use the cross entropy as a cost function.

element-wise product
where

The element-wise product gives this result.

39
Concept of deep learning

40
Concept of machine learning

“easy-for-a-human, difficult-for-a-machine” tasks,

often referred to as pattern recognition.

• This image explains the concept of machine learning, specifically focusing on tasks that are
“easy for a human, difficult for a machine.” These tasks typically involve pattern recognition,
such as distinguishing between cats and dogs from labeled photos.
• On the right side of the image, there are labeled pictures of a cat and a dog. This visual
representation demonstrates how machines learn to recognize patterns, like identifying
different animals, by being trained with labeled data (photos with categories).
41
Why deep learning?
Classical ML (or shallow learning)

hand-crafted features by experts

Deep learning

automated feature extraction by AI


42
Neuron & synapse
Electrical and chemical signal transmission

Digital signal: on-off

threshold

Input Output
Analog signal transmission

Logistic (sigmoid) function 43


Neural network
Hebbian principle (1940s)
a neuro-scientific theory claiming that an increase in synaptic efficacy arises from a presynaptic cell’s
repeated and persistent stimulation of a postsynaptic cell. “Cells that fire together wire together.”

Complexity of human brain


• 1011 neurons
• each neuron has ~7,000 synaptic connections
• 3 years old: 1015 synapses
• adult: 1014~5X1014 synapses

➔ complex functions

• One of the key functions of a neural network is its ability to learn.

• A neural network is a complex adaptive system; it can change its internal structure based on the
information flowing through it by adjusting the amplitude of signals (weight).
44
Perception
“easy-for-a-human, difficult-for-a-machine” tasks, often referred to as pattern recognition.

Unique signal pattern

45
Perceptron
Invented in 1957 by Frank Rosenblatt at the Cornell Aeronautical Laboratory, a perceptron is the
simplest neural network possible: a computational model of a single neuron.

46
Principle of perceptron

0 if 𝑓 𝒘 ∙ 𝒙 < threshold
output = ቊ
1 if 𝑓 𝒘 ∙ 𝒙 ≥ threshold

inputs processor single output 𝒘 ∙ 𝒙 = ෍ 𝑤𝑗 𝑥𝑗


𝑗

Step 1: receive inputs Step 2: weight inputs Step 3: sum inputs Step 4: generate output
Input 0: x1 = 12 x1 × w1 = 12 × 0.5 = 6 Output = sign(sum)
Input 1: x2 = 4 x2 × w2 = 4 × -1 = -4 Sum = 6 + -4 = 2 = sign(2) = +1

47
Bias

0 if 𝑓 𝒘 ∙ 𝒙 < threshold 0 if 𝑓 𝒘 ∙ 𝒙 + 𝑏 < 0


output = ቊ output = ቊ
1 if 𝑓 𝒘 ∙ 𝒙 ≥ threshold 1 if 𝑓 𝒘 ∙ 𝒙 + 𝑏 ≥ 0

move the threshold inside the perceptron


and then make it a learnable parameter

a measure of how easy it is to get the perceptron to output 1.

or a measure of how easy it is to get the perceptron to fire.

48
Linear classification

By varying the weights and the bias, we can get different models of decision-making.

Learning ➔ determining appropriate weights and bias


to get an optimal output with a given training and test data set 49
Logic gate
✓ AND Gate

x1 x2 y
0 0 0
1 0 0
0 1 0
1 1 1

0 if 𝒘 ∙ 𝒙 + 𝑏 < 0
output = ቊ
1 if 𝒘 ∙ 𝒙 + 𝑏 ≥ 0

50
Logic gate
✓ Logic gate - OR Gate

x1 x2 y
0 0 0
1 0 1
0 1 1
1 1 1

0 if 𝒘 ∙ 𝒙 + 𝑏 < 0 the same slope


output = ቊ
1 if 𝒘 ∙ 𝒙 + 𝑏 ≥ 0 but a different bias value

51
Logic gate
✓ Logic gate- XOR Gate

x1 x2 y
0 0 0
1 0 1
0 1 1
1 1 0
x2
XOR Gate

0 if 𝒘 ∙ 𝒙 + 𝑏 < 0
output = ቊ
1 if 𝒘 ∙ 𝒙 + 𝑏 ≥ 0
x1

Non-linear expression with w and x 52


Drawback of a single perceptron
The single perceptron approach has one major drawback; it can only learn linearly separable
functions. Take XOR, a relatively simple function, and notice that it can’t be classified by a linear
separator.

But what if we make the model with more perceptrons? If one perceptron can solve OR and one
perceptron can solve NOT AND, then two perceptrons combined can solve XOR.

➔ Many perceptrons to go beyond linearity


53
Multilayer perceptron
& non-linearity

54
Multilayer perceptron (MLP)
MLP, a.k.a a forward-feed network

𝒙 = {𝑥1, 𝑥2}
𝒙 𝒘1 𝒘2

ℎ𝑗 = 𝑓(𝒘𝑗1 ∙ 𝒙 + 𝑏1)
𝒚

𝑦 = 𝑓(𝒘2 ∙ 𝒉 + 𝑏2)

𝑏1 𝑏2

Marvin Minsky 1969, founder of MIT AI lab


no way to train the neural network ➔ the first winter of AI 55
Nonlinearity

The nonlinear activation functions can


be expanded as
𝑢2
𝜎 𝑢 ≈ 𝜎0 + 𝜎1 ⋅ 𝑢 + 𝜎2 ⋅
2

The nonlinearity comes from both the activation


functions and many perceptrons

56
Universal approximation

57
Universal Approximation Theorem
• Why does neural networks work so well?
• The answer is, MLPs can approximate any non-linear function which is called the universal approximation
theorem. https://en.wikipedia.org/wiki/Universal_approximation_theorem

58
Universal approximation
One hidden-layer approximation of nonlinear functions: x2, sinx, |x|, and H(x)

Good bases for spanning a nonlinear function space 59


Supervised learning
Solubility
Toxicity
Drug efficacy
.
.
.

Input data Output data


Feature extraction (L)
Structure (X) Property (Y)

Y = f(X); f = neural networks

• Regression: predicting values


• Classification: predicting active or inactive

60
Why deeper?
Theorem. For any given multivariate polynomial and any tolerance ε > 0, there exists a neural
network of fixed finite size N (independent of ε) that approximates the polynomial to accuracy
better than ε. Furthermore, N is bounded by the complexity of the polynomial, scaling as the
number of multiplications required times a factor that is typically slightly larger than 4.
https://arxiv.org/pdf/1608.08225v4.pdf

61
Why deeper?
• ILSVRC (ImageNet Large Scale Visual Recognition Challenge) Winners

62
Forward propagation
for prediction

63
Prediction
Feed-forward MLP (or DNN)

x y

64
Example: forward propagation

65
Example: forward propagation

66
Example: forward propagation

67
Example: forward propagation

68
Back propagation
for training

69
Loss functions
Loss function = Cost function = Objective function = Error function

Regression
1 2
L2 loss function (or MSE or LMS) = ෍ 𝑦𝑖 − ℎ(𝑥𝑖 )
2𝑚
𝑖

Classification

Cross entropy = − ෍ 𝑦𝑖 ln ℎ 𝑥𝑖
𝑖

For binary classification, 𝐿 = −(𝑦𝑖 ln ℎ 𝑥𝑖 + (1 − 𝑦𝑖)ln(1 − ℎ 𝑥𝑖 )

70
How to train?

x y

(𝑖+1) (𝑖)
ℎ𝑖 = 𝑓(σ𝑗 𝒘𝑖𝑗ℎ𝑗 + 𝑏 (𝑖) )
One has to determine {w, b} so as to minimize the error, E(ypred, ytrue)

Minsky (1969): no one on earth had found a viable way to train MLPs good enough
to learn such simple functions (Winter of NN1 until 1986) 71
Stochastic gradient descent

Back propagation

1974, 1982 by Paul Werbos,


1986 by Hinton, rediscovery
72
Back propagation

73
Back propagation

74
Back propagation

75
Back propagation

76
Back propagation

77
Back propagation

78
Back propagation

79
Vanishing gradient

s(x) tanh(x)
Successive multiplication of the derivatives of the activation function ➔ zero update value
80
Vanishing gradient

Winter of neural network 2, CIFAR


Canadian Institute For Advanced Research

The vanishing gradient problem occurs during the training of deep neural networks,
particularly in very deep networks or when using activation functions like the sigmoid
or tanh. It happens when the gradients, which are used to update the weights of the
network during backpropagation, become extremely small as they propagate through
the layers. As a result, the early layers in the network receive negligible updates,
making it difficult for the network to learn effectively, slowing or even halting training.
This problem can be mitigated by using activation functions like ReLU and its variants
or by employing techniques like batch normalization.

81
Various cost functions
ReLU = Rectified Linear Unit
Variants of ReLU to improve its problem that has no
learning effect if the input has a negative value (zero
derivative): a.k.a. dying ReLU
• Leaky ReLU: very large negative output
• ELU: softer than Leaky ReLU (good for noise
handling) but more computational costs

Nair and Hinton, ICML (2010)


Go deeper

Maxout: piecewise linear function


(each perceptron has different values)

82
New terms
• Supervised learning
• Unsupervised learning
• Reinforcement learning
• Hypothesis
• Least Mean Square (LMS)
• Maximum Likelihood Estimate (MLE)
• Regression
• Gradient descent
• Classification
• Decision boundary
• Logistic or sigmoid function
• Cost function
• Softmax classification
• Cross entropy
• One-hot encoding
83
New terms
• Neuron & synapse
• Perceptron
• Activation function: sigmoid, tanh, ReLU, etc
• Multilayer perceptron (MLP) = Deep neural network (DNN) = Artificial
neural network (ANN)
• Nonlinearity
• Universal approximation
• Loss function = Cost function = Error function
• Feed forward
• Back propagation
84

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy