Chapter02-Introduction-to-DeepLearning
Chapter02-Introduction-to-DeepLearning
Introduction to
deep learning
Woo Youn Kim
Professor of Chemistry &
Graduate School of Data Science, KAIST
CEO, HITS
Contents
• Linear regression
• Linear classification
hypothesis
observations
• Statistical modeling aims to obtain a hypothesis with training samples (observations) {𝐗, 𝐘}, where
we consider that samples are independent and identically distributed (i.i.d.) random variables
sampled from the joint distribution 𝑝(𝐗, 𝐘).
• Statistical inference is a procedure of inferring a new output 𝒚∗ of a new input 𝒙∗ by using the
obtained hypothesis. 3
Learning algorithms
• Reinforcement learning: to learn a decision (policy) from experience by interacting with environments
• Supervised learning: to learn a hypothesis that best explains the input-label relationship
• Unsupervised learning: to learn features, which are also called as representations or latent variables
• Semi-supervised learning: obtaining a hypothesis by using both labeled and unlabeled data
• Self-supervised learning:
4
Supervised learning
Solubility
Toxicity
Drug efficacy
.
.
.
6
Regression
Let us start with a talk about a few examples of supervised learning problems.
X Y
hypothesis
Molecular structure Energy
Molecular structure Solubility
Molecular structure Biological activity
observations
𝑋
When the target variable that we’re trying to predict is continuous, we apply a
regression method to it.
7
Regression (linear hypothesis)
𝑊: parameters or weights
𝑏 : bias
The figure illustrates how linear regression works by fitting different lines
(hypotheses) to a given dataset. It visualizes the equation H(x)=Wx+b, where
W is the weight (or slope), and b is the bias (intercept). The goal is to find the
best-fitting line that minimizes the error between the predicted values and the
actual data points.
8
Cost function
How fit the line to our (training) data numerically?
9
Cost minimization
10
Simplified hypothesis
To simplify the calculation, we ignore the bias term in the hypothesis and thus in the cost function.
11
What cost looks like?
• 𝑾 = 𝟏, 𝒄𝒐𝒔𝒕 𝑾 = 𝟎
1 2 2
( 1∗1−1 + 1∗2−2 + 1 ∗ 3 − 3 2)
3
• 𝑾 = 𝟎, 𝒄𝒐𝒔𝒕 𝑾 = 𝟒. 𝟔𝟕
1 2 2
( 0∗1−1 + 0∗2−2 + 0 ∗ 3 − 3 2)
3
...
12
Gradient descent algorithm
• Minimize the cost function using gradient descent algorithm
• Each time you change the parameters, you select the gradient
which reduces cost(W, b) the most possible
13
Least Mean Squares (LMS)
The resulting equation is called the LMS update rule (LMS stands for “least mean squares”)
14
Convex function
15
Gradient descent algorithm
• Batch gradient descent
Repeat until convergence {
scan through the entire training set before taking a single step:
computationally expensive: suppose you have million samples!
}
}
start making progress right away, and continue to make progress with each example it looks at.
much faster convergence than the batch gradient descent.
16
Gaussian noise
Suppose that the target variables and the inputs are related via the equation.
where 𝜖 𝑖
is an error term that captures either unmodeled effects or random noise.
Thus,
17
Likelihood
Given a data set, 𝑋 𝑥 𝑖
, and a ML model 𝜃, what is the distribution of target, Y 𝑦 𝑖
?
The probability of the data is given by p(Y|X; 𝜃), which is viewed a function of Y for a fixed value of 𝜃.
What is a reasonable way of choosing our best guess of the parameters 𝜃 or training the machine
learning model?
18
Maximum likelihood
The principal of maximum likelihood says that we should choose 𝜃 so as to make
the data as highly probable as possible.
19
Maximum likelihood
Instead of maximizing 𝐿 𝜃 , we can also maximize any strictly increasing function of 𝐿 𝜃 .
20
Classification
21
Classification
Classification problem: the values y we now want to predict take on only a small number of discrete
values.
22
Decision boundary
23
Problem of linear regression
Let’s try to predict y given z (= θTx) using the linear regression algorithm
Simply it doesn’t make sense for hθ(x) to take values larger than 1 or smaller than 0
when we know that y ∈ {0, 1}.
24
Logistic function
change hθ(x) as
25
Logistic function
26
Maximum likelihood
Assuming that
More compactly
27
Maximum likelihood
For m training examples, the likelihood of the parameters
28
Cost function
The log likelihood ➔ one has to maximize it
𝑦 (𝑖) : true value in the training data ➔ true probablity of being in class 1
ℎ 𝑥 𝑖 : predicted value by the classifier ➔ predicted probability of being in class 1
29
Cost function
if 𝑦 (𝑖) = 0, ℎ 𝑥 𝑖
→0: 1st term = 0 and ln1=0 in the 2nd term; therefore Cost ➔0
if 𝑦 (𝑖) = 1, ℎ 𝑥 𝑖 →1: ln1=0 in the 1st term and 2nd term =0; therefore Cost ➔0
30
Maximum likelihood
31
Maximum likelihood
The stochastic gradient ascent rule
In other words, the classifier perfectly predicts the class of input data!
32
Cross entropy
VS
Entropy in information theory
It is also called the cross entropy
33
Softmax classification
34
Multinomial classification
We need 3 decision boundaries!
35
Softmax function
The softmax function If k = 2 like the binary classification (y = 0 or 1),
Prob. for k = 1
Prob. for k = 2
➔Probability of having the k-th label. The result is the same with the logistic function
for binary classification!!
36
Softmax classification
The hypothesis function
The softmax regression
Dog Cat Rabbit
One has to determine the parameters {𝜃} with a given training set ➔ cost function
37
One-hot encoding
1
Dog = 0
0
0
Cat = 1
0
0
Rabbit = 0
1
• Multiple classes can be represented with a vector having elements whose number is the same
with the number of classes.
• This is called one-hot encoding.
• In this example, we have three classes and thus a vector with 3 elements.
• We take the maximum value and let it the unit value, resulting in only a single non-zero
element (in this example, the first element)
• Therefore, the input data is labeled the class 1.
38
Cost function
As the generalization of the logistic regression, we use the cross entropy as a cost function.
element-wise product
where
39
Concept of deep learning
40
Concept of machine learning
• This image explains the concept of machine learning, specifically focusing on tasks that are
“easy for a human, difficult for a machine.” These tasks typically involve pattern recognition,
such as distinguishing between cats and dogs from labeled photos.
• On the right side of the image, there are labeled pictures of a cat and a dog. This visual
representation demonstrates how machines learn to recognize patterns, like identifying
different animals, by being trained with labeled data (photos with categories).
41
Why deep learning?
Classical ML (or shallow learning)
Deep learning
threshold
Input Output
Analog signal transmission
➔ complex functions
• A neural network is a complex adaptive system; it can change its internal structure based on the
information flowing through it by adjusting the amplitude of signals (weight).
44
Perception
“easy-for-a-human, difficult-for-a-machine” tasks, often referred to as pattern recognition.
45
Perceptron
Invented in 1957 by Frank Rosenblatt at the Cornell Aeronautical Laboratory, a perceptron is the
simplest neural network possible: a computational model of a single neuron.
46
Principle of perceptron
0 if 𝑓 𝒘 ∙ 𝒙 < threshold
output = ቊ
1 if 𝑓 𝒘 ∙ 𝒙 ≥ threshold
Step 1: receive inputs Step 2: weight inputs Step 3: sum inputs Step 4: generate output
Input 0: x1 = 12 x1 × w1 = 12 × 0.5 = 6 Output = sign(sum)
Input 1: x2 = 4 x2 × w2 = 4 × -1 = -4 Sum = 6 + -4 = 2 = sign(2) = +1
47
Bias
48
Linear classification
By varying the weights and the bias, we can get different models of decision-making.
x1 x2 y
0 0 0
1 0 0
0 1 0
1 1 1
0 if 𝒘 ∙ 𝒙 + 𝑏 < 0
output = ቊ
1 if 𝒘 ∙ 𝒙 + 𝑏 ≥ 0
50
Logic gate
✓ Logic gate - OR Gate
x1 x2 y
0 0 0
1 0 1
0 1 1
1 1 1
51
Logic gate
✓ Logic gate- XOR Gate
x1 x2 y
0 0 0
1 0 1
0 1 1
1 1 0
x2
XOR Gate
0 if 𝒘 ∙ 𝒙 + 𝑏 < 0
output = ቊ
1 if 𝒘 ∙ 𝒙 + 𝑏 ≥ 0
x1
But what if we make the model with more perceptrons? If one perceptron can solve OR and one
perceptron can solve NOT AND, then two perceptrons combined can solve XOR.
54
Multilayer perceptron (MLP)
MLP, a.k.a a forward-feed network
𝒙 = {𝑥1, 𝑥2}
𝒙 𝒘1 𝒘2
ℎ𝑗 = 𝑓(𝒘𝑗1 ∙ 𝒙 + 𝑏1)
𝒚
𝑦 = 𝑓(𝒘2 ∙ 𝒉 + 𝑏2)
𝑏1 𝑏2
56
Universal approximation
57
Universal Approximation Theorem
• Why does neural networks work so well?
• The answer is, MLPs can approximate any non-linear function which is called the universal approximation
theorem. https://en.wikipedia.org/wiki/Universal_approximation_theorem
58
Universal approximation
One hidden-layer approximation of nonlinear functions: x2, sinx, |x|, and H(x)
60
Why deeper?
Theorem. For any given multivariate polynomial and any tolerance ε > 0, there exists a neural
network of fixed finite size N (independent of ε) that approximates the polynomial to accuracy
better than ε. Furthermore, N is bounded by the complexity of the polynomial, scaling as the
number of multiplications required times a factor that is typically slightly larger than 4.
https://arxiv.org/pdf/1608.08225v4.pdf
61
Why deeper?
• ILSVRC (ImageNet Large Scale Visual Recognition Challenge) Winners
62
Forward propagation
for prediction
63
Prediction
Feed-forward MLP (or DNN)
x y
64
Example: forward propagation
65
Example: forward propagation
66
Example: forward propagation
67
Example: forward propagation
68
Back propagation
for training
69
Loss functions
Loss function = Cost function = Objective function = Error function
Regression
1 2
L2 loss function (or MSE or LMS) = 𝑦𝑖 − ℎ(𝑥𝑖 )
2𝑚
𝑖
Classification
Cross entropy = − 𝑦𝑖 ln ℎ 𝑥𝑖
𝑖
70
How to train?
x y
(𝑖+1) (𝑖)
ℎ𝑖 = 𝑓(σ𝑗 𝒘𝑖𝑗ℎ𝑗 + 𝑏 (𝑖) )
One has to determine {w, b} so as to minimize the error, E(ypred, ytrue)
Minsky (1969): no one on earth had found a viable way to train MLPs good enough
to learn such simple functions (Winter of NN1 until 1986) 71
Stochastic gradient descent
Back propagation
73
Back propagation
74
Back propagation
75
Back propagation
76
Back propagation
77
Back propagation
78
Back propagation
79
Vanishing gradient
s(x) tanh(x)
Successive multiplication of the derivatives of the activation function ➔ zero update value
80
Vanishing gradient
The vanishing gradient problem occurs during the training of deep neural networks,
particularly in very deep networks or when using activation functions like the sigmoid
or tanh. It happens when the gradients, which are used to update the weights of the
network during backpropagation, become extremely small as they propagate through
the layers. As a result, the early layers in the network receive negligible updates,
making it difficult for the network to learn effectively, slowing or even halting training.
This problem can be mitigated by using activation functions like ReLU and its variants
or by employing techniques like batch normalization.
81
Various cost functions
ReLU = Rectified Linear Unit
Variants of ReLU to improve its problem that has no
learning effect if the input has a negative value (zero
derivative): a.k.a. dying ReLU
• Leaky ReLU: very large negative output
• ELU: softer than Leaky ReLU (good for noise
handling) but more computational costs
82
New terms
• Supervised learning
• Unsupervised learning
• Reinforcement learning
• Hypothesis
• Least Mean Square (LMS)
• Maximum Likelihood Estimate (MLE)
• Regression
• Gradient descent
• Classification
• Decision boundary
• Logistic or sigmoid function
• Cost function
• Softmax classification
• Cross entropy
• One-hot encoding
83
New terms
• Neuron & synapse
• Perceptron
• Activation function: sigmoid, tanh, ReLU, etc
• Multilayer perceptron (MLP) = Deep neural network (DNN) = Artificial
neural network (ANN)
• Nonlinearity
• Universal approximation
• Loss function = Cost function = Error function
• Feed forward
• Back propagation
84