Deeplearning - Ai Deeplearning - Ai

Copyright Notice
These slides are distributed under the Creative Commons License.
DeepLearning.AI makes these slides available for educational purposes. You may not use or distribute
these slides for commercial purposes. You may make copies of these slides and use or distribute them for
educational purposes as long as you cite DeepLearning.AI as the source of the slides.
For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode

Unsupervised learning,
recommender systems
and reinforcement learning
Welcome!
Beyond Supervised Learning
• Unsupervised Learning
• Clustering
• Anomaly detection
• Recommender Systems
• Reinforcement Learning
Andrew Ng
Clustering
What is clustering?
Supervised learning
𝑥!
𝑥"
Training set: ?
Andrew Ng
Unsupervised learning
Clustering
𝑥!
𝑥"
Training set: 𝑥 (") , 𝑥 ($) , 𝑥 (%) , … , 𝑥 (&)
Andrew Ng
Applications of clustering
• Growing skills
• Develop career
• Stay updated with AI,
Grouping similar news understand how it affects
your field of work Market segmentation
Image credit: NASA/JPL-Calte

ch/E. Churchwell (Univ. of Wisconsin, Madison)
DNA analysis Astronomical data analysis
Andrew Ng
Clustering
K-means intuition
Step 1: 𝑥 (") , 𝑥 ($) , 𝑥 (%) , … , 𝑥 (%')
Assign
each point
to its
closest
centroid
Andrew Ng
Step 2: 𝑥 (") , 𝑥 ($) , 𝑥 (%) , … , 𝑥 (%')
Recompute
the
centroids
Andrew Ng
Step 1: 𝑥 (") , 𝑥 ($) , 𝑥 (%) , … , 𝑥 (%')
Assign
each point
to its
closest
centroid
Andrew Ng
Step 1: 𝑥 (") , 𝑥 ($) , 𝑥 (%) , … , 𝑥 (%')
Assign
each point
to its
closest
centroid
Andrew Ng
Step 2: 𝑥 (") , 𝑥 ($) , 𝑥 (%) , … , 𝑥 (%')
Recompute
the
centroids
Andrew Ng
Step 1: 𝑥 (") , 𝑥 ($) , 𝑥 (%) , … , 𝑥 (%')
Assign
each point
to its
closest
centroid
Andrew Ng
Step 2: 𝑥 (") , 𝑥 ($) , 𝑥 (%) , … , 𝑥 (%')
Recompute
the
centroids
Andrew Ng
Clustering
K-means algorithm
K-means algorithm
Randomly initialize cluster centroids
x2
x1
Andrew Ng
K-means algorithm
Repeat {
# Assign points to cluster centroids
for i = 1 to m
c(i):= index (from 1 to K ) of cluster x2
centroid closest to x(i)
x1
Andrew Ng
K-means algorithm
Repeat {
for i = 1 to m
# Move cluster centroids
x1
for k = 1 to K
μk := average (mean) of points assigned to cluster k
}
Andrew Ng
K-means algorithm
Randomly initialize 𝑘 cluster centroids, 𝜇!, 𝜇!, ... 𝜇"
Repeat {
for i = 1 to m
x1
for k = 1 to K
μk := average (mean) of points assigned to cluster k
}
Andrew Ng
K-means for clusters that are not well separated
T-shirt sizing
Weight
Height
Andrew Ng
Clustering
Optimization objective
K-means optimization objective
𝑐 (") = index of cluster (1, 2, … , 𝐾) to which example 𝑥 (") is currently
assigned
𝜇$ = cluster centroid 𝑘
𝜇% (") = cluster centroid of cluster to which example 𝑥 (") has been
assigned
Cost function
&
$ &
1 ,
𝐽 𝑐 ,…,𝑐 , 𝜇$ , . . , 𝜇% = . 𝑥 (') − 𝜇+ (&)
𝑚
𝑚𝑖𝑛 '($
$ &
𝐽(𝑐 ,…,𝑐 , 𝜇$ , . . , 𝜇% )
𝑐 $ ,…,𝑐 &
𝜇$ , … , 𝜇%
Andrew Ng
Andrew Ng
Cost function for K-means "
! "
1 )
𝐽 𝑐 ,…,𝑐 , 𝜇! , . . , 𝜇# = * 𝑥 ($) − 𝜇( (-)
𝑚
$%!
Repeat {
for 𝑖 = 1 to 𝑚:
𝑐 (") = index of cluster
x2
centroid closest to 𝑥 (")
for 𝑖 = 1 to 𝐾:
𝜇$ = average of points in cluster 1 2 3 4 5 6 7 8 9
} x1
Andrew Ng
Moving the centroid
x2
1 2 3 4 5 6 7 8 9 11 12 13
x1
Andrew Ng
Clustering
Initializing K-means
K-means algorithm
Step 0: Randomly initialize 𝐾 cluster centroids 𝜇' , 𝜇' ,…, 𝜇$
Repeat {
Step 1: Assign points to cluster centroids
Step 2: Move cluster centroids
}
Andrew Ng
Random initialization
Choose 𝐾 < 𝑚
Randomly pick 𝐾 training

examples.
Set 𝜇$ , 𝜇$ ,…, 𝜇. equal to these

𝐾 examples.
Andrew Ng
𝐽 𝑐 " ,…,𝑐 1 , 𝜇", . . , 𝜇2
Andrew Ng
Random initialization
For 𝑖 = 1 to 100 {
Randomly initialize K-means.
Run K-means. Get 𝑐 $ , … , 𝑐 & , 𝜇$ , 𝜇$ ,…, 𝜇.
Compute 𝐽(𝑐 $ , … , 𝑐 & , 𝜇$ , 𝜇$ ,…, 𝜇. )
}
Pick set of clusters that gave lowest cost 𝐽
Andrew Ng
Clustering
Choosing the Number of Clusters

What is the right value of K?
Andrew Ng
Choosing the value of K
Elbow method
Cost function
Cost function
1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8
(no. of clusters) (no. of clusters)
Andrew Ng
Choosing the value of K
Often, you want to get clusters for some later (downstream) purpose.
Evaluate K-means based on how well it performs on that later purpose.
XL
T-shirt sizing L T-shirt sizing
L
M M
Weight
Weight
S XS
𝑘=3 𝑘=5
Height Height
Andrew Ng
Anomaly Detection
Finding unusual events

Anomaly detection example
Aircraft engine features: Dataset: 𝑥 (') , 𝑥 (+) , … , 𝑥 (,)
𝑥!= heat generated
𝑥%= vibration intensity
… New engine: 𝑥()*(
𝑥+ (vibration)
𝑥' (heat)
Andrew Ng
Density estimation
Dataset: 𝑥 (') , 𝑥 (+) , … , 𝑥 (,)
Model 𝑝(𝑥)
Is 𝑥&'(& anomalous?
(vibration
𝑝(𝑥()*( ) ≥ 𝜀 𝑝(𝑥()*( ) < 𝜀

)
𝑥+
𝑥' (heat)
Andrew Ng
Fraud detection:
• 𝑥 (*) = features of user 𝑖’s activities
• Model 𝑝(𝑥) from data.
• Identify unusual users by checking which have 𝑝 𝑥 < 𝜀
Manufacturing: Monitoring computers in a data center:

𝑥 (*) = features of product 𝑖 𝑥 (") = features of machine 𝑖
• 𝑥$ = memory use,
• 𝑥% = number of disk accesses/sec,
• 𝑥& = CPU load,
• 𝑥' = CPU load/network traffic.
Andrew Ng
Anomaly Detection
Gaussian (Normal) Distribution

Gaussian (Normal) distribution
Say 𝑥 is a number.
Probability of 𝑥 is determined by a Gaussian with mean 𝜇, variance 𝜎 % .
1 7(8 7 9)0
𝑝(𝑥) = 𝑒 !:0
2𝜋𝜎
𝑥
Andrew Ng
Gaussian distribution example
Andrew Ng
Parameter estimation
Dataset: 𝑥 (') , 𝑥 (+) , … , 𝑥 (,)
Andrew Ng
Anomaly Detection
Algorithm
Density estimation 𝑥"
𝑥, 𝑥!
Training set: x (') , x (+) , … , x (,)
Each example x (") has 𝑛 features x=
...
𝑥$
𝑥A
𝑝(x) = 𝑝(𝑥' ; 𝜇' , 𝜎'+ ) ∗ 𝑝(𝑥+ ; 𝜇+ , 𝜎++ ) ∗ 𝑝(𝑥- ; 𝜇- , 𝜎-+ ) ∗ ⋯ ∗ 𝑝(𝑥. ; 𝜇. , 𝜎.+)
𝑝 𝑥" = high temp = 1/10

5
𝑝 𝑥$ = high vibra = 1/20
= 3 𝑝(𝑥4 ; 𝜇4 , 𝜎4, )
𝑝 𝑥" , 𝑥" = 𝑝 𝑥" ∗ 𝑝(𝑥$ )
4($
Andrew Ng
Anomaly detection algorithm
1. Choose 𝑛 features 𝑥" that you think might be indicative of anomalous examples.
2. Fit parameters 𝜇$ , … , 𝜇+ , 𝜎$% , … , 𝜎+%
Vectorized formula
* *
1 (") 1 *
𝜇$
(")
𝜇( = 3 𝑥( 𝜎(% = 3(𝑥( −𝜇( )% 1 𝜇%
𝑚 𝑚 µ= 3 x (") µ=
")$ ")$ 𝑚 …
")$ 𝜇+
3. Given new example 𝑥, compute 𝑝(𝑥):
+ +
1 (𝑥( − 𝜇( )%
𝑝 𝑥 = :𝑝 𝑥( ; 𝜇( , 𝜎(% =: 𝑒𝑥𝑝(− )
2𝜋𝜎( 2𝜎(%
()$ ()$
Anomaly if 𝑝 𝑥 < 𝜀
Andrew Ng
"
𝑥()*(
𝑥%
𝑥$ 𝑥%
$
𝑥()*( 𝜇" = 5, 𝜎" = 2 𝜇$ = 3, 𝜎$ = 1
𝑥$ 𝑝(𝑥" ; 𝜇" , 𝜎"$ ) 𝑝(𝑥$ ; 𝜇$ , 𝜎$$ )
𝜀 = 0.02
(')
𝑝 𝑥()*( = 0.0426
𝑥% (+)
𝑥$ 𝑝 𝑥()*( = 0.0021
Andrew Ng
Anomaly Detection
Developing and evaluating an

anomaly detection system
The importance of real-number evaluation
When developing a learning algorithm (choosing features, etc.),
making decisions is much easier if we have a way of evaluating our learning algorithm.
Assume we have some labeled data, of anomalous and non-anomalous examples.
Training set: 𝑥 ($) , 𝑥 (%) , … , 𝑥 (*) (assume normal examples/not anomalous)
$ $ * *!"
Cross validation set: 𝑥,- , 𝑦,- , . . . , 𝑥,- !" , 𝑦,-
$ $ * *
Test set: 𝑥./0. , 𝑦./0. , . . . , 𝑥./0.#$%# , 𝑦./0.#$%#
Andrew Ng
Aircraft engines monitoring example
10000 good (normal) engines
20 flawed engines (anomalous)
Training set: 6000 good engines
CV: 2000 good engines (𝑦 = 0) 10 anomalous (𝑦 = 1)
Test: 2000 good engines (𝑦 = 0), 10 anomalous (𝑦 = 1)
Alternative: No test set

Training set: 6000 good engines
CV: 4000 good engines (𝑦 = 0), 20 anomalous (𝑦 = 1)
Andrew Ng
Algorithm evaluation
Fit model 𝑝(𝑥) on training set 𝑥 (') , 𝑥 (+) , … , 𝑥 (,)
On a cross validation/test example 𝑥, predict
1 𝑖𝑓 𝑝 𝑥 < 𝜀 (𝑎𝑛𝑜𝑚𝑎𝑙𝑦)
𝑦= 8
0 𝑖𝑓 𝑝 𝑥 ≥ 𝜀 (𝑛𝑜𝑟𝑚𝑎𝑙)
Possible evaluation metrics:
- True positive, false positive, false negative, true negative
- Precision/Recall
- F1-score
Can also use cross validation set to choose parameter 𝜀
Andrew Ng
Anomaly Detection
Anomaly detection
vs. supervised learning
Anomaly detection vs. Supervised learning
Very small number (0 to 20) of Large number of positive
positive examples 𝑦 = 1; and negative examples.
large number of negative examples (𝑦 = 0);
Model 𝑝 𝑥 with just negative examples.
Use positive examples for cv and test sets
Enough positive examples
Many different “types” of anomalies. for algorithm to get a sense of
Hard for any algorithm to learn what positive examples are like.
(from just positive examples) Future positive examples likely to be
what the anomalies look like. similar to ones in training set.
Future anomalies may look
nothing like any of the anomalous
examples seen so far.
Andrew Ng
Anomaly detection vs. Supervised learning
• Fraud detection • Email spam classification
• Manufacturing: • Manufacturing:
Finding new previously unseen defects. Finding known, previously seen defects
(e.g. aircraft engines) (e.g. scratches on smartphones, 𝑦 = 1)
• Monitoring machines in a data center • Weather prediction (sunny/rainy/etc.)

• Security applications
• Diseases classification
Andrew Ng
Anomaly Detection
Choosing what features to use

Non-gaussian features
𝑥" log(𝑥" )
𝑝(𝑥$ ; 𝜇$ , 𝜎$, ) 𝑥$ log(𝑥$ + 1)
"+
𝑥% 𝑥% = 𝑥% $
plt.hist(x) "+
𝑥, 𝑥, %
𝑥
np.log(x)
𝑥 𝑥
Andrew Ng
Error analysis for anomaly detection
Want 𝑝(𝑥) large for normal examples 𝑥.
𝑝(𝑥) small for anomalous examples 𝑥.
Most common problem:

𝑝(𝑥) is comparable for normal and anomalous examples.
(𝑝(𝑥) is large for both)
𝑥%
𝑥! 𝑥!
Andrew Ng
Monitoring computers in a data center
Choose features that might take on
unusually large or small values
in the event of an anomaly.
𝑥' = memory use of computer
𝑥+ = number of disk accesses/sec
𝑥- = CPU load
𝑥/ = network traffic
𝑥0 = CPU load (CPU load)2

𝑥1 =
network traffic network traffic
Deciding feature choice based on 𝑝 𝑥
Large for normal examples;
Becomes small for anomaly in the cross validation set
Andrew Ng
Copyright Notice

Recommender
Systems
Recommender System
Making recommendations
Predicting movie ratings Ratings
User rates movies using one to five stars
Movie Alice(1) Bob(2) Carol(3) Dave(4)
Love at last 5 5 0 0
Romance forever 5 ? ? 0
Cute puppies of love ? 4 0 ? nu = no. of users
Nonstop car chases 0 0 5 4 nm = no. of movies
Swords vs. karate 0 0 5 ? r(i,j)=1 if user j has

rated movie i
𝑛! = 4 𝑟(1,1) = 1 y(i,j) = rating given by
user j to movie i
𝑛" = 5 𝑟(3,1) = 0 𝑦 ($,&) = 4 (defined only if r(i,j)=1)
Andrew Ng
Collaborative Filtering
Using per-item features

What if we have features of the movies? 𝑛! = 4
𝑛" = 5
x1
(romance)
x2
(action)
𝑛=2
Love at last 5 5 0 0 0.9 0 0.9
5 ? ? 0
𝑥 ($) =
Romance forever 1.0 0.01 0
Cute puppies of love ? 4 0 ? 0.99 0
(()
0.99
Nonstop car chases 0 0 5 4 0.1 1.0 𝑥 =
0
Swords vs. karate 0 0 5 ? 0 0.9
(1) (1) just linear
For user 1: Predict rating for movie i as: w · x(i) + b regression
w(1) $ x(3) + b(1) = 4.95
! ".'
w(1) = "
𝑏 ($) = 0 x(3) = "
For user j: Predict user j’s rating for movie i as w(j) $ x(i) + b(j)
Andrew Ng
Cost function
Notation:
r(i,j) = 1 if user j has rated movie i (0 otherwise)
y(i,j) = rating given by user j on movie i (if defined)
w(j), b(j) = parameters for user j
x(i) = feature vector for movie i
For user j and movie i, predict rating: w(j) ! x(i) + b(j)

m(j) = no. of movies rated by user j
To learn w(j), b(j)
)
1 ()) (*) ()) (*,)) / λ ($) *
min J 𝑤 ( ,𝑏 ( = 2𝑚()) + 𝑤 $𝑥 +𝑏 −𝑦 + + 𝑤
(")
! " (") 2𝑚($) &'( &
*:,(*,)).$
Andrew Ng
Cost function
To learn parameters 𝑤 (() , 𝑏 (() for user j :
)
( ( 1 * λ ($) *
J 𝑤 ,𝑏 = + 𝑤 ($) $ 𝑥 (+) + 𝑏 ($) − 𝑦 (+,$) + + 𝑤&
2 2
+:-(+,$) &'(
To learn parameters 𝑤 ()) , 𝑏 ()) , 𝑤 (&) , 𝑏 (&) , ⋯ 𝑤 *! ,𝑏 *! for all users :

)$ )$ )
1 λ *
𝑤 ()) , …,𝑤 *! ($) (+) ($) (+,$) *
($)
+ + + 𝑤&
J = 2+ + 𝑤 $𝑥 +𝑏 −𝑦 2
𝑏 ()) , … , 𝑏 *! $'( +:- +,$ '( $'( &'(
𝑓(𝑥)
Andrew Ng
Collaborative filtering algorithm

Problem motivation
Movie Alice (1) Bob (2) Carol (3) Dave (4) x1 x2
(romance) (action)
Love at last 5 5 0 0 0.9 0
Romance forever 5 ? ? 0 1.0 0.01
Cute puppies of love ? 4 0 ? 0.99 0
Nonstop car chases 0 0 5 4 0.1 1.0
Swords vs. karate 0 0 5 ? 0 0.9
Problem motivation
Movie Alice (1) Bob (2) Carol (3) Dave (4) x1 x2
(romance) (action)
Love at last 5 5 0 0 ? ?
Romance forever 5 ? ? 0 ? ? 𝑥 (")
Cute puppies of love ? 4 0 ? ? ? 𝑥 ($)
Nonstop car chases 0 0 5 4 ? ?

Swords vs. karate 0 0 5 ? ? ?
5 5 0 0 using 𝒘(𝒋) ! 𝒙(𝒊) + 𝒃(𝒋)

𝑤 (#)% , 𝑤 (&)% , 𝑤 (')% , 𝑤 (()% 1
0 0 5 5 𝑤 (#) ! 𝑥 (#) ≈ 5 → 𝑥 (#) =
0
𝑤 (&) ! 𝑥 (#) ≈ 5
𝑏 (#) = 0 , 𝑏 (&) = 0 , 𝑏 (') = 0 , 𝑏 (() = 0 𝑤 (') ! 𝑥 (#) ≈ 0
𝑤 (() ! 𝑥 (#) ≈ 0
Cost function
Given 𝑤 (") , 𝑏 (") , 𝑤 ($) , 𝑏 ($) , ⋯ , 𝑤 %$ ,𝑏 %$
&
to learn 𝑥 : %
1 & λ + 𝑥 (&) $
J 𝑥 + = 5 𝑤 (() 6 𝑥 (+) + 𝑏 (() − 𝑦 (+,() + ,
2 2 ,'"
(:- +,( .)
To learn 𝑥 (") , 𝑥 ($) , ⋯ , 𝑥 %%

:
%% %% %
1 $ λ (&) $
J 𝑥 ()) , 𝑥 & ,…,𝑥 *+ = + + 𝑤 (() $ 𝑥 (&) + 𝑏 (() − 𝑦 (&,() + + + 𝑥,
2 2
&'" (:* &,( '" &'" ,'"
Andrew Ng
j=1 j=2 j=3
Collaborative filtering Alice Bob Carol
𝑖=1 Movie1 5 5 ?
Cost function to learn 𝑤 (") , 𝑏 (") , ⋯𝑤 %$ ,𝑏 %$ : 𝑖=2 Movie2 ? 2 3
1' 1' 1
1 & λ (0) &
min 3 3 𝑤 (0) ! 𝑥 (2) +𝑏 (0) − 𝑦 (2,0) + 3 3 𝑤5
,(%),.(%), ⋯, , &' ,. &' 2 2
0%# 2:4 2,0 %# 0%# 5%#
"
Cost function to learn 𝑥 , ⋯ , 𝑥 (%%) :
1( 1( 1
1 & λ (2) &
min (& 3 3 𝑤 (0) ! 𝑥 (2) +𝑏 (0) − 𝑦 (2,0) + 3 3 𝑥5
6 , ⋯, 6 ()
(%) 2 2
2%# 0:4 2,0 %# 2%# 5%#
Put them together:

1' 1 1( 1
min & λ & λ &
,(%), …, , &' 1 3 𝑤 (0) ! 𝑥 (2) + 𝑏 (0) − 𝑦 (2,0) + 3 3 𝑤5
0
+ 3 3 𝑥5(2)
.(%), …, . &' 𝐽 𝑤, 𝑏, 𝑥 = 2 2
2 2,0 :4 2,0 %# 0%# 5%# 2%# 5%#
6(%), …, 6 &(
Andrew Ng
Gradient Descent
Linear regression (course 1)
repeat {
< (() (()
𝑤+ = 𝑤+ − 𝛼 <= 𝐽 𝑤, 𝑏 𝑤+ = 𝑤+ −𝛼 /
> (?) 𝐽 𝑤, 𝑏, 𝑥
/ /3>
𝑏=𝑏− 𝛼 /0 𝐽 𝑤, 𝑏
𝑏 (() = 𝑏 (() − 𝛼 / ? 𝐽(𝑤, 𝑏, 𝑥)
/0
(+) (+) /
𝑥1 = 𝑥1 − 𝛼 (>) J(w,b,x)
/2@
}
w, b, x
Andrew Ng
Binary labels:
favs,
likes and clicks
Binary labels

Love at last 1 1 0 0
Romance forever 1 ? ? 0
Cute puppies of love ? 1 0 ?
Nonstop car chases 0 0 1 1
Swords vs. karate 0 0 1 ?
Andrew Ng
Example applications
1. Did user j purchase an item after being shown?
2. Did user j fav/like an item?
3. Did user j spend at least 30sec with an item?
4. Did user j click on an item?
Meaning of ratings:
1 - engaged after being shown item
0 - did not engage after being shown item
? - item not yet shown
Andrew Ng
From regression to binary classification
Previously:
Predict 𝒚 𝒊,𝒋 as 𝒘(𝒋) 6 𝒙(𝒊) + 𝒃(𝒋)
For binary labels:
Predict that the probability of 𝒚 𝒊,𝒋 =𝟏
is given by g 𝒘(𝒋) 6 𝒙(𝒊) + 𝒃(𝒋)
9
where g 𝑧 = 9:; "#
Andrew Ng
Cost function for binary application
Previous cost function:
1( 1 1' 1
1 & λ (2) & λ &
3 𝑤 (0) ! 𝑥 (2) + 𝑏 (0) − 𝑦 (2,0) + 3 3 𝑥5 + 3 3 𝑤5
0
2 2,0 :4 2,0 %#
2 2
2%# 5%# 0%# 5%#
( + (
Loss for binary labels 𝑦 (+,() : 𝑓(3,0,2) 𝑥 = 𝑔(𝑤 6𝑥 +𝑏 )
Loss for
𝐿 𝑓(,,.,6) 𝑥 , 𝑦 2,0 =−𝑦 2,0 log 𝑓(,,.,6) 𝑥 − 1−𝑦 2,0 log 1 − 𝑓(,,.,6) 𝑥 single
example
𝐽 𝑤, 𝑏, 𝑥 = + 𝐿 𝑓(-,.,/) 𝑥 , 𝑦 &,( cost for all examples

(&,():* &,( '"
𝑔(𝑤 (0) ! 𝑥 (2) + 𝑏 (0) )
Andrew Ng
Recommender Systems
implementation
Mean normalization
Users who have not rated any movies
Movie Alice(1) Bob (2) Carol (3) Dave (4) Eve (5)
5 5 0 0 ?
Love at last 5 5 0 0 ? 5 ? ? 0 ?
? 4 0 ? ?
Romance forever 5 ? ? 0 ?
0 0 5 4 ?
Cute puppies of love ? 4 0 ? ?
0 0 5 0 ?
Nonstop car chases 0 0 5 4 ?
Swords vs. karate 0 0 5 ? ?
𝐦𝐢𝐧 1' 1 1( 1
𝒘(𝟏), ….𝒘 𝒏𝒖 1 & λ 0 & λ (2) &
𝒃(𝟏), ….𝒃 𝒏𝒖 3 𝑤 (0) ! 𝑥 (2) + 𝑏 (0) − 𝑦 (2,0) + 3 3 𝑤5 + 3 3 𝑥5
2 2 2
𝒙(𝟏), ….𝒙 𝒏𝒎 2,0 :4 2,0 %# 0%# 5%# 2%# 5%#
Mean Normalization
5 5 0 0 ? 2.5 2.5 2.5 −2.5 −2.5 ?

5 ? ? 0 ? 2.5 2.5 ? ? −2.5 ?
? 4 0 ? ? 𝜇= 2 ? 2 −2 ? ?
0 0 5 4 ? 2.25 −2.25 −2.25 2.75 1.75 ?
0 0 5 0 ? 1.25 −1.25 −1.25 3.75 −1.25 ?
For user j, on movie i predict:

+ 𝜇+
User 5 (Eve):
+𝜇
Recommender Systems
implementational detail
TensorFlow implementation
Derivatives in ML
Gradient descent algorithm 3
Repeat until convergence
Learning rate 2
Derivative
1
0
-0.5 0 0.5 1 1.5 2 2.5
Andrew Ng
y
Custom Training Loop

f(x)
𝑱 = (𝒘𝒙 − 𝟏)𝟐
w = tf.Variable(3.0)
x = 1.0
Tf.variables are the parameters we want
y = 1.0 # target value to optimize
alpha = 0.01 Auto Diff
iterations = 30 Auto Grad
for iter in range(iterations):
Fix b = 0 for this example # Use TensorFlow’s Gradient tape to record the steps
# used to compute the cost J, to enable auto differentiation.
with tf.GradientTape() as tape:
fwb = w*x f(x)
costJ = (fwb - y)**2
𝝏
# Use the gradient tape to calculate the gradients 𝑱(𝒘)
# of the cost with respect to the parameter w. 𝝏𝒘
[dJdw] = tape.gradient( costJ, [w] )
# Run one step of gradient descent by updating

# the value of w to reduce the cost.
w.assign_add(-alpha * dJdw)
tf.variables require special function to
modify
Andrew Ng
Implementation in TensorFlow
# Instantiate an optimizer.
Gradient descent algorithm optimizer = keras.optimizers.Adam(learning_rate=1e-1)
iterations = 200
Repeat until convergence for iter in range(iterations):
# Use TensorFlow’s GradientTape
# to record the operations used to compute the cost
with tf.GradientTape() as tape:
# Compute the cost (forward pass is included in cost)

cost_value = cofiCostFuncV(X, W, b, Ynorm, R,
num_users, num_movies, lambda)
𝑛E 𝑛F
# Use the gradient tape to automatically retrieve
# the gradients of the trainable variables with respect to
the loss
grads = tape.gradient( cost_value, [X,W,b] )
# Run one step of gradient descent by updating

# the value of the variables to minimize the loss.
optimizer.apply_gradients( zip(grads, [X,W,b]) )
Dataset credit: Harper and Konstan. 2015. The MovieLens Datasets: History and Context
Andrew Ng
Finding related items

Finding related items
The features x(i) of item i are quite hard to interpret.
To find other items related to it,
find item k with 𝒙(𝒌) similar to x(i)
*
i.e. with smallest (1) (+) &
distance 5 𝑥8 − 𝑥8 𝒙(𝒌) x(i)
8.)
𝟐
𝒙(𝒌) − 𝒙(𝒊)
Andrew Ng
Limitations of Collaborative Filtering
Cold start problem. How to
• rank new items that few users have rated?

• show something reasonable to new users who have rated few
items?
Use side information about items or users:
• Item: Genre, movie stars, studio, ….

• User: Demographics (age, gender, location), expressed
preferences, …
Andrew Ng
Content-based Filtering
Collaborative filtering
vs
Content-based filtering
Collaborative filtering vs Content-based filtering
Collaborative filtering:
Recommend items to you based on rating of users who
gave similar ratings as you
Content-based filtering:
Recommend items to you based on features of user and item
to find good match
if user j has rated item i
rating given by user j on item i (if defined)
Andrew Ng
Examples of user and item features
User features:
• Age
• Gender
(𝐣)
• Country 𝐱 𝐮 𝐟𝐨𝐫 𝐮𝐬𝐞𝐫 𝐣
• Movies watched
• Average rating per genre Vector size
• … could be
different
Movie features:
• Year
• Genre/Genres (𝐢)
𝐱 𝐦 𝐟𝐨𝐫 𝐦𝐨𝐯𝐢𝐞 𝐢
• Reviews
• Average rating
• …
Andrew Ng
Content-based filtering: Learning to match
Predict rating of user j on movie i as
computed
computed (𝒊)
(𝒋) from 𝒙𝒎
from 𝒙𝒖
Andrew Ng
Deep learning for

content-based filtering
Neural network architecture
User network Movie network
⋮
xu ⋮ vu xm ⋮ vm
⋮ ⋮ ⋮
64 32 128 32
128 256
Prediction :
𝒈 𝒗𝒖 $ 𝒗𝒎 𝒕𝒐 𝒑𝒓𝒆𝒅𝒊𝒄𝒕 𝒕𝒉𝒆 𝒑𝒓𝒐𝒃𝒂𝒃𝒊𝒍𝒊𝒕𝒚 that 𝒚(𝒊,𝒋) 𝒊𝒔 𝟏
Andrew Ng
Neural network architecture
⋮
xu ⋮
⋮
vu
Prediction
vm
⋮ ⋮
xm ⋮
&
Cost 𝐽= 5 𝑣! (() 6 𝑣" (+) − 𝑦 (+,() + NN regularization term
function +,( :-(+,().)
Andrew Ng
Learned user and item vectors:
(𝒋) (𝒋)
𝒗𝒖 is a vector of length 32 that describes user j with features 𝒙𝒖
(𝒊) (𝒊)
𝒗𝒎 is a vector of length 32 that describes movie i with features 𝒙𝒎
To find movies similar to movie i:
Note: This can be pre-computed ahead of time
Andrew Ng
Advanced implementation
Recommending from
a large catalogue
How to efficiently find recommendation from
a large set of items?
• Movies 1000+ xu ⋮
⋮ ⋮
• Ads 1m+ vu
• Songs 10m+ predictions
• Products 10m+ vm
xm ⋮ ⋮ ⋮
Andrew Ng
Two steps: Retrieval & Ranking
Retrieval:
• Generate large list of plausible item candidates
e.g.
1) For each of the last 10 movies watched by the user,
find 10 most similar movies
2) For most viewed 3 genres, find the top 10 movies

3) Top 20 movies in the country
• Combine retrieved items into list, removing duplicates

and items already watched/purchased
Andrew Ng
Two steps: Retrieval & ranking
Ranking:
⋮
• Take list retrieved xu ⋮
⋮
vu
and rank using
learned model predictions
vm
⋮ ⋮
xm ⋮
• Display ranked items to user
Andrew Ng
Retrieval step
• Retrieving more items results in better performance,

but slower recommendations.
• To analyse/optimize the trade-off, carry out offline experiments
to see if retrieving additional items results in more relevant
recommendations (i.e., 𝑝 𝑦 &,( = 1 of items displayed to user
are higher).
Andrew Ng
Advanced implementation
Ethical use of
recommender systems
What is the goal of the recommender system?
Recommend:
• Movies most likely to be rated 5 stars by user

• Products most likely to be purchased
• Ads most likely to be clicked on
• Products generating the largest profit
• Video leading to maximum watch time
Andrew Ng
Ethical considerations with recommender systems
Travel industry Payday loans
Good travel experience Squeeze customers

to more users more
Bid higher Bid higher

for ads More for ads More profit
profitable
Amelioration: Do not accept ads from exploitative businesses
Andrew Ng
Other problematic cases:
• Maximizing user engagement (e.g. watch time) has led to large

social media/video sharing sites to amplify conspiracy theories and
hate/toxicity
Amelioration : Filter out problematic content such as hate speech,

fraud, scams and violent content
• Can a ranking system maximize your profit rather than users’

welfare be presented in a transparent way?
Amelioration : Be transparent with users
Andrew Ng
TensorFlow Implementation
user_NN = tf.keras.models.Sequential([
tf.keras.layers.Dense(256, activation='relu'),
tf.keras.layers.Dense(32)
])
item_NN = tf.keras.models.Sequential([
tf.keras.layers.Dense(32 )
])
# create the user input and point to the base network

input_user = tf.keras.layers.Input(shape=(num_user_features))
vu = user_NN(input_user)
vu = tf.linalg.l2_normalize(vu, axis=1)
# create the item input and point to the base network

input_item = tf.keras.layers.Input(shape=(num_item_features))
vm = item_NN(input_item)
vm = tf.linalg.l2_normalize(vm, axis=1) vu
# measure the similarity of the two vector outputs
output = tf.keras.layers.Dot(axes=1)([vu, vm]) Prediction
vm
# specify the inputs and output of the model
model = Model([input_user, input_item], output)
# Specify the cost function
cost_fn = tf.keras.losses.MeanSquaredError()
Copyright Notice

Reinforcement Learning
Introduction
What is Reinforcement
Learning?
Autonomous Helicopter
GPS
Accelerometers
Compass
Computer
How to fly it?
Andrew Ng
[Thanks to Pieter Abbeel, Adam Coates and Morgan Quigley] For more videos: http://heli.stanford.edu.
Andrew Ng
Reinforcement Learning
position of helicopter how to move control sticks
state ! action "
# $
reward function
positive reward : helicopter flying well
negative reward : helicopter flying poorly
Andrew Ng
Robotic Dog Example
[Thanks to Zico Kolter]
Andrew Ng
Applications
• Controlling robots
• Factory optimization
• Financial (stock) trading
• Playing games (including video games)
Andrew Ng
Reinforcement
Learning formalism
Mars rover example

Mars Rover Example
terminal state terminal state
state
left right
state
[Credit: Jagriti Agrawal, Emma Brunskill]
Andrew Ng
Reinforcement
Learning formalism
The Return in
reinforcement learning
Return
100 0 0 0 0 40
state 1 2 3 4 5 6
Return
Return (until terminal state)
Discount Factor
Return
Andrew Ng
Example of Return
return
! = 0.5
100 0 0 0 0 40 reward
1 2 3 4 5 6
The return depends on the actions you take.
100 0 0 0 0 40
1 2 3 4 5 6
100 0 0 0 0 40
1 2 3 4 5 6
Andrew Ng
Reinforcement
Learning formalism
Making decisions: Policies

in reinforcement learning
Policy
100 40
policy
state action 100 40
100 40
100 40
A policy is a function 5 ! = " mapping from states to actions, that tells you
what action a to take in a given state s.
Andrew Ng
The goal of reinforcement learning
100 40
Find a policy 5 that tells you what action (a = 5(s)) to take in every state (s) so as to
maximize the return.
Andrew Ng
Reinforcement
Learning formalism
Review of key concepts

Mars rover Helicopter Chess
states 6 states position of helicopter pieces on board
how to move
actions possible move
control stick
rewards
discount factor 7
return 8! + 78" + 7 " 8# + ⋯ 8! + 78" + 7 " 8# + ⋯ 8! + 78" + 7 " 8# + ⋯
policy 5 100 40 Find 5 ! = " Find 5 ! = "
Andrew Ng
Markov Decision Process (MDP)
Agent
Agent
&
state s
reward R action a
Environment /
World
Andrew Ng
State-action value
function
State-action value function

definition
State action value function (Q-function)
= Return if you
• start in state !.
• take action " (once).
• then behave optimally after that.
100 50 25 12.5 20 40 return

action
100 0 0 0 0 40 reward
100 0 0 0 0 40
1 2 3 4 5 6
Andrew Ng
Picking actions
100 50 25 12.5 20 40 return
action
100 0 0 0 0 40 reward
100 100 50 12.5 25 6.25 12.5 10 6.25 20 40 40
100 0 0 0 0 40
1 2 3 4 5 6
! ", $ = Return if you
• start in state %.
• take action & (once).
The best possible return from state ! is max < !, " .

$
'∗
The best possible action in state ! is the action " that gives max < !, " . Optimal < function
$
Andrew Ng
State-action value
function
State-action value function

example
Jupyter Notebook
Andrew Ng
State-action value
function
Bellman Equation
Bellman Equation
' (, * = Return if you
1 2 3 4 5 6
! : current state #(!) = reward of current state

" : current action
! ! : state you get to after taking action "
"! : action that you take in state !′
Andrew Ng
Bellman Equation
! ", $ = & " + ( max !(" ( , $( )
? '
100 100 50 12.5 25 6.25 12.5 10 6.25 20 40 40
100 0 0 0 0 40
1 2 3 4 5 6
Andrew Ng
Explanation of Bellman Equation
' (, * = Return if you
The best possible return from state ! is max * (! , " )

"
, ,
! ", $ = & " + ( max
!
!(" ,$ )
+
Reward you get Return from behaving optimally
right away starting from state # ! .
Andrew Ng
Explanation of Bellman Equation
! ", $ = & " + ( max !(" ( , $( )
? '
100 100 50 12.5 25 6.25 12.5 10 6.25 20 40 40
100 0 0 0 0 40
1 2 3 4 5 6
Andrew Ng
State-action value
function
Random (stochastic)
environment (Optional)
Stochastic Environment
1 2 3 4 5 6
Andrew Ng
Expected Return
100 0 0 0 0 40
1 2 3 4 5 6
Expected Return = Average( #% + -#& + - & #' + - ' #( + ⋯ )
= E[+! + -+" + - " +# + - # +$ + ⋯ ]
Andrew Ng
Expected Return
Goal of Reinforcement Learning:
Choose a policy / ! = " that will tell us what action " to take in state ! so as
to maximize the expected return.
, ,
Bellman ! ", $ = & " + ( .[max
!
! " ,$ ]
+
Equation:
Andrew Ng
Jupyter Notebook
Andrew Ng
Continuous State
Spaces
Example of continuous
state applications
Discrete vs Continuous State
Discrete State:
1 2 3 4 5 6
Continuous State:
0 6 km
Andrew Ng
Andrew Ng
Continuous State
Spaces
Lunar Lander
Lunar Lander
Andrew Ng
Lunar Lander
actions:
do nothing
left thruster
main thruster
right thruster
Andrew Ng
Reward Function
• Getting to landing pad: 100 – 140
• Additional reward for moving toward/away from pad.
• Crash: -100
• Soft landing: +100
• Leg grounded: +10
• Fire main engine: -0.3
• Fire side thruster: -0.03
Andrew Ng
Lunar Lander Problem
Learn a policy ! that, given #
$
#̇
$̇
!=
&
&̇
'
(
picks action " = ! $ so as to maximize the return.
) = 0.985
Andrew Ng
Continuous State
Spaces
Learning the state-value

function
Deep Reinforcement
5
Learning
6
5̇
6̇
! 8
x= 8̇
9 #(!, ")
" :
1
0 '
0
0
12 inputs 64 units 64 units 1 unit
In a state $, use neural network to compute
%($, nothing), %($, left), %($, main), %($, right)
Pick the action a that maximizes %($, ")
Andrew Ng
Bellman Equation
& =' # + + max

!
/(#′ , %$ )
%
! "
! = # ,% &
(")
# (") , %(") , '(# (") ), # $
5 &
Andrew Ng
Learning Algorithm
Initialize neural network randomly as guess of %($, ").
Repeat {
Take actions in the lunar lander. Get ($, ", =($), $ ! ).
Store 10,000 most recent ($, ", =($), $ ! ) tuples.
Train neural network:

Create training set of 10,000 examples using
x = ($, ") and y = =($) + A max

!
%($ ! , "! ).
"
Train %#$% such that %#$% $, " ≈ 6.
Set % = %#$% .
[Mnih et al., 2015, Human-level control through deep reinforcement learning]
Andrew Ng
Continuous State
Spaces
Algorithm refinement:
Improved neural network
architecture
Deep Reinforcement
5
Learning
6
5̇
6̇
! 8
8̇
x= 9 #(!, ")
" :
1
0
0
0
12 inputs 64 units 64 units 1 unit
In a state $, use neural network to compute
%($, nothing), %($, left), %($, main), %($, right)
Pick the action a that maximizes %($, ")
Andrew Ng
Deep Reinforcement Learning
#
$
& %($, nothing)
#̇ %($, left)
! = $̇ %($, main)
%($, right)
&̇
'
8 inputs ( 64 units 64 units 4 units
In a state $, input $ to neural network.

Pick the action a that maximizes % $, " . = $ + A max
!
%($ ! , "! )
"
Andrew Ng
Continuous State
Spaces
!-greedy policy
Learning Algorithm
Repeat {
Train model:
x = ($, ") and y = =($) + A max

!
%($ ! , "! ).
"
Train %#$% such that %#$% $, " ≈ 6. D%.' 5 ≈ 6
Set % = %#$% .
Andrew Ng
How to choose actions while still learning?
In some state s
Option 1:
Pick the action " that maximizes %($, ").
Option 2:
With probability 0.95, pick the action a that maximizes %($, ").
With probability 0.05, pick an action " randomly.
(E = 0.05)
Andrew Ng
Continuous State
Spaces
Mini-batch and soft update
(optional)
How to choose actions while still learning?
$
1 ! ! (
! & 0 1, 3 = 7 8%,' # − $
26
!"#
2104 400
1416 232
1534 315
852 178 repeat { +
… … ' 1 ( ( -
# =#−& #, I
( 2H * D%,' 5 − 6
3210 870 '# ()*
+
' 1 ( ( -
#, I
* = * − & ( 2H * D%,' 5 − 6
'* ()*
}
Andrew Ng
V
V
Mini-batch
V
V price in $1000’s
V
500
! &
400
2104 400 300
1416 232
1534 315 200
batch
852 178
100
… …
3210 870 0
0 1000 2000 3000
size in feet2
Andrew Ng
Mini-batch
! & 0 w, 3
2104 400
1416 232
1534 315 K- K-
852 178 # bedrooms
… …
3210 870
K* K*
size in feet2
Andrew Ng
Learning Algorithm
Repeat {
Train model:
x = ($, ") and y = =($) + A max

!
%($ ! , "! ).
"
Train %#$% such that %#$% $, " ≈ 6.
Set % = %#$% .
Andrew Ng
Soft Update
Set ! = !;<= .
Andrew Ng
Continuous State
Spaces
The state of
reinforcement learning
Limitations of Reinforcement Learning
• Much easier to get to work in a simulation than a real robot!
• Far fewer applications than supervised and unsupervised
learning.
• But … exciting research direction with potential for future
applications.
Andrew Ng
Conclusion
Summary and
Thank you
Courses
• Supervised Machine Learning: Regression and Classification
Linear regression, logistic regression, gradient descent
• Advanced Learning Algorithms
Neural networks, decision trees, advice for ML
• Unsupervised Learning, Recommenders, Reinforcement Learning
Clustering, anomaly detection, collaborative filtering, content-
based filtering, reinforcement learning
Andrew Ng
Andrew Ng

Deeplearning - Ai Deeplearning - Ai

Uploaded by

Copyright:

Available Formats

Deeplearning - Ai Deeplearning - Ai

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deeplearning - Ai Deeplearning - Ai

Uploaded by

Copyright:

Available Formats

Copyright Notice

These slides are distributed under the Creative Commons License.

For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode

Image credit: NASA/JPL-Calte

DNA analysis Astronomical data analysis

Step 0: Randomly initialize 𝐾 cluster centroids 𝜇' , 𝜇' ,…, 𝜇$

Randomly pick 𝐾 training

Set 𝜇$ , 𝜇$ ,…, 𝜇. equal to these

Choosing the Number of Clusters

Finding unusual events

𝑝(𝑥()*( ) ≥ 𝜀 𝑝(𝑥()*( ) < 𝜀

Manufacturing: Monitoring computers in a data center:

Gaussian (Normal) Distribution

𝑝 𝑥" = high temp = 1/10

Developing and evaluating an

Assume we have some labeled data, of anomalous and non-anomalous examples.

Training set: 𝑥 ($) , 𝑥 (%) , … , 𝑥 (*) (assume normal examples/not anomalous)

Training set: 6000 good engines

CV: 2000 good engines (𝑦 = 0) 10 anomalous (𝑦 = 1)

Test: 2000 good engines (𝑦 = 0), 10 anomalous (𝑦 = 1)

Alternative: No test set

• Monitoring machines in a data center • Weather prediction (sunny/rainy/etc.)

Choosing what features to use

Most common problem:

𝑥0 = CPU load (CPU load)2

For the rest of the details of the license, see https://creativecommons.org/licenses/by-sa/2.0/legalcode

Movie Alice(1) Bob(2) Carol(3) Dave(4)

Swords vs. karate 0 0 5 ? r(i,j)=1 if user j has

Using per-item features

For user j and movie i, predict rating: w(j) ! x(i) + b(j)

To learn parameters 𝑤 ()) , 𝑏 ()) , 𝑤 (&) , 𝑏 (&) , ⋯ 𝑤 *! ,𝑏 *! for all users :

Collaborative filtering algorithm

Cute puppies of love ? 4 0 ? ? ? 𝑥 ($)

Nonstop car chases 0 0 5 4 ? ?

5 5 0 0 using 𝒘(𝒋) ! 𝒙(𝒊) + 𝒃(𝒋)

Given 𝑤 (") , 𝑏 (") , 𝑤 ($) , 𝑏 ($) , ⋯ , 𝑤 %$ ,𝑏 %$

To learn 𝑥 (") , 𝑥 ($) , ⋯ , 𝑥 %%

Collaborative filtering Alice Bob Carol

Cost function to learn 𝑤 (") , 𝑏 (") , ⋯𝑤 %$ ,𝑏 %$ : 𝑖=2 Movie2 ? 2 3

Put them together:

Movie Alice(1) Bob(2) Carol(3) Dave(4)

𝐽 𝑤, 𝑏, 𝑥 = + 𝐿 𝑓(-,.,/) 𝑥 , 𝑦 &,( cost for all examples

5 5 0 0 ? 2.5 2.5 2.5 −2.5 −2.5 ?

For user j, on movie i predict:

Custom Training Loop

# Run one step of gradient descent by updating

# Compute the cost (forward pass is included in cost)

# Run one step of gradient descent by updating

Finding related items

• rank new items that few users have rated?

Use side information about items or users:

• Item: Genre, movie stars, studio, ….

Deep learning for

To find movies similar to movie i:

Note: This can be pre-computed ahead of time

2) For most viewed 3 genres, find the top 10 movies

• Combine retrieved items into list, removing duplicates

• Display ranked items to user

• Retrieving more items results in better performance,

𝑝(𝑥()( ) ≥ 𝜀 𝑝(𝑥()( ) < 𝜀

To learn parameters 𝑤 ()) , 𝑏 ()) , 𝑤 (&) , 𝑏 (&) , ⋯ 𝑤 ! ,𝑏 ! for all users :