Week 9 Lecture Notes
Week 9 Lecture Notes
Week 9 Lecture Notes
ML:Anomaly Detection
Problem Motivation
Just like in other learning problems, we are given a dataset x (1) , x (2) , … , x (m) .
We are then given a new example, xtest , and we want to know whether this new example is
abnormal/anomalous.
We de ne a "model" p(x) that tells us the probability the example is not anomalous. We also use a
threshold ϵ (epsilon) as a dividing line so we can say which examples are anomalous and which are not.
x
(i)
= features of user i's activities
If our anomaly detector is agging too many anomalous examples, then we need to decrease our
threshold ϵ
Gaussian Distribution
The Gaussian Distribution is a familiar bell-shaped curve that can be described by a function (μ, σ 2 )
Let x∈ℝ. If the probability distribution of x is Gaussian with mean μ, variance σ 2 , then:
2
x ∼ (μ, σ )
Mu, or μ, describes the center of the curve, called the mean. The width of the curve is described by sigma,
or σ, called the standard deviation.
1 x−μ
2
− ( )
1
2
2 σ
p(x; μ, σ ) = e
‾‾‾‾‾‾
σ √(2π)
We can estimate the parameter μ from a given dataset by simply taking the average of all the examples:
m
1 (i)
μ = x
m ∑
i=1
We can estimate the other parameter, σ 2 , with our familiar squared error formula:
https://www.coursera.org/learn/machine-learning/resources/szFCa 1/7
4/11/2018 Coursera | Online Courses From Top Universities. Join for Free | Coursera
m
1
2 (i) 2
σ = (x − μ)
m ∑
i=1
Algorithm
Given a training set of examples, {x (1) , … , x (m) } where each example is a vector, x ∈ ℝ
n
.
2 2 2
p(x) = p(x1 ; μ1 , σ )p(x2 ; μ2 , σ ) ⋯ p(xn ; μn , σn )
1 2
In statistics, this is called an "independence assumption" on the values of the features inside training
example x.
2
= p(xj ; μj , σ )
j
∏
j=1
The algorithm
m
1
Calculate μj
(i)
= x
j
m ∑
i=1
m
1
Calculate σj 2 (i) 2
= (x − μj )
j
m ∑
i=1
n n 2
1 (xj − μj )
2
p(x) = p(xj ; μj , σ ) = exp(− )
j
∏ ∏ 2
j=1 j=1
√2π
‾‾‾σj 2σ
j
Anomaly if p(x)<ϵ
m
1
A vectorized version of the calculation for μ is μ = x
(i)
. You can vectorize σ 2 similarly.
m ∑
i=1
Among that data, take a large proportion of good, non-anomalous data for the training set on which to
train p(x).
Then, take a smaller proportion of mixed anomalous and non-anomalous examples (you will usually have
many more non-anomalous examples) for your cross-validation and test sets.
For example, we may have a set where 0.2% of the data is anomalous. We take 60% of those examples, all
of which are good (y=0) for the training set. We then take 20% of the examples for the cross-validation set
(with 0.1% of the anomalous examples) and another 20% from the test set (with another 0.1% of the
anomalous).
https://www.coursera.org/learn/machine-learning/resources/szFCa 2/7
4/11/2018 Coursera | Online Courses From Top Universities. Join for Free | Coursera
In other words, we split the data 60/20/20 training/CV/test and then split the anomalous examples 50/50
between the CV and test sets.
Algorithm evaluation:
Precision/recall
F1 score
We have a very small number of positive examples (y=1 ... 0-20 examples is common) and a large number of
negative (y=0) examples.
We have many different "types" of anomalies and it is hard for any algorithm to learn from positive examples
what the anomalies look like; future anomalies may look nothing like any of the anomalous examples we've
seen so far.
We have a large number of both positive and negative examples. In other words, the training set is more
evenly divided into classes.
We have enough positive examples for the algorithm to get a sense of what new positives examples look like.
The future positive examples are likely similar to the ones in the training set.
We can check that our features are gaussian by plotting a histogram of our data and checking for the bell-
shaped curve.
Some transforms we can try on an example feature x that does not have the bell-shaped curve are:
log(x)
log(x+1)
√x
1/3
x
We can play with each of these to try and achieve the gaussian shape in our data.
https://www.coursera.org/learn/machine-learning/resources/szFCa 3/7
4/11/2018 Coursera | Online Courses From Top Universities. Join for Free | Coursera
There is an error analysis procedure for anomaly detection that is very similar to the one in supervised
learning.
Our goal is for p(x) to be large for normal examples and small for anomalous examples.
One common problem is when p(x) is similar for both types of examples. In this case, you need to examine
the anomalous examples that are giving high probability in detail and try to gure out new features that will
better distinguish the data.
In general, choose features that might take on unusually large or small values in the event of an anomaly.
Instead of modeling p(x1 ), p(x2 ), … separately, we will model p(x) all in one go. Our parameters will be:
μ ∈ ℝ and Σ ∈ ℝ
n n×n
1
T −1
p(x; μ, Σ) = exp(−1/2(x − μ) Σ (x − μ))
n/2 1/2
(2π ) |Σ|
The important e ect is that we can model oblong gaussian contours, allowing us to better t data that
might not t into the normal circular contours.
Varying Σ changes the shape, width, and orientation of the contours. Changing μ will move the center of the
distribution.
Check also:
The original model for p(x) corresponds to a multivariate Gaussian where the contours of p(x; μ, Σ) are
axis-aligned.
The multivariate Gaussian model can automatically capture correlations between di erent features of x.
However, the original model maintains some advantages: it is computationally cheaper (no matrix to invert,
which is costly for large number of features) and it performs well even with small training set size (in
multivariate Gaussian model, it should be greater than the number of features for Σ to be invertible).
ML:Recommender Systems
Problem Formulation
Recommendation is currently a very popular application of machine learning.
Say we are trying to recommend movies to customers. We can use the following de nitions
https://www.coursera.org/learn/machine-learning/resources/szFCa 4/7
4/11/2018 Coursera | Online Courses From Top Universities. Join for Free | Coursera
nu = number of users
nm = number of movies
One approach is that we could do linear regression for every single user. For each user j, learn a parameter
∈ ℝ . Predict user j as rating movie i with (θ stars.
(j) 3 (j) T (i)
θ ) x
θ
(j)
= parameter vector for user j
x
(i)
= feature vector for movie i
m
(j)
= number of movies rated by user j
n
1 λ (j)
(j) T (i) (i,j) 2 2
min (j) = ((θ ) (x ) − y ) + (θ )
θ k
2 ∑ 2 ∑
i:r(i,j)=1 k=1
This is our familiar linear regression. The base of the rst summation is choosing all i such that r(i, j) .
= 1
nu nu n
1 (j) T (i) (i,j) 2
λ (j) 2
min (1) (nu ) = ((θ ) (x ) − y ) + (θ )
θ ,…,θ k
2 ∑ ∑ 2 ∑∑
j=1 i:r(i,j)=1 j=1 k=1
We can apply our linear regression gradient descent update using the above cost function.
1
The only real di erence is that we eliminate the constant .
m
Collaborative Filtering
It can be very di cult to nd features such as "amount of romance" or "amount of action" in a movie. To
gure this out, we can use feature nders.
We can let the users tell us how much they like the di erent genres, providing their parameter vector
immediately for us.
To infer the features from given parameters, we use the squared error function with regularization over all
the users:
nm nm n
1 (j) T (i) (i,j) 2
λ (i) 2
min x (1) ,…,x (nm ) ((θ ) x − y ) + (x )
k
2 ∑ ∑ 2 ∑∑
i=1 j:r(i,j)=1 i=1 k=1
You can also randomly guess the values for theta to guess the features repeatedly. You will actually
converge to a good set of features.
https://www.coursera.org/learn/machine-learning/resources/szFCa 5/7
4/11/2018 Coursera | Online Courses From Top Universities. Join for Free | Coursera
nm n nu n
1 (j) T (i) (i,j) 2
λ (i) 2
λ (j) 2
J(x, θ) = ((θ ) x − y ) + (x ) + (θ )
∑ k k
2 2 ∑∑ 2 ∑∑
(i,j):r(i,j)=1 i=1 k=1 j=1 k=1
It looks very complicated, but we've only combined the cost function for theta and the cost function for x.
Because the algorithm can learn them itself, the bias units where x0=1 have been removed, therefore x∈ℝn
and θ∈ℝn.
1. Initialize x (i) , . . . , x (n ) , θ (1) , . . . , θ (n ) to small random values. This serves to break symmetry and ensures
m u
that the algorithm learns features x (i) , . . . , x (n ) that are different from each other. m
⎛ ⎞
(j) (j) ⎜ (j) T (i) (i,j) (i) (j) ⎟
θ := θ − α ((θ ) x − y )x + λθ
k k ⎜ ∑ k k ⎟
⎝i:r(i,j)=1 ⎠
3. For a user with parameters θ and a movie with (learned) features x, predict a star rating of θ T x.
Predicting how similar two movies i and j are can be done using the distance between their respective
feature vectors x. Speci cally, we are looking for a small value of ||x (i) − x (j) ||.
We rectify this problem by normalizing the data relative to the mean. First, we use a matrix Y to store the
data from previous ratings, where the ith row of Y is the ratings for the ith movie and the jth column
corresponds to the ratings for the jth user.
μ = [μ1 , μ2 , … , μn ]
m
such that
∑ Y i,j
j:r(i,j)=1
μi =
∑ r(i,j)
j
Which is e ectively the mean of the previous ratings for the ith movie (where only movies that have been
watched by users are counted). We now can normalize the data by subtracting u, the mean rating, from the
actual ratings for each user (column in matrix Y):
https://www.coursera.org/learn/machine-learning/resources/szFCa 6/7
4/11/2018 Coursera | Online Courses From Top Universities. Join for Free | Coursera
⎡5 5 0 0⎤ ⎡ 2.5 ⎤
⎢ ⎥ ⎢ ⎥
4 ? ? 0 2
Y = ⎢ ⎥, μ = ⎢ ⎥
⎢0 0 5 4⎥ ⎢ 2.25 ⎥
⎣0 0 5 0⎦ ⎣ 1.25 ⎦
Now we must slightly modify the linear regression prediction to include the mean normalization term:
(j) T (i)
(θ ) x + μi
Now, for a new user, the initial predicted values will be equal to the μ term instead of simply being
initialized to zero, which is more accurate.
https://www.coursera.org/learn/machine-learning/resources/szFCa 7/7