Week3 Statnlp Web
Week3 Statnlp Web
Week3 Statnlp Web
Sameer Maskey
1
Topics for Today
Text Clustering
Gaussian Mixture Models
K-Means
Expectation Maximization
Hierarchical Clustering
2
Announcement
3
Course Initial Survey
Class Survey
100.00%
95.45% 95.45%
90.91% 90.48%
90.00%
59.09% 59.09%
60.00%
54.55%
50.00%
50.00% Yes
50.00%
45.45% No
40.91% 40.91%
40.00% 36.36% 36.36%
30.00% 27.27%
22.73% 22.73%
20.00%
9.09% 9.09%
10.00%
4.55% 4.55%
0.00%
NLP SLP ML NLP for ML Adv ML NLP-ML Pace Math Matlab Matlab Excited for Industry Larger
Tutorial Project Mentors Audience
Category
4
Perceptron Algorithm
Error is either 1, 0 or -1
Y argmaxyk P (Y = yk )i P (Xi |Y = yk )
7
Nave Bayes Classifier for Text
P (Y ) P (X|Y1 ) P (X|Y2 )
Y argmaxyk P (Y = yk )i P (Xi |Y = yk )
8
Data without Labels
Data with corresponding Human Scores
is writing a paper -
has flu - ?
is happy, yankees won!
-
9
Document Clustering
10
Classification vs. Clustering
11
Clusters for Classification
12
Document Clustering
Baseball Docs ?
Which cluster does the new document
belong to?
Hockey Docs
13
Document Clustering
14
Document Clustering Application
Even though we do not know human labels automatically
induced clusters could be useful News Clusters
15
Document Clustering Application
16
How to Cluster Documents with No
Labeled Data?
Treat cluster IDs or class labels as hidden variables
Maximize the likelihood of the unlabeled data
Cannot simply count for MLE as we do not know
which point belongs to which class
User Iterative Algorithm such as K-Means, EM
Hidden Variables?
What do we mean by this?
17
Hidden vs. Observed Variables
Assuming our observed data is in R2
20
K-Means Clustering
21
Distortion Measure
N
K
J= rnk ||xn k ||2
n=1 k=1
We want to minimize J
22
Estimating Parameters
23
Minimize J with respect to rnk
Step 1
Keep k fixed
24
Minimize J with respect to k
Step 2
Keep rnk fixed
25
Document Clustering with K-means
26
K-Means Example
28
Hard Assignment to Clusters
29
Gaussian Mixture Models (GMMs)
30
Mixtures of 2 Gaussians
P(x)= N (x|1 , 1) + (1 )N (x|2 , 2)
31
Mixture Models
32
Mixture Model Classifier
Given a new data point find out posterior probability from each class
p(x|y)p(y)
p(y|x) = p(x)
p(y = 1|x) N (x|1 , 1 )p(y = 1)
33
Cluster ID/Class Label as Hidden
Variables
p(x) = z p(x, z) = z p(z)p(x|z)
k=1
Mixing Covariance
Mean
Component
K 1
p(x) = k=1 k
(2) D/2
1
(|
|
exp( 12 (xk )T k
(xk ))
k
N 1
l= n=1 log y=0 N (xn , y|, , )
N
= n=1 log (0 N (xn |0 , 0 )+1 N (xn |1 , 1 ))
38
Log-likelihood for Mixture of Gaussians
N k
log p(X|, , ) = n=1 log ( k=1 k N (x|k , k ))
39
Explaining Expectation Maximization
Use the assigned points to recompute mu and sigma for 2 Gaussians; but
weight the updates with soft labels Maximization
40
Expectation Maximization
An expectation-maximization (EM) algorithm is used in statistics for
finding maximum likelihood estimates of parameters in
probabilistic models, where the model depends on unobserved
hidden variables.
E-Step
N (xn |k , k )
(znk ) = K k
j=1 j N (xn |j , j)
42
Estimating Parameters
M-step
1
N
k = Nk n=1 (znk )xn
1
N T
k = Nk n=1 (znk )(xn k )(x n k)
k = Nk
N
N
where Nk = n=1 (znk )
Iterate until convergence of log likelihood
N k
log p(X|, , ) = n=1 log ( k=1 N (x|k , k ))
43
EM Iterations
EM iterations [1]
44
Clustering Documents with EM
45
Clustering Algorithms
46
Hierarchical Clustering
47
Types of Hierarchical Clustering
Agglomerative (bottom-up):
Assign each data point as one cluster
Iteratively combine sub-clusters
Eventually, all data points is a part of 1 cluster
Divisive (top-down):
Assign all data points to the same cluster.
Eventually each data point forms its own cluster
One advantage :
Do not need to define K, number
of clusters before we begin
clustering
48
Hierarchical Clustering Algorithm
Step 1
Assign each data point to its own cluster
Step 2
Compute similarity between clusters
Step 3
Merge two most similar cluster to form one less cluster
49
Hierarchical Clustering Demo
51
Single Linkage
Cluster1
Cluster2
52
Complete Linkage
Cluster1
Cluster2
53
Average Group Linkage
Cluster1
Cluster2
54
Hierarchical Cluster for Documents
56
Summary
57
References
58