Week3 Statnlp Web

Text Clustering, K-Means, Gaussian
Mixture Models, Expectation-

Maximization, Hierarchical Clustering
Sameer Maskey
Week 3, Sept 19, 2012
1
Topics for Today
Text Clustering
Gaussian Mixture Models
K-Means
Expectation Maximization
Hierarchical Clustering
2
Announcement
Proposal Due tonight (11:59pm) not graded

Feedback by Friday
Final Proposal due (11:59pm) next Wednesday
5% of the project grade
Email me the proposal with the title
Project Proposal : Statistical NLP for the Web
Homework 1 is out
Due October 4th (11:59pm) Thursday
Please use courseworks
3
Course Initial Survey
Class Survey
100.00%
95.45% 95.45%
90.91% 90.48%
90.00%
80.00% 77.27% 77.27%

72.73%
70.00%
63.64% 63.64%
Percentage (Yes, No)
59.09% 59.09%
60.00%
54.55%
50.00%
50.00% Yes
50.00%
45.45% No
40.91% 40.91%
40.00% 36.36% 36.36%
30.00% 27.27%
22.73% 22.73%
20.00%
9.09% 9.09%
10.00%
4.55% 4.55%
0.00%
NLP SLP ML NLP for ML Adv ML NLP-ML Pace Math Matlab Matlab Excited for Industry Larger
Tutorial Project Mentors Audience
Category
4
Perceptron Algorithm
We are given (xi , yi )

Initialize w
Do until converged
if error(yi , sign(w.xi )) == T RU E
w w + yi xi
end if
End do
If predicted class is wrong, subtract or add that point to weight vector

5
Perceptron (cont.) Y is prediction based on
weights and its either 0 or 1
yj (t) = f[w(t).xj ] in this case
wi (t + 1) = wi (t) + (dj yj (t))xi,j
Error is either 1, 0 or -1
Example from Wikipedia

6
Nave Bayes Classifier for Text
P (Y = yk |X1 , X2 , ..., XN ) = P (Y =yk )P (X1 ,X2 ,..,XN |Y =yk )

j P (Y =yj )P (X1 ,X 2 ,..,XN |Y =yj )
= P (Y =yk )i P (Xi |Y =yk )

j P (Y =yj )i P (Xi |Y =yj )
Y argmaxyk P (Y = yk )i P (Xi |Y = yk )
7
Nave Bayes Classifier for Text
Given the training data what are the parameters to

be estimated?
P (Y ) P (X|Y1 ) P (X|Y2 )
the: 0.001 the: 0.001

Diabetes : 0.8 diabetic : 0.0001
Hepatitis : 0.2 diabetic : 0.02
blood : 0.0015 water : 0.0118
sugar : 0.02 fever : 0.01
weight : 0.018 weight : 0.008

Y argmaxyk P (Y = yk )i P (Xi |Y = yk )
8
Data without Labels
Data with corresponding Human Scores
is writing a paper 0.5

Regression
has flu 0.1
is happy, yankees won!
0.87
Data with corresponding Human Class Labels

Perceptron
is writing a paper SAD
has flu Nave Bayes
SAD
is happy, yankees won! Fishers Linear Discriminant
HAPPY
Data with NO corresponding Labels
is writing a paper -
has flu - ?
is happy, yankees won!
-
9
Document Clustering
Previously we classified Documents into Two Classes

Diabetes (Class1) and Hepatitis (Class2)
We had human labeled data
Supervised learning
What if we do not have manually tagged documents
Can we still classify documents?
Document clustering
Unsupervised Learning
10
Classification vs. Clustering
Supervised Training Unsupervised Training

of Classification Algorithm of Clustering Algorithm
11
Clusters for Classification
Automatically Found Clusters

can be used for Classification
12
Document Clustering
Baseball Docs ?
Which cluster does the new document
belong to?
Hockey Docs
13
Document Clustering
Cluster the documents in N clusters/categories
For classification we were able to estimate parameters using

labeled data
Perceptrons find the parameters that decide the separating
hyperplane
Nave Bayes count the number of times word occurs in the
given class and normalize
Not evident on how to find separating hyperplane when no

labeled data available
Not evident how many classes we have for data when we do
not have labels
14
Document Clustering Application
Even though we do not know human labels automatically
induced clusters could be useful News Clusters
15
Document Clustering Application
A Map of Yahoo!, Mappa.Mundi Map of the Market with Headlines

Magazine, February 2000. Smartmoney [2]
16
How to Cluster Documents with No
Labeled Data?
Treat cluster IDs or class labels as hidden variables
Maximize the likelihood of the unlabeled data
Cannot simply count for MLE as we do not know
which point belongs to which class
User Iterative Algorithm such as K-Means, EM
Hidden Variables?
What do we mean by this?
17
Hidden vs. Observed Variables
Assuming our observed data is in R2
How many observed variables? How many observed variables?

How many hidden variables?
18
Clustering (30, 1)
(55, 2)
(24, 1)
(40, 1)
(35, 2)

If we have data with labels

Find out i and

from data N (1 , 1 )
i
for both classes N (2 , 2 )
(30, ?)
(55, ?)
(24, ?)
(40, ?)
If we have data with NO labels but know (35, ?)

data comes from 2 classes
Find out i and

i from data
?
N (1 , 1 )
for both classes N (2 , 2 )
19
K-Means in Words
Parameters to estimate for K classes Baseball

Let us assume we can model this data Hockey
with mixture of two Gaussians
Start with 2 Gaussians (initialize mu values)
Compute distance of each point to the mu of 2 Gaussians and

assign it to the closest Gaussian (class label (Ck))
Use the assigned points to recompute mu for 2 Gaussians
20
K-Means Clustering
Let us define Dataset in D dimension{x1 , x2 , ..., xN }
We want to cluster the data in Kclusters

Let k be D dimension vector representing clusterK
Let us define rnk for each xn such that

rnk {0, 1} where k = 1, ..., K and
rnk = 1 if xn is assigned to cluster k
21
Distortion Measure
N
K
J= rnk ||xn k ||2
n=1 k=1
Represents sum of squares of distances to mu_k from each data point
We want to minimize J
22
Estimating Parameters
We can estimate parameters by doing 2 step

iterative process
Minimize J with respect to rnk

Step 1
Keep k fixed
Minimize J with respect to k

Step 2
Keep rnk fixed
23
Minimize J with respect to rnk
Step 1
Keep k fixed
Optimize for each n separately by choosing rnk for k that

gives minimum ||x r ||2
n nk
rnk = 1 if k = argminj ||xn j ||2

= 0 otherwise
Assign each data point to the cluster that is the closest

Hard decision to cluster assignment
24
Minimize J with respect to k
Step 2
Keep rnk fixed
J is quadratic in k . Minimize by setting derivative w.rt. k to

zero

n rnk xn
k =
n rnk
Take all the points assigned to cluster K and re-estimate the

mean for cluster K
25
Document Clustering with K-means
Assuming we have data with no labels for Hockey and

Baseball data
We want to be able to categorize a new document into one of
the 2 classes (K=2)
We can extract represent document as feature vectors
Features can be word id or other NLP features such as POS
tags, word context etc (D=total dimension of Feature vectors)
N documents are available
Randomly initialize 2 class means
Compute square distance of each point (xn)(D dimension) to
class means (k)
Assign the point to K for which k is lowest
Re-compute k and re-iterate
26
K-Means Example
K-means algorithm Illustration [1]

27
Clusters
Number of documents
clustered together
28
Hard Assignment to Clusters
K-means algorithm assigns each point to the closest

cluster
Hard decision
Each data point affects the mean computation equally
How does the points almost equidistant from 2
clusters affect the algorithm?
Soft decision?
Fractional counts?
29
Gaussian Mixture Models (GMMs)
30
Mixtures of 2 Gaussians

P(x)= N (x|1 , 1) + (1 )N (x|2 , 2)
GMM with 2 gaussians
31
Mixture Models
Mixture of Gaussians [1]
1 Gaussian may not fit the data

2 Gaussians may fit the data better
Each Gaussian can be a class category
When labeled data not available we can treat class category
as hidden variable
32
Mixture Model Classifier
Given a new data point find out posterior probability from each class
p(x|y)p(y)
p(y|x) = p(x)

p(y = 1|x) N (x|1 , 1 )p(y = 1)
33
Cluster ID/Class Label as Hidden
Variables
p(x) = z p(x, z) = z p(z)p(x|z)
We can treat class category as hidden variable z

Z is K-dimensional binary random variable in which zk = 1 and 0 for
other elements
z = [00100...]
K
p(z) = k=1 kzk
K
Also, sum of priors sum to 1 k=1 k = 1
Conditional distribution of x given a particular z can be written as

P (x|zk = 1) = N (x|k , k )
K z
P (x| z ) = k=1 N (x|k , k ) k 34
Mixture of Gaussians with Hidden Variables

p(x) = z p(x, z) = z p(z)p(x|z)
Component of
K Mixture

p(x) = k N (x|k , k)
k=1
Mixing Covariance
Mean
Component
K 1
p(x) = k=1 k
(2) D/2
1
(|

|
exp( 12 (xk )T k
(xk ))
k
Mixture models can be linear combinations of other distributions as well

Mixture of binomial distribution for example
35
Conditional Probability of Label Given
Data
Mixture model with parameters mu, sigma and prior can
represent the parameter
We can maximize the data given the model parameters to find
the best parameters
If we know the best parameters we can estimate
p(z =1)p(x|zk =1)
(zk ) p(zk = 1|x) = K k
j=1 p(zj =1)p(x|zj =1)

N (x|k , k )
= K k
j=1 j N (x|j , j)
This essentially gives us probability of class given the data

i.e label for the given data point
36
Maximizing Likelihood
If we had labeled data we could maximize likelihood simply by
counting and normalizing to get mean and variance of
Gaussians for the given classes
N
l = n=1 log p(xn , yn |, , )
N
l = n=1 log yn N (xn |yn , yn )
(30, 1)
(55, 2)
(24, 1)
If we have two classes C1 and C2 (40, 1)
Lets say we have a feature x (35, 2)

x = number of words field
And class label (y)
y = 1 hockey or 2 baseball documents
N (1 , 1 )
Find out i and i from data N (2 , 2 )
for both classes
37
Maximizing Likelihood for Mixture Model with
Hidden Variables
For a mixture model with a hidden variable
representing 2 classes, log likelihood is
N
l= n=1 logp(xn |, , )
N 1
l= n=1 log y=0 N (xn , y|, , )
N
= n=1 log (0 N (xn |0 , 0 )+1 N (xn |1 , 1 ))
38
Log-likelihood for Mixture of Gaussians
N k
log p(X|, , ) = n=1 log ( k=1 k N (x|k , k ))
We want to find maximum likelihood of the above log-

likelihood function to find the best parameters that maximize
the data given the model
We can again do iterative process for estimating the log-
likelihood of the above function
This 2-step iterative process is called Expectation-Maximization
39
Explaining Expectation Maximization
EM is like fuzzy K-means Baseball

Hockey
Parameters to estimate for K classes
Let us assume we can model this data

with mixture of two Gaussians (K=2)
Start with 2 Gaussians (initialize mu and sigma values) Expectation
Compute distance of each point to the mu of 2 Gaussians and assign it a soft

class label (Ck)
Use the assigned points to recompute mu and sigma for 2 Gaussians; but
weight the updates with soft labels Maximization
40
An expectation-maximization (EM) algorithm is used in statistics for
finding maximum likelihood estimates of parameters in
probabilistic models, where the model depends on unobserved
hidden variables.
EM alternates between performing an expectation (E) step, which

computes an expectation of the likelihood by including the latent
variables as if they were observed, and a maximization (M) step,
which computes the maximum likelihood estimates of the
parameters by maximizing the expected likelihood found on the E
step. The parameters found on the M step are then used to begin
another E step, and the process is repeated.
The EM algorithm was explained and given its name in a classic

1977 paper by A. Dempster and D. Rubin in the Journal of the
Royal Statistical Society.
41
(znk ) = E(znk |xn ) = p(zk = 1|xn )
E-Step

N (xn |k , k )
(znk ) = K k
j=1 j N (xn |j , j)
42
M-step
1
N
k = Nk n=1 (znk )xn
1
N T
k = Nk n=1 (znk )(xn k )(x n k)
k = Nk
N
N
where Nk = n=1 (znk )
Iterate until convergence of log likelihood
N k
log p(X|, , ) = n=1 log ( k=1 N (x|k , k ))
43
EM Iterations
EM iterations [1]
44
Clustering Documents with EM
Clustering documents requires representation of

documents in a set of features
Set of features can be bag of words model
Features such as POS, word similarity, number of
sentences, etc
Can we use mixture of Gaussians for any kind of
features?
How about mixture of multinomial for document
clustering?
How do we get EM algorithm for mixture of
multinomial?
45
Clustering Algorithms
We just described two kinds of clustering algorithms

K-means
Expectation-Maximization is a general way to
maximize log likelihood for distributions with hidden
variables
For example, EM for HMM, state sequences were hidden
For document clustering other kinds of clustering
algorithm exists
46
Hierarchical Clustering
Build a binary tree that groups similar data in

iterative manner
K-means
distance of data point to center of the gaussian
EM
Posterior of data point w.r.t to the gaussian
Hierarchical
Similarity : ?
Similarity across groups of data
47
Types of Hierarchical Clustering
Agglomerative (bottom-up):
Assign each data point as one cluster
Iteratively combine sub-clusters
Eventually, all data points is a part of 1 cluster
Divisive (top-down):
Assign all data points to the same cluster.
Eventually each data point forms its own cluster
One advantage :
Do not need to define K, number
of clusters before we begin
clustering
48
Hierarchical Clustering Algorithm
Step 1
Assign each data point to its own cluster
Step 2
Compute similarity between clusters
Step 3
Merge two most similar cluster to form one less cluster
49
Hierarchical Clustering Demo
Animation source [4]

50
Similar Clusters?
How do we compute similar clusters?

Distance between 2 points in the clusters?
Distance from means of two clusters?
Distance between two closest points in the clusters?
Different similarity metric could produce different
types of cluster
Common similarity metric used
Single Linkage
Complete Linkage
Average Group Linkage
51
Single Linkage
Cluster1
Cluster2
52
Complete Linkage
Cluster1
Cluster2
53
Average Group Linkage
Cluster1
Cluster2
54
Hierarchical Cluster for Documents
Figure : [Ho, Qirong, et. al]

55
Hierarchical Document Clusters
Highlevel multi view of the corpus

Taxonomy useful for various purposes
Q&A related to a subtopic
Finding broadly important topics
Recursive drill down on topics
Filter irrelevant topics
56
Summary
Unsupervised clustering algorithms

K-means
Hierarchical clustering
EM is a general algorithm that can be used to
estimate maximum likelihood of functions with
hidden variables
Similarity Metric is important when clustering
segments of text
57
References
[1] Christopher Bishop, Pattern Recognition and Machine Learning,

2006
[2] http://www.smartmoney.com/map-of-the-market/
[3] Ho, Qirong, et. al, Document Hierarchies from Text and Links, 2012
58

Week3 Statnlp Web

Uploaded by

Copyright:

Available Formats

Week3 Statnlp Web

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Week3 Statnlp Web

Uploaded by

Copyright:

Available Formats

Text Clustering, K-Means, Gaussian

Mixture Models, Expectation-

Week 3, Sept 19, 2012

Proposal Due tonight (11:59pm) not graded

80.00% 77.27% 77.27%

We are given (xi , yi )

If predicted class is wrong, subtract or add that point to weight vector

wi (t + 1) = wi (t) + (dj yj (t))xi,j

Example from Wikipedia

P (Y = yk |X1 , X2 , ..., XN ) = P (Y =yk )P (X1 ,X2 ,..,XN |Y =yk )

= P (Y =yk )i P (Xi |Y =yk )

Given the training data what are the parameters to

the: 0.001 the: 0.001

is writing a paper 0.5

Data with corresponding Human Class Labels

Data with NO corresponding Labels

Previously we classified Documents into Two Classes

Supervised Training Unsupervised Training

Automatically Found Clusters

Cluster the documents in N clusters/categories

For classification we were able to estimate parameters using

Not evident on how to find separating hyperplane when no

A Map of Yahoo!, Mappa.Mundi Map of the Market with Headlines

How many observed variables? How many observed variables?

Find out i and

Parameters to estimate for K classes Baseball

Start with 2 Gaussians (initialize mu values)

Compute distance of each point to the mu of 2 Gaussians and

Use the assigned points to recompute mu for 2 Gaussians

Let us define Dataset in D dimension{x1 , x2 , ..., xN }

We want to cluster the data in Kclusters

Let us define rnk for each xn such that

Represents sum of squares of distances to mu_k from each data point

We can estimate parameters by doing 2 step

Minimize J with respect to rnk

Minimize J with respect to k

Optimize for each n separately by choosing rnk for k that

rnk = 1 if k = argminj ||xn j ||2

Assign each data point to the cluster that is the closest

J is quadratic in k . Minimize by setting derivative w.rt. k to

Take all the points assigned to cluster K and re-estimate the

Assuming we have data with no labels for Hockey and

K-means algorithm Illustration [1]

K-means algorithm assigns each point to the closest

GMM with 2 gaussians

Mixture of Gaussians [1]

1 Gaussian may not fit the data

We can treat class category as hidden variable z

Conditional distribution of x given a particular z can be written as

Mixture models can be linear combinations of other distributions as well

This essentially gives us probability of class given the data

We want to find maximum likelihood of the above log-

EM is like fuzzy K-means Baseball

Let us assume we can model this data

Start with 2 Gaussians (initialize mu and sigma values) Expectation

Compute distance of each point to the mu of 2 Gaussians and assign it a soft

EM alternates between performing an expectation (E) step, which

The EM algorithm was explained and given its name in a classic

Clustering documents requires representation of