Chapter 6
Chapter 6
Sports
Feature Selection in Documents
• In text classification, documents are represented in a
high-dimensional space, each dimension is a term.
• Many dimensions correspond to rare words.
• Rare words can mislead the classifier.
• Such features can be noisy features.
• Eliminating noisy features from the representation
increases efficiency and effectiveness of text
classification.
• One popular feature selection scheme is based on
mutual information.
Document Matrix: Example
• Consider the following documents:
Document/Term Data Mining Cough
Document1 4 4 0
Document2 3 3 0
Document3 0 0 2
Document4 0 0 1
D2 = 3T1 + 7T2 + T3
7
T2
• Is D1 or D2 more similar to Q?
• How to measure the degree of similarity? Distance? Angle?
• Is it based on Euclidean Dist? or cos of the angle between vectors?
Euclidean Distance?
2D1 = 4T1+ 6T2 + 10T3
T3
7
T2
Length Normalization
• A vector can be (length-) normalized by dividing each of its
components by its length – for this we use the L2 norm:
x 2 = i xi2
Cosine(query,document)
Dot product Unit vectors
V
q•d q d q di
cos( q, d ) = = • = i =1 i
q d
i=1 i
V V
qd q2
d 2
i =1 i
cos(q,d ) = q • d = qi di
V
i=1
for q, d length-normalized.
12
Mutual Information (MI)
• Let c a be the class of a document and t be a term
• We compute the MI of t in c
• MI tells us how much information t contains about c and vice versa
• I(x,y) =
Mutual Information from Reuters Data
Poultry Class Non-Poultry Class
Export Present
Export Absent
MI(Export, Poultry) =
MI Ranking on Reuters Data
Manning, Raghavan, Schutze, An Introduction to Information Retrieval, Cambridge Univ Press, 2009.
Top 15 Features based on MI
30 Features in SKLEARN
Random Projections: Feature Extraction
• Let and
Random Projections
• A=
1
0
0
2
: λ1 = 1, x1 =( );
1
0
()
λ2 = 2, x2 =
0
1
• A=
1 1
0 2
: λ1 = 1, x1 =
1
0
( );
λ2 = 2, x2 =
0
1
()
0 1
• A= : Find the eigenvalues and eigenvectors
−1 0
Characteristic Ploynomial
• Equation A x = λ x is equivalent to (A- λ I)x = 0; it has a non-zero
solution only if (𝐴 − λ I)−1 does′ nt exist.
• So, det(𝐴 − λ I) = 0 is the character equation having n roots (λ′ 𝑠).
• The eigenvalues may be neither real nor distinct for a real matrix.
3 −1 3 −1 1 0
• A= ; det( -λ ) = 0 gives us
−1 3 −1 3 0 1
λ1 = 4 and λ2 = 2, x1 =
1
( );
−1
λ2 = 2, x2 =()
1
1
Matrix-Vector Multiplication
3 0 0
AS = 0 2 0 has eigenvalues 3, 2, 0 with
corresponding eigenvectors
0 0 0
1 0 0
v1 = 0 v2 = 1 v3 = 0
0 0 1
2 1
Let SA = ; 1 = 1, 2 = 3.
1 2
1 1
The eigenvectors
1 and form U =
1
−1 − 1 1
1
−1 1 / 2 − 1 / 2
Inverting, we have U =
1 / 2 1 / 2
1 1 1 0 1 / 2 − 1 / 2
S=UU–1 =
Then, A
− 1 1 0 3 1 / 2 1 / 2
Diagonal Decomposition: Example
1 / 2 1 / 2 1 0 1 / 2 −1/ 2
S=
Then, A
− 1 / 2 1 / 2 0 3 1 / 2 1/ 2
Q (Q-1= QT )
Principal Components
• Most popular in
• Signal Processing: Eigen speech and Eigenfaces
• Machine Learning: Feature Extraction
• Each new feature is a linear combination of the given features
• Let X1 , X 2 , … X n be n D-dimensional patterns
• Any such D-dimensional vector can be represented as
Let. and its d-dimensional version be
• Because as
Eigenvectors of the Covariance Matrix
The Covariance Matrix is
By looking at the eigenvectors of the covariance matrix,
So, Error =
1 24 4
• So, = = and
6 24 4
1 1 1
<11, > and < , >
1 3 −1
PCs as Extracted Features
• The two eigenvector directions are shown below. They are
orthogonal as the covariance matrix is symmetric.
Based on 8 PCs
Singular Value Decomposition
For an m n matrix A (m patterns and n features) of rank r there exists a
factorization (Singular Value Decomposition - SVD) as follows:
A = UV T
m m m n V is nn
( ), ( ) eigenvectors of 𝐴 A
1
−1
1
1
𝑡
0 2/ 6 1/ 3 1 0
1 / 2 1/ 2
SVD(A) = 1 / 2 −1/ 6 1 / 3 0 3
1 / 2 1/ 2 −1/ 2
1/ 6 − 1 / 3 0 0
Typically, the singular values arranged in decreasing order.
Error in Low-Rank Approximation
min A− X F
= A − Ak F
= k +1
X :rank ( X ) = k
several <1,1>
student <6,1>
teach <1,2> <3,1> <4,1><5,2>
town <2,1> <3,1>
Boolean Model
• A document is represented as a set of keywords.
• Queries are Boolean expressions of keywords,
connected by AND, OR, and NOT.
• Output: Document is relevant or not (Classification).
No partial matches or ranking.
• For example, old college – docs 2 and 3
2. In the big old college in the big old town
3. The college in the town likes the good old teacher
• Bag of Words Paradigm!
• Order is not important, frequency is!
• Learning is exhibited using Matching
Document Matrix
Term/Doc DOC-1 DOC-2 DOC-3 DOC-4 DOC-5 DOC-6
big 0 1 0 0 0 0
college 0 1 1 0 0 0
course 1 0 0 0 0 0
evening 0 0 0 0 1 0
fail 0 0 0 1 0 0
good 1 0 1 0 1 0
lecture 0 0 0 0 0 1
like 0 0 1 0 0 1
note 0 0 0 0 0 1
old 1 1 1 1 1 0
several 1 0 0 0 0 0
student 0 0 0 0 0 1
teach 1 0 1 1 1 0
town 0 1 1 0 0 0
SVD: Low-Rank Approximation
• Whereas the term-doc matrix A may have m=50000, n=10 million (and
rank close to 50000)
• We can construct an approximation A100 with rank 100.
• Of all rank 100 matrices, it would have the lowest Frobenius error.
• … but why should we??
• Answer: Latent Semantic Indexing
C. Eckart, G. Young, The approximation of a matrix by another of lower rank. Psychometrika, 1, 211-218, 1936.
The Problem
• Example: Vector Space Model
Synonymy Polysemy
Will have small cosine Will have large cosine
but are related but not truly related
Problems with Lexical Semantics
Problems with Lexical Semantics
Trip 0 0 0 1 0 1
d5
d3
d1
d2
• d2 is similar to d3
• d2: boat, ocean; d3: ship
• d4 is similar to d5 and d6 (cos )
• d4: voyage, trip; d5: voyage; d6: trip
Latent Semantic Indexing (LSA)
• Perform a low-rank approximation of document-term
matrix (typical rank 100-300)
• General idea
• Map documents (and terms) to a low-dimensional
representation.
• Design a mapping such that the low-dimensional space reflects
semantic associations (latent semantic space).
• Compute document similarity based on the inner product in this
latent semantic space