0% found this document useful (0 votes)
4 views55 pages

Chapter 6

Uploaded by

praveenm026
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views55 pages

Chapter 6

Uploaded by

praveenm026
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Representation

CHAPTER 6: MACHINE LEARNING – THEORY & PRACTICE


Dimensionality Reduction
• ML models may overfit as the dimensionality gets larger.

• The number of samples required per class is 5-10 times


the dimensionality d

• Duda, Hart, and Stork, Pattern Classification, Wiley, 2001.


• In machine learning, we represent patterns as vectors in a
multi-dimensional space.
• A document collection is represented as a matrix:
There are n documents and the vocabulary size is l
• The collection is represented as a document matrix
Class
Politics
Sports

Sports
Feature Selection in Documents
• In text classification, documents are represented in a
high-dimensional space, each dimension is a term.
• Many dimensions correspond to rare words.
• Rare words can mislead the classifier.
• Such features can be noisy features.
• Eliminating noisy features from the representation
increases efficiency and effectiveness of text
classification.
• One popular feature selection scheme is based on
mutual information.
Document Matrix: Example
• Consider the following documents:
Document/Term Data Mining Cough
Document1 4 4 0
Document2 3 3 0
Document3 0 0 2
Document4 0 0 1

• Let the data matrix be A as given next:


• A compact form
Graphic Representation
D1 = (2, 2, 5) T3

D1 = 2T1+ 3T2 + 5T3


Q = 0T1 + 0T2 + T3
2 3
T1

D2 = 3T1 + 7T2 + T3
7
T2

• Is D1 or D2 more similar to Q?
• How to measure the degree of similarity? Distance? Angle?
• Is it based on Euclidean Dist? or cos of the angle between vectors?
Euclidean Distance?
2D1 = 4T1+ 6T2 + 10T3
T3

D1 = 2T1+ 3T2 + 5T3


Q = 0T1 + 0T2 + T3
2 3
T1

7
T2

Distance(D1 , 2 D1 ) > Distance(D1 , Q)


Sec. 6.3

Why Distance is a bad idea

The Euclidean distance


between q and d2 is large
even though the
distribution of terms in
query q and the
distribution of terms in
document d2 are very
similar.
Sec. 6.3

Length Normalization
• A vector can be (length-) normalized by dividing each of its
components by its length – for this we use the L2 norm:

x 2 = i xi2

• Dividing a vector by its L2 norm makes it a unit (length)


vector (on surface of unit hypersphere)
• Effect on the two documents d and d′ (d appended to
itself) from earlier slide: they have identical vectors after
length-normalization.
• Long and short documents now have comparable weights
Cosine Similarity
Sec. 6.3

Cosine(query,document)
Dot product Unit vectors
  


V
  q•d q d q di
cos( q, d ) =   =  •  = i =1 i
q d
 i=1 i
V V
qd q2
d 2
i =1 i

qi is the tf-idf weight of term i in the query


di is the tf-idf weight of term i in the document

cos(q,d) is the cosine similarity of q and d … or,


equivalently, the cosine of the angle between q and d.
Cosine for length-normalized vectors
• For length-normalized vectors, cosine similarity is
simply the dot product (or scalar product):

cos(q,d ) = q • d =  qi di
V

i=1
for q, d length-normalized.


12
Mutual Information (MI)
• Let c a be the class of a document and t be a term
• We compute the MI of t in c
• MI tells us how much information t contains about c and vice versa
• I(x,y) =
Mutual Information from Reuters Data
Poultry Class Non-Poultry Class
Export Present
Export Absent

MI(Export, Poultry) =
MI Ranking on Reuters Data

Manning, Raghavan, Schutze, An Introduction to Information Retrieval, Cambridge Univ Press, 2009.
Top 15 Features based on MI
30 Features in SKLEARN
Random Projections: Feature Extraction

• Let and
Random Projections

We can show that


Without loss of generality choose i = 1.
Random Projections

Var(X) = E(𝑋 2 ) – E(X)2


Random Projections
So,

So, Recall that


10 random Projections
Eigenvalues and Eigenvectors
• Standard eigenvalue problem: Given an n x n matrix A, find a
scalar λ and a nonzero vector x such that A x = λ x
• λ is an eigenvalue and x is the corresponding eigenvector
• < λ, x> is called an eigenpair
• Spectrum = λ (A) = set of eigenvalues of A

• A=
1
0
0
2
: λ1 = 1, x1 =( );
1
0
()
λ2 = 2, x2 =
0
1

• A=
1 1
0 2
: λ1 = 1, x1 =
1
0
( );
λ2 = 2, x2 =
0
1
()
0 1
• A= : Find the eigenvalues and eigenvectors
−1 0
Characteristic Ploynomial
• Equation A x = λ x is equivalent to (A- λ I)x = 0; it has a non-zero
solution only if (𝐴 − λ I)−1 does′ nt exist.
• So, det(𝐴 − λ I) = 0 is the character equation having n roots (λ′ 𝑠).
• The eigenvalues may be neither real nor distinct for a real matrix.
3 −1 3 −1 1 0
• A= ; det( -λ ) = 0 gives us
−1 3 −1 3 0 1

(3 − λ)(3 − λ) – (-1)(-1) = λ2 -6 λ +8 = 0 gives as roots

λ1 = 4 and λ2 = 2, x1 =
1
( );
−1
λ2 = 2, x2 =()
1
1
Matrix-Vector Multiplication

3 0 0
AS = 0 2 0 has eigenvalues 3, 2, 0 with
  corresponding eigenvectors
0 0 0
 1  0  0
     
v1 =  0  v2 =  1  v3 =  0 
 0  0  1
     

On each eigenvector, A acts as a multiple of the identity


matrix: but as a different multiple on each.
 2
 
Any vector (say x= ) can be viewed as a combination of
 4
 6
 
the eigenvectors: x = 2v1 + 4v2 + 6v3
Diagonal Decomposition: Example

2 1 
Let SA =   ; 1 = 1, 2 = 3.
1 2 
 1 1
The eigenvectors 
1  and   form U =
1
 −1   − 1 1
  1  

−1 1 / 2 − 1 / 2
Inverting, we have U = 
1 / 2 1 / 2 
 1 1 1 0 1 / 2 − 1 / 2
S=UU–1 = 
Then, A     
− 1 1 0 3 1 / 2 1 / 2 
Diagonal Decomposition: Example

Let’s divide U (and multiply U–1) by 2

 1 / 2 1 / 2  1 0 1 / 2 −1/ 2 
S= 
Then, A    
− 1 / 2 1 / 2  0 3 1 / 2 1/ 2 
Q  (Q-1= QT )
Principal Components
• Most popular in
• Signal Processing: Eigen speech and Eigenfaces
• Machine Learning: Feature Extraction
• Each new feature is a linear combination of the given features
• Let X1 , X 2 , … X n be n D-dimensional patterns
• Any such D-dimensional vector can be represented as
Let. and its d-dimensional version be

Note that where d < D

are the weight and the basis vector.


Principal Components
• It is convenient to consider

• Because as
Eigenvectors of the Covariance Matrix
The Covariance Matrix is
By looking at the eigenvectors of the covariance matrix,

So, Error =

The error is minimized by selecting so that the entries in the


above sum is based on the least D-d eigenvalues.
Example
• Consider 6 two-dimensional patterns:

1 24 4
• So, = = and
6 24 4

• The eigenpairs of the matrix are:

1 1 1
<11, > and < , >
1 3 −1
PCs as Extracted Features
• The two eigenvector directions are shown below. They are
orthogonal as the covariance matrix is symmetric.
Based on 8 PCs
Singular Value Decomposition
For an m n matrix A (m patterns and n features) of rank r there exists a
factorization (Singular Value Decomposition - SVD) as follows:
A = UV T
m m m n V is nn

The columns of U are orthogonal eigenvectors of AAT.


The columns of V are orthogonal eigenvectors of ATA.
Eigenvalues 1 … r of AAT are the eigenvalues of ATA.
 i = i
 = diag ( 1... r )
SVD: Example
1 − 1 𝑡
𝐴 A=
2 −1
, eigenvalues are 3 and 1.
A =  0 1 
−1 2
Let
 1 0  2 −1 1
A𝐴𝑡 = −1 1 0 , eigenvalues are 3, 1, and 0
1 0 1

( ), ( ) eigenvectors of 𝐴 A
1
−1
1
1
𝑡

 0 2/ 6 1/ 3  1 0 
   1 / 2 1/ 2 
SVD(A) = 1 / 2 −1/ 6 1 / 3  0 3 
1 / 2  1/ 2 −1/ 2 
 1/ 6 − 1 / 3   0 0 
Typically, the singular values arranged in decreasing order.
Error in Low-Rank Approximation

• How good (bad) is this approximation?


• It’s the best possible, measured by the Frobenius norm of the error:

min A− X F
= A − Ak F
=  k +1
X :rank ( X ) = k

where the i are ordered such that i  i+1.


Suggests why Frobenius error drops as k increased.
What is a Document?
• Web page • Multimedia record
• News paper article • Historical record
• Acad. publication • Electronic mail
• Company report • Court transcript
• Research grant • Health record
application
• Manual page • Legal record
• Encyclopedia • Fingerprint
• Images (video) • Software
• Speech records – Code
• Bank transaction slip – Bug reports
Keyword Search
• Simplest notion of relevance is that the query
string appears as it is in the document
• Slightly less strict notion is that the words in the
query appear frequently in the document, in any
order (Bag of words!)
• Example:
1. The good old teacher teaches several courses
2. In the big old college in the big old town
3. The college in the town likes the good old teacher
4. Where the old teacher never did fail
5. The good teacher teaches in the evenings
6. And the students like his lecture notes
Keyword Search (continued)
• With casefolding, the vocabulary is:
and (And) big college courses did evenings fail good
his in (In) lecture like likes never notes old several
students teacher teaches the (The) town where
(Where)
• With stemming: (lossy, yet useful compression)
and big college course did evening fail good his in
lecture like (likes) never note (notes) old several
student (students) teach (teacher, teaches) the town
where
• Stopping then reduces the vocabulary:
big college course evening fail good lecture like note
old several student teach town
Inverted Index
TERM INVERTED LIST
big <2,2>
college <2,1> <3,1>
course <1,1>
evening <5,1>
fail <4,1>
good <1,1> <3,1> <5,1>
lecture <6,1>
like <3,1> <6,1>
note <6,1>
old <1,1> <2,2> <3,1> <4,1> <5,1>

several <1,1>
student <6,1>
teach <1,2> <3,1> <4,1><5,2>
town <2,1> <3,1>
Boolean Model
• A document is represented as a set of keywords.
• Queries are Boolean expressions of keywords,
connected by AND, OR, and NOT.
• Output: Document is relevant or not (Classification).
No partial matches or ranking.
• For example, old college – docs 2 and 3
2. In the big old college in the big old town
3. The college in the town likes the good old teacher
• Bag of Words Paradigm!
• Order is not important, frequency is!
• Learning is exhibited using Matching
Document Matrix
Term/Doc DOC-1 DOC-2 DOC-3 DOC-4 DOC-5 DOC-6
big 0 1 0 0 0 0
college 0 1 1 0 0 0
course 1 0 0 0 0 0
evening 0 0 0 0 1 0
fail 0 0 0 1 0 0
good 1 0 1 0 1 0
lecture 0 0 0 0 0 1
like 0 0 1 0 0 1
note 0 0 0 0 0 1
old 1 1 1 1 1 0
several 1 0 0 0 0 0
student 0 0 0 0 0 1
teach 1 0 1 1 1 0
town 0 1 1 0 0 0
SVD: Low-Rank Approximation

• Whereas the term-doc matrix A may have m=50000, n=10 million (and
rank close to 50000)
• We can construct an approximation A100 with rank 100.
• Of all rank 100 matrices, it would have the lowest Frobenius error.
• … but why should we??
• Answer: Latent Semantic Indexing

C. Eckart, G. Young, The approximation of a matrix by another of lower rank. Psychometrika, 1, 211-218, 1936.
The Problem
• Example: Vector Space Model

auto car make


engine emissions hidden
bonnet hood Markov
tyres make model
lorry model emissions
boot trunk normalize

Synonymy Polysemy
Will have small cosine Will have large cosine
but are related but not truly related
Problems with Lexical Semantics
Problems with Lexical Semantics

• Synonymy: Different terms may have an


identical or a similar meaning (weaker:
words indicating the same topic).
• No associations between words are made
in the vector space representation.
Latent Semantic Analysis
• Latent semantic space: illustrating example
d1 d2 d3 d4 d5 d6
Ship 1 0 1 0 0 0
Boat 0 1 0 0 0 0
Ocean 1 1 0 0 0 0
Voyage 1 0 0 1 1 0

Trip 0 0 0 1 0 1

• The reduced dimensional data in 2-d is given below


d1 d2 d3 d4 d5 d6

SV1 -1.62 -0.6 -0.44 -0.97 -0.7 -0.26

SV2 -0.46 -0.84 -0.3 1.0 0.35 0.65


LSA
d4
d6

d5
d3
d1
d2

• d2 is similar to d3
• d2: boat, ocean; d3: ship

• d4 is similar to d5 and d6 (cos )
• d4: voyage, trip; d5: voyage; d6: trip
Latent Semantic Indexing (LSA)
• Perform a low-rank approximation of document-term
matrix (typical rank 100-300)
• General idea
• Map documents (and terms) to a low-dimensional
representation.
• Design a mapping such that the low-dimensional space reflects
semantic associations (latent semantic space).
• Compute document similarity based on the inner product in this
latent semantic space

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy