CS3491-AI ML-Chapter 5
CS3491-AI ML-Chapter 5
CS3491-AI ML-Chapter 5
Machine
Learning
CHAPTER 5:
Multivariate
Methods
Multivariate Data
Multiple measurements (sensors)
d inputs/features/attributes: d-variate
N instances/observations/examples
X 11 X 1
2 X 1
d
2 2 2
X1 X 2 X d
X
N N N
X 1 X 2 X d
3
Multivariate Parameters
Mean : E x μ 1 ,...,d
T
12 12 1d
2
21 2 2d
CovX E X μ X μ
T
2
d1 d 2 d
4
Parameter Estimation
N
Samplemean m : mi
t 1 i
x t
,i 1,...,d
N
Covariancematrix S : sij
x
N
t 1
t
i
mi xtj mj
N
sij
Correlation matrix R : rij
si s j
5
Estimation of Missing Values
What to do if certain instances have missing
attributes?
Ignore those instances: not a good idea if the
sample is small
Use ‘missing’ as an attribute: may give
information
Imputation: Fill in the missing value
Mean imputation: Use the most likely value (e.g., mean)
Imputation by regression: Predict based on other
attributes
6
Multivariate Normal
Distribution
x ~N d μ, Σ
1 1
p x exp x μ Σ x μ
T 1
2 Σ
d/ 2 1/ 2
2
7
Multivariate Normal
Distribution
Mahalanobis distance: (x – μ)T ∑–1 (x – μ)
measures the distance from x to μ in terms of
∑ (normalizes for difference in variances and
correlations)
12 12
Bivariate: d = 2
12 22
1 1 2
p x1 ,x2 exp
2
2
z1 2z1z 2 z 2
212 1 2
21
zi xi i / i
8
Bivariate Normal
9
10
Independent Inputs: Naive
Bayes
If xi are independent, offdiagonals of ∑ are 0,
Mahalanobis distance reduces to weighted (by
1/σi) Euclidean distance:
d
1 1 d x
2
p x pi xi d
exp i
i
2
i 1
2 i
d / 2
i 1 i
i 1
11
Parametric Classification
If p (x | Ci ) ~ N ( μi , ∑i )
1 1
p x | C i exp x μi Σi x μi
T 1
2 Σi
d/ 2 1/ 2
2
Discriminant functions are
2 2 2
12
Estimation of Parameters
P̂ C
r
t i
t
i
N
mi
t i x
r t t
t i
r t
r x
T
Si
t i
t t t
mi x mi
t i
r t
1 1
gi x log Si x mi Si x mi log P̂ C i
T 1
2 2
13
Different Si
Quadratic discriminant
1
2 2
1 T 1 1 T 1
gi x log Si x Si x 2xT Si mi mi Si mi log P̂ C i
T
xT Wi x wi x wi 0
where
1 1
Wi Si
2
1
wi Si mi
1 T 1 1
wi 0 mi Si mi log Si log P̂ C i
2 2
14
likelihoods
discriminant:
P (C1|x ) = 0.5
posterior for C1
15
Common Covariance Matrix S
Shared common sample covariance S
S P̂ C i Si
i
Discriminant reduces to
1
gi x x mi S 1 x mi log P̂ C i
T
2
which is a linear discriminant
gi x wi x wi 0
T
where
1 T 1
1
wi S mi wi 0 mi S mi log P̂ C i
2 16
Common Covariance Matrix S
17
Diagonal S
When xj j = 1,..d, are independent, ∑ is diagonal
p (x|Ci) = ∏j p (xj |Ci) (Naive Bayes’ assumption)
2
1 x mij
d t
gi x j
log P̂ C i
2 j 1 s j
18
Diagonal S
variances may be
different
19
Diagonal S, equal variances
Nearest mean classifier: Classify based on
Euclidean distance to the nearest mean
2
x mi
gi x 2
log P̂ C i
2s
2
1 d
2 xtj mij
2s j 1
log P̂ C
i
20
Diagonal S, equal variances
*?
21
Model Selection
Assumption Covariance matrix No of parameters
Shared, Hyperspheric Si=S=s2I 1
Shared, Axis-aligned Si=S, with sij=0 d
Shared, Hyperellipsoidal Si=S d(d+1)/2
Different, Si K d(d+1)/2
Hyperellipsoidal
As we increase complexity (less restricted S), bias
decreases and variance increases
Assume simple models (allow some bias) to
control variance (regularization)
22
Discrete Features
Binary features:pij p x j 1 | C i
if xj are independent (Naive Bayes’)
d
p x | C i pij 1 pij
xj 1 x j
j 1
t i
r t
23
Discrete Features
Multinomial (1-of-nj) features: xj {v1, v2,..., vnj}
pijk p z jk 1 | C i p x j vk | C i
if xj are independent
nd j
p x | C i pijkjk
z
j 1 k 1
p̂ijk
t jk i
z t
r t
t i
r t
24
Multivariate Regression
r g x | w ,w ,...,w
t t
0 1 d
25