cs419Notes
cs419Notes
Instructor: Saketh
Contents
Contents i
i
3.1.4 Multivariate Gaussian and Gaussian-inverse-Wishart-Gaussian
Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.5 Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.1.6 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.1 Probabilistic Models . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.1.1 Multivariate Bernoulli Naive Bayes Classifier . . . . 22
3.2.1.2 Gaussian Discriminant Analysis . . . . . . . . . . . . 23
3.2.1.3 Logistic Regression . . . . . . . . . . . . . . . . . . . 23
3.2.1.4 Linear Regression . . . . . . . . . . . . . . . . . . . . 23
3.2.2 Deterministic Models . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.2.1 Support Vector Machine . . . . . . . . . . . . . . . . 24
3.2.2.2 Support Vector Regression . . . . . . . . . . . . . . . 24
3.2.2.3 Kernel Machines . . . . . . . . . . . . . . . . . . . . 25
1
2
Chapter 1
Machine learning aims at developing algorithms that mimic the ability in humans to
learn i.e., improve their “performance” with experience. By performance, we mean
their various cognitive abilities. A short note about this is presented below. It is
easy to observe that machine learning algorithms will have far reaching consequences
in all aspects of living and hence are worth developing.
3
the need/application humans express3 the unknown-concept through some input-
output pairs. For e.g., in the grouping/clustering of data example, the input is the
datum and the output is the group-id etc. This clarifies what we mean by input
and output.
Also, how well or fast will a person learn will definitely depend on his current
state/capacity/bias/background. We will refer to this as background hence-forth.
Given this terminology, one could say that the objective in human learning is to
determine the unknown-concept, but it seems enough to say that the goal is to be
able to provide the “right” output for any input (because this is how usually humans
express their ability/unknown-concept that has been learnt). Summarizing, here are
the five important aspects in human learning:
2. Unknown-concept
3. Experience
4. Background
5. Objective of learning
4
sequence filling problems the probability distribution is over the inputs alone
and in case of grouping/clustering data and speech/object recognition prob-
lems the probability distribution is over the input-output pairs.
3. Experience by:
In either case, we call this set as the training set. Most importantly, the
training set and the unknown distribution have a relation: we assume that
the training set is an instantiation of a set of iid random samples from the
unknown distribution. We denote the set of iid random variables gener-
ating the training set as D = {X1 , . . . , Xm } in un-supervised case and as
D = {(X1 , Y1 ), . . . , (Xm , Ym )} in the supervised case. The training set is an
instantiation of D (hence-forth denoted by D).
(a) Deterministic Models: Linear models (set of all linear functions [Shel-
don Axler, 1997] over the input space), Quadratic models (set of all
quadratic functions over the input space), Analytic models (set of all
analytic functions over the input space), Convex models (set of all con-
vex functions [Boyd and Vandenberghe, 2004] over the input space) etc.
(b) Probabilistic Models [Ross, 2006]6 : Bernoulli models (set of all Bernoulli
distributions over the input space), Gaussian models (set of all Gaussian
distributions over the input space).
Note that the members of each model differ only in terms of their parameters.
These parameters are hence called as model parameters. For eg. for linear
models θ ∈ Rn are the parameters and for Gaussian models the mean vector
and covariance matrix are the parameters. We will reserve M for denoting
model and θ for denoting the vector of model parameters.
5
Broadly, there are two schools of thought when it comes to formalizing the
objective of learning. The first is to view learning as the process of selecting the
“best” model parameter after observing the training set (given the model). In this
case the output for a given input is predicted/estimated/inferred only using this
best model parameter. The second is to view learning as the process of estimating
the uncertainty of each model parameter being the “right” one and then the output
is estimated/predicted/inferred (for a given input) by taking an aggregate opinion
of each model parameter7 . The latter approach is called as Bayesian learning. We
begin with the former set-up of non-Bayesian learning and then provide a description
of Bayesian learning.
where ϑ is the set of possible values the model parameter, θ, could take, and gθ
is the particular member of the deterministic model (i.e., member of the set of
functions) with θ as the parameter8 . R[gθ ] is the risk incurred by employing gθ as
the “best” hypothesis for prediction. Further, we define risk as the expected loss,
where loss is any non-negative function, l : Y × Y 7→ R+ that measures how far
the predicted and true output are for the given input with the given function i.e.,
R[gθ ] ≡ E [l(gθ (X), Y )]. Examples of loss function9 are: i) 0-1 loss: l(gθ (X), Y ) ≡
1gθ (X)=Y ii) square-loss: l(gθ (X), Y ) ≡ (gθ (X) − Y )2 . Needless to say, one cannot
even compute this expectation as the distribution (UXY ) is unknown and hence
minimizing it is impossible. Later on we will see how to approximate this goal using
the training set.
In this case, given a x, the corresponding y is predicted using y = gθ∗ (x), where
∗
θ is the minimizer of the objective in (1.1).
In case of probabilistic models for unsupervised learning with UX as the un-
known distribution, here is a natural way of writing down the goal:
Z
(1.2) min (Fθ (x) − UX (x))2 dx,
θ∈ϑ X
7
To better understand this statement, you may want to view a model as a set of hypothesis
and each model parameter corresponds to a particular hypothesis in this set.
8
In other words, the model is the set of all gθ for various θ ∈ ϑ.
9
Later on we will provide more examples of loss functions.
6
where Fθ is the particular distribution in the model with the parameter as θ. In
plain words, this simply picks that θ that induces the closest distribution to the
unknown one. Further, assuming that for all the distributions in the model and
UX the moment-generating-function (mgf) exists, here is another way of writing the
goal.
∞
X 2
(1.3) min EUX [X i ] − EFθ [X i ]
θ∈ϑ
i=1
In plain words, this simply picks that θ that induces the distribution whose moments
are closest to the unknown one.
10
Posterior distribution because it is the distribution after observing something .... (here it is
the distribution of θ after seeing the training set).
7
8
Chapter 2
1
Bayesian set-up is natural with probabilistic models only.
9
Note that, unlike (1.1), this minimization is not impossible to compute2 .
While this is encouraging, the immediate question is whether the minimum of
empirical risk is PAC version of minimum of risk ? We say ERM is statistically
consistent iff this happens. Fortunately, for the loss functions and models we use
in this course, this can be shown [Schölkopf and Smola, 2002, Vapnik, 1998].
∞ m
!2
X 1 X i
(2.2) min [xj ] − EFθ [X i ]
θ∈ϑ
i=1
m j=1
Under some mild regularity conditions one can show that the method of moments
is statistically consistent [Hansen, 1982].
Infact, one can easily generalize this algorithm: suppose UX ∈ M i.e., there
exists θ∗ ∈ ϑ such that Fθ∗ = UX . Lets also assume no two distributions in the model
are the same.
And, suppose there exist n functions gi : X × ϑ 7→ R, i = 1, . . . , n
g1
g2
and g = .. such that EUX [g(X, θ)] = 0 if and3 only if θ = θ∗ . We define the
.
gn
method of generalized moments as:
n m
!2
X 1 X
(2.3) min gi (xj , θ)
θ∈ϑ
i=1
m j=1
2
This does not mean that (2.1) can be solved in polynomial time. Infact even with 0-1 loss and
linear model, this problem is computationally hard [Feldman et al., 2009].
3
Note that we define expectation of a vector to be the vector of expectations.
10
Note that if we choose gi (X, θ) ≡ X i − EFθ [X i ], i = 1, 2, . . ., then we will get back
the method of moments. Hence this method indeed generalized the earlier one.
From the literature of information theory, we know that g(X, θ̄) = ∇θ ln(fθ (X))|θ=θ̄ ,
known as the Fisher information4 , satisfies5 the required zero expectation criteria
mentioned above i.e., the expectation of gradient of log-likelihood function is zero
only with the correct model parameters (θ∗ ). Here, and hence-forth, we denote the
pmf/pdf of Fθ by fθ and call it the likelihood6 . Note that with this choice of g,
the method of generalized moments is:
(2.4) min k∇θ ln(fθ (D))k2
θ∈ϑ
where fθ (D) is the likelihood of the training set, D, wrt. the Fθ distribution.
Motivated by the above, we write the following algorithm, popular as the
maximum likelihood algorithm:
(2.5) max ln(fθ (D))
θ∈ϑ
While the above algorithm7 is very intuitive, it simply says the best distribution
is the one with which the likelihood of seeing the training data is the most, it is
interesting that under some sufficient conditions (2.4), the generalized method of
moments, and (2.5), the maximum likelihood , are the same: i) log-likelihood as
function of θ is concave and ii) ϑ is an open set.
We finish this section with another algorithm that again is closely related
to the maximum likelihood algorithm. Given two likelihood functions, f1 , f2 , one
way of computing inner-product Sheldon Axler [1997] is: hf1 , f2 i ≡ Ef1 [f2 (X)] =
Ef2 [f1 (X)], the expected likelihood Jebara et al. [2004]. Needless to say, given an
inner-product, a distance/norm function is induced: kf1 −f2 k2 ≡ hf1 , f1 i+hf2 , f2 i−
2hf1 , f2 i. This motivates the following objective for learning: minθ∈ϑ kfθ − Uθ k2 =
minθ∈ϑ Efθ [fθ (X)] − 2EUX [fθ (X)]. Again, motivated by law of large numbers, we
propose the following algorithm for minimizing the above objective:
m
X
(2.6) min Efθ [fθ (X)] − 2 fθ (xi ).
θ∈ϑ
i=1
Interestingly, the first term is a term independent of training data and is fixed given
the model. For e.g., for exponential distribution, f (x) = λ exp −λx, λ > 0, this
4
For this course the Wikipedia article is enough: http://en.Wikipedia.org/wiki/Fisher_
information.
5
∇x g(x) denotes the gradient of g wrt x i.e., the vector of partial derivatives of g wrt. xi s.
6
Instead of saying probability with discrete RVs and density with conts. RV, machine learn-
ing/statistics community uses likelihood term for both kinds of RVs.
7
Note that the θ∗ that maximizes likelihood also maximizes log-likelihood and vice-versa.
11
term is simply λ/2. While this first term is minimized, the second term (without
the minus sign) is the sum of likelihoods of training examples, which is maximized.
Hence the objective is not only performing some kind of maximum likelihood, but
also preferring low-energy8 models. Such terms in the objective that do not depend
on training set are called as regularizers. We also, note that, without the first
term, this algorithm maximizes the arithmetic mean of likelihoods, whereas the
maximum likelihood algorithm maximizes the geometric mean of the likelihoods
(and AM≥GM). This is another way to motivate the popular maximum likelihood
algorithm. In the subsequent section, we shift our focus to the Bayesian set-up.
Note that all methods/algorithms mentioned above approximate the true ex-
pectation/mean/risk wrt. the unknown distribution with the sample average/mean/empirical-
risk. And then hope9 for statistical consistency. Such algorithms that depend on
the notion of probability as frequency as called as Frequentist algorithms10 .
Observe that given fΘ , usually called as prior12 , fΘ/D can be computed. Also, for
prediction, we aggregate over all the model parameters
P using the total probability
rule: consider the UX case, then fX/D (x/D) = f
θ∈ϑ RX/Θ (x/θ)fΘ/D (θ/D) if the
posterior distribution over θ is discrete and fX/D (x/D) = ϑ fX/Θ (x/θ)fΘ/D (θ/D) dθ
if the posterior distribution over θ is continuous. Hence-forth we will refer to fX/D
as the Bayesian Average Model (BAM).
12
the “best” parameter and used for prediction (like in non-Bayesian set-up). For
e.g., compute the mode of the posterior fΘ/D and use it as the “best” parameter,
popularly known as the Maximum Aposteriori Algorithm (MAP); or compute
the mean or median etc. Such half-way Bayesian approaches are often called as
posterior plug-in approximations.
If one now asks the question which of these methods MLE, MAP, BAM may
have better predictive performance, then my answer would be: simply because MAP
and BAM utilize domain knowledge, which is a superior form of knowledge/wisdom,
they may perform better (ofcourse their performance is doubtful when there is no
such explicit domain knowledge) than MLE. In between MAP and BAM, I would
intuitively say, if the model has the distribution that is equal or close to the unknown
distribution being modeled, then MAP or some other such plug-in approximation
based on the posterior is perhaps better. In anycase, we will be able to answer this
question better only in section 2.4.
Generative Models: Model UXY , the unknown distribution and then use Bayes
rule for obtaining UY /X from it14 . At first this seems an overkill, but as sum-
marized in section 8.6 in Murphy [2012], both generative and discriminative
models have their own trade-offs. Further, there are two different choices in
modeling UXY :
13
Ofcourse, any of Frequentist, Bayesian and other plug-in approximations may be
employed for learning with any of the above types of models. Through various
examples we will later realize the conveniences with each of these three different
types of supervised models.
We end this section with some terminology: the supervised problem where Y
is a discrete set15 is called as classification problem. And more specifically, if Y
is a discrete set with exactly two members, then it is called as a binary classi-
fication problem. Typically, when employing a probabilistic model for predicting
the label/output for a given x with a learnt model, one resorts to what is pop-
ularly known as Maximum Aposteriori Inference/Prediction (MAP Infer-
ence/Prediction) i.e. the predicted label of x is given by argmaxy∈Y f (y/x), where
f is the estimate for UY /X=x . This MAP derived function that provides the predic-
tion g(x) = ŷ is called the classifier. If in a classification problem one employs the
third type of model i.e., model both UX/Y and UY , then the resulting classifier is
called as the Bayes classifier. Further, if one employs the Bayesian average model
for both UX/Y and UY , then we qualify the classifier as Bayesian Bayes Classifier.
Sometimes, in addition to making the assumption that the examples in the train-
ing set are independent, one makes the assumption that the features/dimensions of
the data X ∈ Rn are independent. If a Bayes classifier makes such an assumption
i.e., assume UX/Y = UX1 /Y . . . UXn /Y , then the resulting classifier is called a Naive
Bayes classifier.
If Y = R (with the usual Euclidean space for R), then the supervised problem
is called as a regression problem.
14
how to choose the hyperparameter (and parameter) given this fixed set. So essen-
tially, whatever ideas we used for parameter selection or Bayesian averaging, can
be used again; with the only difference being that the hyperparameters now play
the role of parameters and perhaps we should call those playing the role of hyper-
parameters are hyperhyperparameters. Needless to say, this process can be done
recursively. In the following we will begin with some methods that are essentially
equivalent to the ideas in parameter selection. However, later we will present a
statistical learning theory result that introduces a new component to the usual ob-
jective of training-data fit. This will motivate us to devise a new class of algorithms
(please refer to structural risk minimization in the following for that).
Let’s first start with simple algorithms for model/hyperparameter selection/averaging
that are similar in spirit to parameter selection/averaging: a Bayesian’s answer
would be to simply take a weighted opinion of all the models under consider-
ation. For this you may use domain knowledge that gives a prior16 over mod-
els/hyperparameters. We can write an expression17 for the Bayesian average model:
p(x/D) = i=1 p(x/mi , D)p(mi /D) or equivalently, p(x/D) = ri=1 p(x/αi , D)p(αi /D),
Pr P
where p(αi /D) is the posterior distribution of the hyperparameters that ∝ p(D/αi )p(αi ),
i.e., proportional to the product of the marginal likelihood of the data under
a model/hyperparameter and the prior of the hyperparameter/model. Further,
p(x/αi , D) is essentially the BAM for a given hyperparameter/model (already fa-
miliar
R to you) and the expression for marginal likelihood is ofcourse: p(D/αi ) =
θ∈mi
p(D/θ)p(θ) dθ (average opinion of all distributions in the model mi ). Hence
this algorithm can be called Bayesian average of Bayesian average models.
While this is the case with a Bayesian, a non-Bayesian would argue for selecting
the “best” hyperparameter. Here are some obvious alternatives:
15
model selection algorithm is known as the maximum marginal likelihood
algorithm.
• Another simple algorithm is to pick that parameter from all (the union of) the
models that maximizes the likelihood of the training data. This is equivalent
to doing MLE in the union of all models.
Note that each of the four types of algorithms stated above may give different
predictive functions. In particular, the method of maximum marginal likelihood is
not equivalent to the last one (simple MLE over the union). This is because the
marginal likelihood is the likelihood’s average over the entire model and hence the
model for which it is maximum need not be same as that model which contains the
parameter/distribution that maximizes the likelihood of the data. Infact, it is easy
to see that the maximum marginal likelihood and the algorithm mentioned above it
inherently penalize big/complex models! Please read section 5.3 in Murphy [2012]
for a detailed discussion regarding this phenomenon.
This is a key observation that merits a discussion: the simple MLE algorithm
achieves the best fit to the training data because it essentially searches a bigger
model; whereas the others penalize big models; in the sense that bigger the model,
it is more probable that the likelihood, with the corresponding predictive distribution
obtained by the other three algorithms, is less. While this is the case, at the first
look, one might argue that since bigger models have a better chance to contain
the unknown distribution/function being modeled, than smaller models, the simple
MLE algorithm is superior. However, results such as the following18 hint that bigger
models need not necessarily be better:
Theorem 2.4.1. Let G be a set of functions/distributions parametrized by θ ∈ ϑ.
Denote a typical member of this model by gθ . Recall that R[gθ ] and R̂m [gθ ] denote
the true and empirical risks with gθ (computed using a loss function l). Then with
probability 1 − δ, we have:
r
C(G, l) 1 2
(2.8) R[gθ ] ≤ R̂m [gθ ] + 2 √ +3 log ∀ θ ∈ ϑ.
m 2m δ
where, C(G, l) denotes the complexity of the model and loss combination that satisfies
the property that C(G1 , l) ≤ C(G2 , l) whenever G1 ⊂ G2 .
Later on we will give an expression for this complexity for a particular model,
loss combination.
At a first look this theorem’s result may not seem great because after all R̂m is
simply the sample average, and R is the expectation, so it should be easy to derive
18
Refer http://www.cse.iitb.ac.in/saketh/teaching/cs729Notes.pdf for details.
16
such a bound (for example, the proof of weak law of large numbers will itself provide
a bound). Moreover, such a bound will not have anything to do with G. If this is
your thought process, then you should note that this is true if the gθ if fixed apriori.
However, note that if one needs to ascertain how good the ERM candidate, denoted
by θERM , is i.e., how close R̂m [gθERM ] and R[gθERM ] are, then such simple bounding
techniques will not work simply because the θERM itself is a function of the training
data and hence the R̂m is no more a simple average of iid random variables. The
coolest thing about the theorem’s result it that it holds uniformly for every θ ∈ ϑ
and hence can be used to compare R̂m [gθERM ] and R[gθERM ].
Also, in general, this bound cannot be improved. From this theorem it is very
clear that big models are not necessarily better if we desire that the empirical risk of
our candidate is close to its true risk. On the other hand, with smaller models, even
though the empirical risk is perhaps close to the true risk, just because the model
is perhaps too small to contain the unknown distribution, the predictive function
that ERM/MLE/MAP/BAM chooses may be rendered useless. Hence an optimum
balance needs to be striken where perhaps the bound in (2.8), which is usually called
as the guaranteed risk, itself is minimized.
Based on the above discussion, one might hurry to write down the follow-
ing algorithm for model selection and parameter selection, which is usually called
as structural risk minimization (SRM). For reasons of convenience, usually
while talking about this algorithm, one assumes that the given models are nicely
arranged in a structure19 such as m1 ⊂ m2 ⊂ m3 . . .. SRM simply selects the
model/hyperparameter and the parameter in it that minimizes the guaranteed risk.
While this seems flawless at first sight, one has to note that the theorem holds when
the model is fixed/given and perhaps not when the model is to be selected. One may
go ahead and re-derive a theorem for this case where the structure/hierarchy of mod-
els is given (and not the model itself). You can believe me, this will simply lead to
an additive term to (2.8) that is a function of complexity of this hierarchy/structure
of models. Hence SRM is indeed a “safe” algorithm for model selection; however
not so if one further wants to choose the hierarchy/structure itself. It should now
be clear that one can keep on recursively keep minimizing additive variants of the
bound in (2.8). In practice, users usually stop at the level of hyperparameter/model
selection, but somehow ensure that they use some (finite) structure/hierarchy of
models that will certainly contain or approximate the unknown distribution.
We will end this discussion with a slight variant of the simple MLE based model
selection algorithm (the fourth one in the above list). It so happens that practitioners
19
It so happens that this hierarchy/structure is a way to encode prior knowledge about which
models or which parameters are expected to be “good” candidates. The idea is to simply put
the most likely candidates in/as the smallest subset/model so on. etc. since SRM will prefer the
smaller models this is like using prior knowledge that they are good.
17
more commonly employ different subsets of training data to perform parameter
selection and hyperparameter selection using the same criteria of ERM/MLE/fit-to-
the-data. Such algorithms as known as validation based algorithms. Here are some
popular variants:
The important point to note however, owing to the discussion earlier, is that with any
of these model selection algorithms, one pre-fixes the (finite) set of models/hyperparameters.
And it is not necessarily better if more and more hyperparameters/models are con-
sidered. The only conscious choice that can be made for choosing the set of mod-
els/hyperparameters is to somehow ensure that the unknown distribution can be
well approximated by this set.
20
Again, this is intuitive from law of large numbers. However, one should still prove statistical
consistency. Since in practice always one takes a finite set of hyperparameters, it turns out that
statistically consistent is guaranteed.
18
Chapter 3
This this chapter we work out the various algorithms presented earlier for various
models. We begin with unsupervised learning examples.
19
3.1.4 Multivariate Gaussian and Gaussian-inverse-Wishart-
Gaussian Models
Refer sections 4.1 and 4.6 in Murphy [2012].
• Note that from the theorems in the above paper, it is clear that EM algorithm
directly implements the generalized method of moments algorithm (2.4), where
the goal is infact to find stationary points of log-likelihood. Hence, the EM
algorithm is better motivated by the generalized method of moments algorithm
rather than by the maximum likelihood algorithm. Somehow strangely, many
books introduce EM algorithm for solving the maximum likelihood problem
instead.
• For a novice there seems to be the following paradox: since it is known that
(under some mild conditions) the generalized method of moments is statisti-
cally consistent, you may think that just by providing millions and millions
of samples from UX , through EM algorithm, you are able to recover the joint
UXY ! However a closer look will convince you that the mixture model itself
never involves Y and it is we who are interpreting1 the component densities
are class conditionals and the mixing probabilities as the class prior. Hence
the paradox is resolved.
• However, the above never happens when the components are Gaussians with
distinct means. A more comprehensive result is that Gaussian mixture model,
1
We could arrive at the same distribution with a totally different re-parameterization.
2
It is very easy to see this if the component distributions are discrete and hence equivalent to
Euclidean vectors. In this case this essentially is the same as realizing that two different convex
combinations of a set of vectors may give the same vector.
20
which is the set of all Gaussian mixture models with various number of com-
ponents, is universal i.e., given any continuous distribution (over X ), there
is a Gaussian mixture distribution (over X ) with finite number of components
that closely approximates the given distribution. Please refer McLachlan and
Peel [2000] for details. In other words, the Gaussian mixture model is rich
enough for any learning problem (over X ).
• A Gaussian mixture model hence need not only be used for a clustering prob-
lem but also for example to model class conditionals in a classification problem
(which is supervised learning).
• For a tutorial on speech recognition, which was the motivating application for
HMMs in the lectures, please refer Rabiner [1989].
• We defined a HMM as a family of distributions over X , Y (calX is a space of
(finite) sequences of vectors3 and Y is a space of (finite) sequences of objects
in a discrete set4 ) that satisfies the following conditional independence and
homogeneity assumptions: fX1 ,...,Xt ,Y1 ,...,Yt (x1 , . . . , xt , y1 , . . . , yt ) =
fX1 ,...,Xt /Y1 ,...,Yt (x1 , . . . , xt /y1 , . . . , yt )fY1 ,...,Yt (y1 , . . . , yt )
= fX1 /Y1 (x1 /y1 ) . . . fXt /Yt (xt /yt )fY1 (y1 )fY2 /Y1 (y2 /y1 ) . . . fYt /Yt−1 (yt /yt−1 )
(∵ the conditional independence assumptions. The first set say given the Yt , Xt is conditionally independent of other Xi )
(∵ while the second set is simply the Markovian assumption described in section 17.2.)
= f (x1 /y1 ) . . . f (xt /yt )f (y1 )f (y2 /y1 ) . . . f (yt /yt−1 )
(∵ Homogenity assumption that says the (conditional) distribution does not depend on the position in the sequence.)
21
10.2.2 provides the representation for HMMs. By this it should be clear
how HMMs are to be generalized.
– Another is by representing the Markov chain part in the HMM (i.e., the
distribution over Y ) through state transition diagrams (example figure
17.1 in Murphy [2012]).
• We next commented that there are other conditional independences that follow
from those made above in a HMM. One of them was eqn. (17.50) in the book.
We noted that this implied conditional independence is crucial for the EM
algorithm.
• Infact, the most popular way of using HMMs is with loose supervision where
for each speech signal the label is simply given as the word uttered. It turns
out that one can tweak the unsupervised HMM a bit to utilize such weak
supervision.
22
3.2.1.2 Gaussian Discriminant Analysis
• Binary Classification:
• Regression:
With each of these one can implement ERM given by (2.1). It is easy to see that
the logistic-loss ERM problem is exactly same as the MLE problem with logistic
5
Refer Feldman et al. [2009] to know why the non-convexity in the zero-one loss makes ERM a
computationally hard problem.
23
regression. Also, the square-loss ERM problem is exactly same as the MLE problem
with linear regression.
While the above provide examples for equivalence with MLE, it so happens
that there are examples where there is equivalence with MAP in logistic and linear
regression. For this one usually employs the following subset of the linear model:
FW ≡ {f | ∃w ∈ Rn , kwk ≤ W 3 f (x) = w> x ∀ x ∈ Rn }. Here W is the
hyperparameter.
This model together with logistic-loss is equivalent to MAP based logistic
regression. Here is why: the ERM problem in this case is
Pm >
(3.1) minn i=1 log(1 + exp(−yi w xi ))
w∈R
(3.2) s.t. kwk ≤ W
From optimization theory (Lagrange duality), one can show that for every W there
exists some C > 0 such that the above problem and the following have the same
optimal solution set6 :
m
1 X
(3.3) min kwk2 + C log(1 + exp(−yi w> xi ))
w∈Rn 2 i=1
It is now easy to see that the above problem is exactly the same form of MAP based
logistic regression with a zero-mean Gaussian prior over w.
Similarly, (after using Lagrange duality) it is easy to see that FW together with
square-loss is equivalent to MAP based linear regression with a zero-mean Gaussian
prior over w (in other words, ridge regression).
The predictive function that results from FW together with hinge-loss is called
as the (soft-margin) Support Vector Machine (refer section 3.2.2.1). And that with
F and -insensitive loss is called as Support Vector Regression (refer section 3.2.2.2).
24
convenient way of generalizing these non-linear models, which is discussed in the
subsequent section:
Pm
• any w can be written as i=1 αi φ(xi ) plus some vector in the orthogonal
complement of the space spanned by φ(x1 ), . . . , φ(xm ). Denote this orthogonal
complement by w⊥ .
7
For the purposes of this course it is enough to understand that Hilbert space and inner-product
are simply the generalizations of the Euclidean space and dot-product. Interested students may
refer http://www.cse.iitb.ac.in/saketh/teaching/cs723Scans3.pdf for further details.
25
which is same as (∵ Pythogorus theorem holds in H):
m 2 m
* m + !
1 X X X
(3.6) min αi φ(xi ) + C l αj φ(xj ), φ(xi ) , yi ,
α∈Rm 2 i=1 i=1 j=1
The above result is known as the representer theorem. In addition to the result,
from the proof it is clear that ERM, which is originally an optimization problem in
H (that potentially could be infinite dimensional), is essentially a problem in Rm .
Secondly, from the proof it is clear that neither solving ERM (training phase) nor for
computing the prediction function hw, φ(x)i (inference phase) the hyperparameter φ
is not explicitly needed. What is needed is only inner-product between φ(xi ), φ(x).
It so happens that mathematicians (in 1950s) have well-understood functions
that may represent inner products in some Hilbert space. Here is the result:
k(x1 , x1 ) . . . k(x1 , xm )
k(x2 , x1 ) . . . k(x2 , xm )
Suppose k : X ×X 7→ R is a function such that G =
.. .. ..
. . .
k(xm , x1 ) . . . k(xm , xm )
is a symmetric positive semi-definite matrix for any {x1 , x2 , . . . , xm } ⊂ X and for
any m ∈ N, then there exists a Hilbert space Hk and a map φk : X 7→ Hk such that
k(x1 , x2 ) = hφ(x1 ), φ(x2 )i ∀ x1 , x2 ∈ X . Such functions k are called as kernels.
Some examples of kernels are linear kernel k(x1 , x2 ) ≡ x> 1 x2 , polynomial
>
d
kernel k(x1 , x2 ) ≡ 1 + x1 x2 (here d is the hyperparameter that controls the
degree of the polynomial),
Gaussian
kernel or Radial Basis Function (RBF) kernel
kx1 −x2 k2
k(x1 , x2 ) ≡ exp − 2σ2 (here σ > 0 is the hyperparameter). Please verify if the
definition is satisfied for all these examples. Your book Murphy [2012], in section
14.2, provides examples of kernels on non-Euclidean spaces too!
This discussion makes it clear that, instead of using the model FW,φk , we may
now use the equivalent FW,k and the ERM problem is of size m. The later model is
more convenient because of the following reasons:
• In many applications, it is easier for domain experts to say how similar two
inputs are rather than to provide a Euclidean representation. For e.g., if the
inputs are graphs, it is easy to say how similar two graphs are rather than to
26
give a Euclidean vector representation for a graph. Such similarity functions
can be directly used as kernels (provide the mathematical conditions are satis-
fied); whereas the former model will explicitly require φ i.e., the representation
of the input.
• If the dimensionality of H is high (or say ∞), then FW,k is an efficient repre-
sentation. Moreover, the ERM in the form (3.7) is of manageable size.
• One can show that the (infinite) hierarchy F10−10 ,k ⊂ F10−9 ,k ⊂ . . . ⊂ F1,k ⊂
F10,k ⊂ . . ., where k is the Gaussian kernel, is universal in the sense that the
union of all these can approximate any (measurable) function over Rn . For
practitioners, this simply means that they can start with Gaussian kernel and
SVM, tune hyperparameters using some model selection, then the classifier
will perform well (given enough training data). This is important because no
such guarantee, even with lots of training examples, can be given for say SVM
with linear or polynomial kernel or for most of the models discussed in the
course.
27
28
Bibliography
Tony Jebara, Risi Kondor, and Andrew Howard. Probability product kernels. Jour-
nal of Machine Learning Research, 5:819–844, 2004. ISSN 1532-4435.
G. McLachlan and D. Peel. Finite Mixture Models. John Wiley and Sons, New York,
2000.
Bernhard Schölkopf and Alex Smola. Learning with Kernels. MIT press, Cambridge,
2002.
V. N. Vapnik. Statistical Learning Theory. John Wiley and Sons, Inc., 1998.
29