A Literature Survey On Domain Adaptation of Statistical Classifiers
Jing Jiang
1 Introduction
Domain adaptation of statistical classifiers is the problem that arises when the data distribution in our test
domain is different from that in our training domain. The need for domain adaptation is prevalent in many
real-world classification problems. For example, spam filters can be trained on some public collection of
spam and ham emails. But when applied to an individual persons inbox, we may want to personalize the
spam filter, i.e. to adapt the spam filter to fit the persons own distribution of emails in order to achieve better
Although the domain adaptation problem is a fundamental problem in machine learning, it only started
gaining much attention very recently (Daume III and Marcu, 2006; Blitzer et al., 2006; Ben-David et al.,
2007; Daume III, 2007; Jiang and Zhai, 2007a; Satpal and Sarawagi, 2007; Jiang and Zhai, 2007b; Blitzer
et al., 2008). However, some special kinds of domain adaptation problems have been studied before un-
der different names including class imbalance (Japkowicz and Stephen, 2002), covariate shift (Shimodaira,
2000), and sample selection bias (Heckman, 1979; Zadrozny, 2004). There are also some closely-related
but not equivalent machine learning problems that have been studied extensively, including multi-task learn-
ing (Caruana, 1997) and semi-supervised learning (Zhu, 2005; Chapelle et al., 2006).
In this literature survey, we review some existing work in both the machine learning and the natural
language processing communities related to domain adaptation. The goal of this survey is twofold. First,
there have been a number of methods proposed to address domain adaptation, but it is not clear how these
methods are related to each other. This survey thus tries to organize the existing work and lay out an overall
picture of the domain adaptation problem with its possible solutions. Second, a systematic literature survey
naturally reveals the limitations of current work and points out promising directions that should be explored
in the future.
Because domain adaptation is a relatively new topic that is still constantly attracting attention, our survey
is necessarily incomplete. Nevertheless, we try to cover the major lines of work that we are aware of up to
the date this survey is written. This survey will also be updated periodically.
we assume that we always have access to a large amount of unlabeled data, and we use Dt,u = {xt,u i }i=1
to denote this set of unlabeled instances. Sometimes, we may also have a small amount of labeled data
t,l Nt,l
from the target domain, which is denoted as Dt,l = {(xt,l i , yi )}i=1 . In the case when Dt,l is not available,
we refer to the problem as unsupervised domain adaptation, while when Dt,l is available, we refer to the
problem as supervised domain adaptation.
2.2 Overview
Recently, there have been a number of studies related to domain adaptation. However, the motivating ideas
behind these methods are different. To connect the existing work and hence to better understand the problem,
in the following sections, we organize the existing work into several categories from our own viewpoint.
First, in Section 3, we consider a line of work that is based on instance weighting. In Section 4, we look at
some work that bears strong resemblance to semi-supervised learning. In Section 5, we review another line
of work that is based on changing the representation of X. Section 6 reviews work using Bayesian priors,
and Section 7 reviews work related to multi-task learning. In Section 8, ensemble methods for domain
adaptation are considered.
The categories are ordered in this way so that methods in Section 3, Section 4 and Section 5 are generally
applicable to unsupervised domain adaptation problems, while methods in Section 6 and Section 7 can only
handle supervised domain adaptation problems.
3 Instance Weighting
One general approach to addressing the domain adaptation problem is to assign instance-dependent weights
to the loss function when minimizing the expected loss over the distribution of data. To see why instance
weighting may help, let us first briefly review the empirical risk minimization framework for standard su-
pervised learning (Vapnik, 1999), and then informally derive an instance weighting solution to domain
adaptation. Let be a model family from which we want to select an optimal model for our classifica-
tion task. Let l(x, y, ) be a loss function. Strictly speaking, we want to minimize the following objective
function in order to obtain the optimal model for the distribution P (X, Y ):
= arg min P (x, y)l(x, y, ).
(x,y)2X Y
Because P (X, Y ) is unknown, we can use the empirical distribution P (X, Y ) to approximate P (X, Y ).
Let {(xi , yi )}N
i=1 be a set of training instances randomly sampled from P (X, Y ). We then minimize the
following empirical risk in order to find a good model :
= arg min P (x, y)l(x, y, )
(x,y)2X Y
= arg min l(xi , yi , ).
2 i=1
Now consider the setting of domain adaptation. Ideally, we want to find an optimal model for the target
domain that minimizes the expected loss over the target distribution:
t = arg min Pt (x, y)l(x, y, ).
(x,y)2X Y
Ps (X, Y ). We can rewrite the equation above as follows:
X Pt (x, y)
t = arg min Ps (x, y)l(x, y, )
2 Ps (x, y)
(x,y)2X Y
X Pt (x, y)
arg min Ps (x, y)l(x, y, )
2 Ps (x, y)
(x,y)2X Y
X Pt (xsi , yis ) s s
= arg min l(x , y , ). (1)
2 Ps (xsi , yis ) i i
P (xs ,y s )
As we can see, weighting the loss for the instance (xsi , yis ) with Pst (xis ,yis ) provides a well-justified solution
i i
to the domain adaptation problem.
It is not possible to compute the exact value of PPst (x,y)
(x,y) for a pair (x, y), especially because we do not have
enough labeled instances in the target domain. Section 3.1 reviews one line of work in which Pt (X|Y ) =
Ps (X|Y ) is assumed, while Section 3.2 reviews another line of work in which Pt (Y |X) = Ps (Y |X) is
Now we can first estimate Ps (y|x) from the source domain, and then derive Pt (y|x) using Ps (Y ) and Pt (Y ).
For other classification algorithms that do not directly model P (Y |X), such as naive Bayes classifiers
and support vector machines, if P (Y |X) can be obtained through careful calibration, the same trick can
be applied. Chan and Ng (2006) applied this method to the domain adaptation problem in word sense
disambiguation (WSD) using naive Bayes classifiers.
In practice, one needs to know the class distribution in the target domain in order to apply the methods
described above. In some studies, it is assumed that this distribution is known a priori (Lin et al., 2002).
However, in reality, we may not have this information. Chan and Ng (2005) proposed to use the EM
algorithm to estimate the class distribution in the target domain.
and Muller, 2005). In some other work, it is proposed to transform this density ratio estimation into a
problem of predicting whether an instance is from the source domain or from the target domain (Zadrozny,
2004; Bickel and Scheffer, 2007). Huang et al. (2007) transformed the problem into a kernel mean matching
problem in a reproducing kernel Hilbert space. Bickel et al. (2007) proposed to learn this ratio together with
the classification model parameters.
4 Semi-Supervised Learning
If we ignore the domain difference, and treat the labeled source domain instances as labeled data and the
unlabeled target domain instances as unlabeled data, then we are facing a semi-supervised learning (SSL)
problem. We can then apply any SSL algorithms (Zhu, 2005; Chapelle et al., 2006) to the domain adaptation
problem. The subtle difference between SSL and domain adaptation is that (1) the amount of labeled data in
SSL is small but large in domain adaptation, and (2) the labeled data may be noisy in domain adaptation if
we do not assume Ps (Y |X = x) = Pt (Y |X = x) for all x, whereas in SSL the labeled data is all reliable.
There has been some work extending semi-supervised learning methods for domain adaptation. Dai
et al. (2007a) proposed an EM-based algorithm for domain adaptation, which can be shown to be equivalent
to a semi-supervised EM algorithm (Nigam et al., 2000) except that Dai et al. proposed to estimate the
trade-off parameter between the labeled and the unlabeled data using the KL-divergence between the two
domains. Jiang and Zhai (2007a) proposed to not only include weighted source domain instances but also
weighted unlabeled target domain instances in training, which essentially combines instance weighting with
bootstrapping. Xing et al. (2007) proposed a bridged refinement method for domain adaptation using label
propagation on a nearest neighbor graph, which has resemblance to graph-based semi-supervised learning
algorithms (Zhu, 2005; Chapelle et al., 2006).
5 Change of Representation
As has been pointed out, the cause of the domain adaptation problem is the difference between Pt (X, Y )
and Ps (X, Y ). Note that while the representation of Y is fixed, the representation of X can change if we
use different features. Such a change of representation of X can affect both the marginal distribution P (X)
and the conditional distribution P (Y |X). One can assume that under some change of representation of X,
Pt (X, Y ) and Ps (X, Y ) will become the same.
Formally, let g : X ! Z denote a transformation function that transforms an observation x represented
in the original form into another form z = g(x) 2 Z. Define variable Z and an induced distribution of Z
that satisfies P (z) = x2X ,g(x)=z P (x). The joint distribution of Z and Y is then
P (z, y) = P (x, y).
x2X ,g(x)=z
If we can find a transformation function g so that under this transformation, we have Pt (Z, Y ) = Ps (Z, Y ),
then we no longer have the domain adaptation problem because the two domains have the same joint dis-
tribution of the observation and the class label. The optimal model P (Y |Z, ) we learn to approximate
Ps (Y |Z) is still optimal for Pt (Y |Z).
Note that with a change of representation, the entropy of Y conditional on Z is likely to increase from
the entropy of Y conditional on X, because Z is usually a simpler representation of the observation than X,
and thus encodes less information. In another word, the Bayes error rate usually increases under a change
of representation. Therefore, the criteria for good transformation functions include not only the distance
between the induced distributions Pt (Z, Y ) and Ps (Z, Y ) but also the amount of increment of the Bayes
error rate.
Ben-David et al. (2007) first formally analyzed the effect of representation change for domain adaptation.
They proved a generalization bound for domain adaptation that is dependent on the distance between the
induced Ps (Z, Y ) and Pt (Z, Y ).
A special and simple kind of transformation is feature subset selection. Satpal and Sarawagi (2007)
proposed a feature subset selection method for domain adaptation, where the criterion for selecting features
is to minimize an approximated distance function between the distributions in the two domains. Note that
to measure the distance between Ps (Z, Y ) and Pt (Z, Y ), we still need class labels in the target domain. To
solve this problem, in (Satpal and Sarawagi, 2007), predicted labels for the target domain instances are used.
Blitzer et al. (2006) proposed a structural correspondence learning (SCL) algorithm that makes use of
the unlabeled data from the target domain to find a low-rank representation that is suitable for domain
adaptation. It is empirically shown in (Ben-David et al., 2007) that the low-rank representation found by
SCL indeed decreases the distance between the distributions in the two domains. However, SCL does
not directly try to find a representation Z that minimizes the distance between Ps (Z, Y ) and Pt (Z, Y ).
Instead, SCL tries to find a representation that works well for many related classification tasks for which
labels are available in both the source and the target domains. The assumption is that if a representation
Z gives good performance for the many related classification tasks in both domains, then Z is also a good
representation for the main classification task we are interested in in both domains. The core algorithm in
SCL is from (Ando and Zhang, 2005).
6 Bayesian Priors
Most of the work reviewed in the previous sections does not require labeled data from the target domain.
In this section and the next section, we review two kinds of methods that work for supervised domain
adaptation, i.e. when a small amount of labeled data from the target domain is available.
When we use the maximum a posterior (MAP) estimation approach for supervised learning, we can
encode some prior knowledge about the classification model into a Bayesian prior distribution P (), where
is the model parameter. More specifically, instead of maximizing
P (yi |xi ; ),
we maximize
P () P (yi |xi ; ).
In domain adaptation, the prior knowledge can be drawn from the source domain. More specifically,
we first construct a Bayesian prior P (|Ds ), which is dependent on the labeled instances from the source
domain. We then maximize the following objective function:
P (|Ds )P (Dt,l |) = P (|Ds ) P (yit |xti ; ).
Li and Bilmes (2007) proposed a general Bayesian divergence prior framework for domain adaptation.
They then showed how this general prior can be instantiated for generative classifiers and discriminative
classifiers. Chelba and Acero (2004) applied this kind of a Bayesian prior for the task of adapting a maxi-
mum entropy capitalizer across domains.
7 Multi-Task Learning
Multi-task learning, sometimes known as transfer learning, is a machine learning topic highly related to
domain adaptation. The original definition of multi-task learning considers a different setting than domain
adaptation. In multi-task learning, there is a single distribution of the observation, i.e. a single P (X).
There are, however, a number of different output variables Y1 , Y2 , . . . , YM , corresponding to M different
tasks. Therefore, there are M different joint distributions {P (X, Yk )}M k=1 . Note that the class label sets
are different for these M different tasks. We assume that these different tasks are related. When learning
M conditional models {P (Yk |X, k )}M k=1 for the M tasks, we impose a common component shared by
{k }k=1 . There have been a number of studies on multi-task learning (Caruana, 1997; Ben-David and
8 Ensemble Methods
In previous sections, only learning algorithms that return single classification models are considered. En-
semble methods are a type of learning algorithms that combine a set of models to construct a complex
classifier for a classification problem. Ensemble methods include bagging, boosting, mixture of experts, etc.
There has been some work using ensemble methods for domain adaptation.
One line of work uses mixture models. It can be assumed that there are a number of different component
distributions {P (k) (X, Y )}K k=1 , each of which modeled by a simple model. The distribution of X and Y in
either the source domain or the target domain is then a mixture of these component distributions. The source
and the target domains are related because they share some of these component distributions. However, the
mixture coefficients are different in the two domains, making the overall distributions different.
Daume III and Marcu (2006) proposed a mixture model for domain adaptation, in which three mixture
components are assumed, one shared by both the source and the target domains, one specific to the source
domain, and one specific to the target domain. Labeled data from both the source and the target domains is
needed to learn this three-component mixture model using the conditional expectation maximization (CEM)
algorithm. Storkey and Sugiyama (2007) considered a more general mixture model in which the source
and the target domains share more than one mixture components. However, they did not assume any target
domain specific component, and as a result, no labeled data from the target domain is needed. The mixture
model is learned using the expectation maximization (EM) algorithm.
Boosting is a general ensemble method that combines multiple weak learners to form a complex and
effective classifier. Dai et al. (2007b) proposed to modify the widely-used AdaBoost algorithm to address
the domain adaptation problem. With some labeled data from the target domain, the idea here is to put more
weight on mistakenly classified target domain instances but less weight on mistakenly classified source
domain instances in each iteration, because the goal is to improve the performance of the final classifier on
the target domain only.
