0% found this document useful (0 votes)
64 views40 pages

Knowledge Distillation and Student-Teacher Learning For Visual Intelligence: A Review and New Outlooks

1) The document reviews knowledge distillation and student-teacher learning, which transfer knowledge from large neural networks (teachers) to smaller networks (students). 2) Knowledge distillation allows training compact models for deployment on edge devices and handling problems caused by lack of labeled data. It works by having students mimic teachers' outputs. 3) Student-teacher learning is commonly used for model compression and knowledge transfer in computer vision tasks. The teacher provides knowledge to supervise the student.

Uploaded by

Raymon Arnold
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views40 pages

Knowledge Distillation and Student-Teacher Learning For Visual Intelligence: A Review and New Outlooks

1) The document reviews knowledge distillation and student-teacher learning, which transfer knowledge from large neural networks (teachers) to smaller networks (students). 2) Knowledge distillation allows training compact models for deployment on edge devices and handling problems caused by lack of labeled data. It works by having students mimic teachers' outputs. 3) Student-teacher learning is commonly used for model compression and knowledge transfer in computer vision tasks. The teacher provides knowledge to supervise the student.

Uploaded by

Raymon Arnold
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO.

8, APRIL 2020 1

Knowledge Distillation and Student-Teacher


Learning for Visual Intelligence: A Review and
New Outlooks
Lin Wang, Student Member, IEEE, and Kuk-Jin Yoon, Member, IEEE

Abstract—Deep neural models, in recent years, have been successful in almost every field, even solving the most complex problem
statements. However, these models are huge in size with millions (and even billions) of parameters, demanding heavy computation
power and failing to be deployed on edge devices. Besides, the performance boost is highly dependent on redundant labeled data. To
arXiv:2004.05937v7 [cs.CV] 17 Jun 2021

achieve faster speeds and to handle the problems caused by the lack of labeled data, knowledge distillation (KD) has been proposed to
transfer information learned from one model to another. KD is often characterized by the so-called ‘Student-Teacher’ (S-T) learning
framework and has been broadly applied in model compression and knowledge transfer. This paper is about KD and S-T learning,
which are being actively studied in recent years. First, we aim to provide explanations of what KD is and how/why it works. Then, we
provide a comprehensive survey on the recent progress of KD methods together with S-T frameworks typically used for vision tasks. In
general, we investigate some fundamental questions that have been driving this research area and thoroughly generalize the research
progress and technical details. Additionally, we systematically analyze the research status of KD in vision applications. Finally, we
discuss the potentials and open challenges of existing methods and prospect the future directions of KD and S-T learning.

Index Terms—Knowledge distillation (KD), Student-Teacher learning (S-T), Deep neural networks (DNN), Visual intelligence.

1 I NTRODUCTION

T HE success of deep neural networks (DNNs) generally


depends on the elaborate design of DNN architectures.
In large-scale machine learning, especially for tasks such
it is crucial to apply consistency costs or regularization
methods to match the predictions from both labeled and
unlabeled data. In this case, knowledge is transferred within
as image and speech recognition, most DNN-based models the model that assumes a dual role as teacher and student
are over-parameterized to extract the most salient features [2]. For the unlabeled data, the student learns as before;
and to ensure generalization. Such cumbersome models are however, the teacher generates targets, which are then used
usually very deep and wide, which require a considerable by the student for learning. The common goal of such a
amount of computation for training and are difficult to be learning metric is to form a better teacher model from the
operated in real-time. Thus, to achieve faster speeds, many student without additional training, as shown in Fig. 1(b).
researchers have been trying to utilize the cumbersome Another typical example is self-supervised learning, where
models that are trained to obtain lightweight DNN models, the model is trained with artificial labels constructed by the
which can be deployed in edge devices. That is, when the input transformations (e.g., rotation, flipping, color change,
cumbersome model has been trained, it can be used to learn cropping). In such a situation, the knowledge from the input
a small model that is more suitable for real-time applications transformations is transferred to supervise the model itself
or deployment [1] as depicted in Fig. 1(a). to improve its performance as illustrated in Fig. 1(c).
On the other hand, the performance of DNNs is also This paper is about knowledge distillation (KD) and student-
heavily dependent on very large and high-quality labels to teacher (S-T) learning, a topic that has been actively studied in
training datasets. For such a reason, many endeavours have recent years. Generally speaking, KD is widely regarded as
been taken to reduce the amount of labeled training data a primary mechanism that enables humans to quickly learn
without hurting too much the performance of DNNs. A new complex concepts when given only small training sets
popular approach for handling such a lack of data is to trans- with the same or different categories [3]. In deep learning,
fer knowledge from one source task to facilitate the learning KD is an effective technique that has been widely used to
on the target task. One typical example is semi-supervised transfer information from one network to another network
learning in which a model is trained with only a small set whilst training constructively. KD was first defined by [4]
of labeled data and a large set of unlabeled data. Since the and generalized by Hinton et al. [1]. KD has been broadly
supervised cost is undefined for the unlabeled examples, applied to two distinct fields: model compression (refer to
Fig. 1(a)) and knowledge transfer (refer to Fig. 1 (b) and (c)).
For model compression, a smaller student model is trained
• L. Wang and K.-J. Yoon are with the Visual Intelligence Lab., Department to mimic a pretrained larger model or an ensemble of
of Mechanical Engineering, Korea Advanced Institute of Science and
Technology, 291 Daehak-ro, Guseong-dong, Yuseong-gu, Daejeon 34141, models. Although various forms of knowledge are defined
Republic of Korea. E-mail: {wanglin, kjyoon }@kaist.ac.kr. based on the purpose, one common characteristic of KD is
Manuscript received April 19, 2020; revised August 26, 2020. symbolized by its S-T framework, where the model providing
(Corresponding author: Kuk-Jin Yoon) knowledge is called the teacher and the model learning the
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 2

(a) Teacher network T (b) Teacher network T


Matching
(c)
Pre-trained Teacher
distribution of
predictions
unlabelled data
Unlabelled Online ensemble Consistency Distillation loss
Distillation loss transform
loss
Training predictions Classification
data loss Student

Student network S True label


Labelled Student network S
True label

Fig. 1. Illustrations of KD methods with S-T frameworks. (a) for model compression and for knowledge transfer, e.g., (b) semi-supervised learning
and (c) self-supervised learning.

knowledge is called the student. • We discuss the problems and open issues and iden-
In this work, we focus on analyzing and categorizing tify new trends and future direction to provide in-
existing KD methods accompanied by various types of S-T sightful guidance in this research area.
structures for model compression and knowledge transfer.
The organization of this paper is as follows. First, we explain
We review and survey this rapidly developing area with
why we need to care about KD and S-T learning in Sec.2.
particular emphasis on the recent progress. Although KD
Then, we provide a theoretical analysis of KD in Sec.3. Sec-
has been applied to various fields, such as visual intel-
tion 3 is followed by Sec.4 to Sec.8, where we categorize the
ligence, speech recognition, natural language processing
existing methods and analyze their challenges and potential.
(NLP), etc., this paper mainly focuses on the KD methods
Fig. 2 shows the taxonomy of KD with S-T learning to be
in the vision field, as most demonstrations have been done
covered in this survey in a hierarchically-structured way. In
on computer vision tasks. KD methods used in NLP and
Sec.9, based on the taxonomy, we will discuss the answers
speech recognition can be conveniently explained using the
to the questions raised in Sec.1. Section 10 will present the
KD prototypes in vision. As the most studied KD methods
future potentials of KD and S-T learning, followed by a
are for model compression, we systematically discuss the
conclusion in Sec.11.
technical details, challenges, and potentials. Meanwhile, we
also concentrate on the KD methods for knowledge transfer
in semi-supervised learning, self-supervised learning, etc., 2 W HAT IS KD AND W HY C ONCERN IT ?
and we highlight the techniques that take S-T learning as a What’s KD? Knowledge Distillation (KD) was first pro-
way of learning metric. posed by [4] and expanded by [1]. KD refers to the method
We explore some fundamental questions that have been that helps the training process of a smaller student network
driving this research area. Specifically, what is the theoretical under the supervision of a larger teacher network. Unlike
principle for KD and S-T learning? What makes one distil- other compression methods, KD can downsize a network re-
lation method better than others? Is using multiple teachers gardless of the structural difference between the teacher and
better than using one teacher? Do larger models always the student network. In [1], the knowledge is transferred
make better teachers and teach more robust students? Can from the teacher model to the student by minimizing the
a student learn knowledge only if a teacher model exists? difference between the logits (the inputs to the final softmax)
Is the student able to learn by itself? Is off-line KD always produced by the teacher model and those produced by the
better than online learning? student model.
With these questions being discussed, we incorporate the However, in many situations, the output of softmax
potentials of existing KD methods and prospect the future function on the teacher’s logits has the correct class at a
directions of the KD methods together with S-T frameworks. very high probability, with all other class probabilities very
We especially stress the importance of recently developed close to zero. In such a circumstance, it does not provide
technologies, such as neural architecture search (NAS), much information beyond the ground truth labels already
graph neural networks (GNNs), and gating mechanisms provided in the dataset. To tackle such a problem, [1, 5]
for empowering KD. Furthermore, we also emphasize the introduced the concept of ’softmax temperature’, which can
potential of KD methods for tackling challenging problems make the target to be ’soft.’ Given the logits z from a
in particular vision fields such as 360◦ vision and event- network, the class probability pi of an image is calculated
based vision. as [5]:
The main contributions of this paper are three-fold: exp( zρi )
pi = P zi (1)
j exp( ρ )
• We give a comprehensive overview of KD and S-
T learning methods, including problem definition, where ρ is the temperature parameter. When ρ = 1, we
theoretical analysis, a family of KD methods with get the standard softmax function. As ρ increases, the
deep learning, and vision applications. probability distribution produced by the softmax function
• We provide a systematic overview and analysis of becomes softer, providing more information as to which
recent advances of KD methods and S-T frameworks classes the teacher found more similar to the predicted class.
hierarchically and structurally and offer insights and The information provided in the teacher model is called
summaries for the potentials and challenges of each dark knowledge [1]. It is this dark knowledge that affects the
category. overall flow of information to be distilled. When computing
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 3

Knowledge distillation with S-T learning

Distillation based on the number of teachers (Section 4)


Distillation from one teacher Distillation from multiple teachers
Knowledge from logits Knowledge from intermediate layers Distillation from ensemble of logits
●Softened labels and regularization ●Transformation of hints Distillation from ensemble of features
●Learning from noisy labels ●Transformation of guided features Distillation by unifying data sources
●Imposing strictness in distillation ●Position of selected knowledge From one teacher to multiple teachers
●Ensemble of distribution ●Distance metric for measuring KD Customizing student from teachers
Ensemble of peers via mutual learning

Distillation based on data format (Section 5)


Data-free distillation Distillation with a few samples Cross-modal distillation
Distillation based on metadata Distillation vai pseudo examples Supervised cross-modal KD
Distillation using class-similarities Unsupervised cross-modal KD
Distillation via layer-wise estimation
Distillation based on generator ●Learning from one teacher
●Learning from multiple teachers

Online and teacher-free distillation (Section 6) KD based on labels (Section 7)


Online distillation Teacher-free distillation Label-required distillation
Individual student peers Born-again distillation KD with original labels
Sharing blocks among students Distillation via ‘deep’ supervision KD with pseudo labels
Ensemble of student peers Distillation via data augmentation Label-free distillation
Distillation via architecture transform KD with dark knowledge
KD with meta knowledge

KD with novel learning metrics (Section 8) Applications of KD (Section 9)


Distillation via adversarial learning Semantic segmentation
Distillation via graph representation Visual detections
Few-shot learning Domain adaptation
Incremental learning Depth and flow estimation
Reinforcement learning Image translation
Semi-/self-supervised learning Video understanding

Fig. 2. Hierarchically-structured taxonomy of knowledge distillation with S-T learning.

the distillation loss, the same ρ used in the teacher is used both in academia and industry. The particular highlights
to compute the logits of the student. on some representative applications are given in Sec.9, and
For the images with ground truth, [1] stated that it in the following Sec.3, we provide a systematic theoretical
is beneficial to train the student model together with the analysis.
ground truth labels in addition to the teacher’s soft labels.
Therefore, we also calculate the ’student loss’ between the 3 A THEORETICAL ANALYSIS OF KD
student’s predicted class probabilities and the ground truth
Many KD methods have been proposed with various intu-
labels. The overall loss function, composed of the student
itions. However, there is no commonly agreed theory as to
loss and the distillation loss, is calculated as:
how knowledge is transferred, thus making it difficult to
LKD = α ∗ H(y, σ(zs )) + β ∗ H(σ(zt ; ρ), σ(zs ; ρ)) effectively evaluate the empirical results and less actionable
= α ∗ H(y, σ(zs ) + β ∗ [KL(σ(zt ; ρ), σ(zs , ρ)) + H(σ(zt ))] to design new methods in a more disciplined way. Recently,
(2) Ahn et al. [8], Hegde et al. [9] and Tian et al. [10] formulate
KD as a maximization of mutual information between the
where H is the loss function, y is the ground truth label, σ
representations of the teacher and the student networks.
is the softmax function parameterized by the temperature ρ
Note that the representations here can refer to either the
(ρ 6= 1 for distillation loss), and α and β are coefficients. zs
logits information or the intermediate features. From the
and zt are the logits of the student and teacher respectively.
perspective of representation learning and information the-
Why concern KD? KD has become a field in itself in the
ory, the mutual information reflects the joint distribution or
machine learning community, with broad applications to
mutual dependence between the teacher and the student
computer vision, speech recognition, NLP, etc. From 2014 to
and quantifies how much information is transferred. We do
now, many research papers [6, 7] have been presented in the
agree that maximizing the mutual information between the
major conferences, such as CVPR, ICCV, ECCV, NIPS, ICML,
teacher and the student is crucial for learning constructive
ICLR, etc., and the power of KD has been extended to many
knowledge from the teacher. We now give a more detailed
learning processes (e.g., few-shot learning) except to model
explanation regarding this.
compression. The trend in recent years is that KD with
Based on Bayes’s rule, the mutual information between
S-T frameworks has become a crucial tool for knowledge
two paired representations can be defined as:
transfer, along with model compression. The rapid increase
in scientific activity on KD has been accompanied and I(T ; S) = H(R(T )) − H(R(T )|R(S))
nourished by a remarkable string of empirical successes (3)
= −ET [log p(R(T ))] + ET,S [log p((R(T )|R(S))]
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 4

where R(T ) and R(S) are the representations from both the on vision and speech recognition tasks. Recently, Mangalam
teacher and the student, and H(·) is the entropy function. et al. [12] introduce a special method based on class re-
Intuitively, the mutual information is to increase degree of weighting to compress U-net into a smaller version. Re-
certainty in the information provided in R(T ) when R(S) weighting of the classes, in fact, softens the label distribution
is known. Therefore, maximizing ET,S [log p((R(T )|R(S))] by obstructing inherent class imbalance. Compared to [1],
w.r.t. the parameters of the student network S increases some works such as Ding et al. [13], Hegde et al. [9], Tian
the lower bound on mutual information. However, the true et al. [10], Cho et al. [14] and Wen et al. [15], point out
distribution of p((R(T )|R(S)) is unknown, instead it is that the trade-off (see Eqn. 2) between the soft label and
desirable to estimate p((R(T )|R(S)) by fitting a variations the hard label is rarely optimal, and since α, β and T are
distribution q((R(T )|R(S)) to approximate the true distri- fixed during training time, it lacks enough flexibility to cope
bution p((R(T )|R(S)). Then Eqn. 3 can be rewritten as: with the situation without the given softened labels. Ding
et al. [13] instead propose residual label and residual loss to
I(T ; S) = H(R(T )) + ET,S [log p(R(T )|R(S))]
enable the student to use the erroneous experience during
= H(R(T )) + ET,S [log q((R(T )|R(S))]+ (4) the training phase, preventing over-fitting and improving
ES [KL(p(R(T )|R(S))||q(R(T )|R(S))] the performance. Similarly, Tian et al. [10] formulate the
teacher’s knowledge as structured knowledge and train
Assuming there is sufficiently expressive way of modeling
a student to capture significantly more mutual information
q , Eqn. 4 can be updated as:
during contrastive learning. Hegde et al. [9] propose to train
I(T ; S) ≥ H(R(T )) + ET,S [log q((R(T )|R(S))] (5) a variational student by adding sparsity regularizer based
on variational inference, similar to the method in [8]. The
Note that the last term in Eqn. 4 is non-negative since sparsification of the student training reduces over-fitting
KL(·) function is non-negative and H(R(T )) is constant and improves the accuracy of classification. Wen et al. [15]
w.r.t the parameters to be optimized. By modeling q , it is notice that the knowledge from the teacher is useful, but
easy to quantify the amount of knowledge being learned uncertain supervision also influences the result. Therefore,
by the student. In general, q can be modeled by a Gaus- they propose to fix the incorrect predictions (knowledge)
sian distribution, Monte Carlo approximation, or noise con- of the teacher via smooth regularization and avoid overly
trastive estimation (NCE). We do believe that theoretically uncertain supervision using dynamic temperature.
explaining how KD works is connected to representation On the other hand, Cho et al. [16], Yang et al. [17] and Liu
learning, where the correlations and higher-order output et al. [18] focus on different perspectives of regularization to
dependencies between the teacher and the student are cap- avoid under-/over-fitting. Cho et al. [16] discover that early-
tured. The critical challenge is increasing the lower bounds stopped teacher makes a better student especially when
of information, which is also pointed out in [10]. the capacity of the teacher is larger than the student’s.
In summary, we have theoretically analyzed how KD Stopping the training of the teacher early is akin to reg-
works and mentioned that the representation of knowledge ularizing the teacher, and stopping knowledge distillation
is crucial for the transfer of knowledge and learning of the close to convergence allows the student to fit the training
student network. Explicitly dealing with the representation better. Liu et al. [18] focus on modeling the distribution of
of knowledge from the teacher is significant and challeng- the parameters as prior knowledge, which is modeled by
ing, because the knowledge from the teacher expresses a aggregating the distribution (logits) space from the teacher
more general learned information (e.g.feature information, network. Then the prior knowledge is penalized by a sparse
logits, data usage, etc.) that is helpful for building up a recording penalty for constraining the student to avoid over-
well-performing student. In the following sections, we will regularization. Mishra et al. [19] combine network quanti-
provide a hierarchically-structured taxonomy for the KD zation with model compression by training an apprentice
methods regarding how the information is transferred for using KD techniques and showed that the performance of
both teacher and student, how knowledge is measured, and low-precision networks could be significantly improved by
how the teacher is defined. distilling the logits of the teacher network. Yang et al. [17]
propose a snapshot distillation method to perform S-T
4 KD BASED ON THE NUMBER OF TEACHERS (similar network architecture) optimization in one generation.
Their method is based on a cycle learning rate policy (refer
4.1 Distillation from one teacher
to Eqn. 2 and Eqn. 6) in which the last snapshot of each
Overall insight: Transferring knowledge from a large teacher cycle (e.g.,WTl−1 in iteration l − 1) serves as a teacher in the
network to a smaller student network can be achieved using either next cycle (e.g., WTl in iteration l). Thus, the idea of snapshot
the logits or feature information from the teacher. distillation is to extract supervision signals in earlier epochs
in the same generation to make sure the difference between
4.1.1 Knowledge from logits teacher and student is sufficiently large to avoid under-
Softened labels and regularization. Hinton et al. [1] and fitting. The snapshot distillation loss can be described as:
Ba and Caruana [11] propose to shift the knowledge from
teacher network to student network by learning the class L(x, Wl−1 ) = α ∗ H(y, σ(zsl−1 ; ρ = 1)+
(6)
distribution via softened softmax (also called ’soft labels’) β ∗ H(σ(ztl ; ρ = τ ), σ(ztl−1 , ρ = τ ))
given in Eqn. (1). The softened labels are in fact achieved by
introducing temperature scaling to increase of small proba- where the Wl−1 is the weights of student at iteration l − 1.
bilities. These KD methods achieved some surprising results zsl−1 and ztl−1 represent the logits of student and teacher
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 5

at iteration l − 1. More detailed analysis for the methods with high confidence scores in the dark knowledge in [1]) may
mutual information and one generation will be discussed in help to alleviate the risk of the student over-fitting. They
Sec. 6.1. thus introduce a framework of optimizing neural networks
Learning from noisy labels [20, 21, 22, 23] propose methods in generations (namely, iterations), which requires training
that utilize the similar knowledge (softened labels) as in [1] a patriarchal model M 0 only supervised by the dataset.
but focus on data issue. Specifically, [20] assume that there is After m generations, the student M m is trained by m-th
a small clean dataset Dc and a large noisy dataset Dn , while generation with the supervision of a teacher M m−1 . Since
[21] and [22] use both labeled and unlabeled data to improve the secondary information is crucial for training a robust
the performance of student. In [20], the aim of distillation teacher, a fixed integer K standing for the semantically
is to use the large amount of noisy data Dn to augment the similar class is chosen for each image, and the gap between
small clean dataset Dc to learn a better visual representation the confidence scores of the primary class and other K − 1
and classifier. That is, the knowledge is distilled from the classes with highest scores is computed, This can be de-
small clean dataset Dc to facilitate a better model from the scribed as:
entire noisy dataset Dn . The method is essentially different
L(x, W T ) = α ∗ H(y, σ(zt ; ρ = 1)+
from [1] focusing on inferior model instead of inferior
K
dataset. The same loss function in Eqn. 2 is used, except 1 X T (8)
β ∗ [faT1 − f ]
zt = σ[fDc (x)], where fDc is an auxiliary model trained K − 1 k=2 ak
from the clean dataset Dc . Furthermore, a risk function on
the unreliable label ȳ is defined as Rȳ = EDt [||ȳ − y ∗ ||]2 , where fak indicates the k -th largest elements of the output
where y ∗ is the unknown ground truth label and Dt is the (logits) zt . Note that this S-T optimization is similar to BAN
unseen test dataset. Rȳ is an indicator that measures the [30]; however, the goal here is to help the student learn inter-
level of noise in the distillation process. class similarity and prevent over-fitting. Different from the
Xu et al. [22] probes a positive-unlabeled classifier for teacher in [30], the teacher here is deeper and larger than the
addressing the problem of requesting the entire original student. [26] extends [1] for metric learning by using em-
training data, which can not be easily uploaded to the bedding networks to project the information (logits) learned
cloud. [21] trains a noisy student by next three steps: 1) from images to the embedding space. The embeddings are
train a teacher model on labeled data, 2) use the teacher to typically used to perform distance computation between the
generate pseudo labels on unlabeled images, and 3) train a data pairs of a teacher and a student. From this point of
student model on the combination of labeled images and view, the knowledge computed based on the embedding
pseudo labeled images while injecting noise (adversarial network is the actual knowledge as it represents the data
perturbation) to the student for better generalization and distribution. They design two different teachers: absolute
robustness. This way, the student generalizes better than teacher and relative teacher. For the absolute teacher, the
the teacher. Similarly, [23] study adversarial perturbation aim is to minimize the distance between the teacher and
and consider it as a crucial element in improving both the student embeddings while the aim for the relative teacher is
generalization and the robustness of the student. Based on to enforce the student to learn any embedding as long as it
how humans learn, two learning theories for the S-T model results in a similar distance between the data points. They
are proposed: fickle teacher and soft randomization. The also explore hints [1] and attention [36] to strengthen the
fickle teacher model is to transfer the teacher’s uncertainty distillation of embedding networks. We will give more explicit
to the student using Dropout [24] in the teacher model. The explanations of these two techniques in Sec. 4.1.2.
soft randomization method is to improve the adversarial [27] proposes an embedding module that captures in-
robustness of student model by adding Gaussian noise teractions between query and document information for
in the knowledge distillation. In this setting, the original question answering. The embedding of the output repre-
distillation objective for the student in Eqn. 2 can be updated sentation (logits) includes a simple attention model with
as: a query encoder, a prober history encoder, a responder
L(x + δ, W ) = α ∗ H(y, σ(zs ; ρ = 1)+ history encoder, and a document encoder. The attention
(7) model minimizes the summation of cross-entropy loss and
β ∗ H(σ(zt ; ρ = τ ), σ(zs , ρ = τ ))
KL-divergence loss, inspired by [1]. On the other hand, [31]
where δ is the variation of adversarial perturbation. It is and RKD [28] consider another type of strictness, namely
shown that using the teacher model trained on clean images the mutual relation or relation knowledge of the two examples
to train the student model with adversarial perturbation can in the learned representations for both the teacher and the
retain the adversarial robustness and mitigate the loss in student. This approach is very similar to the relative teacher
generalization. in [26] since both aim to measure the distance between the
Imposing strictness in distillation. In contrast, Yang et teacher’s and the student’s embeddings. However, RKD [28]
al. [25], Yu et al. [26], Arora et al. [27], RKD [28] and Peng et also considers the angle-wise relational measure, similar to
al. [29] shift to a new perspective focusing more on putting persevering secondary information in [25].
strictness to the distillation process via optimization (e.g., Ensemble of distribution. Although various methods have
distribution embedding, mutual relations, etc). In particular, been proposed to extract knowledge from logits, some
[25] initiates to put strictness on the teacher while [26] works [16, 32, 33, 34] show that KD is not always prac-
proposes two teaching metrics to impose strictness on the tical due to knowledge uncertainty. The performance of
student. Yang et al. observe that, except learning primary the student degrades when the gap between the student
class (namely, the ground truth), learning secondary class ( and the teacher is large. [33] points out that estimating
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 6

TABLE 1
A taxonomy of KD methods using logits. The given equations here are the generalized objective functions, and they may vary in individual work.

KD objective
Method Sub-category Description function
References

Distillation using soft labels


Softened labels and [1, 9, 10, 11, 12, 14, 15]
regularization
and add regularizatio to Eqn. 6 [8, 9, 15, 16, 17, 18, 19, 19]
avoid under-/over-fitting
Learning from Adding noise Eqn. 6
KD from noisy labels or using noisy data or Eqn. 7
[20, 21, 22, 23, 24]
logits
Adding optimization methods Eqn. 8 or
Imposing strictness to teacher or student Eqn. 6
[25, 26, 27, 28, 29, 30, 31]

Estimating model or
Ensemble of distribution data uncertainty
Eqn. 9 [16, 32, 33, 34, 35]

Teacher network T categories. In overall, distillation using logits needs to


transfer the dark knowledge to avoid over-/under-fitting.
Meanwhile, the gap of model capacity between the teacher

Logits
and the student is also very crucial for effective distilla-
tion. Moreover, the drawbacks of learning from logits are
Ft Ft Ft obvious. First, the effectiveness of distillation is limited to
softmax loss and relies on the number of classes. Second, it
D D D
is impossible to apply these methods to the KD problems in
Fs Fs Fs which there are no labels (e.g., low-level vision).
Open challenges: The original idea in [1] is in its apparent
Logits

generality: any student can learn from any teacher; however,


it is shown that this promise of generality is hard to be
achieved on some datasets [16, 36] (e.g., ImageNet [51]) even
Student network S when regularization or strictness techniques are applied.
When the capacity of the student is too low, it is hard
Fig. 3. An illustration of general feature-based distillation. for the student to incorporate the logits information of the
teacher successfully. Therefore, it is expected to improve
the generality and provide a better representation of logits
the model’s uncertainty is crucial since it ensures a more
information, which can be easily absorbed by the student.
reliable knowledge to be transferred. They stress on the
ensemble approaches to estimate the data uncertainty and
4.1.2 Knowledge from the intermediate layers
the distributional uncertainty. To estimate the distributional
uncertainty, an ensemble distribution distillation approach Overall insight: Feature-based distillation enables learning
anneals the temperature of the softmax to not only capture richer information from the teacher and provides more flexibility
the mean of ensemble soft labels but also the diversity of the for performance improvement.
distribution. Meanwhile, [35] proposes a similar approach Apart from distilling knowledge from the softened la-
of matching the distribution of distillation-based multi- bels, Romero et al. [52] initially introduce hint learning
exit architectures, in which a sequence of feature layers is rooted from [1]. A hint is defined as the outputs of a
augmented with early exits at different depths. By doing so, teacher’s hidden layer, which helps guide the student’s
the loss defined in Eqn. 2 becomes: learning process. The goal of student learning is to learn
a feature representation that is the optimal prediction of
K
1 X the teacher’s intermediate representations. Essentially, the
L(x, W ) = [α ∗ H(y, σ(pks ; ρ = 1)+ function of hints is a form of regularization; therefore, a
K k=1 (9)
pair of hint and guided (a hidden layer of the student)
β∗ H(σ(pkt ; ρ = τ ), σ(pks , ρ = τ ))] layer has to be carefully chosen such that the student is not
where K indicates the total number of exits, and pks and pkt over-regularized. Inspired by [52], many endeavours have
represent the k -th probabilistic output at exit k . been taken to study the methods to choose, transport and
Conversely, [2, 27, 30, 32, 34, 37, 38, 39, 40, 41, 42, 43, 44, match the hint layer (or layers) and the guided layer (or
45, 46, 47, 48, 49, 50] propose to add more teachers or other layers) via various layer transform (e.g., transformer [53, 54])
auxiliaries, such as teaching assistant and small students, to and distance (e.g., MMD [55]) metrics. Generally, the hint
improve the robustness of ensemble distribution. We will learning objective can be written as:
explicitly analyze these approaches in the following Sec. 4.2. L(FT , FS ) = D(T F t (FT ), T F s (FS )) (10)
Summary. Table. 1 summarizes the KD methods that use Where FT and FS are the selected hint and guided layers
logits or ‘soft labels’. We divide these methods into four of teacher and student. T F t and T F s are the transformer
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 7

and regressor functions for the hint layer of teacher and while, to enable the student to assimilate and digest the
guided layer of student. D(·) is the distance function(e.g., l2 ) knowledge according to its own capacity, a user-defined
measuring the similarity of the hint and the guided layers. paraphrase ratio is used in the paraphraser to control the
Fig. 3 depicts the general paradigm of feature-based factor of the transfer. Heo et al. [63] use the original teacher’s
distillation. It is shown that various intermediate feature feature in the form of binarized values, namely via a separat-
representations can be extracted from different positions ing hyperplane (activation boundary (AB)) that determines
and are transformed with a certain type of regressor or whether neurons are activated or deactivated. Since AB only
transformer. The similarity of the transformed representa- considers the activation of neurons and not the magnitude
tions is finally optimized via an arbitrary distance metrics D of neuron response, there is information loss in the feature
(e.g., L1 or L2 distance). In this paper, we carefully scrutinize binarization process. Similar information loss happens in
various design considerations of feature-based KD methods IRG [18], where the teacher’s feature space is transformed
and summarize four key factors that are often considered: to a graph representation with vertices and edges where
transformation of the hint, transform of the guided layer, position the relationship matrices are calculated. IR [68] distills the
of the selected distillation feature, and distance metric [53]. In the internal representations of the teacher model to the student
following parts, we will analyze and categorize all existing model. However, since multiple layers in the teacher model
feature-based KD methods concerning these four aspects. are compressed into one layer of the student model, there
Transformation of hints As pointed in [8], the knowledge is information loss when matching the features. Heo et
of teacher should be easy to learn as the student. To do al. [53] design T F t with a margin ReLU function to exclude
this, teacher’s hidden features are usually converted by a the negative (adverse) information and to allow positive
transformation function Tt . Note that the transformation of (beneficial) information. The margin m is determined based
teacher’s knowledge is a very crucial step for feature-based on batch normalization [74] after 1 × 1 convolution in the
KD since there is a risk of losing information in the process student’s transformer T F s .
of transformation. The transformation methods of teacher’s Conversely, FitNet [52], RCO [60], Chung et al. [64],
knowledge in AT [36], MINILM [58], FSP [57], ASL[70], Wang et al. [65], Gao et al. [69] and Kulkarni et al. [67] do
Jacobian [59], KP [56], SVD [71], SP [61], MEAL [62], KSANC not add additional transformation to the teacher’s knowl-
[66], and NST [55] cause the knowledge to be missing due edge; this leads to no information loss from teacher’s side.
to the reduction of feature dimension. Specifically, AT [54] However, not all knowledge from the teacher is beneficial
and MINILM [58] focus on attention mechanisms (e.g., self- for the student. As pointed by [53], features include both
attention [72]) via an attention transformer Tt to transform adverse and beneficial information. For effective distillation,
the activation tensor F ∈ RC×H×W to C feature maps it is important to impede the use of adverse information and
F ∈ RH×W . FSP [57] and ASL [70] calculate the information to avoid missing the beneficial information.
flow of the distillation based on the Gramian matrices, Transformation of the guided features The transformation
through which the tensor F ∈ RC×H×W is transformed to T F s of the guided features (namely, student transform)
G ∈ RC×N , where N represents the number of matrices. of the student is also an important step for effective KD.
Jacobian [59] and SVD [71] map the tensor F ∈ RC×H×W Interestingly, the SOTA works such as AT [36], MINILM
to G ∈ RC×N based on Jacobians using first-order Taylor [58], FSP [57], Jacobian [59], FT [54], SVD [71], SP [61], KP
series and truncated SVD, respectively, inducing informa- [56], IRG [18], RCO [60],MEAL [62], KSANC [66], NST [55],
tion loss. KP [56] projects F ∈ RC×H×W to M feature maps Kulkarni et al. [67], Gao et al. [69] and Aguilar et al. [68] use
F ∈ RM ×H×W , causing loss of knowledge. Similarly, SP the same T F s as the T F t , which means the same amount
[61] proposes a similarity-preserving knowledge distillation of information might be lost in both transformations of the
method based on the observation that semantically similar teacher and the student.
inputs tend to elicit similar activation patterns. To achieve Different from the transformation of teacher, FitNet [52],
this goal, the teacher’s feature F ∈ RB×C×H×W is trans- AB [63], Heo et al. [53], and VID [8] change the dimension
formed to G ∈ RB×B , where B is the batch size. The G of teacher’s feature representations and design T F s with a
encodes the similarity of the activations at the teacher layer, ‘bottleneck’ layer (1 × 1 convolution) to make the student’s
but leads to an information loss during the transformation. features match the dimension of the teacher’s features. Note
MEAL [62] and KSANC [66] both use pooling to align the that Heo et al. [53] add a batch normalization layer after a
intermediate map of the teacher and student, leading to an 1 × 1 convolution to calculate the margin of the proposed
information loss when transforming the teacher’s knowl- margin ReLU transformer of the teacher. There are some
edge. NST [55] and PKT [73] match the distributions of advantages of using 1 × 1 convolution in KD. First, it offers
neuron selectivity patterns and the affinity of data samples a channel-wise pooling without a reduction of the spatial
between the teacher and the student networks. The loss dimensionality. Second, it can be used to create a one-to-
functions are based on minimizing the maximum mean one linear projection of the stack of feature maps. Lastly,
discrepancy (MMD) and Kullback-Leibler (KL) divergence the projection created by 1 × 1 convolution can also be
between these distributions respectively, thus causing infor- used to directly increase the number of feature maps in the
mation loss when selecting neurons. distillation model. In such a case, the feature representation
On the other hand, FT [54] proposes to extract good of student does not decrease but rather increase to match
factors through which transportable features are made. The the teacher’s representation; this does not cause information
transformer T F t is called the paraphraser and the trans- loss in the transformation of the student.
former T F s is called the translator. To extract the teacher Exceptionally, some works focus on a different aspect
factors, an adequately trained paraphraser is needed. Mean- of the transformation of student’s feature representations.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 8

TABLE 2
A taxonomy of knowledge distillation from the intermediate layers (feature maps). KP incidates knowledge projection.

Method Teacher’s T F t Student’s T F s Distillation position Distance metric Lost knowledge


FitNet [52] None 1 × 1 Conv Middle layer L1 None
AT [36] Attention map Attention map End of layer group L2 Channel dims
KP [56] Projection matrix Projection matrix Middle layers L1 + KP loss Spatial dims
FSP [57] FSP matrix FSP matrix End of layer group L2 Spatial dims
FT [54] Encoder-decoder Encoder-decoder End of layer group L1 Channel + Spatial dims
AT [36] Attention map Attention map End of layer group L2 Channel dimensions
MINILM [58] Self-ttention Self-attention End of layer group KL Channel dimensions
Jacobian [59] Gradient penalty Gradient penalty End of layer group L2 Channel dims
SVD [57] Truncated SVD Truncated SVD End of layer group L2 Spatial dims
VID [8] None 1 × 1 Conv Middle layers KL None
IRG [18] Instance graph Instance graph Middle layers L2 Spatial dims
RCO [60] None None Teacher’s train route L2 None
SP [61] Similarity matrix Similarity matrix Middle layer Frobenius norm None
MEAL [62] Adaptive pooling Adaptive pooling End of layer group L1/2 /KL/LGAN None
Heo [62] Margin ReLU 1 × 1 Conv Pre-ReLU Partial L2 Negative features
AB [63] Binarization 1 × 1 Conv Pre-ReLU Margin L2 feature values
Chung [64] None None End of layer LGAN None
Wang [65] None Adaptation layer Middle layer Margin L1 Channel + Spatial dims
KSANC [66] Average pooling Average pooling Middle layers L2 + LGAN Spatial dims
Kulkarni [67] None None End of layer group L2 None
IR [68] Attention matrix Attention matrix Middle layers KL+ Cosine None
Liu [18] Transform matrix Transform matrix Middle layers KL Spatial dims
NST [55] None None Intermediate layers MMD None
Gao [69] None None Intermediate layers L2 None

Wang et al. [65] make the student imitate the fine-grained positions by employing variational information maximiza-
local feature regions close to object instances of the teacher’s tion [76], curriculum learning [77], adversarial learning [78],
representations. This is achieved by designing a particular similarity-presentation in representation learning [79], muti-
adaptation function T F s to fulfill the imitation task. IR [68] task learning [80], and reinforcement learning [81]. We will
aims to let the student acquire the abstraction in a hidden discuss more for these methods in later sections.
layer of the teacher by matching the internal representations. Distance metric for measuring distillation The quality of
That is, the student is taught to know how to compress the KD from teacher to student is usually measured by various
knowledge from multiple layers of the teacher into a single distance metrics. The most commonly used distance func-
layer. In such a setting, the transformation of the student’s tion is based on L1 or L2 distance. FitNet [52], AT [36], NST
guided layer is done by a self-attention transformer. Chung [55], FSP [57], SVD [71], RCO [60], FT [54], KSANC [66],
et al. [64], on the other hand, propose to impose no trans- Gao et al. [69] and Kulkarni et al. [67] are mainly based on
formation to both student and teacher, but rather add a L2 distance, whereas MEAL [62] and Wang et al. [65] mainly
discriminator to distinguish the feature map distributions use L1 distance. On the other hand, Liu et al. [18] and IR [68]
of different networks (teacher or student). utilize KL-divergence loss to measure feature similarities.
Distillation positions of features In addition to the trans- Furthermore, a cosine-similarity loss is adopted by IR [68]
formation of teacher’s and student’s features, distillation and RKD [28] to regularize the context representation on the
position of the selected features is also very crucial in many feature distributions of teacher and student.
cases. Earlier, FitNet [52], AB [63], and Wang et al. [65] use Some works also resort to the adversarial loss for mea-
the end of an arbitrary middle layer as the distillation point. suring the quality of KD. MEAL [62] shows that the stu-
However, this method is shown to have poor distillation dent learning the distilled knowledge with discriminators is
performance. Based on the definition of layer group [75], in better optimized than the original model, and the student
which a group of layers have same spatial size, AT [36], FSP can learn distilled knowledge from a teacher model with
[57], Jacobian [59], MEAL [62], KSANC [66], Gao et al. [69] arbitrary structures. Among the works focusing on feature-
and Kulkarni et al. [67] define the distillation position at based distillation, KSANC [60] adds an adversarial loss at
the end of each layer group, in contrast to FT [54] and the last layer of both teacher and student networks, while
NST [55] where the position lies only at the end of last MEAL adds multi-stage discriminators in the position of ev-
layer group. Compared to FitNet, FT achieves better results ery extracted feature representation. It is worth mentioning
since it focuses more on informational knowledge. IRG [18] that using adversarial loss has shown considerable potential
considers all the above-mentioned critical positions; namely, in improving the performance of KD. We will explicitly
the distillation position lies not only in the end of earlier discuss the existing KD techniques based on adversarial
layer group but also in the end of the last layer group. learning in the following Sec. 8.1.
Interestingly, VID [8], RCO [60], Chung et al. [64], SP [61], IR Potentials and open challenges Table. 9 summarizes the
[68], and Liu et al. [18] generalize the selection of distillation existing feature-based KD methods. It is shown that most
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 9

works employ feature transformations for both the teacher knowledge system [37, 39, 83]. As a result, many new KD
and the student. L1 or L2 loss is the most commonly used methods [2, 18, 30, 32, 34, 37, 38, 39, 40, 41, 42, 43, 45, 46, 48,
loss for measuring KD quality. A natural question one may 49, 69, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96,
ask is what’s wrong with directly matching the features of 97, 98, 99, 100, 101, 102, 103] have been proposed. Although
teacher and student? If we consider the activation of each these works vary in various distillation scenarios and as-
spatial position as a feature, the flattened activation map of sumptions, they share some standard characteristics that can
each filter is a sample of the space of selected neurons with be categorized into five types: ensemble of logits, ensemble
dimension HW , which reflects how DNN learns an image of feature-level information, unifying data sources, and ob-
[55]. Thus, when matching distribution, it is less desirable to taining sub-teacher networks from a single teacher network,
directly match the samples since the sample density can be customizing student network from heterogeneous teachers
lost in the space, as pointed in [52]. Although [69] proposes and learning a student network with diverse peers, via the
to distill knowledge by directly matching feature maps, a ensemble of logits. We now explicitly analyze each category
teaching assistant is introduced to learn the residual errors and provide insights on how and why they are valuable for
of between the feature maps of the student and teacher. This the problems.
approach better mitigates the performance gap between the
teacher and student, thus improving generalization capabil-
4.2.1 Distillation from the ensemble of logits
ity.
Potentials: Feature-based methods show more generaliza- Model ensemble of logits is one of the popular methods in
tion capabilities and quite promising results. In the next re- KD from multiple teachers as shown in Fig. 4(a). In such
search, more flexible ways of determining the representative a setting, the student is encouraged to learn the softened
knowledge of features are expected. The approaches used output of the assembled teachers’ logits (dark knowledge)
in representation learning (e.g., parameter estimation, graph via the cross-entropy loss as done in [2, 30, 32, 37, 38, 40,
models) might be reasonable solutions for these problems. 43, 46, 87, 88, 90, 92, 93, 95, 96, 101, 102], which can be
Additionally, neural architecture search (NAS) techniques generalized into:
may better handle the selection of features. Furthermore,
m
feature-based KD methods are possible for use in cross- logits 1 X τ
domain transfer and low-level vision problems. LEns = H( NTi (x), NSτ (x)) (11)
m i
Open challenges: Although we have discussed most ex-
isting feature-based methods, it is still hard to say which where m is the total number of teachers, H is the cross-
one is best. First, it is difficult to measure the different entropy loss, NTτi and NSτ are the i-th teacher’s and i-th stu-
aspects in which information is lost. Additionally, most dent’s logits (or softmax ouputs), and τ is the temperature.
works randomly choose intermediate features as knowl- The averaged softened output serves as the incorporation of
edge, and yet do not provide a reason as to why they can be multiple teacher networks in the output layer. Minimizing
the representative knowledge among all layers. Third, the Eqn. 11 achieves the goal of KD at this layer. Note that the
distillation position of features is manually selected based averaged softened output is more objective than that of any
on the network or the task. Lastly, multiple features may not of the individuals, because it can mitigate the unexpected
represent better knowledge than the feature of a single layer. bias of the softened output existing in some of the input
Therefore, better ways to choose knowledge from layers and data.
to represent knowledge could be explored. Unlike the methods as mentioned above, [40, 82, 84, 91,
103] argue that taking the average of individual prediction
4.2 Distillation from multiple teachers may ignore the diversity and importance variety of the
member teachers of an ensemble. Thus, they propose to
Overall insight: The student can learn better knowledge from learn the student model by imitating the summation of
multiple teachers, which are more informative and instructive the teacher’s predictions with a gating component. Then,
than a single teacher. Eqn. 11 becomes:
While impressive progress has been achieved under the
common S-T KD paradigm, where knowledge is transferred m
from one high-capacity teacher network to a student net- logits
X
LEns = H( gi NTτi (x), NSτ (x)) (12)
work. The knowledge capacity in this setting is quite limited i
[48], and knowledge diversity is scarce for some special
cases, such as cross-model KD [82]. To this end, some works where gi is the gating parameter. In [84], the gi is the
probe learning a portable student from multiple teachers or normalized similarity sim(DSi , DT ) of the source domain
an ensemble of teachers. The intuition behind this can be DS and target domain DT .
explained in analogy with the cognitive process of human Summary: Distilling knowledge from the ensemble of
learning. In practice, a student does not solely learn from logits mainly depends on taking the average or the sum-
a single teacher but learn a concept of knowledge better mation of individual teacher’s logits. Taking the average
provided with instructive guidance from multiple teachers alleviates the unexpected bias, but it may ignore the diver-
on the same task or heterogeneous teachers on different sity of individual teachers of an ensemble. The summation
tasks. In such a way, the student can amalgamate and as- of the logits of each teacher can be balanced by the gating
similate various illustrations of knowledge representations parameter gi , but ways to determine better values of gi is an
from multiple teacher networks, and build a comprehensive issue worth studying in further works.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 10

y1 Data A Model A
Teacher 1 Ensemble of L1
Teacher 1
y2 logits Data B Model A
Teacher 2
Teacher 2 L2 Augment Ensemble
y3 yens LKD Data ation Data C Model A
Teacher 3
Teacher 3 L3
yn Data D Model A
Teacher n Teacher n Ln
Features Similarity
ys matrices Student
Student Student

(a) ensemble of logits (b) ensemble of features (c) unifying data sources

Task A Student can do tasks


Teacher 1 Teacher 1 1, 2, … N Peer 1
Task B or most of them!!
Generating Teacher 2
Teacher 2
Teacher sub-teacher Ensemble Peer 3
Teacher 3 Task C Student
networks Teacher 3
Teacher n Peer 2
Teacher n
Task N
(e) obtaining sub-teacher networks from singe teacher (f) customizing student from heterogeneous teachers (g) Mutual learning with ensemble of peers

Fig. 4. Graphical illustration for KD with multiple teachers. The KD methods can be categorized into six types: (a) KD from the ensemble of logits, (b)
KD from the ensemble of feature representations via some similarity matrices, (c) unifying various data sources from the same network (teacher)
model A to generate various teacher models, (d) obtaining hierarchical or stochastic sub-teacher networks given one teacher network; (e) training
a versatile student network from multiple heterogeneous teachers, (f) online KD from diverse peers via ensemble of logits.

4.2.2 Distillation from the ensemble of features where αi is the teacher weight for controlling the
Pmcontribu-
tion of the i-th teacher, and αi should satisfy i αi = 1.
Distillation from the ensemble of feature representations is
AS and ATi are the similarity matrices of the student and
more flexible and advantageous than from the ensemble of
the i-th teacher, respectively. These can be computed by
logits, since it can provide more affluent and diverse cross-
AS = x|S xS and AT = x|Ti xTi , respectively.
information to the student. However, distillation from the
Open challenges: Based on our review, it is evident that
ensemble of features [18, 45, 48, 82, 92, 94, 102] is more
only a few studies propose distilling knowledge from the
challenging, since each teacher’s feature representation at
ensemble of feature representations. Although [48, 90] pro-
specific layers is different from the others. Hence, transform-
posed to let the student directly mimic the ensemble of
ing the features, and forming an ensemble of the teachers’
feature maps of the teachers via either non-linear transfor-
feature-map-level representations becomes the key problem,
mation or similarity matrices with weighting mechanisms,
as illustrated in Fig. 4(b).
there still exist some challenges. First, how can we know
To address this issue, Park et al. [48] proposed feeding
which teacher’s feature representation is more reliable or
the student’s feature map into some nonlinear layers (called
more influential among the ensemble? Second, how can we
transformers). The output is then trained to mimic the
determine the weighting parameter αi for each student in
final feature map of the teacher network. In this way, the
an adaptive way? Third, instead of summing all feature
advantages of the general model ensemble and feature-
information together, is there any mechanism of selecting
based KD methods, as mentioned in Sec. 4.1.2, can both be
the best feature map of one teacher from the ensemble as
incorporated. The loss function is given by:
the representative knowledge?
m
f ea
X xTi T F i (xS ) 4.2.3 Distillation by unifying data sources
LEns = || − ||1 (13)
i
||xTi ||2 ||T F i (xS )||2 Although the above mentioned KD methods using multiple
teachers are good in some aspects, they assume that the
where xTi and xS are i-th teacher’s and i-th student’s target classes of all teacher and student models are the same.
feature maps respectively, and T F is the transformer (e.g., In addition, the dataset used for training is often scarce,
3 × 3 convolution layer) used for adapting the student’s and teacher models with high capacity are limited. To tackle
feature with that of the teacher. these problems, some recent works [39, 42, 45, 86, 103, 104]
In contrast, Wu et al. [45] and Liu et al. [18] proposed propose data distillation by unifying data sources from
letting the student model to imitate the learnable transfor- multiple teachers, as illustrated in Fig. 4(c). The goal of these
mation matrices of the teacher models. This approach is an methods is to generate labels for the unlabeled data via var-
updated version of a single teacher model [61]. For the i-th ious data processing approaches (e.g., data augmentation) to
teacher and student network in [45], the similarity between train a student model.
the feature maps is computed based on the Euclidean metric Vongkulbhisal et al. [86] proposed to unify an ensemble
as: of heterogeneous classifiers (teachers) which may be trained
m
f ea
X to classify different sets of target classes and can share
LEns = αi || log(AS ) − log(ATi )||2F (14)
the same network architecture. To generalize distillation,
i
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 11

a probabilistic relationship connecting the outputs of the of multiple teachers and online KD, many uncertainties
heterogeneous classifiers with that of the unified (ensemble) remain. Firstly, it is unclear how many teachers are sufficient
classifier is proposed. Similarly, Wu et al. [45] and Gong et for online distillation. Secondly, which structure is optimal
al. [104] also explored transferring knowledge from teacher among the ensemble of sub-teachers is unclear? Thirdly,
models trained in existing data to a student model by balancing the training efficiency and accuracy of the stu-
using unlabeled data to form a decision function. Besides, dent network is an open issue. These challenges are worth
some works utilize the potential of data augmentation ap- exploring in further studies.
proaches to build multiple teacher models from a trained
teacher model. Radosavovic et al. [42] proposed a distillation 4.2.5 Customizing student form heterogeneous teachers
method via multiple transformations on the unlabeled data
In many cases, well-trained deep networks (teachers) are
to build diverse teacher models sharing the same network
focused on different tasks, and are optimized for different
structure. The technique consists of four steps. First, a single
datasets. However, most studies focus on training a student
teacher model is trained on manually labeled data. Second,
by distilling knowledge from teacher networks on the same
the trained teacher model is applied to multiple transfor-
task or on the same dataset. To tackle these problems,
mations of the unlabeled data. Third, the predictions on the
knowledge amalgamation has been initialized by recent works
unlabeled data are converted into an ensemble of numerous
[18, 46, 83, 98, 99, 100, 102, 105, 106, 107] to learn a versatile
predictions. Fourth, the student model is trained on the
student model by distilling knowledge from the expertise
union of the manually labeled data and the automatically
of all teachers, as illustrated in Fig. 4(e). Shen et al. [83],
labeled data. Sau et al. [39] proposed an approach to simu-
Ye et al. [99], Luo et al. [100] and Ye et al. [105] proposed
late the effect of multiple teachers by injecting noise to the
training a student network by customizing the tasks without
training data, and perturbing the logit outputs of a teacher.
accessing human-labeled annotations. These methods rely
In such a way, the perturbed outputs not only simulate the
on schemes such as branch-out [108] or selective learning
setting of multiple teachers, but also generate noise in the
[109]. The merits of these methods lie in their ability to
softmax layer, thus regularizing the distillation loss.
reuse deep networks pre-trained on various datasets of
Summary: Unifying data sources using data augmenta-
diverse tasks to build a tailored student model based on the
tion techniques and unlabeled data from a single teacher
user demand. The student inherits most of the capabilities
model to build up multiple sub-teacher models is also
of heterogeneous teachers, and thus can perform multiple
valid for training a student model. However, it requires a
tasks simultaneously. Shen et al. [98] and Gao et al. [106]
high-capacity teacher with more generalized target classes,
utilized a similar methodology, but focused on same task
which could confine the application of these techniques. In
classification, with two teachers specialized in different clas-
addition, the effectiveness of these techniques for some low-
sification problems. In this method, the student is capable
level vision problems should be studied further based on
of handling comprehensive or fine-grained classification.
feature representations.
Dvornik et al. [46] attempted to learn a student that can
4.2.4 From a single teacher to multiple sub-teachers predict unseen classes by distilling knowledge from teachers
via few-shot learning. Rusu et al. [107] proposed a multi-
It has been shown that students could be further improved
teacher single-student policy distillation method that can
with multiple teachers used as ensembles or used sepa-
distill multiple policies of reinforcement learning agents to
rately. However, using multiple teacher networks is resource
a single student network for sequential prediction tasks.
heavy, and delays the training process. Following this, some
Open challenges: Studies such as the ones mentioned above
methods [37, 41, 49, 84, 88, 90, 97] have been proposed to
have shown considerable potential in customizing versatile
generate multiple sub-teachers from a single teacher net-
student networks for various tasks. However, there are some
work, as shown in Fig. 4(d). Lee et al. [49] proposed stochas-
limitations in such methods. Firstly, the student may not be
tic blocks and skip connections to teacher networks, so that
compact due to the presence of branch-out structures. Sec-
the effect of multiple teachers can be obtained in the same
ondly, current techniques mostly require teachers to share
resource from a single teacher network. The sub-teacher
similar network structures (e.g., encoder–decoder), which
networks have reliable performance, because there exists
confines the generalization of such methods. Thirdly, train-
a valid path for each batch. By doing so, the student can
ing might be complicated because some works adopt a dual-
be trained with multiple teachers throughout the training
stage strategy, followed by multiple steps with fine-tuning.
phase. Similarly, Ruiz et al. [89] introduced hierarchical neu-
These challenges open scopes for future investigation on
ral ensemble by employing a binary-tree structure to share a
knowledge amalgamation.
subset of intermediate layers between different models. This
scheme allows controlling the inference cost on the fly, and
deciding how many branches need to be evaluated. Tran et 4.2.6 Mutual learning with ensemble of peers
al. [88], Song et al. [41] and He et al. [97] introduced multi- One problem with conventional KD methods using multiple
headed architectures to build multiple teacher networks, while teachers is their computation cost and complexity, because
amortizing the computation through a shared heavy-body they require pre-trained high-capacity teachers with two-
network. Each head is assigned to an ensemble member, stage (also called offline) learning. To simplify the distil-
and tries to mimic the individual predictions of the ensemble lation process, one-stage (online) KD methods [34, 40, 50,
member. 64, 82, 85, 101, 110, 111] have been developed, as shown in
Open challenges: Although network ensembles using Fig. 4(f). Instead of pre-training a static teacher model, these
stochastic or deterministic methods can achieve the effect methods train a set of student models simultaneously by
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 12

TABLE 3
A taxonomy of KD with multiple teachers. LCE is for cross-entropy loss, LEns is for the KD loss between the ensemble teacher and the student,
LKD indicates the KD loss between individual teacher and the student, LKDf ea+logits means KD loss using feature and logits, KL indicates the
KL divergence loss for mutual learning, LGAN is for adversarial loss, MMD means mean maximum discrepancy loss, Lreg is the regression loss,
N/A means not available. Note that the losses summarized are generalized terms, which may vary in individual works.
Ensemble Ensemble of Unifying Customize Extending Online Mutual Major Loss
Method Logits Features data sources student teacher KD learning functions
Anil [101] X 7 7 7 7 X X LCE +LEns
Chen [50] X X 7 7 7 X X LCE +LEns +LKD
Dvornik [46] X 7 7 X 7 X X LCE +LEns +LKD
Fukuda [91] X 7 7 7 7 7 7 LCE +LEns
Furlanello [30] X 7 7 7 7 7 7 LCE +LEns
He [97] 7 X 7 7 X 7 7 L1 +LKD
Jung [93] X 7 7 7 7 7 7 LCE +LKD
Lan [40] X 7 7 7 7 X 7 LCE +LEns
Lee [49] X X 7 7 X X X N/A
Liu [18] 7 X 7 X 7 X 7 LCE +LKD
Luo [100] X X 7 X 7 7 7 LCE +LKDf ea+logits
Zhou [102] X X 7 X 7 7 7 MMD+LKD
Mirzadeh [32] X 7 7 7 7 7 7 LCE +LKD
Papernot [38] X 7 X 7 7 X 7 LKD
Park [48] 7 X 7 7 7 7 7 LCE +LKD
Radosavovic [42] X 7 X 7 7 7 7 N/A
Ruder [84] X 7 7 7 7 7 7 LCE +LKD
Ruiz [89] X 7 7 7 X 7 7 LCE +LEns
Sau [39] X 7 X 7 7 7 7 L2 (KD)
Shen [98] X X 7 X 7 7 7 LKD +LP L
Shen [83] X X 7 X 7 7 7 LKDf ea+logits +Lreg
Song [41] X 7 7 7 X X 7 LCE +LKD
Tarvaninen [2] X 7 7 7 7 X 7 LKD
Tran [88] X 7 7 7 X 7 X LCE +LEns +KL
Vongkulbhisal [86] X 7 X 7 7 7 7 LCE +LEns
Wu [45] 7 X X 7 7 7 7 LKD
Wu [90] X 7 7 7 7 7 7 LCE +LEns
Yang [87] X 7 7 7 7 7 7 LCE +LKD
Ye [99] X X 7 X 7 7 7 LKD +LKD
You [37] X X 7 7 7 7 7 LCE +LKDf ea+logits
Zhang [82] X X 7 7 7 7 7 LCE +LKDf ea+logits
Zhang [34] X 7 7 7 7 X X LCE +KL
Zhu [85] X 7 7 7 7 X X LCE +KL+LEns
Chung [64] X X 7 7 7 X X LCE +LGAN +KL
Kim [110] X X 7 7 7 X X LCE +LEns +KL
Hou [111] 7 X 7 7 7 X 7 LCE +LEns
Xiang [103] X 7 X 7 7 7 7 LCE +LKD

making them learn from each other in a peer-teaching man- branches. Each branch was trained with a distillation loss
ner. There are some benefits of such methods. First, these that aligns the prediction of that branch with the teacher’s
approaches merge the training processes of teachers and prediction. Mathematically, the distillation loss can be for-
student models, and use peer networks to provide teaching mulated by minimizing the KL divergence of ze (prediction
knowledge. Second, these online distilling strategies can of the ensemble teacher) and prediction zi of the i-th branch
improve the performance of models of any capacity, lead- peer:
m
ing to generic applications. Third, such a peer-distillation X
method can sometimes outperform teacher-based two-stage LKD
Ens = KL(ze , zi ) (16)
i=1
KD methods. For KD with mutual learning, the distillation Pm
loss of two peers is based on the KL divergence, which can where the prediction ze = i=1 gi zi . gi is the weighting
be formulated as: score or attention-based weights [50] of the i-th branch peer
zi .
LKD
P eer = KL(z1 , z2 ) + KL(z2 , z1 ) (15) Although most of these methods only consider using
logit information, some works also exploit feature informa-
where KL is the KL divergence function, and z1 and z2 are tion. Chung et al. [64] proposed a feature-map-level distil-
the predictions of peer one and peer two, respectively. lation by employing adversarial learning (discriminators).
In addition, Lan et al. [40] and Chen et al. [50] also Kim et al. [110] introduce a feature fusion module to form an
constructed a multi-branch variant of a given target (stu- ensemble teacher. However, the fusion is based on the con-
dent) network by adding auxiliary branches to create a local catenation of the features (output channels) from the branch
ensemble teacher (also called a group leader) model from all peers. Moreover, Liu et al. [18] presented a knowledge flow
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 13

framework which moves the knowledge from the features metadata from a single layer (average-pooling layer) using
of multiple teacher networks to a student. k -means clustering is sufficient to achieve high student
Summary: Compared to two-stage KD methods using accuracy. In contrast to [112, 113] requiring sampling the
pre-trained teachers, distillation from student peers has activations generated by real data, Haroush et al. [114]
many merits. The methods are based on mutual learn- proposed using metadata (e.g., channel-wise mean and stan-
ing of peers, and sometimes on ensembles of peers. Most dard deviation) from Batch Normalization (BN) [74] layer
studies rely on logit information; however, some works with synthetic samples. The objective of metadata-based
also exploit feature information via adversarial learning distillation can be formulated as:
or feature fusion. There is room for improvement in this
X ∗ = arg min L(Φ(X), Φ0 ) (17)
direction. For instance, the number of peers most optimal X∼RH×W
for KD processing is worth investigating. In addition, the where X ∗ is the image (with width W and height H ) to be
possibility of using both the online and offline methods found, Φ is the representation of X , Φ0 is the representation
simultaneously when the teacher is available is intriguing. of the metadata, and L is the loss function (e.g., l2 ).
Reducing the computation cost without sacrificing accuracy
and generalization is also an open issue. We will discuss the 5.1.2 Distillation based on class-similarities
advantages and disadvantages of online and offline KD in
Nayak et al. [115] argued that the metadata used in [112, 113]
Sec. 6.1.
are actually not complete data-free approaches, since the
Potentials Table. 3 summarizes the KD methods with multi-
metadata is formed using the training data itself. They
ple instructors. Overall, most methods rely on the ensemble
instead proposed a zero-shot KD approach, in which no
of logits. However, the knowledge of feature representa-
data samples and no metadata information are used. In
tions has not been taken into account much. Therefore, it
particular, the approach obtains useful prior information
is possible to exploit the knowledge of the ensemble of
about the underlying data distribution in the form of class
feature representations by designing better gating mecha-
similarities from the model parameters of the teacher. This
nisms. Unifying data sources and extending teacher models
prior information can be further utilized for crafting data
are two effective methods for reducing individual teacher
samples (also called data impressions (DIs)) by modeling the
models; however, their performances are degraded. Thus,
output space of the teacher model as a Dirichlet distribution.
overcoming this issue needs more research. Customizing a
The class similarity matrix, similar to [61], is calculated
versatile student is a valuable idea, but existing methods are
based on the softmax layer of the teacher model. The ob-
limited by network structures, diversity, and computation
jective of the data impression Xik can be formulated based
costs, which must be improved in future works.
on the cross-entropy loss:

5 D ISTILLATION BASED ON DATA FORMAT Xik = arg min LCE (yik , T (X, θT , τ )) (18)
X
5.1 Data-free distillation where yik is the sampled i-th softmax vector, and k is the
Overall insight: Can we achieve KD when the original data certain class.
for the teacher or (un)labeled data for training student are not
available? 5.1.3 Distillation using generator
One major limitation of most KD methods such as [1, Considering the limitation of metadata and similarity-based
28, 48, 52] is that they assume the training samples of the distillation methods, some works [116, 117, 118, 119, 120,
original networks (teachers) or of target networks (students) 121] propose novel data-free KD methods via adversarial
to be available. However, the training dataset is sometimes learning [78, 123, 124]. Although the tasks and network
unknown in real-world applications owing to privacy and structures vary in these methods, most are built on a com-
transmission concerns [112]. To handle this problem, some mon framework. That is, the pretrained teacher network
representative data-free KD paradigms [112, 113, 114, 115, is fixed as a discriminator, while a generator is designed
116, 117, 118, 119, 120, 121, 122] are newly developed. A to synthesize training samples, given various input source
taxonomy of these methods are summarized in Table. 4, and (e.g., noise [116, 118, 119, 120]). However, slight differences
detailed technical analysis is provided as follows. exist in some studies. Fang et al. [117] point out the problem
of taking the teacher as the discriminator since the infor-
5.1.1 Distillation based on metadata mation of the student is ignored, and generated samples
To the best of our knowledge, Lopes et al. [112] initially cannot be customized without the student. Thus, they take
proposed to reconstruct the original training dataset using both the teacher and the student as the discriminator to
only teacher model and it is metadata recorded in the form of reduce the discrepancy between them, while a generator is
precomputed activation statistics. Thus, the objective is to find trained to generate some samples to adversarially enlarge
the set of images whose representation best matches the one the discrepancy. In contrast, Ye et al. [118] focus more on
given by the teacher network. Gaussian noise is randomly strengthening the generator structure, and three generators
passed as input to the teacher, and the gradient descent (GD) are designed and subtly used. Specifically, a group-stack
is made to minimize the difference between the metadata generator is trained to generate the images originally used
and the representations of noise input. To better constrain for pre-training the teachers, and the intermediate activa-
the reconstruction, the metadata of all layers of the teacher tions. Then, a dual generator takes the generated image
model are used and recorded to train the student model as the input, the dual part is taken as the target network
with high accuracy. Bhardwaj et al. [113] demonstrated that (student), and regrouped for multi-label classifications. To
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 14

TABLE 4
A taxonomy of data-free knowledge distillation.
Original Metadata or Number of Multi-task
Method data needed prior info. generators
Inputs Discriminator distillation

Activations
Lopes [112] X of all layers
7 Image shape 7 7

Activations of
Bhardwaj [113] X pooling layer
7 Image shape 7 7

Batch
Haroush [114] X normalization layer
7 Image shape 7 7

Nayak [115] 7 Class similarities 7 Class label+ Number of DIs 7 7


Chen [116] 7 7 One Noise Teacher 7
Fang [117] 7 7 One Noise/images Teacher + student 7
Ye [118] 7 7 Three Noise Teachers X
Yoo [119] 7 7 One Noise + class labels Teacher 7
Yin [120] 7 7 One Noise Teacher 7
Micaelli [121] 7 7 One Noise Teacher 7

compute the adversarial loss for both the generated image 5.2.1 Distillation via pseudo examples
and the intermediate activations, multiple group-stack dis- Insight: If training data is insufficient, try to create pseudo
criminators (multiple teachers) are also designed to amal- examples for training the student.
gamate multi-knowledge into the generator. Yoo et al. [119] [122, 126, 128] focus on creating pseudo training exam-
make the generator take two inputs: a sampled class label ples when training data is scarce and leading to overfitting
y , and noise z . Meanwhile, a decoder is also applied to of the student network. Specifically, Kimura et al. [128]
reconstruct the noise input z 0 and class label y 0 from the fake adopt the idea of inducing points [130] to generate pseudo
data x0 generated by the generator from the noise input z , training examples, which are then updated by applying
and class label y . Thus, by minimizing the errors between y adversarial examples [131, 132], and further optimized by
and y 0 and between z and z 0 , the generator generates more an imitation loss. Liu et al. [126] generate pseudo ImageNet
reliable data. Although adversarial loss is not used in [120], [51] labels from a teacher model (trained with ImageNet),
the generator (called DeepInversion) taking an image prior and also utilize the semantic information (e.g., words) to add
regularization term to synthesize images is modified from a supervision signal for the student. Interestingly, Kulkarni
DeepDream [125]. et al. [122] create a ‘mismatched’ unlabeled stimulus (e.g.,
soft labels of MNIST dataset [133] provided by the teacher
5.1.4 Open challenges for data-free distillation
trained on CIFAR dataset [134]), which are used for aug-
Although data-free KD methods have shown considerable menting a small amount of training data to train the student.
potential and new directions for KD, there still exist many
challenges. First, the recovered images are unrealistic and 5.2.2 Distillation via layer-wise estimation
low-resolution, which may not be utilized in some data- Insight: Layer-wise distillation from the teacher network via
captious tasks (e.g., semantic segmentation). Second, train- estimating the accumulated errors on the student network can
ing and computation of the existing methods might be also achieve the purpose of few-example KD.
complicated due to the utilization of many modules. Third, In Bai et al. [129] and Li et al. [127], the teacher network
diversity and generalization of the recovered data are still is first compressed to create a student via network pruning
limited, compared with the methods of data-driven distil- [135], and layer-wise distillation losses are then applied to
lation. Fourth, the effectiveness of such methods for low- reduce the estimation error on given limited samples. To
level tasks (e.g., image super-resolution) needs to be studied conduct layer-wise distillation, Li et al. [127] add a 1×1 layer
further. after each pruned layer block in the student, and estimate
the least-squared error to align the parameters with the
5.2 Distillation with a few data samples student. Bai et al. [129] employ cross distillation losses to
Overall insight: How to perform efficient knowledge distillation mimic the behavior of the teacher network, given its current
with only a small amount of training data? estimations.
Most KD methods with S-T structures, such as [1, 48, 54,
64], are based on matching information (e.g., logits, hints) 5.2.3 Challenges and potentials
and optimizing the KD loss with the fully annotated large- Although KD methods with a small number of examples
scale training dataset. As a result, the training is still data- inspired by the techniques of data augmentation and layer-
heavy and processing-inefficient. To enable efficient learning wise learning are convincing, these techniques are still con-
of the student while using small amount of training data, fined by the structures of teacher networks. This is because
some works [122, 126, 127, 128, 129] propose few-sample KD most methods rely on network pruning from teacher net-
strategies. The technical highlight of these methods is based works to create student networks. Besides, the performance
on generating pseudo training examples, or aligning the of the student is heavily dependent on the amount of the
teacher and the student with layer-wise estimation metrics. crafted pseudo labels, which may impede the effectiveness
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 15

Visual recognition networks Sound detection network


Data modality A Teacher
RGB Branch 1
Teacher Network1 Loss 1
Image
Knowledge Video Network2 Loss 2 2
Branch 2 Trunk Video
Ground truth Distillation Knowledge
Distillation
Network3 Loss 3 Branch 3
Data modality B Depth
Student Student
Image Teachers Student

(a) Supervised cross-modal KD (b) Unsupervised cross-modal KD with one teacher (c) cross-modal KD from multiple teachers

Fig. 5. Graphical illustration of cross-modal KD methods. (a) supervised cross-modal KD from the teacher with one modality to the student with
another modality. (b) unsupervised cross-modal KD with one teacher. (c) unsupervised cross-modal KD with multiple teachers, each of which is
transferring the discriminative knowledge to the student.

of these methods. Lastly, most works focus on generic clas- utilize the knowledge from high-quality images to learn a
sification tasks, and it is unclear whether these methods are classifier with better generalization on low-quality image
effective for tasks without class labels (e.g., low-level vision (paired).
tasks).
5.3.2 Unsupervised cross-modal distillation
5.3 Cross-modal distillation Most cross-modal KD methods exploit unsupervised learn-
ing, since the labels in target domains are hard to get. Thus,
Overall insight: KD for cross-modal learning is typically per-
these methods are also called distillation ‘in the wild’. In this
formed with network architectures containing modal-specific rep-
setting, the knowledge from the teacher’s modality provides
resentations or shared layers, utilizing the training images in
supervision for the student network. To this end, some works
correspondence of different domains.
[136, 141, 142, 143, 144, 145, 146, 147, 148, 150, 151, 152, 153,
One natural question we ask is if it is possible to transfer
156] aimed for cross-modal distillation in an unsupervised
knowledge from a pre-trained teacher network for one
manner.
task to a student learning another task, while the training
examples are in correspondence across domains. Note that
5.3.3 Learning from one teacher
KD for cross-modal learning is essentially different from that for
domain adaptation, in which data are drawn independently from Afouras et al. [141], Albanie et al. [142], Gupta et al. [143],
different domains, but the tasks are the same. Thoker et al. [145], Zhao et al. [146], Owens et al. [147], Kim
Compared to previously mentioned KD methods fo- et al. [151], Arandjelovic et al. [148], Gan et al. [154], and
cused on transferring supervision within the same modality Hafner et al. [153] focus on distilling knowledge from one
between the teacher and the student, cross-modal KD uses teacher (see Fig. 5(b)), and mostly learn a single student
the teacher’s representation as a supervision signal to train network. Thoker et al. [145] and Zhao et al. [146] learn two
the student learning another task. In this problem setting, students. Especially, Thoker et al. refer to mutual learning
the student needs to rely on the visual input of the teacher [34], where two students learn from each other based on
to accomplish its task. Following this, many novel cross- two KL divergence losses. In addition, Zhao et al. [146]
modal KD methods [136, 137, 138, 139, 140, 141, 142, 143, exploit the feature fusion strategy, similar to [110, 157] to
144, 145, 146, 147, 148, 149, 150, 152, 153, 156] have been learn a more robust decoder. Do et al. [149] focuses on
proposed. We now provide a systematic analysis of the unpaired images of two modalities, and learns a semantic
technical details, and point the challenges and potential of segmentation network (student) using the knowledge from
cross-domain distillation. the other modality (teacher).

5.3.1 Supervised cross-modal distillation 5.3.4 Learning from multiple teachers


Using the ground truth labels for the data used in the Aytar et al. [136], Salem et al. [144], Aytar et al. [150] and Do
student network is the common way of cross-modal KD, as et al. [149] exploit the potential of distilling from multiple
shown in Fig. 5(a). Do et al. [149], Su et al. [137], Nagrani et teachers as mentioned in Sec. 4.2. Most methods rely on
al. [138], Nagrani et al. [139] and Hoffman et al. [140] rely on concurrent knowledge among visual, audio, and textual in-
supervised learning for cross-modal transfer. Several works formation, as shown in Fig. 5(c). However, Salem et al. [144]
[138, 139, 141, 155] leverage the synchronization of visual focus on the visual modality only, where teachers learn the
and audio information in the video data, and learn a joint information of object detection, image classification, and
embedding between the two modalities. Afouras et al. [141] scene categorization via a multi-task approach, and distill
and Nagrani et al. [139] transfer the voice knowledge to learn the knowledge to a single student.
a visual detector, while Nagrani et al. [138] utilize visual
knowledge to learn a voice detector (student). In contrast, 5.3.5 Potentials and open challenges
Hoffman et al. [140], Do et al. [149] and Su et al. [137] Potentials: Based on the analysis of the existing cross-modal
focus on different modalities in the visual domain only. In KD techniques in Table. 5, we can see that cross-modal
particular, Hoffman et al. [140] learn a depth network by KD expands the generalization capability of the knowledge
transferring the knowledge from an RGB network, and fuse learned from the teacher models. Cross-domain KD has
the information across modalities. This improves the object considerable potential in relieving the dependence for a large
recognition performance during the test time. Su et al. [137] amount of labeled data in one modality or both. In addition,
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 16

TABLE 5
A taxonomy of cross-modal knowledge distillation methods.
Source Target Number of Model
Method Use GT modality modality teachers
Online KD Knowledge compression
Ayter [136] 7 RGB frames Sound Two 7 Logits 7
Su [137] X HR image map LR image One 7 Soft labels X
Nagrani [138] X RGB frames Voice One X Soft labels 7
Nagrani [139] X Voice/face Face/voice Multiple X Features 7
Hoffman [140] X RDG images Depth images One 7 Features 7
Afouras [141] 7 Audio Video One 7 Soft labels 7
Albanie [142] 7 Video frames Sound One 7 Logits 7
Gupta [143] 7 RGN images Depth images One 7 Soft labels 7
Scene
Salem [144] 7 classification, Localization Three 7 Soft labels 7
object detection
Thoker [145] 7 RGB video Skeleton data One 7 Logits 7
Confidence
Zhao [146] 7 RGB frames Heatmaps One 7 maps
7
Owens [147] 7 Sound Video frames One 7 Soft labels 7
Arandjelovic [148] 7 Video frames Audio One 7 Features 7
Image,
Do [149] X Questions, Image questions Three 7 Logits X
Answer info.
Sound,
Aytar [150] X Image Image, Text
Three 7 Features 7
Kim [151] 7 Sound/images Images/sound One 7 Features 7
Dou [152] 7 CT images MRI images One X Logits X
Hafner [153] 7 Depth images RGB images One 7 Embeddings 7
Feature
Gan [154] X Video frame Sound One 7 soft labels
7

RGB video
Perez [155] X Acoustic images
Audio One 7 Soft labels 7

cross-domain KD is more scalable, and can be easily applied memory-intensive. Thus, it would be better if an online KD
to new distillation tasks. Moreover, it is advantageous for strategy is considered. Lastly, some works (e.g., [144, 150])
learning multiple modalities of data ‘in the wild’, since it learn a student model using the knowledge from multiple
is relatively easy to get data with one modality based on teachers. However, the student is less versatile or modality-
other data. In visual applications, cross-modal KD has the dependent. Inspired by the analysis of Sec. 4.2.5, we open a
potential to distill knowledge among images taken from research question: Is it possible to learn a versatile student
different types of cameras. For instance, one can distill that can perform tasks from multiple modalities?
knowledge from an RGB image to event streams (stacked
event images from event cameras) [137, 158].
6 O NLINE AND T EACHER - FREE DISTILLATION
Open challenges: Since the knowledge is the transferred
representations (e.g., logits, features) of teacher models, 6.1 Online distillation
ensuring the robustness of the transferred knowledge is Overall insight: With the absence of a pre-trained powerful
crucial. We hope to transfer the good representations, but teacher, simultaneously training a group of student models by
negative representations do exist. Thus, it is imperative that learning from peers’ predictions is an effective substitute for two-
the supervision provided by the teachers is complementary stage (offline) KD
to the target modality. Moreover, existing cross-modal KD In this section, we provide a deeper analysis of online
methods are highly dependent on data sources (e.g., video, (one-stage) KD methods in contrast to the previously dis-
images), but finding data with paired (e.g., RGB image with cussed offline (two-stage) KD methods. Offline KD methods
depth pair) or multiple modalities (class labels, bounding often require pre-trained high-capacity teacher models to
boxes and segmentation labels) is not always an easy task. perform one-way transfer [1, 8, 37, 54, 91, 111, 159, 160, 161].
We are compelled to ask if it is possible to come up with However, it is sometimes difficult to get such ‘good’ teach-
a way for data-free distillation or distillation with a few ers, and the performance of the student gets degraded when
examples? In other words, is it possible to just learn a the gap of network capacity between the teacher and the
student model with the data from the target modality based student is significant. In addition, two-stage KD requires
on the knowledge of the teacher, without referencing the many parameters, resulting in higher computation costs. To
source modality? overcome these difficulties, some studies focus on online
Moreover, existing cross-modal KD methods are KD that simultaneously trains a group of student peers by
mostly offline methods, which are computation-heavy and learning from the peers’ predictions.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 17

Logits Branch outs


Peer 1 or features Peer 2 Peer 1

Peer 3 Peer 2 ensemble


Head Peer 1

Peer 2 Peer 3
Peer 3

(a) Individual student peers (b) Sharing head among students (c) ensemble of student peers

Fig. 6. An illustration of online KD methods. (a) online KD with individual student peer learning from each other, (b) online KD with student peers
sharing trunk (head) structure, (c) online KD by assembling the weights of each student to form a teacher or group leader.

6.1.1 Individual student peers 6.1.4 Summary and open challenges


Zhang et al. [34], Gao et al. [160] and Anil et al. [101] focus Summary: Based on the above analysis, we have deter-
on online mutual learning [34] (also called codistilation) in mined that codistillation, multi-architectures, and ensemble
which a pool of untrained student networks with the same learning are three main techniques for online distillation.
network structure simultaneously learns the target task. In There are some advantages of online KD compared with
such a peer-teaching environment, each student learns the offline KD. Firstly, online KD does not require pre-training
average class probabilities from the other (see Fig. 6(a)). teachers. Secondly, online learning provides a simple but
However, Chung et al. [64] also employ individual students, effective way to improve the learning efficiency and gener-
and additionally design a feature map-based KD loss via alizability of the network, by training together with other
adversarial learning. Hou et al. [111] proposed DualNet, student peers. Thirdly, online learning with student peers
where two individual student classifiers were fused into one often results in better performance than offline learning.
fused classifier. During training, the two student classifiers Open challenges: There are some challenges in online KD.
are locally optimized, while the fused classifier is globally Firstly, there is a lack of theoretical analysis for why online
optimized as a mutual learning method. Other methods learning is sometimes better than offline learning. Secondly,
such as [159, 162], focus on online video distillation by in online ensemble KD, simply aggregating students’ logits
periodically updating the weights of the student, based on to form an ensemble teacher restrains the diversity of stu-
the output of the teacher. Although codistillation achieves dent peers, thus limiting the effectiveness of online learning.
parallel learning of students, [34, 64, 101, 111] do not con- Thirdly, existing methods are confined problems in which
sider the ensemble of peers’ information as done in other ground truth (GT) labels exist (e.g., classification). However,
works such as [50, 160]. for some problems (e.g., low-level vision problems), ways
for the student peers to form effective ensemble teachers
need to be exploited.
6.1.2 Sharing blocks among student peers
Considering the training cost of employing individual stu- 6.2 Teacher-free distillation
dents, some works propose sharing network structures (e.g.,
head sharing) of the students with branches as shown in Overal insight: Is it possible to enable the student to distill
Fig. 6(b). Song et al. [41] and Lan et al. [40] build the student knowledge by itself to achieve plausible performance?
peers on multi-branch architectures [131]. In such a way, all The conventional KD approaches [1, 52, 57, 61, 110]
structures together with the shared trunk layers (often use still have many setbacks to be tackled, although signif-
head layers) can construct individual student peers, and any icant performance boost has been achieved. First of all,
target student peer network in the whole multi-branch can these approaches have low efficiencies, since student mod-
be optimized. els scarcely exploit all knowledge from the teacher mod-
els. Secondly, designing and training high-capacity teacher
models still face many obstacles. Thirdly, two-stage distilla-
6.1.3 Ensemble of student peers tion requires high computation and storage costs. To tackle
While using codistillation and multi-architectures can facil- these challenges, several novel self-distillation frameworks
itate online distillation, knowledge from all student peers [25, 30, 80, 163, 164, 165, 166, 167, 168, 169, 170, 171] have
is not accessible. To this end, some studies [40, 50, 110, been proposed recently. The goal of self-distillation is to
160, 161] proposed using the assembly of knowledge (logits learn a student model by distilling knowledge in itself with-
information) of all student peers to build an on the fly out referring to other models. We now provide a detailed
teacher or group leader, which is in turn distilled back to analysis of the technical details for self-distillation.
all student peers to enhance student learning in a closed-
loop form, as shown in Fig. 6(c). Note that in ensemble 6.2.1 Born-again distillation
distillation, the student peers can either be independent, Insight: Sequential self-teaching of students enables them to
or share the same head structure (trunk). The ensemble become masters, and outperform their teachers significantly.
distillation loss is given by Eqn. 12 of Sec. 4.2, where a Furlanello et al. [30] initializde the concept of self-
gating component gi is added to balance the contribution of distillation, in which the students are parameterized iden-
each student. Chen et al. [50] obtain the gating component tically to their teachers, as shown in Fig. 7(a). Through
gi based on the self-attention mechanism [72]. sequential teaching, the student is continuously updated,
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 18

data
data data data data B1 B2 B3 Bn softmax Augmentation

Attention transfer
Teacher
data1 data2
T T S1 S1 S2 S1 S2 Sn Architecture
Student
transformation
KD KD Ensemble
softmax softmax softmax
Output1 Output2
Step 0 Step 1 Step 2 Step n Student
: Feature or attention map Ensemble

(a) Born-again KD (b) KD via ‘deep’ supervision (c) data augmented KD (d) KD with architecture transform

Fig. 7. An illustration of self-distillation methods. (a) born-again distillation. Note that T and S1 , · · · , Sn can be multi-tasks. (b) distillation via ‘deep’
supervision where the deepest branch (Bn ) is used to distill knowledge to shallower branches. (c) distillation via data augmentation (e.g., rotation,
cropping). (d) distillation with network architecture transformation (e.g., changing convolution filters).

TABLE 6
A taxonomy of self-distillation methods. Logits and hints indicate the knowledge to be distilled. ‘Deep’ supervision is for self-distillation from
deepest branch (or layer) of the student network. One-stage KD is checking whether self-distillation is achieved in one step. X/ 7is for yes/no.
Data ’Deep’ Architecture
Method Logits Hints augmentation supervision
One-stage KD Multi-task KD
transformation
Clarm [80] X 7 7 7 7 X 7
Chowley [167] 7 Attention map 7 7 7 7 X
Furlanello [30] X 7 7 7 7 7 7
Hahn [165] X 7 7 7 7 7 7
Hou [169] 7 Attention maps 7 X X 7 7
Luan [170] X Feature maps 7 X X X 7
Lee [166] X 7 X 7 7 7 7
Xu [164] X Feature maps X 7 X 7 7
Zhang [168] X Feature maps 7 X X 7 7
Yang [25] X 7 7 7 7 7 7

and at the end of the procedure, additional performance 6.2.3 Distillation based on data augmentation
gains are achieved by an ensemble of multiple student
Insight: Data augmentation (e.g., rotation, flipping, cropping,
generations. Hahn et al. [165] apply born-again distillation
etc) during training forces the student network to be invariant to
[30] to natural language processing. Yang et al. [25] observe
augmentation transformations via self-distillation.
that it remains unclear how S-T optimization works, and
Although most methods focus on how to better su-
they then focus on putting strictness (adding an extra term
pervise student in self-distillation, data representations for
to the standard cross-entropy loss) to the teacher model,
training the student are not fully excavated and utilized. To
such that the student can better learn inter-class similarity,
this end, Xu et al. [164] and Lee et al. [166] focus on self-
and potentially prevent over-fitting. Instead of learning a
distillation via data augmentation of the training samples,
single task, Clark et al. [80] extend [30] to the multi-task
as shown in Fig. 7(c). There are some advantages to such a
setting, where single-task models are distilled sequentially
framework. First, it is efficient and effective to optimize a
to teach a multi-task model. Since the born-again distillation
single student network without branching or the assistance
approach is based on the multi-stage training, it is less
of other models. Second, with data-to-data self-distillation,
efficient and computation-heavy compared to the following
the student learns more inherent representations for gener-
methods.
alization. Third, the performance of the student model is
significantly enhanced with relatively low computation cost
6.2.2 Distillation via ‘deep’ supervision and memory load.
Insight: The deeper layers (or branches) in the student model Xu et al. [164] apply random mirror and cropping to the
contains more useful information than those of shallower layers. batch images from the training data. Besides, inspired by
Among the methods, Hou et al. [169], Luan et al. [170] mutual learning [34], the last feature layers and softmax
and Zhang et al. [168] propose similar approaches where outputs of the original batch image and distorted batch
the target network (student) is divided into several shallow images are mutually distilled via MMD loss [55] and KL
sections (branches) according to their depths and original divergence loss, respectively. In contrast, Lee et al. [166]
structures (see Fig. 7(b)). As the deepest section may contain consider two types of data augmentation (rotation and color
more useful and discriminative feature information than permutation to the same image), and the ensemble method
shallower sections, the deeper branches can be used to distill used in [40, 50, 85] is employed to aggregate all logits of the
knowledge to the shallower branches. In contrast, in [169], student model to one, which is in turn is used by the student
instead of directly distilling features, attention-based meth- to transfer the knowledge to itself.
ods used in [36] are adopted to force shallower layers to
mimic the attention maps of deeper layers. Luan et al. [170]
6.2.4 Distillation with architecture transformation
make each layer branch (ResNet block) a classifier. Thus,
the deepest classifier is used to distill earlier the classifiers’ Insight: A student model can be derived by changing the convolu-
feature maps and logits. tion operators in the teacher model with any architecture change.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 19

In contrast with all the above-mentioned self-distillation [137, 138, 139, 143, 149, 150]. While using labels expand the
methods, Crowley et al. [167] proposes structure model generalization capability of knowledge for learning student
distillation for memory reduction by replacing standard network, such approaches fail when labels are scarce or
convolution blocks with cheaper convolutions, as shown in unavailable.
Fig. 7(d). In such a way, a student model that is a simple
transformation of the teacher’s architecture is produced. 7.1.2 KD with pseudo labels.
Then, attention transfer (AT) [55] is applied to align the Some works also exploit the pseudo labels. The most com-
teacher’s attention map with that of the student’s. mon methods can be discomposed into two groups. The
first one aims to create noisy labels. [20, 21, 22, 23] pro-
6.2.5 Summary and open challenges pose to leverage large number of noisy labels to augment
Summary: In Table. 6, we summarize and compare different small amount of clean labels, which turns to improve the
self-distillation approaches. Overall, using logits/feature in- generalization and robustness of student network. The sec-
formation and two-stage training for self-distillation with ond group of methods focus on creating pseudo labels via
‘deep’ supervision from the deepest branch are main metadata [112], class similarities [115] or generating labels
stream. Besides, data augmentation and attention-based [117, 118], etc.
self-distillation approaches are promising. Lastly, it is shown
that multi-task learning with self-distillation is also a valu- 7.2 Label-free distillation
able direction, deserving more research. However, in real-world applications, labels for the data used
Open challenges: There still exist many challenges to in the student network is not always easy to obtain. Hence,
tackle. First, theoretical support laks in explaining why some attempts have been taken for label-free distillation. We
self-distillation works better. Mobahi et al. [163] provide now provide more detailed analysis for these methods.
theoretical analysis for born-again distillation [30] and find
that self-distillation may reduce over-fitting by loop-over 7.2.1 KD with dark knowledge.
training, thus leading to good performance. However, it is This has inspired some works to exploit KD without the
still unclear why other self-distillation methods (e.g., online requirement of labels. Based on our review, label-free dis-
‘deep’ supervision [168, 169, 170]) work better. tillation is mostly achieved in cross-modal learning, as dis-
In addition, existing methods focus on self-distillation cussed in Sec. 5.3. With paired modality data (e.g., video
with certain types of group-based network structures (e.g., and audio), where the label of modality (e.g., video) is
ResNet group). Thus, the generalization and flexibility of available, the student learns the end tasks only based on
self-distillation methods need to be probed further. Lastly, the distillation loss in Eq.2 [136, 141, 144, 147, 148, 153]. That
all existing methods focus on classification-based tasks, and is, in this situation, the dark knowledge of teacher provides
it is not clear whether self-distillation is effective for other ‘supervision’ for the student network.
tasks (e.g., low-level vision tasks).
7.2.2 Creating meta knowledge.
7 L ABEL - REQUIRED /- FREE DISTILLATION Recently, a few methods [117, 118, 120] propose data-/label-
Overall Insight: It is possible to learn a student without refer- free frameworks for KD. The core technique is to craft
ring to the labels of training data? samples with labels by using either feature or logits informa-
tion, which are also called meta knowledge. Although these
7.1 Label-required distillation methods point out an interesting direction for KD, there still
exist many challenges to achieve reasonable performance.
The success of KD relies on the assumption that labels
provide the required level of semantic description for the
task at hand [1, 4]. For instance, in most existing KD meth- 7.3 Potential and challenges.
ods for classification-related tasks [1, 8, 9, 28, 48, 131, 172], Label-free distillation is a promising since it relieves the
image-level labels are required for learning student net- need for data annotation. However, the current status of
work. Meanwhile, some works exploit the pseudo labels research shows that there still exist many uncertainties and
when training data are scarce. We now provide a systematic challenges in this direction. The major concern is how to
analysis for these two types of methods. ensure that the ‘supervision’ provided by teacher is reliable
enough. As some works interpret the knowledge as a way
7.1.1 KD with original labels. a label regularization [172] or class similarities [177], it is
Using the ground truth labels for the data used in the crucial to guarantee that the knowledge can be captured by
student network is the common way for KD. As depicted the student.
in Eq.2, the overall loss function is composed of the student Another critical challenge of label-required distillation is
loss and the distillation loss. The student loss is heavily that, in Eq. 2, the KD loss term never involves any label
dependent on the ground truth label y . Following this information although the student loss (e.g., cross-entropy
fashion, main-stream methods mostly utilize the original loss) uses labels. As labels provide informative knowledge
labels and design better distillation loss terms to achieve for the student learning, it is worthwhile to find a way to use
better performances [50, 55, 173, 174, 175, 176]. This con- labels for the KD loss to further improve the performance.
vention has also been continually adopted in recent KD While it is generally acknowledged that a pretrained teacher
methods, such as online distillation [50, 159], teacher-free has already mastered sufficient knowledge about the label
distillation [82, 154, 168], and even cross-modal learning information, its predictions still have a considerable gap
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 20

with the ground truth labels. Based on our literature review, of the generator is to generate realistic data, given the con-
there exists some difficulties to bring label information to the ditional information. Mathematically, the objective function
distillation loss. That is, to bring the label information, the can be written as:
teacher might need to be updated or fine-tuned, which may min max J(G, D) = Ex∼p(x) [log(D(x|c))]+
cause additional computation cost. However, with some re- G D (20)
cent attempts based on meta learning or continual learning, Ez∼p(z) [log(1 − D(G(z|c)))]
it is possible to learn the label information with only a few Unlike the above-mentioned GANs, triple-GAN [20] (the
examples. Besides, it might be possible to learn a bootstrap third type) introduces a three-player game where there is a
representation based on the labels, as done in [178], and classifier C , generator G, and discriminator D, as shown in
further incorporate the information to the KD loss. We do Fig. 8(c). Adversarial learning of generators and discrimina-
believe this direction is promising in real-world applications tors overcomes some difficulties [78], such as not having a
and thus expect future research might move towards this good optimization and the failure of the generator to control
direction. the semantics of generated samples. We assume that there
is a pair of data (x, y) from the true distribution p(x, y).
8 KD WITH NOVEL LEARNING METRICS After a sample x is sampled from p(x), C assigns a pseudo
label y following the conditional distribution pc (y|x), that
8.1 Distillation via adversarial learning
is, C characterizes the conditional distribution pc (y|x) ≈
Overall Insight: GAN can help learn the correlation between p(y|x).The aim of the generator is to model the conditional
classes and preserve the multi-modality of S-T framework, espe- distribution in the other direction pg (x|y) ≈ p(x|y), while
cially when student has relatively small capacity. the discriminator determines whether a pair of data (x,y) is
In Sec. 4.1, we have discussed the two most popular from the true distribution p(x, y). Thus, the minmax game
approaches for KD. However, the key problem is that it is can be formulated as:
difficult for the student to learn the true data distribution
min max J(C, G, D) = E(x,y)∼p(x,y) [log(D(x, y))]+
from the teacher, since the teacher can not perfectly model C,G D
the real data distribution. Generative adversarial networks α E(x,y)∼pc (x,y) [log(1 − D(x, y))]+ (21)
(GANs) [78, 123, 124, 158, 180] have been proven to have (1 − α) E(x,y)∼pg (x,y) [log(1 − D(G(y, z), y))]
potential in learning the true data distribution in image
translation. To this end, recent works [62, 64, 116, 181, 182, where α is a hyper-parameter that controls the relative
183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, importance of C and G.
196, 197, 198, 199] have tried to explore adversarial learning
to improve the performance of KD. These works are, in fact, 8.1.2 How does GAN boost KD?
built on three fundamental prototypes of GANs [20, 78, 179]. Based on the aforementioned formulation of GANs, we
Therefore, we formulate the principle of these three types analyze how they are applied to boost the performance of
of GANs, as illustrated in Fig. 8, and analyze the existing KD with S-T learning.
GAN-based KD methods. KD based on the conventional GAN (first type) Chen
et al.[116] and Fang et al.[117] focused on distilling the
8.1.1 A basic formulation of GANs in KD knowledge of logits from teacher to student via the first type
The first type of GAN, as shown in Fig. 8(a), is proposed of GAN, as depicted in Fig. 8(a). 1 There are several benefits
to generate continuous data by training a generator G of predicting logits based on the discriminator. First, the
and discriminator D, which penalizes the generator G for learned loss, as described using Eqn 19, can be effective in
producing implausible results. The generator G produces image translation tasks [123, 124, 200]. The second benefit is
synthetic examples G(z) (e.g., images) from the random closely related to the multi-modality of the network output;
noise z sampled using a specific distribution (e.g., normal) therefore, it is not necessary to exactly mimic the output of
[78]. These synthetic examples are fed to the discriminator D one teacher network to achieve good student performance
along with the real examples sampled from the real data dis- as it is usually done [1, 52]. However, the low-level fea-
tribution p(x). The discriminator D attempts to distinguish ture alignment is missing because the discriminator only
the two inputs, and both the generator G and discriminator captures the high-level statistics of the teacher and student
D improve their respective abilities in a minmax game until outputs (logits).
the discriminator D is unable to distinguish the fake from In contrast, Belagiannis et al. [183], Liu et al. [185], Hong
the real. The objective function can be written as follows: et al. [192], Aguinaldo et al. [198], Chung et al. [64], Wang
et al. [194], Wang et al. [186], Chen et al. [203], and Li et
min max J(G, D) = Ex∼p(x) [log(D(x))]+ al. [199] aimed to distinguish whether the features come
G D (19)
Ez∼p(z) [log(1 − D(G(z)))] from the teacher or student via adversarial learning, which
effectively pushes the two distributions close to each other. 2
where pz (z) is the distribution of noise (e.g., uniform or The features of the teacher and student are used as inputs to
normal). the discriminator because of their dimensionality. The feature
The second type of GAN for KD is built on conditional
GAN (CGAN) [123, 124, 179, 200], as shown in Fig. 8(b). 1. [116, 117] are data-free KD methods, which will be explicitly
discussed in Sec. 5.1.
CGAN is trained to generate samples from a class condi-
2. Note that in [64], least squares GAN (LSGAN) [201] loss was used
tional distribution c. The generator is replaced by useful and in [194], Wasserstein GAN-gradient penalty (WGAN-GP) loss [202]
information rather than random noise. Hence, the objective was used to stabilize training.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 21

Logits or Logits or
S feature S feature C feature
Image given
or noise D D feature D
KD
T T x
{Real/Fake} {Label, Real/fake} T
(a) conventional GAN (b) conditional GAN (c) KDGAN

Fig. 8. An illustration of GAN-based KD methods. (a) KD based on GAN [78] where discriminator D discerns the feature/logits of T and S ; (b) KD
based on conditional GAN (CGAN) [179] where the input also functions as a condition to D; (c) KD based on TripleGAN [180] where the classifier
C , teacher T and D play a minmax game.

TABLE 7
Taxonomy of KD based on adversarial learning.

Method GAN type Purpose Inputs of D Number of D Online KD


Chen [116] First type Classification Logits One No
Belagiannis [183] First type Classification Features One No
Classification
Liu [185] First type Object detection
Features One No
Hong [192] First type Object detection Features Six No
Wang [186] First type Classification Features One No
Aguinaldo [198] First type Classification Features One No
Chung [64] First type (LSGAN [201]) Classification Features Two/Three Yes
Wang [194] First type(WGAN-GP [202]) Image generation Features One/Multiple Yes
Chen [203] First/Second type Image translation Features Two No
Liu [197] Second type (WGAN-GP [202]) Semantic segmentation Features One No
Xu [182] Second type Classification Logits One No
Cross-domain
Roheda [187] Second type surveillance
Features One Yes
Zhai [193] Second type (BicyleGAN [204]) Image translation Features One Yes
Liu [184] Second (AC-GAN [205]) Image translation Features One No
Wang [181] Third type Image translation Features One No
Li [199] First/Second type Image translation Features One No
Classification
Fang [117] First type Semantic segmentation
Logits One No
Yoo [119] Second type Classification Logits One No

representations extracted from the teacher are high-level ab- discriminators,for compressing image translation networks.
stract information and easy for classification, which lowers To avoid model collapse, Liu et al. [197] used Wasserstein
the probability for the discriminator to make a mistake [185]. loss [202] to stabilize training.
However, the GAN training in this setting is sometimes KD based on TripleGAN (third type) In contrast to the dis-
unstable and even difficult to converge, particularly when tillation methods based on conventional GAN and CGAN
the model capacity between the student and teacher is large. , Wang et al. [181] proposed a three-player game named
To address this problem, some regularization techniques KDGAN, consisting of a classifier (tent), a teacher, and a
such as dropout [24] or l2 or l1 regularization [183] are discriminator (similar to the prototype in TripleGAN [20]),
added to Eqn. 19 to confine the weights. as shown in Fig. 8(c). The classifier and the teacher learn
KD based on CGAN (second type) Xu et al. [206] and from each other via distillation losses, and are adversar-
Yoo et al. [119] employed CGAN [179] for KD, where the ially trained against the discriminator via the adversarial
discriminator was trained to distinguish whether the label loss defined in Eqn. 21. By simultaneously optimizing the
distribution (logits) was from the teacher or the student. The distillation and the adversarial loss, the classifier (student)
student, which was regarded as the generator, was adver- learns the true data distribution at equilibrium.
sarially trained to deceive the discriminator. Liu et al. [184]
also exploited CGAN for compressing image generation 8.1.3 Summary and open challenges
networks. However, the discriminator predicted the class In Table 7, we summarize existing GAN-based knowledge
label of the teacher and student, together with an auxiliary distillation methods regarding the practical applications,
classifier GAN [205]. input features of the discriminator D, the number of dis-
In contrast, Roheda et al. [187], Zhai et al. [193], Li et criminators used, and whether it is one-stage (without the
al. [199], Chen et al. [203], and Liu et al. [197] focused on need for the teacher to be trained first). In general, most
discriminating the feature space of the teacher and student methods focus on classification tasks based on the first type
in the CGAN framework. Interestingly, Chen et al. [203] de- of GAN (conventional GAN) [78] and use the features as the
ployed two discriminators, namely, the teacher and student inputs to the discriminator D. Besides, it is worth noting that
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 22

most methods use only one discriminator for discerning the TABLE 8
student from the teacher. However, some works such as [64], A summary of notations used in Sec. 8.2.
[194] and [203] employ multiple discriminators in their KD
Notations Descriptions
frameworks. One can see that most methods follow a two-
stage KD paradigm where the teacher is trained first, and |·| The cardinally of a set
G = (V, E) Graph G with a set of node V and set of edge E
then knowledge is transferred to the student via KD loss. vi , eij A node vi ∈ V and an edge eij linking vi and vj
In contrast, studies such as [64, 187, 193, 194] also exploit xvi , xe[vi ] Features of vi and features of edges of vi
online (one-stage) KD, without the necessity of pre-trained hne[vi ] , xne[vi ] Features of states and of neighboring nodes of vi
teacher networks. More detailed analyses of KD methods with Fv (vi ), Fe (eij ) Mapping of node type vi and edge type eij
Tv , Te The set of node types and set of edge types
respect to online/two-stage distillation and image translation are
< h, r, t > Head, relation and tail in knowledge graph
described in Sec. 6.1 and Sec. 9.5, respectively. N Number of nodes in the graph
Open challenges: The first challenge for GAN-based KD hvi Hidden state of i-th node v
is the stability of training, especially when the capacity ft , fo local transition and output functions
between the teachers and the students is large. Secondly, Ft , Fo global transition and output functions
H, O, X Stack of all hidden states, outputs, features
it is less intuitive whether using only logits or only features Ht Hidden state of t-th iteration of H
or both as inputs to the discriminator is good because there
lacks theoretical support. Thirdly, the advantages of using
multiple discriminators are less clear and what features in Los IsCityOf IsStateOf
California US
Angeles
which position are suitable for training GAN also needs to
be further studied.
Fig. 9. An example of knowledge graph.

8.2 Distillation with graph representations


Overall insight: Graphs are the most typical locally connected h, t ∈ V are entities and r ∈ E is the
structures that capture the features and hierarchical patterns for relation. Hereby, we note < h, r, t > as a triplet
KD. for knowledge graph. An example is shown in
Up to now, we have categorized and analyzed the most Fig. 9. The knowledge graph includes two triplets
common KD methods using either logits or feature infor- < h, r, t >: < LosAngeles, IsCityOf, Calif ornia >
mation. However, one critical issue regarding KD is data. and < Calif ornia, isStateOf, U S >.
In general, training a DNN requires embedding a high- Graph neural networks. A graph neural network (GNN) is
dimensional dataset to facilitate data analysis. Thus, the a type of DNN that operates directly on the graph structure.
optimal goal of training a teacher model is not only to trans- A typical application is about node classification [211]. In the
form the training dataset into a low-dimensional space, but node classification problem, the i-th node vi is characterized
also to analyze the intra-data relations [207, 208]. However, by its feature xvi , and ground truth tvi . Thus, given a
most KD methods do not consider such relations. Here, labeled graph G, the goal is to leverage the labeled nodes
we introduce the definitions of the basic concepts of graph to predict the unlabeled ones. It learns to represent each
embedding and knowledge graph based on [209, 210]. We node with a d dimensional vector state hvi containing the
provide an analysis of existing graph-based KD methods, information of its neighborhood. Specifically speaking, hvi
and discuss new perspectives about KD. can be mathematically described as [174]:
hvi = ft (xvi , xco[vi ] , hne[vi ] , xne[vi ] ) (22)
8.2.1 Notation and definition
ovi = fo (hvi , xvi ) (23)
Definition 1. A graph can be depicted as G = (V, E), where
v ∈ V is a node and e ∈ E is an edge. A graph G is associated where xco[vi ] denotes the feature of the edges connected with
with a node type mapping function Fv : V → T v , and an edge vi , hne[vi ] denotes the embedding of the neighboring nodes
type mapping function Fe : E → T e . of vi , and xne[vi ] denotes the features of the neighboring
nodes of vi . The function ft is a transition function that
Here, T v and T e denote the node types and edge types, projects these inputs onto a d-dimensional space, and fo
respectively. For any vi ∈ V , there exists a particular is the local output function that produces the output. Note
mapping type: Fv (vi ) ∈ T v . Similar mapping comes to any that ft and fo can be interpreted as the feedforward neural
eij ∈ E , which is mapped as Fe (eij ) ∈ T e , where i and j networks. If we denote H, O, X and XN as the concatenation
indicate the i-th and j -th nodes. of the outputs of stacking all the states, all the outputs, all
Definition 2. A homogeneous graph (directed graph), de- the features, and all the node features, respectively, then H
picted as Ghom = (V, E), is a type of graph in which |T v | = and O can be formulated as:
|T e | = 1. All nodes and edges in this graph embedding are of one H = Ft (H, X) (24)
type.
O = Fo (H, X) (25)
Definition 3. A knowledge graph, defined as Gkn = (V, E),
is an instance of a directed heterogeneous graph whose nodes where Ft is the global transition function and Fo is the
are entities, and edges are subject-property-object triplets. Each global output function. Note that Ft and Fo are the stacked
edge has the form: head entity, relation, tail entity, denoted as functions of ft and fo , respectively, in all nodes V in the
< h, r, t >, indicating a relationship from a head h to a tail t. graph.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 23

Since we are aiming to get a unique solution for hvi , in where the distance metric D is based on L2 distance.
[174, 211], a neighborhood aggregation algorithm is applied, IRG [208] essentially is similar to GKD [212] in the
such that: construction of the graph, however, IRG also takes into
account the instance of graph transformations. The aim
Ht+1 = Ft (Ht , X) (26) of introducing feature space transformation across layers
is because there may be too tight or dense constraints
where Ht denotes t-th iteration of H. and descriptions fitting on the teacher’s instance features
Given any initial state H(0), Ht+1 in Eqn. 26 conver- at intermediate layers. The transformation of the instance
gences exponentially to the solution in Eqn. 24. Based on relation graph is composed of vertex transformation and
the framework, ft and fo can be optimized via supervised edge transformation from the l1 l-th layer to l2 -th layer, as
loss when the target information tiv is known: shown in Fig. 10 (b). Thus the loss in Eqn. 28 can be extended
N
X to:
L= (tiv − oiv ) (27) X
i=1
L= D1 (GSl (X), GTl (X))+
where N is the total number of supervised nodes in the  l∈Λ  (29)
graph. D2 (ΘT (GSl (X)), ΘS (GTl (X))

8.2.2 Graph-based distillation where ΘT and ΘS are the transformation functions for
the teacher and the student, respectively, and D1 and D2
Based on the above explanation regarding the fundamentals are the distance metrics for instance relation and instance
of graph representations and GNN, we now delve into the translation.
existing graph-based distillation techniques. To our knowl-
MHKD [207] is a method that enables distilling data
edge, Liu et al. [216] first introduced a graph-modeling
based knowledge from a teacher network to a graph using
approach for visual recognition task in videos. Videos are
an attention network (see Fig. 10(d)). Like IRG [18], feature
action models modeled initially as bag of visual words
transformation is also considered to capture the intra-data
(BoVW), which is sensitive to visual changes. However,
relations. The KD loss is based on the KL-divergence loss us-
some higher-level features are shared across views, and
ing the embedded graphs from the teacher and the student.
enable connecting the action models of different views. To
KTG [213] also exploits graph representation; however, it
better capture the relationship between two vocabularies,
focuses on a different perspective of KD. The knowledge
they construct a bipartite graph G = (V, E) to partition
transfer graph provides a unified view of KD, and has the
them into visual-word clusters. Note that V is the union of
potential to represent diverse knowledge patterns. Interest-
vocabularies V1 and V2 , and E are the weights attached to
ingly, each node in the graph represents the direction of
nodes. In this way, knowledge from BoVW can be trans-
knowledge transfer. On each edge, a loss function is defined
ferred to visual-word clusters, which are more discrimi-
for transferring knowledge between two nodes linked by
native in the presence of view changes. Luo et al. [217]
each edge. Thus, combining different loss functions can
consider incorporating rich, privileged information from a
represent collaborative knowledge learning with pair-wise
large-scale multimodal dataset in the source domain, and
knowledge transfer. Fig. 10(c) shows the knowledge graph
improve the learning in the target domain where train-
of diverse collaborative distillation with three nodes, where
ing data and modalities are scarce. Regarding using S-T
Ls,t represents the loss function used for the training node.
structures for KD, to date, there are several works such as
In addition, GFL [214], HGKT [173], GRL [215] and
[173, 207, 208, 212, 213, 214, 215, 218].
MHGD [207] all resort to GNN for the purpose of KD.
GKD [212] and IRG [208] consider the geometry of the
HGKT and GFL focus on transferring knowledge from seen
perspective feature spaces by reducing intra-class varia-
classes to unseen classes in few-shot learning [223, 224].
tions, which allow for dimension-agnostic transfer of knowl-
GFL [214] leverages the knowledge learned by the auxiliary
edge. This perspective is the opposite of Liu et al. [208]
graphs to improve semi-supervised node classification in
and RKD [28]. Specifically, instead of directly exploring
the target graph. As shown in Fig. 10(e), GFL learns the
the mutual relation between data points in students and
representation of a whole graph, and ensures the transfer
teachers, GKD [212] regards this relation as a geometry of
of a similarly structured knowledge. Auxiliary graph recon-
data space (see Fig. 10(a)). Given a batch of inputs X, we
struction is achieved by using a graph autoencoder. HGTK
can compute the inner representation XS S
l = [xl , x ∈ X aims to build a heterogeneous graph focusing on transfer-
and XTl = [xTl , x ∈ X at layer l (l ∈ Λ) of the teacher
ring intra-class and inter-class knowledge simultaneously.
and student networks. Using cosine similarity metric, these
Inspired by modeling class distribution in adversarial learn-
representations can be used to build a k -nearest neighbor
ing [78, 123, 181, 182, 189, 225], in which instances with
similarity graph for the teacher GTl (X) =< XTl , WTl >, and
the same class are expected to have the same distribution,
for the student GS S S T
l (X) =< Xl , Wl >. Note that Wl and Wl
S
the knowledge is transferred from seen classes to new
represent the edge weights, which represent the similarity
unseen classes based on learned aggregation and embed-
between the i-th and j -th elements of XTl and XS l . Based on ding functions, as shown in Fig. 10 (f). GRL [215] builds
graph representation for both the teacher and the student,
a multi-task KD method for representation learning based
the KD loss in Eqn. 10 can be updated as follows:
on DeepGraph [222].This knowledge is based on GNN,
and maps raw graphs to metric values. The learned graph
X  
L= D GSl (X), GTl (X) (28)
l∈Λ
metrics are then used as auxiliary tasks, and the knowledge
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 24

Layer l 1 Layer l 1 Layer l 2


w1 Transformation m1
Graph-based
v1 v4
l 12
knowledge
l 31
v1
v3 l 21 l 13
w2 m2 l 23 w3
v4
w4
w3 v2 v3 l 32 Student
v2 Teacher

(a): GKD (b): IRG (c): KTG (d): MHKD

Aggregation Task A
Embedding
histogram Shared
CNN
Auxiliary graph
reconstruction DeepGraph Task B

(e): GFL (f): HGKD (g): GRL

Fig. 10. A graphical illustration of graph-based KD methods. GKD [212], IRG [208], KTG [213], MHKD [207] all focus on graph-based knowledge
distillation for model compression. GFL [214] and HGKD [173] aim to improve semi-supervised node classification via graph-based knowledge
transfer, whereas GRL [215] exploits graph-based knowledge for multi-task learning.

TABLE 9
A summary of KD methods via graph representations.
Method Purpose Graph type Knowledge type Distance metric Graph embedding
GKD [212] Model compression Heterogeneous graph Layer-wise feature L2 GSP [219]
IRG [208] Model compression Knowledge graph Middle layers L2 Instance relations
MHKD [207] Model compression Knowledge graph Middle layers KL SVD [71] + Attention
KTG [213] Model compression Directed graph Network model L1 + KP loss –
GFL [214] Few-shot learning GNN Class of nodes Frobenius norm HGR [220]
HGKD [173] Few-shot learning GNN Class of nodes Wasserstein GraphSAGE [221]
GRL [215] Multi-task leaning GNN Class of nodes Cross-entropy HKS [222]
Yang [218] Model compression GNN Topological info. KL Attention

of the network is distilled into graph representations (see datasets, and limited amount of labeled data.
Fig. 10(g)). The graph representation structure is learned via Semi-supervised learning usually handles the problem
a CNN by feeding the graph descriptor to it. We denote of over-fitting due to the lack of high-quality labels of
pairs of graph and graph-level labels as {(Gi , yi )}N
i=1 , where training data. To this end, most methods apply S-T learning
Gi ∈ G, yi ∈ Y, and G, Y are the cardinally of all possible that assumes a dual role as a teacher and a student. The
graphs and labels respectively. Then, the loss for learning student model aims to learn the given data as before,
the model parameters are described as: and the teacher learns from the noisy data and generates
predicted targets. These are then transferred to the student
L = E[D(yi , f (Gi ; θ))] (30)
model via consistency cost. In self-supervised learning, the
where θ are the model parameters. student itself generates knowledge to be learned via various
Open challenges Graph representations are of significant approaches, and the knowledge is then transferred by the
importance for tackling KD problems because they better student to itself via distillation losses. We now provide
capture the hierarchical patterns in locally connected struc- a detailed analysis of the technical details of the existing
tures. However, there are some challenges. Firstly, graph methods.
representations are difficult to generalize because they are
limited to structured data or specific types of data. Secondly, 8.3.1 Semi-supervised learning
it is challenging to measure graph distances appropriately,
The baseline S-T frameworks for semi-supervised learning
since existing distance measure (e.g., l2 ) may not fit well.
was initialized by Laine et al. [226] and Tarvainen et al. [2], as
Thirdly, layer-wise distillation is difficult to achieve in graph
illustrated in Fig. 1(b). The student and the teacher models
KD, because graph representation models and network
have the same structures, and the teacher learns from noise
structures in such cases are limited.
and transfers knowledge to the student via consistency cost.
Interestingly, in [2], the teacher’s weights are updated using
8.3 KD for semi-/self-supervised learning the earth moving average (EMA) of the student’s weights.
Overall insight: KD with S-T learning aims to learn a rich rep- Inspired by [2], Luo et al. [227], Zhang et al. [228], French et
resentation by training a model with a large number of unlabeled al. [229], Choi et al. [230], Cai et al. [231] and Xu et al. [232]
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 25

all employ similar frameworks where the teacher’s weights of knowledge. Second, no methods attempt to exploit the
are updated using exponential moving average (EMA) of rich feature knowledge from teacher models. Third, data
the student. However, Ke et al.[157] mention that using a augmentation methods in these distillation methods are
coupled EMA teacher is not sufficient for the student, since less effective compared to those proposed in Sec. 6.2, in
the degree of coupling increases as the training goes on. which the advantages of adversarial learning are distinctive.
To tackle this problem, the teacher is replaced with another Fourth, the representations of knowledge in these methods
student, and two students are optimized individually dur- are limited and less effective. BYOL [178] opens a door for
ing training while a stabilization constraint is provided for representation boost, and there exists a potential to further
knowledge exchange (similar to mutual learning [34]). bind this idea with KD in the further research. Moreover, it
Instead of taking independent weights between the has the potential to exploit a better-structured data represen-
teacher and the student, Hailat et al. [233] employ weight- tation approach, such as in GNNs. With these challenges, the
sharing, in which the last two fully connected layers of the future directions of KD for semi-/self-supervised learning
teacher and the student are kept independent. The teacher could gain inspirations from exploiting feature knowledge
model plays the role of teaching the student, stabilizing and more sophisticated data augmentation methods to-
the overall model, and attempting to clean the noisy labels gether with more robust representation approaches.
in the training dataset. In contrast, Gong et al. [104] and
Xie et al. [21] follow the conventional distillation strategy
8.4 Few-shot learning
proposed by [1], where a pretrained teacher is introduced
to generate learnable knowledge using unlabeled data, and Insight:Is it possible possible to learn an effective student model
utilizes it as privileged knowledge to teach the student on to classify unseen classes (query sets) by distilling knowledge from
labeled data. However, during learning of the student, Xie a teacher model with the support set?
et al. inject noise (e.g., dropout) to the student such that it In contrast to the methods discussed in Sec. 5.2 focusing
learns better than the teacher. Papernot et al. [38] propose to on distillation with a few samples for training a student
distill from multiple teachers (an ensemble of teachers) on a network (without learning to generalize to new classes), this
disjoint subset of sensitive data (augmented with noise) and section stresses on analyzing the technical details of few-
to aggregate the knowledge of teachers to guide the student shot learning with KD. Few-shot learning is to classify new
on query data. data, having seen from only a few training examples. Few-
shot learning itself is a meta-learning problem in which the
8.3.2 Self-supervised learning DNN learns how to learn to classify, given a set of training
Distilling knowledge for self-supervised learning aims to tasks, and evaluate using a set of test tasks. Here, the goal is
preserve the learned representation for the student itself, to discriminate between N classes with K examples of each
as depicted in Fig. 1(c). Using pseudo labels is the most (so-called N -way-K -shot classification). In this setting, these
common approach, as done in [71, 234]. Specifically, Lee et training examples are known as the support set. In addition,
al. [71] adopt self-supervised learning for KD, which not there are further examples of the same classes, known as a
only ensures the transferred knowledge does not vanish, query set. The approaches for learning prior knowledge of a
but also provides an additional performance improvement. few-shot are usually based on three types: prior knowledge
In contrast, Noroozi et al. [234] propose to transfer knowl- about similarity, prior knowledge about learning procedure,
edge by reducing the learned representation (from a pre- and prior knowledge about data. We now analyze the KD
trained teacher model) to pseudo-labels (via clustering) on methods for few-shot learning [28, 46, 126, 236, 237] that
the unlabeled dataset, which are then utilized to learn a have been recently proposed.
smaller student network. Another approach is based on data Prior knowledge about similarity: Park et al. [28] propose
augmentation (e.g., rotation, cropping, color permutation) distance-wise and angle-wise distillation losses. The aim
[164, 166, 235], which has been mentioned in Sec. 6.2.3. In is to penalize the structural differences in relation to the
contrast to making the ‘positive’ and ‘negative’ (augmented) learned representations between the teacher and the student
examples, BYOL [178] directly bootstraps the representa- for few-shot learning.
tions with two neural networks, referred to as online and Prior knowledge about learning procedure: [236, 237]
target networks, that interact and learn from each other. tackle the second type of prior knowledge, namely learning
This spirit is somehow similar to mutual learning [34]; procedure. To be specific, Flennerhag et al. [236] focuses
however, BYOL trains its online network to predict the on transferring knowledge across the learning process, in
target network’s representation of another augmented view which the information from previous tasks is distilled to
of the same image. The promising performance of BYOL facilitate the learning on new tasks. However, Jin et al. [237]
might point out a new direction of KD with self-supervised address the problem of learning a meta-learner that can
learning via representation boostrap rather than using neg- automatically learn what knowledge to transfer from the
ative examples. source network to where in the target network.
Prior knowledge about data: Dvornik et al. [46] and Liu et
8.3.3 Potentials and open challenges al. [126] address the third type of prior knowledge, namely
Based on the technical analysis for the KD methods in data variance. To be specific, in [46], en ensemble of several
semi-/self-supervised learning, it is noticeable that online teacher networks is elaborated to leverage the variance of
distillation is the mainstream. However, there are several the classifiers and encouraged to cooperate while encourag-
challenges. First, as pointed by [157], using EMA for up- ing the diversity of prediction. However, in [126], the goal
dating teacher’s weights might lead to less optimal learning is to preserve the knowledge of the teacher (e.g., intra-class
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 26

relationship) learned at the pretraining stage by generating 8.5.3 Open challenges


pseudo labels for training samples in the fine-tuning set. The existing methods rely on multi-step training (offline).
However, it will be more significant if the online (one-step)
8.4.1 What’s challenging? distillation approaches can be utilized to improve the learn-
Based on our analysis, the existing techniques actually ex- ing efficiency and performance. Moreover, existing methods
pose crucial challenges. First, the overall performance of require accessing the previous data to avoid ambiguities
KD-based few-shot learning is convincing, but the power between the update steps. However, the possibility of data-
of meta-learning is somehow degraded or exempted. Sec- free distillation methods remains open. Furthermore, exist-
ond, transferring knowledge from multi-source networks ing methods only tackle the incremental learning of new
is a potential, but identifying what to learn and where to classes in the same data domain, but it will be fruitful if
transfer is heavily based on the meta-learner, and selecting cross-domain distillation methods can be applied in this
which teacher to learn is computation-complex. Third, all direction.
approaches focus on a task-specific distillation, but the
performance drops as the domain shifts. Thus, future works 8.6 Reinforcement learning
may focus more on handling these problems.
Overall insight: KD in reinforcement learning is to encourage
polices (such as students) in the ensemble to learn from the best
8.5 Incremental Learning policies (such as teachers), thus enabling rapid improvement and
Overall insight: KD for incremental learning mainly deals with continuous optimization.
two challenges: maintaining the performance on old classes, and Reinforcement learning (RL) is a learning problem that
balancing between old and new classes. trains a policy to interact with the environment in way that
Incremental learning investigates learning the new yields maximal reward. To use the best policy to guide other
knowledge continuously to update the model’s knowledge, policies, KD has been employed in [18, 107, 236, 243, 244,
while maintaining the existing knowledge [238]. Many at- 245, 246, 247]. Based on the specialties of these methods, we
tempts [102, 169, 193, 238, 239, 240, 241] have been made to divide them into three categories, and provide an explicit
utilize KD in addressing the challenge of maintaining the analysis. We assume familiarity with basics of RL, and skip the
old knowledge. Based on the number of teacher networks definitions of Deep Q-network and A3C.
used for distillation, these methods can be categorized into
two types: distillation from a single teacher and distillation 8.6.1 Collaborative distillation
from multiple teachers. Xue et al. [246], Hong et al. [245], and Lin et al. [247] focus on
collaborative distillation, which is similar to mutual learning
8.5.1 Distillation from a single teacher [34]. In Xue et al., the agents teach each other based on
Shmelkov et al. [241], Wu et al. [238], Michieli et al. [239] the reinforcement rule, and teaching occurs between the
and Hou et al. [169] focus on learning student networks value function of the agents (students and teachers). Note
for new classes, by distilling knowledge (logits information) that the knowledge is provided by a group of student
from pretrained teachers on old-class data. Although these peers periodically, and assembled to enhance the learning
methods vary in tasks and distillation process, they follow speed and stability as in [245]. However, Hong et al.[245]
similar S-T structures. Usually, the pretrained model is taken periodically distill the best-performing policy to the rest
as the teacher, and the same network or a different network of the ensemble. Lin et al. stress on collaborative learning
is employed to adapt for new classes. Michieli et al. exploit among heterogeneous learning agents, and incorporate the
the intermediate feature representations and transfer them knowledge into online training.
to the student.
8.6.2 Model compression with RL-based distillation
8.5.2 Distillation from multiple teachers Ashok et al. [243] tackle the problem of model compression
Castro et al. [240], Zhou et al. [102] and Ammar et al. [242] via RL. The method takes a larger teacher network, and
concentrate on learning an incremental model with multiple outputs a compressed student network derived from it. In
teachers. Specifically, Castro et al. share the same feature particular, two recurrent policy networks are employed to
extractor between teachers and the student. The teachers aggressively remove layers from the teacher network, and
contain old classes, and their logits are used for distillation to carefully reduce the size of each remaining layer. The
and classification. Interestingly, Zhou et al. propose a multi- learned student network is evaluated by a reward, which
model and multi-level KD strategy in which all previous is a score based on the accuracy and compression of the
model snapshots are leveraged to learn the last model (stu- teacher.
dent). This approach is similar to born-again KD methods,
as mentioned in Sec. 6.2, where the student model at the 8.6.3 Random network distillation
last step, and is updated using the assembled knowledge Burda et al. [244] focus on a different perspective where the
from all previous steps. However, the assembled knowl- prediction problem is randomly generated. The approach
edge also depends on the intermediate feature representa- involves two networks: the target (student) network, which
tions. Ammar et al. develop a cross-domain incremental RL is fixed and randomly initialized, and a predictor (teacher)
framework, in which the transferable knowledge is shared network trained on the data collected by the agent. With the
and projected to different task domains of the task-specific knowledge distilled from the predictor, the target network
student peers. tends to have lower prediction errors. Rusu et al. [107]
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 27

also apply random initialization for the target network. a distilled detector. To overcome these challenges, several
However, they focus more on online learning action policies, impressive KD methods [65, 175, 192, 237, 253, 254, 255, 256,
which can be either single-task or muti-task. 257, 258, 259, 260, 261, 262, 263, 264] have been proposed
for compressing visual detection networks. We categorize
8.6.4 Potentials of RL-based KD these methods according to their specialties (e.g., pedestrian
We have analyzed existing RL-based KD methods in detail. detection).
Especially, we notice that model compression via RL-based
KD is promising due to its extraordinary merits. First, RL- 9.2.1 Generic object detection
based KD better addresses the problem of scalability of [65, 175, 192, 253, 254, 256, 258, 265, 266] aimed to learn
network models. This is similar to neural architecture search lightweight object detectors with KD. Among these works,
(NAS). Moreover, the reward functions in RL-based KD Chen et al. [254] and Hao et al. [253] highlighted learning a
better balance the accuracy-size trade-off. It is also possible class-incremental student detector by following the generic
to transfer knowledge from a smaller model to a larger model, KD framework (from a pretrained teacher). However, novel
which is a distinctive advantage over other KD methods. object detection losses were adopted as strong impetus for
learning new classes. These losses handled classification
results, location results, the detected region of interest, and
9 A PPLICATIONS FOR VISUAL INTELLIGENCE
all intermediate region proposals. Moreover, Chen et al. [175]
9.1 Semantic and motion segmentation learned a student detector by distilling knowledge from the
Insight: Semantic segmentation is a structured problem, and intermediate layer, logits, and regressor of the teacher, in
structure information (e.g., spatial context structures) needs to contrast to [65], in which only the intermediate layer of the
be taken into account when distilling knowledge for semantic teacher was utilized based on fine-grained imitation masks
segmentation networks. to identify informative locations. Jin et al. [258], Tang et
Semantic segmentation is a special classification problem al. [256], and Hong et al.[192] exploited multiple interme-
that predicts the category label in a pixel-wise manner. diate layers as useful knowledge. Jin et al. designed an
As existing the state-of-the-art (SOTA) methods such as uncertainty-aware distillation loss to learn the multiple-shot
fully convolutional networks [248] have large model sizes features from the teacher network. However, Hong et al.and
and high computation costs, some methods [117, 152, 159, Tang et al. were based on one-stage KD (online) via adver-
197, 239, 249, 250, 251, 252] have been proposed to train sarial learning and semi-supervised learning, respectively.
lightweight networks via KD. Although these methods vary In contrast, Liu et al. [266] combined single S-T learning
in their learning methods, most of them share the same and mutual learning of students for learning lightweight
distillation frameworks. Particularly, Xie et al. [252], Shan tracking networks.
et al. [249], and Michieli et al. [239] focused on pixel-wise,
feature-based distillation methods. Moreover, Liu et al.[197] 9.2.2 Pedestrian detection
and He et al. [250] both exploited affinity-based distillation While pedestrian detection is based on generic object de-
strategy using intermediate features. Liu et al. also em- tection, various sizes and aspect ratios of pedestrians under
ployed pixel-wise and holistic KD losses via adversarial extreme illumination conditions are challenges. To learn an
learning. In contrast, Dou et al. [152] focused on unpaired effective lightweight detector, Chen et al. [255] suggested
multi-modal segmentation, and proposed an online KD using the unified hierarchical knowledge via multiple inter-
method via mutual learning [34]. Chen et al. [251] proposed mediate supervisions, in which not only the feature pyramid
a target-guided KD approach to learn the real image style (from low-level to high-level features) and region features,
by training the student to imitate a teacher trained with real but also the logits information were distilled. Kruthiventi et
images. Mullapudi et al. [159] trained a compact video seg- al. [261] learned an effective student detector in challenging
mentation model via online distillation, in which a teacher’s illumination conditions by extracting dark knowledge (both
output was used as a learning target to adapt the student RGB and thermal-like hint features) from a multi-modal
and select the next frame for supervision. teacher network.

9.2 KD for visual detection and tracking 9.2.3 Face detection


Insight: Challenges such as regression, region proposals, and Ge et al. [262] and Karlekar et al. [267] compressed face
less voluminous labels must be considered when distilling visual detectors to recognize low-resolution faces via selective KD
detectors. (last hidden layer) from teachers which were initialized to
Visual detection is a crucial high-level task in computer recognize high-resolution faces. In contrast, Jin et al. [237],
vision. Speed and accuracy are two key factors for visual Luo et al. [263], and Feng et al. [264] used single type of
detectors. KD is a potential choice to achieve sped-up and image. Jin et al.focused on compressing face detectors by
lightweight network models. However, applying distilla- using the supervisory signal from the classification maps of
tion methods to detection is more challenging than ap- teacher models and regression maps of the ground truth.
plying classification methods. First, detection performance They identify that it is better to learn a classification map
degrades seriously after compression. Second, detection of a larger model than that of smaller models. Feng et
classes are not equally important, and special considera- al.presented a triplet KD method to transfer knowledge from
tions for distillation have to be taken into account. Third, a teacher model to a student model, in which a triplet of
domain and data generalization has to be considered for samples, the anchor image, positive image, and negative
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 28

image, was used. The purpose of the triplet loss was to 9.3 Domain adaptation
minimize the feature similarity between the anchor and Insight: Is it possible to distill knowledge of a teacher in one
positive images, while maximizing that between the anchor domain to a student in another domain?
and negative images. Luo et al.addressed the importance Domain adaptation (DA) addresses the problem of learn-
of neurons at the higher hidden layer of the teacher, and ing a target domain with the help of a different but related
a neuron selection method was applied to select neurons source domain [276]. Since Lopez et al. [277] and Gupta
that were crucial for teaching the student. Dong et al. [268] et al. [143] initially proposed the technique of transfer-
concentrated on the interaction between the teacher and the ring knowledge between images from different modali-
students. Two students learned to generate pseudo facial ties (called generalized distillation), it is natural to ask if
landmark labels, which were filtered and selected as the this novel technique be used to address the problem of
qualified knowledge by the teacher. DA. The challenge of DA usually comes with transferring
knowledge from the source model (usually with labels)
to the target domain with unlabeled data. To address the
9.2.4 Vehicle detection and driving learning problem, several KD methods based on S-T frameworks
Lee et al. [259], Saputra et al. [257], and Xu et al. [182] [230, 231, 232, 276, 278, 279, 280, 281, 282] have been
focused more on detection tasks for autonomous driving. In proposed recently. Although these methods are focused on
particular, Lee et al.focused on compressing a vehicle maker diverse tasks, technically, they can be categorized into two
classification system based on cascaded CNNs (teacher) into types: unsupervised and semi-supervised DA via KD.
a single CNN structure (student). The proposed distillation 9.3.1 Semi-supervised DA
method used the feature map as the transfer medium, and
French et al. [229], Choi et al. [230], Cai et al. [231], Xu
the teacher and student were trained in parallel (online
et al. [232], and Cho et al. [283] proposed similar S-T
distillation). Although the detection task was different, Xu
frameworks for semantic segmentation and object detection.
et al. build a binary weight Yolo vehicle detector by mincing
These frameworks were the updated methods of Mean-
the feature maps of the teacher network from easy tasks to
Teacher [2], which is based on self-ensemble of the stu-
difficult ones progressively. Zhao et al. [269] exploited an S-
dent networks (teacher and student models have the same
T framework to encourage the student to learn the teacher’s
structure). Note that the weights of the teacher models in
sufficient and invariant representation knowledge (based on
these methods are the EMAs of the weights of the student
semantic segmentation) for driving.
models. In contrast, Choi et al.added a target-guided gen-
erator to produce augmented images instead of stochastic
9.2.5 Pose detection augmentation, as in [229, 231, 232]. Cai et al. also exploited
the feature knowledge from the teacher model, and applied
Distilling human pose detectors has several challenges. region-level and intra-graph consistency losses instead of
First, lightweight detectors have to deal with arbitrary per- the mean square error loss.
son images/videos to determine joint locations with uncon- In contrast, Ao et al. [276] proposed a generalized distil-
strained human appearances. Second, the detectors must lation DA method by applying the generalized distillation
be robust in viewing conditions and background noises. information [277] to multiple teachers to generate soft labels,
Third, the detectors should have fast inference speeds, and which were then used to supervise the student model (this
be memory-efficient. To this end, [145, 270, 271, 272, 273, framework is similar to online KD from multiple teachers
274, 275] formulated various distillation methods. Zhang et as mentioned in Sec. 4.2). Cho et al. [283] proposed an S-
al. [270] achieved effective knowledge transfer by distilling T learning framework, in which a smaller depth prediction
the joint confidence maps from a pre-trained teacher model, network was trained based on the supervision of the aux-
whereas Huang et al. [272] exploited the heat map and iliary information (ensemble of multiple depth predictions)
location map of a pretrained teacher as the knowledge to obtained from a larger stereo matching network (teacher).
be distilled. Furthermore, Xu et al. [273], Thoker et al. [145],
and Martinez et al. [271] focused on multi-person pose esti- 9.3.2 Unsupervised DA
mation. Thoker et al. addressed cross-modality distillation Some methods such as [278, 280] distill the knowledge from
problems, in which a novel framework based on mutual the source domain to the target domain based on adversarial
learning [34] of two students supervised by one teacher was learning [78] and image translation [123, 124, 200]. Techni-
initialized. Xu et al. [273] learned the integral knowledge, cally, images in the source domain are translated to images
namely, the feature, logits, and structured information via a in the target domain as data augmentation, and cross-
discriminator under the standard S-T framework. Martinez domain consistency losses are adopted to force the teacher
et al. [271] trained the student to mimic the confidence and student models to produce consistent predictions. Tsai
maps, feature maps, and inner-stage predictions of a pre- et al. [281] and Deng et al. [282] focused on aligning the
trained teacher with depth images. Wang et al. [274] trained feature similarities between teacher and student models,
a 3D pose estimation network by distilling knowledge from compared with Meng et al. [279], who focused on aligning
non-rigid structure from motion using only 2D landmark softmax outputs.
annotations. In contrast, Nie et al. [275] introduced an online
KD in which the pose kernels in videos were distilled by 9.4 Depth and scene flow estimation
leveraging the temporal cues from the previous frame in a Insight: The challenges for distilling depth and flow estimation
one-shot learning manner. tasks come with transferring the knowledge of data and labels.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 29

Depth and optical flow estimations are low-level vision discriminator, and fine-tuned the discriminator together
tasks aiming to estimate the 3D structure and motion of the with the compressed generator, which was automatically
scene. There are several challenges. First, in contrast to other found with significantly lower computation cost and fewer
tasks (e.g., semantic segmentation), depth and flow estima- parameters, by using NAS. In contrast, Wang et al. [289]
tions do not have class labels. Thus, applying existing KD focused on compressing encoder-decoder based neural style
techniques directly may not work well. Moreover, learning a transfer network via collaborative distillation (between the
lightweight student model usually requires a large amount encoder and its decoder), where the student was restricted
of labeled data to achieve robust generalization capability. to learn the linear embedding of the teacher’s output.
However, acquiring these data is very costly.
To address these challenges, Guo et al. [176], Pilzer et 9.6 KD for Video understanding
al. [284], and Tosi et al. [285] proposed distillation-based
9.6.1 Video classification and recognition
approaches to learn monocular depth estimation. These
methods were focused on handling the second challenge, Bhardwaj et al. [290] and Wang et al. [291] employed the
namely, data distillation. Specifically, Pilzer et al. [284] pro- general S-T learning framework for video classification. The
posed an unsupervised distillation approach, where the left student was trained with processing only a few frames of
image was translated to the right via the image translation the video, and produced a representation similar to that
framework [123, 200]. The inconsistencies between the left of the teacher. Gan et al. [292] focused on video concept
and right images were used to improve depth estimation, learning for action recognition and event detection by us-
which was finally used to improve the student network ing web videos and images. The learned knowledge from
via KD. In contrast, Guo et al. and Tosi et al. focused on teacher network (Lead network) was used to filter out
cross-domain KD, which aimed to distill the proxy labels the noisy images. These were then used to fine-tune the
obtained from the stereo network (teacher) to learn a stu- teacher network to obtain a student network (Exceeding
dent depth estimation network. Choi et al. [283] learned a network). Gan et al. [293] explored geometry as a new type
student network for monocular depth inference by distilling of practical auxiliary knowledge for self-supervised learning
the knowledge of depth predictions from a stereo teacher of video representations. Fu et al. [294] propose focused on
network via the data ensemble strategy. video attention prediction by leveraging both spatial and
Liu et al. [286] and Aleotti et al. [287] proposed data- temporal knowledge. Farhadi et al. [295] distill the temporal
distillation methods for scene flow estimation. Liu et al. dis- knowledge from a teacher model over the selected video
tilled reliable predictions from a teacher network with un- frames to a student model.
labeled data, and used these predictions (for non-occluded
9.6.2 Video captioning
pixels) as annotations to guide a student network to learn
the optical flow. They proposed to leverage on the knowl- [296, 297] exploited the potential of graph-based S-T learn-
edge learned by the teacher networks specialized in stereo ing for image captioning. Specifically, Zhang et al. [296]
to distill proxy annotations, which is similar to the KD leveraged the object-level information (teacher) to learn the
method for depth estimation in [176, 285]. Tosi et al. [288] scene feature representation (student) via a spatio-temporal
learned a compact network for predicting holistic scene un- graph. Pan et al. [297] highlighted the importance of the rela-
derstanding tasks including depth, optical flow, and motion tional graph connecting for all the objects in the video, and
segmentation, based on distillation of proxy semantic labels forced the caption model to learn the abundant linguistic
and semantic-aware self-distillation of optical information. language via teacher-recommended learning.

9.5 Image translation 10 D ISCUSSIONS


In this section, we discuss some fundamental questions and
Insight: Distilling GAN frameworks for image translation has
challenges that are crucial for better understanding and
to consider three factors: large number of parameters of the
improving KD.
generators, no ground truth labels for training data, and complex
framework (both generator and discriminator).
Attempts were made in several works to compress 10.1 Are bigger models better teachers?
GANs for image translation with KD. Aguinaldo et al. [198] The early assumption and idea behind KD are that soft
focused on unconditional GANs, and proposed to learn labels (probabilities) from a trained teacher reflect more
a smaller student generator by distilling knowledge from about the distribution of data than the ground truth labels
the generated images of a larger teacher generator using [1]. If this is true, then it is expected that as the teacher
mean squared error (MSE). However, the knowledge incor- becomes more robust, the knowledge (soft labels) provided
porated in the teacher discriminator was not investigated. by the teacher would be more reliable and better capture
In contrast, Chen et al. [203] and Li et al. [199] focused the distribution of classes. That is, a more robust teacher
on conditional GANs, and exploited the knowledge from provides constructive knowledge and supervision to the
the teacher discriminator. Specifically, Chen et al.included a student. Thus, the intuitive approach for learning a more
student discriminator to measure the distances between real accurate student is to employ a bigger and more robust
images and images generated by the student and teacher teacher. However, based on the experimental results in [16],
generators. The student GAN was then trained under the it is found out that a bigger and more robust model does
supervision of the teacher GAN. In particular, Li et al. [199] not always make a better teacher. As the teacher’s capacity
adopted the discriminator of the teacher as the student grows, the student’s accuracy rises to some extent, and then
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 30

begins to drop. We summarize two crucial reasons behind the procedure, all the student generations are assembled
the lack of theoretical support for KD, based on [16, 35]. together to get an additional gain. So is self-distillation
in the generations better? [16] finds that network architec-
• The student is able to follow the teacher, but it cannot
ture heavily determines the success of KD in generations.
absorb useful knowledge from the teacher. This indi-
Although the ensemble of the student models from all
cates that there is a mismatch between the KD losses
the generations outperforms a single model trained from
and accuracy evaluation methods. As pointed in [35],
scratch, the ensemble does not outperform an ensemble of
the optimization method used could have a large
an equal number of models trained from scratch.
impact on the distillation risk. Thus, optimization
Instead, recent works [163, 164, 168] shift the focus
methods might be crucial for significant KD to the
from sequential self-distillation (multiple stages) to the one-
student.
stage (online) manner. The student distills knowledge to
• Another reason comes from when the student is
itself without resorting to the teacher and heavy compu-
unable to follow the teacher due to the large model
tation. These methods show more efficiency, less computa-
capacity between the teacher and the student. It is
tion costs, and higher accuracy. The reason for such better
stated in [1, 53] that the S-T similarity is highly re-
performance has been pointed out in [163, 168]. They have
lated to how well the student can mimic the teacher.
figured out that that online self-distillation can help student
If the student is similar to the teacher, it will produce
models converge to flat minima. Moreover, self-distillation
outputs similar to the teacher.
prevents student models from the ‘vanishing gradient’ prob-
Intermediate feature representations are also effective lem. Lastly, self-distillation helps to extract more discrimi-
knowledge that can be used to learn the student [52, 54]. native features. In summary, online self-distillation shows
The common approach for feature-based distillation is to significant advantages than sequential distillation methods
transfer the features into a type of representation that the and is more generalizable.
student can easily learn. In such a case, are bigger models
are better teachers? As pointed in [52], feature-based distil-
10.4 Single teacher vs multiple teachers
lation is better than the distillation of soft labels, and deeper
students perform better than shallower ones. In addition, It is noticeable that recent distillation methods turn to
the performance of the student increases upon increasing exploit the potential of learning from multiple teachers. Is
the number of layers (feature representations) [54]. How- learning from multiple teachers really better than learning
ever, when the student is fixed, a bigger teacher does not from a single teacher? To answer this question, [37] intu-
always teach a better student. When the similarity between itively identified that the student can fuse different predic-
the teacher and student is relatively high, the student tends tions from multiple teachers to establish its own comprehen-
to achieve plausible results. sive understanding of the knowledge. The intuition behind
this is that by unifying the knowledge from the ensemble of
teachers, the relative similarity relationship among teachers
10.2 Is a pretrained teacher important?
is maintained. This provides a more integrated dark knowl-
While most works focus on learning a smaller student based edge for the student. Similar to mutual learning [34, 40],
on the pretrained teacher, the distillation is not always the ensemble of teachers collects the individual predictions
efficient and effective. When the model capacity between (knowledge) together, thus converging at minima that are
the teacher and the student is large, it is hard for the more robust. Lastly, learning from multiple teachers relieves
student to follow the teacher, thus inducing the difficulty of training difficulties such as vanishing gradient problems.
optimization. Is a pretrained teacher important for learning
a compact student with plausible performance? [34, 40]
10.5 Is data-free distillation effective enough?
propose learning from student peers, each of which has
the same model complicity. The greatest advantage of this In the absence of training data, some novel methods
distillation approach is efficiency, since the pretraining of a [112, 116, 118, 119] have been proposed to achieve plausible
high capacity teacher is exempted. Instead of teaching, the results. A theoretical explanation for why such methods are
student peers learn to cooperate with each other to obtain robust enough for learning a portable student has yet to be
an optimal learning solution. Surprisingly, learning without proposed. These methods are focused only on classification,
the teacher even enables improving the performance. The and the generalization capability of such methods is still
question of why learning without the teacher is better has low. Most works employ generators to generate the ‘latent’
been studied in [2]. Their results indicate that the compact images from noise via adversarial learning [123, 132], but
student may have a less chance of overfitting. Moreover, such methods are relatively hard to train and computation-
[16] suggests that early stopping of training on ImageNet ally expensive.
[51] achieves better performance. The ensemble of students
pool their collective predictions, thus helping to converge at 10.6 Logits vs features
a more robust minima as pointed in [34].
The knowledge defined in existing KD methods is from
three aspects: logits, feature maps (intermediate layers), and
10.3 Is born-again self-distillation better? both. However, it is still unclear which one of these repre-
Born-again network [30] is the initial self-distillation method sents better knowledge. While works such as [52, 53, 54, 55,
in which the student is trained sequentially, and the later 61] focus on better interpretation of feature representations
step is supervised by the earlier generation. At the end of and claim that features might contain richer information;
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 31

some other works [1, 15, 34, 115] mention that softened la- (NAS), graph neural network (GNN)), novel non-Euclidean
bels (logits) could represent each sample by a class distribu- distances (e.g., hypersphere), better feature representation
tion, and a student can easily learn the intra-class variations. approaches, and potential vision applications, such as 360◦
However, it is noticeable that KD via logits has obvious vision [298] and event-based vision [158] into account.
drawbacks. First, its effectiveness is limited to the softmax
loss function, and it relies on the number of classes (cannot
11.1 Potential of NAS
be applied to low-level vision tasks). Secondly, when the
capacity between the teacher and the student is big, it is hard In recent years, NAS has become a popular topic in deep
for the student to follow the teacher’s class probabilities [16]. learning. NAS has the potential of automating the design of
Moreover, as studied in [61], semantically similar inputs neural networks. Therefore, it can be efficient for searching
tend to elicit similar activation patterns in teacher networks, more compact student models. In such a way, NAS can be
indicating that the similarity-preserving knowledge from incorporated with KD for model compression. This has been
intermediate features express not only the representation recently demonstrated for GAN compression [199, 299]. It
space, but also the activations of object category (similar is shown to be effective for finding efficient student model
to class distributions). Thus, we can clearly see that features from the teacher with lower computation costs and fewer
provide more affluent knowledge than logits, and generalize parameters. It turns out that NAS improves the compression
better to the problems without class labels. ratio and accelerates the KD process. A similar approach is
taken by [243] learning to remove layers of teacher network
10.7 Interpretability of KD based on reinforcement learning (RL). Thus, we propose
that NAS with RL can be a good direction of KD for model
In Sec. 3, we provided a theoretical analysis of KD based
compression. This might significantly relieve the complexity
on the information maximization theory. It is commonly
and enhance the learning efficiency of existing methods,
acknowledged that the teacher model’s dark knowledge
in which the student is manually designed based on the
provides the privileged information on similarity of class
teacher.
categories to improve the learning of students [1, 4]. How-
ever, why KD works is also an important question. There
are some methods that explore the principles of KD from 11.2 Potential of GNN
the view of label smoothing [172], visual concepts [177], Although GNN has brought progress in KD learning under
category similarity [1], etc. Specifically, [172] found that KD the S-T framework, some challenges remain. This is because
is a learned label smoothing regularization (LSR), and LSR most methods rely on finding structured data on which
is an ad-hoc KD. Even a poorly-trained teacher can improve graph-based algorithms can be applied. [18] considers the
the student’s performance, and the weak student could instance features and instance relationships as instance
improve the teacher. However, the findings in [172] only graphs, and [215] builds an input graph representation for
focus on classification-related tasks, and these intriguing multi-task knowledge distillation. However, in knowledge
results do not apply to the tasks without labels [199, 203]. In distillation, there exists non-structural knowledge in addition
contrast, [177] claims that KD makes DNN learn more task- to the structural knowledge (e.g., training data, logits, inter-
related visual concepts and discard task-irreverent concepts mediate features, and outputs of teacher), and it is necessary
to learn discriminative features. From a general perspective, to construct a flexible knowledge graph to tackle the non-
the quantification of visual concepts in [177] provides a structural distillation process.
more intuitive interpretation for the success of KD. How-
ever, there exists a strong need that more intensive research
needs to be done in this direction. 11.3 Non-Euclidean distillation measure
Existing KD losses are mostly dependent on Euclidean loss
10.8 Network architecture vs effectiveness of KD. (e.g., l1 ), and have their own limitations. [300] has shown
It has been demonstrated that distillation position has a that algorithms that regularize with Euclidean distance,
significant impact on the effectiveness of KD [16, 53]. Most (e.g.MSE loss) are easily confused by random features.
methods demonstrate this by deploying the same network The difficulty arises when the model capacity between the
for both teacher and student. However, many fail to trans- teacher and the student is large. Besides, l2 regularization
fer across very different teacher and student architectures. does not penalize small weights enough. Inspired by a
Recently, [10] found that [11, 52, 57] perform poorly even recent work [301] on GAN training, we propose that it is
on very similar student and teacher architectures. [172] also useful to exploit the information of higher-order statistics of
reported an intriguing finding that a poorly trained teacher data in non-Euclidean spaces (e.g., hypersphere). This is be-
also can improve the student’s performance. It is thus quite cause geometric constraints induced by the non-Euclidean
imperative to excavate how network architecture affects the distance might make training more stable, thus improving
effectiveness of KD and why KD fails to work when the the efficiency of KD.
network architectures of student and teacher are different.
11.4 Better feature representations
11 N EW OUTLOOKS AND PERSPECTIVES Existing methods that focus on KD with multiple teachers
In this section, we provide some ideas and discuss fu- show potential for handling cross-domain problems or other
ture directions of knowledge distillation. We take the lat- problems where the ground truth is not available. However,
est deep learning methods (e.g., neural architecture search the ensemble of feature representations [48, 64, 111] is still
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 32

challenging in some aspects. One critical challenge is fusing provide a formal definition of the problem, and introduce
the feature representations and balancing each of them with the taxonomy methods for existing KD approaches. Draw-
robust gating mechanisms. Manually assigning weights to ing connections among these approaches, we identify a new
each component may hurt the diversity and flexibility of active area of research that is likely to create new methods
individual feature representations, thus impairing the effec- that take advantage of the strengths of each paradigm. Each
tiveness of ensemble knowledge. One possible solution is taxonomy of the KD methods shows the current technical
attention gates, as demonstrated in some detection tasks status regarding its advantages and disadvantages. Based
[302, 303]. The aim of this approach is to highlight the on explicit analyses, we then discuss methods to overcome
important feature dimensions and prune feature responses the challenges, and break the bottlenecks by exploiting new
to preserve only the activations relevant to the specific task. deep learning methods, new KD losses, and new vision
Another approach is inspired by the gating mechanism used application fields.
in long short-term memory(LSTM) [304, 305]. That is, this
gate unit in KD is elaborately designed to remember features
across different image regions and to control the pass of each ACKNOWLEDGMENTS
region feature as a whole by their contribution to the task This work was supported by the National Research Foun-
(e.g., classification) with the weight of importance. dation of Korea(NRF) grant funded by the Korea gov-
ernment(MSIT) (NRF-2018R1A2B3008640) and Institute of
11.5 A more constructive theoretical analysis Information & Communications Technology Planning &
While KD shows impressive performance improvements in Evaluation(IITP) grant funded by Korea government(MSIT)
many tasks, the intuition behind it is still unclear. Recently, (No.2020-0-00440, Development of Artificial Intelligence
[16] explained conventional KD [1] using linear models, and Technology that Continuously Improves Itself as the Situ-
[8, 9, 10] focus on explaining feature-based KD. Mobahi ation Changes in the Real World).
et al.[163] provides theoretical analysis for self-distillation.
However, the mechanism behind data-free KD and KD from
multiple teachers is still unknown. Therefore, further theo- R EFERENCES
retical studies on explaining the principles of these methods [1] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowl-
should be undertaken. edge in a neural network,” NIPS Deep Learning Workshop,
2015.
[2] A. Tarvainen and H. Valpola, “Mean teachers are better
11.6 Potentials for special vision problems role models: Weight-averaged consistency targets im-
While existing KD techniques are mostly developed based prove semi-supervised deep learning results,” in Ad-
on vision problems (e.g., classification), they rarely exploit vances in neural information processing systems, 2017, pp.
1195–1204.
some special vision fields such as 360◦ vision [298] and [3] S. Gutstein, O. Fuentes, and E. Freudenthal, “Knowledge
event-based vision [124, 158, 306]. The biggest challenge transfer in deep convolutional neural nets,” International
for both these vision fields is the lack of labeled data, and Journal on Artificial Intelligence Tools, vol. 17, no. 03, pp.
learning in these needs a special change of inputs for neural 555–567, 2008.
networks. Thus, the potential of KD, particularly cross- [4] C. Bucilua, R. Caruana, and A. Niculescu-Mizil, “Model
modal KD, for these two fields is promising. By distilling compression,” in Proceedings of the 12th ACM SIGKDD
international conference on Knowledge discovery and data
knowledge from the teacher trained with RGB images or
mining. ACM, 2006, pp. 535–541.
frames to the student network specialized in learning to [5] https://intellabs.github.io/distiller/knowledge distillation.html.
predict 360◦ images or stacked event images, it not only [6] https://github.com/FLHonker/Awesome-Knowledge-
handles the problem of lack of data, but also achieves Distillation.
desirable results in the prediction tasks. [7] https://github.com/dkozlov/awesome-knowledge-
distillation.
[8] S. Ahn, S. X. Hu, A. Damianou, N. D. Lawrence, and
11.7 Integration of vision, speech and NLP. Z. Dai, “Variational information distillation for knowl-
As a potential, it is promising to apply KD to the integrated edge transfer,” in The IEEE Conference on Computer Vision
learning problems of vision, speech, and NLP. Although and Pattern Recognition (CVPR), June 2019.
recent attempts of cross-modal KD [142, 145, 146, 147] focus [9] S. Hegde, R. Prasad, R. Hebbalaguppe, and V. Kumar,
“Variational student: Learning compact and sparser net-
on transferring the knowledge from one modality (e.g., works in knowledge distillation framework,” ICASSP,
video) to the other (e.g., sound) on the end tasks, it is still 2020.
challenging to learn the end tasks for the integration of [10] Y. Tian, D. Krishnan, and P. Isola, “Contrastive represen-
the three modalities. The major challenge may come from tation distillation,” in International Conference on Learning
collecting paired data of three modalities; however, it is Representations, 2019.
possible to apply GAN or representation learning methods [11] J. Ba and R. Caruana, “Do deep nets really need to be
deep?” in Advances in neural information processing systems,
to unsupervised cross-modal KD for learning effective end
2014, pp. 2654–2662.
tasks. [12] K. Mangalam and M. Salzamann, “On compress-
ing u-net using knowledge distillation,” arXiv preprint
12 C ONCLUSION arXiv:1812.00249, 2018.
[13] Q. Ding, S. Wu, H. Sun, J. Guo, and S.-T. Xia, “Adaptive
This review of KD and S-T learning has covered major regularization of labels,” arXiv preprint arXiv:1908.05474,
technical details and applications for visual intelligence. We 2019.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 33

[14] J. Cho and M. Lee, “Building a compact convolutional ciation for the Advancement of Artificial Intelligence, 2019.
neural network for embedded intelligent sensor systems [33] A. Malinin, B. Mlodozeniec, and M. Gales, “Ensemble
using group sparsity and knowledge distillation,” Sen- distribution distillation,” arXiv preprint arXiv:1905.00076,
sors, vol. 19, no. 19, p. 4307, 2019. 2019.
[15] T. Wen, S. Lai, and X. Qian, “Preparing lessons: Improve [34] Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu, “Deep
knowledge distillation with better supervision,” arXiv mutual learning,” in Proceedings of the IEEE Conference on
preprint arXiv:1911.07471, 2019. Computer Vision and Pattern Recognition, 2018, pp. 4320–
[16] J. H. Cho and B. Hariharan, “On the efficacy of knowl- 4328.
edge distillation,” in Proceedings of the IEEE International [35] M. Phuong and C. H. Lampert, “Distillation-based train-
Conference on Computer Vision, 2019, pp. 4794–4802. ing for multi-exit architectures,” in Proceedings of the
[17] C. Yang, L. Xie, C. Su, and A. L. Yuille, “Snapshot distilla- IEEE International Conference on Computer Vision, 2019, pp.
tion: Teacher-student optimization in one generation,” in 1355–1364.
Proceedings of the IEEE Conference on Computer Vision and [36] S. Zagoruyko and N. Komodakis, “Paying more attention
Pattern Recognition, 2019, pp. 2859–2868. to attention: Improving the performance of convolutional
[18] I.-J. Liu, J. Peng, and A. G. Schwing, “Knowledge flow: neural networks via attention transfer,” International Con-
Improve upon your teachers,” Seventh International Con- ference on Learning Representations, 2016.
ference on Learning Representations, 2019. [37] S. You, C. Xu, C. Xu, and D. Tao, “Learning from mul-
[19] A. Mishra and D. Marr, “Apprentice: Using knowledge tiple teacher networks,” in Proceedings of the 23rd ACM
distillation techniques to improve low-precision network SIGKDD International Conference on Knowledge Discovery
accuracy,” arXiv preprint arXiv:1711.05852, 2017. and Data Mining, 2017, pp. 1285–1294.
[20] Y. Li, J. Yang, Y. Song, L. Cao, J. Luo, and L.-J. Li, “Learn- [38] N. Papernot, M. Abadi, U. Erlingsson, I. Goodfellow, and
ing from noisy labels with distillation,” in Proceedings of K. Talwar, “Semi-supervised knowledge transfer for deep
the IEEE International Conference on Computer Vision, 2017, learning from private training data,” ICLR, 2017.
pp. 1910–1918. [39] B. B. Sau and V. N. Balasubramanian, “Deep model
[21] Q. Xie, E. Hovy, M.-T. Luong, and Q. V. Le, “Self-training compression: Distilling knowledge from noisy teachers,”
with noisy student improves imagenet classification,” arXiv preprint arXiv:1610.09650, 2016.
CVPR, 2020. [40] X. Lan, X. Zhu, and S. Gong, “Knowledge distillation by
[22] Y. Xu, Y. Wang, H. Chen, K. Han, X. Chunjing, D. Tao, and on-the-fly native ensemble,” in Proceedings of the 32nd
C. Xu, “Positive-unlabeled compression on the cloud,” in International Conference on Neural Information Processing
Advances in Neural Information Processing Systems, 2019, Systems. Curran Associates Inc., 2018, pp. 7528–7538.
pp. 2561–2570. [41] G. Song and W. Chai, “Collaborative learning for deep
[23] F. Sarfraz, E. Arani, and B. Zonooz, “Noisy collaboration neural networks,” in Advances in Neural Information Pro-
in knowledge distillation,” openreview.net, 2019. cessing Systems, 2018, pp. 1832–1841.
[24] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, [42] I. Radosavovic, P. Dollár, R. Girshick, G. Gkioxari,
and R. Salakhutdinov, “Dropout: a simple way to prevent and K. He, “Data distillation: Towards omni-supervised
neural networks from overfitting,” The journal of machine learning,” in Proceedings of the IEEE Conference on Com-
learning research, vol. 15, no. 1, pp. 1929–1958, 2014. puter Vision and Pattern Recognition, 2018, pp. 4119–4128.
[25] C. Yang, L. Xie, S. Qiao, and A. L. Yuille, “Training [43] X. Tan, Y. Ren, D. He, T. Qin, Z. Zhao, and T.-Y. Liu,
deep neural networks in generations: A more tolerant “Multilingual neural machine translation with knowl-
teacher educates better students,” in Proceedings of the edge distillation,” ICLR, 2019.
AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. [44] J. Vongkulbhisal, P. Vinayavekhin, and M. Visentini-
5628–5635. Scarzanella, “Unifying heterogeneous classifiers with dis-
[26] L. Yu, V. O. Yazici, X. Liu, J. v. d. Weijer, Y. Cheng, and tillation,” in The IEEE Conference on Computer Vision and
A. Ramisa, “Learning metrics from teachers: Compact Pattern Recognition (CVPR), June 2019.
networks for image embedding,” in Proceedings of the [45] A. Wu, W.-S. Zheng, X. Guo, and J.-H. Lai, “Distilled per-
IEEE Conference on Computer Vision and Pattern Recogni- son re-identification: Towards a more scalable system,” in
tion, 2019, pp. 2907–2916. Proceedings of the IEEE Conference on Computer Vision and
[27] S. Arora, M. M. Khapra, and H. G. Ramaswamy, “On Pattern Recognition, 2019, pp. 1187–1196.
knowledge distillation from complex networks for re- [46] N. Dvornik, C. Schmid, and J. Mairal, “Diversity with
sponse prediction,” in Proceedings of the 2019 Conference cooperation: Ensemble methods for few-shot classifica-
of the North American Chapter of the Association for Compu- tion,” in Proceedings of the IEEE International Conference on
tational Linguistics: Human Language Technologies, Volume 1 Computer Vision, 2019, pp. 3723–3731.
(Long and Short Papers), 2019, pp. 3813–3822. [47] Z. Yang, L. Shou, M. Gong, W. Lin, and D. Jiang, “Model
[28] W. Park, D. Kim, Y. Lu, and M. Cho, “Relational knowl- compression with two-stage multi-teacher knowledge
edge distillation,” in Proceedings of the IEEE Conference on distillation for web question answering system,” WSDM,
Computer Vision and Pattern Recognition, 2019, pp. 3967– 2019.
3976. [48] S. Park and N. Kwak, “Feed: Feature-level ensemble for
[29] B. Peng, X. Jin, J. Liu, D. Li, Y. Wu, Y. Liu, S. Zhou, and knowledge distillation,” European Conference on Artificial
Z. Zhang, “Correlation congruence for knowledge distil- Intelligence, 2019.
lation,” in Proceedings of the IEEE International Conference [49] K. Lee, L. T. Nguyen, and B. Shim, “Stochasticity and skip
on Computer Vision, 2019, pp. 5007–5016. connections improve knowledge transfer,” 2019.
[30] T. Furlanello, Z. C. Lipton, M. Tschannen, L. Itti, and [50] D. Chen, J.-P. Mei, C. Wang, Y. Feng, and C. Chen, “Online
A. Anandkumar, “Born again neural networks,” ICML, knowledge distillation with diverse peers,” Association for
2018. the Advancement of Artificial Intelligence, 2020.
[31] D. Wang, Y. Li, Y. Lin, and Y. Zhuang, “Relational knowl- [51] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,
edge transfer for zero-shot learning,” in Thirtieth AAAI “Imagenet: A large-scale hierarchical image database,”
Conference on Artificial Intelligence, 2016. in 2009 IEEE conference on computer vision and pattern
[32] S.-I. Mirzadeh, M. Farajtabar, A. Li, and H. Ghasemzadeh, recognition. Ieee, 2009, pp. 248–255.
“Improved knowledge distillation via teacher assistant: [52] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta,
Bridging the gap between student and teacher,” the Asso- and Y. Bengio, “Fitnets: Hints for thin deep nets,” Inter-
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 34

national Conference on Learning Representations, 2014. with probabilistic knowledge transfer,” in Proceedings of
[53] B. Heo, J. Kim, S. Yun, H. Park, N. Kwak, and J. Y. Choi, the European Conference on Computer Vision (ECCV), 2018,
“A comprehensive overhaul of feature distillation,” in pp. 268–284.
Proceedings of the IEEE International Conference on Computer [74] S. Ioffe and C. Szegedy, “Batch normalization: Accelerat-
Vision, 2019, pp. 1921–1930. ing deep network training by reducing internal covariate
[54] J. Kim, S. Park, and N. Kwak, “Paraphrasing complex shift,” arXiv preprint arXiv:1502.03167, 2015.
network: Network compression via factor transfer,” in [75] S. Zagoruyko and N. Komodakis, “Wide residual net-
Advances in Neural Information Processing Systems, 2018, works,” BMVC, 2016.
pp. 2760–2769. [76] D. Barber and F. V. Agakov, “The im algorithm: a vari-
[55] Z. Huang and N. Wang, “Like what you like: Knowledge ational approach to information maximization,” in Ad-
distill via neuron selectivity transfer,” International Con- vances in neural information processing systems, 2003, p.
ference on Learning Representations, 2017. None.
[56] Z. Zhang, G. Ning, and Z. He, “Knowledge projection [77] Y. Bengio, J. Louradour, R. Collobert, and J. Weston,
for effective design of thinner and faster deep neural “Curriculum learning,” in Proceedings of the 26th annual
networks,” 2017. international conference on machine learning. ACM, 2009,
[57] J. Yim, D. Joo, J. Bae, and J. Kim, “A gift from knowledge pp. 41–48.
distillation: Fast optimization, network minimization and [78] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,
transfer learning,” in Proceedings of the IEEE Conference on D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio,
Computer Vision and Pattern Recognition, 2017, pp. 4133– “Generative adversarial nets,” in Advances in neural infor-
4141. mation processing systems, 2014, pp. 2672–2680.
[58] W. Wang, F. Wei, L. Dong, H. Bao, N. Yang, and M. Zhou, [79] F. Horn and K.-R. Müller, “Learning similarity preserv-
“Minilm: Deep self-attention distillation for task-agnostic ing representations with neural similarity and context
compression of pre-trained transformers,” arXiv preprint encoders,” 2016.
arXiv:2002.10957, 2020. [80] K. Clark, M.-T. Luong, U. Khandelwal, C. D. Manning,
[59] S. Srinivas and F. Fleuret, “Knowledge transfer with and Q. V. Le, “Bam! born-again multi-task networks for
jacobian matching,” International Conference on Machine natural language understanding,” ACL, 2019.
Learning, 2018. [81] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap,
[60] X. Jin, B. Peng, Y. Wu, Y. Liu, J. Liu, D. Liang, J. Yan, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous
and X. Hu, “Knowledge distillation via route constrained methods for deep reinforcement learning,” in Interna-
optimization,” ICCV, 2019. tional conference on machine learning, 2016, pp. 1928–1937.
[61] F. Tung and G. Mori, “Similarity-preserving knowledge [82] C. Zhang and Y. Peng, “Better and faster: knowledge
distillation,” in Proceedings of the IEEE International Con- transfer from multiple self-supervised learning tasks via
ference on Computer Vision, 2019, pp. 1365–1374. graph distillation for video classification,” arXiv preprint
[62] Z. Shen, Z. He, and X. Xue, “Meal: Multi-model ensem- arXiv:1804.10069, 2018.
ble via adversarial learning,” in Proceedings of the AAAI [83] C. Shen, M. Xue, X. Wang, J. Song, L. Sun, and M. Song,
Conference on Artificial Intelligence, vol. 33, 2019, pp. 4886– “Customizing student networks from heterogeneous
4893. teachers via adaptive knowledge amalgamation,” in Pro-
[63] B. Heo, M. Lee, S. Yun, and J. Y. Choi, “Knowledge ceedings of the IEEE International Conference on Computer
transfer via distillation of activation boundaries formed Vision, 2019, pp. 3504–3513.
by hidden neurons,” 2018. [84] S. Ruder, P. Ghaffari, and J. G. Breslin, “Knowl-
[64] I. Chung, S. Park, J. Kim, and N. Kwak, “Feature-map- edge adaptation: Teaching to adapt,” arXiv preprint
level online adversarial knowledge distillation,” 2020. arXiv:1702.02052, 2017.
[65] T. Wang, L. Yuan, X. Zhang, and J. Feng, “Distilling [85] X. Zhu, S. Gong et al., “Knowledge distillation by on-
object detectors with fine-grained feature imitation,” in the-fly native ensemble,” in Advances in neural information
Proceedings of the IEEE Conference on Computer Vision and processing systems, 2018, pp. 7517–7527.
Pattern Recognition, 2019, pp. 4933–4942. [86] J. Vongkulbhisal, P. Vinayavekhin, and M. Visentini-
[66] S. Changyong, L. Peng, X. Yuan, Q. Yanyun, Scarzanella, “Unifying heterogeneous classifiers with dis-
D. Longquan, and M. Lizhuang, “Knowledge squeezed tillation,” in Proceedings of the IEEE Conference on Computer
adversarial network compression,” arXiv preprint Vision and Pattern Recognition, 2019, pp. 3175–3184.
arXiv:1904.05100, 2019. [87] Z. Yang, L. Shou, M. Gong, W. Lin, and D. Jiang, “Model
[67] A. Kulkarni, N. Panchi, and S. Chiddarwar, “Stagewise compression with two-stage multi-teacher knowledge
knowledge distillation,” arXiv preprint arXiv:1911.06786, distillation for web question answering system,” in Pro-
2019. ceedings of the 13th International Conference on Web Search
[68] G. Aguilar, Y. Ling, Y. Zhang, B. Yao, X. Fan, and E. Guo, and Data Mining, 2020, pp. 690–698.
“Knowledge distillation from internal representations,” [88] L. Tran, B. S. Veeling, K. Roth, J. Swiatkowski, J. V.
AAAI, 2019. Dillon, J. Snoek, S. Mandt, T. Salimans, S. Nowozin, and
[69] M. Gao, Y. Shen, Q. Li, and C. C. Loy, “Residual knowl- R. Jenatton, “Hydra: Preserving ensemble diversity for
edge distillation,” arXiv preprint arXiv:2002.09168, 2020. model distillation,” arXiv preprint arXiv:2001.04694, 2020.
[70] H.-T. Li, S.-C. Lin, C.-Y. Chen, and C.-K. Chiang, “Layer- [89] A. Ruiz and J. Verbeek, “Distilled hierarchical neural
level knowledge distillation for deep neural network ensembles with adaptive inference cost,” arXiv preprint
learning,” Applied Sciences, vol. 9, no. 10, p. 1966, 2019. arXiv:2003.01474, 2020.
[71] S. H. Lee, D. H. Kim, and B. C. Song, “Self-supervised [90] M.-C. Wu and C.-T. Chiu, “Multi-teacher knowledge dis-
knowledge distillation using singular value decom- tillation for compressed video action recognition based
position,” in European Conference on Computer Vision. on deep learning,” Journal of Systems Architecture, vol. 103,
Springer, 2018, pp. 339–354. p. 101695, 2020.
[72] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, [91] T. Fukuda, M. Suzuki, G. Kurata, S. Thomas, J. Cui, and
A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is B. Ramabhadran, “Efficient knowledge distillation from
all you need,” in Advances in neural information processing an ensemble of teachers.” in Interspeech, 2017, pp. 3697–
systems, 2017, pp. 5998–6008. 3701.
[73] N. Passalis and A. Tefas, “Learning deep representations [92] M. Mehak and V. N. Balasubramanian, “Knowledge dis-
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 35

tillation from multiple teachers using visual explana- edge distillation for deep neural networks,” NIPS Work-
tions,” Ph.D. dissertation, Indian Institute of Technology shop, 2017.
Hyderabad, 2018. [113] K. Bhardwaj, N. Suda, and R. Marculescu, “Dream dis-
[93] J.-w. Jung, H. Heo, H.-j. Shim, and H.-J. Yu, “Distilling the tillation: A data-independent model compression frame-
knowledge of specialist deep neural networks in acoustic work,” ICML Joint Workshop, 2019.
scene classification,” 2019. [114] M. Haroush, I. Hubara, E. Hoffer, and D. Soudry, “The
[94] S. Sun, Y. Cheng, Z. Gan, and J. Liu, “Patient knowledge knowledge within: Methods for data-free model com-
distillation for bert model compression,” EMNLP, 2019. pression,” CVPR, 2020.
[95] L. Liu, H. Wang, J. Lin, R. Socher, and C. Xiong, “Atten- [115] G. K. Nayak, K. R. Mopuri, V. Shaj, R. V. Babu, and
tive student meets multi-task teacher: Improved knowl- A. Chakraborty, “Zero-shot knowledge distillation in
edge distillation for pretrained models,” arXiv preprint deep networks,” ICML, 2019.
arXiv:1911.03588, 2019. [116] H. Chen, Y. Wang, C. Xu, Z. Yang, C. Liu, B. Shi, C. Xu,
[96] X. Liu, P. He, W. Chen, and J. Gao, “Improving multi- C. Xu, and Q. Tian, “Data-free learning of student net-
task deep neural networks via knowledge distillation works,” in Proceedings of the IEEE International Conference
for natural language understanding,” arXiv preprint on Computer Vision, 2019, pp. 3514–3522.
arXiv:1904.09482, 2019. [117] G. Fang, J. Song, C. Shen, X. Wang, D. Chen, and
[97] X. He, Z. Zhou, and L. Thiele, “Multi-task zipping via M. Song, “Data-free adversarial distillation,” arXiv
layer-wise neuron sharing,” in Advances in Neural Infor- preprint arXiv:1912.11006, 2019.
mation Processing Systems, 2018, pp. 6016–6026. [118] J. Ye, Y. Ji, X. Wang, X. Gao, and M. Song, “Data-free
[98] C. Shen, X. Wang, J. Song, L. Sun, and M. Song, “Amal- knowledge amalgamation via group-stack dual-gan,”
gamating knowledge towards comprehensive classifica- CVPR, 2020.
tion,” in Proceedings of the AAAI Conference on Artificial [119] J. Yoo, M. Cho, T. Kim, and U. Kang, “Knowledge ex-
Intelligence, vol. 33, 2019, pp. 3068–3075. traction with no observable data,” in Advances in Neural
[99] J. Ye, Y. Ji, X. Wang, K. Ou, D. Tao, and M. Song, “Student Information Processing Systems, 2019, pp. 2701–2710.
becoming the master: Knowledge amalgamation for joint [120] H. Yin, P. Molchanov, Z. Li, J. M. Alvarez, A. Mallya,
scene parsing, depth estimation, and more,” in Proceed- D. Hoiem, N. K. Jha, and J. Kautz, “Dreaming to distill:
ings of the IEEE Conference on Computer Vision and Pattern Data-free knowledge transfer via deepinversion,” CVPR,
Recognition, 2019, pp. 2829–2838. 2020.
[100] S. Luo, X. Wang, G. Fang, Y. Hu, D. Tao, and M. Song, [121] P. Micaelli and A. J. Storkey, “Zero-shot knowledge trans-
“Knowledge amalgamation from heterogeneous net- fer via adversarial belief matching,” in Advances in Neural
works by common feature learning,” IJCAI, 2019. Information Processing Systems, 2019, pp. 9547–9557.
[101] R. Anil, G. Pereyra, A. Passos, R. Ormandi, G. E. Dahl, [122] M. Kulkarni, K. Patil, and S. Karande, “Knowledge
and G. E. Hinton, “Large scale distributed neural network distillation using unlabeled mismatched images,” arXiv
training through online distillation,” ICLR, 2018. preprint arXiv:1703.07131, 2017.
[102] P. Zhou, L. Mai, J. Zhang, N. Xu, Z. Wu, and L. S. [123] L. Wang, W. Cho, and K.-J. Yoon, “Deceiving image-to-
Davis, “M2kd: Multi-model and multi-level knowledge image translation networks for autonomous driving with
distillation for incremental learning,” The British Machine adversarial perturbations,” IEEE Robotics and Automation
Vision Conference (BMVC), 2019. Letters (RAL), 2020.
[103] L. Xiang and G. Ding, “Learning from multiple experts: [124] L. Wang, Y.-S. Ho, K.-J. Yoon et al., “Event-based high dy-
Self-paced knowledge distillation for long-tailed classifi- namic range image and very high frame rate video gener-
cation,” ECCV, 2020. ation using conditional generative adversarial networks,”
[104] C. Gong, X. Chang, M. Fang, and J. Yang, “Teaching in Proceedings of the IEEE/CVF Conference on Computer
semi-supervised classifier via generalized distillation.” in Vision and Pattern Recognition, 2019, pp. 10 081–10 090.
IJCAI, 2018, pp. 2156–2162. [125] A. Mordvintsev, C. Olah, and M. Tyka, “Inceptionism:
[105] J. Ye, X. Wang, Y. Ji, K. Ou, and M. Song, “Amalgamat- Going deeper into neural networks,” 2015.
ing filtered knowledge: learning task-customized student [126] Q. Liu, L. Xie, H. Wang, and A. L. Yuille, “Semantic-aware
from multi-task teachers,” IJCAI, 2019. knowledge preservation for zero-shot sketch-based im-
[106] J. Gao, Z. Li, R. Nevatia et al., “Knowledge concentration: age retrieval,” in Proceedings of the IEEE International
Learning 100k object classifiers in a single cnn,” arXiv Conference on Computer Vision, 2019, pp. 3662–3671.
preprint arXiv:1711.07607, 2017. [127] T. Li, J. Li, Z. Liu, and C. Zhang, “Few sample knowledge
[107] A. A. Rusu, S. G. Colmenarejo, C. Gulcehre, distillation for efficient network compression,” arXiv
G. Desjardins, J. Kirkpatrick, R. Pascanu, V. Mnih, preprint arXiv:1812.01839, 2018.
K. Kavukcuoglu, and R. Hadsell, “Policy distillation,” [128] A. Kimura, Z. Ghahramani, K. Takeuchi, T. Iwata, and
arXiv preprint arXiv:1511.06295, 2015. N. Ueda, “Few-shot learning of neural networks from
[108] K. Ahmed, M. H. Baig, and L. Torresani, “Network of scratch by pseudo example optimization,” BMVC, 2018.
experts for large-scale image categorization,” in European [129] H. Bai, J. Wu, I. King, and M. Lyu, “Few shot network
Conference on Computer Vision. Springer, 2016, pp. 516– compression via cross distillation,” AAAI, 2020.
532. [130] E. Snelson and Z. Ghahramani, “Sparse gaussian pro-
[109] I. M. Galván, P. Isasi, R. Aler, and J. M. Valls, “A selective cesses using pseudo-inputs,” in Advances in neural infor-
learning method to improve the generalization of multi- mation processing systems, 2006, pp. 1257–1264.
layer feedforward neural networks,” International journal [131] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
of neural systems, vol. 11, no. 02, pp. 167–177, 2001. D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich,
[110] J. Kim, M. Hyun, I. Chung, and N. Kwak, “Feature fusion “Going deeper with convolutions,” in Proceedings of the
for online mutual knowledge distillation,” International IEEE conference on computer vision and pattern recognition,
Conference on Pattern Recognition, 2019. 2015, pp. 1–9.
[111] S. Hou, X. Liu, and Z. Wang, “Dualnet: Learn comple- [132] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explain-
mentary features for image recognition,” in Proceedings of ing and harnessing adversarial examples,” arXiv preprint
the IEEE International Conference on Computer Vision, 2017, arXiv:1412.6572, 2014.
pp. 502–510. [133] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-
[112] R. G. Lopes, S. Fenu, and T. Starner, “Data-free knowl- based learning applied to document recognition,” Pro-
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 36

ceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998. “A cross-modal distillation network for person
[134] A. Krizhevsky, G. Hinton et al., “Learning multiple layers re-identification in rgb-depth,” arXiv preprint
of features from tiny images,” 2009. arXiv:1810.11641, 2018.
[135] M. Zhu and S. Gupta, “To prune, or not to prune: explor- [154] C. Gan, H. Zhao, P. Chen, D. Cox, and A. Torralba, “Self-
ing the efficacy of pruning for model compression,” ICLR supervised moving vehicle tracking with stereo sound,”
Workshop, 2018. in Proceedings of the IEEE International Conference on Com-
[136] Y. Aytar, C. Vondrick, and A. Torralba, “Soundnet: Learn- puter Vision, 2019, pp. 7053–7062.
ing sound representations from unlabeled video,” in Ad- [155] A. Perez, V. Sanguineti, P. Morerio, and V. Murino,
vances in neural information processing systems, 2016, pp. “Audio-visual model distillation using acoustic images,”
892–900. in The IEEE Winter Conference on Applications of Computer
[137] J.-C. Su and S. Maji, “Adapting models to signal degra- Vision, 2020, pp. 2854–2863.
dation using distillation,” BMVC, 2016. [156] L. Zhao, X. Peng, Y. Chen, M. Kapadia, and D. N.
[138] A. Nagrani, S. Albanie, and A. Zisserman, “Learnable Metaxas, “Knowledge as priors: Cross-modal knowledge
pins: Cross-modal embeddings for person identity,” in generalization for datasets without superior knowledge,”
Proceedings of the European Conference on Computer Vision CVPR, 2020.
(ECCV), 2018, pp. 71–88. [157] Z. Ke, D. Wang, Q. Yan, J. Ren, and R. W. Lau, “Dual
[139] ——, “Seeing voices and hearing faces: Cross-modal bio- student: Breaking the limits of the teacher in semi-
metric matching,” in Proceedings of the IEEE conference supervised learning,” in Proceedings of the IEEE Interna-
on computer vision and pattern recognition, 2018, pp. 8427– tional Conference on Computer Vision, 2019, pp. 6728–6736.
8436. [158] L. Wang, T.-K. Kim, and K.-J. Yoon, “Eventsr: From asyn-
[140] J. Hoffman, S. Gupta, J. Leong, S. Guadarrama, and chronous events to image reconstruction, restoration, and
T. Darrell, “Cross-modal adaptation for rgb-d detection,” super-resolution via end-to-end adversarial learning,”
in 2016 IEEE International Conference on Robotics and Au- Proceedings of the IEEE Conference on Computer Vision and
tomation (ICRA). IEEE, 2016, pp. 5032–5039. Pattern Recognition, 2020.
[141] T. Afouras, J. S. Chung, and A. Zisserman, “Asr is all you [159] R. T. Mullapudi, S. Chen, K. Zhang, D. Ramanan, and
need: cross-modal distillation for lip reading,” ICASSP, K. Fatahalian, “Online model distillation for efficient
2020. video inference,” in Proceedings of the IEEE International
[142] S. Albanie, A. Nagrani, A. Vedaldi, and A. Zisserman, Conference on Computer Vision, 2019, pp. 3573–3582.
“Emotion recognition in speech using cross-modal trans- [160] L. Gao, X. Lan, H. Mi, D. Feng, K. Xu, and Y. Peng,
fer in the wild,” in Proceedings of the 26th ACM interna- “Multistructure-based collaborative online distillation,”
tional conference on Multimedia, 2018, pp. 292–301. Entropy, vol. 21, no. 4, p. 357, 2019.
[143] S. Gupta, J. Hoffman, and J. Malik, “Cross modal distil- [161] R. Lin, J. Xiao, and J. Fan, “Mod: A deep mixture model
lation for supervision transfer,” in Proceedings of the IEEE with online knowledge distillation for large scale video
conference on computer vision and pattern recognition, 2016, temporal concept localization,” ICCV YouTube8M work-
pp. 2827–2836. shop, 2019.
[144] T. Salem, C. Greenwell, H. Blanton, and N. Jacobs, [162] A. Cioppa, A. Deliege, M. Istasse, C. De Vleeschouwer,
“Learning to map nearly anything,” in IGARSS 2019-2019 and M. Van Droogenbroeck, “Arthus: Adaptive real-time
IEEE International Geoscience and Remote Sensing Sympo- human segmentation in sports through online distilla-
sium. IEEE, 2019, pp. 4803–4806. tion,” in Proceedings of the IEEE Conference on Computer
[145] F. M. Thoker and J. Gall, “Cross-modal knowledge distil- Vision and Pattern Recognition Workshops, 2019, pp. 0–0.
lation for action recognition,” in 2019 IEEE International [163] H. Mobahi, M. Farajtabar, and P. L. Bartlett, “Self-
Conference on Image Processing (ICIP). IEEE, 2019, pp. distillation amplifies regularization in hilbert space,”
6–10. arXiv preprint arXiv:2002.05715, 2020.
[146] M. Zhao, T. Li, M. Abu Alsheikh, Y. Tian, H. Zhao, [164] T.-B. Xu and C.-L. Liu, “Data-distortion guided self-
A. Torralba, and D. Katabi, “Through-wall human pose distillation for deep neural networks,” in Proceedings of
estimation using radio signals,” in Proceedings of the IEEE the AAAI Conference on Artificial Intelligence, vol. 33, 2019,
Conference on Computer Vision and Pattern Recognition, pp. 5565–5572.
2018, pp. 7356–7365. [165] S. Hahn and H. Choi, “Self-knowledge distilla-
[147] A. Owens, J. Wu, J. H. McDermott, W. T. Freeman, and tion in natural language processing,” arXiv preprint
A. Torralba, “Ambient sound provides supervision for arXiv:1908.01851, 2019.
visual learning,” in European conference on computer vision. [166] H. Lee, S. J. Hwang, and J. Shin, “Rethinking data aug-
Springer, 2016, pp. 801–816. mentation: Self-supervision and self-distillation,” ICML,
[148] R. Arandjelovic and A. Zisserman, “Look, listen and 2020.
learn,” in Proceedings of the IEEE International Conference [167] E. J. Crowley, G. Gray, and A. J. Storkey, “Moonshine:
on Computer Vision, 2017, pp. 609–617. Distilling with cheap convolutions,” in Advances in Neural
[149] T. Do, T.-T. Do, H. Tran, E. Tjiputra, and Q. D. Tran, Information Processing Systems, 2018, pp. 2888–2898.
“Compact trilinear interaction for visual question an- [168] L. Zhang, J. Song, A. Gao, J. Chen, C. Bao, and K. Ma, “Be
swering,” in Proceedings of the IEEE International Confer- your own teacher: Improve the performance of convolu-
ence on Computer Vision, 2019, pp. 392–401. tional neural networks via self distillation,” in Proceedings
[150] Y. Aytar, C. Vondrick, and A. Torralba, “See, hear, of the IEEE International Conference on Computer Vision,
and read: Deep aligned representations,” arXiv preprint 2019, pp. 3713–3722.
arXiv:1706.00932, 2017. [169] S. Hou, X. Pan, C. C. Loy, Z. Wang, and D. Lin, “Learning
[151] C. Kim, H. V. Shin, T.-H. Oh, A. Kaspar, M. Elgharib, a unified classifier incrementally via rebalancing,” in
and W. Matusik, “On learning associations of faces and Proceedings of the IEEE Conference on Computer Vision and
voices,” in Asian Conference on Computer Vision. Springer, Pattern Recognition, 2019, pp. 831–839.
2018, pp. 276–292. [170] Y. Luan, H. Zhao, Z. Yang, and Y. Dai, “Msd: Multi-
[152] Q. Dou, Q. Liu, P. A. Heng, and B. Glocker, “Unpaired self-distillation learning via multi-classifiers within deep
multi-modal segmentation via knowledge distillation,” neural networks,” arXiv preprint arXiv:1911.09418, 2019.
IEEE TMI, 2020. [171] X. Lan, X. Zhu, and S. Gong, “Self-referenced deep learn-
[153] F. Hafner, A. Bhuiyan, J. F. Kooij, and E. Granger, ing,” in Asian conference on computer vision. Springer,
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 37

2018, pp. 284–300. 6761.


[172] L. Yuan, F. E. Tay, G. Li, T. Wang, and J. Feng, “Revisit [191] M. Goldblum, L. Fowl, S. Feizi, and T. Goldstein, “Adver-
knowledge distillation: a teacher-free framework,” Pro- sarially robust distillation,” AAAI, 2020.
ceedings of the IEEE/CVF Conference on Computer Vision and [192] W. Hong and J. Yu, “Gan-knowledge distillation for one-
Pattern Recognition, 2019. stage object detection,” arXiv preprint arXiv:1906.08467,
[173] J. Wang, X. Wang, B. Jin, J. Yan, W. Zhang, and H. Zha, 2019.
“Heterogeneous graph-based knowledge transfer for [193] M. Zhai, L. Chen, F. Tung, J. He, M. Nawhal, and G. Mori,
generalized zero-shot learning,” 2019. “Lifelong gan: Continual learning for conditional image
[174] J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, generation,” in Proceedings of the IEEE International Con-
and M. Sun, “Graph neural networks: A review of meth- ference on Computer Vision, 2019, pp. 2759–2768.
ods and applications,” arXiv preprint arXiv:1812.08434, [194] Y. Wang, A. Gonzalez-Garcia, D. Berga, L. Herranz, F. S.
2018. Khan, and J. van de Weijer, “Minegan: effective knowl-
[175] G. Chen, W. Choi, X. Yu, T. Han, and M. Chandraker, edge transfer from gans to target domains with few
“Learning efficient object detection models with knowl- images,” CVPR, 2019.
edge distillation,” in Advances in Neural Information Pro- [195] L. Gao, H. Mi, B. Zhu, D. Feng, Y. Li, and Y. Peng,
cessing Systems, 2017, pp. 742–751. “An adversarial feature distillation method for audio
[176] X. Guo, H. Li, S. Yi, J. Ren, and X. Wang, “Learning classification,” IEEE Access, vol. 7, pp. 105 319–105 330,
monocular depth by distilling cross-domain stereo net- 2019.
works,” in Proceedings of the European Conference on Com- [196] Z. Shen, Z. He, W. Cui, J. Yu, Y. Zheng, C. Zhu, and
puter Vision (ECCV), 2018, pp. 484–500. M. Savvides, “Adversarial-based knowledge distillation
[177] X. Cheng, Z. Rao, Y. Chen, and Q. Zhang, “Explaining for multi-model ensemble and noisy data refinement,”
knowledge distillation by quantifying the knowledge,” in AAAI, 2019.
Proceedings of the IEEE/CVF Conference on Computer Vision [197] Y. Liu, K. Chen, C. Liu, Z. Qin, Z. Luo, and J. Wang,
and Pattern Recognition, 2020, pp. 12 925–12 935. “Structured knowledge distillation for semantic segmen-
[178] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, tation,” in Proceedings of the IEEE Conference on Computer
E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, Vision and Pattern Recognition, 2019, pp. 2604–2613.
M. G. Azar et al., “Bootstrap your own latent: A new [198] A. Aguinaldo, P.-Y. Chiang, A. Gain, A. Patil, K. Pearson,
approach to self-supervised learning,” arXiv preprint and S. Feizi, “Compressing gans using knowledge distil-
arXiv:2006.07733, 2020. lation,” arXiv preprint arXiv:1902.00159, 2019.
[179] M. Mirza and S. Osindero, “Conditional generative ad- [199] M. Li, J. Lin, Y. Ding, Z. Liu, J.-Y. Zhu, and S. Han,
versarial nets,” arXiv preprint arXiv:1411.1784, 2014. “Gan compression: Efficient architectures for interactive
[180] L. Chongxuan, T. Xu, J. Zhu, and B. Zhang, “Triple gen- conditional gans,” CVPR, 2020.
erative adversarial nets,” in Advances in neural information [200] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-
processing systems, 2017, pp. 4088–4098. to-image translation with conditional adversarial net-
[181] X. Wang, R. Zhang, Y. Sun, and J. Qi, “Kdgan: knowledge works,” in Proceedings of the IEEE conference on computer
distillation with generative adversarial networks,” in Ad- vision and pattern recognition, 2017, pp. 1125–1134.
vances in Neural Information Processing Systems, 2018, pp. [201] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and
775–786. S. Paul Smolley, “Least squares generative adversarial
[182] Z. Xu, Y.-C. Hsu, and J. Huang, “Training shallow and networks,” in Proceedings of the IEEE International Confer-
thin networks for acceleration via knowledge distillation ence on Computer Vision, 2017, pp. 2794–2802.
with conditional adversarial networks,” ICLR workshop, [202] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and
2017. A. C. Courville, “Improved training of wasserstein gans,”
[183] V. Belagiannis, A. Farshad, and F. Galasso, “Adversarial in Advances in neural information processing systems, 2017,
network compression,” in Proceedings of the European Con- pp. 5767–5777.
ference on Computer Vision (ECCV), 2018, pp. 0–0. [203] H. Chen, Y. Wang, H. Shu, C. Wen, C. Xu, B. Shi, C. Xu,
[184] R. Liu, N. Fusi, and L. Mackey, “Teacher-student com- and C. Xu, “Distilling portable generative adversarial
pression with generative adversarial networks,” arXiv networks for image translation,” AAAI, 2020.
preprint arXiv:1812.02271, 2018. [204] J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros,
[185] P. Liu, W. Liu, H. Ma, T. Mei, and M. Seok, “Ktan: O. Wang, and E. Shechtman, “Toward multimodal image-
knowledge transfer adversarial network,” International to-image translation,” in Advances in neural information
Joint Conference on Neural Networks (IJCNN), 2020. processing systems, 2017, pp. 465–476.
[186] Y. Wang, C. Xu, C. Xu, and D. Tao, “Adversarial learning [205] A. Odena, C. Olah, and J. Shlens, “Conditional image
of portable student networks,” in Thirty-Second AAAI synthesis with auxiliary classifier gans,” in Proceedings
Conference on Artificial Intelligence, 2018. of the 34th International Conference on Machine Learning-
[187] S. Roheda, B. S. Riggan, H. Krim, and L. Dai, “Cross- Volume 70. JMLR. org, 2017, pp. 2642–2651.
modality distillation: A case for conditional generative [206] Z. Xu, Y.-C. Hsu, and J. Huang, “Learning loss for knowl-
adversarial networks,” in 2018 IEEE International Confer- edge distillation with conditional adversarial networks,”
ence on Acoustics, Speech and Signal Processing (ICASSP). arXiv preprint arXiv:1709.00513, 2017.
IEEE, 2018, pp. 2926–2930. [207] S. Lee and B. C. Song, “Graph-based knowledge distilla-
[188] Z. Xu, Y.-C. Hsu, and J. Huang, “Training student net- tion by multi-head self-attention network,” BMVC, 2019.
works for acceleration with conditional adversarial net- [208] Y. Liu, J. Cao, B. Li, C. Yuan, W. Hu, Y. Li, and Y. Duan,
works.” in BMVC, 2018, p. 61. “Knowledge distillation via instance relationship graph,”
[189] B. Heo, M. Lee, S. Yun, and J. Y. Choi, “Knowledge in Proceedings of the IEEE Conference on Computer Vision
distillation with adversarial samples supporting decision and Pattern Recognition, 2019, pp. 7096–7104.
boundary,” in Proceedings of the AAAI Conference on Artifi- [209] H. Cai, V. W. Zheng, and K. C.-C. Chang, “A comprehen-
cial Intelligence, vol. 33, 2019, pp. 3771–3778. sive survey of graph embedding: Problems, techniques,
[190] J. Liu, Y. Chen, and K. Liu, “Exploiting the ground-truth: and applications,” IEEE Transactions on Knowledge and
An adversarial imitation based knowledge distillation Data Engineering, vol. 30, no. 9, pp. 1616–1637, 2018.
approach for event detection,” in Proceedings of the AAAI [210] W. L. Hamilton, R. Ying, and J. Leskovec, “Representation
Conference on Artificial Intelligence, vol. 33, 2019, pp. 6754– learning on graphs: Methods and applications,” IEEE
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 38

Data Engineering Bulletin, 2017. tional Conference on Computer Vision, 2019, pp. 6830–6840.
[211] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and [231] Q. Cai, Y. Pan, C.-W. Ngo, X. Tian, L. Duan, and T. Yao,
G. Monfardini, “The graph neural network model,” IEEE “Exploring object relation in mean teacher for cross-
Transactions on Neural Networks, vol. 20, no. 1, pp. 61–80, domain detection,” in Proceedings of the IEEE Conference on
2008. Computer Vision and Pattern Recognition, 2019, pp. 11 457–
[212] C. Lassance, M. Bontonou, G. B. Hacene, V. Gripon, 11 466.
J. Tang, and A. Ortega, “Deep geometric knowledge [232] Y. Xu, B. Du, L. Zhang, Q. Zhang, G. Wang, and L. Zhang,
distillation with graphs,” ICASSP, 2020. “Self-ensembling attention networks: Addressing do-
[213] S. Minami, T. Hirakawa, T. Yamashita, and H. Fujiyoshi, main shift for semantic segmentation,” in Proceedings of
“Knowledge transfer graph for deep collaborative learn- the AAAI Conference on Artificial Intelligence, vol. 33, 2019,
ing,” 2019. pp. 5581–5588.
[214] H. Yao, C. Zhang, Y. Wei, M. Jiang, S. Wang, J. Huang, [233] Z. Hailat and X.-W. Chen, “Teacher/student deep semi-
N. V. Chawla, and Z. Li, “Graph few-shot learning supervised learning for training with noisy labels,” in
via knowledge transfer,” arXiv preprint arXiv:1910.03053, 2018 17th IEEE International Conference on Machine Learn-
2019. ing and Applications (ICMLA). IEEE, 2018, pp. 907–912.
[215] J. Ma and Q. Mei, “Graph representation learning via [234] M. Noroozi, A. Vinjimoor, P. Favaro, and H. Pirsiavash,
multi-task knowledge distillation,” NeurIPS GRL Work- “Boosting self-supervised learning via knowledge trans-
shop, 2019. fer,” in Proceedings of the IEEE Conference on Computer
[216] J. Liu, M. Shah, B. Kuipers, and S. Savarese, “Cross-view Vision and Pattern Recognition, 2018, pp. 9359–9367.
action recognition via view knowledge transfer,” in CVPR [235] G. Xu, Z. Liu, X. Li, and C. C. Loy, “Knowledge dis-
2011. IEEE, 2011, pp. 3209–3216. tillation meets self-supervision,” European Conf. Comput.
[217] Z. Luo, J.-T. Hsieh, L. Jiang, J. Carlos Niebles, and L. Fei- Vis.(ECCV), 2020.
Fei, “Graph distillation for action detection with privi- [236] S. Flennerhag, P. G. Moreno, N. D. Lawrence, and
leged modalities,” in Proceedings of the European Conference A. Damianou, “Transferring knowledge across learning
on Computer Vision (ECCV), 2018, pp. 166–183. processes,” arXiv preprint arXiv:1812.01054, 2018.
[218] Y. Yang, J. Qiu, M. Song, D. Tao, and X. Wang, “Distil- [237] H. Jin, S. Zhang, X. Zhu, Y. Tang, Z. Lei, and S. Z.
lating knowledge from graph convolutional networks,” Li, “Learning lightweight face detector with knowledge
arXiv preprint arXiv:2003.10477, 2020. distillation,” in 2019 International Conference on Biometrics
[219] A. Ortega, P. Frossard, J. Kovačević, J. M. Moura, and (ICB). IEEE, 2019, pp. 1–7.
P. Vandergheynst, “Graph signal processing: Overview, [238] Y. Wu, Y. Chen, L. Wang, Y. Ye, Z. Liu, Y. Guo, and
challenges, and applications,” Proceedings of the IEEE, vol. Y. Fu, “Large scale incremental learning,” in Proceedings
106, no. 5, pp. 808–828, 2018. of the IEEE Conference on Computer Vision and Pattern
[220] Z. Ying, J. You, C. Morris, X. Ren, W. Hamilton, and Recognition, 2019, pp. 374–382.
J. Leskovec, “Hierarchical graph representation learning [239] U. Michieli and P. Zanuttigh, “Knowledge distillation for
with differentiable pooling,” in Advances in Neural Infor- incremental learning in semantic segmentation,” arXiv
mation Processing Systems, 2018, pp. 4800–4810. preprint arXiv:1911.03462, 2019.
[221] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive repre- [240] F. M. Castro, M. J. Marı́n-Jiménez, N. Guil, C. Schmid,
sentation learning on large graphs,” in Advances in Neural and K. Alahari, “End-to-end incremental learning,” in
Information Processing Systems, 2017, pp. 1024–1034. Proceedings of the European Conference on Computer Vision
[222] C. Li, X. Guo, and Q. Mei, “Deepgraph: Graph (ECCV), 2018, pp. 233–248.
structure predicts network growth,” arXiv preprint [241] K. Shmelkov, C. Schmid, and K. Alahari, “Incremental
arXiv:1610.06251, 2016. learning of object detectors without catastrophic forget-
[223] T. N. Kipf and M. Welling, “Semi-supervised classifica- ting,” in Proceedings of the IEEE International Conference on
tion with graph convolutional networks,” ICLR, 2017. Computer Vision, 2017, pp. 3400–3409.
[224] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M. [242] H. B. Ammar, E. Eaton, J. M. Luna, and P. Ruvolo, “Au-
Hospedales, “Learning to compare: Relation network for tonomous cross-domain knowledge transfer in lifelong
few-shot learning,” in Proceedings of the IEEE Conference policy gradient reinforcement learning,” in Twenty-Fourth
on Computer Vision and Pattern Recognition, 2018, pp. 1199– International Joint Conference on Artificial Intelligence, 2015.
1208. [243] A. Ashok, N. Rhinehart, F. Beainy, and K. M. Kitani,
[225] M. A. Haidar and M. Rezagholizadeh, “Textkd-gan: text “N2n learning: Network to network compression via
generation using knowledge distillation and generative policy gradient reinforcement learning,” arXiv preprint
adversarial networks,” in Canadian Conference on Artificial arXiv:1709.06030, 2017.
Intelligence. Springer, 2019, pp. 107–118. [244] Y. Burda, H. Edwards, A. Storkey, and O. Klimov, “Ex-
[226] S. Laine and T. Aila, “Temporal ensembling for semi- ploration by random network distillation,” arXiv preprint
supervised learning,” arXiv preprint arXiv:1610.02242, arXiv:1810.12894, 2018.
2016. [245] Z.-W. Hong, P. Nagarajan, and G. Maeda, “Periodic intra-
[227] Y. Luo, J. Zhu, M. Li, Y. Ren, and B. Zhang, “Smooth ensemble knowledge distillation for reinforcement learn-
neighbors on teacher graphs for semi-supervised learn- ing,” arXiv preprint arXiv:2002.00149, 2020.
ing,” in Proceedings of the IEEE conference on computer [246] Z. Xue, S. Luo, C. Wu, P. Zhou, K. Bian, and W. Du,
vision and pattern recognition, 2018, pp. 8896–8905. “Transfer heterogeneous knowledge among peer-to-peer
[228] S. Zhang, J. Li, and B. Zhang, “Pairwise teacher-student teammates: A model distillation approach,” arXiv preprint
network for semi-supervised hashing,” in Proceedings of arXiv:2002.02202, 2020.
the IEEE Conference on Computer Vision and Pattern Recog- [247] K. Lin, S. Wang, and J. Zhou, “Collaborative deep re-
nition Workshops, 2019, pp. 0–0. inforcement learning,” arXiv preprint arXiv:1702.05796,
[229] G. French, M. Mackiewicz, and M. Fisher, “Self- 2017.
ensembling for visual domain adaptation,” arXiv preprint [248] J. Long, E. Shelhamer, and T. Darrell, “Fully convolu-
arXiv:1706.05208, 2017. tional networks for semantic segmentation,” in Proceed-
[230] J. Choi, T. Kim, and C. Kim, “Self-ensembling with gan- ings of the IEEE conference on computer vision and pattern
based data augmentation for domain adaptation in se- recognition, 2015, pp. 3431–3440.
mantic segmentation,” in Proceedings of the IEEE Interna- [249] Y. Shan, “Distilling pixel-wise feature similarities for
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 39

semantic segmentation,” arXiv preprint arXiv:1910.14226, [267] J. Karlekar, J. Feng, Z. S. Wong, and S. Pranata, “Deep face
2019. recognition model compression via knowledge transfer
[250] T. He, C. Shen, Z. Tian, D. Gong, C. Sun, and Y. Yan, and distillation,” arXiv preprint arXiv:1906.00619, 2019.
“Knowledge adaptation for efficient semantic segmenta- [268] X. Dong and Y. Yang, “Teacher supervises students how
tion,” in Proceedings of the IEEE Conference on Computer to learn from partially labeled images for facial landmark
Vision and Pattern Recognition, 2019, pp. 578–587. detection,” in Proceedings of the IEEE International Confer-
[251] Y. Chen, W. Li, and L. Van Gool, “Road: Reality oriented ence on Computer Vision, 2019, pp. 783–792.
adaptation for semantic segmentation of urban scenes,” [269] A. Zhao, T. He, Y. Liang, H. Huang, G. V. d. Broeck,
in Proceedings of the IEEE Conference on Computer Vision and S. Soatto, “Lates: Latent space distillation for
and Pattern Recognition, 2018, pp. 7892–7901. teacher-student driving policy learning,” arXiv preprint
[252] J. Xie, B. Shuai, J.-F. Hu, J. Lin, and W.-S. Zheng, “Improv- arXiv:1912.02973, 2019.
ing fast segmentation with teacher-student learning,” [270] F. Zhang, X. Zhu, and M. Ye, “Fast human pose estima-
arXiv preprint arXiv:1810.08476, 2018. tion,” in Proceedings of the IEEE Conference on Computer
[253] Y. Hao, Y. Fu, Y.-G. Jiang, and Q. Tian, “An end-to- Vision and Pattern Recognition, 2019, pp. 3517–3526.
end architecture for class-incremental object detection [271] A. Martı́nez-González, M. Villamizar, O. Canévet, and J.-
with knowledge distillation,” in 2019 IEEE International M. Odobez, “Efficient convolutional neural networks for
Conference on Multimedia and Expo (ICME). IEEE, 2019, depth-based multi-person pose estimation,” IEEE Trans-
pp. 1–6. actions on Circuits and Systems for Video Technology, 2019.
[254] L. Chen, C. Yu, and L. Chen, “A new knowledge distil- [272] D.-H. Hwang, S. Kim, N. Monet, H. Koike, and
lation for incremental object detection,” in 2019 Interna- S. Bae, “Lightweight 3d human pose estimation network
tional Joint Conference on Neural Networks (IJCNN). IEEE, training using teacher-student learning,” arXiv preprint
2019, pp. 1–7. arXiv:2001.05097, 2020.
[255] R. Chen, H. Ai, C. Shang, L. Chen, and Z. Zhuang, [273] X. Xu, Q. Zou, X. Lin, Y. Huang, and Y. Tian, “Integral
“Learning lightweight pedestrian detector with hierar- knowledge distillation for multi-person pose estimation,”
chical knowledge distillation,” in 2019 IEEE International IEEE Signal Processing Letters, 2020.
Conference on Image Processing (ICIP). IEEE, 2019, pp. [274] C. Wang, C. Kong, and S. Lucey, “Distill knowledge
1645–1649. from nrsfm for weakly supervised 3d pose learning,” in
[256] S. Tang, L. Feng, W. Shao, Z. Kuang, W. Zhang, Proceedings of the IEEE International Conference on Computer
and Y. Chen, “Learning efficient detector with Vision, 2019, pp. 743–752.
semi-supervised adaptive distillation,” arXiv preprint [275] X. Nie, Y. Li, L. Luo, N. Zhang, and J. Feng, “Dynamic
arXiv:1901.00366, 2019. kernel distillation for efficient pose estimation in videos,”
[257] M. R. U. Saputra, P. P. de Gusmao, Y. Almalioglu, in Proceedings of the IEEE International Conference on Com-
A. Markham, and N. Trigoni, “Distilling knowledge from puter Vision, 2019, pp. 6942–6950.
a deep pose regressor network,” in Proceedings of the IEEE [276] S. Ao, X. Li, and C. X. Ling, “Fast generalized distillation
International Conference on Computer Vision, 2019, pp. 263– for semi-supervised domain adaptation,” in Thirty-First
272. AAAI Conference on Artificial Intelligence, 2017.
[258] X. Jin, C. Lan, W. Zeng, and Z. Chen, “Uncertainty-aware [277] D. Lopez-Paz, L. Bottou, B. Schölkopf, and V. Vapnik,
multi-shot knowledge distillation for image-based object “Unifying distillation and privileged information,” Inter-
re-identification,” arXiv preprint arXiv:2001.05197, 2020. national Conference on Learning Representations, 2016.
[259] Y. Lee, N. Ahn, J. H. Heo, S. Y. Jo, and S.-J. Kang, [278] J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola,
“Teaching where to see: Knowledge distillation-based K. Saenko, A. A. Efros, and T. Darrell, “Cycada: Cycle-
attentive information transfer in vehicle maker classifi- consistent adversarial domain adaptation,” CVPR, 2018.
cation,” IEEE Access, vol. 7, pp. 86 412–86 420, 2019. [279] Z. Meng, J. Li, Y. Gaur, and Y. Gong, “Domain adapta-
[260] J. Xu, Y. Nie, P. Wang, and A. M. López, “Training a tion via teacher-student learning for end-to-end speech
binary weight object detector by knowledge transfer for recognition,” in 2019 IEEE Automatic Speech Recognition
autonomous driving,” in 2019 International Conference on and Understanding Workshop (ASRU). IEEE, 2019, pp.
Robotics and Automation (ICRA). IEEE, 2019, pp. 2379– 268–275.
2384. [280] Y.-C. Chen, Y.-Y. Lin, M.-H. Yang, and J.-B. Huang, “Cr-
[261] S. S. Kruthiventi, P. Sahay, and R. Biswal, “Low-light doco: Pixel-level domain transfer with cross-domain con-
pedestrian detection from rgb images using multi-modal sistency,” in Proceedings of the IEEE Conference on Computer
knowledge distillation,” in 2017 IEEE International Con- Vision and Pattern Recognition, 2019, pp. 1791–1800.
ference on Image Processing (ICIP). IEEE, 2017, pp. 4207– [281] Y.-H. Tsai, W.-C. Hung, S. Schulter, K. Sohn, M.-H. Yang,
4211. and M. Chandraker, “Learning to adapt structured out-
[262] S. Ge, S. Zhao, C. Li, and J. Li, “Low-resolution face recog- put space for semantic segmentation,” in Proceedings of
nition in the wild via selective knowledge distillation,” the IEEE Conference on Computer Vision and Pattern Recog-
IEEE Transactions on Image Processing, vol. 28, no. 4, pp. nition, 2018, pp. 7472–7481.
2051–2062, 2018. [282] Z. Deng, Y. Luo, and J. Zhu, “Cluster alignment with a
[263] P. Luo, Z. Zhu, Z. Liu, X. Wang, and X. Tang, “Face model teacher for unsupervised domain adaptation,” in Proceed-
compression by distilling knowledge from neurons,” in ings of the IEEE International Conference on Computer Vision,
Thirtieth AAAI conference on artificial intelligence, 2016. 2019, pp. 9944–9953.
[264] Y. Feng, H. Wang, R. Hu, and D. T. Yi, “Triplet distillation [283] J. Cho, D. Min, Y. Kim, and K. Sohn, “A large rgb-
for deep face recognition,” ICML ODML-CDNNR Work- d dataset for semi-supervised monocular depth estima-
shop, 2019. tion,” IEEE Transactions on Image Processing, 2019.
[265] H. Felix, W. M. Rodrigues, D. Macêdo, F. Simões, A. L. [284] A. Pilzer, S. Lathuiliere, N. Sebe, and E. Ricci, “Refine and
Oliveira, V. Teichrieb, and C. Zanchettin, “Squeezed distill: Exploiting cycle-inconsistency and knowledge dis-
deep 6dof object detection using knowledge distillation,” tillation for unsupervised monocular depth estimation,”
IJCNN, 2020. in Proceedings of the IEEE Conference on Computer Vision
[266] Y. Liu, X. Dong, W. Wang, and J. Shen, “Teacher- and Pattern Recognition, 2019, pp. 9768–9777.
students knowledge distillation for siamese trackers,” [285] F. Tosi, F. Aleotti, M. Poggi, and S. Mattoccia, “Learning
arXiv preprint arXiv:1907.10586, 2019. monocular depth estimation infusing traditional stereo
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, APRIL 2020 40

knowledge,” in Proceedings of the IEEE Conference on Com- [304] S. Hochreiter and J. Schmidhuber, “Long short-term
puter Vision and Pattern Recognition, 2019, pp. 9799–9809. memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780,
[286] P. Liu, I. King, M. R. Lyu, and J. Xu, “Ddflow: Learning 1997.
optical flow with unlabeled data distillation,” in Proceed- [305] R.-W. Zhao, J. Li, Y. Chen, J.-M. Liu, Y.-G. Jiang, and
ings of the AAAI Conference on Artificial Intelligence, vol. 33, X. Xue, “Regional gating neural networks for multi-label
2019, pp. 8770–8777. image classification.” in BMVC, 2016, pp. 1–12.
[287] F. Aleotti, M. Poggi, F. Tosi, and S. Mattoccia, “Learning [306] M. Mostafavi, L. Wang, and K.-J. Yoon, “Learning to
end-to-end scene flow by distilling single tasks knowl- reconstruct hdr images from events, with applications
edge,” arXiv preprint arXiv:1911.10090, 2019. to depth and flow prediction,” International Journal of
[288] F. Tosi, F. Aleotti, P. Z. Ramirez, M. Poggi, S. Salti, Computer Vision, pp. 1–21.
L. Di Stefano, and S. Mattoccia, “Distilled semantics
for comprehensive scene understanding from videos,”
CVPR, 2020.
[289] H. Wang, Y. Li, Y. Wang, H. Hu, and M.-H. Yang, “Col-
laborative distillation for ultra-resolution universal style
transfer,” arXiv, pp. arXiv–2003, 2020.
[290] S. Bhardwaj, M. Srinivasan, and M. M. Khapra, “Efficient
video classification using fewer frames,” in Proceedings
of the IEEE Conference on Computer Vision and Pattern
Recognition, 2019, pp. 354–363.
[291] X. Wang, J.-F. Hu, J.-H. Lai, J. Zhang, and W.-S. Zheng,
“Progressive teacher-student learning for early action
prediction,” in Proceedings of the IEEE Conference on Com- Lin Wang is a Ph.D. student in Visual Intelli-
gence Lab., Dept. of Mechanical Engineering,
puter Vision and Pattern Recognition, 2019, pp. 3556–3565.
Korea Advanced Institute of Science and Tech-
[292] C. Gan, T. Yao, K. Yang, Y. Yang, and T. Mei, “You lead, nology (KAIST). His research interests include
we exceed: Labor-free video concept learning by jointly deep learning (especially adversarial learning,
exploiting web videos and images,” in Proceedings of the transfer learning, semi-/self-supervised learn-
IEEE Conference on Computer Vision and Pattern Recogni- ing), event camera-based vision, low-level vision
tion, 2016, pp. 923–932. (image super-solution and deblurring, etc.) and
[293] C. Gan, B. Gong, K. Liu, H. Su, and L. J. Guibas, “Ge- computer vision for VR/AR.
ometry guided convolutional neural networks for self-
supervised video representation learning,” in Proceedings
of the IEEE Conference on Computer Vision and Pattern
Recognition, 2018, pp. 5589–5597.
[294] K. Fu, J. Li, Y. Song, Y. Zhang, S. Ge, and Y. Tian, “Ultra-
fast video attention prediction with coupled knowledge
distillation,” arXiv preprint arXiv:1904.04449, 2019.
[295] M. Farhadi and Y. Yang, “Tkd: Temporal knowledge
distillation for active perception,” WACV, 2020.
[296] Z. Zhang, Y. Shi, C. Yuan, B. Li, P. Wang, W. Hu,
and Z. Zha, “Object relational graph with teacher-
recommended learning for video captioning,” CVPR,
2020.
[297] B. Pan, H. Cai, D.-A. Huang, K.-H. Lee, A. Gaidon, Kuk-Jin Yoon received the B.S., M.S., and Ph.D.
degrees in electrical engineering and computer
E. Adeli, and J. C. Niebles, “Spatio-temporal graph for
science from the Korea Advanced Institute of
video captioning with knowledge distillation,” CVPR, Science and Technology in 1998, 2000, and
2020. 2006, respectively. He is now an Associate Pro-
[298] Y. Lee, J. Jeong, J. Yun, W. Cho, and K.-J. Yoon, fessor at the Department of Mechanical Engi-
“Spherephd: Applying cnns on a spherical polyhedron neering, Korea Advanced Institute of Science
representation of 360deg images,” in Proceedings of the and Technology (KAIST), South Korea, leading
IEEE Conference on Computer Vision and Pattern Recogni- the Visual Intelligence Laboratory. Before joining
tion, 2019, pp. 9181–9189. KAIST, he was a Post-Doctoral Fellow in the
[299] P. Bashivan, M. Tensen, and J. J. DiCarlo, “Teacher guided PERCEPTION Team, INRIA, Grenoble, France,
from 2006 to 2008, and was an Assistant/Associate Professor at the
architecture search,” in Proceedings of the IEEE Interna-
School of Electrical Engineering and Computer Science, Gwangju In-
tional Conference on Computer Vision, 2019, pp. 5320–5329. stitute of Science and Technology, South Korea, from 2008 to 2018.
[300] M. Derezinski and M. K. Warmuth, “The limits of squared His research interests include various topics in computer vision such
euclidean distance regularization,” in Advances in Neural as multi-view stereo, visual object tracking, SLAM and structure-from-
Information Processing Systems, 2014, pp. 2807–2815. motion, 360 camera and event-camera-base vision, sensor fusion.
[301] S. W. Park and J. Kwon, “Sphere generative adversar-
ial network based on geometric moment matching,” in
Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2019, pp. 4292–4301.
[302] K. Li, Y. Zhang, K. Li, Y. Li, and Y. Fu, “Attention bridging
network for knowledge transfer,” in Proceedings of the
IEEE International Conference on Computer Vision, 2019, pp.
5198–5207.
[303] J. Schlemper, O. Oktay, M. Schaap, M. Heinrich, B. Kainz,
B. Glocker, and D. Rueckert, “Attention gated networks:
Learning to leverage salient regions in medical images,”
Medical image analysis, vol. 53, pp. 197–207, 2019.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy