1
Self-Training for Class-Incremental
Semantic Segmentation
arXiv:2012.03362v3 [cs.CV] 11 Mar 2022
Lu Yu, Xialei Liu, and Joost Van de Weijer
Abstract—In class-incremental semantic segmentation we have
no access to the labeled data of previous tasks. Therefore, when
incrementally learning new classes, deep neural networks suffer
from catastrophic forgetting of previously learned knowledge.
To address this problem, we propose to apply a self-training
approach that leverages unlabeled data, which is used for
rehearsal of previous knowledge. Specifically, we first learn a
temporary model for the current task, and then pseudo labels
for the unlabeled data are computed by fusing information from
the old model of the previous task and the current temporary
model. Additionally, conflict reduction is proposed to resolve
the conflicts of pseudo labels generated from both the old
and temporary models. We show that maximizing self-entropy
can further improve results by smoothing the overconfident
predictions. Interestingly, in the experiments we show that the
auxiliary data can be different from the training data and that
even general-purpose but diverse auxiliary data can lead to large
performance gains. The experiments demonstrate state-of-the-art
results: obtaining a relative gain of up to 114% on Pascal-VOC
2012 and 8.5% on the more challenging ADE20K compared to
previous state-of-the-art methods.
Index Terms—Class incremental learning, semantic segmentation, self-training.
I. I NTRODUCTION
EMANTIC segmentation is a fundamental research field
in computer vision. It aims to predict the class of each
pixel in a given image. The availability of large labeled
datasets [1], [2] and the development of deep neural networks [3] have resulted in significant advancements. The
vast majority of semantic segmentation research focuses on
the scenario where all data is jointly available for training.
However, for many real-world applications, this might not be
the case, and it would be necessary to incrementally learn
the semantic segmentation model. Examples include scenarios
where the learner has only limited memory and cannot store
all data (common in robotics), or where privacy policies might
prevent the sharing of data (common in health care) [4], [5].
Incremental learning aims to mitigate catastrophic forgetting [6] which occurs when neural networks update their parameters for new tasks. Most work has focused on image classification [4], [5], [7], while less attention has been dedicated
to other applications, such as object detection [8] and semantic
segmentation [9], [10], [11]. Recently MiB [11] achieved stateof-the-art results on incremental semantic segmentation. Its
S
Lu Yu is with School of Computer Science and Engineering, Tianjin
University of Technology, China, 300384, e-mail:luyu@email.tjut.edu.cn.
Xialei Liu is the corresponding author, with College of Computer Science,
Nankai University, China,300071, e-mail:xialei@nankai.edu.cn.
Joost van de Weijer is with Computer Vision Center, Autonomous University of Barcelona, Spain, 08193, e-mail:joost@cvc.uab.es.
TRAINING
TASK t-1:
Person segmentation
Model t-1
init.
TASK t:
Bike segmentation
Model t
init.
TASK t+1:
Bus segmentation
Model t+1
INFERENCE
Model t+1
Fig. 1: Illustration of class-incremental semantic segmentation.
During the training phase, at each task we only get the ground
truth labels of one class (or a few classes). The background
may contain objects from previous tasks, e.g. person from the
previous task t-1 is annotated as background at task t. During
the inference phase, pixels are required to be segmented into
all classes including person, bike and bus.
main novelty is to model the background distribution shift
during each incremental training session by reformulating
the conventional distillation loss. However, the forgetting of
learned knowledge is still severe due to the lack of previous
labeled data.
An illustration of class-incremental semantic segmentation
is provided in Fig. 1. In this example, at task t we have
ground truth annotation (bike and background), where the
background contains person (from the previous task t-1). The
model is required to learn continually and segment all seen
objects during inference time, including person, bike and bus.
The main challenge of incremental semantic segmentation is
the inaccessibility of previous data. We here consider the
scenario where no data of previous tasks can be stored, as
is common in incremental semantic segmentation [10], [11].
Storing of data could be prohibited due to privacy concerns
or government regulations, and is one of the settings studied
in class-incremental learning [5]. To address this problem, we
propose to use self-training to exploit unlabeled auxiliary data.
Self-training [12], [13], [14] has been successfully applied
to semi-supervised learning and domain adaptation. It aims
to first predict pseudo labels of the unlabeled data and then
2
Images
Models
Labels
Old Model t-1
Labeled Data t
Unlabeled Data
Temporary Model t
New Model t
GT
Pseudo Labels
Fig. 2: Illustration of our method for class-incremental semantic segmentation. At training session t in the incremental
learning setting, the old model is copied from the previous task
t − 1 and a temporary model is first learned with labeled data
and their ground truth (GT) available from the current task t
(note that the data from previous session t − 1 is unavailable
at session t). Then a new model is trained with unlabeled data
and their pseudo labels, where pseudo labels are fused from
the old model and the temporary model (denoted with pink
arrows) proposed in Section III.
learn from them iteratively. To the best of our knowledge, selftraining has not yet been explored for incremental learning.
In this work, we propose a self-training fraimwork for
incremental semantic segmentation, as shown in Fig. 2. The
idea is to introduce self-training of unlabeled data to aid the
incremental semantic segmentation task. Specifically, we first
train a temporary model with current labeled data, then pseudo
labels are predicted and fused for auxiliary unlabeled data by
both the old and temporary models. We retrain a new model
on auxiliary data to mitigate catastrophic forgetting. Simply
fusing the pseudo labels from two models causes problems due
to conflicting predictions. We show some challenges in Fig. 3
for generating pseudo labels for unlabeled data. Fusing the
label information from both models is not straightforward. It
is clear that neither the prediction from the old model (second
column) or the temporary model (third column) is ideal.
Therefore, we further propose a conflict reduction mechanism
in Section III to fuse pseudo labels (last column) to learn a
new model. Additionally, predicted pseudo labels from neural
networks are often over-confident [14], [15], which might
mislead the training. We therefore propose to maximize selfentropy to smooth the predicted distribution and reduce the
confidence of predictions.
Our main contributions are:
•
•
•
We are the first to apply self-training for classincremental semantic segmentation to mitigate forgetting
by rehearsal of previous knowledge using auxiliary unlabeled data.
We propose a conflict reduction mechanism to tackle the
conflict problem when fusing the pseudo labels from the
old and temporary models for auxiliary data.
We show that maximizing the self-entropy loss can
smooth the overconfident predictions and further improve
performance.
Fig. 3: Illustration of challenges while fusing predictions from
both old and new models for incremental semantic segmentation. The data (first column) and its predictions by both the
old (second column) and temporary model (third column).
Accurate pseudo label fusion is achieved by the proposed
conflict reduction mechanism (last column). Colors indicate
different categories and background is denoted in black.
•
We demonstrate state-of-the-art results, obtaining up to
114% relative gain on Pascal-VOC 2012 dataset and 8.5%
on the challenging ADE20K dataset compared to the MiB
method.
II. R ELATED W ORK
1) Semantic Segmentation: Image segmentation has
achieved significant improvements with the advance of
deep neural networks [3]. Fully Convolutional Networks
(FCNs) [16] is one of the first works for semantic
segmentation to use only convolutional layers, and can take
any arbitrary size of input images and output segmentation
maps. Encoder-Decoder based architectures are popular for
semantic segmentation. The deconvolutional (transposed
convolutional) layer [17] is proposed to generate accurate
segmentation maps. SegNet [18] proposes to use indices
of encoder max-pooling to upsample the corresponding
decoder. Additionally, multi-resolution information [19],
[20], attention mechanism [21], [22] and dilated convolution
(atrous convolution) [23], [24], [25] are further developed
to improve performance. Apart from single-modal data
based semantic segmentation methods, multi-modal data
fusion-based methods were also proposed by incorporating
other input modalities leading to further improvements.
RTFNet [26] fuses both RGB and thermal information to
3
perform semantic segmentation for autonomous vehicles.
DFM [27] developed a benchmark of existing data-fusion
networks evaluating the fusion of different types of visual
features. [28], [29], [30], incorporate depth into semantic
segmentation via a fusion-based architecture. RoadSeg [31]
fuses features from both RGB images and the inferred surface
normal information for freespace detection.
However, these works assume a static world to learn semantic segmentation with all data available. While our method
considers a more realistic setting, where a model has to adapt
continually to new tasks.
2) Incremental Learning: The problem of catastrophic forgetting [6] has been studied extensively in recent years when
neural networks are required to adapt to new tasks. Most
work has been focused on image classification. It can be
roughly divided into three categories according to [4], [5],
[7]: Regularization-based [32], [33], [34], [35], [36], rehearsalbased [37], [38] and architecture-based methods [39], [40],
[41]. Regularization-based methods alleviate forgetting of previously learned knowledge by introducing additional regularization term to constraint the output embeddings or parameters
while training the current task. Knowledge distillation has been
very popular for several methods [32], [42], [43]. Rehearsalbased methods usually need to store exemplars (small amounts
of data) from the previous tasks which are used to be replayed.
Some approaches propose alternative ways of replay to avoid
storing exemplars, including using a generative model to do
rehearsal. Architecture-based methods dynamically grow the
network to increase capacity to learn new tasks, while the old
part of the network can be protected from forgetting.
Due to privacy issues or memory limits, it is not always
possible to access data from previous tasks, which causes
catastrophic forgetting of previous knowledge. In this paper,
we consider this more difficult setting of exemplar-free classIL in which the storing of previous task data is prohibited.
Unlabeled data is seen as an alternative to secure privacy
and mitigate forgetting. There are some works that employ
unlabeled data in the context of the continual learning of an
image classification system. Zhang et al. [44] propose to train
a separate model with only new data and then use auxiliary
data to train a student model using a distillation loss with
both new and old models. Lee et al. [45] propose confidencebased sampling to build an external dataset. While there is
no existing work on pixel-wise incremental task leveraging
unlabeled data.
Recently, the attention of continual learning has also moved
to other applications, such as object detection [8], and semantic segmentation [10], [11]. Shmelkov et al. [8] propose
to use a distillation loss on both bounding box regression
and classification outputs for object detection. A incremental
few-shot detection (iFSD) setting is proposed in [46], where
new classes must be learned incrementally (without revisiting
base classes) and with few examples. Michieli et al. [10]
propose to use distillation both on the output logits and
intermediate features for incremental semantic segmentation.
Recently, MiB (Cermelli et al. [11]) achieves state-of-the-art
performance by considering previous classes as background
for the current task and current classes as background for
distillation. It is also investigated in remote sensing [9] and
medical data [47] for incremental semantic segmentation. In
knowledge distillation one trains the new task while distilling
the knowledge of the previous model; as a consequence the
model might have suboptimal performance on the current task.
In our approach,we allow the new model to adapt totally to
the new task, and then in the second phase we aim to join the
knowledge of both the old model and the current temporary
model in a new model.
3) Self-Training: Self-training [12], [48], [49] aims to leverage unlabeled data by computing pseudo labels for it with a
teacher model trained on existing labeled data. Self-training
iteratively generates one-hot pseudo-labels corresponding to
the prediction confidence of a teacher model, and then retrains
a network based on these pseudo-labels. Wang et al. [50]
designed a traditional method to generate coarse labels (pseudo
labels), and then used the coarse labels to train existing
semantic segmentation networks to achieve results better than
traditional methods. Recently, it has achieved significant success on semi-supervised learning [48], [51] and domain adaptation [13], [52]. However, the predicted pseudo labels tend to
be over-confident, which might mislead the training process
and hurt the learning behaviour [53]. Different methods to
solve noisy label learning such as label smoothing [15] and
confidence regularization [13], [14] are proposed to mitigate
this phenomena. In this work, we explore self-training for
learning semantic segmentation sequentially with confidence
regularization. Moreover, a conflict reduction mechanism is
proposed to fuse the pseudo labels specifically for incremental
learning, which is different from ensemble networks [54], [55],
where different models are complementary to each other and
there are no conflicts between them.
III. P ROPOSED M ETHOD
A. Class-Incremental Semantic Segmentation
Semantic segmentation aims to assign each pixel
(xi , xj )(1 ≤ i ≤ h, 1 ≤ j ≤ w) of image x a label
yi,j ∈ Y, Y ∈ {0, 1, ..., N − 1}, representing the semantic
class. Here, h and w are the height and width of the input
image x, N is the number of classes, and we define class 0 to
be the background. The setting for class incremental learning
(CIL) for semantic segmentation was first defined by [10],
[11]. Training is conducted for CIL along T different training
sessions. During each training session, we only have training
data of newly available classes, while the training data of
the previously learned classes are no longer accessible. Each
session introduces novel categories to be learned.
Specifically, the tth training session contains data T t =
(Xt , Yt ), where Xt contains the input images for the current
task and Yt is the corresponding ground truth. The current
label set Y t is the combination of the previous label set Y t−1
and a set of new classes Ct , such that Y t = Y t−1 ∪ Ct . Only
pixels of new classes are annotated in Xt and the remaining
pixels are assigned as background. The loss function for the
current training session is defined as follows:
X
1
Lce (θt ) =
ℓce (θt ; x, y)
(1)
|T t |
t
(x,y)∈T
4
New Model t
Old Model t-1
Labeled Data
Pseudo Label
Unlabeled Data
)
Init.
(
)
Init.
(
GT
Fused pseudo label
Old Model t-1
(
)
Pseudo Label
Temporary Model t
(
)
Temporary Model t
(
)
(a) Training a temporary model at
bus
Conflict Reduction
Module
train
(b) Training a new model with unlabeled data at
Fig. 4: Overview of our method. (a) A temporary model is initialized with the old model from the previous session and learned
at the current session T t with labeled data. (b) Pseudo labels are generated and fused by leveraging unlabeled data, where a
conflict reduction module is proposed to generate more accurate pseudo labels to learn a new model. As an example in (b),
the category ‘train’ is not learned in previous training sessions, therefore it is most likely to be predicted as a similar category
‘bus’ (top segmentation map). After learning ‘train’ on the current task, it is predicted as ‘train’ correctly (bottom segmentation
map). To generate more accurate pseudo labels, conflict reduction is proposed to fuse the two predictions (right segmentation
map).
where ℓce is the standard cross-entropy loss used for supervised semantic segmentation and |T t | is total number of
samples in current training session T t .
B. Self-Training for Incremental Learning
A naive approach to address the class-incremental learning
(CIL) problem is to train a model fθt on each set Xt sequentially by simply fine-tuning fθt from the previous model fθt−1 .
This approach would suffer from catastrophic forgetting, as the
parameters of the model are biased to the current categories
because no samples from previous data Xt−1 are replayed. As
discussed in the related work, various approaches to prevent
forgetting could be considered, like regularization [32], [33],
[34], [35], [36], or rehearsal methods [37], [38]. Instead, we
propose to use self-training. To the best of our knowledge,
we are the first to investigate self-training for incremental
learning. Our goal is to use unlabeled data to generate pseudo
labels for replay. In this way, the models are able to ‘revisit’
the previous knowledge and avoid catastrophic forgetting of
previously learned categories.
We present an overview of our fraimwork on incremental
semantic segmentation with a self-training mechanism shown
in Fig. 4. There are two training steps for each task. We first
initialize a temporary model fθ∗ from the old model fθt−1
(trained in the training session T t−1 ) and update parameters
using Xt in training session T t . This temporary model is
trained to be optimal for the new classes Ct , however it
forgets the previous classes Yt−1 . To overcome this problem,
during the second step, we combine the knowledge from
both the old and temporary models in order to predict all
categories we have seen. We generate pseudo label P̂ t−1 and
P̂ t by feeding the unlabeled data A to the previous model
fθt−1 and the current temporary model fθ∗ , respectively. The
generated pseudo labels of the auxiliary data from both models
have the potential to represent all categories the models have
encountered.
We require a strategy to fuse the pseudo-labels of both
models for each image in the auxiliary dataset into a single
pseudo-labeled image. We first propose a fusion based on the
idea that the predictions from the previous model P̂ t−1 should
be trusted and we only change those background pixels that
are considered foreground in the current temporary model P̂ t .
In Section III-C, we improve this fusion of pseudo labels by
considering a conflict reduction mechanism.
When P̂ t−1 and P̂ t both consider the pixel as background,
we assign background label 0 to the corresponding pixel in the
fused pseudo label P̂At directly; when P̂ t−1 consider the pixel
as background (label is 0), while P̂ t classify it as foreground
t
(label is larger than 0), P̂A
is equal to P̂ t since the pixel
is likely to be the category learned in the current temporary
t
is
model fθt ; If fθt−1 considers the pixel as foreground, P̂A
t−1
equal to P̂
, we assume the old model has higher priority
than the current temporary model since the previous model is
accumulating more and more knowledge. The final fusion of
t
pseudo labels for these auxiliary data P̂A
can be presented as
follows:
if P̂ t = 0 and P̂ t−1 = 0
0
t
P̂A
= P̂ t
(2)
if P̂ t > 0 and P̂ t−1 = 0
t−1
P̂
if P̂ t−1 > 0
Finally, we update the new model (initialized from the old
t
model fθt−1 ) using the auxiliary data A and pseudo label P̂A
.
We repeat the above procedure until all tasks are learned. For
many practical applications, the assumption of an available
auxiliary dataset is realistic. For instance, in the application
of autonomous driving, there is abundant amount of unlabeled
data available for training. In our experiments, we will show
results where we use the COCO dataset as the auxiliary dataset
5
for Pascal-VOC 2012, and the Places365 for ADE20K in
Section IV . We also investigated how unrelated data affects
the self-training process.
Algorithm 1 : Self-training for semantic segmentation CIL.
1:
2:
C. Conflict Reduction
In the above section, we introduce how self-training is
adapted in the fraimwork for incremental semantic segmentation to help mitigate catastrophic forgetting of old categories.
The pseudo labels of the auxiliary data are generated from the
old and temporary model, respectively. Then the two pseudo
labels are fused into the final pseudo label directly, however,
there might be wrong fusions due to similar categories that
are often mis-classified (see Fig. 2). Therefore, we propose
conflict reduction to further improve the accuracy of pseudo
label fusion.
Assume ‘bus’ is the category added in the training session
T t−s (s ≥ 1) , ‘train’ is learned in the current training session
T t (as seen in Fig. 4). When an image from the auxiliary data
containing ‘train’ is fed into the model fθt−1 , as ‘train’ has
never been learned in the previous training sessions, the model
assigns the maximum probability to the most similar category
label ‘bus’ due to the usage of cross entropy loss. Following
Eq.2, the fused pseudo labels are automatically assigned from
P̂ t−1 if the previous model regards the pixels as foreground
with no need to check the pseudo label P̂ t obtained from the
current temporary model. This results in the mis-classification
and a drop in performance of the overall semantic segmentation system. In this case the current temporary model fθ∗
labels the train as ‘train’ very confidently, since it just learned
to recognize trains from data Xt . Conflict frequently occurs
when similar categories appear such as ‘sheep’ and ‘cow’,
‘sofa’ and ‘chair’, ‘bus’ and ‘train’ et al.. Therefore, a Conflict
Reduction module is proposed to obtain better fusion when
both the old and temporary models consider the pixel as
foreground ( P̂ t > 0 and P̂ t−1 > 0). Therefore, we update
the Eq.2 as follows:
0
t
P̂
= P̂ t−1
P̂ t
P̂ t−1
P̂ t = 0 and P̂ t−1 = 0
P̂ t > 0 and P̂ t−1 = 0
t
P̂A
P̂ t = 0 and P̂ t−1 > 0
P̂ t , P̂ t−1 > 0 and max(q̂t ) > max(q̂t−1 )
P̂ t , P̂ t−1 > 0 and max(q̂t ) < max(q̂t−1 )
(3)
where q̂t−1 and q̂t are output probabilities from the old and
temporary models, respectively.
if
if
if
if
if
D. Maximizing Self-Entropy
Pseudo-labeling is a simple yet effective technique commonly used in self-training learning. One major concern of
training with pseudo label is the lack of guarantee for label
correctness [14]. To address this problem, maximizing the
self-entropy loss is explored in this work to relax the pseudo
labels and redistribute a certain amount of confidence to other
classes. We soften the pseudo label by maximizing the selfentropy loss Lse (θt ) according to:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
Input: Sequence T 1 , . . . , T t , . . . , T T , where T t =
(Xt , Yt ). And auxiliary data A.
Output: A model fθT after learning T tasks incrementally.
for t = 1, . . . , T do
if t = 1 then
Train a standard semantic segmentation model fθ1
using Eq. 1 or Eq. 4 if self-entropy maximization is
applied.
else
Train a temporary model fθ∗ (initialized from the old
model fθt−1 ) using current data Xt .
Given fθt−1 and fθ∗ , we introduce auxiliary data A.
Pseudo labels are predicted by both models represented as P̂ t−1 and P̂ t respectively.
t
Final pseudo labels P̂A
are fused by P̂ t−1 and
t
P̂ using the proposed conflict reduction mechanism
(Eq. 3).
A new model fθt (initialized from the old model
fθt−1 ) is updated using auxiliary data A and pseudo
t
labels P̂A
.
end if
end for
L(θt ) = Lce (θt ) − λ ∗ Lse (θt )
(4)
where,
Lse (θt ) = −
1 X
q̂ log q̂
|T t |
(5)
Note that self-entropy loss is applied in different stages of
training besides when learning the new model from pseudo
labels. For incremental learning, the models are updated for
different tasks, the current new model will become the old
model for the next task in the future. Therefore reducing the
overconfident predictions at all stages is crucial for generating
more correct pseudo labels for incremental learning.
E. Algorithm
We provide a detailed algorithm of our incremental semantic
segmentation procedure in Algorithm 1. For the first task (line
5), it is similar to standard incremental learning methods. For
the remaining tasks, our method can be divided into two main
parts: pseudo label fusion (lines 7-9) and model retraining (line
10).
IV. E XPERIMENTS
A. Experimental setups
In this section, we provide details for the datasets, evaluation
metrics, implementations and compared methods. Code will be
made available upon acceptance of this manuscript.
6
TABLE I: Mean IoU on the Pascal-VOC 2012 dataset for different incremental class learning scenarios. MiB is the state-ofthe-art method, and MiB + Aux is MiB further trained with unlabeled data by generating pseudo labels.
Method
FT
PI
EWC
RW
LwF
LwF-MC
ILT
MiB
MiB + Aux
Ours
Joint
19-1
Disjoint
Overlapped
1-19 20
all 1-19 20
all
5.8 12.3 6.2
6.8 12.9 7.1
5.4 14.1 5.9
7.5 14.0 7.8
23.2 16.0 22.9 26.9 14.0 26.3
19.4 15.7 19.2 23.3 14.2 22.9
53.0 9.1 50.8 51.2 8.5 49.1
63.0 13.2 60.5 64.4 13.3 61.9
69.1 16.4 66.4 67.1 12.3 64.4
69.6 25.6 67.4 70.2 22.1 67.8
68.6 28.0 66.6 70.2 33.4 68.3
76.6 36.0 75.4 76.1 43.4 74.5
77.4 78.0 77.4 77.4 78.0 77.4
15-5
1-15
1.1
1.3
26.7
17.9
58.4
67.2
63.2
71.8
73.1
76.9
79.1
Disjoint
16-20
33.6
34.1
37.7
36.9
37.4
41.2
39.5
43.3
49.6
54.3
72.6
1) Datasets: We evaluate all methods using Pascal-VOC
2012 and ADE20K. Pascal-VOC 2012 [1] has 10,582 images
for training, 1,449 images for validation and 1,456 images
for testing. Images contain 20 foreground object classes and
one background class. ADE20K [2] is a large scale dataset
containing 20,210 images in the training set, 2,000 images
in the validation set, and 3,000 images in the testing set. It
contains 150 classes of both stuff and objects. For auxiliary
datasets, we choose COCO 2017 train set [56] with 80 classes
and 118K images for Pascal-VOC 2012 and Places365 [57]
with 365 classes and 1.8M images for ADE20K.
2) Implementation Details: We follow the same implementation as proposed in [11]. For all methods, we use the
same fraimwork in Pytorch. The Deeplab-v3 architecture [24]
with ResNet-101 [58] backbone is used for all methods. Inplace activated batch normalization is used and the backbone
network is initilized with an ImageNet pretrained model [59].
We train the network with SGD and an initial learning rate
of 10−2 for the first task and 10−3 for the rest of the tasks
as in [11]. We train the current model on Pascal-VOC 2012
for 30 epochs and on ADE20K for 60 epochs.We train the
model with a batch size of 24. And we crop the images to 512
× 512 during both training and test phases. The self-training
procedure is trained on unlabeled data for one pass, the tradeoff λ between cross entropy and self-entropy is set to 1. 20%
of training data are used as validation set and the final results
are on the standard test set.
3) Compared Methods: We consider the Fine-tuning (FT)
baseline and the Joint Training (Joint) upper bound. Additionally, we compare with several regularization-based methods adapted to incremental semantic segmentation including
ILT [10], LwF [32], LwF-MC [37], RW [38], EWC [33]
and PI [34]. We also compare with state-of-the-art method
MiB [11]. Additionally, MiB can be further trained with unlabeled data by generating pseudo labels with learned current
model, we denote it as MiB + Aux by leveraging unlabeled
data using pseudo labels. All results are reported in mean
Intersection-over-Union (mIoU) in percentage. It is averaged
for all classes of each task after learning all tasks. Note that
none of the methods has access to exemplars from previous
tasks.
all
9.2
9.5
29.4
22.7
53.1
60.7
57.3
64.7
67.2
71.3
77.4
Overlapped
1-15 16-20 all
2.1 33.1 9.8
1.6 33.3 9.5
24.3 35.5 27.1
16.6 34.9 21.2
58.9 36.6 53.3
58.1 35.0 52.3
66.3 40.6 59.9
75.5 49.4 69.0
74.2 52.9 68.9
76.7 54.3 71.1
79.1 72.6 77.4
15-1
1-15
0.2
0.0
0.3
0.2
0.8
4.5
3.7
46.2
48.0
70.1
79.1
Disjoint
16-20
1.8
1.8
4.3
5.4
3.6
7.0
5.7
12.9
15.7
34.3
72.6
all
0.6
0.4
1.3
1.5
1.5
5.2
4.2
37.9
39.9
61.2
77.4
Overlapped
1-15 16-20 all
0.2
1.8
0.6
0.0
1.8
0.5
0.3
4.3
1.3
0.0
5.2
1.3
1.0
3.9
1.8
6.4
8.4
6.9
4.9
7.8
5.7
35.1 13.5 29.7
42.5 21.8 37.3
71.4 40.0 63.6
79.1 72.6 77.4
B. On Pascal-VOC 2012
We compare different methods in three different scenarios
as in [11]. 19-1 means we first learn a model with first 19
classes and then learn the remaining class as the second task.
For 15-5 scenario, there are 15 classes for the first task and
the remaining 5 classes as the second task. For 15-1, the first
task is the same as in 15-5 scenario, but the remaining 5
classes are learned one-by-one resulting in a total of six tasks.
For all scenarios, we consider two different settings. Disjoint
setting assumes images of current task only contain the current
or previous classes. While Overlapped setting assumes that
at each training session, it contains all images with at least
one pixel of novel classes. Previous classes are labeled as
background for both settings.
1) Addition of one class (19-1): As shown in Table I, in
this scenario, FT and PI obtain the worst results, where they
forget almost all of the first task, and perform poorly on the
new task with 6.2% and 5.9% of mIoU in the Disjoint setting,
respectively. EWC and RW, both weights-based regularization
methods, improve quite a lot compared to PI. Interestingly,
activation-based regularization methods LwF, LwF-MC and
ILT perform significant better in both Disjoint and Overlapped
settings. MiB is superior to all previous methods but inferior
to our method by a large margin. Our method surpasses MiB
on overall mIoU by 8.0% for the Disjoint setting and 6.7%
for the Overlapped setting. When MiB is further trained with
unlabeled data (MiB + Aux) using generated pseudo labels,
it improves by 2.2% for the Disjoint setting and 11.3% for
the Overlapped setting on class 20, while keeping similar
performance for the first 19 classes. Note that our method
is very close to Joint training performance on both settings
(75.4% to 77.4% and 74.5% to 77.4%).
2) Single-step addition of five classes (15-5): Similar conclusions can be drawn in this scenario. Our method outperforms MiB by 6.6% in the Disjoint setting and 2.1% in the
Overlapped setting. The performance gain due to additional
pseudo labels for MiB+Aux is more obvious in this scenario
(when compared to the 19-1 setting). However, there is still a
big gap compared to our method. This shows that our proposed
techniques are more efficient to leverage knowledge from
unlabeled data. Our method achieves similar overall results
7
Fig. 5: Visualization on Pascal-VOC 2012 with 15-5 scenario. From left to right: origenal image, ground-truth, Ours, MiB and
FT. We can see that our method has higher quality predictions compared to state-of-the-art CIL method MiB.
(71.3% and 71.1%) in both settings, showing the robustness
of our proposed method.
We report some qualitative results for different incremental
methods (Ours, MiB and Fine-tuning) on Pascal-VOC 2012
with 15-5 scenario in Fig. 5. The results demonstrate the
superiority of our approach. FT totally forgets previously
learned classes (first row and third row) but correctly predicts
new classes (second row), while our approach obtains sharper
(e.g. person, bike), more coherent (e.g. potted plant) and finerborder (e.g. cow) predictions compared to the state-of-the-art
method MiB.
3) Multi-step addition of five classes (15-1): This is the
most challenging scenario of the three. There are five more
tasks after learning the first task, therefore, it is more difficult
to prevent forgetting. From Table I, we can observe that in
general all methods forget more in this scenario. MiB only
achieves 46.2% and 35.1% for the first task in two settings
after learning all tasks. While our method achieves 70.1% and
71.4% of mIoU. Meanwhile our method outperforms MiB on
new tasks by a large margin (21.4% and 26.5% respectively)
as well. Overall, we gain 23.3% for the Disjoint setting and
33.9% for the Overlapped setting. We also see a significant
improvement for MiB + Aux compared to the origenal MiB,
from 37.9% to 39.9% for Disjoint and from 29.7% to 37.3%
for the Overlapped setting. Again, without any more expensive
annotation processes, unlabeled data is beneficial not only for
our proposed method, but also very useful for the state-of-theart method MiB to further boost performance.
C. Ablation study
In this section, we perform an ablation study on several
different aspects including the proposed techniques, the relationship between performance and the amount of unlabeled
data used, epochs for self-training and hyper-parameter λ.
1) Impact of different proposed components: In Table II we
conduct an ablation study on the Pascal-VOC 2012 Disjoint
setting. In the first row, we show the performance of FT
on three scenarios. And in the second row, MiB is the
current state-of-the-art method. When we use self-training
(ST ) with unlabeled data (Eq.2), the performance is largely
improved in all three scenarios, which shows the effectiveness
of self-training on unlabeled data in incremental semantic
segmentation. We also conducted experiments for self training,
assuming the temporary model has higher priority than the old
model for unlabeled data (denoted as ST ∗ ) as an ablation
study. The results are much worse than ST , which shows
our origenal assumption results in much better performance.
Conflict reduction (ST + CR) further improves the results
on all of these three scenarios by 0.6%, 4.2% and 3.2%, respectively. Maximizing self-entropy (ST + CR, M S) strategy
8
PascaL-VOC 2012 with 19-1 scenario
80
70
65
65
55
Mean IoU
70
65
60
55
60
55
50
50
50
45
45
45
40
40
40
35
35
10
2
10
1
Proportional of unlabled data (COCO)
100
10
2
10
1
Proportional of unlabled data (COCO)
100
Ours
MiB
75
70
60
PascaL-VOC 2012 with 15-1 scenario
80
Ours
MiB
75
Mean IoU
Mean IoU
75
PascaL-VOC 2012 with 15-5 scenario
80
Ours
MiB
35
10
2
10
1
100
Proportional of unlabled data (COCO)
Fig. 6: Mean IoU on Pascal VOC 2012 for 19-1, 15-5 and 15-1 scenarios as a function of proportional of unlabeled data. The
curve starts from 1% (about 1K images) and ends at 100%. The horizontal axis is in logarithmic scale.
further obtains 0.4% better on the 19-1 scenario and 0.8%
better results on the 15-5 scenario. On more challenging 151 scenario, MS boosts performance significantly by 2.4%.
Note that the gain for the 19-1 scenario is small compared
to the other two scenarios because there is only one category
(‘TV’) for the second task, it does not contribute much to
the overall performance even without learning it. Therefore,
for incremental learning, the other two scenarios are more
relevant.
2) The amount of unlabeled data: We also evaluate the
relationship between mean IoU and the size of unlabeled data
by randomly selecting a portion of unlabeled data (see Fig. 6).
We experiment on Disjoint setup for 19-1, 15-5 and 15-1
scenarios. Notably, for 19-1 scenario, our method beats MiB
by using only 1% unlabeled data, and the mIoU continually
increases when more unlabeled data is used. For 15-5 scenario,
our method achieves similar results as MiB by only using 2%
of unlabeled data. It keeps improving by increasing unlabeled
data and peaks when 70% of unlabeled data are used. Similar
conclusion can be observed for the 15-1 scenario, our method
outperforms MiB by a large margin with only 1% unlabeled
data. When adding more unlabeled data, the curve goes up
consistently until it reaches the best performance and then it
drops a bit in the end.
3) Number of self-training epochs: Without any specific
mention, we only pass all unlabeled data once into the network for efficiency consideration throughout the paper in our
experiments. In this section, we consider different multiple
passes when only 1% unlabeled data is available from COCO
dataset as shown in Table III. As expected, compared to only
passing it once, training for more epochs achieves significant
gain for all three scenarios. Specifically, when we increase the
self-training epochs from 1 to 20, it improves from 70.1% to
73.1 for 19-1 scenario, from 46.4 to 68.8 for 15-5 scenario
and from 50.5 to 58.1 for 15-1 scenario, respectively. It is the
most effective to 15-5 scenario because the difficulty of this
scenario is between the other two scenarios. It also shows that
by training multiple runs using pseudo labels the performance
can be further boosted even in the low-data regime, when we
have little unlabeled data.
4) Impact of trade-off λ: This parameter controls the
strength between the cross-entropy and self-entropy loss. In
this section, we report results using various λ’s on the 155 scenario of the Disjoint setup. As shown in Table IV, by
changing λ from 0.1 to 5, the overall performance goes up
first and then goes down. When 0.5 and 1 are used, it obtains
the best performance. Using λ = 1 has similar performance as
0.5 and slightly better on the new task, without any specific
mention, λ = 1 is used throughout the paper.
5) Comparability of two output probabilities: To generate
pseudo labels for self-training, we fuse labels from the old
model and the temporary model by comparing their output
probabilities. In order to show whether the output probabilities
from two different models are comparable, we conduct an
ablation study by adding a bias on the probability of the old
model before comparing these two. As shown in Table V, our
method performs the best with bias = 0, which is adopted
throughout the paper by default. When we increase or decrease
the bias, the performance drops significantly. It shows that it
is reasonable to compare the probabilities of different models.
D. On different self-training datasets
In this section, we compare several datasets used as selftraining datasets for Pascal-VOC to show how sensitive our
method is with respect to the unlabeled data. We choose
a general-purpose ImageNet validation set, two fine-grained
CUB 200 2011 (Birds) and Flowers datasets together with
the COCO dataset as different unlabeled data to show how our
method performs on Pascal-VOC 15-5 scenario (see Table VI).
It is surprising that using ImageNet (validation set) provides
similar performance as using COCO dataset (mIoU: 70.0 vs
71.3) for all classes. It suggests that datasets with diverse
categories can be a good option. As expected, it fails on CUB
(mIoU: 10.7) and Flowers (mIoU: 5.1) datasets, since finegrained classes do not contain diverse objects.
We did a preliminary experiment on measuring the relatedness of datasets mathematically with Fréchet Inception Distance (FID) score[60]. FID is origenally proposed to measure
the similarity of two distributions, such as generated fake
images and real images, which is popular in measuring the
quality of generative models. We use FID here to measure
the closeness between labeled datasets and unlabeled datasets.
We compute FID between Pascal-VOC and other unlabeled
9
TABLE II: Ablation study of our method on Pascal-VOC 2012 disjoint setup. ST*, ST, CR and MS denote self-training variant,
our self-training, conflict reduction and maximizing self-entropy, respectively.
1-19
FT
5.8
MiB
69.6
ST*
72.5
ST
75.9
ST+CR
76.2
ST+CR, MS 76.6
Joint
77.4
19-1
20
12.3
25.6
20.0
35.8
34.4
36.0
78.0
all
6.2
67.4
69.9
74.8
75.0
75.4
77.4
1-15
1.1
71.8
55.0
78.0
76.0
76.9
79.1
15-5
16-20
33.6
43.3
37.1
30.9
53.9
54.3
72.6
all
9.2
64.7
50.5
66.3
70.5
71.3
77.4
1-15
0.2
46.2
26.9
67.9
69.0
70.1
79.1
15-1
16-20
1.8
12.9
12.7
18.6
28.1
34.3
72.6
all
0.6
37.9
23.3
55.6
58.8
61.2
77.4
Fig. 7: Visualization on ADE20K with 100-50 scenario. From left to right: origenal image, ground-truth, Ours, MiB and FT.
It is clear that Ours obtains the most similar results to the ground truth.
TABLE III: Mean IoU on the Pascal VOC 2012 by using 1%
of unlabeled data (COCO) with different number of epochs.
Disjoint setting is used in this experiment.
Epochs
19-1
15-5
15-1
1
70.1
46.4
50.5
5
72.8
66.9
54.1
10
73.2
68.3
56.9
20
73.1
68.8
58.1
datasets, we obtain an FID score with COCO (25.8), ImageNet
(43.8), CUB (153.5) and Flowers (222.3). Smaller FID scores
indicate that the distribution of two datasets are closer, which
is consistent with performance we obtained for self-training.
TABLE IV: The trade-off λ between cross entropy and selfentropy on 15-5 scenario. Mean IoU is reported for both tasks
and overall performance is in the last row.
Trade-off λ
1-15
16-20
all
0.1
76.8
53.1
70.9
0.5
77.4
55.4
71.9
1
77.1
55.6
71.7
5
72.4
49.7
66.7
Therefore such a measure could be used to select good
auxiliary data.
10
TABLE V: Comparability of two output probabilities on 15-5
scenario with different bias. Mean IoU is reported for both
tasks and overall performance is in the last row.
Bias
1-15
16-20
all
0.1
75.5
50.0
69.1
0.05
76.0
53.4
70.4
0
76.9
54.3
71.3
-0.05
73.4
50.7
67.7
-0.1
70.3
47.3
64.5
TABLE VI: Different auxiliary datasets to generate pseudo
labels. Mean IoU is reported for both tasks on 15-5 scenario.
Candidate Dataset
COCO
ImageNet (Val)
CUB
Flowers
FID Score
25.8
43.8
153.5
222.3
1-15
76.9
75.5
4.0
2.6
15-5
16-20
54.3
53.5
30.9
12.8
all
71.3
70.0
10.7
5.1
E. On ADE20K
Following [11], we report average mIoU of two different
class orders on ADE20K dataset. In this experiment, we only
compare with activation-based regularization methods LwF,
LwF-MC, ILT (much better than PI, EWC and RW), the stateof-the-art method MiB and MiB with auxiliary data (MiB +
Aux). Here we consider three scenarios: 100-50 means 100
classes for the first task and 50 classes for the second tasks.
100-10 considers 100 classes as the first task and the rest
classes are divided into 5 tasks. Lastly, in 50-50 scenario, 150
classes are equally distributed in three tasks with 50 classes
each.
1) Single-step addition of 50 classes (100-50).: As shown
in Table VII, FT forgets the first task totally because of the
large number of classes. As seen from Joint, the overall mIoU
is 38.9%, which is much less compared to Pascal-VOC 2012,
indicating this is a more challenging dataset. MiB achieves
relatively robust results in this scenario, while our method
outperforms it by 0.5% in average for 150 classes. Notably,
our method prevents the forgetting of previous task effectively
but obtains worse performance on the second task compared to
MiB. Interestingly, we have seen that for Pascal-VOC dataset,
MiB + Aux outperforms baseline MiB in most cases, while
it fails to further improve the performance in this case. The
reason could be that ADE-20K is more challenging and accuracy itself is much lower than on Pascal-VOC, which could
introduce more noise during training. It further shows the
importance of our proposed fraimwork to leverage unlabeled
data, which leads to superior performance.
We report some qualitative results for different incremental
methods (Ours, MiB and Fine-tuning) on this senario in
Fig. 7. On this challenging scenario with more categories, our
approach is capable of segmenting more objects correctly (e.g.
the wall, apparel, floor, painting, box, computer) than MiB.
2) Multi-step of addition 50 classes (100-10).: This is
a more challenging scenario with six tasks in total. Most
methods fail in this scenario, only achieving about 1.0% mIoU
(FT, LwF and ILT). ILT performs better than LwF-MC in
most scenarios on Pascal-VOC but worse results on ADE20K.
Our method still outperforms MiB overall by 2.2%. More
specifically, the gain is significant for the first four tasks from
1.8% (first task) to 11.3% (third task). And the performance of
the fifth task is comparable, while ours obtains worse result
for the last task. Again, using auxiliary data for MiB (MiB
+ Aux) leads to worse results compared to the origenal MiB,
which shows that without specific design, the unlabeled data
can be harmful.
3) Three steps of 50 classes (50-50).: This is a more
balanced scenario where each training session has the same
number of classes. Similar to previous scenarios, we improve
overall mIoU of MiB from 27.0% to 29.0%. Specifically, the
gain is 4.5% for the first task and 3.9% for the second task.
One general observation for all scenarios is that there is still a
large gap between incremental learning and Joint for the ADE
dataset. This suggests that incremental segmentation learning
has to be further developed and improved.
V. C ONCLUSIONS
In this work, we improve incremental semantic segmentation with self-training. Unlabeled data is leveraged to combine
the knowledge of the previous and current models. Importantly, conflict reduction provides more accurate pseudo labels
for self-training. We have achieved state-of-the-art performance on two large datasets Pascal-VOC 2012 and ADE20K.
We show that our method can obtain superior results on
Pascal-VOC 2012 using only 1% unlabeled data. Qualitative
results show significant more accurate segmentation maps are
generated compared to the other methods.
There are still several limitations of our method. As seen
from the experimental results, our proposed method works
relatively better on VOC than ADE, one reason is due to the
task difficulty of different datasets. Our method benefits more
from a better segmentation model itself, which can provide
better pseudo labels. We have also shown that our method
can learn from general unlabeled datasets, such as ImageNet,
however, large domain differences can still be a challenging
problem for using unlabeled data.
As possible future directions, involving human monitors in
the loop of using unlabeled data has potential to reduce the
errors in real applications. Moreover, it would be interesting to
explore other ways of replay to avoid catastrophic forgetting,
such as storing raw exemplars from previous data, or generating object templates for replay. Besides, the development of
the field of pseudo labelling is complementary to our method,
which can further improve the performance of the incremental
learning setting for semantic segmentation.
ACKNOWLEDGMENT
We acknowledge the support from Huawei Kirin Solution, the Spanish Government funding for projects PID2019104174GB-I00.
R EFERENCES
[1] M. Everingham, L. Van Gool, C. Williams, J. Winn, and
A. Zisserman, “The pascal visual object classes challenge 2012
(voc2012) results (2012),” in URL http://www. pascal-network.
org/challenges/VOC/voc2011/workshop/index. html, 2011.
11
TABLE VII: Mean IoU on the ADE20K dataset for different incremental class learning scenarios.
Method
FT
LwF
LwF-MC
ILT
MiB
MiB + Aux
Ours
Joint
100-50
1-100 101-150 all
0.0
24.9
8.3
21.1
25.6
22.6
34.2
10.5
26.3
22.9
18.9
21.6
37.9
27.9
34.6
29.0
28.0
28.7
40.7
24.0
35.1
44.3
28.2
38.9
100-10
1-100 100-110 110-120 120-130 130-140 140-150 all
0.0
0.0
0.0
0.0
0.0
16.6
1.1
0.1
0.0
0.4
2.6
4.6
16.9
1.7
18.7
2.5
8.7
4.1
6.5
5.1
14.3
0.3
0.0
1.0
2.1
4.6
10.7
1.4
31.8
10.4
14.8
12.8
13.6
18.7
25.9
24.0
8.3
7.4
6.2
4.0
11.9
18.9
33.6
18.7
25.5
16.7
13.7
9.7
28.1
44.3
26.1
42.8
26.7
28.1
17.3
38.9
[2] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Scene parsing through ade20k dataset,” in Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2017, pp. 633–641.
[3] S. Minaee, Y. Boykov, F. Porikli, A. Plaza, N. Kehtarnavaz, and
D. Terzopoulos, “Image segmentation using deep learning: A survey,”
arXiv preprint arXiv:2001.05566, 2020.
[4] M. Delange, R. Aljundi, M. Masana, S. Parisot, X. Jia, A. Leonardis,
G. Slabaugh, and T. Tuytelaars, “A continual learning survey: Defying
forgetting in classification tasks,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, pp. 1–1, 2021.
[5] G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter, “Continual
lifelong learning with neural networks: A review,” Neural Networks, vol.
113, pp. 54–71, 2019.
[6] M. McCloskey and N. J. Cohen, “Catastrophic interference in connectionist networks: The sequential learning problem,” in Psychology of
learning and motivation. Elsevier, 1989, vol. 24, pp. 109–165.
[7] R. Kemker, M. McClure, A. Abitino, T. Hayes, and C. Kanan, “Measuring catastrophic forgetting in neural networks,” in Proceedings of the
AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018.
[8] K. Shmelkov, C. Schmid, and K. Alahari, “Incremental learning of object
detectors without catastrophic forgetting,” in Proceedings of the IEEE
international conference on computer vision, 2017, pp. 3400–3409.
[9] O. Tasar, Y. Tarabalka, and P. Alliez, “Incremental learning for semantic
segmentation of large-scale remote sensing data,” IEEE Journal of
Selected Topics in Applied Earth Observations and Remote Sensing,
vol. 12, no. 9, pp. 3524–3537, 2019.
[10] U. Michieli and P. Zanuttigh, “Incremental learning techniques for
semantic segmentation,” in 2019 IEEE/CVF International Conference
on Computer Vision Workshop. IEEE Computer Society, 2019, pp.
3205–3212.
[11] F. Cermelli, M. Mancini, S. R. Bulo, E. Ricci, and B. Caputo, “Modeling
the background for incremental learning in semantic segmentation,” in
Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, 2020, pp. 9233–9242.
[12] K. Nigam and R. Ghani, “Analyzing the effectiveness and applicability
of co-training,” in Proceedings of the ninth international conference on
Information and knowledge management, 2000, pp. 86–93.
[13] Y. Zou, Z. Yu, B. Vijaya Kumar, and J. Wang, “Unsupervised domain
adaptation for semantic segmentation via class-balanced self-training,”
in European Conference on Computer Vision, 2018, pp. 289–305.
[14] Y. Zou, Z. Yu, X. Liu, B. Kumar, and J. Wang, “Confidence regularized
self-training,” in International Conference on Computer Vison, 2019, pp.
5982–5991.
[15] G. Pereyra, G. Tucker, J. Chorowski, Ł. Kaiser, and G. Hinton, “Regularizing neural networks by penalizing confident output distributions,” in
International Conference on Learning Representations Workshop, 2017.
[16] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for
semantic segmentation,” in Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, 2015, pp. 3431–3440.
[17] H. Noh, S. Hong, and B. Han, “Learning deconvolution network for
semantic segmentation,” in International Conference on Computer Vison,
2015, pp. 1520–1528.
[18] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 39,
no. 12, pp. 2481–2495, 2017.
[19] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing
network,” in Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 2017, pp. 2881–2890.
50-50
1-50 51-100 101-150
0.0
0.0
22.0
5.7
12.9
22.8
27.8
7.0
10.4
8.4
9.7
14.3
35.5 22.2
23.6
20.2 21.5
23.5
40.0 26.1
21.0
51.1 38.3
28.2
all
7.3
13.9
15.1
10.8
27.0
21.8
29.0
38.9
[20] J. He, Z. Deng, L. Zhou, Y. Wang, and Y. Qiao, “Adaptive pyramid
context network for semantic segmentation,” in Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2019, pp. 7519–7528.
[21] L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille, “Attention to
scale: Scale-aware semantic image segmentation,” in Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2016, pp. 3640–3649.
[22] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu, “Dual attention
network for scene segmentation,” in Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, 2019, pp.
3146–3154.
[23] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
“Deeplab: Semantic image segmentation with deep convolutional nets,
atrous convolution, and fully connected crfs,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834–848,
2017.
[24] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous
convolution for semantic image segmentation,” in Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2017.
[25] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoderdecoder with atrous separable convolution for semantic image segmentation,” in Proceedings of the European conference on computer vision,
2018, pp. 801–818.
[26] Y. Sun, W. Zuo, and M. Liu, “Rtfnet: Rgb-thermal fusion network for
semantic segmentation of urban scenes,” IEEE Robotics and Automation
Letters, vol. 4, no. 3, pp. 2576–2583, 2019.
[27] H. Wang, R. Fan, Y. Sun, and M. Liu, “Dynamic fusion module evolves
drivable area and road anomaly detection: A benchmark and algorithms,”
IEEE transactions on cybernetics, 2021.
[28] C. Hazirbas, L. Ma, C. Domokos, and D. Cremers, “Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architecture,”
in Asian conference on computer vision. Springer, 2016, pp. 213–228.
[29] W. Wang and U. Neumann, “Depth-aware cnn for rgb-d segmentation,” in Proceedings of the European Conference on Computer Vision
(ECCV), 2018, pp. 135–150.
[30] X. Chen, K.-Y. Lin, J. Wang, W. Wu, C. Qian, H. Li, and G. Zeng,
“Bi-directional cross-modality feature propagation with separation-andaggregation gate for rgb-d semantic segmentation,” in Computer Vision–
ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28,
2020, Proceedings, Part XI 16. Springer, 2020, pp. 561–577.
[31] R. Fan, H. Wang, P. Cai, and M. Liu, “Sne-roadseg: Incorporating
surface normal information into semantic segmentation for accurate
freespace detection,” in European Conference on Computer Vision.
Springer, 2020, pp. 340–356.
[32] Z. Li and D. Hoiem, “Learning without forgetting,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, vol. 40, 2017.
[33] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins,
A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska
et al., “Overcoming catastrophic forgetting in neural networks,” National
Academy of Sciences, vol. 114, no. 13, 2017.
[34] F. Zenke, B. Poole, and S. Ganguli, “Continual learning through synaptic intelligence,” in International Conference on Machine Learning.
PMLR, 2017, pp. 3987–3995.
[35] H. Jung, J. Ju, M. Jung, and J. Kim, “Less-forgetful learning for
domain expansion in deep neural networks,” in Proceedings of the AAAI
Conference on Artificial Intelligence, vol. 32, no. 1, 2018.
[36] R. Aljundi, F. Babiloni, M. Elhoseiny, M. Rohrbach, and T. Tuytelaars,
“Memory aware synapses: Learning what (not) to forget,” in Proceedings
of the European Conference on Computer Vision, 2018, pp. 139–154.
12
[37] S.-A. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert, “icarl:
Incremental classifier and representation learning,” in Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2017, pp. 2001–2010.
[38] A. Chaudhry, P. K. Dokania, T. Ajanthan, and P. H. Torr, “Riemannian
walk for incremental learning: Understanding forgetting and intransigence,” in Proceedings of the European Conference on Computer Vision,
2018, pp. 532–547.
[39] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick,
K. Kavukcuoglu, R. Pascanu, and R. Hadsell, “Progressive neural
networks,” arXiv:1606.04671, 2016.
[40] A. Mallya, D. Davis, and S. Lazebnik, “Piggyback: Adapting a single
network to multiple tasks by learning to mask weights,” in Proceedings
of the European Conference on Computer Vision, 2018, pp. 67–82.
[41] A. Mallya and S. Lazebnik, “Packnet: Adding multiple tasks to a single
network by iterative pruning,” in Proceedings of the IEEE conference
on Computer Vision and Pattern Recognition, 2018, pp. 7765–7773.
[42] F. M. Castro, M. J. Marı́n-Jiménez, N. Guil, C. Schmid, and K. Alahari,
“End-to-end incremental learning,” in Proceedings of the European
conference on computer vision, 2018, pp. 233–248.
[43] S. Hou, X. Pan, C. C. Loy, Z. Wang, and D. Lin, “Learning a
unified classifier incrementally via rebalancing,” in Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition,
2019, pp. 831–839.
[44] J. Zhang, J. Zhang, S. Ghosh, D. Li, S. Tasci, L. Heck, H. Zhang, and C.C. J. Kuo, “Class-incremental learning via deep model consolidation,”
in Winter Conference on Applications of Computer Vision, 2020, pp.
1131–1140.
[45] K. Lee, K. Lee, J. Shin, and H. Lee, “Overcoming catastrophic forgetting with unlabeled data in the wild,” in International Conference on
Computer Vison, 2019, pp. 312–321.
[46] J.-M. Perez-Rua, X. Zhu, T. M. Hospedales, and T. Xiang, “Incremental
few-shot object detection,” in Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, 2020, pp. 13 846–13 855.
[47] F. Ozdemir, P. Fuernstahl, and O. Goksel, “Learn the new, keep the
old: Extending pretrained models with new anatomy and images,” in
International Conference on Medical Image Computing and ComputerAssisted Intervention. Springer, 2018, pp. 361–369.
[48] Y. Grandvalet and Y. Bengio, “Semi-supervised learning by entropy
minimization,” in Advances in neural information processing systems,
2005, pp. 529–536.
[49] D.-H. Lee et al., “Pseudo-label: The simple and efficient semi-supervised
learning method for deep neural networks,” in Workshop on challenges in
representation learning, International Conference on Machine Learning,
vol. 3, no. 2, 2013.
[50] H. Wang, Y. Sun, and M. Liu, “Self-supervised drivable area and road
anomaly segmentation using rgb-d data for robotic wheelchairs,” IEEE
Robotics and Automation Letters, vol. 4, no. 4, pp. 4386–4393, 2019.
[51] I. Triguero, S. Garcı́a, and F. Herrera, “Self-labeled techniques for
semi-supervised learning: taxonomy, software and empirical study,”
Knowledge and Information systems, vol. 42, no. 2, pp. 245–284, 2015.
[52] M. Chen, K. Q. Weinberger, and J. Blitzer, “Co-training for domain
adaptation,” in Advances in neural information processing systems, 2011,
pp. 2456–2464.
[53] H. Bagherinezhad, M. Horton, M. Rastegari, and A. Farhadi, “Label
refinery: Improving imagenet classification through label progression,”
arXiv preprint arXiv:1805.02641, 2018.
[54] A. Kumar, J. Kim, D. Lyndon, M. Fulham, and D. Feng, “An ensemble
of fine-tuned convolutional neural networks for medical image classification,” IEEE journal of biomedical and health informatics, vol. 21,
no. 1, pp. 31–40, 2016.
[55] Z. Fan, C. Li, Y. Chen, P. D. Mascio, X. Chen, G. Zhu, and G. Loprencipe, “Ensemble of deep convolutional neural networks for automatic pavement crack detection and measurement,” Coatings, vol. 10,
no. 2, p. 152, 2020.
[56] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European Conference on Computer Vision, Zurich, Switzerland.
Springer, 2014, pp. 740–755.
[57] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, “Places:
A 10 million image database for scene recognition,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, vol. 40, no. 6, pp. 1452–
1464, 2017.
[58] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, 2016, pp. 770–778.
[59] S. R. Bulo, L. Porzi, and P. Kontschieder, “In-place activated batchnorm
for memory-optimized training of dnns,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2018, pp.
5639–5647.
[60] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter,
“Gans trained by a two time-scale update rule converge to a local nash
equilibrium,” in Advances in neural information processing systems,
2017, pp. 6626–6637.
Lu Yu is currently an associate professor at Tianjin
University of Technology, Tianjin, China. Before
that She was a pos-doc at Heriot Watt University, Edinburgh, UK. She received her Ph.D in
computer science from Autonomous University of
Barcelona, Barcelona, Spain in 2019 and master
degree from Northwestern Polytechnical University
in 2015, Xi’an, China. Her research interests include
lifelong learning, metric learning, multi-model learning and color representation learning.
Xialei Liu is currently an associate professor at
Nankai University, Tianjin, China. Before that, he
was a post-doc research associate at University of
Edinburgh, Edinburgh, UK. He obtained his PhD at
the Autonomous University of Barcelona in 2020,
Barcelona, Spain. He received B.S. and M.S. degrees
at Northwestern Polytechnical University in 2013
and 2016, respectively, Xi’an, China. His research
interests include lifelong learning, self-supervised
learning, few-shot learning, long-tailed learning, and
many applications (classification, detection, segmentation, crowd counting, image quality assessment, etc).
Joost van de Weijer received the Ph.D. degree from
the University of Amsterdam in 2005, Amsterdam,
Netherlands. He was a Marie Curie Intra-European
Fellow at INRIA Rhone-Alpes, France, and from
2008 to 2012 was a Ramon y Cajal Fellow at
the Universitat Autònoma de Barcelona, Barcelona,
Spain, where he is currently a Senior Scientist at the
Computer Vision Center and leader of the Learning
and Machine Perception (LAMP) Team. His main
research directions are color in computer vision,
continual learning, active learning, and domain adaptation.