Unsupervised and Self-Adaptative Techniques For Cross-Domain Person Re-Identification

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL.
16, 2021 1
Unsupervised and self-adaptative techniques for

cross-domain person re-identification
Gabriel Bertocco, Fernanda Andaló, Member, IEEE,
and Anderson Rocha, Senior Member, IEEE
Abstract—Person Re-Identification (ReID) across non- comprises the primary techniques to find possible people, or
overlapping cameras is a challenging task, and most works groups of people, involved in an event and to, ultimately,
in prior art rely on supervised feature learning from a propose candidate suspects for further investigation [1].
arXiv:2103.11520v3 [cs.CV] 7 Feb 2022
labeled dataset to match the same person in different views.

However, it demands the time-consuming task of labeling Person ReID aims to match the same person in different
the acquired data, prohibiting its fast deployment in forensic non-overlapping views in a camera system. Thanks to the
scenarios. Unsupervised Domain Adaptation (UDA) emerges considerable discrimination power given by deep learning,
as a promising alternative, as it performs feature adaptation recent works [2], [3], [4], [5], [6] consider supervised feature
from a model trained on a source to a target domain without learning on a labeled dataset, which yields high values of mean
identity-label annotation. However, most UDA-based methods
rely upon a complex loss function with several hyper-parameters, Average Precision (mAP) and top Ranking accuracy.
hindering the generalization to different scenarios. Moreover, as However, the labeling of massive datasets demanded by
UDA depends on the translation between domains, it is crucial to deep learning is time-consuming and error-prone, especially
select the most reliable data from the unseen domain, avoiding when targeting forensic applications. In this context, Unsuper-
error propagation caused by noisy examples on the target data vised Domain Adaptation (UDA) aims to adapt a model trained
— an often overlooked problem. In this sense, we propose a
novel UDA-based ReID method that optimizes a simple loss on a source dataset to a target domain without the need for
function with only one hyper-parameter and takes advantage identity information of the target samples. Most ReID methods
of triplets of samples created by a new offline strategy based that follow this approach are based on label proposing, in
on the diversity of cameras within a cluster. This new strategy which feature vectors of target images are extracted and
adapts and regularizes the model, avoiding overfitting the target clustered. Upon unsupervised training, these clusters receive
domain. We also introduce a new self-ensembling approach,
which aggregates weights from different iterations to create a pseudo-labels for the adaptation to the target domain.
final model, combining knowledge from distinct moments of Several works [7], [8], [9], [10], [11] apply the pseudo-
the adaptation. For evaluation, we consider three well-known labeling principle by developing different ways to propose and
deep learning architectures and combine them for the final refine clusters on the target domain. The aim is to alleviate
decision. The proposed method does not use person re-ranking noisy labels, which can harm feature learning. Our method
nor any identity label on the target domain and outperforms
state-of-the-art techniques, with a much simpler setup, on the follows this trend, but we consider a more general clustering
Market to Duke, the challenging Market1501 to MSMT17, and algorithm differently from previous work, which can relax the
Duke to MSMT17 adaptation scenarios. criteria to select data points, allowing clusters with arbitrary
Index Terms—Person Re-Identification, Unsupervised Learn- densities in feature space. By not forcing all clusters to have
ing, Deep Learning, Curriculum Learning, Network Ensemble. the same complexity, we can utilize the density information
to better group relevant data points.
As we are dealing with data from an unknown target
domain, clusters can have different degrees of reliability,
I. I NTRODUCTION
i.e., contain different quantities of noisy labels. We need to
P ERSON Re-Identification (ReID) has gained increasing

attention in the last years in the Computer Vision and
Forensic Science communities due to its broad range of appli-
select the most reliable clusters to optimize the model at
each iteration of the clustering process. The generated model
must also be camera-invariant to generate the same feature
cations for person tracking, crime investigation, and surveil- representation for an identity, regardless of the camera point
lance. One of the main goals when dealing with forensic prob- of view. Based on these observations, we hypothesize that
lems is to answer “who took part in an event?”. Person ReID clusters with more cameras might be more reliable to optimize
the model. Suppose that a cluster contains images of the same
Gabriel Bertocco is a Ph.D. student at the Artificial Intelligence Lab.
(Recod.ai), Institute of Computing, University of Campinas, Brazil identity seen from two or more cameras. In this case, the
Fernanda Andaló is a researcher associated to the Artificial Intelligence model was able to embed these images close to each other
Lab. (Recod.ai), Institute of Computing, University of Campinas, Brazil in the feature space, overcoming differences in illumination,
Anderson Rocha is an Associate Professor and Chair of the Artificial
Intelligence Lab. (Recod.ai) at the Institute of Computing, University of pose, and occlusion, which are inherently present in different
Campinas, Brazil. camera vantage points.
This paper has supplementary downloadable material available at We argue that the greater the number of different cameras in
http://ieeexplore.ieee.org., provided by the author. The material includes
an extra file with further quantitative and qualitative analysis. Contact a cluster, the more reliable this cluster is to optimize the model.
gabriel.bertocco@ic.unicamp.br for further questions about this work. Following this idea, we propose a new way to create triplets
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 16, 2021 2
of samples in an offline manner. We select one sample as an original images. The main idea is to transfer low- and mid-
anchor for each camera represented in a cluster and two others level characteristics from the target domain, such as back-
as positive and negative examples. As a positive example, we ground, illumination, resolution, and even clothing, to the
choose a sample from one of the other represented cameras. images in the source domain. These methods create a synthetic
In contrast, the negative example is a sample from a different dataset of labeled images with the same conditions as the
cluster but with the same camera as the anchor. Consequently, target domain. And to adapt the model, they apply supervised
the greater the number of cameras in a cluster, the more training. Some works in this category are SPGAN [17],
diverse the triplets to train the model. With this approach, we PTGAN [18], AT-Net [19], CR-GAN [20], PDA-Net [21], and
give more importance to the more reliable clusters, regularize HHL [22]. Besides transferring the characteristics from source
the model, and alleviate the dependency on hyper-parameters to target domain for image-level generation, DG-Net++ [23]
by using a single-term and single-hyper-parameter triplet loss also applies label proposing through clustering. The final loss
function. This technique brings robustness and generability to is the aggregation of the GAN-based loss function to generate
the final model, easing its adaptation to different scenarios. images, along with the classification loss defined for the
Another important observation is that, at different points proposed labels. By doing this, they perform the disentangling
of the adaptation from a source to a target domain, the model and adaptation of the features on the target domain.
holds different levels of knowledge as different portions of the CCSE [24] performs camera mining and, using a GAN-
target data are considered each time. Thus, we argue that the based model, generates synthetic data for an identity con-
model has complementary knowledge in different iterations sidering the point of view of each other camera, increasing
during training. Based on this, we propose a self-ensembling the number of images available for training. They leverage
strategy to summarize the knowledge from various iterations new clustering criteria to avoid creating massive clusters
into a unique final model. comprising most of the dataset and potentially having two or
Finally, based on recent advances in ensemble-based meth- more true identities assigned to the same pseudo-label. Finally,
ods for ReID [12], [13], we propose to combine the knowledge they train directly from ImageNet, without considering any
acquired by different architectures. Unlike prior work, we specific source domain. In comparison, our solution does not
avoid complex training stages by simply assembling the results require synthetic images since we explore the cross-camera
from different architectures only during evaluation time. information inside each cluster using only real images. This
To summarize, the contributions of our work are: leads our method to outperform CCSE considering the same
• A new approach to creating diverse triplets based on the training conditions (unsupervised scenario).
variety of cameras represented in a cluster. This approach
helps the model to be camera-invariant and more robust B. Attribute Alignment Methods
in generating the same person’s features from different These methods seek to align common attributes in both
perspectives. It also allows us to leverage a single-term domains to ease transferring knowledge from source to target.
and single-hyper-parameter triplet loss function to be Such features can be clothing items (backpacks, hats, shoes)
optimized. and other soft-biometric attributes that might be common
• A novel self-ensembling fusion method, which enables
to both domains. These works align mid-level features and
the final model to summarize the complementary knowl- enable the learning of higher semantic features on the target
edge acquired during training. This method relies upon domain. Works such as TJ-AIDL [25] consider a fixed set
the knowledge hold by the model in different checkpoints of attributes. However, source and target domains can have
of the adaptation process. substantial context differences, leading to potentially different
• A novel ensemble technique to take advantage of the
attributes. For example, the source domain could be recorded
complementarity of different backbones trained inde- in an airport and the target domain in a shopping center. To
pendently. Instead of applying the typical knowledge obtain a better generalization, in [26], the authors propose the
distilling [14] or co-teaching [15], [16] methods, which Multi-task Mid-level Feature Alignment (MMFA) technique to
add complexity to the training process, we propose using enable the method to learn attributes from both domains and
an ensemble-based prediction. align them for a better generalization on the target domain.
Other methods, such as UCDA [27] and CASCL [28], aim to
II. R ELATED W ORK align attributes by considering images from different cameras
Several works address Unsupervised Domain Adaptation on the target dataset.
for Person Re-Identification. They can be roughly divided
into three categories: generative, attribute alignment, and label C. Label Proposing Methods
proposing methods.
Methods in this category predict possible labels for the
unlabeled target domain by leveraging clustering methods (K-
A. Generative Methods means [29], DBSCAN [30], among others). Once the target
ReID generative methods aim to synthesize data by trans- data is pseudo-labeled, the next step is to train models to
lating images from a source to a target domain. Once data learn discriminative features on the new domain. PUL [7]
from the source dataset is labeled, the translated images on applies the Curriculum Learning technique to adapt a model
the target context receive the same labels as the corresponding learned on a source domain to a target domain. However, as
K-means is used to cluster the features, it is not possible to both categories need images from source and target domains
account for camera variability. As K-means generates only during adaptation. Finally, the last Label Proposing methods
convex clusters, it cannot find more complex cluster structures, consider mutual-learning or co-teaching, which brings com-
hindering the performance. UDAP [8] and ISSDA-ReID [31] plexity to the training stage.
utilize DBSCAN as the clustering algorithm along with la- Similarly, we assume to have only camera-related informa-
beling refinement. SSG [9] also applies DBSCAN to cluster tion, i.e., we know from which camera (viewpoint) an image
features of the whole, upper, and low-body parts of identities was taken. In all steps, we use pseudo-identity information
of interest. The final loss is the sum of individual triplet losses exclusively given by the clustering algorithm without relying
in each feature space (body part). Similar to our work, they use on any ground-truth information. We differ from the prior
a source domain to pre-train the model and the target domain art by using a new diversity learning scheme and generating
for adaptation. However, they do not perform cross-camera triplets based on each cluster’s diversity of points of view.
mining, cluster filtering, nor ensembling. These elements of As we train the whole model, the method also learns high-
our solution allow it to outperform SSG in all adaptation level features on the target domain. We simplify the training
scenarios. process by considering one backbone at a time, without mutual
ECN [32], ECN-GPP [33], MMCL [34], and Dual- information exchange during adaptation. Finally, we apply
Refinement [35] use a memory bank to store features, which model ensembling for inference after the training process.
is updated along the training to avoid the direct use of features
generated by the model in further iterations. The authors aim III. P ROPOSED M ETHOD
to avoid propagating noisy labels to future training steps, Our approach to Person ReID comprises two phases: train-
contributing to keeping and increasing the discrimination of ing and inference. Figure 1 depicts the training process, while
features during training. Table I shows the variables used in this work.
PAST [10] applies HDBSCAN [36] as the clustering
method, which is similar to OPTICS [37] — the algorithm TABLE I
of choice in our work. However, the memory complexity of VARIABLES ’ MEANING IN THIS WORK
OPTICS is O(n), while for HDBSCAN is O(n2 ), making our
Variable Meaning
model more memory efficient in the clustering stage. nb Number of different backbones in the Ensemble
MMT [12], MEB-Net [13], ACT [38], SSKD [39], and M Model backbone
ABMT [16] are ensemble-based methods. They consider two K1 Number of iterations of the blue flow in Figure 1
K2 Number of iterations of the orange flow in Figure 1
or more networks and leverage mutual teaching by sharing ci i-th cluster in the feature space
one network’s outputs with the others, making the whole ni Number of cameras in cluster ci
system more discriminative on the target domain. However, camj j-th camera in a cluster
xsi i-th image in the source domain
training models in a mutual-teaching regime brings complexity xti i-th image in the target domain
in memory and to the general training process. Besides that, yis Label of the i-th image in the source domain
noisy labels can be propagated to other ensemble models, Ns Number of images in the source domain
Nt Number of images in the target domain
hindering the training process. Nonetheless, ensemble-based m Number of anchors per camera in a cluster
learning provides the best performance among state-of-art α Margin parameter of the Triplet Loss
methods. We propose using ensembles only during inference to B Batch of triplets in an iteration
simultaneously eliminate the complexity added to the training,
still taking advantage of knowledge complementary between During training, we independently optimize nb different
the models. backbones to adapt the model to the target domain. This phase
Our work is also based on Curriculum Learning with is divided into five main stages that are performed iteratively:
Diversity [40], a schema whereby the model starts learning feature extraction from all data; clustering; cluster selection;
with easier examples, i.e., samples that are correctly classified cross-camera triplet creation and fine-tuning; feature extraction
with a high score early in training. However, in a multi- from pseudo-labeled data.
class problem, one of the classes might have more examples After training, we perform the proposed self-ensembling
correctly classified early on, making it easier than the other phase to summarize the training parameters in a single final
classes. Therefore, in Curriculum Learning with Diversity, the model based on the weighted average of model parameters
method selects the most confident samples (easier samples) from each different checkpoint. We perform this step for each
from the easier classes, including some examples from the backbone independently and, in the end, we have nb self-
harder ones. In this way, it enables the model to learn in ensembled models.
an easy-to-hard manner, avoiding local minima and allowing During inference, for a pair query/gallery image, we calcu-
better generalization. late the distance between them considering feature vectors ex-
Even though recent work achieves competitive perfor- tracted by each of the nb models. Hence, for each query/gallery
mances, there are some limitations that we aim to address pair, we have nb distances, one for each of the trained models.
in our work. First, generative methods bring complexity by We then apply our last ensemble technique: the nb distances
considering GANs to translate images from a domain to the are averaged to obtain a final distance. Finally, based on this
other. Second, attribute Alignment methods only tackle the final distance, we take the label of the closest gallery image
alignment of low and mid-level features. Third, methods in as the query label.
Fig. 1. Overview of the training phase. We assume to have camera-related information, i.e., we know the camera used to acquire each image; and we do not
rely on any ground-truth label information about the identities on the target domain. The pipeline has two flows: the blue flow is executed every K1 times,
and the orange flow is executed K2 times. Both flows share steps in green. In Stage 1, we initially extract feature vectors for each training image in the
target domain using model M , and cluster them using the OPTICS algorithm in Stage 2 to propose pseudo-labels. Afterward, we perform cluster selection
in Stage 3, removing outliers and clusters with only one camera. Then, triplets are created based on each cluster’s diversity in Stage 4a and used to train the
model in Stage 4b. These steps are denoted by the blue flow in which the Clustering and Cluster Selection are performed. Instead of going back to Stage 1,
the method follows the orange flow. In Stage 5, we extract feature vectors of the samples selected in Stage 3, and the process continues to Stage 4a and 4b
again. The blue flow marks an iteration, while the orange flow is called an epoch. Therefore, in each iteration, we have K2 epochs.
A. Training Stages 1 and 2: Feature Extraction from all data clusters can be split or combined to create new ones. In other
and Clustering words, if we change the threshold, other clusters might appear,
Let Ds = {(xsi , yis )}N creating a different label proposing for the samples. However,
i=1 be a labeled dataset representing
s
the source domain, formed by Ns images xsi and their respec- clusters that emerge from real labels often have different
tive identity labels yis ; and let Dt = {(xti )}N distributions and densities, indicating that a generally fixed
i=1 be an unlabeled
t
target dataset representing the target domain, formed by Nt threshold might not be sufficient to detect them. In this sense,
images xti . Before applying the proposed pipeline, we firstly OPTICS relaxes DBSCAN by ordering feature vectors in a
train a model M in a supervised way, with source dataset Ds manifold based on the distances between them, which allows
and its labels. After training, assuming source dataset Ds is not the construction of a reachability plot. Probable clusters with
available anymore, we perform transfer learning, updating M different densities are revealed as valleys in this plot and can
to the target domain, only considering samples from unlabeled be detected by their steepness. With this formulation, we are
target dataset Dt . more likely to propose labels closer to real label distribution
With model M trained on Ds , we first extract all feature on the target data.
vectors from images in Dt and create a new set of fea-
ture vectors {M (xti )}N B. Training Stage 3: Cluster Selection
i=1 . We remove possible duplicates by
t
checking if there is a replacement from one of them, which After the first and second stages, feature vectors are either
might be caused by duplicate images on target data. The assigned to a cluster or considered outliers. As people can
remaining feature vectors are L2-normalized to embed them be captured by one or more cameras in a ReID system, the
into a unit hypersphere. The normalized feature vectors are produced clusters are naturally formed by samples acquired by
clustered using the OPTICS algorithm to obtain pseudo labels. different devices. We hypothesize that clusters with samples
The OPTICS algorithm [37] leverages the principle of obtained by two or more cameras are more reliable than
dense neighborhood, similarly to DBSCAN [30]. DBSCAN clusters with only one camera.
defines the neighborhood of a sample as being formed by its If an identity is well described by model M , its feature
closest feature vectors, with distances lower than a predefined vectors should be closer in the feature space regardless of
threshold. Clusters are created based on these neighborhoods, the camera. Therefore, clusters with only one camera might
and samples not assigned to any cluster are considered outliers. be created due to bias to a particular device or viewpoint,
If the threshold changes, other clusters are discovered, current and different identities captured by the same camera can be
assigned to the same cluster. Besides, if a feature vector is

predicted as an outlier by the clustering algorithm, it means
that it does not have a good description of its image identity
to be assigned to a cluster.
Based on these observations and for optimization purposes,
we filter the feature vectors by discarding outliers and clusters
with a single camera type. With camera-related information, it
is possible to count the number of images from each camera
in a cluster. If all samples in a cluster come from the same
camera, it is removed from the feature space. By doing this,
we keep in the feature space only clusters with images from
at least two cameras.Figure 1 depicts this process, from Stage
2 to Stage 3, in which the outlier samples (green points) and
clusters with only one camera (magenta points) are removed
from the feature space.
Fig. 2. Cross-Camera Triplet Creation. For each selected cluster, we have at
The remaining clusters (the ones with two or more cameras) least two cameras. Suppose the represented cluster c has images from three
are considered reliable to fine-tune model M . Furthermore, cameras (represented with red, blue, and yellow contours). For each camera,
different clusters have different degrees of reliability based on we select m anchors. For each anchor, we create triplets with a positive sample
from other cameras in the same cluster and a negative sample with the same
the number of represented cameras. Suppose images captured camera in other clusters. For instance, for camera red, we select an anchor and
by several cameras form a cluster. In that case, it means model we sort, based on the distance, all feature vectors from cameras yellow and
M can embed samples of the corresponding identity captured blue. Then we select the median feature vector from each one (represented by
the arrows coming to the anchor). To select the negative sample, we sort all
by all of these cameras in the feature space, eliminating point- feature vectors from the same camera but from a cluster 6= c, and we choose
of-view bias. In contrast, the fewer images from different the closest and not previously selected sample. For the triplet with a yellow
points of view, the more complex the identity definition. In this median sample as positive, we select as negative the closest sample to the
red anchor from another cluster (represented by the yellow arrow leaving the
sense, we propose a new approach of creating cross-camera anchor). For the triplet with a blue median sample as positive, we select the
triplets of samples to optimize the model by emphasizing second closest feature vector to the red anchor from another cluster (since the
cluster diversity and forcing samples of the same identity to first closest has already been picked). This explanation assumes m = 1 and
is repeated for cameras yellow and blue.
be closer in the feature space regardless of their acquisition
camera.
For a cluster ci with ni cameras, we generate a total of
C. Training Stage 4: Cross-Camera Triplet Creation and Fine- ni − 1 triplets with the same anchor. If we select m anchors
tuning for one camera in ci , a total of m(ni − 1) triplets are created.
Figure 2 shows the triplet creation process. A triplet is Considering that this process is repeated for each camera in
formed by an anchor, a positive, and a negative sample. During ci , we have a total of ni m(ni − 1) triplets for cluster ci . Note
optimization, the distance from the anchor to the positive that the triplets are created in an offline manner. The offline
sample should be minimized, while the distance to the negative creation enables us to choose triplets considering a global view
sample should be maximized. Ideally, positive and negative of the target data instead of creating them in a batch, which
samples should be hard-to-classify examples for the current would bring a limited view of the target feature space.
model M as easy examples do not bring diversity to the The number m of anchors of a camera is the same for all
learning process. clusters. Consequently, the number of triplets generated for a
We initially select, as the anchor, one random sample in cluster ci is O(n2i ). The greater the diversity of cameras in
cluster c captured by camera camj . For each camera camk 6= a cluster, the greater its representativeness on the triplets. By
camj in cluster c, we sort all feature vectors from camera emphasizing the clusters with more camera diversity during
camk based on their distance to the anchor. The positive training, the model learns from easy-to-hard identities and is
sample is then selected as being the median feature vector. The more robust to different viewpoints. In our experiments we set
median is considered instead of the farthest sample (the hardest m = 2 for all adaptation scenarios.
example) to avoid selecting a noisy example. We do not choose Due to this new approach of creating cross-camera triplets,
an easy example (the closest one) to avoid slowing down the we can optimize the model by using the triplet loss [41]
model convergence or even getting stuck on a local minimum. without the need for weight decay or any other regularization
To select the negative sample, we first sort all feature vectors term and hyper-parameters. This also suggests that cross-
from camera camj belonging to other clusters 6= c based on camera triplets help to regularize the model during training.
their distance to the anchor. As the negative sample, we pick After creating the triplets in an offline manner, we optimize
the closest feature vector that has not been assigned yet to the model using the standard triplet loss function:
a triplet. In this way, we avoid selecting the same negative
sample, which brings diversity to the triplets and alleviates
1 X
the harmful impact if one of the negative samples shares the L= [d(xa , xp ) − d(xa , xn ) + α]+ , (1)
anchor’s same real identity. |B|
(xa ,xp ,xn )∈B
where B is a batch of triplets, xa is the anchor, xp is the data from the target domain is considered in an iteration, it
positive sample and xn is the negative one. α is the margin means that the model is more confident, and then it can have
that is set to 0.3 and [.]+ is the max(0, .) function. This is more discrimination power on the target domain. Hence, pi
illustrated in Figure 1, Stage 4b. is equal to the percentage of reliable target data in the i-th
iteration. Consequently, a model that takes more data from the
D. Stage 5: Feature Extraction from Pseudo-Labeled Samples target to train will have a higher weight pi . Self-Ensembling
is illustrated in Figure 3. Note that we directly deal with the
This stage is part of the orange flow performed after Fine- model’s learned parameters and create a new one by averaging
tuning (Stage 4b). The main idea is to keep the pseudo-labeled the weights.
clusters from Stage 3, recreating a new set of triplets based
We end up with a single model containing a combination
on the new distances between samples after the model update
of knowledge from different adaptation moments, which sig-
in Stage 4b, bringing more diversity to the training phase.
nificantly boosts performance, as shown in Section V.
To do so, we extract feature vectors only for samples of the
pseudo-labeled clusters selected in Stage 3. The orange flow is
performed K2 times, and a complete cycle defines an epoch.
The blue flow is performed every K1 times, and a complete
cycle defines an iteration. Therefore, in each iteration, we have
K2 epochs. This concludes the training phase.
Unlike the five best state-of-the-art methods proposed in
the prior art (DG-Net++, MEB-Net, Dual-Refinement, SSKD,
and ABMT), our solution is trained with a single-term loss,
which contains only one hyper-parameter. Even the weight
decay has been removed, as the proposed method can already
calibrate the gradient to avoid overfitting, as we show in
Section IV. Moreover, prior work performs clustering on the
Fig. 3. Self-Ensembling scheme after training. Different amounts of the target
training phase through k-reciprocal Encoding [42], which is a data (with no label information whatsoever) are used to fine-tune the model
more robust distance metric than Euclidean distance. However, during the adaptation process. Different models created along the adaptation
it has a higher computational footprint, as it is necessary to can be
complementary. We create a new final model by weight
check the neighborhood of each sample whenever distances averaging the models’ parameters from different iterations.
are calculated. For training simplicity, we opt for standard Weight pi is based on the amount of reliable data from the
Euclidean distance to cluster the feature vectors. However, as target domain on the i-th iteration. We end up with a single
k-reciprocal encoding gives the model higher discrimination, model encoding knowledge from different moments of the
we adopt it during inference time. Therefore, different from adaptation.
previous works, we calculate k-reciprocal encoding only once
during inference.
E. Self-ensembling F. Ensemble-based prediction

Our last contribution relies upon the curriculum learning
After training and performing the self-ensemble fusion, we
theory. Different iterations of the training phase consider
have a single model adapted from the source to the target
different amounts of reliable data from the target domain, as
domain. However, due to the high performance of ensemble-
shown in Section IV. This property leads us to hypothesize that
based methods in recent ReID literature [12], [13], as a last
knowledge obtained at different iterations is complementary.
measure, we leverage a combination of nb different architec-
Therefore, we propose to summarize knowledge from different
tures to make a final prediction considering even more learned
moments of the optimization in a unique final model. However,
knowledge, which improves performance on the target dataset.
as the model discrimination ability increases as more iterations
We apply the ensemble technique only for inference, different
are performed (the model is able to learn from more data), we
from [12], [13] that leverage a mutual-teaching regime on
propose combining the model weights of different iterations
training time. In turn, we avoid bringing complexity to the
by weighting their importance with the amount of reliable data
training but still take advantage of the complementarity from
used in the corresponding iteration. We perform this weighted
different architectures during inference.
average of the model parameters as:
P To perform the ensemble-based prediction, we first calculate
pi .θi the feature distance of the query to each image on gallery for
p∈P
θf inal = P , (2) each of the nb final models. Let fk (x) = Mk (x) be the L2-
pi normalized feature vector of image x obtained with model
p∈P
Mk and d(fk (q), fk (gi )) be the distance between the feature
where θi represents the model parameters after the i-th it- vectors of the query q and of the i-th image gallery gi extracted
eration and pi is the weight assigned to θi . Weight pi is using the k-th model on ensemble. The final distance between
obtained based on the reliability of the target domain; if more query q and gallery image gi is given by:
B. Implementation details
K
1 X In terms of deep-learning architectures, we adopt
df inal (q, gi ) = d(fk (q), fk (gi )), (3)
K ResNet50 [45], OSNet [4], and DenseNet121 [46], i.e.,
k=1
nb = 3, all of them pre-trained on ImageNet [47]. To test
where K is the number of models in the ensemble. In this way, them on an adaptation scenario, we choose one of the datasets
we can incorporate knowledge from different models encoded as the source and another as the target domain. We train
as the distance between two feature vectors. After obtaining the backbone over the source domain and the adaptation
the distance between query q and all images in the gallery, we pipeline over the target domain. We consider Market1501
take the label of the closest gallery image as the query label. and DukeMTMC-ReID as source domains, leaving MSMT17
We consider an equal contribution from each backbone. only as the target dataset (the hardest one in the prior art).
Without labels on the target domain, it is impossible to This way, we have four possible adaptation scenarios: Market
evaluate the impact of the individual models and give them → Duke, Duke → Market, Market → MSMT17, and Duke
proportional weights on the combination. → MSMT17. We keep those scenarios (without MSMT17
as a source) to have a fair comparison with state-of-the-art
methods. Besides, the most challenging scenario is MSMT17
IV. E XPERIMENTS AND R ESULTS
as the target dataset: we train backbones on simpler datasets
This section presents the datasets we adopt in this work (Market and Duke) and adapt their knowledge to a harder
and compares the proposed method with the prior art with a dataset, with almost the double number of cameras and with
comprehensive set of experiments considering different, and many more identities recorded in different moments of the
challenging, source/target domains. day and the year. This enables us to test the generalization
of our method in adaptation scenarios where source and
target domain have substantial differences in the number of
A. Datasets identities, camera recording conditions, and environment.
To validate our pipeline, we used three large-scale bench- We used the code available at [48] to train OSNet and
mark datasets present on Re-ID literature: at [13] to train ResNet50 and DenseNet121 over the source
domains. Our source code is based on PyTorch [49] and it is
• Market1501 [43]: It has 12,936 images of 751 identities
freely available at https://github.com/Gabrielcb/Unsupervised
in the training set and 19,732 images in the testing set.
selfAdaptative ReID.
The testing set is still divided into 3,368 images for the
After training, we remove the last classification layer from
query set and 15,913 images for the gallery set. Following
all backbones and use the last layer’s output as our feature
previous work, we removed the “junk” images from the
embedding. We trained our pipeline using the three backbones
gallery set, so 451 images are discarded. This dataset has
independently in all scenarios of adaptation. Considering the
a total of six non-overlapping cameras. Each identity is
flows depicted in Figure 1, we perform K1 = 50 cycles of the
captured by at least two cameras.
blue flow (50 iterations), and, in each one, we perform K2 = 5
• DukeMTMC-ReID [44]: It has 16,522 images of 702
cycles of the orange flow (5 epochs). We consider Adam [50]
identities in the training set and 19,889 images in the
as the network optimizer and set the learning rate to 0.0001
testing set. The testing set is also divided into 2,228 query
in the first 30 iterations. After the 30th iteration, we divided
images and 17,661 gallery images of other 702 identities.
it by ten and kept it unchanged until reaching the maximum
The dataset has a total of eight cameras. Each identity is
number of iterations. As we show in our experiments, we can
captured by at least two cameras.
set the weight decay to zero since our proposed Cross-Camera
• MSMT17 [18]: It is the most challenging ReID dataset
Triplet Creation can regularize the model without extra hyper-
present in the prior art. It comprises 32,621 images of
parameters. The triplet batch size is set to 30; batches with 30
1,401 identities in the training set and 93,820 images
triplets are used to update the model in each epoch. The margin
of 3,060 identities in the testing set. The testing set is
in Equation 1 is set to 0.3, and the number of anchors is set
divided into 11,659 images for a query set and 82,161
to m = 2. We resize the images to 256 × 128 × 3 and apply
images for a gallery set. It comprises 15 cameras recorded
Random Flipping and Random Erasing as data augmentation
in three day periods (morning, afternoon, and noon) on
strategies during training.
four different days. Besides, of the 15 cameras, 12 are
outdoor cameras, and three are indoor cameras. Each
identity is captured by at least two cameras. C. Comparison with the Prior Art
As done in previous work in the literature, we remove Tables II and III show results comparing the proposed
from the gallery images with the same identity and camera method to the state of the art. The proposed method outper-
of the query to assess the model performance in a cross- forms the other methods regarding mAP and Rank-1 in Market
camera matching. Feature vectors are L2-normalized before → Duke by improving those values in 1.8 and 1.7 percentage
calculating distances. For evaluation, we calculate the Cumu- points (p.p.), respectively, and without re-ranking. In the Duke
lative Matching Curve (CMC), from which we report Rank-1 → Market scenario, we obtain a solid competitive performance
(R1), Rank-5 (R5), and Rank-10 (R10), and mean Average by having values 0.1 p.p. lower only in Rank-1, also without
Precision (mAP). re-ranking.
In turn, ABMT applies k-reciprocal encoding during train- Market → Duke scenarios (or both of them) to perform grid-
ing, which is more robust than Euclidean distance. However, searching over hyper-parameter values. Once they find the best
it is more expensive to calculate as it is necessary to search values, they keep them unchanged for all adaptation setups.
for k-reciprocal neighbors of each feature vector in each In ABMT [16], the authors do not provide a clear explana-
iteration of the algorithm before clustering. In our case, we tion on how they define the hyper-parameter values for their
only apply the standard Euclidean distance during training, loss function. However, they perform an ablation study over
reducing the training time and complexity on adaptation, but Duke → Market and Market → Duke scenarios, so their results
still obtaining performance gains. Moreover, we have a single- might be biased to those specific setups, which gives them one
term and single-hyper-parameter loss function, while ABMT of the best performances. However, when they keep the same
depends on a loss with three terms and more hyper-parameters. values for different and more challenging scenarios, such as
They apply a teacher-student strategy to their training while we Market → MSMT17 or Duke → MSMT17, they obtain worse
perform ensembling only for inference. Therefore, with a more results than ours by a large margin. This shows that our method
direct pipeline and ensemble prediction, the proposed method provides a better generalization capability brought by a simpler
has a Rank-1 only 0.1 p.p. lower in the Duke → Market, while loss function and a more diverse training. It prevents us from
outperforming all methods in all other adaptation scenarios. choosing specific hyper-parameter values and be biased to a
However, to benefit from the k-reciprocal encoding, we also specific adaptation setup. Consequently, we achieve the best
apply it during inference to keep a simpler training process. performances, especially in the most challenging scenarios.
In this case, the proposed method outperforms the methods
in the prior art regarding mAP and Rank-1 in all adaptation D. Discussion
scenarios. As we aim to re-identify people in a camera system in an
Compared to SSKD in Duke → Market scenario, we are unsupervised way, we must be robust to hyper-parameters that
below it by 0.3 and 0.4 p.p. in Rank-5 and Rank-10, respec- require adjustments based on grid-searching using true label
tively. Considering the closest actual gallery match image to information, keeping the training process (and adaptation to a
the query (R1), our ensemble retrieves more correct matches, target domain) as simple as possible. If a pipeline is complex
as Table II shows, with our method outperforming SSKD by and too sensitive to hyper-parameters, it might be challenging
1.2 p.p. in Rank-1 without re-ranking. Even with fewer hyper- to train and deploy it on a real investigation scenario, where
parameters than SSKD and a more straightforward training we do not have prior knowledge about the people of interest.
process (no co-teaching, simpler loss function, and late en- This complexity leads to sub-optimal performance. This has
sembling), our method shows competitive results considering already been pointed out in [55]. The authors claim that most
the training complexity trade-off. works rely on many hyper-parameters during the adaptation
Interestingly, the proposed method performs better under stage, which can help or hinder the performance, depending
more difficult adaptation scenarios. We measure the difficulty on the value assigned to them and which adaptation scenario
of a scenario based on the number of different cameras it is considered.
comprises. Market, Duke, and MSMT17 have 6, 8, and 15 SSKD[39] is an ensemble-based method leveraging three
cameras, respectively. Hence the most challenging adaptation deep models in a co-teaching training regime with a four-term
scenario is from Market to MSMT17. We adapt a model from loss function with three hyper-parameters. One of the terms
a simpler scenario (6 cameras, all videos recorded in the same of their final loss function is a multi-similarity loss [56], with
day period and the same season of the year) to a more complex three extra hyper-parameters to train the model.
target domain (15 cameras – 12 outdoors and 3 indoors – MEB-Net has complex training by relying on a co-training
recorded at 3 different day periods – morning, afternoon and technique with three deep neural networks in which each one
noon – in 4 different days – each day on a different season learns with the others. Each of these three networks has its
of the year). Market → MSMT17 is the most challenging separate loss function with six terms, and their overall loss
adaptation and close to real-world conditions where we might function is a weighted average of the individual loss functions
have people recorded along the day and in different locations from each model on the ensemble.
(indoors and outdoors). In this case, as shown in Table III, ABMT also leverages a teacher-student model where the
we obtained the highest performance even without re-ranking teacher and student networks share the same architecture, in-
techniques. The proposed method outperforms the state of creasing time and memory complexity during training. More-
the art by 1.5 and 2.1 p.p. in mAP and Rank-1, respectively, over, they utilize a three-term loss function to optimize both
on Duke → MSMT17, and by 2.2 and 4.2 p.p. on the most models with three hyper-parameters controlling the contribu-
challenge scenario, Market → MSMT17. tion of each term to the final loss. They update the teacher
There are several reasons why our method performs well. weights based on the exponential moving average (EMA) of
We explicitly design a model to deal with the diversity of the student weights, in order to avoid error label amplification
cameras and viewpoints by creating a set of triplets based on training. This also adds another parameter to control the
on the different cameras in a cluster. We also keep a more inertia in the teacher weights’ EMA. The authors do not
straightforward training, with only one hyper-parameter in our perform an ablation study regarding the hyper-parameter value
loss function (triplet loss margin). Most works in the ReID variation to assess their impact on final performance.
literature optimize a loss function with many terms and hyper- Based on these observations, our proposed model better
parameters. They usually consider the Duke → Market or the captures the diversity of real cases, by considering a loss
TABLE II
R ESULTS ON M ARKET 1501 TO D UKE MTMC-R E ID AND D UKE MTMCR E -ID TO M ARKET 1501 ADAPTATION SCENARIOS . W E REPORT M AP, R ANK -1,
R ANK -5, AND R ANK -10, COMPARING TO SEVERAL STATE - OF - ART METHODS . T HE BEST RESULT IS SHOWN IN BOLD , THE SECOND IN UNDERLINE AND
THE THIRD IN italic. W ORKS WITH (*) DO NOT PRE - TRAIN THE MODEL IN ANY SOURCE DATASET BEFORE ADAPTATION .
Duke → Market Market → Duke

Method reference mAP R1 R5 R10 mAP R1 R5 R10
SSL [51]* CVPR 2020 37.8 71.7 83.8 87.4 28.6 52.5 63.5 68.9
CCSE [24]* TIP 2020 38.0 73.7 84.0 87.9 30.6 56.1 66.7 71.5
UDAP [8] PR 2020 53.7 75.8 89.5 93.2 49.0 68.4 80.1 83.5
MMCL [34] CVPR 2020 60.4 84.4 92.8 95.0 51.4 72.4 82.9 85.0
ACT [38] AAAI 2020 60.6 80.5 - - 54.5 72.4 - -
ECN-GPP [33] TPAMI 2020 63.8 84.1 92.8 95.4 54.4 74.0 83.7 87.4
HCT [52]* CVPR 2020 56.4 80.0 91.6 95.2 50.7 69.6 83.4 87.4
SNR [53] CVPR 2020 61.7 82.8 - - 58.1 76.3 - -
AD-Cluster [11] CVPR 2020 68.3 86.7 94.4 96.5 54.1 72.6 82.5 85.5
MMT [12] ICLR 2020 71.2 87.7 94.9 96.9 65.1 78.0 88.8 92.5
CycAs [54]* ECCV 2020 64.8 84.8 - - 60.1 77.9 - -
DG-Net++ [23] ECCV 2020 61.7 82.1 90.2 92.7 63.8 78.9 87.8 90.4
MEB-Net [13] ECCV 2020 76.0 89.9 96.0 97.5 66.1 79.6 88.3 92.2
Dual-Refinement [35] arXiv 2020 78.0 90.9 96.4 97.7 67.7 82.1 90.1 92.5
SSKD [39] arXiv 2020 78.7 91.7 97.2 98.2 67.2 80.2 90.6 93.3
ABMT [16] WACV 2020 80.4 93.0 - - 70.8 83.3 - -
Ours (w/o Re- This Work 67.7 89.5 94.8 96.5 68.8 82.4 90.6 92.5
Ranking)*
Ranking)
Ours (w/ Re-Ranking) This Work 88.0 93.8 96.4 97.4 82.7 87.2 92.5 93.9
function with a single term and that is less sensitive to hyper- domain. All works hold this assumption in Table II that do not
parameters (only margin α needs to be selected). In such have the (*) after their name.
setups, it is difficult to select hyper-parameter values correctly, Section V shows that our pipeline still performs well even
as we might not know any information about the identities on without pre-training in a source dataset. In other words, we
the target domain. The self-ensembling also summarizes the take the backbone trained over ImageNet and directly apply
whole training into a single model by using each checkpoint’s it without any previous ReID-related knowledge. Even in this
confidence values over the target data, without using any setup, we can achieve competitive performance.
hyper-parameter or human-defined value. Even adopting a
more straightforward formulation, we still obtain state-of-the- E. Qualitative Analysis
art performance on the Market → Duke scenario and com-
petitive performance on the Duke → Market scenario. Each We now provide qualitative analysis by highlighting regions
architecture in our work is trained in parallel without any co- of the top-10 gallery images returned for a given query image.
teaching strategy. After self-ensembling, the joint contribution The redder the color of a region, the more important it is to
from different backbones is applied only on evaluation time, the ranking. As explained in Section IV-A, the correct matches
avoiding label propagation of noisy examples (e.g., potential always come from cameras different from the query’s camera.
outliers) but still taking advantage of the complementarity The green contour denotes a true positive, the red contour a
between them. false positive, and the blue color the query image. We present
successful cases (when the first gallery image is a true positive)
Our assumptions are the same as recent prior art [11], [33],
and failure cases (when the first gallery image is a false
[24]. We assume to know from which camera an image of a
positive) for each camera on Market1501 and DukeMTMC-
person was recorded but not the identity. We rely on camera
ReID datasets. We show two successful cases and two failure
information to filter out clusters elements captured by only
cases (one for each dataset) in Figures 4 and 5 considering
one camera and create the cross-camera triplets.
ResNet50 as the backbone. For visualizations for all cameras
We also assume that at least two cameras have captured of both datasets, please refer to the Supplementary Material.
most identities and all of them have non-overlapping vantage MSMT17 was not considered as the dataset agreement does
points. All prior art holds this assumption as defined by the not allow the reproduction of the images in any format.
datasets and train/test split division. Figures 4a and 5a depict two successful cases on Market
Finally, we assume that training on a source domain related → Duke and Duke → Market scenarios, respectively. In both
to Person Re-Identification gives the model a basic knowledge cases, we see that our model finds fine-grained details on the
to adapt to the target domain. This knowledge enables the image leading to a correct match. As an example, Figure 4a
model to propose better initial clusters on early iterations, shows the model focusing on the red jacket, even in a different
grouping feature vectors from the same identity recorded from pose and under occlusion (7th and 10th image from left to
different cameras. The pipeline starts the adaptation with more right). Figure 5a shows that the model can overcome pose
reliable pseudo-labels in the clustering step and progressively changes of the query on a cross-view setup. The query only
creates more clusters representing more identities on the target shows the person’s back, but the closest image is a true match
TABLE III
R ESULTS ON M ARKET 1501 TO MSMT17 AND D UKE MTMCR E -ID TO MSMT17 ADAPTATION SCENARIOS . W E REPORT M AP, R ANK -1, R ANK -5, AND
R ANK -10, COMPARING TO SEVERAL STATE - OF - ART METHODS . T HE BEST RESULT IS SHOWN IN BOLD , THE SECOND IN UNDERLINE AND THE THIRD IN
italic. W ORKS WITH (*) DO NOT PRE - TRAIN THE MODEL IN ANY SOURCE DATASET BEFORE ADAPTATION .
Duke → MSMT17 Market → MSMT17

ECN [32] CVPR 2019 10.2 30.2 41.5 46.8 8.5 25.3 36.3 42.1
CCSE [24]* TIP 2020 9.9 31.4 41.4 45.7 9.9 31.4 41.4 45.7
SSG [9] ICCV 2019 13.3 32.2 - 51.2 13.2 31.6 - 49.6
ECN-GPP [33] TPAMI 2020 16.0 42.5 55.9 61.5 15.2 40.4 53.1 58.7
MMCL [34] CVPR 2020 16.2 43.6 54.3 58.9 15.1 40.8 51.8 56.7
MMT [12] ICLR 2020 23.3 50.1 63.9 69.8 22.9 49.2 63.1 68.8
CycAs [54]* ECCV 2020 26.7 50.1 - - 26.7 50.1 - -
DG-Net++ [23] ECCV 2020 22.1 48.8 60.9 65.9 22.1 48.4 60.9 66.1
Dual-Refinement [35] arXiv 2020 26.9 55.0 68.4 73.2 25.1 53.3 66.1 71.5
SSKD [39] arXiv 2020 26.0 53.8 66.6 72.0 23.8 49.6 63.1 68.8
ABMT [16] WACV 2020 33.0 61.8 - - 27.8 55.5 - -
SpCL [57] NeurIPS 2020 - - - - 31.0 58.1 69.6 74.1
Ranking)
(a) (b)
Fig. 4. The most activated regions in the gallery image given a query on DukeMTMCReID (Market → Duke scenario) for ResNet50. (a) Successful match;
(b) Failure case.
(a) (b)
Fig. 5. Highlighting image regions most activated on gallery image given query on Market1501 after run Duke → Market scenario on ResNet50. (a) Successful
match; (b) Failure case.
showing the person from the front. The same happens on the F. Results on an Unsupervised Scenario
second closest image, where the identity has its back recorded This section explores the possibilities of our method when
by another camera; and on the fourth and fifth closest images, not performing any pre-training on a source domain. Here the
only the right side is captured. The third closest image not method starts with backbones trained over ImageNet directly.
only records a different position of the query, but also has This is a harder case as we eliminate the possibility of having
a different resolution. This shows that the model effectively prior knowledge of the person re-identification problem. This
overcomes identity pose changes and resolution on cross-view requires the backbones to adapt themselves to the target,
cameras. not relying on any identity-related annotation coming from
the source domain. Table II shows the results denoted by
“Ours(w/o Re-Ranking)*”. In this case, we keep ξ = 0.05
when Duke is the target, as in previous results, and ξ = 0.03
Figures 4b and 5b depict failure cases to show the limi- when Market is the target. The value ξ = 0.05 was too strict,
tations of the method. The errors happen when there is no leading to clusters with images from only one camera for
person on the image — see 4b, which has been fully occluded the Market dataset. Section V presents a deeper analysis of
by the car. In this case, the method does not have any specific different choices of ξ on the clustering process.
region to focus on, and then the gallery images are almost fully However, when we consider Duke as the target domain, the
activated. Another failure case happens when the identity is on model without source pre-training is the third best. We lose 3.8
a motorcycle (Figure 5b) along with another identity which led and 2.6 p.p. to the equivalent pre-trained model in mAP and
to mismatching cases where there is no identity (distractor Rank-1, respectively, and we lose 2.0 and 0.9 p.p. compared
images on the gallery) or images with parts of a bike. In to ABMT, outperforming all other methods. This shows that,
the Supplementary Material, we provide more successful and although our model is not completely robust to the backbone
failure cases in other cameras for both datasets. initialization, it is still capable of mining discriminative fea-
tures, even without pre-training, proving comparative or better Rank-1), it relies on an unstable point in the setup of Duke
results when compared to the state of the art. → Market, and it is only marginally better than ξ = 0.05 for
The proposed method outperforms all others in the same Market → Duke. Rank-5 and Rank-10 tend to be more stable
conditions (no pre-training, denoted with a star in Table II). in both cases. Thus we adopt ξ = 0.05 in all scenarios.
The difference to the best one (CycAs) is 2.9 and 4.7 p.p. on
mAP and Rank-1 when Market is the target, and in 8.7 and B. Impact of Curriculum Learning
4.5 p.p. on mAP and Rank-1 when Duke is the target. In our pipeline, Stage 3 is responsible for cluster selection.
We conclude that the previous training on a ReID source- After running the clustering algorithm, a feature vector can
related dataset is important for better performances on the be an outlier, assigned to a cluster with only one camera or
task. However, when no ReID source domain is available, our assigned to a cluster with two or more cameras. We argue that
methods can still provide competitive results, mainly in the feature space cleaning is essential for better adaptation, and
more challenging scenario (Duke as target). that feature vectors in a cluster with at least two cameras are
more reliable than ones assigned as outliers or to cluster with
V. A BLATION S TUDY a single camera. Then, we consider the curriculum learning
principle to select the most confident samples and learn in an
This section shows the contribution of each part of the
easy-to-hard manner. To achieve this, we remove the outliers
pipeline to the final result. In each experiment, we change one
and the clusters with only one camera. To check the impact
of the parts and keep the others unchanged. If not explicitly
of this removal, we performed four experiments in which we
mentioned, we consider ResNet50 as the backbone, OPTICS
alternate between keeping the outliers and the clusters with
with hyper-parameter ξ = 0.05, and self-ensembling applied
only one camera. The results are summarized in Table IV.
after training.
We observe a performance gain on most metrics, especially
on mAP and Rank-1, when we apply our cluster selection
A. Impact of the Clustering Hyper-parameter strategy. If we keep the outliers in the feature space (first
Although we have only one hyper-parameter in the loss and third rows in Table IV), we face the most significant
function, we still need to set hyper-parameter ξ of the OPTICS performance drop in both adaptation scenarios. It shows the
clustering algorithm, a threshold in the range [0, 1]. The closer importance of removing outliers after the clustering stage;
ξ is to 1, the stronger is the criteria to define a cluster; that otherwise, they can be considered in the creation of triplets,
is, we might have many samples not assigned to any cluster, increasing the number of false negatives (for instance, select-
which leads to several detected outliers (if ξ = 1, all feature ing negative samples of the same real class) and, consequently,
vectors are detected as outliers). In contrast, the closer ξ is hindering the performance. We see a lower performance drop
to 0, the more relaxed the criteria is, and more samples are by keeping clusters with only one camera but without outliers
assigned to clusters (if ξ = 0, all feature vectors are grouped (second row), indicating that those clusters do not hinder the
into a single cluster). In Figure 6, we show the impact of performance much, but might contain noisy samples for model
the threshold ξ for the Market → Duke and Duke → Market updating. It is more evident when we verify that the most gains
scenarios. were over mAP and lower gains over Rank-1 in the last row.
This demonstrates that if we keep one-camera clusters, the
model can still retrieve most of the gallery’s correct images
but with lower confidence. Hence, the cluster selection criteria
effectively improves our model generalization and we apply it
in all adaptation scenarios.
With this strategy, we observe that the percentage of feature
vectors from the target domain kept in the feature space
increases during the adaptation, as shown in Figures 7c and 8c.
In fact, reliability, mAP and Rank-1 increase during training
(a) (b) (Figures 7 and 8), which means that the model becomes more
Fig. 6. Impact of clustering hyper-parameter ξ. Results on (a) Market → robust in the target domain as more iterations are performed.
Duke, and (b) Duke → Market. This demonstrates the curriculum learning importance, where
easier examples at the beginning of the training (images
The best value for ξ changes according to the adaptation whose feature vectors are assigned to clusters with at least
scenario. This is expected when dealing with different unseen two cameras in early iterations) are used to give an initial
target domains. In both cases, Rank-1, Rank-5, and Rank-10 knowledge about the unseen target domain and allow the
curves are more stable than the mAP curve, showing that model to increase its performance gradually.
the parameter does not impact the retrieval of true positive As a direct consequence, the number of clusters with only
images. The best Rank-1 values are obtained for ξ between one camera removed from the feature space decreases, as
0.04 and 0.08 considering both scenarios and, in the more shown in Figure 9. This means that the model learns to group
challenging one (Market → Duke), it achieves the second-best cross-view images in the same cluster.
value when ξ = 0.05, for both mAP and Rank-1. Although the For the Market → Duke scenario, the initial percentage
best performance is achieved when ξ = 0.07 (best mAP and of removed clusters is higher than on Duke → Market.
TABLE IV
I MPACT OF CURRICULUM LEARNING , WHEN CONSIDERING DIFFERENT CLUSTER SELECTION CRITERIA . W E TESTED OUR METHOD WITH AND WITHOUT
OUTLIERS AND WITH AND WITHOUT CLUSTERS WITH ONLY ONE CAMERA IN THE FEATURE SPACE . A LL EXPERIMENTS CONSIDER R ES N ET 50 AS THE
BACKBONE WITH SELF - ENSEMBLING APPLIED AFTER TRAINING .

w/o outliers w/o cluster with mAP R1 R5 R10 mAP R1 R5 R10
one camera
D
- - 50.9 79.2 89.5 92.8 32.7 56.7 68.5 72.9
D
- 72.4 89.5 95.2 96.7 66.8 81.1 90.2 92.4
D D
- 49.1 79.8 89.5 92.6 32.7 57.2 68.4 72.3
74.1 89.6 95.3 97.1 67.8 81.7 90.0 92.6
(a) (b) (c)

Fig. 7. Progress on Rank-1, mean Average Precision and Reliability on target dataset, in the Market1501 to DukeMTMC-ReID scenario.
(a) (b) (c)

Fig. 8. Progress on Rank-1, mean Average Precision and Reliability on target dataset on DukeMTMC-ReID to Market1501 scenario.
comprising images recorded from only one camera. For the

same reason, the final percentage for Market → Duke is higher
than Duke → Market. In this last case, all backbones tend to
stabilize between 20% and 30% of clusters removed in the
last iterations.
What if all identities are captured by only one camera? In

(a) (b) this extreme case, we hypothesize that the model can still
Fig. 9. Percentage of cluster removed along the training iterations on (a) adapt to the target domain. However, the performance will be
Market → Duke and (b) Duke → Market scenarios considering the three limited, as different identities could be grouped in the same
backbones trained independently. cluster, increasing the false positive rate. This happens because
one of our assumptions is that each identity should be captured
by at least two cameras. In fact, this is inherited directly from
This is expected as the former is a more complex case, so the Person Re-Identification problem. Moreover, our method
initial clusters tend to have several images grouped due to utilizes this assumption to create the triplets, enabling a better
the camera bias, which leads to a higher number of clusters adaptation to the target domain.
C. Impact of self-ensembling The overall time to execute the pipeline and the whole
To check the contribution of our proposed self-ensembling training on Market → Duke scenario is smaller than Market
method explained in Section III-E, we take the best checkpoint → MSMT17’s, as expected, given that the latter is a more
of our model during adaptation in both scenarios, considering complex setup. As the number of training images is higher,
all backbones, and compare it with the self-ensembled model. the number of proposed clusters is also higher on MSMT17.
Note that we select the best model only for reference. In This leads to an increase in clustering, filtering, and overall
practice, we do not know the best checkpoint during training training times.
since we do not have any identity-label information. Our goal OSNet is the backbone that takes less time on both adap-
here is merely to show that our self-ensembling method leads tation setups, because of its feature embedding size. For
to a final model that outperforms any checkpoint individually. ResNet50 and DenseNet121, the embeddings have 2,048 di-
Even if we do not have any label information to choose the mensions while OSNet has 512. This allows a faster clustering,
best one during training, the self-ensembling can summarize as Table VII shows. Considering the same adaptation scenario,
the whole training process in a final model, which is better the clustering step is the most affected by the backbone and
than all checkpoints. Table V shows these results. its respective embedding size. This is why ResNet50 and
Our proposed self-ensembling method can improve discrim- DenseNet121 present more similar training times and OSNet
inative power over the target domain by summarizing the is the fastest one.
whole training during adaptation. The method outperforms the The inference time is calculated assuming that all gallery
best models in mAP by 2.0, 4.5 and 4.3 p.p., on Duke → feature vectors have been extracted and stored. It is the average
Market, for ResNet50, OSNet and DenseNet121, respectively. time to predict the label of one query based on the ranking
Similarly, for Market → Duke we achieve an improvement of the gallery images, following the protocol presented in
of 1.6, 2.2 and 3.3 p.p. in mAP for ResNet50, OSNet and Section IV-B. The difference between both adaptation scenar-
DenseNet121, respectively. We can also observe gains for all ios is due to the gallery size. As explained in Section IV-A,
backbones in both scenarios considering Rank-1. Therefore, MSMT17 has a gallery size more than 4× bigger than Duke’s.
our proposed self-ensembling strategy increases the number For all experiments, we used two GTX 1080 Ti GPUs. One
of correct examples retrieved from the gallery and their of them is used exclusively for clustering with an implemen-
confidence. It shows that different checkpoints trained with tation based on [58], and the other for pipeline training, for
different percentages of the data from the target domain have each backbone.
complementary information. Besides, as the self-ensembling is
VI. C ONCLUSIONS AND F UTURE W ORK
performed at the parameter level, without human supervision
and considering each checkpoint’s confidence, it reduces the In this work, we tackle the problem of cross-domain Per-
memory footprint by eliminating all unnecessary checkpoints son Re-Identification (ReID) with non-overlapping cameras,
and keeping only the self-ensembled final model. especially targeting forensic scenarios with fast deployment
requirements. We propose an Unsupervised Domain Adapta-
tion (UDA) pipeline, with three novel techniques: (1) cross-
D. Impact of Ensemble-based prediction camera triplet creation aiming at increasing diversity during
To increase discrimination ability, we combine distances training; (2) self-ensembling, to summarize complementary
computed by all considered architectures (Equation 3) for the information acquired at different iterations during training; and
final inference. Results are shown in Table VI. (3) an ensemble-based prediction technique to take advantage
The ensembled model outperforms the individual models by of the complementarity between different trained backbones.
3.3, 5.2 and 0.9 p.p. regarding Rank-1, on Duke → Market, Our cross-camera triplet creation technique increases the
for ResNet50, OSNet and DenseNet, respectively. The same model’s invariance to different points-of-view and types of
can be observed for Market → Duke, in which Rank-1 is cameras in the target domain, and increases the regularization
improved by 3.3, 2.9 and 1.6 p.p. for ResNet50, OSNet and of the model, allowing the use of a single-term single-hyper-
DenseNet121, respectively. Results for all the other metrics parameter triplet loss function. Moreover, we showed the
also increase for both adaptation scenarios. Therefore, we importance of having this more straightforward loss function.
can effectively combine knowledge encoded in models with It is less biased towards specific scenarios and helps us achieve
different architectures. By performing it only for inference, we state-of-art results in the most complex adaptation setups,
keep a simpler training process and still can take advantage surpassing prior art by a large margin in most cases.
of the ensembled knowledge from different backbones. The self-ensembling technique helps us increase the final
performance by aggregating information from different check-
points throughout the training process, without human or label
E. Processing Footprint supervision. This is inspired by the reliability measurement,
To measure the processing footprint of our pipeline (training which shows that our models learn from more reliable data
and inference), we consider two representative adaptation as more iterations are performed. Furthermore, this process is
scenarios: Market → Duke and Market → MSMT17. As done in an easy-to-hard manner to increase model confidence
explained, the first setup represents a mildly difficult case and gradually.
the second is the most challenging one. Table VII shows the Finally, our last ensemble technique takes advantage of the
time measurements. complementarity between different backbones, enabling us to
TABLE V
I MPACT OF SELF - ENSEMBLING . W E CONSIDER A WEIGHTED AVERAGE OF THE PARAMETERS OF THE BACKBONE IN DIFFERENT MOMENTS OF THE
ADAPTATION . “B EST ” REFERS TO RESULTS OBTAINED WITH THE CHECKPOINT WITH HIGHEST R ANK -1 DURING ADAPTATION . “F USION ” IS THE FINAL
MODEL CREATED THROUGH THE PROPOSED SELF - ENSEMBLING METHOD . T HE BEST RESULTS ARE IN BOLD .

mAP R1 R5 R10 mAP R1 R5 R10
ResNet (Best) 72.1 89.0 95.5 97.1 66.2 81.5 89.5 92.2
ResNet (Fusion) 74.1 89.6 95.3 97.1 67.8 81.7 90.0 92.6
OSNet (Best) 60.7 85.8 93.5 95.9 65.1 81.7 90.3 92.1
OSNet (Fusion) 65.2 87.7 94.8 96.6 67.3 82.1 90.5 92.4
DenseNet (Best) 72.6 90.1 95.6 97.1 66.0 81.7 90.1 92.4
DenseNet (Fusion) 76.9 92.0 96.5 97.7 69.3 83.4 91.3 93.0
TABLE VI
I MPACT OF ENSEMPLE - BASED PREDICTION . P ERFORMANCE WITH AND WITHOUT MODEL ENSEMBLE DURING INFERENCE . B EST VALUES ARE IN BOLD .

mAP R1 R5 R10 mAP R1 R5 R10
ResNet (Fusion) 74.1 89.6 95.3 97.1 67.8 81.7 90.0 92.6
OSNet (Fusion) 65.2 87.7 94.8 96.6 67.3 82.1 90.5 92.4
DenseNet (Fusion) 76.9 92.0 96.5 97.7 69.3 83.4 91.3 93.0
Ensembled model 78.4 92.9 96.9 97.8 72.6 85.0 92.1 93.9
TABLE VII
T IME E VALUATION . W E CALCULATE EACH TIME IN HH:MM:SS FOR TRAINING AND IN MILLISECONDS ( MS ) FOR INFERENCE
. On training, we analyze the time taken to cluster and filter (Stages 2 and 3), one round of fine-tuning (Stage 4b), one epoch (time taken to perform K2
iterations of orange flow), and the whole pipeline training. On inference, we calculate the time to predict the identity of a query image given the gallery
feature vectors.
Market → Duke Market → MSMT17
Clustering + Finetuning Epoch Whole Inference Clustering + Finetuning Epoch Whole Inference
filtering Training Filtering training
ResNet 00:03:55 00:08:55 00:13:34 11:31:19 5ms 00:16:45 00:09:36 00:28:08 23:00:55 13ms
OSNet 00:01:53 00:08:56 00:11:14 09:33:04 4ms 00:07:41 00:12:20 00:20:59 17:49:40 11ms
DenseNet 00:04:06 00:08:33 00:13:36 11:33:14 4ms 00:16:46 00:11:27 00:31:13 26:32:08 13ms
Ensemble - - - - 6ms - - - - 22ms
achieve state-of-the-art results without adding complexity to demands pairwise distance calculation between all feature vec-
the training, differently from the mutual-learning strategies tors. Therefore, this approach may introduce higher processing
used in current methods [13], [39], [16]. It is important to times to the pipeline. In this sense, we also aim to extend our
note that both ensembling strategies are done after training to method to scale to very large datasets by introducing online
generate a final model and a final prediction. deep clustering and self-supervised techniques directly in the
Because the training process is more straightforward than pipeline.
other state-of-the-art methods and does not need information Another possible extension of our pipeline can be its appli-
on the target domain’s identities, our work is easily extendable cation to general object re-identification, such as vehicle ReID,
to other adaptation scenarios and deployed in actual investi- to mine critical objects of interest in an investigation. For
gations and other forensic contexts. example, with Person ReID, this could enable a joint analysis
by matching mined identities and objects to propose relations
A key aspect of our method also shared with other recent
between them during an event’s analysis finally.
methods in the literature [28], [11], [33], is that it requires
information about the camera used to acquire each sample.
That is, we suppose we know, a priori, the device that ACKNOWLEDGMENT
captured each image. This information does not need to be
the specific type of camera but, at least, information about We thank the financial support of the São Paulo Re-
different camera models. Without this information, our model search Foundation (FAPESP) through the grants DéjàVu
could face suboptimal performance, as it would not be able to #2017/12646-3 and #2019/15825-1.
take advantage of the diversity introduced by the cross-camera
triplets. To address this drawback, we aim to extend this work R EFERENCES
by incorporating techniques for automatic camera attribution
[59], [60], allowing the identification of the camera used to [1] R. Padilha, C. M. Rodrigues, F. A. Andaló, G. Bertocco, Z. Dias, and
A. Rocha, “Forensic event analysis: From seemingly unrelated data to
acquire an image or identifying whether the same camera understanding,” IEEE Security and Privacy, vol. 18, no. 6, pp. 23–32,
acquired a pair of images. 2020.
[2] X. Qian, Y. Fu, Y.-G. Jiang, T. Xiang, and X. Xue, “Multi-scale
Regarding the clustering process, our method requires that deep learning architectures for person re-identification,” in International
all selected samples are considered during this phase, which Conference on Computer Vision, 2017, pp. 5399–5408.
[3] Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang, “Beyond part models: [26] S. Lin, H. Li, C.-T. Li, and A. C. Kot, “Multi-task mid-level
Person retrieval with refined part pooling (and a strong convolutional feature alignment network for unsupervised cross-dataset person re-
baseline),” in European Conference on Computer Vision, 2018, pp. 480– identification,” arXiv preprint, vol. arXiv:1807.01440, 2018.
496. [27] L. Qi, L. Wang, J. Huo, L. Zhou, Y. Shi, and Y. Gao, “A novel
[4] K. Zhou, Y. Yang, A. Cavallaro, and T. Xiang, “Omni-scale feature unsupervised camera-aware domain adaptation framework for person re-
learning for person re-identification,” in International Conference on identification,” in International Conference on Computer Vision, 2019,
Computer Vision, 2019, pp. 3702–3712. pp. 8080–8089.
[5] X. Chen, C. Fu, Y. Zhao, F. Zheng, J. Song, R. Ji, and [28] A. Wu, W.-S. Zheng, and J.-H. Lai, “Unsupervised person re-
Y. Yang, “Salience-guided cascaded suppression network for person re- identification by camera-aware similarity consistency learning,” in In-
identification,” in IEEE Conference on Computer Vision and Pattern ternational Conference on Computer Vision, 2019, pp. 6922–6931.
Recognition, 2020, pp. 3300–3310. [29] S. Lloyd, “Least squares quantization in PCM,” IEEE Transactions on
[6] C. Liu, X. Chang, and Y.-D. Shen, “Unity style transfer for person re- Information Theory, vol. 28, no. 2, pp. 129–137, 1982.
identification,” in IEEE Conference on Computer Vision and Pattern [30] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algo-
Recognition, 2020, pp. 6887–6896. rithm for discovering clusters in large spatial databases with noise,” in
[7] H. Fan, L. Zheng, C. Yan, and Y. Yang, “Unsupervised person re- International Conference on Knowledge Discovery and Data Mining,
identification: Clustering and fine-tuning,” ACM Transactions on Mul- 1996, pp. 226–231.
timedia Computing, Communications, and Applications, vol. 14, no. 4, [31] H. Tang, Y. Zhao, and H. Lu, “Unsupervised person re-identification with
pp. 1–18, 2018. iterative self-supervised domain adaptation,” in IEEE Conference on
[8] L. Song, C. Wang, L. Zhang, B. Du, Q. Zhang, C. Huang, and X. Wang, Computer Vision and Pattern Recognition Workshops, 2019, pp. 1536–
“Unsupervised domain adaptive re-identification: Theory and practice,” 1543.
Pattern Recognition, vol. 102, p. 107173, 2020. [32] Z. Zhong, L. Zheng, Z. Luo, S. Li, and Y. Yang, “Invariance matters:
[9] Y. Fu, Y. Wei, G. Wang, Y. Zhou, H. Shi, and T. S. Huang, “Self- Exemplar memory for domain adaptive person re-identification,” in IEEE
similarity grouping: A simple unsupervised cross domain adaptation Conference on Computer Vision and Pattern Recognition, 2019, pp. 598–
approach for person re-identification,” in International Conference on 607.
Computer Vision, 2019, pp. 6112–6121. [33] ——, “Learning to adapt invariance in memory for person re-
[10] X. Zhang, J. Cao, C. Shen, and M. You, “Self-training with progressive identification,” IEEE Transactions on Pattern Analysis and Machine
augmentation for unsupervised cross-domain person re-identification,” Intelligence, pp. 1–1, 2020.
in International Conference on Computer Vision, 2019, pp. 8222–8231. [34] D. Wang and S. Zhang, “Unsupervised person re-identification via
[11] Y. Zhai, S. Lu, Q. Ye, X. Shan, J. Chen, R. Ji, and Y. Tian, “Ad- multi-label classification,” in IEEE Conference on Computer Vision and
cluster: Augmented discriminative clustering for domain adaptive person Pattern Recognition, 2020, pp. 10 981–10 990.
re-identification,” in IEEE Conference on Computer Vision and Pattern [35] Y. Dai, J. Liu, Y. Bai, Z. Tong, and L.-Y. Duan, “Dual-refinement: Joint
Recognition, 2020, pp. 9021–9030. label and feature refinement for unsupervised domain adaptive person
[12] Y. Ge, D. Chen, and H. Li, “Mutual mean-teaching: Pseudo label refinery re-identification,” arXiv preprint, vol. arXiv:2012.13689, 2020.
for unsupervised domain adaptation on person re-identification,” arXiv [36] R. J. Campello, D. Moulavi, and J. Sander, “Density-based clustering
preprint, vol. arXiv:2001.01526, 2020. based on hierarchical density estimates,” in Pacific-Asia Conference on
[13] Y. Zhai, Q. Ye, S. Lu, M. Jia, R. Ji, and Y. Tian, “Multiple expert brain- Knowledge Discovery and Data Mining, 2013, pp. 160–172.
storming for domain adaptive person re-identification,” arXiv preprint,
[37] M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander, “OPTICS:
vol. arXiv:2007.01546, 2020.
ordering points to identify the clustering structure,” ACM SIGMOD
[14] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural
Record, vol. 28, no. 2, pp. 49–60, 1999.
network,” arXiv preprint, vol. arXiv:1503.02531, 2015.
[38] F. Yang, K. Li, Z. Zhong, Z. Luo, X. Sun, H. Cheng, X. Guo, F. Huang,
[15] B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, and
R. Ji, and S. Li, “Asymmetric co-teaching for unsupervised cross-domain
M. Sugiyama, “Co-teaching: Robust training of deep neural networks
person re-identification,” in AAAI Conference on Artificial Intelligence,
with extremely noisy labels,” in Advances in Neural Information Pro-
2020, pp. 12 597–12 604.
cessing Systems, 2018, pp. 8527–8537.
[16] H. Chen, B. Lagadec, and F. Bremond, “Enhancing diversity in teacher- [39] J. Yin, J. Qiu, S. Zhang, Z. Ma, and J. Guo, “SSKD: Self-
student networks via asymmetric branches for unsupervised person re- supervised knowledge distillation for cross domain adaptive person re-
identification,” in IEEE Winter Conference on Applications of Computer identification,” arXiv preprint, vol. arXiv:2009.05972, 2020.
Vision, 2020, pp. 1–10. [40] L. Jiang, D. Meng, S.-I. Yu, Z. Lan, S. Shan, and A. Hauptmann,
[17] W. Deng, L. Zheng, Q. Ye, G. Kang, Y. Yang, and J. Jiao, “Image- “Self-paced learning with diversity,” Advances in Neural Information
image domain adaptation with preserved self-similarity and domain- Processing Systems, vol. 27, pp. 2078–2086, 2014.
dissimilarity for person re-identification,” in IEEE Conference on Com- [41] F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified
puter Vision and Pattern Recognition, 2018, pp. 994–1003. embedding for face recognition and clustering,” in IEEE Conference
[18] L. Wei, S. Zhang, W. Gao, and Q. Tian, “Person transfer GAN to on Computer Vision and Pattern Recognition, 2015, pp. 815–823.
bridge domain gap for person re-identification,” in IEEE Conference [42] Z. Zhong, L. Zheng, D. Cao, and S. Li, “Re-ranking person re-
on Computer Vision and Pattern Recognition, 2018, pp. 79–88. identification with k-reciprocal encoding,” in IEEE Conference on Com-
[19] J. Liu, Z.-J. Zha, D. Chen, R. Hong, and M. Wang, “Adaptive transfer puter Vision and Pattern Recognition, 2017, pp. 1318–1327.
network for cross-domain person re-identification,” in IEEE Conference [43] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian, “Scalable
on Computer Vision and Pattern Recognition, 2019, pp. 7202–7211. person re-identification: A benchmark,” in International Conference on
[20] Y. Chen, X. Zhu, and S. Gong, “Instance-guided context rendering for Computer Vision, 2015, pp. 1116–1124.
cross-domain person re-identification,” in International Conference on [44] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi, “Performance
Computer Vision, 2019, pp. 232–242. measures and a data set for multi-target, multi-camera tracking,” in
[21] Y.-J. Li, C.-S. Lin, Y.-B. Lin, and Y.-C. F. Wang, “Cross-dataset person European Conference on Computer Vision, 2016, pp. 17–35.
re-identification via unsupervised pose disentanglement and adaptation,” [45] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
in International Conference on Computer Vision, 2019, pp. 7919–7929. image recognition,” in IEEE Conference on Computer Vision and Pattern
[22] Z. Zhong, L. Zheng, S. Li, and Y. Yang, “Generalizing a person Recognition, 2016, pp. 770–778.
retrieval model hetero-and homogeneously,” in European Conference on [46] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely
Computer Vision, 2018, pp. 172–188. connected convolutional networks,” in IEEE Conference on Computer
[23] Y. Zou, X. Yang, Z. Yu, B. Kumar, and J. Kautz, “Joint disentangling Vision and Pattern Recognition, 2017, pp. 4700–4708.
and adaptation for cross-domain person re-identification,” arXiv preprint, [47] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:
vol. arXiv:2007.10315, 2020. A large-scale hierarchical image database,” in IEEE Conference on
[24] Y. Lin, Y. Wu, C. Yan, M. Xu, and Y. Yang, “Unsupervised person Computer Vision and Pattern Recognition, 2009, pp. 248–255.
re-identification via cross-camera similarity exploration,” IEEE Trans- [48] K. Zhou and T. Xiang, “Torchreid: A library for deep learning person re-
actions on Image Processing, vol. 29, pp. 5481–5490, 2020. identification in Pytorch,” arXiv preprint, vol. arXiv:1910.10093, 2019.
[25] J. Wang, X. Zhu, S. Gong, and W. Li, “Transferable joint attribute- [49] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
identity deep learning for unsupervised person re-identification,” in IEEE T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf,
Conference on Computer Vision and Pattern Recognition, 2018, pp. E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner,
2275–2284. L. Fang, J. Bai, and S. Chintala, “PyTorch: An imperative style, high-
performance deep learning library,” in Advances in Neural Information

Processing Systems, 2019, pp. 8024–8035.
[50] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
arXiv preprint, vol. arXiv:1412.6980, 2014.
[51] Y. Lin, L. Xie, Y. Wu, C. Yan, and Q. Tian, “Unsupervised person re-
identification via softened similarity learning,” in IEEE Conference on
Computer Vision and Pattern Recognition, 2020, pp. 3390–3399.
[52] K. Zeng, M. Ning, Y. Wang, and Y. Guo, “Hierarchical clustering with
hard-batch triplet loss for person re-identification,” in IEEE Conference
on Computer Vision and Pattern Recognition, 2020, pp. 13 657–13 665.
[53] X. Jin, C. Lan, W. Zeng, Z. Chen, and L. Zhang, “Style normalization
and restitution for generalizable person re-identification,” in IEEE Con-
ference on Computer Vision and Pattern Recognition, 2020, pp. 3143–
3152.
[54] Z. Wang, J. Zhang, L. Zheng, Y. Liu, Y. Sun, Y. Li, and S. Wang, Fernanda Andaló is a researcher associated to the
“CycAs: Self-supervised cycle association for learning re-identifiable Artificial Intelligence Lab. (Recod.ai) at the Insti-
descriptions,” arXiv preprint, vol. arXiv:2007.07577, 2020. tute of Computing, University of Campinas, Brazil.
[55] F. Dubourvieux, R. Audigier, A. Loesch, S. Ainouz, and S. Canu, Andaló received a Ph.D. in Computer Science from
“Unsupervised domain adaptation for person re-identification through the same university in 2012, during which she was a
source-guided pseudo-labeling,” arXiv preprint, vol. arXiv:2009.09445, visiting researcher at Brown University. She worked
2020. as a researcher at Samsung and as a postdoctoral re-
[56] X. Wang, X. Han, W. Huang, D. Dong, and M. R. Scott, “Multi- searcher in collaboration with Motorola, from 2014
similarity loss with general pair weighting for deep metric learning,” in to 2018. Since then, she works at The LEGO Group,
IEEE Conference on Computer Vision and Pattern Recognition, 2019, Denmark, devising machine learning solutions for
pp. 5022–5030. their digital products. She is an IEEE member and
[57] Y. Ge, D. Chen, F. Zhu, R. Zhao, and H. Li, “Self-paced contrastive was the 2016-2017 Chair of the IEEE Women in Engineering (WIE) South
learning with hybrid memory for domain adaptive object re-id,” arXiv Brazil Section. Her research interests include machine learning and computer
preprint, vol. arXiv:2006.02713, 2020. vision.
[58] D. Melo, S. Toledo, F. Mourão, R. Sachetto, G. Andrade, R. Ferreira,
S. Parthasarathy, and L. Rocha, “Hierarchical density-based clustering
based on gpu accelerated data indexing strategy,” Procedia Computer
Science, vol. 80, pp. 951–961, 2016.
[59] F. d. O. Costa, E. Silva, M. Eckmann, W. J. Scheirer, and A. Rocha,
“Open set source camera attribution and device linking,” Pattern Recog-
nition Letters, vol. 39, pp. 92–101, 2014.
[60] J. Bernacki, “A survey on digital camera identification methods,” Foren-
sic Science International: Digital Investigation, vol. 34, p. 300983, 2020. Anderson Rocha has been an associate professor at
the Institute of Computing, the University of Camp-
inas, Brazil, since 2009. Rocha received his Ph.D. in
Computer Science from the University of Campinas.
His research interests include Artificial Intelligence,
Gabriel Bertocco is currently pursuing his Ph.D. Reasoning for complex data, and Digital Forensics.
in Computer Science with a focus on digital foren- He is the Chair of the Artificial Intelligence Lab.
sics and machine learning at the Artificial Intelli- (Recod.ai) at the Institute of Computing, University
gence Lab. (Recod.ai) at the Institute of Computing, of Campinas. He was the Chair of the IEEE Infor-
University of Campinas, Brazil, where he received mation Forensics and Security Technical Committee
a B.Sc. in Computing Engineering in 2019. His for 2019-2020 term. Finally, Prof. is an IEEE Senior
research interests include machine learning, com- Member, a Microsoft, Google and Tan Chi Tuan Faculty Fellow and is listed
puter vision, and digital forensics. Contact him at among the Top-1% of most influential scientists worldwide according to a
gabriel.bertocco@ic.unicamp.br. study from Stanford Univ/Plos Biology.
Supplementary Material
Unsupervised and self-adaptative techniques for cross-domain person re-identification
Gabriel Bertocco, Fernanda Andaló, Member, IEEE,
and Anderson Rocha, Senior Member, IEEE
I. S UPLEMENTARY M ATERIAL then those methods cannot fully describe images on the target
domain considering these constraints. In more recent works,
A. Full Comparison with Prior Art
researchers have proposed further processing, such as pseudo-
arXiv:2103.11520v3 [cs.CV] 7 Feb 2022
In the main article, we compared our results with recent labeling (DG-Net++ [?]), pose alignment (PDA-Net [?]), and
state-of-the-art methods. In this supplementary material, we context-alignment (CR-GAN [?]). Our method can surpass all
present the full set of results, comparing our method with these GAN-based methods by a large margin. Compared to
methods proposed before 2018, published in top-tier journals the most powerful of them, DG-Net++, we outperform it by
and conferences (Tables I and II). Following the same conclu- 16.7 and 10.8 p.p on mAP and Rank-1 in the Duke → Market
sions discussed in the main article, our method outperforms scenario, and in Market → Duke by 8.8 and 6.1 p.p.
all works from 2018 and previous years. MSMT17 is the most recent and challenging Person ReID
LOMO [?], BOW [?], and UMDL [?] are hand-crafted- benchmark, and this is why there are fewer works considering
based methods. They directly compute feature vectors over it for evaluation. PTGAN was the first to considered it, both
pixel values without using a neural network. UMDL also being proposed in [?], in 2018. We outperform them by a
learns a shared dictionary to mine meaningful attributes from substantial margin of 31.2 and 52.1 p.p. in mAP and Rank-1 in
the target dataset, however, in a much simpler setup than Duke → MSMT17 and by 30.3 and 52.1 p.p in the challenging
any deep-learning method. They then calculate the distance Market → MSMT17 scenario.
between query and gallery images. This makes them scalable SpCL [?] is similar to ours in the sense that it increases the
and fast deployable. However, since hand-crafted features cluster reliability during the clustering stage as the training
usually do not describe high-level features from images, the progresses. However, it does not apply any strategy consider-
methods fail when used to match the same person from ing diversity as we do by creating diverse triplets considering
different camera views. The substantial differences caused by all cameras comprised in a cluster. Besides, they leverage both
changes of illumination, resolution, and pose of the identities source and target domain images on adaptation stages and
bring a high non-linearity to the feature space that is not enable their model to use the source labeled identity to bring
captured by hand-crafted-based methods. We surpass UMDL some regularization to the adaptation process. Differently, our
by 65.3 and 66.5 percentage points (p.p) on mAP and Rank-1 method does not use the source domain images after fine-
when considering Market → Duke scenario and by 66.0 and tuning and leverages the adaptation process relying only on
58.4 p.p. considering Duke → Market. This shows the power target images. As shown in the main article, we outperform
of deep neural networks, which effectively describe identities them by 1.2 and 4.2 p.p. in the most challenging Market →
in a non-overlapping camera system under different point-of- MSMT17 in mAP and Rank-1, respectively.
views.
All other works published in and after 2018 have been
described in the main article. MMFA [?] and TJ-AIDL [?] are B. Full Qualitative Analysis
methods based on low- and mid-level attribute alignment by In this section we extend the qualitative analysis presented
leveraging deep convolutional neural networks. Since they do in the main article. We show the successful and failure
not encourage the networks to be robust to different point-of- cases for a query from each camera for Market1501 and
views, their performance is lower than more recent proposed DukeMTMC-ReID datasets.
pseudo-labeling methods (PCB-PAST [?], SSG [?], UDAP [?], In Figures 1 and 3, we observe some successful cases with
AD-Cluster [?], among others) and ensemble-based methods the activation maps for the top-10 closest gallery images to the
(ACT [?], MMT [?], MEB-Net [?], SSKD [?], ABMT [?]). query. We adapted the implementation from [?] to visualize
The same can be observed for PTGAN [?], SPGAN and the activation maps. In both scenarios, we see that our model
SPGAN+LMP [?], which are GAN-based methods that aim is able to find fine-grained details on the images, enabling it to
to transfer images from source to target domain, replicating correct match the query to the gallery images. For example, in
the same camera conditions of the target domain in the Figure 1d, the model tends to focus on shoes and parts of the
labeled source images. However, transferring only camera- face, while in Figure 1a the focus is on regions depicting hair
level features, such as color, contrast and resolution, are not and pants. We conclude that our method is able to distinguish
enough. People on the source domain might be in different the semantic parts of the body and soft-biometric attributes
poses and contexts from the ones on the target domain, and which are vital to Person Re-Identification. It is also important
TABLE I
R ESULTS ON M ARKET 1501 TO D UKE MTMC-R E ID AND D UKE MTMCR E -ID TO M ARKET 1501 ADAPTATION SCENARIOS . W E REPORT M AP, R ANK -1,
R ANK -5, AND R ANK -10, COMPARING DIFFERENT METHODS . T HE BEST RESULT IS SHOWN IN BOLD , THE SECOND IN UNDERLINE AND THE THIRD IN
italic. W ORKS WITH (*) DO NOT PRE - TRAIN THE MODEL IN ANY SOURCE DATASET BEFORE ADAPTATION .

LOMO [?] CVPR 2015 8.0 27.2 41.6 49.1 4.8 12.3 21.3 26.6
BOW [?] ICCV 2015 14.8 35.8 52.4 60.3 8.3 17.1 28.8 34.9
UMDL [?] CVPR 2016 12.4 34.5 52.6 59.6 7.3 18.5 31.4 37.6
PTGAN [?] CVPR 2018 - 38.6 - 66.1 - 27.4 - 50.7
PUL [?] TOMM 2018 20.5 45.5 60.7 66.7 16.4 30.0 43.4 48.5
MMFA [?] ArXiv 27.4 56.7 75.0 81.8 24.7 45.3 59.8 66.3
SPGAN [?] CVPR 2018 22.8 51.5 70.1 76.8 22.3 41.1 56.6 63.0
TJ-AIDL [?] CVPR 2018 26.5 58.2 74.8 81.1 23.0 44.3 59.6 65.0
SPGAN+LMP [?] CVPR 2018 26.7 57.7 75.8 82.4 26.2 46.4 62.3 68.0
HHL [?] ECCV 2018 31.4 62.2 78.8 84.0 27.2 46.9 61.0 66.7
ATNet [?] CVPR 2019 25.6 55.7 73.2 79.4 24.9 45.1 59.5 64.2
CamStyle [?] TIP 2019 27.4 58.8 78.2 84.3 25.1 48.4 62.5 68.9
MAR [?] CVPR 2019 40.0 67.7 81.9 - 48.0 67.1 79.8 -
PAUL [?] CVPR 2019 40.1 68.5 82.4 87.4 53.2 72.0 82.7 86.0
ECN [?] CVPR 2019 43.0 75.1 87.6 91.6 40.4 63.3 75.8 80.4
ISSDA-ReID [?] CVPR 2019 63.1 81.3 92.4 95.2 54.1 72.8 82.9 85.9
PDA-Net [?] ICCV 2019 47.6 75.2 86.3 90.2 45.1 63.2 77.0 82.5
CR-GAN [?] ICCV 2019 54.0 77.7 89.7 92.7 48.6 68.9 80.2 84.7
PCB-PAST [?] ICCV 2019 54.6 78.4 - - 54.3 72.4 - -
UCDA [?] ICCV 2019 30.9 60.4 - - 31.0 47.7 - -
SSG [?] ICCV 2019 58.3 80.0 90.0 92.4 53.4 73.0 80.6 83.2
CASCL [?] ICCV 2019 35.5 65.4 80.6 86.2 37.8 59.3 73.2 77.8
SSL [?]* CVPR 2020 37.8 71.7 83.8 87.4 28.6 52.5 63.5 68.9
CCSE [?]* TIP 2020 38.0 73.7 84.0 87.9 30.6 56.1 66.7 71.5
UDAP [?] PR 2020 53.7 75.8 89.5 93.2 49.0 68.4 80.1 83.5
MMCL [?] CVPR 2020 60.4 84.4 92.8 95.0 51.4 72.4 82.9 85.0
ACT [?] AAAI 2020 60.6 80.5 - - 54.5 72.4 - -
ECN-GPP [?] TPAMI 2020 63.8 84.1 92.8 95.4 54.4 74.0 83.7 87.4
HCT [?]* CVPR 2020 56.4 80.0 91.6 95.2 50.7 69.6 83.4 87.4
SNR [?] CVPR 2020 61.7 82.8 - - 58.1 76.3 - -
AD-Cluster [?] CVPR 2020 68.3 86.7 94.4 96.5 54.1 72.6 82.5 85.5
MMT [?] ICLR 2020 71.2 87.7 94.9 96.9 65.1 78.0 88.8 92.5
CycAs [?]* ECCV 2020 64.8 84.8 - - 60.1 77.9 - -
DG-Net++ [?] ECCV 2020 61.7 82.1 90.2 92.7 63.8 78.9 87.8 90.4
MEB-Net [?] ECCV 2020 76.0 89.9 96.0 97.5 66.1 79.6 88.3 92.2
Dual-Refinement [?] arXiv 2020 78.0 90.9 96.4 97.7 67.7 82.1 90.1 92.5
SSKD [?] arXiv 2020 78.7 91.7 97.2 98.2 67.2 80.2 90.6 93.3
ABMT [?] WACV 2020 80.4 93.0 - - 70.8 83.3 - -
Ours (w/o Re- This Work 67.7 94.8 96.5 68.8 82.4 90.6 92.5
Ranking)* 89.5
Ranking)
to note that the query and the correct matches are always from Despite the high probability of people wearing similar clothing
different cameras, which also confirms that our model is able items, it is marginal the chance of them being dressed exactly
to overcome different camera conditions. Analyzing the errors the same way, however this is the situation shown in Figure 4b.
in Figures 1h and 3f, we see they are mainly caused by similar bounding box. In this case, the method erroneously retrieves
clothes, but the method is still able to recover at least the images with no identity on it (distractor images on gallery) or
closest gallery image (Rank-1 image). images with parts of a bike.
Figures 2 and 4 depict the failure cases for one of the back-
bones, showing the limitations of the method. The failures are
mainly related to similar clothes or soft-biometric attributes. Figure 4a shows an interesting failure case, where the model
Another source of errors is occlusion. Figure 2a is an focus uniquely on the drawing on the person’s shirt in the
example where the person has been fully occluded by a car. query image. The method returns gallery images of other
As there is no person, the method does not have any specific identities with similar shirt drawings. Despite the failure, it
region to focus on and then the gallery images are almost fully is interesting to note that our method was able to focus on
activated. In Figure 4d, the target identity is on a motorcycle fine-grained details to find matches, and not activate the whole
together with another person, which led them to be in the same image or large parts of it.
TABLE II
R ESULTS ON M ARKET 1501 TO MSMT17 AND D UKE MTMCR E -ID TO MSMT17 ADAPTATION SCENARIOS . W E REPORT M AP, R ANK -1, R ANK -5, AND
R ANK -10, COMPARING DIFFERENT METHODS . T HE BEST RESULT IS SHOWN IN BOLD , THE SECOND IN UNDERLINE AND THE THIRD IN italic. W ORKS
WITH (*) DO NOT PRE - TRAIN THE MODEL IN ANY SOURCE DATASET BEFORE ADAPTATION .
Duke → MSMT17 Market → MSMT17

PTGAN [?] CVPR 2018 3.3 11.8 - 27.4 2.9 10.2 - 24.4
ECN [?] CVPR 2019 10.2 30.2 41.5 46.8 8.5 25.3 36.3 42.1
CCSE [?]* TIP 2020 9.9 31.4 41.4 45.7 9.9 31.4 41.4 45.7
SSG [?] ICCV 2019 13.3 32.2 - 51.2 13.2 31.6 - 49.6
ECN-GPP [?] TPAMI 2020 16.0 42.5 55.9 61.5 15.2 40.4 53.1 58.7
MMCL [?] CVPR 2020 16.2 43.6 54.3 58.9 15.1 40.8 51.8 56.7
MMT [?] ICLR 2020 23.3 50.1 63.9 69.8 22.9 49.2 63.1 68.8
CycAs [?]* ECCV 2020 26.7 50.1 - - 26.7 50.1 - -
DG-Net++ [?] ECCV 2020 22.1 48.8 60.9 65.9 22.1 48.4 60.9 66.1
Dual-Refinement [?] arXiv 2020 26.9 55.0 68.4 73.2 25.1 53.3 66.1 71.5
SSKD [?] arXiv 2020 26.0 53.8 66.6 72.0 23.8 49.6 63.1 68.8
ABMT [?] WACV 2020 33.0 61.8 - - 27.8 55.5 - -
SpCL [?] NeurIPS 2020 - - - - 31.0 58.1 69.6 74.1
Ranking)
(a) Camera 1 (b) Camera 2
(c) Camera 3 (d) Camera 4
(e) Camera 5 (f) Camera 6
(g) Camera 7 (h) Camera 8

Fig. 1. Successful cases considering one query from each camera on Duke. This results are obtained with the ResNet50 backbone after the Market → Duke
adaptation.
(g) Camera 7 (h) Camera 8

Fig. 2. Failure cases considering one query from each camera on Duke. This results are obtained with the ResNet50 backbone after the Market → Duke
adaptation.

Fig. 3. Successful cases considering one query from each camera on Market. This results are obtained with the ResNet50 backbone after the Duke → Market
adaptation.

Fig. 4. Failure cases considering one query from each camera on Market. This results are obtained with the ResNet50 backbone after the Duke → Market
adaptation.

Unsupervised and Self-Adaptative Techniques For Cross-Domain Person Re-Identification

Uploaded by

Copyright:

Available Formats

Unsupervised and Self-Adaptative Techniques For Cross-Domain Person Re-Identification

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unsupervised and Self-Adaptative Techniques For Cross-Domain Person Re-Identification

Uploaded by

Copyright:

Available Formats

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL.

Unsupervised and self-adaptative techniques for

labeled dataset to match the same person in different views.

P ERSON Re-Identification (ReID) has gained increasing

assigned to the same cluster. Besides, if a feature vector is

E. Self-ensembling F. Ensemble-based prediction

Duke → Market Market → Duke

Duke → MSMT17 Market → MSMT17

Duke → Market Market → Duke

(a) (b) (c)

(a) (b) (c)

comprising images recorded from only one camera. For the

What if all identities are captured by only one camera? In

Duke → Market Market → Duke

Duke → Market Market → Duke

performance deep learning library,” in Advances in Neural Information

Duke → Market Market → Duke

Duke → MSMT17 Market → MSMT17

(a) Camera 1 (b) Camera 2

(c) Camera 3 (d) Camera 4

(e) Camera 5 (f) Camera 6

(g) Camera 7 (h) Camera 8

(a) Camera 1 (b) Camera 2

(c) Camera 3 (d) Camera 4

(e) Camera 5 (f) Camera 6

(g) Camera 7 (h) Camera 8

(a) Camera 1 (b) Camera 2

(c) Camera 3 (d) Camera 4

(e) Camera 5 (f) Camera 6

(a) Camera 1 (b) Camera 2

(c) Camera 3 (d) Camera 4

(e) Camera 5 (f) Camera 6

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.