Unsupervised and Self-Adaptative Techniques For Cross-Domain Person Re-Identification
Unsupervised and Self-Adaptative Techniques For Cross-Domain Person Re-Identification
Unsupervised and Self-Adaptative Techniques For Cross-Domain Person Re-Identification
16, 2021 1
Abstract—Person Re-Identification (ReID) across non- comprises the primary techniques to find possible people, or
overlapping cameras is a challenging task, and most works groups of people, involved in an event and to, ultimately,
in prior art rely on supervised feature learning from a propose candidate suspects for further investigation [1].
arXiv:2103.11520v3 [cs.CV] 7 Feb 2022
of samples in an offline manner. We select one sample as an original images. The main idea is to transfer low- and mid-
anchor for each camera represented in a cluster and two others level characteristics from the target domain, such as back-
as positive and negative examples. As a positive example, we ground, illumination, resolution, and even clothing, to the
choose a sample from one of the other represented cameras. images in the source domain. These methods create a synthetic
In contrast, the negative example is a sample from a different dataset of labeled images with the same conditions as the
cluster but with the same camera as the anchor. Consequently, target domain. And to adapt the model, they apply supervised
the greater the number of cameras in a cluster, the more training. Some works in this category are SPGAN [17],
diverse the triplets to train the model. With this approach, we PTGAN [18], AT-Net [19], CR-GAN [20], PDA-Net [21], and
give more importance to the more reliable clusters, regularize HHL [22]. Besides transferring the characteristics from source
the model, and alleviate the dependency on hyper-parameters to target domain for image-level generation, DG-Net++ [23]
by using a single-term and single-hyper-parameter triplet loss also applies label proposing through clustering. The final loss
function. This technique brings robustness and generability to is the aggregation of the GAN-based loss function to generate
the final model, easing its adaptation to different scenarios. images, along with the classification loss defined for the
Another important observation is that, at different points proposed labels. By doing this, they perform the disentangling
of the adaptation from a source to a target domain, the model and adaptation of the features on the target domain.
holds different levels of knowledge as different portions of the CCSE [24] performs camera mining and, using a GAN-
target data are considered each time. Thus, we argue that the based model, generates synthetic data for an identity con-
model has complementary knowledge in different iterations sidering the point of view of each other camera, increasing
during training. Based on this, we propose a self-ensembling the number of images available for training. They leverage
strategy to summarize the knowledge from various iterations new clustering criteria to avoid creating massive clusters
into a unique final model. comprising most of the dataset and potentially having two or
Finally, based on recent advances in ensemble-based meth- more true identities assigned to the same pseudo-label. Finally,
ods for ReID [12], [13], we propose to combine the knowledge they train directly from ImageNet, without considering any
acquired by different architectures. Unlike prior work, we specific source domain. In comparison, our solution does not
avoid complex training stages by simply assembling the results require synthetic images since we explore the cross-camera
from different architectures only during evaluation time. information inside each cluster using only real images. This
To summarize, the contributions of our work are: leads our method to outperform CCSE considering the same
• A new approach to creating diverse triplets based on the training conditions (unsupervised scenario).
variety of cameras represented in a cluster. This approach
helps the model to be camera-invariant and more robust B. Attribute Alignment Methods
in generating the same person’s features from different These methods seek to align common attributes in both
perspectives. It also allows us to leverage a single-term domains to ease transferring knowledge from source to target.
and single-hyper-parameter triplet loss function to be Such features can be clothing items (backpacks, hats, shoes)
optimized. and other soft-biometric attributes that might be common
• A novel self-ensembling fusion method, which enables
to both domains. These works align mid-level features and
the final model to summarize the complementary knowl- enable the learning of higher semantic features on the target
edge acquired during training. This method relies upon domain. Works such as TJ-AIDL [25] consider a fixed set
the knowledge hold by the model in different checkpoints of attributes. However, source and target domains can have
of the adaptation process. substantial context differences, leading to potentially different
• A novel ensemble technique to take advantage of the
attributes. For example, the source domain could be recorded
complementarity of different backbones trained inde- in an airport and the target domain in a shopping center. To
pendently. Instead of applying the typical knowledge obtain a better generalization, in [26], the authors propose the
distilling [14] or co-teaching [15], [16] methods, which Multi-task Mid-level Feature Alignment (MMFA) technique to
add complexity to the training process, we propose using enable the method to learn attributes from both domains and
an ensemble-based prediction. align them for a better generalization on the target domain.
Other methods, such as UCDA [27] and CASCL [28], aim to
II. R ELATED W ORK align attributes by considering images from different cameras
Several works address Unsupervised Domain Adaptation on the target dataset.
for Person Re-Identification. They can be roughly divided
into three categories: generative, attribute alignment, and label C. Label Proposing Methods
proposing methods.
Methods in this category predict possible labels for the
unlabeled target domain by leveraging clustering methods (K-
A. Generative Methods means [29], DBSCAN [30], among others). Once the target
ReID generative methods aim to synthesize data by trans- data is pseudo-labeled, the next step is to train models to
lating images from a source to a target domain. Once data learn discriminative features on the new domain. PUL [7]
from the source dataset is labeled, the translated images on applies the Curriculum Learning technique to adapt a model
the target context receive the same labels as the corresponding learned on a source domain to a target domain. However, as
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 16, 2021 3
K-means is used to cluster the features, it is not possible to both categories need images from source and target domains
account for camera variability. As K-means generates only during adaptation. Finally, the last Label Proposing methods
convex clusters, it cannot find more complex cluster structures, consider mutual-learning or co-teaching, which brings com-
hindering the performance. UDAP [8] and ISSDA-ReID [31] plexity to the training stage.
utilize DBSCAN as the clustering algorithm along with la- Similarly, we assume to have only camera-related informa-
beling refinement. SSG [9] also applies DBSCAN to cluster tion, i.e., we know from which camera (viewpoint) an image
features of the whole, upper, and low-body parts of identities was taken. In all steps, we use pseudo-identity information
of interest. The final loss is the sum of individual triplet losses exclusively given by the clustering algorithm without relying
in each feature space (body part). Similar to our work, they use on any ground-truth information. We differ from the prior
a source domain to pre-train the model and the target domain art by using a new diversity learning scheme and generating
for adaptation. However, they do not perform cross-camera triplets based on each cluster’s diversity of points of view.
mining, cluster filtering, nor ensembling. These elements of As we train the whole model, the method also learns high-
our solution allow it to outperform SSG in all adaptation level features on the target domain. We simplify the training
scenarios. process by considering one backbone at a time, without mutual
ECN [32], ECN-GPP [33], MMCL [34], and Dual- information exchange during adaptation. Finally, we apply
Refinement [35] use a memory bank to store features, which model ensembling for inference after the training process.
is updated along the training to avoid the direct use of features
generated by the model in further iterations. The authors aim III. P ROPOSED M ETHOD
to avoid propagating noisy labels to future training steps, Our approach to Person ReID comprises two phases: train-
contributing to keeping and increasing the discrimination of ing and inference. Figure 1 depicts the training process, while
features during training. Table I shows the variables used in this work.
PAST [10] applies HDBSCAN [36] as the clustering
method, which is similar to OPTICS [37] — the algorithm TABLE I
of choice in our work. However, the memory complexity of VARIABLES ’ MEANING IN THIS WORK
OPTICS is O(n), while for HDBSCAN is O(n2 ), making our
Variable Meaning
model more memory efficient in the clustering stage. nb Number of different backbones in the Ensemble
MMT [12], MEB-Net [13], ACT [38], SSKD [39], and M Model backbone
ABMT [16] are ensemble-based methods. They consider two K1 Number of iterations of the blue flow in Figure 1
K2 Number of iterations of the orange flow in Figure 1
or more networks and leverage mutual teaching by sharing ci i-th cluster in the feature space
one network’s outputs with the others, making the whole ni Number of cameras in cluster ci
system more discriminative on the target domain. However, camj j-th camera in a cluster
xsi i-th image in the source domain
training models in a mutual-teaching regime brings complexity xti i-th image in the target domain
in memory and to the general training process. Besides that, yis Label of the i-th image in the source domain
noisy labels can be propagated to other ensemble models, Ns Number of images in the source domain
Nt Number of images in the target domain
hindering the training process. Nonetheless, ensemble-based m Number of anchors per camera in a cluster
learning provides the best performance among state-of-art α Margin parameter of the Triplet Loss
methods. We propose using ensembles only during inference to B Batch of triplets in an iteration
simultaneously eliminate the complexity added to the training,
still taking advantage of knowledge complementary between During training, we independently optimize nb different
the models. backbones to adapt the model to the target domain. This phase
Our work is also based on Curriculum Learning with is divided into five main stages that are performed iteratively:
Diversity [40], a schema whereby the model starts learning feature extraction from all data; clustering; cluster selection;
with easier examples, i.e., samples that are correctly classified cross-camera triplet creation and fine-tuning; feature extraction
with a high score early in training. However, in a multi- from pseudo-labeled data.
class problem, one of the classes might have more examples After training, we perform the proposed self-ensembling
correctly classified early on, making it easier than the other phase to summarize the training parameters in a single final
classes. Therefore, in Curriculum Learning with Diversity, the model based on the weighted average of model parameters
method selects the most confident samples (easier samples) from each different checkpoint. We perform this step for each
from the easier classes, including some examples from the backbone independently and, in the end, we have nb self-
harder ones. In this way, it enables the model to learn in ensembled models.
an easy-to-hard manner, avoiding local minima and allowing During inference, for a pair query/gallery image, we calcu-
better generalization. late the distance between them considering feature vectors ex-
Even though recent work achieves competitive perfor- tracted by each of the nb models. Hence, for each query/gallery
mances, there are some limitations that we aim to address pair, we have nb distances, one for each of the trained models.
in our work. First, generative methods bring complexity by We then apply our last ensemble technique: the nb distances
considering GANs to translate images from a domain to the are averaged to obtain a final distance. Finally, based on this
other. Second, attribute Alignment methods only tackle the final distance, we take the label of the closest gallery image
alignment of low and mid-level features. Third, methods in as the query label.
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 16, 2021 4
Fig. 1. Overview of the training phase. We assume to have camera-related information, i.e., we know the camera used to acquire each image; and we do not
rely on any ground-truth label information about the identities on the target domain. The pipeline has two flows: the blue flow is executed every K1 times,
and the orange flow is executed K2 times. Both flows share steps in green. In Stage 1, we initially extract feature vectors for each training image in the
target domain using model M , and cluster them using the OPTICS algorithm in Stage 2 to propose pseudo-labels. Afterward, we perform cluster selection
in Stage 3, removing outliers and clusters with only one camera. Then, triplets are created based on each cluster’s diversity in Stage 4a and used to train the
model in Stage 4b. These steps are denoted by the blue flow in which the Clustering and Cluster Selection are performed. Instead of going back to Stage 1,
the method follows the orange flow. In Stage 5, we extract feature vectors of the samples selected in Stage 3, and the process continues to Stage 4a and 4b
again. The blue flow marks an iteration, while the orange flow is called an epoch. Therefore, in each iteration, we have K2 epochs.
A. Training Stages 1 and 2: Feature Extraction from all data clusters can be split or combined to create new ones. In other
and Clustering words, if we change the threshold, other clusters might appear,
Let Ds = {(xsi , yis )}N creating a different label proposing for the samples. However,
i=1 be a labeled dataset representing
s
the source domain, formed by Ns images xsi and their respec- clusters that emerge from real labels often have different
tive identity labels yis ; and let Dt = {(xti )}N distributions and densities, indicating that a generally fixed
i=1 be an unlabeled
t
target dataset representing the target domain, formed by Nt threshold might not be sufficient to detect them. In this sense,
images xti . Before applying the proposed pipeline, we firstly OPTICS relaxes DBSCAN by ordering feature vectors in a
train a model M in a supervised way, with source dataset Ds manifold based on the distances between them, which allows
and its labels. After training, assuming source dataset Ds is not the construction of a reachability plot. Probable clusters with
available anymore, we perform transfer learning, updating M different densities are revealed as valleys in this plot and can
to the target domain, only considering samples from unlabeled be detected by their steepness. With this formulation, we are
target dataset Dt . more likely to propose labels closer to real label distribution
With model M trained on Ds , we first extract all feature on the target data.
vectors from images in Dt and create a new set of fea-
ture vectors {M (xti )}N B. Training Stage 3: Cluster Selection
i=1 . We remove possible duplicates by
t
checking if there is a replacement from one of them, which After the first and second stages, feature vectors are either
might be caused by duplicate images on target data. The assigned to a cluster or considered outliers. As people can
remaining feature vectors are L2-normalized to embed them be captured by one or more cameras in a ReID system, the
into a unit hypersphere. The normalized feature vectors are produced clusters are naturally formed by samples acquired by
clustered using the OPTICS algorithm to obtain pseudo labels. different devices. We hypothesize that clusters with samples
The OPTICS algorithm [37] leverages the principle of obtained by two or more cameras are more reliable than
dense neighborhood, similarly to DBSCAN [30]. DBSCAN clusters with only one camera.
defines the neighborhood of a sample as being formed by its If an identity is well described by model M , its feature
closest feature vectors, with distances lower than a predefined vectors should be closer in the feature space regardless of
threshold. Clusters are created based on these neighborhoods, the camera. Therefore, clusters with only one camera might
and samples not assigned to any cluster are considered outliers. be created due to bias to a particular device or viewpoint,
If the threshold changes, other clusters are discovered, current and different identities captured by the same camera can be
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 16, 2021 5
where B is a batch of triplets, xa is the anchor, xp is the data from the target domain is considered in an iteration, it
positive sample and xn is the negative one. α is the margin means that the model is more confident, and then it can have
that is set to 0.3 and [.]+ is the max(0, .) function. This is more discrimination power on the target domain. Hence, pi
illustrated in Figure 1, Stage 4b. is equal to the percentage of reliable target data in the i-th
iteration. Consequently, a model that takes more data from the
D. Stage 5: Feature Extraction from Pseudo-Labeled Samples target to train will have a higher weight pi . Self-Ensembling
is illustrated in Figure 3. Note that we directly deal with the
This stage is part of the orange flow performed after Fine- model’s learned parameters and create a new one by averaging
tuning (Stage 4b). The main idea is to keep the pseudo-labeled the weights.
clusters from Stage 3, recreating a new set of triplets based
We end up with a single model containing a combination
on the new distances between samples after the model update
of knowledge from different adaptation moments, which sig-
in Stage 4b, bringing more diversity to the training phase.
nificantly boosts performance, as shown in Section V.
To do so, we extract feature vectors only for samples of the
pseudo-labeled clusters selected in Stage 3. The orange flow is
performed K2 times, and a complete cycle defines an epoch.
The blue flow is performed every K1 times, and a complete
cycle defines an iteration. Therefore, in each iteration, we have
K2 epochs. This concludes the training phase.
Unlike the five best state-of-the-art methods proposed in
the prior art (DG-Net++, MEB-Net, Dual-Refinement, SSKD,
and ABMT), our solution is trained with a single-term loss,
which contains only one hyper-parameter. Even the weight
decay has been removed, as the proposed method can already
calibrate the gradient to avoid overfitting, as we show in
Section IV. Moreover, prior work performs clustering on the
Fig. 3. Self-Ensembling scheme after training. Different amounts of the target
training phase through k-reciprocal Encoding [42], which is a data (with no label information whatsoever) are used to fine-tune the model
more robust distance metric than Euclidean distance. However, during the adaptation process. Different models created along the adaptation
it has a higher computational footprint, as it is necessary to can be
complementary. We create a new final model by weight
check the neighborhood of each sample whenever distances averaging the models’ parameters from different iterations.
are calculated. For training simplicity, we opt for standard Weight pi is based on the amount of reliable data from the
Euclidean distance to cluster the feature vectors. However, as target domain on the i-th iteration. We end up with a single
k-reciprocal encoding gives the model higher discrimination, model encoding knowledge from different moments of the
we adopt it during inference time. Therefore, different from adaptation.
previous works, we calculate k-reciprocal encoding only once
during inference.
B. Implementation details
K
1 X In terms of deep-learning architectures, we adopt
df inal (q, gi ) = d(fk (q), fk (gi )), (3)
K ResNet50 [45], OSNet [4], and DenseNet121 [46], i.e.,
k=1
nb = 3, all of them pre-trained on ImageNet [47]. To test
where K is the number of models in the ensemble. In this way, them on an adaptation scenario, we choose one of the datasets
we can incorporate knowledge from different models encoded as the source and another as the target domain. We train
as the distance between two feature vectors. After obtaining the backbone over the source domain and the adaptation
the distance between query q and all images in the gallery, we pipeline over the target domain. We consider Market1501
take the label of the closest gallery image as the query label. and DukeMTMC-ReID as source domains, leaving MSMT17
We consider an equal contribution from each backbone. only as the target dataset (the hardest one in the prior art).
Without labels on the target domain, it is impossible to This way, we have four possible adaptation scenarios: Market
evaluate the impact of the individual models and give them → Duke, Duke → Market, Market → MSMT17, and Duke
proportional weights on the combination. → MSMT17. We keep those scenarios (without MSMT17
as a source) to have a fair comparison with state-of-the-art
methods. Besides, the most challenging scenario is MSMT17
IV. E XPERIMENTS AND R ESULTS
as the target dataset: we train backbones on simpler datasets
This section presents the datasets we adopt in this work (Market and Duke) and adapt their knowledge to a harder
and compares the proposed method with the prior art with a dataset, with almost the double number of cameras and with
comprehensive set of experiments considering different, and many more identities recorded in different moments of the
challenging, source/target domains. day and the year. This enables us to test the generalization
of our method in adaptation scenarios where source and
target domain have substantial differences in the number of
A. Datasets identities, camera recording conditions, and environment.
To validate our pipeline, we used three large-scale bench- We used the code available at [48] to train OSNet and
mark datasets present on Re-ID literature: at [13] to train ResNet50 and DenseNet121 over the source
domains. Our source code is based on PyTorch [49] and it is
• Market1501 [43]: It has 12,936 images of 751 identities
freely available at https://github.com/Gabrielcb/Unsupervised
in the training set and 19,732 images in the testing set.
selfAdaptative ReID.
The testing set is still divided into 3,368 images for the
After training, we remove the last classification layer from
query set and 15,913 images for the gallery set. Following
all backbones and use the last layer’s output as our feature
previous work, we removed the “junk” images from the
embedding. We trained our pipeline using the three backbones
gallery set, so 451 images are discarded. This dataset has
independently in all scenarios of adaptation. Considering the
a total of six non-overlapping cameras. Each identity is
flows depicted in Figure 1, we perform K1 = 50 cycles of the
captured by at least two cameras.
blue flow (50 iterations), and, in each one, we perform K2 = 5
• DukeMTMC-ReID [44]: It has 16,522 images of 702
cycles of the orange flow (5 epochs). We consider Adam [50]
identities in the training set and 19,889 images in the
as the network optimizer and set the learning rate to 0.0001
testing set. The testing set is also divided into 2,228 query
in the first 30 iterations. After the 30th iteration, we divided
images and 17,661 gallery images of other 702 identities.
it by ten and kept it unchanged until reaching the maximum
The dataset has a total of eight cameras. Each identity is
number of iterations. As we show in our experiments, we can
captured by at least two cameras.
set the weight decay to zero since our proposed Cross-Camera
• MSMT17 [18]: It is the most challenging ReID dataset
Triplet Creation can regularize the model without extra hyper-
present in the prior art. It comprises 32,621 images of
parameters. The triplet batch size is set to 30; batches with 30
1,401 identities in the training set and 93,820 images
triplets are used to update the model in each epoch. The margin
of 3,060 identities in the testing set. The testing set is
in Equation 1 is set to 0.3, and the number of anchors is set
divided into 11,659 images for a query set and 82,161
to m = 2. We resize the images to 256 × 128 × 3 and apply
images for a gallery set. It comprises 15 cameras recorded
Random Flipping and Random Erasing as data augmentation
in three day periods (morning, afternoon, and noon) on
strategies during training.
four different days. Besides, of the 15 cameras, 12 are
outdoor cameras, and three are indoor cameras. Each
identity is captured by at least two cameras. C. Comparison with the Prior Art
As done in previous work in the literature, we remove Tables II and III show results comparing the proposed
from the gallery images with the same identity and camera method to the state of the art. The proposed method outper-
of the query to assess the model performance in a cross- forms the other methods regarding mAP and Rank-1 in Market
camera matching. Feature vectors are L2-normalized before → Duke by improving those values in 1.8 and 1.7 percentage
calculating distances. For evaluation, we calculate the Cumu- points (p.p.), respectively, and without re-ranking. In the Duke
lative Matching Curve (CMC), from which we report Rank-1 → Market scenario, we obtain a solid competitive performance
(R1), Rank-5 (R5), and Rank-10 (R10), and mean Average by having values 0.1 p.p. lower only in Rank-1, also without
Precision (mAP). re-ranking.
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 16, 2021 8
In turn, ABMT applies k-reciprocal encoding during train- Market → Duke scenarios (or both of them) to perform grid-
ing, which is more robust than Euclidean distance. However, searching over hyper-parameter values. Once they find the best
it is more expensive to calculate as it is necessary to search values, they keep them unchanged for all adaptation setups.
for k-reciprocal neighbors of each feature vector in each In ABMT [16], the authors do not provide a clear explana-
iteration of the algorithm before clustering. In our case, we tion on how they define the hyper-parameter values for their
only apply the standard Euclidean distance during training, loss function. However, they perform an ablation study over
reducing the training time and complexity on adaptation, but Duke → Market and Market → Duke scenarios, so their results
still obtaining performance gains. Moreover, we have a single- might be biased to those specific setups, which gives them one
term and single-hyper-parameter loss function, while ABMT of the best performances. However, when they keep the same
depends on a loss with three terms and more hyper-parameters. values for different and more challenging scenarios, such as
They apply a teacher-student strategy to their training while we Market → MSMT17 or Duke → MSMT17, they obtain worse
perform ensembling only for inference. Therefore, with a more results than ours by a large margin. This shows that our method
direct pipeline and ensemble prediction, the proposed method provides a better generalization capability brought by a simpler
has a Rank-1 only 0.1 p.p. lower in the Duke → Market, while loss function and a more diverse training. It prevents us from
outperforming all methods in all other adaptation scenarios. choosing specific hyper-parameter values and be biased to a
However, to benefit from the k-reciprocal encoding, we also specific adaptation setup. Consequently, we achieve the best
apply it during inference to keep a simpler training process. performances, especially in the most challenging scenarios.
In this case, the proposed method outperforms the methods
in the prior art regarding mAP and Rank-1 in all adaptation D. Discussion
scenarios. As we aim to re-identify people in a camera system in an
Compared to SSKD in Duke → Market scenario, we are unsupervised way, we must be robust to hyper-parameters that
below it by 0.3 and 0.4 p.p. in Rank-5 and Rank-10, respec- require adjustments based on grid-searching using true label
tively. Considering the closest actual gallery match image to information, keeping the training process (and adaptation to a
the query (R1), our ensemble retrieves more correct matches, target domain) as simple as possible. If a pipeline is complex
as Table II shows, with our method outperforming SSKD by and too sensitive to hyper-parameters, it might be challenging
1.2 p.p. in Rank-1 without re-ranking. Even with fewer hyper- to train and deploy it on a real investigation scenario, where
parameters than SSKD and a more straightforward training we do not have prior knowledge about the people of interest.
process (no co-teaching, simpler loss function, and late en- This complexity leads to sub-optimal performance. This has
sembling), our method shows competitive results considering already been pointed out in [55]. The authors claim that most
the training complexity trade-off. works rely on many hyper-parameters during the adaptation
Interestingly, the proposed method performs better under stage, which can help or hinder the performance, depending
more difficult adaptation scenarios. We measure the difficulty on the value assigned to them and which adaptation scenario
of a scenario based on the number of different cameras it is considered.
comprises. Market, Duke, and MSMT17 have 6, 8, and 15 SSKD[39] is an ensemble-based method leveraging three
cameras, respectively. Hence the most challenging adaptation deep models in a co-teaching training regime with a four-term
scenario is from Market to MSMT17. We adapt a model from loss function with three hyper-parameters. One of the terms
a simpler scenario (6 cameras, all videos recorded in the same of their final loss function is a multi-similarity loss [56], with
day period and the same season of the year) to a more complex three extra hyper-parameters to train the model.
target domain (15 cameras – 12 outdoors and 3 indoors – MEB-Net has complex training by relying on a co-training
recorded at 3 different day periods – morning, afternoon and technique with three deep neural networks in which each one
noon – in 4 different days – each day on a different season learns with the others. Each of these three networks has its
of the year). Market → MSMT17 is the most challenging separate loss function with six terms, and their overall loss
adaptation and close to real-world conditions where we might function is a weighted average of the individual loss functions
have people recorded along the day and in different locations from each model on the ensemble.
(indoors and outdoors). In this case, as shown in Table III, ABMT also leverages a teacher-student model where the
we obtained the highest performance even without re-ranking teacher and student networks share the same architecture, in-
techniques. The proposed method outperforms the state of creasing time and memory complexity during training. More-
the art by 1.5 and 2.1 p.p. in mAP and Rank-1, respectively, over, they utilize a three-term loss function to optimize both
on Duke → MSMT17, and by 2.2 and 4.2 p.p. on the most models with three hyper-parameters controlling the contribu-
challenge scenario, Market → MSMT17. tion of each term to the final loss. They update the teacher
There are several reasons why our method performs well. weights based on the exponential moving average (EMA) of
We explicitly design a model to deal with the diversity of the student weights, in order to avoid error label amplification
cameras and viewpoints by creating a set of triplets based on training. This also adds another parameter to control the
on the different cameras in a cluster. We also keep a more inertia in the teacher weights’ EMA. The authors do not
straightforward training, with only one hyper-parameter in our perform an ablation study regarding the hyper-parameter value
loss function (triplet loss margin). Most works in the ReID variation to assess their impact on final performance.
literature optimize a loss function with many terms and hyper- Based on these observations, our proposed model better
parameters. They usually consider the Duke → Market or the captures the diversity of real cases, by considering a loss
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 16, 2021 9
TABLE II
R ESULTS ON M ARKET 1501 TO D UKE MTMC-R E ID AND D UKE MTMCR E -ID TO M ARKET 1501 ADAPTATION SCENARIOS . W E REPORT M AP, R ANK -1,
R ANK -5, AND R ANK -10, COMPARING TO SEVERAL STATE - OF - ART METHODS . T HE BEST RESULT IS SHOWN IN BOLD , THE SECOND IN UNDERLINE AND
THE THIRD IN italic. W ORKS WITH (*) DO NOT PRE - TRAIN THE MODEL IN ANY SOURCE DATASET BEFORE ADAPTATION .
function with a single term and that is less sensitive to hyper- domain. All works hold this assumption in Table II that do not
parameters (only margin α needs to be selected). In such have the (*) after their name.
setups, it is difficult to select hyper-parameter values correctly, Section V shows that our pipeline still performs well even
as we might not know any information about the identities on without pre-training in a source dataset. In other words, we
the target domain. The self-ensembling also summarizes the take the backbone trained over ImageNet and directly apply
whole training into a single model by using each checkpoint’s it without any previous ReID-related knowledge. Even in this
confidence values over the target data, without using any setup, we can achieve competitive performance.
hyper-parameter or human-defined value. Even adopting a
more straightforward formulation, we still obtain state-of-the- E. Qualitative Analysis
art performance on the Market → Duke scenario and com-
petitive performance on the Duke → Market scenario. Each We now provide qualitative analysis by highlighting regions
architecture in our work is trained in parallel without any co- of the top-10 gallery images returned for a given query image.
teaching strategy. After self-ensembling, the joint contribution The redder the color of a region, the more important it is to
from different backbones is applied only on evaluation time, the ranking. As explained in Section IV-A, the correct matches
avoiding label propagation of noisy examples (e.g., potential always come from cameras different from the query’s camera.
outliers) but still taking advantage of the complementarity The green contour denotes a true positive, the red contour a
between them. false positive, and the blue color the query image. We present
successful cases (when the first gallery image is a true positive)
Our assumptions are the same as recent prior art [11], [33],
and failure cases (when the first gallery image is a false
[24]. We assume to know from which camera an image of a
positive) for each camera on Market1501 and DukeMTMC-
person was recorded but not the identity. We rely on camera
ReID datasets. We show two successful cases and two failure
information to filter out clusters elements captured by only
cases (one for each dataset) in Figures 4 and 5 considering
one camera and create the cross-camera triplets.
ResNet50 as the backbone. For visualizations for all cameras
We also assume that at least two cameras have captured of both datasets, please refer to the Supplementary Material.
most identities and all of them have non-overlapping vantage MSMT17 was not considered as the dataset agreement does
points. All prior art holds this assumption as defined by the not allow the reproduction of the images in any format.
datasets and train/test split division. Figures 4a and 5a depict two successful cases on Market
Finally, we assume that training on a source domain related → Duke and Duke → Market scenarios, respectively. In both
to Person Re-Identification gives the model a basic knowledge cases, we see that our model finds fine-grained details on the
to adapt to the target domain. This knowledge enables the image leading to a correct match. As an example, Figure 4a
model to propose better initial clusters on early iterations, shows the model focusing on the red jacket, even in a different
grouping feature vectors from the same identity recorded from pose and under occlusion (7th and 10th image from left to
different cameras. The pipeline starts the adaptation with more right). Figure 5a shows that the model can overcome pose
reliable pseudo-labels in the clustering step and progressively changes of the query on a cross-view setup. The query only
creates more clusters representing more identities on the target shows the person’s back, but the closest image is a true match
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 16, 2021 10
TABLE III
R ESULTS ON M ARKET 1501 TO MSMT17 AND D UKE MTMCR E -ID TO MSMT17 ADAPTATION SCENARIOS . W E REPORT M AP, R ANK -1, R ANK -5, AND
R ANK -10, COMPARING TO SEVERAL STATE - OF - ART METHODS . T HE BEST RESULT IS SHOWN IN BOLD , THE SECOND IN UNDERLINE AND THE THIRD IN
italic. W ORKS WITH (*) DO NOT PRE - TRAIN THE MODEL IN ANY SOURCE DATASET BEFORE ADAPTATION .
(a) (b)
Fig. 4. The most activated regions in the gallery image given a query on DukeMTMCReID (Market → Duke scenario) for ResNet50. (a) Successful match;
(b) Failure case.
(a) (b)
Fig. 5. Highlighting image regions most activated on gallery image given query on Market1501 after run Duke → Market scenario on ResNet50. (a) Successful
match; (b) Failure case.
showing the person from the front. The same happens on the F. Results on an Unsupervised Scenario
second closest image, where the identity has its back recorded This section explores the possibilities of our method when
by another camera; and on the fourth and fifth closest images, not performing any pre-training on a source domain. Here the
only the right side is captured. The third closest image not method starts with backbones trained over ImageNet directly.
only records a different position of the query, but also has This is a harder case as we eliminate the possibility of having
a different resolution. This shows that the model effectively prior knowledge of the person re-identification problem. This
overcomes identity pose changes and resolution on cross-view requires the backbones to adapt themselves to the target,
cameras. not relying on any identity-related annotation coming from
the source domain. Table II shows the results denoted by
“Ours(w/o Re-Ranking)*”. In this case, we keep ξ = 0.05
when Duke is the target, as in previous results, and ξ = 0.03
Figures 4b and 5b depict failure cases to show the limi- when Market is the target. The value ξ = 0.05 was too strict,
tations of the method. The errors happen when there is no leading to clusters with images from only one camera for
person on the image — see 4b, which has been fully occluded the Market dataset. Section V presents a deeper analysis of
by the car. In this case, the method does not have any specific different choices of ξ on the clustering process.
region to focus on, and then the gallery images are almost fully However, when we consider Duke as the target domain, the
activated. Another failure case happens when the identity is on model without source pre-training is the third best. We lose 3.8
a motorcycle (Figure 5b) along with another identity which led and 2.6 p.p. to the equivalent pre-trained model in mAP and
to mismatching cases where there is no identity (distractor Rank-1, respectively, and we lose 2.0 and 0.9 p.p. compared
images on the gallery) or images with parts of a bike. In to ABMT, outperforming all other methods. This shows that,
the Supplementary Material, we provide more successful and although our model is not completely robust to the backbone
failure cases in other cameras for both datasets. initialization, it is still capable of mining discriminative fea-
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 16, 2021 11
tures, even without pre-training, proving comparative or better Rank-1), it relies on an unstable point in the setup of Duke
results when compared to the state of the art. → Market, and it is only marginally better than ξ = 0.05 for
The proposed method outperforms all others in the same Market → Duke. Rank-5 and Rank-10 tend to be more stable
conditions (no pre-training, denoted with a star in Table II). in both cases. Thus we adopt ξ = 0.05 in all scenarios.
The difference to the best one (CycAs) is 2.9 and 4.7 p.p. on
mAP and Rank-1 when Market is the target, and in 8.7 and B. Impact of Curriculum Learning
4.5 p.p. on mAP and Rank-1 when Duke is the target. In our pipeline, Stage 3 is responsible for cluster selection.
We conclude that the previous training on a ReID source- After running the clustering algorithm, a feature vector can
related dataset is important for better performances on the be an outlier, assigned to a cluster with only one camera or
task. However, when no ReID source domain is available, our assigned to a cluster with two or more cameras. We argue that
methods can still provide competitive results, mainly in the feature space cleaning is essential for better adaptation, and
more challenging scenario (Duke as target). that feature vectors in a cluster with at least two cameras are
more reliable than ones assigned as outliers or to cluster with
V. A BLATION S TUDY a single camera. Then, we consider the curriculum learning
principle to select the most confident samples and learn in an
This section shows the contribution of each part of the
easy-to-hard manner. To achieve this, we remove the outliers
pipeline to the final result. In each experiment, we change one
and the clusters with only one camera. To check the impact
of the parts and keep the others unchanged. If not explicitly
of this removal, we performed four experiments in which we
mentioned, we consider ResNet50 as the backbone, OPTICS
alternate between keeping the outliers and the clusters with
with hyper-parameter ξ = 0.05, and self-ensembling applied
only one camera. The results are summarized in Table IV.
after training.
We observe a performance gain on most metrics, especially
on mAP and Rank-1, when we apply our cluster selection
A. Impact of the Clustering Hyper-parameter strategy. If we keep the outliers in the feature space (first
Although we have only one hyper-parameter in the loss and third rows in Table IV), we face the most significant
function, we still need to set hyper-parameter ξ of the OPTICS performance drop in both adaptation scenarios. It shows the
clustering algorithm, a threshold in the range [0, 1]. The closer importance of removing outliers after the clustering stage;
ξ is to 1, the stronger is the criteria to define a cluster; that otherwise, they can be considered in the creation of triplets,
is, we might have many samples not assigned to any cluster, increasing the number of false negatives (for instance, select-
which leads to several detected outliers (if ξ = 1, all feature ing negative samples of the same real class) and, consequently,
vectors are detected as outliers). In contrast, the closer ξ is hindering the performance. We see a lower performance drop
to 0, the more relaxed the criteria is, and more samples are by keeping clusters with only one camera but without outliers
assigned to clusters (if ξ = 0, all feature vectors are grouped (second row), indicating that those clusters do not hinder the
into a single cluster). In Figure 6, we show the impact of performance much, but might contain noisy samples for model
the threshold ξ for the Market → Duke and Duke → Market updating. It is more evident when we verify that the most gains
scenarios. were over mAP and lower gains over Rank-1 in the last row.
This demonstrates that if we keep one-camera clusters, the
model can still retrieve most of the gallery’s correct images
but with lower confidence. Hence, the cluster selection criteria
effectively improves our model generalization and we apply it
in all adaptation scenarios.
With this strategy, we observe that the percentage of feature
vectors from the target domain kept in the feature space
increases during the adaptation, as shown in Figures 7c and 8c.
In fact, reliability, mAP and Rank-1 increase during training
(a) (b) (Figures 7 and 8), which means that the model becomes more
Fig. 6. Impact of clustering hyper-parameter ξ. Results on (a) Market → robust in the target domain as more iterations are performed.
Duke, and (b) Duke → Market. This demonstrates the curriculum learning importance, where
easier examples at the beginning of the training (images
The best value for ξ changes according to the adaptation whose feature vectors are assigned to clusters with at least
scenario. This is expected when dealing with different unseen two cameras in early iterations) are used to give an initial
target domains. In both cases, Rank-1, Rank-5, and Rank-10 knowledge about the unseen target domain and allow the
curves are more stable than the mAP curve, showing that model to increase its performance gradually.
the parameter does not impact the retrieval of true positive As a direct consequence, the number of clusters with only
images. The best Rank-1 values are obtained for ξ between one camera removed from the feature space decreases, as
0.04 and 0.08 considering both scenarios and, in the more shown in Figure 9. This means that the model learns to group
challenging one (Market → Duke), it achieves the second-best cross-view images in the same cluster.
value when ξ = 0.05, for both mAP and Rank-1. Although the For the Market → Duke scenario, the initial percentage
best performance is achieved when ξ = 0.07 (best mAP and of removed clusters is higher than on Duke → Market.
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 16, 2021 12
TABLE IV
I MPACT OF CURRICULUM LEARNING , WHEN CONSIDERING DIFFERENT CLUSTER SELECTION CRITERIA . W E TESTED OUR METHOD WITH AND WITHOUT
OUTLIERS AND WITH AND WITHOUT CLUSTERS WITH ONLY ONE CAMERA IN THE FEATURE SPACE . A LL EXPERIMENTS CONSIDER R ES N ET 50 AS THE
BACKBONE WITH SELF - ENSEMBLING APPLIED AFTER TRAINING .
D
- - 50.9 79.2 89.5 92.8 32.7 56.7 68.5 72.9
D
- 72.4 89.5 95.2 96.7 66.8 81.1 90.2 92.4
D D
- 49.1 79.8 89.5 92.6 32.7 57.2 68.4 72.3
74.1 89.6 95.3 97.1 67.8 81.7 90.0 92.6
C. Impact of self-ensembling The overall time to execute the pipeline and the whole
To check the contribution of our proposed self-ensembling training on Market → Duke scenario is smaller than Market
method explained in Section III-E, we take the best checkpoint → MSMT17’s, as expected, given that the latter is a more
of our model during adaptation in both scenarios, considering complex setup. As the number of training images is higher,
all backbones, and compare it with the self-ensembled model. the number of proposed clusters is also higher on MSMT17.
Note that we select the best model only for reference. In This leads to an increase in clustering, filtering, and overall
practice, we do not know the best checkpoint during training training times.
since we do not have any identity-label information. Our goal OSNet is the backbone that takes less time on both adap-
here is merely to show that our self-ensembling method leads tation setups, because of its feature embedding size. For
to a final model that outperforms any checkpoint individually. ResNet50 and DenseNet121, the embeddings have 2,048 di-
Even if we do not have any label information to choose the mensions while OSNet has 512. This allows a faster clustering,
best one during training, the self-ensembling can summarize as Table VII shows. Considering the same adaptation scenario,
the whole training process in a final model, which is better the clustering step is the most affected by the backbone and
than all checkpoints. Table V shows these results. its respective embedding size. This is why ResNet50 and
Our proposed self-ensembling method can improve discrim- DenseNet121 present more similar training times and OSNet
inative power over the target domain by summarizing the is the fastest one.
whole training during adaptation. The method outperforms the The inference time is calculated assuming that all gallery
best models in mAP by 2.0, 4.5 and 4.3 p.p., on Duke → feature vectors have been extracted and stored. It is the average
Market, for ResNet50, OSNet and DenseNet121, respectively. time to predict the label of one query based on the ranking
Similarly, for Market → Duke we achieve an improvement of the gallery images, following the protocol presented in
of 1.6, 2.2 and 3.3 p.p. in mAP for ResNet50, OSNet and Section IV-B. The difference between both adaptation scenar-
DenseNet121, respectively. We can also observe gains for all ios is due to the gallery size. As explained in Section IV-A,
backbones in both scenarios considering Rank-1. Therefore, MSMT17 has a gallery size more than 4× bigger than Duke’s.
our proposed self-ensembling strategy increases the number For all experiments, we used two GTX 1080 Ti GPUs. One
of correct examples retrieved from the gallery and their of them is used exclusively for clustering with an implemen-
confidence. It shows that different checkpoints trained with tation based on [58], and the other for pipeline training, for
different percentages of the data from the target domain have each backbone.
complementary information. Besides, as the self-ensembling is
VI. C ONCLUSIONS AND F UTURE W ORK
performed at the parameter level, without human supervision
and considering each checkpoint’s confidence, it reduces the In this work, we tackle the problem of cross-domain Per-
memory footprint by eliminating all unnecessary checkpoints son Re-Identification (ReID) with non-overlapping cameras,
and keeping only the self-ensembled final model. especially targeting forensic scenarios with fast deployment
requirements. We propose an Unsupervised Domain Adapta-
tion (UDA) pipeline, with three novel techniques: (1) cross-
D. Impact of Ensemble-based prediction camera triplet creation aiming at increasing diversity during
To increase discrimination ability, we combine distances training; (2) self-ensembling, to summarize complementary
computed by all considered architectures (Equation 3) for the information acquired at different iterations during training; and
final inference. Results are shown in Table VI. (3) an ensemble-based prediction technique to take advantage
The ensembled model outperforms the individual models by of the complementarity between different trained backbones.
3.3, 5.2 and 0.9 p.p. regarding Rank-1, on Duke → Market, Our cross-camera triplet creation technique increases the
for ResNet50, OSNet and DenseNet, respectively. The same model’s invariance to different points-of-view and types of
can be observed for Market → Duke, in which Rank-1 is cameras in the target domain, and increases the regularization
improved by 3.3, 2.9 and 1.6 p.p. for ResNet50, OSNet and of the model, allowing the use of a single-term single-hyper-
DenseNet121, respectively. Results for all the other metrics parameter triplet loss function. Moreover, we showed the
also increase for both adaptation scenarios. Therefore, we importance of having this more straightforward loss function.
can effectively combine knowledge encoded in models with It is less biased towards specific scenarios and helps us achieve
different architectures. By performing it only for inference, we state-of-art results in the most complex adaptation setups,
keep a simpler training process and still can take advantage surpassing prior art by a large margin in most cases.
of the ensembled knowledge from different backbones. The self-ensembling technique helps us increase the final
performance by aggregating information from different check-
points throughout the training process, without human or label
E. Processing Footprint supervision. This is inspired by the reliability measurement,
To measure the processing footprint of our pipeline (training which shows that our models learn from more reliable data
and inference), we consider two representative adaptation as more iterations are performed. Furthermore, this process is
scenarios: Market → Duke and Market → MSMT17. As done in an easy-to-hard manner to increase model confidence
explained, the first setup represents a mildly difficult case and gradually.
the second is the most challenging one. Table VII shows the Finally, our last ensemble technique takes advantage of the
time measurements. complementarity between different backbones, enabling us to
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 16, 2021 14
TABLE V
I MPACT OF SELF - ENSEMBLING . W E CONSIDER A WEIGHTED AVERAGE OF THE PARAMETERS OF THE BACKBONE IN DIFFERENT MOMENTS OF THE
ADAPTATION . “B EST ” REFERS TO RESULTS OBTAINED WITH THE CHECKPOINT WITH HIGHEST R ANK -1 DURING ADAPTATION . “F USION ” IS THE FINAL
MODEL CREATED THROUGH THE PROPOSED SELF - ENSEMBLING METHOD . T HE BEST RESULTS ARE IN BOLD .
TABLE VI
I MPACT OF ENSEMPLE - BASED PREDICTION . P ERFORMANCE WITH AND WITHOUT MODEL ENSEMBLE DURING INFERENCE . B EST VALUES ARE IN BOLD .
TABLE VII
T IME E VALUATION . W E CALCULATE EACH TIME IN HH:MM:SS FOR TRAINING AND IN MILLISECONDS ( MS ) FOR INFERENCE
. On training, we analyze the time taken to cluster and filter (Stages 2 and 3), one round of fine-tuning (Stage 4b), one epoch (time taken to perform K2
iterations of orange flow), and the whole pipeline training. On inference, we calculate the time to predict the identity of a query image given the gallery
feature vectors.
Market → Duke Market → MSMT17
Clustering + Finetuning Epoch Whole Inference Clustering + Finetuning Epoch Whole Inference
filtering Training Filtering training
ResNet 00:03:55 00:08:55 00:13:34 11:31:19 5ms 00:16:45 00:09:36 00:28:08 23:00:55 13ms
OSNet 00:01:53 00:08:56 00:11:14 09:33:04 4ms 00:07:41 00:12:20 00:20:59 17:49:40 11ms
DenseNet 00:04:06 00:08:33 00:13:36 11:33:14 4ms 00:16:46 00:11:27 00:31:13 26:32:08 13ms
Ensemble - - - - 6ms - - - - 22ms
achieve state-of-the-art results without adding complexity to demands pairwise distance calculation between all feature vec-
the training, differently from the mutual-learning strategies tors. Therefore, this approach may introduce higher processing
used in current methods [13], [39], [16]. It is important to times to the pipeline. In this sense, we also aim to extend our
note that both ensembling strategies are done after training to method to scale to very large datasets by introducing online
generate a final model and a final prediction. deep clustering and self-supervised techniques directly in the
Because the training process is more straightforward than pipeline.
other state-of-the-art methods and does not need information Another possible extension of our pipeline can be its appli-
on the target domain’s identities, our work is easily extendable cation to general object re-identification, such as vehicle ReID,
to other adaptation scenarios and deployed in actual investi- to mine critical objects of interest in an investigation. For
gations and other forensic contexts. example, with Person ReID, this could enable a joint analysis
by matching mined identities and objects to propose relations
A key aspect of our method also shared with other recent
between them during an event’s analysis finally.
methods in the literature [28], [11], [33], is that it requires
information about the camera used to acquire each sample.
That is, we suppose we know, a priori, the device that ACKNOWLEDGMENT
captured each image. This information does not need to be
the specific type of camera but, at least, information about We thank the financial support of the São Paulo Re-
different camera models. Without this information, our model search Foundation (FAPESP) through the grants DéjàVu
could face suboptimal performance, as it would not be able to #2017/12646-3 and #2019/15825-1.
take advantage of the diversity introduced by the cross-camera
triplets. To address this drawback, we aim to extend this work R EFERENCES
by incorporating techniques for automatic camera attribution
[59], [60], allowing the identification of the camera used to [1] R. Padilha, C. M. Rodrigues, F. A. Andaló, G. Bertocco, Z. Dias, and
A. Rocha, “Forensic event analysis: From seemingly unrelated data to
acquire an image or identifying whether the same camera understanding,” IEEE Security and Privacy, vol. 18, no. 6, pp. 23–32,
acquired a pair of images. 2020.
[2] X. Qian, Y. Fu, Y.-G. Jiang, T. Xiang, and X. Xue, “Multi-scale
Regarding the clustering process, our method requires that deep learning architectures for person re-identification,” in International
all selected samples are considered during this phase, which Conference on Computer Vision, 2017, pp. 5399–5408.
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 16, 2021 15
[3] Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang, “Beyond part models: [26] S. Lin, H. Li, C.-T. Li, and A. C. Kot, “Multi-task mid-level
Person retrieval with refined part pooling (and a strong convolutional feature alignment network for unsupervised cross-dataset person re-
baseline),” in European Conference on Computer Vision, 2018, pp. 480– identification,” arXiv preprint, vol. arXiv:1807.01440, 2018.
496. [27] L. Qi, L. Wang, J. Huo, L. Zhou, Y. Shi, and Y. Gao, “A novel
[4] K. Zhou, Y. Yang, A. Cavallaro, and T. Xiang, “Omni-scale feature unsupervised camera-aware domain adaptation framework for person re-
learning for person re-identification,” in International Conference on identification,” in International Conference on Computer Vision, 2019,
Computer Vision, 2019, pp. 3702–3712. pp. 8080–8089.
[5] X. Chen, C. Fu, Y. Zhao, F. Zheng, J. Song, R. Ji, and [28] A. Wu, W.-S. Zheng, and J.-H. Lai, “Unsupervised person re-
Y. Yang, “Salience-guided cascaded suppression network for person re- identification by camera-aware similarity consistency learning,” in In-
identification,” in IEEE Conference on Computer Vision and Pattern ternational Conference on Computer Vision, 2019, pp. 6922–6931.
Recognition, 2020, pp. 3300–3310. [29] S. Lloyd, “Least squares quantization in PCM,” IEEE Transactions on
[6] C. Liu, X. Chang, and Y.-D. Shen, “Unity style transfer for person re- Information Theory, vol. 28, no. 2, pp. 129–137, 1982.
identification,” in IEEE Conference on Computer Vision and Pattern [30] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algo-
Recognition, 2020, pp. 6887–6896. rithm for discovering clusters in large spatial databases with noise,” in
[7] H. Fan, L. Zheng, C. Yan, and Y. Yang, “Unsupervised person re- International Conference on Knowledge Discovery and Data Mining,
identification: Clustering and fine-tuning,” ACM Transactions on Mul- 1996, pp. 226–231.
timedia Computing, Communications, and Applications, vol. 14, no. 4, [31] H. Tang, Y. Zhao, and H. Lu, “Unsupervised person re-identification with
pp. 1–18, 2018. iterative self-supervised domain adaptation,” in IEEE Conference on
[8] L. Song, C. Wang, L. Zhang, B. Du, Q. Zhang, C. Huang, and X. Wang, Computer Vision and Pattern Recognition Workshops, 2019, pp. 1536–
“Unsupervised domain adaptive re-identification: Theory and practice,” 1543.
Pattern Recognition, vol. 102, p. 107173, 2020. [32] Z. Zhong, L. Zheng, Z. Luo, S. Li, and Y. Yang, “Invariance matters:
[9] Y. Fu, Y. Wei, G. Wang, Y. Zhou, H. Shi, and T. S. Huang, “Self- Exemplar memory for domain adaptive person re-identification,” in IEEE
similarity grouping: A simple unsupervised cross domain adaptation Conference on Computer Vision and Pattern Recognition, 2019, pp. 598–
approach for person re-identification,” in International Conference on 607.
Computer Vision, 2019, pp. 6112–6121. [33] ——, “Learning to adapt invariance in memory for person re-
[10] X. Zhang, J. Cao, C. Shen, and M. You, “Self-training with progressive identification,” IEEE Transactions on Pattern Analysis and Machine
augmentation for unsupervised cross-domain person re-identification,” Intelligence, pp. 1–1, 2020.
in International Conference on Computer Vision, 2019, pp. 8222–8231. [34] D. Wang and S. Zhang, “Unsupervised person re-identification via
[11] Y. Zhai, S. Lu, Q. Ye, X. Shan, J. Chen, R. Ji, and Y. Tian, “Ad- multi-label classification,” in IEEE Conference on Computer Vision and
cluster: Augmented discriminative clustering for domain adaptive person Pattern Recognition, 2020, pp. 10 981–10 990.
re-identification,” in IEEE Conference on Computer Vision and Pattern [35] Y. Dai, J. Liu, Y. Bai, Z. Tong, and L.-Y. Duan, “Dual-refinement: Joint
Recognition, 2020, pp. 9021–9030. label and feature refinement for unsupervised domain adaptive person
[12] Y. Ge, D. Chen, and H. Li, “Mutual mean-teaching: Pseudo label refinery re-identification,” arXiv preprint, vol. arXiv:2012.13689, 2020.
for unsupervised domain adaptation on person re-identification,” arXiv [36] R. J. Campello, D. Moulavi, and J. Sander, “Density-based clustering
preprint, vol. arXiv:2001.01526, 2020. based on hierarchical density estimates,” in Pacific-Asia Conference on
[13] Y. Zhai, Q. Ye, S. Lu, M. Jia, R. Ji, and Y. Tian, “Multiple expert brain- Knowledge Discovery and Data Mining, 2013, pp. 160–172.
storming for domain adaptive person re-identification,” arXiv preprint,
[37] M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander, “OPTICS:
vol. arXiv:2007.01546, 2020.
ordering points to identify the clustering structure,” ACM SIGMOD
[14] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural
Record, vol. 28, no. 2, pp. 49–60, 1999.
network,” arXiv preprint, vol. arXiv:1503.02531, 2015.
[38] F. Yang, K. Li, Z. Zhong, Z. Luo, X. Sun, H. Cheng, X. Guo, F. Huang,
[15] B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, and
R. Ji, and S. Li, “Asymmetric co-teaching for unsupervised cross-domain
M. Sugiyama, “Co-teaching: Robust training of deep neural networks
person re-identification,” in AAAI Conference on Artificial Intelligence,
with extremely noisy labels,” in Advances in Neural Information Pro-
2020, pp. 12 597–12 604.
cessing Systems, 2018, pp. 8527–8537.
[16] H. Chen, B. Lagadec, and F. Bremond, “Enhancing diversity in teacher- [39] J. Yin, J. Qiu, S. Zhang, Z. Ma, and J. Guo, “SSKD: Self-
student networks via asymmetric branches for unsupervised person re- supervised knowledge distillation for cross domain adaptive person re-
identification,” in IEEE Winter Conference on Applications of Computer identification,” arXiv preprint, vol. arXiv:2009.05972, 2020.
Vision, 2020, pp. 1–10. [40] L. Jiang, D. Meng, S.-I. Yu, Z. Lan, S. Shan, and A. Hauptmann,
[17] W. Deng, L. Zheng, Q. Ye, G. Kang, Y. Yang, and J. Jiao, “Image- “Self-paced learning with diversity,” Advances in Neural Information
image domain adaptation with preserved self-similarity and domain- Processing Systems, vol. 27, pp. 2078–2086, 2014.
dissimilarity for person re-identification,” in IEEE Conference on Com- [41] F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified
puter Vision and Pattern Recognition, 2018, pp. 994–1003. embedding for face recognition and clustering,” in IEEE Conference
[18] L. Wei, S. Zhang, W. Gao, and Q. Tian, “Person transfer GAN to on Computer Vision and Pattern Recognition, 2015, pp. 815–823.
bridge domain gap for person re-identification,” in IEEE Conference [42] Z. Zhong, L. Zheng, D. Cao, and S. Li, “Re-ranking person re-
on Computer Vision and Pattern Recognition, 2018, pp. 79–88. identification with k-reciprocal encoding,” in IEEE Conference on Com-
[19] J. Liu, Z.-J. Zha, D. Chen, R. Hong, and M. Wang, “Adaptive transfer puter Vision and Pattern Recognition, 2017, pp. 1318–1327.
network for cross-domain person re-identification,” in IEEE Conference [43] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian, “Scalable
on Computer Vision and Pattern Recognition, 2019, pp. 7202–7211. person re-identification: A benchmark,” in International Conference on
[20] Y. Chen, X. Zhu, and S. Gong, “Instance-guided context rendering for Computer Vision, 2015, pp. 1116–1124.
cross-domain person re-identification,” in International Conference on [44] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi, “Performance
Computer Vision, 2019, pp. 232–242. measures and a data set for multi-target, multi-camera tracking,” in
[21] Y.-J. Li, C.-S. Lin, Y.-B. Lin, and Y.-C. F. Wang, “Cross-dataset person European Conference on Computer Vision, 2016, pp. 17–35.
re-identification via unsupervised pose disentanglement and adaptation,” [45] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
in International Conference on Computer Vision, 2019, pp. 7919–7929. image recognition,” in IEEE Conference on Computer Vision and Pattern
[22] Z. Zhong, L. Zheng, S. Li, and Y. Yang, “Generalizing a person Recognition, 2016, pp. 770–778.
retrieval model hetero-and homogeneously,” in European Conference on [46] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely
Computer Vision, 2018, pp. 172–188. connected convolutional networks,” in IEEE Conference on Computer
[23] Y. Zou, X. Yang, Z. Yu, B. Kumar, and J. Kautz, “Joint disentangling Vision and Pattern Recognition, 2017, pp. 4700–4708.
and adaptation for cross-domain person re-identification,” arXiv preprint, [47] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:
vol. arXiv:2007.10315, 2020. A large-scale hierarchical image database,” in IEEE Conference on
[24] Y. Lin, Y. Wu, C. Yan, M. Xu, and Y. Yang, “Unsupervised person Computer Vision and Pattern Recognition, 2009, pp. 248–255.
re-identification via cross-camera similarity exploration,” IEEE Trans- [48] K. Zhou and T. Xiang, “Torchreid: A library for deep learning person re-
actions on Image Processing, vol. 29, pp. 5481–5490, 2020. identification in Pytorch,” arXiv preprint, vol. arXiv:1910.10093, 2019.
[25] J. Wang, X. Zhu, S. Gong, and W. Li, “Transferable joint attribute- [49] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
identity deep learning for unsupervised person re-identification,” in IEEE T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf,
Conference on Computer Vision and Pattern Recognition, 2018, pp. E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner,
2275–2284. L. Fang, J. Bai, and S. Chintala, “PyTorch: An imperative style, high-
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 16, 2021 16
Supplementary Material
Unsupervised and self-adaptative techniques for cross-domain person re-identification
Gabriel Bertocco, Fernanda Andaló, Member, IEEE,
and Anderson Rocha, Senior Member, IEEE
I. S UPLEMENTARY M ATERIAL then those methods cannot fully describe images on the target
domain considering these constraints. In more recent works,
A. Full Comparison with Prior Art
researchers have proposed further processing, such as pseudo-
arXiv:2103.11520v3 [cs.CV] 7 Feb 2022
In the main article, we compared our results with recent labeling (DG-Net++ [?]), pose alignment (PDA-Net [?]), and
state-of-the-art methods. In this supplementary material, we context-alignment (CR-GAN [?]). Our method can surpass all
present the full set of results, comparing our method with these GAN-based methods by a large margin. Compared to
methods proposed before 2018, published in top-tier journals the most powerful of them, DG-Net++, we outperform it by
and conferences (Tables I and II). Following the same conclu- 16.7 and 10.8 p.p on mAP and Rank-1 in the Duke → Market
sions discussed in the main article, our method outperforms scenario, and in Market → Duke by 8.8 and 6.1 p.p.
all works from 2018 and previous years. MSMT17 is the most recent and challenging Person ReID
LOMO [?], BOW [?], and UMDL [?] are hand-crafted- benchmark, and this is why there are fewer works considering
based methods. They directly compute feature vectors over it for evaluation. PTGAN was the first to considered it, both
pixel values without using a neural network. UMDL also being proposed in [?], in 2018. We outperform them by a
learns a shared dictionary to mine meaningful attributes from substantial margin of 31.2 and 52.1 p.p. in mAP and Rank-1 in
the target dataset, however, in a much simpler setup than Duke → MSMT17 and by 30.3 and 52.1 p.p in the challenging
any deep-learning method. They then calculate the distance Market → MSMT17 scenario.
between query and gallery images. This makes them scalable SpCL [?] is similar to ours in the sense that it increases the
and fast deployable. However, since hand-crafted features cluster reliability during the clustering stage as the training
usually do not describe high-level features from images, the progresses. However, it does not apply any strategy consider-
methods fail when used to match the same person from ing diversity as we do by creating diverse triplets considering
different camera views. The substantial differences caused by all cameras comprised in a cluster. Besides, they leverage both
changes of illumination, resolution, and pose of the identities source and target domain images on adaptation stages and
bring a high non-linearity to the feature space that is not enable their model to use the source labeled identity to bring
captured by hand-crafted-based methods. We surpass UMDL some regularization to the adaptation process. Differently, our
by 65.3 and 66.5 percentage points (p.p) on mAP and Rank-1 method does not use the source domain images after fine-
when considering Market → Duke scenario and by 66.0 and tuning and leverages the adaptation process relying only on
58.4 p.p. considering Duke → Market. This shows the power target images. As shown in the main article, we outperform
of deep neural networks, which effectively describe identities them by 1.2 and 4.2 p.p. in the most challenging Market →
in a non-overlapping camera system under different point-of- MSMT17 in mAP and Rank-1, respectively.
views.
All other works published in and after 2018 have been
described in the main article. MMFA [?] and TJ-AIDL [?] are B. Full Qualitative Analysis
methods based on low- and mid-level attribute alignment by In this section we extend the qualitative analysis presented
leveraging deep convolutional neural networks. Since they do in the main article. We show the successful and failure
not encourage the networks to be robust to different point-of- cases for a query from each camera for Market1501 and
views, their performance is lower than more recent proposed DukeMTMC-ReID datasets.
pseudo-labeling methods (PCB-PAST [?], SSG [?], UDAP [?], In Figures 1 and 3, we observe some successful cases with
AD-Cluster [?], among others) and ensemble-based methods the activation maps for the top-10 closest gallery images to the
(ACT [?], MMT [?], MEB-Net [?], SSKD [?], ABMT [?]). query. We adapted the implementation from [?] to visualize
The same can be observed for PTGAN [?], SPGAN and the activation maps. In both scenarios, we see that our model
SPGAN+LMP [?], which are GAN-based methods that aim is able to find fine-grained details on the images, enabling it to
to transfer images from source to target domain, replicating correct match the query to the gallery images. For example, in
the same camera conditions of the target domain in the Figure 1d, the model tends to focus on shoes and parts of the
labeled source images. However, transferring only camera- face, while in Figure 1a the focus is on regions depicting hair
level features, such as color, contrast and resolution, are not and pants. We conclude that our method is able to distinguish
enough. People on the source domain might be in different the semantic parts of the body and soft-biometric attributes
poses and contexts from the ones on the target domain, and which are vital to Person Re-Identification. It is also important
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 16, 2021 2
TABLE I
R ESULTS ON M ARKET 1501 TO D UKE MTMC-R E ID AND D UKE MTMCR E -ID TO M ARKET 1501 ADAPTATION SCENARIOS . W E REPORT M AP, R ANK -1,
R ANK -5, AND R ANK -10, COMPARING DIFFERENT METHODS . T HE BEST RESULT IS SHOWN IN BOLD , THE SECOND IN UNDERLINE AND THE THIRD IN
italic. W ORKS WITH (*) DO NOT PRE - TRAIN THE MODEL IN ANY SOURCE DATASET BEFORE ADAPTATION .
to note that the query and the correct matches are always from Despite the high probability of people wearing similar clothing
different cameras, which also confirms that our model is able items, it is marginal the chance of them being dressed exactly
to overcome different camera conditions. Analyzing the errors the same way, however this is the situation shown in Figure 4b.
in Figures 1h and 3f, we see they are mainly caused by similar bounding box. In this case, the method erroneously retrieves
clothes, but the method is still able to recover at least the images with no identity on it (distractor images on gallery) or
closest gallery image (Rank-1 image). images with parts of a bike.
Figures 2 and 4 depict the failure cases for one of the back-
bones, showing the limitations of the method. The failures are
mainly related to similar clothes or soft-biometric attributes. Figure 4a shows an interesting failure case, where the model
Another source of errors is occlusion. Figure 2a is an focus uniquely on the drawing on the person’s shirt in the
example where the person has been fully occluded by a car. query image. The method returns gallery images of other
As there is no person, the method does not have any specific identities with similar shirt drawings. Despite the failure, it
region to focus on and then the gallery images are almost fully is interesting to note that our method was able to focus on
activated. In Figure 4d, the target identity is on a motorcycle fine-grained details to find matches, and not activate the whole
together with another person, which led them to be in the same image or large parts of it.
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 16, 2021 3
TABLE II
R ESULTS ON M ARKET 1501 TO MSMT17 AND D UKE MTMCR E -ID TO MSMT17 ADAPTATION SCENARIOS . W E REPORT M AP, R ANK -1, R ANK -5, AND
R ANK -10, COMPARING DIFFERENT METHODS . T HE BEST RESULT IS SHOWN IN BOLD , THE SECOND IN UNDERLINE AND THE THIRD IN italic. W ORKS
WITH (*) DO NOT PRE - TRAIN THE MODEL IN ANY SOURCE DATASET BEFORE ADAPTATION .