0% found this document useful (0 votes)
75 views

Guarding Barlow Twins Against Overfitting With Mixed Samples

Uploaded by

sony
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views

Guarding Barlow Twins Against Overfitting With Mixed Samples

Uploaded by

sony
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Guarding Barlow Twins Against Overfitting with Mixed Samples

Wele Gedara Chaminda Bandara1 , Celso M. De Melo2 , and Vishal M. Patel1


1
Johns Hopkins University 2 US Army DEVCOM Research Laboratory
https://github.com/wgcban/mix-bt.git
arXiv:2312.02151v1 [cs.CV] 4 Dec 2023

Abstract

Self-supervised Learning (SSL) aims to learn transfer-


able feature representations for downstream applications
without relying on labeled data. The Barlow Twins algo-
rithm, renowned for its widespread adoption and straight-
forward implementation compared to its counterparts like
contrastive learning methods, minimizes feature redun-
dancy while maximizing invariance to common corruptions.
Optimizing for the above objective forces the network to
learn useful representations, while avoiding noisy or con- Figure 1. Assessing the representation quality via k-NN accuracy
on the test-set during SSL training on train-set for information
stant features, resulting in improved downstream task per-
maximization-based Barlow Twins [74] vs. contrastive learning-
formance with limited adaptation. Despite Barlow Twins’
based SimCLR [14], on CIFAR-10 [42] dataset.
proven effectiveness in pre-training, the underlying SSL ob-
jective can inadvertently cause feature overfitting due to the
lack of strong interaction between the samples unlike the
garnered significant attention [3]. In these architectures,
contrastive learning approaches. From our experiments,
two networks are trained to generate similar embeddings
we observe that optimizing for the Barlow Twins objective
for different perspectives of the same image. A prominent
doesn’t necessarily guarantee sustained improvements in
example is the Siamese network architecture [10, 16]. How-
representation quality beyond a certain pre-training phase,
ever, Siamese network architectures are susceptible to “rep-
and can potentially degrade downstream performance on
resentation collapse.” This occurs when the network disre-
some datasets. To address this challenge, we introduce
gards its inputs, leading to the generation of identical or ir-
Mixed Barlow Twins, which aims to improve sample inter-
relevant feature representations [46, 47]. To address this is-
action during Barlow Twins training via linearly interpo-
sue, prior research has adopted two approaches: contrastive
lated samples. This results in an additional regularization
learning and information maximization.
term to the original Barlow Twins objective, assuming lin-
ear interpolation in the input space translates to linearly Contrastive-learning [10, 14, 19, 20, 34, 35, 61, 65] of-
interpolated features in the feature space. Pre-training ten entails substantial computational demands, necessitat-
with this regularization effectively mitigates feature over- ing large batch sizes [14] or the utilization of memory
fitting and further enhances the downstream performance banks [34]. These methods employ a loss function de-
on CIFAR-10, CIFAR-100, TinyImageNet, STL-10, and Im- signed to explicitly encourage the convergence of embed-
ageNet datasets. The code and checkpoints are available dings for similar images (positive samples) while push-
at: https://github.com/wgcban/mix-bt.git ing apart embeddings for dissimilar images (negative sam-
ples). However, recent trends in SSL have shifted towards
non-contrastive methods due to their simplicity of imple-
mentation and their ability to learn high-quality representa-
1. Introduction tions. Notable examples of these non-contrastive SSL meth-
Self-Supervised Learning (SSL) has experienced remark- ods include BYOL [31] and SimSiam [16]. These meth-
able advancements in recent years [4, 12, 16, 27, 31, 34, ods employ various techniques such as batch-wise normal-
52, 60, 76], consistently outperforming supervised learning ization, feature-wise normalization, “momentum encoders”
across numerous downstream tasks [77]. Among the var- [31, 58], and stop-gradient in one of the branches.
ious SSL techniques, joint embedding architectures have More recently, methods focused on preventing feature

1
collapse through information maximization (InfoMax) have to substantial improvements in performance on downstream
exhibited promising outcomes [6, 7, 28, 64, 74]. These ap- applications, highlighting its capacity to facilitate the learn-
proaches aim to decorrelate every pair of variables within ing of high-quality features that significantly benefit a range
embedding vectors, thereby indirectly maximizing the in- of downstream tasks. In summary, this paper makes the.
formation content of these vectors. The Barlow Twins [74] following contributions:
achieves this objective by aligning the cross-correlation ma- • We revisit the Barlow Twins algorithm and identify its
trix of the two embeddings with the identity matrix. Mean- susceptibility to overfitting.
while, Whitening-MSE [28] whitens and spreads out the • Through experiments on various datasets, we highlight
embedding vectors on the unit sphere. VICReg [6] and its the overfitting phenomenon when the embedding dimen-
variant VICRegL [7] introduce two regularization terms to sionality increases.
maintain the variance of each embedding dimension above • We counteract feature overfitting in Barlow Twins with
a threshold and decorrelate each pair of variables. mixed sample interaction, named Mixed Barlow Twins.
In this work, we revisit Barlow Twins [74], one of • Our experiments demonstrate that Mixed Barlow Twins
the most widely adopted methods in SSL, and shed light improves the performance on downstream tasks over the
on a critical aspect that demands attention. Our investi- Barlow Twins and other state-of-the-art (SOTA) methods.
gation reveals that the Barlow Twins, despite its notable
strengths, is susceptible to overfitting, particularly when 2. Related Work
the dimension of the embeddings experiences substantial
growth. Our experiments conducted on small to medium- Contrastive Learning for SSL. Contrastive learning
scale datasets, including CIFAR-10 [42], CIFAR-100 [42], methods, often employed in joint embedding architec-
TinyImageNet [43], and STL-10 [22], demonstrate that the tures [2, 5, 8], aim to bring the output embeddings of two
k-NN evaluation [26] performance saturates or even dete- views of a sample closer to each other, while pushing other
riorates during the SSL training as the embedding dimen- samples and their distortions farther apart. This is typi-
sionality increases, as depicted in Figure 1. However, this cally achieved through the use of the InfoNCE loss [54].
is not the case with contrastive learning methods like Sim- Such methods are commonly implemented using a Siamese
CLR [14]. This phenomenon indicates a tendency towards network architecture, where the weights of the two branches
overfitting to the training data in Barlow Twins, with the are shared [10, 14, 15, 18, 32, 34, 35, 52, 54, 65, 71]. While
model potentially memorizing specific instances and ex- contrastive learning techniques have demonstrated excellent
cessively focusing on feature representations [64]. This, performance, they come with a notable drawback - the re-
in turn, adversely affects the generalization of features for quirement for a substantial number of contrastive sample
downstream applications. By evaluating the learned fea- pairs [13], which in turn necessitates significant memory
tures with k-NN, we can directly assess the quality of the and extended pre-training times. To overcome this limita-
learned representations without adapting them to a spe- tion, some recent approaches, such as the one utilized in
cific downstream task. These experiments conducted on MoCo [17, 34], have proposed sampling these pairs from a
datasets of varying sizes enable us to monitor the perfor- memory bank as an alternative to the approach employed in
mance throughout the SSL process, leading to these critical SimCLR [14]. The latter method, which samples pairs from
insights. the current minibatch, results in higher memory consump-
tion. The resource-intensive nature of contrastive learning
Having identified the overfitting phenomenon in Bar- for SSL has prompted researchers to explore alternative ap-
low Twins, we explore various techniques for mitigating proaches, such as clustering (see supplementary material),
it. Notably, we discovered that MixUp regularization [75], distillation, and information maximization-based methods.
a technique commonly employed in supervised learning,
proves effective. This technique involves the linear mix-
ing of two samples in the input space, and we formu- Clustering for SSL. Clustering-based methods aim to ex-
late a regularization loss by establishing a relationship be- tract useful representations by grouping data samples based
tween the cross-correlation matrix of the mixed embeddings on a similarity measure [1, 9, 11, 12, 37, 68–70, 78]. For
and the unmixed embeddings, under the assumption of the instance, DeepCluster [12] employs k-means clustering of
same linear interpolation in the embedding space. To the representations from previous iterations as pseudo-labels
best of our knowledge, this marks the first utilization of for new representations, though this approach entails an ex-
MixUp regularization in InfoMax-based SSL, as existing pensive clustering phase performed asynchronously, mak-
SSL augmentations are typically applied on a per-sample ing it challenging to scale up. Furthermore, clustering ap-
basis. Our experiments conclusively demonstrate that the proaches can be seen as a form of contrastive learning at the
proposed MixUp-based regularization acts as a safeguard cluster level, which still necessitates a substantial number
against overfitting in Barlow Twins. Furthermore, it leads of negative comparisons to perform effectively.

2
Distillation for SSL. In contrast to using negative sam- (a)
ples to prevent representation collapse, distillation-based 1

approaches such as BYOL [58, 58], SimSiam [16], and 1


1
0

1
OBoW [30] employ architectural techniques inspired by 0 1
1
1
knowledge distillation. These methods predominantly rely ,

on a student-teacher network framework, in which the net- Shuffle


(b)
work’s weights are either a moving average of the student Linear Mixing MixUp Regularization
network’s weights [31] or are shared with the student net-
work but with gradient updates stopped at the teacher net-
work [16]. Although these methods can learn valuable rep-
resentations, there is no clear evidence of how they pre-
Figure 2. Schematic of the proposed Mixed Barlow Twins. (a)
vent representation collapse, unlike contrastive learning ap-
Original Barlow Twins Algorithm [74]. (b) Proposed MixUp reg-
proaches. Alternatively, in methods like OBoW [30], im- ularization technique to prevent Barlow Twins from overfitting and
ages can be represented as bags of words using a dictionary to further enhance the representation quality.
of visual features, which can be obtained through offline or
online clustering.
issue of infomax-based methods, particularly with Barlow
Twins.
Information Maximization for SSL. These methods aim
to prevent information collapse by maximizing the infor-
mation in the embedding space. Key works in this direc- Mutual Sample Augmentations for SSL. As previously
tion include Barlow-Twins [74], Whitening-MSE [28], VI- mentioned, many SSL frameworks generate multiple views
CReg [6], and VicRegL [7]. In Barlow Twins, the loss term of each image during training (common in contrastive and
endeavors to make the normalized cross-correlation matrix info-max methods), and aim to teach the model that the em-
of the embedding vectors from the two branches close to the beddings of views from the same image should be as similar
identity. In Whitening-MSE, an additional module trans- as possible. These views are often created using augmenta-
forms the embeddings into the eigenspace of their covari- tions, which lead the model to become invariant to specific
ance matrix, ensuring the obtained vectors are uniformly augmentations [56, 67, 74]. Typically, these augmentations
distributed on the unit sphere. VICReg extends the fea- are applied on a per-sample basis. In contrast, supervised
ture decorrelation concept from Barlow Twins by introduc- learning has demonstrated the effectiveness and general-
ing two regularization terms: one to maintain the variance ization capabilities of augmentations that involve multiple
of each embedding dimension above a threshold and an- samples [57], such as MixUp [75] and CutMix [73]. How-
other to decorrelate each pair of variables. VICRegL [7] ever, one primary reason for their limited adoption in SSL
extends this idea further by proposing to learn both local is the challenge of incorporating them into self-supervised
and global features simultaneously. These infomax-based loss formulations. There are very few works that attempted
methods aim to produce embedding variables that are decor- to use augmentations involving multiple images for SSL,
related, thus preventing informational collapse, as all vari- such as MixCo [40], i-Mix [44], Un-Mix [59], Hard Neg-
ables are normalized over a batch, eliminating the incen- ative Mixing [39], and MNN [55]. Most of these methods
tive for them to shrink or expand. However, based on are introduced for contrastive learning or clustering based
our experiments, we observe that while the aforementioned approaches. In contrast, our proposed Mixed Barlow Twins
infomax-based methods avoid feature collapse with batch formulation can be seen as integrating mutual sample aug-
normalization, they tend to overfit or excessively focus on mentations into InfoMax-based SSL learning pipeline and
feature representations during pre-training, particularly on has shown significant improvements in low- and medium-
small and medium-scale datasets. This differs from con- data regimes.
trastive learning methods and distillation-based methods,
potentially leading to degraded downstream performance 3. Proposed Method
with extended pre-training. This might be attributed to the
3.1. Overview of Barlow Twins
lack of actual interaction between samples, in contrast to
the contrastive learning approaches. In order to overcome Barlow Twins [74] adopts a Siamese network architecture
the above issue, we introduce another regularization on top consisting of two identical branches as shown in Figure 2.
of the Barlow Twins by introducing mixed samples into the Each branch processes a distinct view of the same input im-
SSL process where we assume that linear interpolated im- age X, denoted as Y A and Y B . These views are gener-
ages will result in linearly interpolated embeddings. This ated by applying a series of random augmentations T to the
simple trick can potentially avoid the feature memorization original sample X. Augmentations include operations like

3
random cropping, rotation, and color perturbations. Subse- presented in Figure 3, a discernible trend emerges. Across
quently, Y A and Y B are forwarded through the encoder and all four datasets, increasing the embedding dimension ini-
the projector to obtain normalized embeddings Z A and Z B tially results in improved top-1 accuracy. However, as pre-
(centered along the batch dimension). These embeddings training extends, the utility of the learned representations
are then used to compute the Barlow Twins objective LBT diminishes, often starting to deteriorate around the 400-600
based on the cross-correlation matrix C computed between iteration mark. Consequently, employing representations
Z A and Z B along the batch dimension as follows: obtained around this interval may prove more advantageous
for the downstream tasks compared to relying on represen-
A B
X zb,i zb,j tations acquired after extensive pre-training.
Cij ≜ qP , (1)
A 2 B 2
b b (zb,i ) (zb,j ) 3.3. What Causes the Overfitting?
where b indexes batch samples, and i and j index the vector The observed overfitting issue of Barlow Twins can be at-
dimension of the embeddings. The Barlow Twins objective tributed to its unique loss formulation. To better under-
is defined on C and consists of two terms: stand this, let’s compare the Barlow Twins’ loss (Equation
1. Invariance Term: The first term aims to equate the 1) with the InfoNCE loss commonly used in contrastive
diagonal elements of C to 1. This enforces invari- SSL [14, 47]:
ance in the embeddings, making them resilient to the
applied distortions. Mathematically, it is expressed as X ⟨zbA , zbB ⟩i
2 LInfoNCE ≜ −
τ ∥zbA ∥2 ∥zbB ∥2
P
i (1 − Cii ) . b
2. Redundancy Reduction Term: The second term of the | {z }
similarity term
objective strives to equate the off-diagonal elements of C  !
to 0. This operation decorrelates different vector compo- X X ⟨zbA , zbB′ ⟩i
+ log  exp ,
nents of the embedding, Preducing redundancy. The term τ ∥zbA ∥2 zbB′ 2
P 2 b b′ ̸=b
is represented as λBT i j̸=i Cij , where λBT is a hy- | {z }
perparamter that controls the balance between invariance contrastive term
loss and redundacy reduction loss.
The combination of these terms in Barlow Twins objective where z A and z B are the twin network outputs, b indexes
ensures that the embeddings are both invariant to distortions the sample in a batch, i indexes the vector component of
and exhibit reduced redundancy, resulting in highly infor- the output, and τ is a positive constant called temperature.
mative and diverse feature representations. As we can observe, the InfoNCE loss aims to maximize
the variability among embeddings by increasing the pair-
3.2. Overfitting Issue of Barlow Twins wise distance between all sample pairs. In contrast, Barlow
Twins loss operates differently. It focuses on decorrelating
While Barlow Twins algorithm boasts a straightforward
the components of the embedding vectors rather than em-
design and a capacity to learn valuable representations
phasizing the distance between samples within a batch. The
for downstream tasks, we have observed a critical phe-
distinction lies in Barlow Twins’ approach, which results in
nomenon: increasing the dimensionality of the embedding
no or less interaction between the samples in a given batch.
space d can result in the production of lower-quality repre-
As the embedding dimension grows, it leads to a consid-
sentations but also in the risk of overfitting. This behavior
erable increase in the total number of trainable parameters.
may manifest as the network endeavors to minimize invari-
This expansion can potentially lead to overfitting, where the
ance and redundancy by memorizing individual samples.
optimization process predominantly involves memorizing
This issue, somewhat surprising, was neither acknowledged
the samples rather than substantially improving the qual-
nor reported in the original Barlow Twins study, which pri-
ity of the embeddings. This deviation from encouraging
marily emphasized linear evaluation results conducted on
sample variability to focus on decorrelation might be a con-
the extensive ImageNet dataset.
tributing factor to the observed overfitting in Barlow Twins.
However, our experiments conducted on smaller to
medium-sized datasets, such as CIFAR-10, CIFAR-100,
3.4. Guarding Barlow Twins from Overfitting
TinyImageNet, and STL-10, provide compelling evidence
of the Barlow Twins’s susceptibility to overfitting and the Motivated by the aforementioned issue observed in Barlow
generation of suboptimal representations with the expan- Twins (in Figure 3), we now explore a potential solution
sion of embedding dimensionality. To underscore this con- to address it. While we recognize the advantages that the
cern, we executed k-NN evaluation and monitored the top-1 original Barlow Twins algorithm brings to SSL compared
accuracy at five-epoch intervals throughout 1000 epochs of to contrastive learning approaches, our goal is to mitigate
Barlow Twins training. Based on the experimental results this issue through a simple modification.

4
Figure 3. k-NN evaluation results (on test-set) with ResNet50 backbone for Barlow Twins with different embedding dimensions d on (a)
CIFAR-10, (b) CIFAR-100, (c) TinyImageNet, and (d) STL-10 datasets.

To achieve this, we propose to incorporate an additional distribution: λ ∼ Beta(α, α) [38, 75]. In all of our experi-
regularization term Lreg on top of Barlow Twins loss func- ments, we use α = 1.0 unless stated otherwise.
tion LBT , which promotes interaction between samples in Next, we feed Y M through the encoder and projector
the batch. This new regularization term Lreg is inspired to obtain their normalized embeddings centered along the
by mixup regularization [75] in supervised learning but batch dimension:
adapted to the context of SSL, aligning with Barlow Twins
loss formulation. Z M = fe+p (Z M ), (4)
As depicted in Figure 2, our approach involves promot- where fe+p (·) denotes the network. Subsequently, we com-
ing interaction between the samples by creating mixed sam- pute the cross-correlation matrices between the mixed em-
ples from Y A and Y B through linear interpolation, obtain- beddings Z M and the unmixed embeddings Z A and Z B
ing the embeddings of the mixed samples from the network, along the batch dimension:
and formulating the regularization loss by assuming net-
work produces linearly interpolated embeddings for it. The C M A = (Z M )T Z A , (5)
following section provides a detailed explanation of the pro- C MB M T
= (Z ) Z , B
(6)
posed Mixed Barlow Twins approach.
The proposed Mixed Barlow Twins first generates a where, T denotes matrix transpose, C M A and C M B ∈
batch of mixed samples denoted as Y M by linearly inter- Rd×d , representing the cross-correlation between the mixed
polating between Y A and Y B . Since both Y A and Y B con- and the original embeddings. Assuming that “linear inter-
sist of different views from the same images, we first shuf- polation in the input RGB space results in linearly interpo-
fle Y B to ensure that the linear interpolation predominantly lated features in the embedding space,” we can determine
MA MB
involves different images. This process can be mathemati- the ground-truth cross-correlation matrices Cgt and Cgt
cally expressed as follows: for creating the regularization term:
Y s = Shuffle(X B ), (2) MA
Cgt = (Z M )T Z A , (7)
M A s A S T A
Y = λY + (1 − λ)Y , (3) = (λZ + (1 − λ)Z ) Z , (8)
A T A ∗ B T A
where, Y S represents the shuffled batch of images obtained = λ(Z ) Z + (1 − λ)Shuffle (Z ) Z . (9)
by shuffling images in Y B using a randomly determined Similarly,
shuffling order denoted as Shuffle(·), and λ is the inter-
polation ratio between Y A and Y S , sampled from a Beta MB
Cgt = λ(Z A )T Z B + (1 − λ)Shuffle∗ (Z B )T Z B . (10)

5
Here, Shuffle∗ denotes shuffling embeddings in the same a batch size of 256, cosine annealing learning rate sched-
order as in Equation (2). The mixup-based regularization uler [49] with linear warm up [48], and a weight decay [50]
loss is designed to align C M A and C M B with their re- of 1e-6 for pre-training. The pre-training phase spans 1000
MA
spective ground-truth cross-correlation matrices, Cgt and epochs although we sometimes pre-trained for 2000 epochs
MB on small datasets. We perform grid hyperparamter tuning
Cgt . To achieve this, we employ a simple L2 loss:
on embedding dimension d for values 128, 1024, 2048,
λBT and 4096, λBT for values 0.0078125 and 1/d [62], and
∥C M A − Cgt
MA
∥2 + ∥C M B − Cgt
MB

Lreg = ∥2
2 λreg with values 1×, 2×, 3×, 4×, 5 × λBT . To assess the
(11) performance of the pre-trained models, we conduct k-NN
The final loss L of the proposed Mixed Barlow Twins evaluation [23, 29] on the test set and linear probing. k-NN
can then be expressed as: evaluation involves utilizing normalized features from the
encoder fe . We set k = 200 [28]. For linear probing,
L = LBT + λreg Lreg , (12) we employ the Adam optimizer with a batch size of 512,
exponential learning scheduler [45], and a weight decay
where λreg controls the balance between LBT and Lreg .
of 1e-6. We use single NVIDIA RTX A5000 GPU for all
As can be observed, the proposed mixup regularization im-
experiments on CIFAR-10, CIFAR-100, TinyImageNet,
proves the interaction between the samples in the batch and
and STL-10 datasets.
introduces an infinite set of synthetic samples into the pre-
training process, which is demonstrated in the following For the experiments on ImageNet, we conducted ex-
section, leads to a reduction in feature overfitting and con- periments using the ResNet50 backbone. We employed
siderable improvements in the downstream performance. the LARS [72] optimizer with a batch size of 1024 dis-
tributed across eight(8) NVIDIA RTX A5000 GPUs, with
base learning rates of 0.2 for weights and 0.0048 for biases,
4. Experimental Results
respectively. Both learning rates were linearly ramped from
Datasets. We conduct experiments on datasets of vary- 0 up to their base values over 10 epochs and then gradually
ing scales to demonstrate the effectiveness of the proposed- decayed with a cosine schedule over the remaining epochs
mixup regularization in addressing the overfitting issue of until they reached 0.001 times their base rate. All ImageNet
Barlow Twins. We employ five datasets: CIFAR-10 [42], experiments were trained for 300 epochs with an embed-
CIFAR-100 [42], TinyImageNet [43], STL-10 [22], and ding dimension of 8192. We used weight decay of 1e-6 and
ImageNet-1k [24]. The CIFAR-10 and CIFAR-100 datasets set λBT to 0.0051. The projector consisted of MLPs with an
consist of 50,000 training images and 10,000 test images, embedding dimension of 8192-8192-8192. Following the
each with a size of 32 × 32 pixels. In contrast, TinyIma- original implementation, cross-correlation matrices com-
geNet, which is a subset of the full ImageNet, comprises puted on each GPU were combined using the all reduce
of 100,000 training images, 10,000 test images, and 200 operation before computing the Barlow Twins loss LBT and
classes, with images of dimension 64×64 pixels. Compared the MixUp regularization loss Lreg . For linear probing, we
to CIFAR-10 and CIFAR-100, TinyImageNet is considered freeze the ResNet50 backbone and train a classifier for 100
large due to its increased number of images and classes. epochs with a batch size of 256 distributed across 8 GPUs.
The STL-10 dataset is specifically designed for SSL and We utilize a cosine annealing learning rate scheduler with
differs from the other three datasets. It includes a sepa- a base learning rate of 0.3. We use the Stochastic Gradi-
rate unlabeled dataset with 100,000 images, 5,000 training ent Descent (SGD) optimizer with a momentum of 0.9 and
images, and 8,000 test images, and distributed across 10 weight decay of 1e-6. During the linear probing, random
classes. The unlabeled examples are drawn from a similar resized crops of 224x224 are used, followed by random
but broader image distribution, making it an ideal choice horizontal flips as the only augmentations applied. During
for SSL. ImageNet [24], renowned as one of the largest im- testing phase, images are resized to 256 and then center-
age classification benchmarks, contains 1.2 million training cropped to 224.
images and 50,000 validation images. In our experiments, Our implementation of Mixed Barlow Twins is based
we employ the training set without labels for SSL training on the original Barlow Twins implementation 1 for exper-
on all five datasets. However, for the STL-10 dataset, we iments on ImageNet and Barlow Twins HSIC [62]2 with
incorporate additional unlabeled data into the SSL process. modifications mentioned above for other datasets. For
SOTA methods (SimCLR [14], BYOL [58], and Witen-
Experimental Setup. We employ ResNet-18 and ingMSE [28]), we closely follow their original implemen-
ResNet-50 [33] as the backbone architectures. In the case
of SSL with CIFAR-10, CIFAR-100, TinyImageNet, and 1 https://github.com/facebookresearch/barlowtwins

STL-10 datasets, we utilize the Adam [41] optimizer with 2 https://github.com/yaohungt/Barlow-Twins-HSIC

6
Figure 4. knn evaluation results (on test-set) with ResNet50 backbone for Barlow Twins (- - - dashed lines) vs. proposed Mixed Barlow
Twins (— solid lines) with various embedding dimensions d on (a) CIFAR-10, (b) CIFAR-100, (c) TinyImageNet, and (d) STL-10.

tations for ImageNet and the implementations by [28]3 for ing TinyImageNet, where the best performance is achieved
other datasets. with a embedding dimension of d = 1024, which is com-
paratively smaller than the optimal results for CIFAR-10
Effect of MixUp Regularization: Comparing k-NN and CIFAR-100. In the best-case scenario, we observe a
Evaluation Results with the ResNet50 Backbone. Af- +2.86% improvement with mixup regularization (40.52%
ter demonstrating the overfitting issue of the Barlow Twins vs. 37.66%). Experiments with the STL-10 dataset presents
algorithm in Section 3.2, we now delve into the k-NN eval- similar observations. We notice a +2.77% improvement
uation results to understand how they evolve throughout (87.55% vs. 84.78%) for the best-case scenario where em-
SSL training when mixup regularization Lreg is incorpo- bedding dimension is d = 1024.
rated. As illustrated in Figure 4, where solid lines repre- In summary, the results obtained across all four datasets
sent SSL training with LBT + Lreg and dashed lines cor- consistently demonstrate that adding the proposed mixup-
respond to LBT alone, it becomes evident that integrat- based regularization on top of the original Barlow Twins
ing mixup regularization into Barlow Twins training always loss mitigates network overfitting and fosters the learning
leads to improved k-NN accuracy and reduced overfitting of superior feature representations that prove beneficial for
across all four datasets. Across all four embedding dimen- downstream applications.
sions considered, the proposed Mixed Barlow Twins elimi-
nates the overfitting issue, leading to improved top-1 accu- Comparison with SOTA methods using ResNet-18 and
racy throughout SSL training. The most notable improve- ResNet-50. Table 1 and 2 compare the results of our
ment is seen with an embedding dimension of d = 1024, Mixed Barlow Twins with Barlow Twins and other SOTA
where we observe a remarkable +5.2% increase over the approaches using ResNet-50 and ResNet-18, respectively.
original Barlow Twins method with mixup regularization We present both k-NN evaluation and linear probing results
(91.14% vs. 85.93%) on CIFAR-10. Similarly, in the case (top-1). When looking at the k-NN results with the ResNet-
of CIFAR-100, we witness a similar trend. Incorporating 18 backbone, we can observe considerable improvements
mixup regularization yields better k-NN evaluation perfor- that Mixed Barlow Twins brings over Barlow Twins and
mance. When comparing the best performance, achieved other SOTA methods on all four datasets. More impor-
with an embedding dimension of d = 1024, we note a sub- tantly, Mixed Barlow Twins achieves new SOTA results
stantial +3.78% improvement for Mixed Barlow Twins over with +1.28% (91.39 vs. 90.11), +3.07% (64.32 vs. 61.25),
the Barlow Twins (61.71% vs. 57.93%). When consider- +4.45% (42.21 vs. 37.66), and +0.85% (87.79 vs. 86.94)
3 https://github.com/htdt/self-supervised with k-NN evaluation and +1,13%, +2.92%, +4.98%, and

7
Table 1. Comparing Mixed Barlow Twins with SOTA methods using ResNet-50 [33] backbone.

Method CIFAR-10 CIFAR-100 TinyImageNet STL-10


Epochs
k-NN linear k-NN linear k-NN linear k-NN linear

SimCLR [14] 1000 88.94 91.85 57.55 68.37 29.62 46.34 85.14 89.26
BYOL [58] 1000 90.11 92.76 61.25 69.59 25.78 41.02 86.94 91.18
Barlow Twins [74] 1000 85.92 90.88 57.93 66.15 37.66 46.86 84.78 87.93
Mixed Barlow Twins (ours) 1000 91.14 93.48 61.71 71.98 40.52 50.59 87.55 91.10
Mixed Barlow Twins (ours) 2000 91.39 93.89 64.32 72.51 42.21 51.84 87.79 91.70

Table 2. Comparing Mixed Barlow Twins with SOTA methods using ResNet-18 [33] backbone.

Method CIFAR-10 CIFAR-100 TinyImageNet STL-10


Epochs
k-NN linear k-NN linear k-NN linear k-NN linear

SimCLR [14] 1000 88.42 91.80 56.56 66.83 32.86 48.84 85.68 90.51
BYOL [58] 1000 89.45 91.73 56.82 66.60 36.24 51.00 88.64 91.99
W-MSE [28] 1000 89.69 91.55 56.69 66.10 34.16 48.20 87.10 90.36
Barlow Twins [74] 1000 86.66 87.76 55.94 61.64 33.60 41.80 84.23 88.21
Mixed Barlow Twins (ours) 1000 89.91 91.99 61.12 68.55 37.52 51.48 88.23 91.04
Mixed Barlow Twins (ours) 2000 90.52 92.58 61.25 69.31 38.11 51.67 88.94 91.02

+0.52% with linear evaluation on CIFAR-10, CIFAR-100, Results on ImageNet-1k. Before delving into the linear
TinyImageNet, and STL-10 datasets, respectively, using the probing results, we provide some training statistics that
ResNet-50 backbone. As shown in Table 2, similar perfor- compare the convergence of different loss terms during
mance improvements over Barlow Twins and other SOTA training. Figures 5a and 5b illustrate the convergence of two
methods are observed using the ResNet-18 backbone, ex- loss terms: the Barlow Twins loss (LBT ) and the MixUp
cept it achieves second best results on the STL-10 dataset regularization loss (Lreg). These figures compare the de-
under linear evaluation. fault Barlow Twins training (shown in purple) with Mixed
Barlow Twins training (shown in red), specifically for a reg-
ularization weight (λreg ) value of 1.0. In the case of Mixed
Barlow Twins training, we observe that the Barlow Twins
Table 3. Linear probing results (top-1 %) on ImageNet [24] with loss (LBT ) tends to converge to a slightly higher average
ResNet50 backbone. value (on average) compared to the default Barlow Twins
training. This difference in convergence is likely due to the
Method Epochs Acc. % influence of MixUp regularization. However, despite the
higher Barlow Twins loss, the results of linear evaluation in-
MoCo v2 [17] 200 67.5
dicate that Mixed Barlow Twins training achieves a higher
SimCLR [14] 400 68.2
top-1 accuracy compared to its default counterpart.
BYOL [58] 300 72.3
VICReg [6] 300 71.5 Table 3 presents the linear evaluation results on Ima-
VICRegL [7] 300 71.2 geNet with the ResNet-50 backbone. It is evident that
Barlow Twins [74] 300 71.3 Mixed Barlow Twins achieves the second-best results com-
Ours: pared to SOTA methods. Moreover, it significantly im-
Mixed Barlow Twins (λreg = 0.0025) 300 70.9 proves over Barlow Twins and its later counterparts, such
Mixed Barlow Twins (λreg = 0.1) 300 71.6 as VICReg and VicRegL. This outcome indicates that intro-
Mixed Barlow Twins (λreg = 1.0) 300 72.2 ducing mixed samples into the SSL framework is beneficial
even for large datasets.

8
Convergence of Different Loss Terms. In this section,
we examine how each of the loss terms converges during
the SSL training for default Barlow Twins (i.e., LBT ) and
Mixed Barlow Twins training (i.e., LBT and Lreg ). From
Figure 6, it is evident that, despite Mixed Barlow Twins
demonstrating improved performance (see Figure 4), it con-
verges to a higher Barlow Twins loss under MixUp regular-
ization compared to the default Barlow Twins training. This
observation underscores a key insight – achieving lower
Barlow Twins loss during training does not inherently guar-
antee improved downstream performance. In fact, the drive
to excessively minimize the Barlow Twins loss may lead
the network to memorize or overly fixate on the embed-
(a) Barlow Twins Loss LBT for Mixed Barlow Twins Training (in red) dings of samples, which can ultimately result in poor down-
and Barlow Twins Training (in purple). stream performance. In contrast, incorporating mixed sam-
ples into the SSL training makes it challenging for the net-
work to memorize samples, as it introduces a theoretically
infinite variety of mixed samples into the SSL training pro-
cess, effectively compelling the network to minimize the
Barlow Twins loss by learning valuable high-level repre-
sentations. These representations, in turn, prove beneficial
for the downstream applications, ultimately leading to the
enhanced performance.

Effect of Mixup Regularization Coefficient λreg . In this


section, we examine the impact of the regularization param-
eter λreg on the downstream performance. As previously
observed in Section 4, the addition of mixup regularization
(b) MixUp Regularization Loss Lreg for Mixed Barlow Twins Training to the original Barlow Twins loss significantly enhances the
(in red). In this experiment, λreg is set to 1.0. downstream performance. It is crucial to note that, like any
Figure 5. Convergence of Barlow Twins loss and mixup-based reg-
other regularization term, selecting an appropriate coeffi-
ularization loss during ImageNet training for 300 epochs. Barlow cient that adequately balances the loss terms is essential. A
Twins training is shown in purple color and Mixed Barlow Twins substantial increase in the regularization coefficient notice-
training is shown in red color. ably prolongs the convergence time of the network. Figure
7 illustrates the variations in the total training loss L and
the k-NN evaluation accuracy throughout the SSL training
5. Further Discussion for different mixup regularization coefficient values. The
figure demonstrates that an escalation in the mixup regular-
ization coefficient leads to slower convergence and reduced
downstream performance. This phenomenon was consis-
tently observed across the other three datasets (CIFAR-100,
TinyImageNet, and STL-10). Based on our analysis, we
found that setting λreg = 1 to 3 × λBT serves as a useful
rule of thumb.

6. Conclusion
We highlight some limitations of the popular Barlow Twins
algorithm and propose a remedy in the form of Mixed Bar-
low Twins. Our experiments demonstrate that the integra-
Figure 6. Breakdown of loss terms (LBT and Lreg ) on CIFAR-10
tion of mixup-based regularization effectively mitigates fea-
dataset with d = 1024.
ture overfitting and significantly enhances the downstream
task performance. This novel approach enriches the Barlow

9
Figure 7. Impact of the λreg on CIFAR-10 dataset with d = 4096.

Twins process by promoting the interaction between sam-


ples and introducing a myriad of synthetic samples. The
results endorse the efficacy of Mixed Barlow Twins in ad-
dressing the drawbacks of the Barlow Twins approach, sug-
gesting a promising direction in SSL.

10
Guarding Barlow Twins Against Overfitting with Mixed Samples
Supplementary Material
A. Transfer Learning CIFAR-10, CIFAR-100, and STL-10 to {DTD, MNIST,
FashionMNIST, CUBirds, VGGFlower, Traffic Signs, and
In this section, we compare transfer learning capabilities of Aircraft}, with the best result in each block shown in bold.
Mixed Barlow Twins with Barlow Twins. It can be observed that our Mixed Barlow Twins consis-
A.1. Transfer Learning Datasets tently yield significant improvements over Barlow Twins,
demonstrating that mixup regularization not only enhances
For the transfer learning experiments, we consider seven in-dataset performance but also boosts the transfer learning
datasets: capabilities of the model.
1. DTD [21]: This texture database consists of 5640 im-
ages, organized into 47 categories inspired by human B. Pseudocode
perception. Each category contains 120 images. The
image sizes range from 300x300 to 640x640 pixels, and We present pseudocode for our Mixed Barlow Twins
each image contains at least 90% of the surface repre- algorithm in Algorithm 1. Since our implementation is
senting the category attribute. based on the default Barlow Twins implementation, we have
2. MNIST [25]: MNIST is a dataset of handwritten digits, highlighted the modifications required to be added to Bar-
with a training set of 60,000 examples and a test set of low Twins in red color. As shown in the pseudocode, our
10,000 examples. method involves a few lines of code changes on top of the
3. FashionMNIST [66]: FashionMNIST is a dataset of Barlow Twins algorithm.
Zalando’s article images, consisting of a training set of
60,000 examples and a test set of 10,000 examples. Each C. Default Hyperparameter Configuration for
example is a 28x28 grayscale image associated with a la- Mixed Barlow Twins Training
bel from one of 10 classes.
4. CUBirds [63]: CUBirds is a challenging image dataset In this section, we provide the default (i.e., the best) hy-
annotated with 200 bird species, mostly North Ameri- perparameter configuration used for pre-training on each
can. It contains 11,788 images categorized into 200 sub- dataset.
categories, with 5,994 images for training and 5,794 for
testing. Each image has detailed annotations, includ- C.1. CIFAR-10/CIFAR-100
ing a subcategory label, 15 part locations, 312 binary Table 5 presents the default configuration for the Mixed
attributes, and a bounding box. Barlow Twins experiments conducted on CIFAR-10 and
5. VGGFlower [53]: VGGFlower is a dataset with 102 cat- CIFAR-100 datasets using ResNet-18 and ResNet-50 as
egories, consisting of 102 flower categories commonly backbones. We observed that among the choices of pro-
occurring in the United Kingdom. Each class contains jector dimension d (128, 1024, 2048, and 4096), a value of
between 40 and 258 images. 1,024 consistently yielded the best results when combined
6. Traffic Signs [36]: This dataset is from the German with a λBT value of 0.0078125 (chosen from the options
Traffic Sign Detection Benchmark (GTSDB) and in- 0.0078125 and 1/d), particularly for CIFAR datasets. The
cludes 900 training images (1360 x 800 pixels) contain- optimal nearest neighbor (k-NN) accuracy was achieved
ing only the traffic signs. with a λreg value of 4.0.
7. Aircraft [51]: Aircraft is a benchmark dataset for fine-
grained visual categorization of aircraft. It contains Table 5. Default Hyperparameter Configuration for Mixed Bar-
10,200 images of aircraft, with 100 images for each of low Twins Training on CIFAR-10 and CIFAR-100 datasets with
102 different aircraft model variants, most of which are ResNet-18 and ResNet-50 Backbones.
airplanes. Each image is annotated with a tight bounding
box and a hierarchical airplane model label. Key Value
batch size 256
A.2. Transfer Learning Results learning rate 0.01
We present linear evaluation results in which a linear clas- learning rate scheduler “cosine”
sifier is trained on top of the frozen ResNet backbone. The feature dim. (d) 1,024
linear classifier is trained for 100 epochs using the SGD op- λBT 0.0078125
timizer. Table 4 displays the transfer learning results from λreg 4.0

11
Model DTD MNIST FaMNIST CUBirds VGGFlower TrafficSigns Aircraft
CIFAR-10
Barlow Twins 20.53 91.78 83.49 5.04 25.80 49.42 14.58
Mixed Barlow Twins (ours) 34.04 97.26 87.84 10.70 56.48 76.06 31.13
CIFAR-100
Barlow Twins 27.45 93.72 84.78 5.89 34.96 59.85 15.54
Mixed Barlow Twins (ours) 37.23 97.81 88.03 11.01 64.04 77.61 31.23
STL-10
Barlow Twins 45.10 96.34 85.28 11.53 68.03 66.80 34.65
Mixed Barlow Twins (ours) 46.31 97.31 86.21 12.27 68.36 69.73 35.27

Table 4. Transfer learning results (linear classification accuracy) from CIFAR-10 → to {DTD, MNIST, FaMNIST, CUBirds, VGGFlower,
TrafficSigns} with ResNet18 backbone. Both models are pre-trained for 1000 epochs on CIFAR-10.

C.2. TinyImageNet Table 7. Default hyperparameter configuration for Mixed Barlow


Twins training on STL-10 with ResNet-18/ResNet50 backbone.
Table 6 presents the default configuration for the Mixed
Key Value
Barlow Twins experiments conducted on TinyImageNet
using ResNet-18 and ResNet-50 as backbones. We ob- batch size 256
served that among the choices of projector dimension d learning rate 0.01
(128, 1024, 2048, and 4096), a value of 1,024 consis- learning rate scheduler “cosine”
tently yielded the best results when combined with a λBT feature dim. (d) 1,024
value of 0.0009765 (chosen from the options 0.0078125 and λBT 0.0078125
1/d=0.0009765). The optimal nearest neighbor (k-NN) ac- λreg 2.0
curacy was achieved with a λreg value of 4.0.
C.4. ImageNet
Table 8 presents the default configuration for the Mixed
Table 6. Default hyperparameter configuration for Mixed Barlow
Barlow Twins experiments conducted on ImageNet using
Twins training on TinyImageNet with ResNet-18/ResNet-50 back-
bone.
ResNet-50 as the backbone. We use the default configura-
tion reported in the original Barlow Twins analysis: embed-
Key Value ding dimention d of 8192 with a λBT value of 0.0051. The
batch size 256 optimal linear probing accuracy was achieved with a λreg
learning rate 0.01 value of 1.0.
learning rate scheduler “cosine”
Table 8. Default hyperparameter configuration for Mixed Barlow
feature dim. (d) 1,024 Twins training on ImageNet with ResNet50 backbone.
λBT 1/d = 0.0009765
λreg 4.0 Key Value
# gpus 8
batch size 1024
epochs 300
C.3. STL-10 learning rate biases 0.0048
learning rate weights 0.2
Table 7 presents the default configuration for the Mixed projector dims “8192-8192-8192”
Barlow Twins experiments conducted on TinyImageNet us- weight decay 0.000001
ing ResNet-18 and ResNet-50 as backbones. We observed λBT 0.0051
that among the choices of projector dimension d (128, 1024, λreg 0.1
2048, and 4096), a value of 1,024 consistently yielded the
best results when combined with a λBT value of 0.0078125
(chosen from the options 0.0078125 and 1/d). The optimal
nearest neighbor (k-NN) accuracy was achieved with a λreg
value of 2.0.

12
Algorithm 1 PyTorch-style pseudocode for Mixed Barlow Twins with changes required on top of Barlow Twins highlighted
in red. This pseudocode template is adapted from Barlow Twins.
# f: encoder network
# lambda: weight on the off-diagonal terms
# lmbda_mixup: weight on the mixup regularization loss
# N: batch size
# D: dimensionality of the embeddings
#
# mm: matrix-matrix multiplication
# off_diagonal: off-diagonal elements of a matrix
# eye: identity matrix
# randperm: random permutation of integers
# beta: draw samples from a Beta distribution
for x in loader: # load a batch with N samples
# two randomly augmented versions of x
y_a, y_b = augment(x)
# compute embeddings
z_a = f(y_a) # NxD
z_b = f(y_b) # NxD
# normalize repr. along the batch dimension
z_a_norm = (z_a - z_a.mean(0)) / z_a.std(0) # NxD
z_b_norm = (z_b - z_b.mean(0)) / z_b.std(0) # NxD

# cross-correlation matrix
c = mm(z_a_norm.T, z_b_norm) / N # DxD
# loss
c_diff = (c - eye(D)).pow(2) # DxD
# multiply off-diagonal elems of c_diff by lambda
off_diagonal(c_diff).mul_(lambda)
loss_bt = c_diff.sum()

### MixUp Regularization (our contribution) ###


# creating mixed samples: Eqn. (2)-(3)
idxs = randperm(N)
alpha = beta(1.0, 1.0)
y_m = alpha * y_a + (1 - alpha) * y_b[idxs, :]
# compute mixed sample embeddings: Eqn. (4)
z_m = f(y_m) # NxD
z_m_norm = (z_m - z_m.mean(dim=0)) / z_m.std(dim=0) # NxD
# cross-correlation matrices: Eqn. (5)-(6)
cc_m_a = mm(z_m_norm.T, z_a_norm) / N # DxD
cc_m_b = mm(z_m_norm.T, z_b_norm) / N # DxD
# groud-truth cross-correlation matrices: Eqn. (9)-(10)
cc_m_a_gt = alpha*mm(z_a_norm.T, z_a_norm)/N + (1-alpha)*mm(z_b_norm[idxs,:].T, z_a_norm)/N # DxD
cc_m_b_gt = alpha*mm(z_a_norm.T, z_b_norm)/N + (1-alpha)*mm(z_b_norm[idxs,:].T, z_b_norm)/N # DxD

# mixup regularization loss: Eqn. (11)


loss_mix = lmbda_mixup*lmbda*((cc_m_a-cc_m_a_gt).pow_(2).sum() + (cc_m_b-cc_m_b_gt).pow_(2).sum())

# total loss
loss = loss_bt + loss_mix
# optimization step
loss.backward()
optimizer.step()

References LeCun, and Nicolas Ballas. Self-supervised learning


from images with a joint-embedding predictive archi-
[1] Yuki Markus Asano, Christian Rupprecht, and An-
tecture. In Proceedings of the IEEE/CVF Conference
drea Vedaldi. Self-labelling via simultaneous clus-
on Computer Vision and Pattern Recognition (CVPR),
tering and representation learning. arXiv preprint pages 15619–15629, 2023. 2
arXiv:1911.05371, 2019. 2
[2] Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr [3] Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr
Bojanowski, Pascal Vincent, Michael Rabbat, Yann Bojanowski, Pascal Vincent, Michael Rabbat, Yann

13
LeCun, and Nicolas Ballas. Self-supervised learning [14] Ting Chen, Simon Kornblith, Mohammad Norouzi,
from images with a joint-embedding predictive archi- and Geoffrey Hinton. A simple framework for con-
tecture. In Proceedings of the IEEE/CVF Conference trastive learning of visual representations. In Interna-
on Computer Vision and Pattern Recognition, pages tional conference on machine learning, pages 1597–
15619–15629, 2023. 1 1607. PMLR, 2020. 1, 2, 4, 6, 8
[4] Philip Bachman, R Devon Hjelm, and William Buch- [15] Ting Chen, Simon Kornblith, Kevin Swersky, Mo-
walter. Learning representations by maximizing mu- hammad Norouzi, and Geoffrey E Hinton. Big self-
tual information across views. Advances in neural in- supervised models are strong semi-supervised learn-
formation processing systems, 32, 2019. 1 ers. Advances in neural information processing sys-
[5] Sami Barchid, José Mennesson, and Chaabane tems, 33:22243–22255, 2020. 2
Djéraba. Exploring joint embedding architectures and [16] Xinlei Chen and Kaiming He. Exploring simple
data augmentations for self-supervised representation siamese representation learning. In Proceedings of the
learning in event-based vision. In Proceedings of IEEE/CVF conference on computer vision and pattern
the IEEE/CVF Conference on Computer Vision and recognition, pages 15750–15758, 2021. 1, 3
Pattern Recognition (CVPR) Workshops, pages 3903– [17] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming
3912, 2023. 2 He. Improved baselines with momentum contrastive
[6] Adrien Bardes, Jean Ponce, and Yann LeCun. learning. arXiv preprint arXiv:2003.04297, 2020. 2,
Vicreg: Variance-invariance-covariance regulariza- 8
tion for self-supervised learning. arXiv preprint [18] Xinlei Chen, Saining Xie, and Kaiming He. An empir-
arXiv:2105.04906, 2021. 2, 3, 8 ical study of training self-supervised vision transform-
[7] Adrien Bardes, Jean Ponce, and Yann LeCun. Vicregl: ers. In 2021 IEEE/CVF International Conference on
Self-supervised learning of local visual features. Ad- Computer Vision (ICCV), pages 9620–9629, 2021. 2
vances in Neural Information Processing Systems, 35:
[19] Sumit Chopra, Raia Hadsell, and Yann LeCun. Learn-
8799–8810, 2022. 2, 3, 8
ing a similarity metric discriminatively, with applica-
[8] Adrien Bardes, Jean Ponce, and Yann LeCun. Mc-
tion to face verification. In 2005 IEEE computer soci-
jepa: A joint-embedding predictive architecture for
ety conference on computer vision and pattern recog-
self-supervised learning of motion and content fea-
nition (CVPR’05), pages 539–546. IEEE, 2005. 1
tures. arXiv preprint arXiv:2307.12698, 2023. 2
[20] Ching-Yao Chuang, R Devon Hjelm, Xin Wang, Vib-
[9] Miguel A Bautista, Artsiom Sanakoyeu, Ekaterina
hav Vineet, Neel Joshi, Antonio Torralba, Stefanie
Tikhoncheva, and Bjorn Ommer. Cliquecnn: Deep
Jegelka, and Yale Song. Robust contrastive learning
unsupervised exemplar learning. Advances in Neural
against noisy views. In Proceedings of the IEEE/CVF
Information Processing Systems, 29, 2016. 2
Conference on Computer Vision and Pattern Recogni-
[10] Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard
tion, pages 16670–16681, 2022. 1
Säckinger, and Roopak Shah. Signature verification
using a” siamese” time delay neural network. Ad- [21] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, , and
vances in neural information processing systems, 6, A. Vedaldi. Describing textures in the wild. In Pro-
1993. 1, 2 ceedings of the IEEE Conf. on Computer Vision and
Pattern Recognition (CVPR), 2014. 11
[11] Mathilde Caron, Piotr Bojanowski, Julien Mairal, and
Armand Joulin. Unsupervised pre-training of image [22] Adam Coates, Andrew Ng, and Honglak Lee. An anal-
features on non-curated data. In Proceedings of the ysis of single-layer networks in unsupervised feature
IEEE/CVF International Conference on Computer Vi- learning. In Proceedings of the fourteenth interna-
sion, pages 2959–2968, 2019. 2 tional conference on artificial intelligence and statis-
[12] Mathilde Caron, Ishan Misra, Julien Mairal, Priya tics, pages 215–223. JMLR Workshop and Conference
Goyal, Piotr Bojanowski, and Armand Joulin. Un- Proceedings, 2011. 2, 6
supervised learning of visual features by contrasting [23] Thomas Cover and Peter Hart. Nearest neighbor pat-
cluster assignments. Advances in neural information tern classification. IEEE transactions on information
processing systems, 33:9912–9924, 2020. 1, 2 theory, 13(1):21–27, 1967. 6
[13] Changyou Chen, Jianyi Zhang, Yi Xu, Liqun Chen, [24] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai
Jiali Duan, Yiran Chen, Son Tran, Belinda Zeng, and Li, and Li Fei-Fei. Imagenet: A large-scale hierarchi-
Trishul Chilimbi. Why do we need large batchsizes cal image database. In 2009 IEEE conference on com-
in contrastive learning? a gradient-bias perspective. puter vision and pattern recognition, pages 248–255.
Advances in Neural Information Processing Systems, Ieee, 2009. 6, 8
35:33860–33875, 2022. 2 [25] Li Deng. The mnist database of handwritten digit im-

14
ages for machine learning research. IEEE Signal Pro- imization. In International Conference on Learning
cessing Magazine, 29(6):141–142, 2012. 11 Representations, 2019. 1, 2
[26] Luc Devroye, László Györfi, Gábor Lugosi, Luc De- [36] Sebastian Houben, Johannes Stallkamp, Jan Salmen,
vroye, László Györfi, and Gábor Lugosi. Consistency Marc Schlipsing, and Christian Igel. Detection of traf-
of the k-nearest neighbor rule. A Probabilistic Theory fic signs in real-world images: The german traffic sign
of Pattern Recognition, pages 169–185, 1996. 2 detection benchmark. In The 2013 International Joint
[27] Carl Doersch, Abhinav Gupta, and Alexei A. Efros. Conference on Neural Networks (IJCNN), pages 1–8,
Unsupervised visual representation learning by con- 2013. 11
text prediction. In Proceedings of the IEEE Interna- [37] Jiabo Huang, Qi Dong, Shaogang Gong, and Xiatian
tional Conference on Computer Vision (ICCV), 2015. Zhu. Unsupervised deep learning by neighbourhood
1 discovery. In International Conference on Machine
[28] Aleksandr Ermolov, Aliaksandr Siarohin, Enver Learning, pages 2849–2858. PMLR, 2019. 2
Sangineto, and Nicu Sebe. Whitening for self- [38] Norman L Johnson, Samuel Kotz, and
supervised representation learning. In International Narayanaswamy Balakrishnan. Continuous uni-
Conference on Machine Learning, pages 3015–3024. variate distributions, volume 2. John wiley & sons,
PMLR, 2021. 2, 3, 6, 7, 8 1995. 5
[29] Evelyn Fix and Joseph Lawson Hodges. Discrimi- [39] Yannis Kalantidis, Mert Bulent Sariyildiz, Noe Pion,
natory analysis. nonparametric discrimination: Con- Philippe Weinzaepfel, and Diane Larlus. Hard nega-
sistency properties. International Statistical Re- tive mixing for contrastive learning. Advances in Neu-
view/Revue Internationale de Statistique, 57(3):238– ral Information Processing Systems, 33:21798–21809,
247, 1989. 6 2020. 3
[30] Spyros Gidaris, Andrei Bursuc, Nikos Komodakis, [40] Sungnyun Kim, Gihun Lee, Sangmin Bae, and Se-
Patrick Pérez, and Matthieu Cord. Learning represen- Young Yun. Mixco: Mix-up contrastive learn-
tations by predicting bags of visual words. In Proceed- ing for visual representation. arXiv preprint
ings of the IEEE/CVF Conference on Computer Vision arXiv:2010.06300, 2020. 3
and Pattern Recognition, pages 6928–6938, 2020. 3 [41] Diederik Kingma and Jimmy Ba. Adam: A method for
[31] Jean-Bastien Grill, Florian Strub, Florent Altché, stochastic optimization. In International Conference
Corentin Tallec, Pierre Richemond, Elena on Learning Representations (ICLR), San Diega, CA,
Buchatskaya, Carl Doersch, Bernardo Avila Pires, USA, 2015. 6
Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. [42] Alex Krizhevsky, Geoffrey Hinton, et al. Learning
Bootstrap your own latent-a new approach to self- multiple layers of features from tiny images. G. Tech-
supervised learning. Advances in neural information nical Report 0, University of Toronto, 2009. 1, 2, 6
processing systems, 33:21271–21284, 2020. 1, 3 [43] Ya Le and Xuan Yang. Tiny imagenet visual recogni-
[32] Raia Hadsell, Sumit Chopra, and Yann LeCun. Di- tion challenge. CS 231N, 7(7):3, 2015. 2, 6
mensionality reduction by learning an invariant map- [44] Kibok Lee, Yian Zhu, Kihyuk Sohn, Chun-Liang Li,
ping. In 2006 IEEE computer society conference on Jinwoo Shin, and Honglak Lee. i-mix: A domain-
computer vision and pattern recognition (CVPR’06), agnostic strategy for contrastive representation learn-
pages 1735–1742. IEEE, 2006. 2 ing. arXiv preprint arXiv:2010.08887, 2020. 3
[33] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian [45] Zhiyuan Li and Sanjeev Arora. An exponential learn-
Sun. Deep residual learning for image recognition. ing rate schedule for deep learning. In International
In Proceedings of the IEEE conference on computer Conference on Learning Representations, 2019. 6
vision and pattern recognition, pages 770–778, 2016. [46] Ran Liu. Understand and improve contrastive learning
6, 8 methods for visual representation: A review. arXiv
[34] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and preprint arXiv:2106.03259, 2021. 1
Ross Girshick. Momentum contrast for unsupervised [47] Xiao Liu, Fanjin Zhang, Zhenyu Hou, Li Mian,
visual representation learning. In Proceedings of the Zhaoyu Wang, Jing Zhang, and Jie Tang. Self-
IEEE/CVF conference on computer vision and pattern supervised learning: Generative or contrastive. IEEE
recognition, pages 9729–9738, 2020. 1, 2 transactions on knowledge and data engineering, 35
[35] R Devon Hjelm, Alex Fedorov, Samuel Lavoie- (1):857–876, 2021. 1, 4
Marchildon, Karan Grewal, Phil Bachman, Adam [48] I Loshchilov and F Hutter. Stochastic gradient descent
Trischler, and Yoshua Bengio. Learning deep repre- with warm restarts. In Proceedings of the 5th Int. Conf.
sentations by mutual information estimation and max- Learning Representations, pages 1–16. 6

15
[49] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic of the 34th International Conference on Neural In-
gradient descent with warm restarts. arXiv preprint formation Processing Systems, Red Hook, NY, USA,
arXiv:1608.03983, 2016. 6 2020. Curran Associates Inc. 1
[50] Ilya Loshchilov and Frank Hutter. Decoupled weight [62] Yao-Hung Hubert Tsai, Shaojie Bai, Louis-Philippe
decay regularization. In International Conference on Morency, and Ruslan Salakhutdinov. A note on con-
Learning Representations, 2018. 6 necting barlow twins with negative-sample-free con-
[51] S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and A. trastive learning. arXiv preprint arXiv:2104.13712,
Vedaldi. Fine-grained visual classification of aircraft. 2021. 6
Technical report, 2013. 11 [63] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Be-
[52] Ishan Misra and Laurens van der Maaten. Self- longie. Caltech-ucsd birds-200-2011 (cub-200-2011).
supervised learning of pretext-invariant representa- Technical Report CNS-TR-2011-001, California Insti-
tions. In Proceedings of the IEEE/CVF conference on tute of Technology, 2011. 11
computer vision and pattern recognition, pages 6707– [64] Haoqing Wang, Xun Guo, Zhi-Hong Deng, and Yan
6717, 2020. 1, 2 Lu. Rethinking minimal sufficient representation in
[53] Maria-Elena Nilsback and Andrew Zisserman. Au- contrastive learning. In Proceedings of the IEEE/CVF
tomated flower classification over a large number of Conference on Computer Vision and Pattern Recogni-
classes. In Indian Conference on Computer Vision, tion, pages 16041–16050, 2022. 2
Graphics and Image Processing, 2008. 11 [65] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua
[54] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Lin. Unsupervised feature learning via non-parametric
Representation learning with contrastive predictive instance discrimination. In Proceedings of the IEEE
coding. arXiv preprint arXiv:1807.03748, 2018. 2 conference on computer vision and pattern recogni-
[55] Chen Peng, Xianzhong Long, and Yun Li. Mnn: tion, pages 3733–3742, 2018. 1, 2
Mixed nearest-neighbors for self-supervised learning,
[66] Han Xiao, Kashif Rasul, and Roland Vollgraf.
2023. 3
Fashion-mnist: a novel image dataset for benchmark-
[56] Senthil Purushwalkam and Abhinav Gupta. Demys- ing machine learning algorithms, 2017. 11
tifying contrastive self-supervised learning: Invari-
[67] Tete Xiao, Xiaolong Wang, Alexei A Efros, and
ances, augmentations and dataset biases. Advances
Trevor Darrell. What should not be contrastive in con-
in Neural Information Processing Systems, 33:3407–
trastive learning. arXiv preprint arXiv:2008.05659,
3418, 2020. 3
2020. 3
[57] Sylvestre-Alvise Rebuffi, Sven Gowal, Dan Andrei
Calian, Florian Stimberg, Olivia Wiles, and Timo- [68] Junyuan Xie, Ross Girshick, and Ali Farhadi. Unsu-
thy A Mann. Data augmentation can improve robust- pervised deep embedding for clustering analysis. In
ness. Advances in Neural Information Processing Sys- International conference on machine learning, pages
tems, 34:29935–29948, 2021. 3 478–487. PMLR, 2016. 2
[58] Pierre H Richemond, Jean-Bastien Grill, Florent [69] Xueting Yan, Ishan Misra, Abhinav Gupta, Deepti
Altché, Corentin Tallec, Florian Strub, Andrew Brock, Ghadiyaram, and Dhruv Mahajan. Clusterfit: Im-
Samuel Smith, Soham De, Razvan Pascanu, Bilal Piot, proving generalization of visual representations. In
et al. Byol works even without batch statistics. arXiv Proceedings of the IEEE/CVF Conference on Com-
preprint arXiv:2010.10241, 2020. 1, 3, 6, 8 puter Vision and Pattern Recognition, pages 6509–
[59] Zhiqiang Shen, Zechun Liu, Zhuang Liu, Marios Sav- 6518, 2020.
vides, Trevor Darrell, and Eric Xing. Un-mix: Re- [70] Jianwei Yang, Devi Parikh, and Dhruv Batra. Joint un-
thinking image mixtures for unsupervised visual rep- supervised learning of deep representations and image
resentation learning. In Proceedings of the AAAI Con- clusters. In Proceedings of the IEEE conference on
ference on Artificial Intelligence, pages 2216–2224, computer vision and pattern recognition, pages 5147–
2022. 3 5156, 2016. 2
[60] Yonglong Tian, Chen Sun, Ben Poole, Dilip Krish- [71] Mang Ye, Xu Zhang, Pong C Yuen, and Shih-Fu
nan, Cordelia Schmid, and Phillip Isola. What makes Chang. Unsupervised embedding learning via invari-
for good views for contrastive learning? Advances ant and spreading instance feature. In Proceedings of
in neural information processing systems, 33:6827– the IEEE/CVF conference on computer vision and pat-
6839, 2020. 1 tern recognition, pages 6210–6219, 2019. 2
[61] Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, [72] Yang You, Igor Gitman, and Boris Ginsburg. Large
Cordelia Schmid, and Phillip Isola. What makes for batch training of convolutional networks. arXiv
good views for contrastive learning? In Proceedings preprint arXiv:1708.03888, 2017. 6

16
[73] Sangdoo Yun, Dongyoon Han, Seong Joon Oh,
Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo.
Cutmix: Regularization strategy to train strong classi-
fiers with localizable features. In Proceedings of the
IEEE/CVF international conference on computer vi-
sion, pages 6023–6032, 2019. 3
[74] Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and
Stéphane Deny. Barlow twins: Self-supervised learn-
ing via redundancy reduction. In International Con-
ference on Machine Learning, pages 12310–12320.
PMLR, 2021. 1, 2, 3, 8
[75] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin,
and David Lopez-Paz. mixup: Beyond empirical
risk minimization. arXiv preprint arXiv:1710.09412,
2017. 2, 3, 5
[76] Richard Zhang, Phillip Isola, and Alexei A. Efros.
Colorful image colorization. In Computer Vision –
ECCV 2016, pages 649–666, Cham, 2016. Springer
International Publishing. 1
[77] Yuanyi Zhong, Haoran Tang, Junkun Chen, Jian Peng,
and Yu-Xiong Wang. Is self-supervised contrastive
learning more robust than supervised learning?, 2022.
1
[78] Chengxu Zhuang, Alex Lin Zhai, and Daniel Yamins.
Local aggregation for unsupervised learning of vi-
sual embeddings. In Proceedings of the IEEE/CVF
International Conference on Computer Vision, pages
6002–6012, 2019. 2

17

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy