R J E P A R F P L: Ecurrent Oint Mbedding Redictive Rchitecture With Ecurrent Orward Ropagation Earning
R J E P A R F P L: Ecurrent Oint Mbedding Redictive Rchitecture With Ecurrent Orward Ropagation Earning
R J E P A R F P L: Ecurrent Oint Mbedding Redictive Rchitecture With Ecurrent Orward Ropagation Earning
A BSTRACT
Conventional computer vision models rely on very deep, feedforward networks processing whole
images and trained offline with extensive labeled data. In contrast, biological vision relies on com-
paratively shallow, recurrent networks that analyze sequences of fixated image patches, learning con-
tinuously in real-time without explicit supervision. This work introduces a vision network inspired
by these biological principles. Specifically, it leverages a joint embedding predictive architecture
[1] incorporating recurrent gated circuits [2]. The network learns by predicting the representation
of the next image patch (fixation) based on the sequence of past fixations, a form of self-supervised
learning. We show mathematical and empirically that the training algorithm avoids the problem of
representational collapse. We also introduce Recurrent-Forward Propagation, a learning algorithm
that avoids biologically unrealistic backpropagation through time [3] or memory-inefficient real-
time recurrent learning [4]. We show mathematically that the algorithm implements exact gradient
descent for a large class of recurrent architectures, and confirm empirically that it learns efficiently.
This paper focuses on these theoretical innovations and leaves empirical evaluation of performance
in downstream tasks, and analysis of representational similarity with biological vision for future
work.
1 Introduction
One of the most significant challenges for our visual system is the need to integrate visual information across the
constant movements of our eyes [5, 6]. We move our eyes in rapid saccades several times per second, holding fixation
only for fractions of a second, yet our brain constructs a coherent and stable representation of the visual scene. How
the brain accomplishes this remains an open question.
Visual processing in the primate brain is structured in consecutive visual processing areas that extract increasingly
more complex features of the visual input [7]. In every area, the visual field is processed similarly across locations.
This structure has been emulated with artificial neural networks, organized in consecutive layers, with convolutions
at every layer emulating the spatially invariant processing [8]. A deep stack of such convolutional neural networks
(CNNs) achieves impressive results in static image analysis tasks; however, they do not take into account the temporal
dynamics inherent in visual processing. For processing sequential data, Recurrent Neural Networks (RNNs) are one
of the preferred models due to their ability to model long-range dependencies and capture the evolution of temporal
patterns over time [9]. Indeed, biological vision is dominated by recurrent feedback, both within brain area as well
as between areas [10]. Therefore, the combination of recurrent connections and convolutional layers could effectively
address challenges such as video understanding, action recognition, and object tracking. There are a few recent efforts
to incorporate recurrence into vision models [2, 11, 12, 13]. However, it is not clear how such networks should handle
visual input that is constantly changing due to eye movements, or how such networks should be trained.
The conventional approach to training computer vision models is supervised learning, i.e., using images as input
and labels for those images as output. However, humans and animals do not receive explicit labels for every object
in a scene; instead, they learn patterns, relationships, and regularities from continuous experience. Most learning
occurs implicitly, which is more consistent with Self-Supervised Learning (SSL) techniques. SSL is a paradigm in
Recurrent JEPA with Forward Propagation Learning
machine learning where a model uses the input itself as targets for learning. Often, the model is trained to predict an
unobserved part of the input from the observed parts. In this way, the model can learn to relate multiple "views" or
multiple modalities in large datasets without relying on explicit labels (e.g., predict images from associated audio or
text).
In this work, we present a new biologically inspired architecture that integrates convolutional layers, recurrent net-
works, and SSL using next-step prediction. Specifically, the network is trained to predict a representation of a fixated
image patch, based on previous fixations. In doing so, the recurrent architecture can and should integrate information
across multiple fixations, thus forming an wholistic representation of a scene. Our model follows the joint-embedding
predictive architecture (JEPA) [14, 15] and extends this to include recurrent networks, in short, R-JEPA.
The specific network architecture we propose here borrows from previous efforts to train vision networks with con-
trastive [16] or predictive SSL [17], essentially, a ResNet50. The recurrent structure we use borrows from previous
recurrent vision models that have used supervised learning [2] or contrastive learning [18], which still requires labels
in the form of same/different exemplars. We omit the need for labels entirely, by using next-step prediction as a SSL
objective.
The two main contributions of this paper are theoretical. First, we show mathematically, and confirm in simulations that
the R-JEPA avoids the problem of representational collapse, despite lacking negative exemplars used for contrastive
learning or otherwise enforcing diversity of representations, as in [19, 20]. Second, we will show that the network can
be trained in real-time with a forward propagation of sensitivity for each weight in the recurrent network, rather than
costly back-propagation through time, usually required for recurrent networks. We show the conditions under which
this efficient Recurrent Forward Propagation is an exact gradient computation. Without realizing these conditions,
previous work on efficient real-time recurrent learning only offered approximate solutions [21, 22, 23, 24]. We confirm
in simulation that the network learns effectively when applying Recurrent Forward Propagation in real-time.
2 Methods
R-JEPA is a novel self-supervised recurrent model for the processing of high-dimensional time series x(t) (see Figure
1). Like JEPA [14], our model has two key elements:
1. embedding: The architecture defines an embedding space H through function Enc : X → H that transform
the trajectory x(t) of a high-dimensional space X to a trajectory h(t) in H.
2. predictive architecture: It can learn to predict h(t) from h(t − ∆) with a function G : H → H.
The representations in the embedding space h(t) must satisfy the following properties: (1) they should be maximally
informative about x(t), and (2) h(t) must be easily predictable from h(t − ∆). The point (1) prevents representational
collapse (i.e. h(t) is a point in H, or weakly informative). On the other hand, point (2) is enforced by a loss function
LR defined in the embedding space H to capture the quality of the prediction.
Our goal for R-JEPA is to mimic and extend how humans perceive, and comprehend the world around them, and
anticipate future events based on experiences. We expect R-JEPA to be able to learn temporal dynamics and causal
relationships. Here we develop the concept in the context of vision, but JEPA has more generally been used in a
multimodal context of vision, hearing and action.
R-JEPA is trained to predict representations of the future from representations of the past and present. For this, we
use an encoder Enc based on Recurrent Neural Networks (RNNs). RNNs compute time-variant internal states whose
transition is determined by the information from the past and input at present. The context vector c(t) represents
accumulated past information that the RNN uses for processing the new input x(t). We select a context vector that
is composed of two components c(t) = (s(t), m(t)), where s(t) is the internal state of the network and m(t) is a
memory signal. Finally, the trajectory h(t) in the embedding space is a projection of the context vector c(t).
Mathematically,
2
Recurrent JEPA with Forward Propagation Learning
Recurrent
Representation
Joint (t+1)
Predictor
Embedding LR(t+1)
Predictive h(t)
Architecture
h(t+1)
c(t-1) Recurrent c(t) ...
Encoder
Action
Predictor
x(t)
â(t)
ht
LA(t)
Embedding
a(t)
c(t-1) c(t)
RNN
xt
Figure 1: Recurrent Joint Embedding Predictive Architecture (R-JEPA). For an input x(t), the encoder generates
a representation h(t). Then, the predictor generates a prediction ĥ(t + 1) of the representation of the next input. The
objective of the encoder and predictor is to minimize the prediction loss LR in the embedding space H. At the same
time, the context vector c(t) is used to predict the action/response â(t) to stimuli x(t).
Like JEPA, R-JEPA can also produce a behavioral response to input stimuli x(t), which we’ll refer to as actions. For
example, an action might be an eye movement (e.g., saccades) while viewing a static image. To infer an action â(t),
information from the context vector c(t) will be helpful, as this will aid in predicting the next representation ĥ(t + 1).
In certain experiments, this action a(t) can be measured and compared to the model’s predictions â(t). For this work,
we will assume that the sequence of actions is given, i.e. the eye movements are given to us, and we leave scan-path
prediction or “active vision” to future work. Effectively, we assume the sequence of fixated image patches x(t) is
given.
We focus on an architecture for processing the temporal sequences of fixated image patches. The encoder consists of
six processing areas to allow for learning of hierarchical representations typical for models of biological vision (see
Fig. 2) [25, 26, 2].
In this work, we use reciprocal gated circuits (RGCs) [2]:
3
Recurrent JEPA with Forward Propagation Learning
h(t)
sL(t)
Recurrent Encoder
Embedding Area L
Reciprocal Gated Circuit (RGC)
Area 2
+ 3x3, 128
1x1, 256
3x3, 128
1x1, 256
3x3, 128
1x1, 256
3x3, 128
1x1, 256
Figure 2: Recurrent Encoder. We implemented an encoder based on ResNet50 [17] and Reciprocal Gated Circuit
[2]. The encoder is a hierarchical architecture: (1) The first five areas correspond to ResNet stages, (2) Area 6 works
as a head in ResNet, and (3) Embedding is a linear projection. Areas 1-6 have recurrent units inside them.
1. Direct passthrough, if the internal state is initialized to zero at the first time step s(0) = m(0) = 0, then the
RGC allows feedforward input to pass on to the next area as in a standard feedforward network: s(1) = x(1).
This allows the network to provide a fast initial repose, which can be refined with recurrent iteration in time.
2. Gating, in which the value of a internal state determines how much of the bottom-up input is passed through,
retained, or discarded at the next time step. RNNs with a gating mechanism can learn long-term sequential
data.
These properties have direct analogies to biological mechanisms: direct passthrough would correspond to feedforward
processing in time, and gating would correspond to adaptation to stimulus statistics across time [2].
We call the first five areas low-level areas; while the last high-level area. The state of the network at time t ∈
{1, 2, ..., T } is defined by
They differ in that low-level areas have spatial resolution (corresponding to retinotopy) and spatially invariant process-
ing implemented with convolutions, whereas higher-level areas are in a feature space without explicit spatial resolution
implemented with “dense" connections.
In both Cl and CH dynamics, x represents the external input. For the low-level, xLow
l (t) is the integration of informa-
tion from other areas,
N
!!
M
xl (t) = Fl cLowl−1 (t − 1) + φl S(cLow
k (t − 1)) . (1)
k=l+1
Fl is a “ResNet stage” whichL involves downsampling, S is an upsampling operation to match spatial dimensions of
activity from later areas hk,t , indicates concatenation along the channel axis, and φl is a linear map combining the
concatenated channels and reducing the channel dimension to match cLow l−1 (t).
4
Recurrent JEPA with Forward Propagation Learning
The general idea of joint embeddings is that related stimuli should have representational embeddings that are pre-
dictable from one another [15]. When learning a joint embedding, a standard problem is that of representational
collapse, namely, all stimuli have the same representation. To prevent this trivial collapse, contrastive learning also
aims to increase the contrast of the embedding for stimuli that are not related, e.g. patches from different images [16].
To avoid the need for same/different labels a number of methods attempt to generate embeddings that generated repre-
sentations that are decorrelated and variance is preserved [19, 20]. Another approach has been to utilize architectural
asymmetries (e.g., dual networks in [27], referred to as BYOL), or asymmetries in data modality [1]. A particularly
simple solutions, termed SimSiam, uses identical (Siamese) encoder networks but prevents gradient propagation in
one of the encoder networks. That “stop-gradient" alone was found to prevent representational collapse [17]. Subse-
quent theoretical work showed this trick preserves a “balance" between the encoder and predictor, and that this balance
prevents collapse [28]. Here we have implemented stop-gradient in branches of the architecture to prevent collapse
(see pink lines in Fig. 1).
In this section, we will show that for R-JEPA, a balancing of the Representation Predictor and R-Enconder exists. It
ensures that the encoder will also learn what the predictor learns, which is important as the encoder’s representations
are what is used for downstream tasks.
Theorem 1 In R-JEPA, when minimizing the square prediction error E under a stable network dynamic, a linear
representation predictor WGh converges after repeated gradient decent iteration with weight decay to the following
proportionality:
T
WGh WGh ∝ HH T . (3)
Here HH T measures the covariance of the representations h(t) across time. The proof for this theorem for the R-JEPA
T
is in Appendix A, and follows [28]. As we discuss in the next section, in practice matrix WGh WGh is approximately
diagonal. This implies that the representations are decorrelated after iterating gradient descent, achieving a diverse
representation without explicitly requiring decorrelation, or constrained variance. While we derived this Theorem for
the square prediction error, it applies equally for other error measured, such as the cosine-distance.
5
Recurrent JEPA with Forward Propagation Learning
a) b)
0.8
10-2
10-4
Eigenvalues
10-6
10-8
10-10
0 40 80 120
Dimension
Figure 3: R-JEPA avoids collapse. (a) A 2D projection of the trajectory of features h(t) for different inputs x(t)
using PCA. (b) Distribution of eigenvalues of the correlation matrix of features, i.e. HH T .
Biological neural networks can efficiently perform credit assignment under spatiotemporal locality constraints [30].
In machine learning, Backpropagation (BP) is the standard solution to the credit assignment problem in feedforward
networks [31] and BP-through-time (BPTT) for recurrent networks [3]. Both methods rely on biologically implausible
assumptions [32, 33], particularly regarding temporal locality. For example, BPTT cannot be performed online, as
errors are calculated retrospectively only at the end of a task (i.e. processing of the complete time sequence {x(t)}Tt=1 ).
This requires either storing the entire network history and/or recomputing it over again at every update [30].
There are alternative gradient forward propagation algorithms [21], which are temporally local, referred to online or
real-time learning. [4] presents one of the canonical examples called Real-Time Recurrent Learning (RTRL) that
computes how each parameter affects the hidden states at each time step. However, this tensor γ increases with order
O(n3 ), with n being the number of nodes in the network. This makes RTRL computationally prohibitive in large
networks and more costly than BPTT (order O(n2 )).
In this Section, we show that the computational cost of RTRL for recurrent gated circuits (RGCs) is only O(n2 ). We
refer to the resulting algorithms as Recurrent Forward Propagation (RFP). We proof that RFP is applicable more
generally to recurrent networks that contain only two-points interactions. We then show empirical results with both
BPTT and RFP.
4.1 Reducing the computational cost anpad memory storage of the Recurrent Gated Circuit
∂E
= t T1 ∂L∂W R (t)
P
Given the loss function E = Et [LR (t)], the gradient can be decomposed over time ∂W , respectively.
∂LR (t)
RTRL is an online algorithm because it computes each term ∂W using dynamic updates of a tensor γ. This tensor
2
measures the sensitivity of the nth node in the network on changes in each of the n connection between nodes. Hence,
it has a size of n3 . The update equation of γ is deterministic and closed form. However, the specific form depends on
the equation of the recurrence.
In Appendix C, we derive the update equation of γ for the recurrent gated circuit (RGC) and observe that its memory
storage and computational cost is only O(n2 ) due to the specific structure of the recurrence. More precisely, we find
where Γ ∈ Rn×n is a sensitivity matrix and s ∈ Rn is the current state of the network. The forward recursion equation
of Γ is presented in Appendix C (see Eq. 14).
In Appendix D, we show that the forward recursion for Γ provides an exact gradient computation for a more general
class of recurrent networks which satisfy a two-points interactions property. This stands in contrast to previous ver-
6
Recurrent JEPA with Forward Propagation Learning
sions of forward propagation methods that did not appreciate this conditions and therefore obtain only approximate
forward gradient computations [21, 24, 22, 23].
Note that LR (t) is the instantaneous loss. Equation (4) indicates that one obtains the gradient for learning by combining
the sensitivity tensor with the dependence of the instantaneous loss on the current network state. If the loss only
becomes available at a latter time, say at the end of a sequence, the sensitivity vector integrates the effects of the
weights on the future network states. Therefore, there is no need to propagate the error back in time. In Appendix
E we show that this approach works, provided the total loss is a sum (of functions) of the instantaneous losses. For
instance, the instantaneous losses can be zero until the end of a sequence. If instantaneous losses are available, then
gradient updates can be applied online at every time step (as in stochastic gradient descent), or after batches of data
have been observed, or at the end of a sequence, with the usual relative merits of each approach.
∂LR (t)
Finally, note that factor ∂s(t−1) can be straightforwardly computed because there is no recurrent computation between
s(t − 1) and LR (t). However, if the network is a multilayer hierarchy, this term does require standard (spatial)
backpropagation.
We now empirically evaluate and compare the training of R-JEPA using both algorithms: BPTT and RFP. We used the
same data and network structure as in the empirical evaluation above. The network is trained on sequences of length
100 of consecutive fixations selected at random from the full-length films. We had in total 29000 segments for training
and 7000 for testing. Before training the network shows no evidence of integration of information across time, i.e.
fixations (Fig. 4). After learning, the prediction error decreases with time (the sequence of fixations), suggesting that
the information carried by the context vector allows the network to progressively improve its prediction of what it will
"see" next in a movie scene.
We use convolutional layers from a pretrained ResNet-50, which are frozen during training. The weights of the
predictor (MLP) were randomly initialized under the condition that ĥ(t + 1) ≈ h(t) (i.e., equivalent to SimCLR). The
weights of the RGC were initialized to zero, ensuring no temporal dynamics at Epoch 0 (see Fig. 4 left). When the
network is trained with BPTT, LR (t) tends to increase at the first time step (t = 1) before decreasing. This is expected
behavior for BPTT, as at the start of the sequence, the encoder lacks sufficient feedback from later time steps to make
accurate predictions. As the sequence progresses, the feedback effects accumulate, allowing the model to improve. In
the case of RFP, the uniform decrease in the L(t) is a common characteristic (see Fig. 4 right).
Epoch = 0 Epoch = 6
Representation Loss
0.4
RFP BPTT
BPTT
0.3 RFP
1 10 100 1 10 100
Time Steps
Figure 4: Prediction error as a function of time in video. Representation Loss indicates the ability of the recurrent
network to predict the content of the next fixation (image patch). Curves are the average over 7000 fixation sequences
in the test data. Network behavior is show before (Epoch=0) and after learning (Epoch=6). Time steps indicates
the number of images patches (fixations) from the start of the recurrent iteration, i.e. the start of the test fixation
sequences. The drop with time steps indicates that the network accumulates information in the context vector allowing
it progressively improve its prediction
.
7
Recurrent JEPA with Forward Propagation Learning
5 Conclusion
We have introduced a recurrent version of JEPA motivated by recurrence observed in biological vision networks. We
made two important theoretical contributions. First, we demonstrate that next-step prediction with a stop-gradient
achieves diverse representation thus avoiding representational collapse. Second, we derived general conditions for
recurrent networks to be efficiently trained online with Recurrent Forward Propagation. Future work will explore
more complete recurrent architectures with feedback at all areas of the visual processing hierarchy (as indicated in
Fig. 1), and will evaluate the performance of trained networks on downstream video processing tasks. To facilitate
this, we are sharing the ongoing development of the R-JEPA architecture at Github. We also intend to compare network
representations with neural activity similar to previous efforts [25, 2, 18], as soon as neural data during free-viewing
of videos becomes available.
6 Acknowledgment
We would like to thanks Jens Madsen for providing saccade data from participants watching movies and the corre-
sponding time-aligned video files used in the empirical demonstrations here.
AppendixAppendix
A Proof of Theorem 1
Consider the following R-JEPA
8
Recurrent JEPA with Forward Propagation Learning
∂E λ1 ∂
= Tr[WGa WA RcH WAT WGa T
− 2WGa WA Et [cLow (t − 1)h(t)T ]
∂WGa 2 ∂WGa
+ 2WGa WA Et [cLow (t)h(t)T ]WGh
T
]
= λ1 [WGa WA Rclow WAT − Et [h(t)cLow (t − 1)T ]WAT + WGh Et [h(t)cLow (t)T ]WAT ].
T
1 X
Et [||h(t) − ĥ(t)||2 ] = ||stop[h(t)] − ĥ(t)||2
T t=1
T
1 X
= ||stop[h(t)] − WGh h(t − 1) − WGa WA cLow (t − 1)||2 .
T t=1
The result is
T
∂E 1 ∂ X
= ||stop[h(t)] − WGh h(t − 1) − WGa WA cLow (t − 1)||2
∂h(q) T ∂h(q) t=1
T
1X
=− [h(t) − WGh h(t − 1) − WGa WA cLow (t − 1)]T WGh
T
δq,t−1
T t=1
Θ(0 ≤ q ≤ T − 1)
= [−h(q + 1) + WGh h(q) + WGa WA cLow (q)]T WGh
T
Θ(0 ≤ q ≤ T − 1)
= [−h(q + 1)T + h(q)T WGh
T
+ cLow (q)T WAT WGa
T
]WGh .
T
∂F ∂C ∂F ∂C
δh(t) = δθc − η θc
∂c(t) ∂θc ∂c(q) ∂θc
∂E
δWGh =− − ηWGh
∂WGh
∂E
δWGa =− − ηWGa
∂WGa
where η is a weight decay parameter. This differential system is a good approximation of the limit of large batch sizes
and small discrete time learning rates [34].
T T
∂E ∂E ∂F ∂C
By the gradient descent, the term δθc = − ∂θ c
is equal to − ∂h(t) ∂c(t) ∂θc . Then,
∂C ∂C T T ∂E T ∂C
δh(t) = −M M − ηM θc (5)
∂θc ∂θc ∂h(t) ∂θc
∂E
δWGh =− − ηWGh (6)
∂WGh
∂E
δWGa =− − ηWGa (7)
∂WGa
9
Recurrent JEPA with Forward Propagation Learning
∂F √1
where M = ∂c(t) = [0|I] ∈ RdcH ×dc where dc = dcL + dcH . We denote H = T
[h(1), ..., h(T )]; then,
Therefore,
c1,t−1 cT1,t−1
0 ... 0
T T
∂C ∂C 0 c 2,t−1 c 2,t−1 ... 0
L(t) = ≈
∂θc ∂θc ... ... ... 0
T
0 0 ... cN +1,t−1 cN +1,t−1
2
Id1 ||xt || 0 ... 0
0 Id2 ||c1,t ||2 ... 0
+ ,
... ... ... 0
2
0 0 ... IdN +1 ||cN,t ||
and
∂C
θc ≈ c(t).
∂θc
Also, M L(t)M T = h(t − 1)h(t − 1)T + ||cN,t ||2 I. We will use the normalization ||cN (t)||2 = 1.
A.5 Balancing
10
Recurrent JEPA with Forward Propagation Learning
T T ∂E T
WGh δWGh = −WGh − ηWGh WGh
∂WGh
T T T
= λ1 [WGh R1 − WGh WGh R0 − WGh WGa WA Et [cLow (t)h(t)T ]] − ηWGh
T
WGh
= λ1 [δHH T + ηR0 − Y ] − ηWGh
T
WGh ,
and
T T ∂E T
WGa δWGa = −WGa − ηWGa WGa
∂WGa
T
= −λ1 WGa [WGa WA Rclow WAT − Et [h(t)cLow (t − 1)T ]WAT
+ WGh Et [h(t)cLow (t)T ]WAT ] − ηWGa
T
WGa .
Finally,
T T T
WGh δWGh + WGh δWGh + 2ηWGh WGh = λ1 [δHH T + δH T H + 2ηR0 ] − λ1 [Y + Y T ].
T
WGh WGh = λ1 H T H, (s → ∞) (8)
B Feature correlations
11
Recurrent JEPA with Forward Propagation Learning
Finally,
(ν)
X (2ν+1) (1−ν)
ai = f( Wij cj (t − 1)),
j
(ν)
X (2ν) (ν)
bi = f( Wij cj (t − 1)).
j
where we use the index k = 0, 1, 2, 3 to indicate the pairs ss, ms, mm, and sm, respectively.
12
Recurrent JEPA with Forward Propagation Learning
ν,k ν,k
The initial condition γipq (0) = 0 implies γipq (t) = Γν,k
pq (t)δip . Then, the the temporal evolution of γ can be written
as:
Γν,k (t) = µν,0 (t) ⊙ Γν,k (t − 1) + µν,1 (t) ⊙ Γ1−ν,k (t − 1) + δk//2,ν J ν,k%2
µν,0 (t) = [1 − b(ν)2 ] ⊙ c(ν) (t − 1) ⊙ diag(W (2ν) ) + b(ν)
µν,1 (t) = −[1 − a(ν)2 ] ⊙ I(t) ⊙ diag(W (2ν+1) ) (14)
ν,0 (ν)2 (ν) (ν)T
J (t) = [1 − b ]⊙c (t − 1)c (t − 1)
ν,1 (ν)2 (1−ν)T
J (t) = −[1 − a ] ⊙ I(t)c (t − 1)
Finally,
∂LR (t) ∂LR (t) ∂s(t − 1)
k
= k
∂Wpq ∂s(t − 1) ∂Wpq
X ∂LR (t) 0,k
= γ (t − 1)
i
∂si (t − 1) ipq
∂LR (t) 0,k
= Γ (t − 1).
∂sp (t − 1) pq
Using the initial condition γ(0) = 0, we obtain γk,(ij) (t) = δki Γij (t) and
13
Recurrent JEPA with Forward Propagation Learning
Under the two-point interaction hypothesis, the size of Γ and the number of necessary operations is n2 for this case.
This is a complexity reduction of the general case of RTRL (i.e. np ∼ n3 ). We refer to this more efficient learning
rule as Recurrent Forward Propagation.
These types of loss functions ensure that RTRL can be applied, as each term in the following sum can be computed
online,
T
∂E X ∂ψ ∂L(t)
= |t,L(t) (20)
∂W t=1
∂L ∂W
References
[1] Quentin Garrido, Mahmoud Assran, Nicolas Ballas, Adrien Bardes, Laurent Najman, and Yann LeCun. Learning
and Leveraging World Models in Visual Representation Learning, March 2024. arXiv:2403.00504.
[2] Aran Nayebi, Javier Sagastuy-Brena, Daniel M. Bear, Kohitij Kar, Jonas Kubilius, Surya Ganguli, David Sussillo,
James J. DiCarlo, and Daniel L. K. Yamins. Recurrent Connections in the Primate Ventral Visual Stream Mediate
a Trade-Off Between Task Performance and Network Size During Core Object Recognition. Neural Computation,
34(8):1652–1675, July 2022.
[3] P.J. Werbos. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10):1550–
1560, October 1990. Conference Name: Proceedings of the IEEE.
[4] Ronald J. Williams and David Zipser. A Learning Algorithm for Continually Running Fully Recurrent Neural
Networks. Neural Computation, 1(2):270–280, June 1989. Conference Name: Neural Computation.
[5] Garry Kong, Lisa M. Kroell, Sebastian Schneegans, David Aagten-Murphy, and Paul M. Bays. Transsaccadic
integration relies on a limited memory resource. Journal of Vision, 21(5):24, May 2021.
[6] Leonie Oostwoud Wijdenes, Louise Marshall, and Paul M. Bays. Evidence for Optimal Integration of Visual
Feature Representations across Saccades. The Journal of Neuroscience: The Official Journal of the Society for
Neuroscience, 35(28):10146–10153, July 2015.
[7] National Research Council (US) Committee on Vision. Information Processing in the Primate Visual System.
In Advances in the Modularity of Vision: Selections From a Symposium on Frontiers of Visual Science. National
Academies Press (US), 1990.
[8] Alessia Celeghin, Alessio Borriero, Davide Orsenigo, Matteo Diano, Carlos Andrés Méndez Guerrero, Alan Per-
otti, Giovanni Petri, and Marco Tamietto. Convolutional neural networks for vision neuroscience: significance,
developments, and outstanding issues. Frontiers in Computational Neuroscience, 17, July 2023. Publisher:
Frontiers.
[9] Robin M. Schmidt. Recurrent Neural Networks (RNNs): A gentle Introduction and Overview, November 2019.
arXiv:1912.05911.
[10] Charles D. Gilbert and Wu Li. Top-down influences on visual processing. Nature Reviews. Neuroscience,
14(5):350–363, May 2013.
[11] Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. An Empirical Evaluation of Generic Convolutional and Recur-
rent Networks for Sequence Modeling, April 2018. arXiv:1803.01271.
[12] Md Zahangir Alom, Mahmudul Hasan, Chris Yakopcic, and Tarek M. Taha. Inception Recurrent Convolutional
Neural Network for Object Recognition, April 2017. arXiv:1704.07709.
14
Recurrent JEPA with Forward Propagation Learning
[13] Jeff Donahue, Lisa Anne Hendricks, Marcus Rohrbach, Subhashini Venugopalan, Sergio Guadarrama, Kate
Saenko, and Trevor Darrell. Long-term Recurrent Convolutional Networks for Visual Recognition and Descrip-
tion, May 2016. arXiv:1411.4389.
[14] Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun,
and Nicolas Ballas. Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture,
April 2023. arXiv:2301.08243.
[15] Anna Dawid and Yann LeCun. Introduction to latent variable energy-based models: a path toward autonomous
machine intelligence. Journal of Statistical Mechanics: Theory and Experiment, 2024(10):104011, October
2024. Publisher: IOP Publishing.
[16] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A Simple Framework for Contrastive
Learning of Visual Representations, July 2020. arXiv:2002.05709.
[17] Xinlei Chen and Kaiming He. Exploring Simple Siamese Representation Learning, November 2020.
arXiv:2011.10566.
[18] Chengxu Zhuang, Siming Yan, Aran Nayebi, Martin Schrimpf, Michael C. Frank, James J. DiCarlo, and Daniel
L. K. Yamins. Unsupervised neural network models of the ventral visual stream. Proceedings of the National
Academy of Sciences, 118(3):e2014196118, January 2021. Publisher: Proceedings of the National Academy of
Sciences.
[19] Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny. Barlow Twins: Self-Supervised Learning
via Redundancy Reduction, June 2021. arXiv:2103.03230.
[20] Adrien Bardes, Jean Ponce, and Yann LeCun. VICReg: Variance-Invariance-Covariance Regularization for Self-
Supervised Learning, January 2022. arXiv:2105.04906.
[21] Owen Marschall, Kyunghyun Cho, and Cristina Savin. A Unified Framework of Online Learning Algorithms for
Training Recurrent Neural Networks, July 2019. arXiv:1907.02649.
[22] Kazuki Irie, Anand Gopalakrishnan, and Jürgen Schmidhuber. Exploring the Promise and Limits of Real-Time
Recurrent Learning, February 2024. arXiv:2305.19044.
[23] Wulfram Gerstner, Marco Lehmann, Vasiliki Liakoni, Dane Corneil, and Johanni Brea. Eligibility Traces and
Plasticity on Behavioral Time Scales: Experimental Support of NeoHebbian Three-Factor Learning Rules. Fron-
tiers in Neural Circuits, 12:53, July 2018.
[24] Guillaume Bellec, Franz Scherr, Anand Subramoney, Elias Hajek, Darjan Salaj, Robert Legenstein, and Wolf-
gang Maass. A solution to the learning dilemma for recurrent networks of spiking neurons. Nature Communica-
tions, 11(1):3625, July 2020. Publisher: Nature Publishing Group.
[25] Daniel L. K. Yamins and James J. DiCarlo. Using goal-driven deep learning models to understand sensory cortex.
Nature Neuroscience, 19(3):356–365, March 2016. Publisher: Nature Publishing Group.
[26] Aran Nayebi, Daniel Bear, Jonas Kubilius, Kohitij Kar, Surya Ganguli, David Sussillo, James J. DiCarlo,
and Daniel L. K. Yamins. Task-Driven Convolutional Recurrent Models of the Visual System, October 2018.
arXiv:1807.00053.
[27] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya,
Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray
Kavukcuoglu, Rémi Munos, and Michal Valko. Bootstrap your own latent: A new approach to self-supervised
Learning, September 2020. arXiv:2006.07733.
[28] Kang-Jun Liu, Masanori Suganuma, and Takayuki Okatani. Bridging the Gap from Asymmetry Tricks to Decor-
relation Principles in Non-contrastive Self-supervised Learning. Advances in Neural Information Processing
Systems, 35:19824–19835, December 2022.
[29] Chaoning Zhang, Kang Zhang, Chenshuang Zhang, Trung X. Pham, Chang D. Yoo, and In So Kweon. How Does
SimSiam Avoid Collapse Without Negative Samples? A Unified Understanding with Self-supervised Contrastive
Learning, March 2022. arXiv:2203.16262.
[30] Benjamin Ellenberger, Paul Haider, Jakob Jordan, Kevin Max, Ismael Jaras, Laura Kriener, Federico Benitez,
and Mihai A. Petrovici. Backpropagation through space, time, and the brain, July 2024. arXiv:2403.16933.
[31] David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by back-propagating
errors. Nature, 323(6088):533–536, October 1986. Publisher: Nature Publishing Group.
[32] Timothy P Lillicrap and Adam Santoro. Backpropagation through time and the brain. Current Opinion in
Neurobiology, 55:82–89, April 2019.
15
Recurrent JEPA with Forward Propagation Learning
[33] D. O. Hebb. The Organization of Behavior: A Neuropsychological Theory. Psychology Press, New York, April
2002.
[34] Yuandong Tian, Xinlei Chen, and Surya Ganguli. Understanding self-supervised Learning Dynamics without
Contrastive Pairs, October 2021. arXiv:2102.06810.
16