Dream
Dream
Dream
Forward
Abstract Human Visual System
processes within the human visual system: the Reverse Vi- DREAM
sual Association Cortex (R-VAC) which reverses pathways
Figure 1. Forward and Reverse Cycle. Forward (HVS): visual
of this brain region, extracting semantics from fMRI data;
stimuli 7→ color, depth, semantics 7→ fMRI; Reverse (DREAM):
the Reverse Parallel PKM (R-PKM) component simultane- fMRI 7→ color, depth, semantics 7→ reconstructed images.
ously predicting color and depth from fMRI signals. The
experiments indicate that our method outperforms the cur-
rent state-of-the-art models in terms of the consistency of
sual decoding. Hence, current methods have endeavored to
appearance, structure, and semantics. Code will be made
incorporate structural and positional details, either through
publicly available to facilitate further research in this field.
depth maps [10, 42] or by utilizing the decoded representa-
tion of an initial guessed image [31, 37]. However, these
methods primarily focus on merging inputs that fit well
1. Introduction within the pretrained generative model for visual decoding,
Exploring neural encoding unravel the intricacies of lacking the insights from the human visual system.
brain function. In last years, we have witnessed tremendous We commence our study with the foundational princi-
progress in visual decoding [18] which aims at decoding a ples [3] governing the Human Visual System (HVS) and
Functional Magnetic Resonance Imaging (fMRI) to recon- dissect essential cues crucial for effective visual decoding.
struct the test image seen by a human subject during the Our method draw insights from HVS — how humans per-
fMRI recording. Visual decoding could significantly affect ceive visual stimuli (forward route in Fig. 1) — to address
our society from how we interact with machines to help- the potential information loss during the transition from the
ing paralyzed patients [43]. However, existing methods still fMRI to the visual domain (reverse route in Fig. 1). We
suffer from missing concepts and limited quality in the im- do that by deciphering crucial cues from fMRI recordings,
age results. Recent studies turned to deep generative mod- thereby contributing to enhanced consistency in terms of
els for visual decoding due to their remarkable generation appearance, structure, and semantics. As cues, we inves-
capabilities, particularly the text-to-image diffusion mod- tigate: color for accurate scene appearance [29], depth for
els [35, 49]. These methods heavily rely on aligning brain scene structure [34], and the popular semantics for high-
signals with the vision-language model [33]. This strategic level comprehension [33]. Our study shows that current
utilization of CLIP helps mitigate the scarcity of annotated visual decoding methods often underscored and unnoticed
data and the complexities of underlying brain information. color which in fact plays an indispensable role. Fig. 2 high-
Still, the inherent nature of CLIP, which fails to preserve lights color inconsistencies in a recent work [31]. The gen-
the scene structural and positional information, limits vi- erated images, while accurate in semantics, deviate in struc-
process learns to denoise by learning transition kernels pa-
Test image
Forward Pathways (stimuli to fMRI) Reverse Pathways (fMRI to Semantics, Color and Depth to Image)
Semantics
Reverse VAC
Pathway
Composer
Depth
Magnocellular Depth
Pathway
C-A
Figure 3. Relation of the HVS and Our proposed DREAM. Grounding on the Human Visual System (HVS), we devise reverse pathways
aimed at deciphering semantics, depth, and color cues from fMRISec. 3.2to
VACguide
Reverseimage
Pathway reconstruction.
(Semantics) (Left) Schematic view of HVS, detailed
in Sec. 3. When perceiving visual stimuli,
𝑧 connectionsEfrom
fMRI the retina to the brain can be separated into two parallel pathways. The Parvo-
𝜃! ! Vi
cellular Pathway originates from midget cells in the retina and is3.3responsible
Sec. for transmitting
PKM Reverse Pathway (Depth & Color) color information, while the Magnocellular
Pathway starts with parasol cells and is specialized in detecting depth and motion. The conveyed information is channeled into the visual
cortex for undertaking intricate processing of high-level semantics from the visual image. (Right) DREAM mimics the corresponding
inverse processes within the HVS: the Reverse VAC (Sec. 4.1) replicates the opposite operations of this brain region, analogously extract-
ing semantics Ŝ as a form of CLIP embedding from fMRI; and the Reverse PKM (Sec. 4.2) maps fMRI to color Ĉ and depth D̂ in the
form of spatial palettes and depth maps to facilitate subsequent processing by the Color Adapter (C-A) and the Depth Adapter (D-A) in
T2I-Adapter [29] in conjunction with SD [35] for image reconstruction from deciphered semantics, color, and depth cues {Ŝ, Ĉ, D̂}.
large-scale vision model [32] or encoded from an initial es- formation from V1 and undertakes intricate processing of
timated image [37]. The high-level semantics is frequently high-level semantic contents from the visual image.
represented as CLIP embedding. Recent research [10, 42] The hierarchical and parallel manner where visual stim-
also suggested the explicit provision of both low-level and uli are broken down and passed forward as color, depth, and
high-level information, predicting captions and depth maps semantics guided our choices to reverse the HVS for decod-
from brain signals. Our method differs in terms of how to ing. For a detailed illustration of human perception and an
predict auxiliary information and the incorporation of color. analysis on the feasibility of extracting desired cues from
fMRI recordings, please consult supplementary material.
3. Preliminary on the Human Visual System
Human Visual System (HVS) endows us with the ability 4. DREAM
of visual perception. The visual information is concurrently
relayed from various cell types in the retina, each capturing The task of visual decoding aims to recover the viewed
distinct facets of data, through the optic nerve to the brain. image I ∈ RH×W ×3 from brain activity signals elicited by
Connections from the retina to the brain, as shown in Fig. 3, visual stimuli. Functional MRI (fMRI) is usually employed
can be separated into a parvocellular pathway and a mag- as a proxy of the brain activities, typically encoded as a set
nocellular pathway1 . The parvocellular pathway originates of voxels fMRI ∈ R1×N . Formally, the task optimizes f (·)
from midget cells in the retina and is responsible for trans- so that f (fMRI) = I,ˆ where Iˆ best approximates I.
mitting color information, while the magnocellular pathway
starts with parasol cells and is specialized in detecting depth To address this task, we propose DREAM, a method
and motion. The visual information is first directed to a sen- grounded on fundamental principles of human perception.
sory relay station known as the lateral geniculate nucleus Following Sec. 3, our method relies on explicit design of
(LGN) of the thalamus, before being channeled to the vi- reverse pathways to decipher Semantics, Color, and Depth
sual cortex (V1) for the initial processing of visual stimuli. intertwined in the fMRI data. These reverse pathways mir-
The visual association cortex (VAC) receives processed in- ror the forward process from visual stimuli to brain activity.
Considering that an fMRI captures changes in the brain re-
1 An additional set of neurons, known as the koniocellular layers, are gions during the forward process, it is feasible to derive the
found ventral to each of the magnocellular and parvocellular layers [3]. desired cues of the visual stimuli from such recording [2].
(RGB, D) pairs
NSD LAION NSD LAION
Contrastive Learning. In practice, we train(the fMRI en- )
fMRI embedding
coder EfMRI withMiDaS
triplets of {fMRI, image, caption} to pull
(RGB, D, fMRI) triplets
𝑟$"%!,#
𝑟# the fMRI embeddings closer to the rich shared ( semantic )
brain fMRI MixCo (a)
𝑝! space of CLIP. Given that both text encoder (Etxt ) and im-
data mapping
𝑟" encoding 𝑝!∗ 𝑣#
age encoder (Eimg ) of CLIP are frozen, we minimize the
𝑐# embedding distances of fMRI-imagestage and1 fMRI-text which
❄ image embedding 𝑐! 𝑣! in turns forces alignment of the fMRI embedding with
stage 2
image CLIP Space CLIP. The training
fMRI is illustrated in Fig. 4. Formally, with
the embeddings of fMRI, text, and image stage 3 denoted by p, c,
A man is competing text embedding RGBD
in Olympics skiing
❄ v, respectively, the (b)
initial contrastive loss writes
Contrastive Learning
caption
w/ Data Augmentation
CLIP Encoders
\mathcal {L}_{p}= - \log \frac {\exp \left (p_i{\cdot }c_i / {\tau }\right )}{\sum _{j=0}^K \exp \left (p_i{\cdot }c_j / \tau \right )} - \log \frac {\exp \left (p_i{\cdot }v_i/{\tau }\right )}{\sum _{j=0}^K \exp \left (p_i {\cdot } v_j / {\tau }\right )}, \label {eqn:contrastive}
Figure 4. R-VAC Training. To decipher semantics from fMRI, we
train an encoder Efmri in a contrastive fashion which aligns fMRI (1)
data with the frozen CLIP space [33]. Data augmentation (repre- where τ is a temperature hyperparameter. The sum for each
sented by the dashed rectangle) [21] combats the data scarcity of term is over one positive and K negative samples. Each
the fMRI modality. See Sec. 4.1 for more details. term represents the log loss of a (K+1)-way softmax-based
classifier [16], which aims to classify pi as ci (or vi ). The
sum over samples of the batch size n is omitted for brevity.
Overview. Fig. 3 illustrates an overview of DREAM. It is The joint image-text-fMRI representation is intended for
constructed on two consecutive phases, namely, Pathways potential retrieval purposes [37]. The fMRI-image or fMRI-
Reversing and Guided Image Reconstruction. These phases text components are utilized for specific tasks.
break down the reverse mapping from fMRI to image into
two subprocesses: fMRI 7→ {Ŝ, Ĉ, D̂} and {Ŝ, Ĉ, D̂} 7→ Î.
In the first phase, two Reverse Pathways decipher the cues Data Augmentation. An important issue to consider is
of semantics, color and depth from fMRI with parallel that there are significant fewer fMRI samples (≈ 104 ) com-
components: Reverse Visual Association Cortex (R-VAC, pared to the number of samples used to train CLIP (108 ),
Sec. 4.1) inverts operations of the VAC region to extract which may damage contrastive learning [19, 52]. To ad-
semantic details from the fMRI, encoded as CLIP embed- dress this, we utilize a data augmentation loss based on
ding [33], and Reverse Parallel Parvo-, Konio- and Magno- MixCo [21], which generates mixed fMRI data rmixi,k from
Cellular (R-PKM, Sec. 4.2) is designed to predict color and convex combination of two fMRI data ri and rk :
depth simultaneously from fMRI signals. Given the lossy
nature of fMRI data and the non-bijective transformation r_{\text {mix}_{i, k}}=\lambda _i \cdot r_i+\left (1-\lambda _i\right ) \cdot r_k, (2)
of image 7→ fMRI, we then cast the decoding process as
where k represents the arbitrary index of any data in the
a generative task while using the extracted Ŝ, Ĉ, D̂ cues as
same batch, and its encoding writes p∗i = Efmri (rmixi,k ).
conditions for image reconstruction. Therefore, in the sec-
The data augmentation loss, which excludes the image com-
ond phase Guided Image Reconstruction (GIR, Sec. 4.3),
ponents for brevity, is formulated as
we follow recent visual decoding practices [41, 42] and em-
ploy a frozen SD with T2I-Adapter [29] to generate images
through benefiting here the additional Ŝ, Ĉ, D̂ guidance.
\begin {aligned} \mathcal {L}_\text {MixCo}= & -\sum _{i=1}^{n} \biggl [\lambda _i \cdot \log \frac {\exp \left (p_i^*{\cdot }c_i/{\tau }\right )}{\sum _{j=0}^K \exp \left (p_i^*{\cdot }c_j/{\tau }\right )} \\ & + \left (1-\lambda _i\right ) \cdot \log \frac {\exp \left (p_i^*{\cdot }c_{k}/{\tau }\right )}{\sum _{j=0}^K \exp \left (p_i^*{\cdot }c_j/{\tau }\right )} \biggr ]. \end {aligned}
4.1. R-VAC (Semantics Decipher) (3)
The Visual Association Cortex (VAC), as detailed in
Sec. 3, is responsible for interpreting the high-level seman-
tics of visual stimuli. We design R-VAC to reverse such
Finally, the total loss is a combination of Lp and LMixCo
process through analogous learning of the mapping from
weighted with hyperparameter α:
fMRI to semantics fMRI 7→ Ŝ. This is achieved by training
an encoder Efmri whose goal is to align the fMRI embed-
\mathcal {L}_{total}=\mathcal {L}_p +\alpha \mathcal {L}_\text {MixCo}\,. (4)
ding with the shared CLIP space [33]. Though CLIP was
initially trained with image-text pairs, prior works [26, 37]
4.2. R-PKM (Depth & Color Decipher)
demonstrated ability to align new modalities. To fight the
scarcity of fMRI data, we also carefully select ad-hoc data While R-VAC provides semantics knowledge, the latter
augmentation strategy [21]. is inherently bounded by the CLIP space capacity, unable
(RGB, D) pairs
NSD LAION NSD LAION
( )
fMRI embedding
MiDaS (RGB, D, fMRI) triplets
𝑟# (
𝑟$"%!,# )
brain fMRI MixCo (a)
𝑝! data mapping
to encode spatial colors and geometry. 𝑟" To address this is-𝑣#
encoding 𝑝!∗
sue, inspired by the human visual system, we craft the R-𝑐# stage 1
\mathcal {L}_r(r, \hat {r})=\beta \cdot \operatorname {MSE}(r, \hat {r})-(1-\beta ) \cos (\angle (r, \hat {r})), (5) where z is a random noise, ωc and ωd are adjustable weights
where β is determined empirically as a hyperparameter. to control the relative significance of the adapters.
Figure 6. Sample Visual Decoding Results from the SOTA Methods on NSD.
Table 1. Quantitative Evaluation. Following standard NSD metrics, DREAM performs on a par or better than the SOTA methods (we
highlight best and second). We also report ablation of the two strategies fighting fMRI data scarcity: R-VAC without Data Augmentation
(DA) and R-PKM without the third-stage decoder training (S3) that allows additional RGBD data without fMRI.
Low-Level High-Level
Method
PixCorr ↑ SSIM ↑ AlexNet(2) ↑ AlexNet(5) ↑ Inception ↑ CLIP ↑ EffNet-B ↓ SwAV ↓
Mind-Reader [24] − − − − 78.2% − − −
Takagi et al. [41] − − 83.0% 83.0% 76.0% 77.0% − −
Gu et al. [15] .150 .325 − − − − .862 .465
Brain-Diffuser [32] .254 .356 94.2% 96.2% 87.2% 91.5% .775 .423
MindEye [37] .309 .323 94.7% 97.8% 93.8% 94.1% .645 .367
DREAM (Ours) .274 .328 93.9% 96.7% 93.4% 94.1% .645 .418
DREAM (sub01) .288 .338 95.0% 97.5% 94.8% 95.2% .638 .413
w/o DA (R-VAC) .279 .340 86.8% 88.1% 87.2% 89.9% .662 .517
w/o S3 (R-PKM) .203 .295 92.7% 96.2% 92.1% 94.6% .642 .463
Metrics for Visual Decoding. The same set of eight met- Table 2. Consistency of the Decoded Images. We evaluate the
rics is utilized for our evaluation in accordance with prior color and depth consistencies in decoded images by comparing
research [32, 37]. To be specific, PixCorr is the pixel-level the distances between test images and visual decoding results.
DREAM significantly outperforms the other two methods [32,37].
correlation between the reconstructed and ground-truth im-
ages. PixCorr is the pixel-level correlation between the re- Depth Color
constructed and ground-truth images. Structural Similar- Method
Abs Rel ↓ Sq Rel ↓ RMSE ↓ RMSE log ↓ CD ↓ STRESS ↓
ity Index (SSIM) [46] quantifies similarity between two Brain-Diffuser [32] 10.162 4.819 9.871 1.157 4.231 47.025
images. It measures the structural and textural similar- MindEye [37] 8.391 4.176 9.873 1.075 4.172 45.380
DREAM (Ours) 7.695 4.031 9.862 1.039 2.957 37.285
ity rather than just pixel-wise differences. AlexNet(2) and
AlexNet(5) are two-way comparisons of the second and
fifth layers of AlexNet [22], respectively. Inception is
the two-way comparison of the last pooling layer of In- ror). For color metrics, we use CD (Color Discrepancy) [47]
ceptionV3 [40]. CLIP is the two-way comparison of the and STRESS (Standardized Residual Sum of Squares) [11].
last layer of the CLIP-Vision [33] model. EffNet-B and Please consult the supplementary material for details.
SwAV are distances gathered from EfficientNet-B1 [44] and
Implementation Details. One NVIDIA A100-SXM-80GB
SwAV-ResNet50 [5], respectively. The first four metrics can
GPU is used in all experiments, including the training of
be categorized as low-level measurements, whereas the re-
fMRI 7→ Semantics encoder Efmri and fMRI 7→ Depth
maining four capture higher-level characteristics.
& Color encoder E and decoder D. We use pretrained
Metrics for Depth and Color. We use metrics from depth color and depth adapters from T2I-adapter [29] to extract
estimation and color correction to assess depth and color guidance features from predicted spatial palettes and depth
consistencies in the final reconstructed images. The depth maps. These guidance features, along with predicted CLIP
metrics, as elaborated in [28], include Abs Rel (absolute representations, are then input into the pretrained SD model
error), Sq Rel (squared error), RMSE (root mean squared for the purpose of image reconstruction. The hyperparame-
error), and RMSE log (root mean squared logarithmic er- ters α= 0.3, β = 0.9, ωc and ωd are set as 1.0 unless other-
Test Ground-truth (D, C) Predictions (D̂, Ĉ) DREAM DREAM w/o Color Guidance
Figure 7. Visual Decoding with DREAM. Sample outputs demonstrate DREAM’s ability to accurately decode the visual stimuli from
fMRI. Our depth and color predictions from the R-PKM (Sec. 4.2) are in line with the pseudo ground-truth, despite the extreme complexity
of the task. DREAM reconstructions closely match the test images, and the rightmost samples demonstrate the benefit of color guidance.
Table 3. Effectiveness of R-VAC (Semantics Decipher) and R-PKM (Depth & Color Decipher). We conducted two sets of experiments
using ground-truth (GT) or predicted (Pred) cues to reconstruct the visual stimuli, respectively.
Reconstruction Low-Level High-Level
(Sec. 4.3) PixCorr ↑ SSIM ↑ AlexNet(2) ↑ AlexNet(5) ↑ Inception ↑ CLIP ↑ EffNet-B ↓ SwAV ↓
semantics {S} .244 .272 96.68% 97.39% 87.82% 92.45% 1.00 .415
GT
+depth {S, D} .186 .286 99.58% 99.78% 98.78% 98.09% .723 .322
+color {S, D, C} .413 .366 99.99% 99.98% 99.19% 98.66% .702 .278
semantics {Ŝ} .194 .278 91.82% 92.57% 93.11% 91.24% .645 .369
Pred
+depth {Ŝ, D̂} .083 .282 88.07% 94.69% 94.13% 96.05% .802 .429
+color {Ŝ, D̂, Ĉ} .288 .338 94.99% 97.50% 94.80% 95.24% .638 .413
wise mentioned. For further details regarding the network Cues Deciphering. Our method decodes three cues from
architecture, please refer to the supplementary material. fMRI data: semantics, depth, and color. To assess seman-
tics deciphered from R-VAC (Sec. 4.1), we simply refer to
5.2. Experimental Results and Analysis
the CLIP metric of Tab. 1 which quantifies CLIP embed-
Visual Decoding. Our method is compared with five state- dings distances with the test image. From the aforemen-
of-the-art methods: Mind-Reader [24], Takagi et al. [41], tioned table, DREAM is at least 1.1% better than others.
Gu et al. [15], Brain-Diffuser [32], and MindEye [37]. The Fig. 7 shows examples of the depth (D̂) and color (Ĉ) deci-
quantitative visual decoding results are presented in Tab. 1, phered by R-PKM (Sec. 4.2). While accurate depth is ben-
indicating a competitive performance. Our method, with eficial for image reconstruction, faithfully recovering the
explicit deciphering mechanism, appears to be more pro- original depth from fMRI is nearly impossible due to the
ficient at discerning scene structure and semantics, as evi- information loss in capturing the brain activities [2]. Still,
denced by the favorable high-level metrics. The qualitative coarse depth is sufficient in most cases to guide the scene
results, depicted in Fig. 6, align with the numerical findings, structure and object position such as determining the loca-
indicating that DREAM produces more realistic outcomes tion of an airplane or the orientation of a bird standing on a
that maintains consistency with the viewed images in terms branch. This is intuitively understood from the bottom row
of semantics, appearance, and structure, compared to the of Fig. 7, where our coarse depth (D̂) leaves no doubt on
other methods. Striking DREAM outputs are the food plate the giraffe’s location and orientation. Interestingly, despite
(left, middle row) which accurately decodes the presence of not precisely preserving the local color, the estimated color
vegetables and a tablespoon, and the baseball scene (right, palettes (Ĉ) provide a reliable constraint and guidance on
middle row) which showcases the correct number of players the overall scene appearance. This is further demonstrated
(3) with poses similar to the test image. in last three columns of Fig. 7 by removing color guidance
Besides visual appearance, we wish to measure the con- which, despite appealing visuals, proves to produce images
sistency of depth and color in the decoded images with re- drastically differing from the test image.
spect to the test images viewed by the subject. We achieve
this by measuring the variance in the estimated depth (and 6. Ablation Study
color palettes) of the test image and the reconstructed re-
sults from Brain-Diffuser [32], MindEye [37], or DREAM. Here we present ablation studies while discussing first
Results presented in Tab. 2 indicate that our method yields the effect of using color, depth and semantics as guidance
images that align more consistently in color and depth with for image reconstruction thanks to our reversed pathways
the visual stimuli than the other two methods. (i.e., R-VAC and R-PKM).
Ground-Truth Predictions ωc / ωd
Test D D̂ Ĉ 1 / 1 (ours) 1 / 0.6 1/0 0.6 / 0 0/1 0/0
Figure 8. Effect of the Composition Weight in GIR (Sec. 4.3). The two weights ωc and ωd control the relative importance of correspond-
ing features from three deciphered cues: semantics, color, and depth. When predicted depth or color fail to provide reliable guidance, we
can manually tweak the weights to achieve satisfactory reconstructed results.
Effect of Color Palettes. As highlighted earlier, the deci- gies address the limited availability of fMRI data and sub-
phered color guidance noticeably enhances the visual qual- sequently bolster the model’s generalization capability.
ity of the reconstructed images in Fig. 7. We further quan- Effect of Weighted Guidance. The features inputted into
tify the color’s significance through two additional sets of
SD are formulated as S + ωc Rc (C) + ωd Rd (D), where two
experiments detailed in Tab. 3, where the reconstruction
weights ωc and ωd from Eq. (7) control the relative impor-
uses: 1) ground-truth (GT) depth and caption with or with- tance of the deciphered cues and play a crucial role in the fi-
out color, and 2) fMRI-predicted (Pred) depth and semantic nal image quality and alignment with the cues. In DREAM,
embedding with or without the predicted color. The results ωc and ωd are set to 1.0 showing no preference of color
using ground-truth cues serve as a proxy. The generated
guidance over depth guidance. Still, Fig. 8 shows that in
results exhibit improved color consistency and enhanced
some instances the predicted components fail to provide de-
quantitative performance across the board, underscoring the
pendable guidance on structure and appearance, thus com-
importance of using color for visual decoding. promising results. There is also empirical evidence that in-
Effect of Depth and Semantics. Tab. 3 presents results dicates the T2I-adapter slightly underperforms when com-
where we ablate the use of depth and semantics. Com- pared to ControlNet. The performance of the T2I-adapter
parison of GT and predicted semantics (i.e., {S} and {Ŝ}) further diminishes when multiple conditions are used, as
suggests that the fMRI embedding effectively incorporates opposed to just one. Taking both factors into account, there
high-level semantic cues into the final images. The over- are instances where manual adjustments to the weighting
all image quality can be further improved by integrating ei- parameters become necessary to achieve images of the de-
ther ground truth depth (represented as {S, D}) or predicted sired quality, semantics, structure, and appearance.
depth ({Ŝ, D̂}) combined with color (denoted as {S, D, C}
and {Ŝ, D̂, Ĉ}). Introducing color cues not only bolsters the 7. Conclusion
structural information but also strengthens the semantics, This paper presents DREAM, a visual decoding method
possibly because it compensates for the color information founded on principles of human perception. We design re-
absent in the predicted fMRI embedding. Of note, all met- verse pathways that mirror the forward pathways from vi-
rics (except for PixCorr) improve smoothly with more GT sual stimuli to fMRI recordings. These pathways specialize
guidance. Yet, the impact of predicted cues varies across in deciphering semantics, color, and depth cues from fMRI
metrics, highlighting intriguing research avenues and em- data and then use these predicted cues as guidance to re-
phasizing the need for more reliable measures. construct visual stimuli. Experiments demonstrate that our
method surpasses current state-of-the-art models in terms of
Effect of Data Scarcity Strategies. We ablate the two consistency in appearance, structure, and semantics.
strategies introduced to fight fMRI data scarcity: data aug-
mentation (DA) in R-VAC and the third-stage decoder train- Acknowledgements This work was supported by the En-
ing (S3) which allows R-PKM to use additional RGBD data gineering and Physical Sciences Research Council [grant
without fMRI. The results shown in the two bottom rows number EP/W523835/1] and a UKRI Future Leaders Fel-
of Tab. 1 demonstrate that the two data augmentation strate- lowship [grant number G104084].
Appendices
In the following, we provide more details and discussion on
the background knowledge, experiments, and more results
of our method. We first provide more details on the NSD
neuroimaging dataset in Sec. A and extend background
knowledge of the Human Visual System in Sec. B, which
together shed light on our design choices. We then detail
T2I-Adapter in Sec. C. Sec. D provides thorough implemen-
tation of DREAM, including architectures, representations
and metrics. Finally, in Sec. E we further demonstrate the
ability of our method with new results of cues deciphering,
reconstruction, and reconstruction across subjects. Figure 1. Functional Anatomy of Cortex. The functional local-
ization in the human brain is based on findings from functional
brain imaging, which link various anatomical regions of the brain
A. NSD Dataset to their associated functions.
Source: Wikimedia Commons. This image is licensed under the
The Natural Scenes Dataset (NSD) [2] is currently the Creative Commons Attribution-Share Alike 3.0 Unported license.
largest publicly available fMRI dataset. It features in-depth
recordings of brain activities from 8 participants (subjects)
who passively viewed images for up to 40 hours in an MRI B. Detailed Human Visual System
machine. Each image was shown for three seconds and re-
peated three times over 30-40 scanning sessions, amount- Our approach aims to decode semantics, color, and depth
ing to 22,000-30,000 fMRI response trials per participant. from fMRI data, thus inherently bounded by the ability of
These viewed natural scene images are sourced from the fMRI data to capture the ad hoc brain activities. It is crucial
Common Objects in Context (COCO) dataset [25], enabling to ascertain whether fMRI captures the alterations in the re-
the utilization of the original COCO captions for training. spective human brain regions responsible for processing the
The fMRI-to-image reconstruction studies that used visual information. Here, we provide a comprehensive ex-
NSD [15,31,41] typically follow the same procedure: train- amination of the specific brain regions in the human visual
ing individual-subject models for the four participants who system recorded by the fMRI data.
finished all scanning sessions (participants 1, 2, 5, and 7), The flow of visual information [3] in neuroscience is pre-
and employing a test set that corresponds to the common sented as follows. Fig. 1 presents a comprehensive depic-
1,000 images shown to each participant. For each partic- tion of the functional anatomy of the visual perception. Sen-
ipant, the training set has 8,859 images and 24,980 fMRI sory input originating from the Retina travels through the
tests (as each image being tested up to 3 times). Another LGN in the thalamus and then reaches the Visual Cortex.
982 images and 2,770 fMRI trials are common across the Retina is a layer within the eye comprised of photorecep-
four individuals. We use the preprocessed fMRI voxels in a tor and glial cells. These cells capture incoming photons
1.8-mm native volume space that corresponds to the “nsd- and convert them into electrical and chemical signals, which
general” brain region. This region is described by the NSD are then relayed to the brain, resulting in visual perception.
authors as the subset of voxels in the posterior cortex that Different types of information are processed through the
are most responsive to the presented visual stimuli. For parvocellular and magnocellular pathways, details of which
fMRI data spanning multiple trials, we calculate the aver- are elaborated in the main paper. LGN then channels the
age response as in prior research [27]. Tab. 1 details the conveyed visual information into the Visual Cortex, where
characteristics of the NSD dataset and the region of inter- it diverges into two streams in Visual Association Cortex
ests (ROIs) included in the fMRI data. (VAC) for undertaking intricate processing of high-level se-
mantic contents from the visual image.
Table 1. Details of the NSD dataset. The Visual Cortex, also known as visual area 1 (V1),
serves as the initial entry point for visual perception within
Training Test ROIs Subject ID Dimensions the cortex. Visual information flows here first before be-
sub01 15,724 ing relayed to other regions. VAC comprises multiple re-
V1, V2, V3, hV4,
8859 982 VO, PHC, MT,
sub02 14,278 gions surrounding the visual cortex, including V2, V3, V4,
sub05 13,039 and V5 (also known as the middle temporal area, MT). V1
MST, LO, IPS
sub07 12,682
transmits information into two primary streams: the ventral
stream and the dorsal stream.
Test image (I) GT Depth (D) GT Color (C)
• The ventral stream (black arrow) begins with V1, goes
through V2 and V4, and to the inferior temporal cortex
(IT cortex). The ventral stream is responsible for the
“meaning” of the visual stimuli, such as object recog-
nition and identification.
• The dorsal stream (blue arrow) begins with V1, goes
through visual area V2, then to the dorsomedial area
(DM/V6) and medial temporal area (MT/V5) and to
the posterior parietal cortex. The dorsal stream is en-
gaged in analyzing information associated with “posi-
tion”, particularly the spatial properties of objects.
Figure 2. Depth and Color Representations. We present pseudo
After juxtaposing the explanations illustrated in Fig. 1 ground truth samples of Depth (MiDaS prediction [34]) and Color
with the collected information demonstrated in Tab. 1, it be- (×64 downsampling of the test image) for a NSD input image.
comes apparent that the changes occurring in brain regions
linked to the processing of semantics, color, and depth are
indeed present within the fMRI data. This observation em- final MLP projector, akin to previous research [6, 37]. The
phasizes the capability to extract the intended information learned embedding is with a feature dimension of 77 × 768,
from the provided fMRI recordings. where 77 denotes the maximum token length and 768 rep-
resents the encoding dimension of each token. It is then fed
C. T2I-Adapter into the pretrained Stable Diffusion [35] to inject semantic
information into the final reconstructed images.
T2I-Adapter [29] and ControlNet [51] learn versatile The fMRI 7→ Depth & Color encoder E and decoder D
modality-specific encoders to improve the control ability of decipher depth and color information from the fMRI data.
text-to-image SD model [35]. These encoders extract guid- Given that spatial palettes are generated by first downsam-
ance features from various conditions y (e.g. sketch, seman- pling (with bicubic interpolation) an image and then upsam-
tic label, and depth). They aim to align external control with pling (with nearest interpolation) it back to its original res-
internal knowledge in SD, thereby enhancing the precision olution, the primary objective of the encoder E and the de-
of control over the generated output. Each encoder R pro- coder D shifts towards predicting RGBD images from fMRI
duces n hierarchical feature maps FiR from the primitive data. The architecture of E and D is built on top of [12],
condition y. Then each FiR is added with the corresponding with inspirations drawn from VDVAE [7].
intermediate feature FiSD in the denoising U-Net encoder:
\begin {aligned} \text {F}_{\mathcal {R}} &=\mathcal {R}\left (\mathbf {y}\right ), \\ \hat {\text {F}}_{\text {SD}}^i & =\text {F}_{\text {SD}}^i+\text {F}_{\mathcal {R}}^i, \quad i \in \{1,2,\cdots ,n\}. \end {aligned}
D.2. Representation of Semantics, Color and Depth
(1) This section serves as an introduction to the possible
choices of representations for semantics, color, and depth.
T2I-Adapter consists of a pretrained SD model and sev- We currently use CLIP embedding, depth map [34], and
eral adapters. These adapters are used to extract guidance spatial color palette [29] to facilitate subsequent process-
features from various conditions. The pretrained SD model ing of T2I-Adapter [29] in conjunction with a pretrained
is then utilized to generate images based on both the in- Stable Diffusion [35] for image reconstruction from deci-
put text features and the additional guidance features. The phered cues. However, there are other possibilities that can
CoAdapter mode becomes available when multiple adapters be utilized within our framework.
are involved, and a composer processes features from these
Semantics. The Stable Diffusion utilizes a frozen CLIP
adapters before they are further fed into the SD. Given the
ViT-L/14 text encoder to condition the model on text
deciphered semantics, color, and depth information from
prompts. It is with a feature space dimension of 77 × 768,
fMRI, we can reconstruct the final images using the color
where 77 denotes the maximum token length and 768 rep-
and depth adapters in conjunction with SD.
resents the encoding dimension of each token. The CLIP
D. Implementation Details ViT-L/14 image encoder is with a feature space dimension
of 257 × 768. We maps flattened voxels to an intermedi-
D.1. Network Architectures ate space of size 77 × 768, corresponding to the last hidden
layer of CLIP ViT/L-14. The learned embeddings inject se-
The fMRI 7→ Semantics encoder Efmri maps fMRI vox-
mantic information into the reconstructed images.
els to the shared CLIP latent space [33] to decipher seman-
tics. The network architecture includes a linear layer fol- Depth. We select depth as the structural guidance for two
lowed by multiple residual blocks, a linear projector, and a main reasons: alignment with the human visual system, and
better performance demonstrated in our preliminary experi- ground truth images in the NSD dataset) for PixCorr and
ments. Following prior research [29,51], we use the MiDaS SSIM metrics. For the other metrics, the generated images
predictions [34] as the surrogate ground truth depth maps, were adjusted based on the input specifications of each re-
which are visualized in Fig. 2. spective network. It should be noted that not all evalua-
tion outcomes are available for earlier models, depending
Color. There are many representations that can provide
on the metrics they chose to experiment with. Our quantita-
the color information, such as histogram and probabilis-
tive comparisons with MindEye [37], Takagi et al. [41], and
tic palette [23, 45] However, ControlNet [51] and T2I-
Gu et al. [15] are made according to the exact same test set,
Adapter [29] only accept spatial inputs, which leaves no al-
i.e., the 982 images that are shared for all 4 subjects. Lin et
ternative but to utilize the spatial color palettes as the color
al. [24] disclosed their findings exclusively for Subject 1,
representation. In practice, spatial color palettes resemble
with a custom training-test dataset split.
coarse resolution images, as seen in Fig. 2, and are gener-
ated by first ×64 downsampling (with bicubic interpolation)
an image and then upsampling (with nearest interpolation) Metrics for Depth and Color. We additionally measure
it back to its original resolution. consistency of our extracted depth and color. We bor-
During the image reconstruction phase, the spatial row some common metrics from depth estimation [28] and
palettes contribute the color and appearance information to color correction [47] to assess depth and color consisten-
the final images. These spatial palettes are derived from the cies in the final reconstructed images. For depth metrics,
image estimated by the RGBD decoder in R-PKM. We refer we report Abs Rel (absolute error), Sq Rel (squared er-
to the images produced at this stage as the “initial guessed ror), RMSE (root mean squared error), and RMSE log (root
image” to differentiate them from the final reconstruction. mean squared logarithmic error) — detailed in [28].
The initial guessed image offers color cues but it also con- For color metrics, we use CD (Color Discrepancy) [47]
tains inaccuracies. By employing a ×64 downsampling, we and STRESS (Standardized Residual Sum of Squares) [11].
can effectively extract necessary color details from this im- CD calculates the absolute differences between the ground
age while minimizing the side effects of inaccuracies. truth I and the reconstructed image Iˆ by utilizing the nor-
malized histograms of images segmented into bins:
Other Guidance. In the realm of visual decoding with pre-
trained diffusion models [29,49,51], any guidance available \mathrm {CD}({I}, \hat {I})=\sum \left |\mathcal {H}\left ({I}\right )-\mathcal {H}(\hat {I})\right |, (2)
in these models can be harnessed to fill in gaps of miss-
ing information, thereby enhancing performance. This spa-
where H(·) represents the histogram function over the given
tial guidance includes representations such as sketch [39],
range (e.g. [0, 255]) and number of bins. In simpler terms,
Canny edge detection, HED (Holistically-Nested Edge De-
this equation computes the absolute difference between the
tection) [48], and semantic segmentation maps [4]. These
histograms of the two images for all bins and then sums
alternatives could potentially serve as the intermediate rep-
them up. The number of bins for histogram is set as 64.
resentations for the reverse pathways in our method. HED
STRESS calculates a scaled difference between the ground-
and Canny are edge detectors, which provide object bound-
truth C and the estimated color palette Ĉ:
aries within images. However, during our preliminary ex-
periments, both methods were shown to face challenges in
providing reliable edges for all images. Sketches encounter \text {STRESS}=100 \sqrt {\frac {\sum _{i=1}^n\left (F \hat {\texttt {C}}_i-\texttt {C}_i\right )^2}{\sum _{i=1}^n \texttt {C}_i^2}}, (3)
similar difficulties in providing reliable guidance. The se-
mantic segmentation map provides both structural and se-
mantic cues. However, it overlaps in function with CLIP where n is the number of samples and F is calculated as
semantics and depth maps, and leads to diminished perfor-
mance gain on top of the other two representations. F=\frac {\sum _{i=1}^n \hat {\texttt {C}}_i \texttt {C}_i}{\sum _{i=1}^n \hat {\texttt {C}}_i^2}. (4)
Figure 3. DREAM Decoding of Depth and Color. We display the test image corresponding to fMRI, alongside the depth ground truth
(D) and the depth/color predictions (D̂, Ĉ). The R-PKM component predicts depth maps and the initial guessed RGB images (Î0 ). The
predicted spatial palettes are derived from these initial guessed images. The results highlight the proficiency of our R-PKM module in
capturing and converting intricate aspects from fMRI recordings into essential cues for visual reconstructions.
The last two columns show the color results. The pre-
dicted spatial palettes are generated by downscaling the
Figure 4. Sample Depth. We show sample depth maps (D̂) deci- “initial guessed images” denoted Î0 (not to be confused
phered from fMRI using R-PKM, alongside the ground-truth depth with Î) which corresponds to the RGB channels of the R-
(D) estimated from MiDaS [34] on the original test image (I).
PKM RGBD output. As discussed in Sec. D.2, employ-
ing a ×64 downsampling on the “initial guessed images”
from fMRI recordings lead to enhanced consistency in ap- achieves a trade-off between efficiently extracting essen-
pearance, structure, and semantics when compared to the tial color cues and effectively mitigating the inaccuracies
viewed visual stimuli. Sec. E.3 provides results of all four in these images. Despite not accurately preserving the color
subjects. of local regions due to the resolution, the produced color
palettes provide a relevant constraint and guidance on the
E.1. Depth & Color Deciphering overall color tone. Additional depth outputs are in Fig. 4.
Fig. 3 showcases additional depth and color results de-
ciphered from the R-PKM component. Overall, it is able Although depth and color guidance are sufficient to re-
to capture and translate these intricate aspects from fMRI construct images reasonably resembling the test one, it is
recordings to spatial guidance crucial for more accurate im- yet unclear if better depth and color cues can be extracted
age reconstructions. from the fMRI data or if depth and color are doomed to be
For depth, the second and third columns show exam- coarse estimation due to loss of data in the fMRI recording.
Test image GT (D) Predictions (D̂, Ĉ) DREAM
Figure 5. DREAM Reconstructions. We show reconstruction for subject 1 (sub01) from the NSD dataset. Our approach extracts essential
cues from fMRI recordings, leading to enhanced consistency in appearance, structure, and semantics when compared to the viewed visual
stimuli. The results are randomly selected. The illustrated depth, color, and final images demonstrate that the deciphered and represented
color and depth cues help to boost the performance of visual decoding.
Test image
sub01
sub02
DREAM
sub05 sub07
Figure 6. Subject-Specific Results. We visualize subject-specific outputs of DREAM on the NSD dataset. For each subject, the model is
retrained because the brain activity varies across subjects. Overall, it consistently reconstructs the test image for all subjects while we note
that some reconstruction inaccuracies are shared across subjects (cf. Sec. E.3). Quantitative metrics are in Tab. 2.
Table 2. Subject-Specific Evaluation. Quantitative evaluation of the DREAM reconstructions for the participants (sub01, sub02, sub05,
and sub07) of the NSD dataset. Performance is stable accross all participants, and consistent with the results reported in the main paper.
Some example visual results can be found in Fig. 6.
Low-Level High-Level
Subject
PixCorr ↑ SSIM ↑ AlexNet(2) ↑ AlexNet(5) ↑ Inception ↑ CLIP ↑ EffNet-B ↓ SwAV ↓
sub01 .288 .338 95.0% 97.5% 94.8% 95.2% .638 .413
sub02 .273 .331 94.2% 97.1% 93.4% 93.5% .652 .422
sub05 .269 .325 93.5% 96.6% 93.8% 94.1% .633 .397
sub07 .265 .319 92.7% 95.4% 92.6% 93.7% .656 .438