Dream

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

DREAM: Visual Decoding from REversing HumAn Visual SysteM

Weihao Xia1 B Raoul de Charette2 Cengiz Oztireli3 Jing-Hao Xue1


1
University College London 2 Inria 3 University of Cambridge
{weihao.xia.21,jinghao.xue}@ucl.ac.uk, raoul.de-charette@inria.fr, aco41@cam.ac.uk
arXiv:2310.02265v2 [cs.CV] 10 Apr 2024

Forward
Abstract Human Visual System

In this work we present DREAM, an fMRI-to-image


method for reconstructing viewed images from brain activi-
ties, grounded on fundamental knowledge of the human vi-
sual system. We craft reverse pathways that emulate the hi-
erarchical and parallel nature of how humans perceive the
visual world. These tailored pathways are specialized to visual stimuli Reverse fMRI
decipher semantics, color, and depth cues from fMRI data, semantics
mirroring the forward pathways from visual stimuli to fMRI re
co
n depth
str
uc
recordings. To do so, two components mimic the inverse t
color

processes within the human visual system: the Reverse Vi- DREAM
sual Association Cortex (R-VAC) which reverses pathways
Figure 1. Forward and Reverse Cycle. Forward (HVS): visual
of this brain region, extracting semantics from fMRI data;
stimuli 7→ color, depth, semantics 7→ fMRI; Reverse (DREAM):
the Reverse Parallel PKM (R-PKM) component simultane- fMRI 7→ color, depth, semantics 7→ reconstructed images.
ously predicting color and depth from fMRI signals. The
experiments indicate that our method outperforms the cur-
rent state-of-the-art models in terms of the consistency of
sual decoding. Hence, current methods have endeavored to
appearance, structure, and semantics. Code will be made
incorporate structural and positional details, either through
publicly available to facilitate further research in this field.
depth maps [10, 42] or by utilizing the decoded representa-
tion of an initial guessed image [31, 37]. However, these
methods primarily focus on merging inputs that fit well
1. Introduction within the pretrained generative model for visual decoding,
Exploring neural encoding unravel the intricacies of lacking the insights from the human visual system.
brain function. In last years, we have witnessed tremendous We commence our study with the foundational princi-
progress in visual decoding [18] which aims at decoding a ples [3] governing the Human Visual System (HVS) and
Functional Magnetic Resonance Imaging (fMRI) to recon- dissect essential cues crucial for effective visual decoding.
struct the test image seen by a human subject during the Our method draw insights from HVS — how humans per-
fMRI recording. Visual decoding could significantly affect ceive visual stimuli (forward route in Fig. 1) — to address
our society from how we interact with machines to help- the potential information loss during the transition from the
ing paralyzed patients [43]. However, existing methods still fMRI to the visual domain (reverse route in Fig. 1). We
suffer from missing concepts and limited quality in the im- do that by deciphering crucial cues from fMRI recordings,
age results. Recent studies turned to deep generative mod- thereby contributing to enhanced consistency in terms of
els for visual decoding due to their remarkable generation appearance, structure, and semantics. As cues, we inves-
capabilities, particularly the text-to-image diffusion mod- tigate: color for accurate scene appearance [29], depth for
els [35, 49]. These methods heavily rely on aligning brain scene structure [34], and the popular semantics for high-
signals with the vision-language model [33]. This strategic level comprehension [33]. Our study shows that current
utilization of CLIP helps mitigate the scarcity of annotated visual decoding methods often underscored and unnoticed
data and the complexities of underlying brain information. color which in fact plays an indispensable role. Fig. 2 high-
Still, the inherent nature of CLIP, which fails to preserve lights color inconsistencies in a recent work [31]. The gen-
the scene structural and positional information, limits vi- erated images, while accurate in semantics, deviate in struc-
process learns to denoise by learning transition kernels pa-
Test image

rameterized by deep neural networks such as U-Net. In La-


tent Diffusion Models (LDMs) [35], diffusion process is ap-
plied within the latent space rather than in the pixel space,
enabling faster inference and reducing training costs. The
Reconstruction

text-conditioned LDM, known as Stable Diffusion (SD), has


gained widespread usage due to its versatile applications
and capabilities. ControlNet [51] and T2I-Adapter [29] aim
to enhance the control capabilities even more by training
Figure 2. Appearance Inconsistency. When decoding the fMRI versatile modality-specific encoders. These encoders align
data of a subject viewing a test image (top), recent visual decoding external control (e.g., sketch, depth, and spatial palette)
methods, here [32], reconstruct images (bottom) which are seman- with internal knowledge in SD, thereby enabling more pre-
tically close but still suffer from strong color inconsistencies. cise control over the generated output. Unlike SD, which
solely employs the CLIP text encoder, Versatile Diffusion
(VD) [49] incorporates both CLIP text and image encoders,
ture and color from the original visual stimuli. This phe-
thereby enabling the utilization of multimodal capabilities.
nomenon arises due to the absence of proper color guidance.
Following above analysis, we propose DREAM, a visual 2.2. Image Decoding from fMRI
Decoding method from REversing humAn visual systeM.
It aims to mirror the forward process from visual stimuli to The advancements in visual decoding are closely inter-
fMRI recordings (Sec. 3). Specifically, we design two re- twined with the evolution of various modeling frameworks.
verse pathways specialized in deciphering semantics, color, For instance, in [17], sparse linear regression was applied to
and depth from fMRI. Reverse-VAC (Sec. 4.1) replicates preprocessed fMRI data to predict features extracted from
the reverse operations of the visual associated cortex, ex- early convolutional layers in a pretrained CNN. In the past
tracting semantics from fMRI. Reverse-PKM (Sec. 4.2) pre- few years, researchers have advanced visual decoding tech-
dicts color and depth simultaneously from fMRI signals. niques by mapping the brain signals to the latent space of
Deciphered cues are then fed into Stable Diffusion [35] generative adversarial networks (GANs) [14] to reconstruct
with T2I-Adapter [29] to guide the image reconstruction human faces [8] and natural scenes [31, 38]. More recently,
(Sec. 4.3). Our contributions are summarized as follows: visual decoding has reached an unprecedented level of qual-
ity [15, 32, 41] with the release of vision-language mod-
• We scrutinize the limitations of recent diffusion-based els [33], multimodal diffusion models [35, 49], and large-
visual decoding methods, shedding light on the poten- scale fMRI datasets [2]. Lin et al. [24] learned to project
tial loss of information, and introduce a novel formu- voxels to the CLIP space and then processed outcomes
lation based on the principles of human perception. through a fine-tuned conditional StyleGAN2 [20] to recon-
• We mirror the forward process from visual stimuli to struct natural images. Takagi et al. [41] employed the ridge
fMRI recordings within the visual system and devise regression to associate fMRI signals with the CLIP text em-
two reverse pathways specialized in extracting seman- bedding and the latent space of Stable Diffusion, opting for
tics, color, and depth information from fMRI data. varied voxels based on different components. Recent re-
search [27, 32] explored the process of mapping fMRI sig-
• We show through experiments that our biologically nals to both CLIP text and image embeddings, subsequently
interpretable method, DREAM, outperforms state-of- utilizimg the pre-trained Versatile Diffusion model [49] that
the-art methods while maintaining better consistency accommodates multiple inputs for image reconstruction.
of appearance, structure, and semantics.
2.3. Multi-Level Modeling in Visual Decoding
2. Related Work Hierarchical visual feature representations are frequently
utilized in visual decoding. Early studies [1, 17] have indi-
2.1. Diffusion Probabilistic Models
cated that hierarchical features extracted through pretrained
Recently, diffusion models have risen to prominence as CNN models demonstrated strong correlation with neural
cutting-edge generative models. Denoising Diffusion Prob- activities of visual cortices. Recent research delved into
abilistic Model is a parameterized bi-directional Markov combining both low-level visual cues and high-level se-
chain that utilizes variational inference to produce match- mantics inferred from brain activity, using frozen diffusion
ing samples. The forward diffusion process is designed to models for reconstruction [27, 32, 37, 50]. The low-level
transform any data distribution into a basic prior distribu- visual cues are commonly incorporated in an implicit man-
tion (e.g., isotropic Gaussian), and the reverse denoising ner, such as utilizing intermediate features predicted by a
Processing
Reconstructed image

Forward Pathways (stimuli to fMRI) Reverse Pathways (fMRI to Semantics, Color and Depth to Image)

Semantics
Reverse VAC
Pathway

Visual stimuli Brain Activities fMRI voxels


Reverse PKM
High-Level Semantics Pathway
Visual Cortex
T2I-Adapter
Reconstructed
Color image
D-A
Parvocellular
Pathway

Composer
Depth

Magnocellular Depth
Pathway

C-A

Human Visual System


Color

Figure 3. Relation of the HVS and Our proposed DREAM. Grounding on the Human Visual System (HVS), we devise reverse pathways
aimed at deciphering semantics, depth, and color cues from fMRISec. 3.2to
VACguide
Reverseimage
Pathway reconstruction.
(Semantics) (Left) Schematic view of HVS, detailed
in Sec. 3. When perceiving visual stimuli,
𝑧 connectionsEfrom
fMRI the retina to the brain can be separated into two parallel pathways. The Parvo-
𝜃! ! Vi
cellular Pathway originates from midget cells in the retina and is3.3responsible
Sec. for transmitting
PKM Reverse Pathway (Depth & Color) color information, while the Magnocellular

Pathway starts with parasol cells and is specialized in detecting depth and motion. The conveyed information is channeled into the visual
cortex for undertaking intricate processing of high-level semantics from the visual image. (Right) DREAM mimics the corresponding
inverse processes within the HVS: the Reverse VAC (Sec. 4.1) replicates the opposite operations of this brain region, analogously extract-
ing semantics Ŝ as a form of CLIP embedding from fMRI; and the Reverse PKM (Sec. 4.2) maps fMRI to color Ĉ and depth D̂ in the
form of spatial palettes and depth maps to facilitate subsequent processing by the Color Adapter (C-A) and the Depth Adapter (D-A) in
T2I-Adapter [29] in conjunction with SD [35] for image reconstruction from deciphered semantics, color, and depth cues {Ŝ, Ĉ, D̂}.

large-scale vision model [32] or encoded from an initial es- formation from V1 and undertakes intricate processing of
timated image [37]. The high-level semantics is frequently high-level semantic contents from the visual image.
represented as CLIP embedding. Recent research [10, 42] The hierarchical and parallel manner where visual stim-
also suggested the explicit provision of both low-level and uli are broken down and passed forward as color, depth, and
high-level information, predicting captions and depth maps semantics guided our choices to reverse the HVS for decod-
from brain signals. Our method differs in terms of how to ing. For a detailed illustration of human perception and an
predict auxiliary information and the incorporation of color. analysis on the feasibility of extracting desired cues from
fMRI recordings, please consult supplementary material.
3. Preliminary on the Human Visual System
Human Visual System (HVS) endows us with the ability 4. DREAM
of visual perception. The visual information is concurrently
relayed from various cell types in the retina, each capturing The task of visual decoding aims to recover the viewed
distinct facets of data, through the optic nerve to the brain. image I ∈ RH×W ×3 from brain activity signals elicited by
Connections from the retina to the brain, as shown in Fig. 3, visual stimuli. Functional MRI (fMRI) is usually employed
can be separated into a parvocellular pathway and a mag- as a proxy of the brain activities, typically encoded as a set
nocellular pathway1 . The parvocellular pathway originates of voxels fMRI ∈ R1×N . Formally, the task optimizes f (·)
from midget cells in the retina and is responsible for trans- so that f (fMRI) = I,ˆ where Iˆ best approximates I.
mitting color information, while the magnocellular pathway
starts with parasol cells and is specialized in detecting depth To address this task, we propose DREAM, a method
and motion. The visual information is first directed to a sen- grounded on fundamental principles of human perception.
sory relay station known as the lateral geniculate nucleus Following Sec. 3, our method relies on explicit design of
(LGN) of the thalamus, before being channeled to the vi- reverse pathways to decipher Semantics, Color, and Depth
sual cortex (V1) for the initial processing of visual stimuli. intertwined in the fMRI data. These reverse pathways mir-
The visual association cortex (VAC) receives processed in- ror the forward process from visual stimuli to brain activity.
Considering that an fMRI captures changes in the brain re-
1 An additional set of neurons, known as the koniocellular layers, are gions during the forward process, it is feasible to derive the
found ventral to each of the magnocellular and parvocellular layers [3]. desired cues of the visual stimuli from such recording [2].
(RGB, D) pairs
NSD LAION NSD LAION
Contrastive Learning. In practice, we train(the fMRI en- )
fMRI embedding
coder EfMRI withMiDaS
triplets of {fMRI, image, caption} to pull
(RGB, D, fMRI) triplets

𝑟$"%!,#
𝑟# the fMRI embeddings closer to the rich shared ( semantic )
brain fMRI MixCo (a)
𝑝! space of CLIP. Given that both text encoder (Etxt ) and im-
data mapping
𝑟" encoding 𝑝!∗ 𝑣#
age encoder (Eimg ) of CLIP are frozen, we minimize the
𝑐# embedding distances of fMRI-imagestage and1 fMRI-text which
❄ image embedding 𝑐! 𝑣! in turns forces alignment of the fMRI embedding with
stage 2
image CLIP Space CLIP. The training
fMRI is illustrated in Fig. 4. Formally, with
the embeddings of fMRI, text, and image stage 3 denoted by p, c,
A man is competing text embedding RGBD
in Olympics skiing
❄ v, respectively, the (b)
initial contrastive loss writes
Contrastive Learning
caption
w/ Data Augmentation
CLIP Encoders
\mathcal {L}_{p}= - \log \frac {\exp \left (p_i{\cdot }c_i / {\tau }\right )}{\sum _{j=0}^K \exp \left (p_i{\cdot }c_j / \tau \right )} - \log \frac {\exp \left (p_i{\cdot }v_i/{\tau }\right )}{\sum _{j=0}^K \exp \left (p_i {\cdot } v_j / {\tau }\right )}, \label {eqn:contrastive}
Figure 4. R-VAC Training. To decipher semantics from fMRI, we
train an encoder Efmri in a contrastive fashion which aligns fMRI (1)
data with the frozen CLIP space [33]. Data augmentation (repre- where τ is a temperature hyperparameter. The sum for each
sented by the dashed rectangle) [21] combats the data scarcity of term is over one positive and K negative samples. Each
the fMRI modality. See Sec. 4.1 for more details. term represents the log loss of a (K+1)-way softmax-based
classifier [16], which aims to classify pi as ci (or vi ). The
sum over samples of the batch size n is omitted for brevity.
Overview. Fig. 3 illustrates an overview of DREAM. It is The joint image-text-fMRI representation is intended for
constructed on two consecutive phases, namely, Pathways potential retrieval purposes [37]. The fMRI-image or fMRI-
Reversing and Guided Image Reconstruction. These phases text components are utilized for specific tasks.
break down the reverse mapping from fMRI to image into
two subprocesses: fMRI 7→ {Ŝ, Ĉ, D̂} and {Ŝ, Ĉ, D̂} 7→ Î.
In the first phase, two Reverse Pathways decipher the cues Data Augmentation. An important issue to consider is
of semantics, color and depth from fMRI with parallel that there are significant fewer fMRI samples (≈ 104 ) com-
components: Reverse Visual Association Cortex (R-VAC, pared to the number of samples used to train CLIP (108 ),
Sec. 4.1) inverts operations of the VAC region to extract which may damage contrastive learning [19, 52]. To ad-
semantic details from the fMRI, encoded as CLIP embed- dress this, we utilize a data augmentation loss based on
ding [33], and Reverse Parallel Parvo-, Konio- and Magno- MixCo [21], which generates mixed fMRI data rmixi,k from
Cellular (R-PKM, Sec. 4.2) is designed to predict color and convex combination of two fMRI data ri and rk :
depth simultaneously from fMRI signals. Given the lossy
nature of fMRI data and the non-bijective transformation r_{\text {mix}_{i, k}}=\lambda _i \cdot r_i+\left (1-\lambda _i\right ) \cdot r_k, (2)
of image 7→ fMRI, we then cast the decoding process as
where k represents the arbitrary index of any data in the
a generative task while using the extracted Ŝ, Ĉ, D̂ cues as
same batch, and its encoding writes p∗i = Efmri (rmixi,k ).
conditions for image reconstruction. Therefore, in the sec-
The data augmentation loss, which excludes the image com-
ond phase Guided Image Reconstruction (GIR, Sec. 4.3),
ponents for brevity, is formulated as
we follow recent visual decoding practices [41, 42] and em-
ploy a frozen SD with T2I-Adapter [29] to generate images
through benefiting here the additional Ŝ, Ĉ, D̂ guidance.
\begin {aligned} \mathcal {L}_\text {MixCo}= & -\sum _{i=1}^{n} \biggl [\lambda _i \cdot \log \frac {\exp \left (p_i^*{\cdot }c_i/{\tau }\right )}{\sum _{j=0}^K \exp \left (p_i^*{\cdot }c_j/{\tau }\right )} \\ & + \left (1-\lambda _i\right ) \cdot \log \frac {\exp \left (p_i^*{\cdot }c_{k}/{\tau }\right )}{\sum _{j=0}^K \exp \left (p_i^*{\cdot }c_j/{\tau }\right )} \biggr ]. \end {aligned}
4.1. R-VAC (Semantics Decipher) (3)
The Visual Association Cortex (VAC), as detailed in
Sec. 3, is responsible for interpreting the high-level seman-
tics of visual stimuli. We design R-VAC to reverse such
Finally, the total loss is a combination of Lp and LMixCo
process through analogous learning of the mapping from
weighted with hyperparameter α:
fMRI to semantics fMRI 7→ Ŝ. This is achieved by training
an encoder Efmri whose goal is to align the fMRI embed-
\mathcal {L}_{total}=\mathcal {L}_p +\alpha \mathcal {L}_\text {MixCo}\,. (4)
ding with the shared CLIP space [33]. Though CLIP was
initially trained with image-text pairs, prior works [26, 37]
4.2. R-PKM (Depth & Color Decipher)
demonstrated ability to align new modalities. To fight the
scarcity of fMRI data, we also carefully select ad-hoc data While R-VAC provides semantics knowledge, the latter
augmentation strategy [21]. is inherently bounded by the CLIP space capacity, unable
(RGB, D) pairs
NSD LAION NSD LAION
( )
fMRI embedding
MiDaS (RGB, D, fMRI) triplets

𝑟# (
𝑟$"%!,# )
brain fMRI MixCo (a)
𝑝! data mapping
to encode spatial colors and geometry. 𝑟" To address this is-𝑣#
encoding 𝑝!∗
sue, inspired by the human visual system, we craft the R-𝑐# stage 1

PKM component to reverse ❄ pathways


image embedding
of the Parvo-, 𝑐! Konio-
𝑣!
stage 2
and Magno-Cellularimage (PKM), subsequently predicting CLIP color
Space fMRI

and depth from


A man is the fMRI data denoted
competing as fMRI7→{Ĉ, D̂}.
text embedding RGBD
stage 3
in Olympics skiing
While color and depth can ❄ be represented in various ways
Contrastive Learning
caption
w/ Data Augmentation
(e.g., histograms, graphs), we represent them
CLIP Encoders as spatial Figure 5. R-PKM Training. Our multi-stage training reverses the
color palettes and depth maps to facilitate the reconstruc- PKM pathway and decodes color and depth cues in fMRI data.
(b)
tion guidance, as discussed in Sec. 4.3. Visuals are in Fig. 3. Stages 1 and 2 employ (RGBD, fMRI) pairs to train an encoder
In practice, we formulate the problem as RGBD estima- E that maps RGBD to fMRI and a decoder D to decode RGBD
tion. The color palette is then derived from the RGB pre- from fMRI. Stage 3 benefits from additional RGBD images with-
diction by first ×64 downscaling it and then upscaling back out fMRI to train D in a cycle-consistent manner dˆ = D(E(d)),
to its original size. There are readily available methods for while keeping E frozen. See Sec. 4.2 for details.
fMRI 7→ RGBD mapping [10,42] but they offer limited per-
formance due to the scarcity fMRI data. Instead, we in-
D̂ in the form of CLIP embedding, spatial color palette,
troduce a multi-stage encoder-decoder training [13], which
and depth map. Finally, guided image reconstruction
benefits from both the scarcely available (fMRI, RGBD)
{Ŝ, Ĉ, D̂} 7→ Iˆ completes the reverse mapping of the for-
pairs and the abundant RGBD data without fMRI. Fig. 5
ward process during the visual perception I 7→ fMRI.
shows the training procedure for R-PKM.
We utilize Stable Diffusion (SD) [35] to reconstruct
Stage 1. Given limited pairs {(r, d)}={fMRI, RGBD}, we the final image from the predicted CLIP embedding Ŝ
first train an encoder to map RGBD to their corresponding and the additional guidance from predicted color palette Ĉ
fMRI data. To compensate for the absence of depth in fMRI and depth map D̂. Such guidance is produced using the
datasets, we use MiDaS-estimated depth maps [34] as sur- color adapter Rc and the depth adapter Rd within T2I-
rogate ground-truth depth. The encoder is trained with a adapter [29]. This process is formulated as follows:
convex combination of mean square error and cosine prox-
imity between the input r and its predicted counterpart r̂: \label {eq:im_guided} \begin {aligned} \text {F}_{\mathcal {R}} & = \omega _c \mathcal {R}_c\left (\hat {\texttt {C}}\right ) + \omega _d \mathcal {R}_d\left (\hat {\texttt {D}}\right ), \\ \hat {I} & = \text {SD}\left (z, \text {F}_{\mathcal {R}}, \hat {\texttt {S}}\right ), \end {aligned}
(7)

\mathcal {L}_r(r, \hat {r})=\beta \cdot \operatorname {MSE}(r, \hat {r})-(1-\beta ) \cos (\angle (r, \hat {r})), (5) where z is a random noise, ωc and ωd are adjustable weights
where β is determined empirically as a hyperparameter. to control the relative significance of the adapters.

Stage 2. Similar to stage 1, we now train the decoder with 5. Experiments


pairs {(r, d)} in a supervised manner:
Following common practice in the field, we evaluate
\mathcal {L}_s(d, \hat {d})=\|d-\hat {d}\|_1+\mathcal {J}(\hat {d}), \label {eqn:decoder_loss} (6) DREAM with the largest public neuroimaging dataset, the
Natural Scene Dataset [2]. We report our performance
where dˆ = D(r) and the total variation regularization J (d)
ˆ
ˆ against five leading methods [15, 24, 32, 37, 41]. We de-
encourages spatial smoothness in the reconstructed d. tail our experimental methodology in Sec. 5.1 and report
Stage 3. To address the scarcity of fMRI data and improve quantitative and qualitative evaluations in Sec. 5.2. Abla-
the model generalization to unseen categories, we employ a tion studies are presented in Sec. 6.
self-supervised strategy to finetune the decoder while keep-
5.1. Experimental Setting
ing the encoder frozen. This facilitate the usage of any
natural images (e.g., from ImageNet [9] or LAION [36]) Dataset. We use the Natural Scenes Dataset (NSD) [2]
along with their estimated depth maps, without need of in all experiments, which follows the standard practices in
paired fMRI and image data. Hence, we train solely with the field [15, 24, 27, 32, 37, 41]. NSD, as the largest fMRI
the RGBD data by ensuring a cycle consistency through the dataset, records brain responses from eight human subjects
Encoder-Decoder transformation, i.e. dˆ = D(E(d)), with successively isolated in an MRI machine and passively ob-
the loss in Eq. (6). Given that this stage involves images served a wide range of visual stimuli, namely, natural im-
for which fMRI data was never collected, the model greatly ages sourced from MS-COCO [25], which allows retriev-
improves its generalization capability. ing the associated captions. In practice, because brain ac-
tivity patterns highly vary across subjects [18], a separate
4.3. Guided Image Reconstruction (GIR)
model is trained per subject. The standardized splits contain
Equipped with R-VAC (Sec. 4.1) and R-PKM (Sec. 4.2), 982 fMRI test samples and 24,980 fMRI training samples.
our method can decipher semantics Ŝ, color Ĉ, and depth Please refer to the supplementary material for more details.
Testseen
Image MindEye et al. Takagi
Image seenBrain-Diffuser
MindEye Ozcelik
MindEye Takagi et
et al.
Ozcelik al. DREAM
et al. Image seen
Takagi et al. Test
MindEye
Image MindEye
seen Ozcelik et al.
MindEye Brain-Diffuser
Gu et al. et al.
Ozcelik Gu
Gu et al.
et al. DREAM
Image (2023) (2023) (2022) (Ours) Image (2023) (2023) (2023) (Ours)
in MRI (ours)
in MRI (2023)
(ours) (2022)
(2023) in MRI
(2022) (ours)
in MRI (2023)
(ours) (2023)
(2023) (2023)

Figure 6. Sample Visual Decoding Results from the SOTA Methods on NSD.

Table 1. Quantitative Evaluation. Following standard NSD metrics, DREAM performs on a par or better than the SOTA methods (we
highlight best and second). We also report ablation of the two strategies fighting fMRI data scarcity: R-VAC without Data Augmentation
(DA) and R-PKM without the third-stage decoder training (S3) that allows additional RGBD data without fMRI.
Low-Level High-Level
Method
PixCorr ↑ SSIM ↑ AlexNet(2) ↑ AlexNet(5) ↑ Inception ↑ CLIP ↑ EffNet-B ↓ SwAV ↓
Mind-Reader [24] − − − − 78.2% − − −
Takagi et al. [41] − − 83.0% 83.0% 76.0% 77.0% − −
Gu et al. [15] .150 .325 − − − − .862 .465
Brain-Diffuser [32] .254 .356 94.2% 96.2% 87.2% 91.5% .775 .423
MindEye [37] .309 .323 94.7% 97.8% 93.8% 94.1% .645 .367
DREAM (Ours) .274 .328 93.9% 96.7% 93.4% 94.1% .645 .418
DREAM (sub01) .288 .338 95.0% 97.5% 94.8% 95.2% .638 .413
w/o DA (R-VAC) .279 .340 86.8% 88.1% 87.2% 89.9% .662 .517
w/o S3 (R-PKM) .203 .295 92.7% 96.2% 92.1% 94.6% .642 .463

Metrics for Visual Decoding. The same set of eight met- Table 2. Consistency of the Decoded Images. We evaluate the
rics is utilized for our evaluation in accordance with prior color and depth consistencies in decoded images by comparing
research [32, 37]. To be specific, PixCorr is the pixel-level the distances between test images and visual decoding results.
DREAM significantly outperforms the other two methods [32,37].
correlation between the reconstructed and ground-truth im-
ages. PixCorr is the pixel-level correlation between the re- Depth Color
constructed and ground-truth images. Structural Similar- Method
Abs Rel ↓ Sq Rel ↓ RMSE ↓ RMSE log ↓ CD ↓ STRESS ↓
ity Index (SSIM) [46] quantifies similarity between two Brain-Diffuser [32] 10.162 4.819 9.871 1.157 4.231 47.025
images. It measures the structural and textural similar- MindEye [37] 8.391 4.176 9.873 1.075 4.172 45.380
DREAM (Ours) 7.695 4.031 9.862 1.039 2.957 37.285
ity rather than just pixel-wise differences. AlexNet(2) and
AlexNet(5) are two-way comparisons of the second and
fifth layers of AlexNet [22], respectively. Inception is
the two-way comparison of the last pooling layer of In- ror). For color metrics, we use CD (Color Discrepancy) [47]
ceptionV3 [40]. CLIP is the two-way comparison of the and STRESS (Standardized Residual Sum of Squares) [11].
last layer of the CLIP-Vision [33] model. EffNet-B and Please consult the supplementary material for details.
SwAV are distances gathered from EfficientNet-B1 [44] and
Implementation Details. One NVIDIA A100-SXM-80GB
SwAV-ResNet50 [5], respectively. The first four metrics can
GPU is used in all experiments, including the training of
be categorized as low-level measurements, whereas the re-
fMRI 7→ Semantics encoder Efmri and fMRI 7→ Depth
maining four capture higher-level characteristics.
& Color encoder E and decoder D. We use pretrained
Metrics for Depth and Color. We use metrics from depth color and depth adapters from T2I-adapter [29] to extract
estimation and color correction to assess depth and color guidance features from predicted spatial palettes and depth
consistencies in the final reconstructed images. The depth maps. These guidance features, along with predicted CLIP
metrics, as elaborated in [28], include Abs Rel (absolute representations, are then input into the pretrained SD model
error), Sq Rel (squared error), RMSE (root mean squared for the purpose of image reconstruction. The hyperparame-
error), and RMSE log (root mean squared logarithmic er- ters α= 0.3, β = 0.9, ωc and ωd are set as 1.0 unless other-
Test Ground-truth (D, C) Predictions (D̂, Ĉ) DREAM DREAM w/o Color Guidance

Figure 7. Visual Decoding with DREAM. Sample outputs demonstrate DREAM’s ability to accurately decode the visual stimuli from
fMRI. Our depth and color predictions from the R-PKM (Sec. 4.2) are in line with the pseudo ground-truth, despite the extreme complexity
of the task. DREAM reconstructions closely match the test images, and the rightmost samples demonstrate the benefit of color guidance.

Table 3. Effectiveness of R-VAC (Semantics Decipher) and R-PKM (Depth & Color Decipher). We conducted two sets of experiments
using ground-truth (GT) or predicted (Pred) cues to reconstruct the visual stimuli, respectively.
Reconstruction Low-Level High-Level
(Sec. 4.3) PixCorr ↑ SSIM ↑ AlexNet(2) ↑ AlexNet(5) ↑ Inception ↑ CLIP ↑ EffNet-B ↓ SwAV ↓
semantics {S} .244 .272 96.68% 97.39% 87.82% 92.45% 1.00 .415
GT

+depth {S, D} .186 .286 99.58% 99.78% 98.78% 98.09% .723 .322
+color {S, D, C} .413 .366 99.99% 99.98% 99.19% 98.66% .702 .278
semantics {Ŝ} .194 .278 91.82% 92.57% 93.11% 91.24% .645 .369
Pred

+depth {Ŝ, D̂} .083 .282 88.07% 94.69% 94.13% 96.05% .802 .429
+color {Ŝ, D̂, Ĉ} .288 .338 94.99% 97.50% 94.80% 95.24% .638 .413

wise mentioned. For further details regarding the network Cues Deciphering. Our method decodes three cues from
architecture, please refer to the supplementary material. fMRI data: semantics, depth, and color. To assess seman-
tics deciphered from R-VAC (Sec. 4.1), we simply refer to
5.2. Experimental Results and Analysis
the CLIP metric of Tab. 1 which quantifies CLIP embed-
Visual Decoding. Our method is compared with five state- dings distances with the test image. From the aforemen-
of-the-art methods: Mind-Reader [24], Takagi et al. [41], tioned table, DREAM is at least 1.1% better than others.
Gu et al. [15], Brain-Diffuser [32], and MindEye [37]. The Fig. 7 shows examples of the depth (D̂) and color (Ĉ) deci-
quantitative visual decoding results are presented in Tab. 1, phered by R-PKM (Sec. 4.2). While accurate depth is ben-
indicating a competitive performance. Our method, with eficial for image reconstruction, faithfully recovering the
explicit deciphering mechanism, appears to be more pro- original depth from fMRI is nearly impossible due to the
ficient at discerning scene structure and semantics, as evi- information loss in capturing the brain activities [2]. Still,
denced by the favorable high-level metrics. The qualitative coarse depth is sufficient in most cases to guide the scene
results, depicted in Fig. 6, align with the numerical findings, structure and object position such as determining the loca-
indicating that DREAM produces more realistic outcomes tion of an airplane or the orientation of a bird standing on a
that maintains consistency with the viewed images in terms branch. This is intuitively understood from the bottom row
of semantics, appearance, and structure, compared to the of Fig. 7, where our coarse depth (D̂) leaves no doubt on
other methods. Striking DREAM outputs are the food plate the giraffe’s location and orientation. Interestingly, despite
(left, middle row) which accurately decodes the presence of not precisely preserving the local color, the estimated color
vegetables and a tablespoon, and the baseball scene (right, palettes (Ĉ) provide a reliable constraint and guidance on
middle row) which showcases the correct number of players the overall scene appearance. This is further demonstrated
(3) with poses similar to the test image. in last three columns of Fig. 7 by removing color guidance
Besides visual appearance, we wish to measure the con- which, despite appealing visuals, proves to produce images
sistency of depth and color in the decoded images with re- drastically differing from the test image.
spect to the test images viewed by the subject. We achieve
this by measuring the variance in the estimated depth (and 6. Ablation Study
color palettes) of the test image and the reconstructed re-
sults from Brain-Diffuser [32], MindEye [37], or DREAM. Here we present ablation studies while discussing first
Results presented in Tab. 2 indicate that our method yields the effect of using color, depth and semantics as guidance
images that align more consistently in color and depth with for image reconstruction thanks to our reversed pathways
the visual stimuli than the other two methods. (i.e., R-VAC and R-PKM).
Ground-Truth Predictions ωc / ωd
Test D D̂ Ĉ 1 / 1 (ours) 1 / 0.6 1/0 0.6 / 0 0/1 0/0

Figure 8. Effect of the Composition Weight in GIR (Sec. 4.3). The two weights ωc and ωd control the relative importance of correspond-
ing features from three deciphered cues: semantics, color, and depth. When predicted depth or color fail to provide reliable guidance, we
can manually tweak the weights to achieve satisfactory reconstructed results.

Effect of Color Palettes. As highlighted earlier, the deci- gies address the limited availability of fMRI data and sub-
phered color guidance noticeably enhances the visual qual- sequently bolster the model’s generalization capability.
ity of the reconstructed images in Fig. 7. We further quan- Effect of Weighted Guidance. The features inputted into
tify the color’s significance through two additional sets of
SD are formulated as S + ωc Rc (C) + ωd Rd (D), where two
experiments detailed in Tab. 3, where the reconstruction
weights ωc and ωd from Eq. (7) control the relative impor-
uses: 1) ground-truth (GT) depth and caption with or with- tance of the deciphered cues and play a crucial role in the fi-
out color, and 2) fMRI-predicted (Pred) depth and semantic nal image quality and alignment with the cues. In DREAM,
embedding with or without the predicted color. The results ωc and ωd are set to 1.0 showing no preference of color
using ground-truth cues serve as a proxy. The generated
guidance over depth guidance. Still, Fig. 8 shows that in
results exhibit improved color consistency and enhanced
some instances the predicted components fail to provide de-
quantitative performance across the board, underscoring the
pendable guidance on structure and appearance, thus com-
importance of using color for visual decoding. promising results. There is also empirical evidence that in-
Effect of Depth and Semantics. Tab. 3 presents results dicates the T2I-adapter slightly underperforms when com-
where we ablate the use of depth and semantics. Com- pared to ControlNet. The performance of the T2I-adapter
parison of GT and predicted semantics (i.e., {S} and {Ŝ}) further diminishes when multiple conditions are used, as
suggests that the fMRI embedding effectively incorporates opposed to just one. Taking both factors into account, there
high-level semantic cues into the final images. The over- are instances where manual adjustments to the weighting
all image quality can be further improved by integrating ei- parameters become necessary to achieve images of the de-
ther ground truth depth (represented as {S, D}) or predicted sired quality, semantics, structure, and appearance.
depth ({Ŝ, D̂}) combined with color (denoted as {S, D, C}
and {Ŝ, D̂, Ĉ}). Introducing color cues not only bolsters the 7. Conclusion
structural information but also strengthens the semantics, This paper presents DREAM, a visual decoding method
possibly because it compensates for the color information founded on principles of human perception. We design re-
absent in the predicted fMRI embedding. Of note, all met- verse pathways that mirror the forward pathways from vi-
rics (except for PixCorr) improve smoothly with more GT sual stimuli to fMRI recordings. These pathways specialize
guidance. Yet, the impact of predicted cues varies across in deciphering semantics, color, and depth cues from fMRI
metrics, highlighting intriguing research avenues and em- data and then use these predicted cues as guidance to re-
phasizing the need for more reliable measures. construct visual stimuli. Experiments demonstrate that our
method surpasses current state-of-the-art models in terms of
Effect of Data Scarcity Strategies. We ablate the two consistency in appearance, structure, and semantics.
strategies introduced to fight fMRI data scarcity: data aug-
mentation (DA) in R-VAC and the third-stage decoder train- Acknowledgements This work was supported by the En-
ing (S3) which allows R-PKM to use additional RGBD data gineering and Physical Sciences Research Council [grant
without fMRI. The results shown in the two bottom rows number EP/W523835/1] and a UKRI Future Leaders Fel-
of Tab. 1 demonstrate that the two data augmentation strate- lowship [grant number G104084].
Appendices
In the following, we provide more details and discussion on
the background knowledge, experiments, and more results
of our method. We first provide more details on the NSD
neuroimaging dataset in Sec. A and extend background
knowledge of the Human Visual System in Sec. B, which
together shed light on our design choices. We then detail
T2I-Adapter in Sec. C. Sec. D provides thorough implemen-
tation of DREAM, including architectures, representations
and metrics. Finally, in Sec. E we further demonstrate the
ability of our method with new results of cues deciphering,
reconstruction, and reconstruction across subjects. Figure 1. Functional Anatomy of Cortex. The functional local-
ization in the human brain is based on findings from functional
brain imaging, which link various anatomical regions of the brain
A. NSD Dataset to their associated functions.
Source: Wikimedia Commons. This image is licensed under the
The Natural Scenes Dataset (NSD) [2] is currently the Creative Commons Attribution-Share Alike 3.0 Unported license.
largest publicly available fMRI dataset. It features in-depth
recordings of brain activities from 8 participants (subjects)
who passively viewed images for up to 40 hours in an MRI B. Detailed Human Visual System
machine. Each image was shown for three seconds and re-
peated three times over 30-40 scanning sessions, amount- Our approach aims to decode semantics, color, and depth
ing to 22,000-30,000 fMRI response trials per participant. from fMRI data, thus inherently bounded by the ability of
These viewed natural scene images are sourced from the fMRI data to capture the ad hoc brain activities. It is crucial
Common Objects in Context (COCO) dataset [25], enabling to ascertain whether fMRI captures the alterations in the re-
the utilization of the original COCO captions for training. spective human brain regions responsible for processing the
The fMRI-to-image reconstruction studies that used visual information. Here, we provide a comprehensive ex-
NSD [15,31,41] typically follow the same procedure: train- amination of the specific brain regions in the human visual
ing individual-subject models for the four participants who system recorded by the fMRI data.
finished all scanning sessions (participants 1, 2, 5, and 7), The flow of visual information [3] in neuroscience is pre-
and employing a test set that corresponds to the common sented as follows. Fig. 1 presents a comprehensive depic-
1,000 images shown to each participant. For each partic- tion of the functional anatomy of the visual perception. Sen-
ipant, the training set has 8,859 images and 24,980 fMRI sory input originating from the Retina travels through the
tests (as each image being tested up to 3 times). Another LGN in the thalamus and then reaches the Visual Cortex.
982 images and 2,770 fMRI trials are common across the Retina is a layer within the eye comprised of photorecep-
four individuals. We use the preprocessed fMRI voxels in a tor and glial cells. These cells capture incoming photons
1.8-mm native volume space that corresponds to the “nsd- and convert them into electrical and chemical signals, which
general” brain region. This region is described by the NSD are then relayed to the brain, resulting in visual perception.
authors as the subset of voxels in the posterior cortex that Different types of information are processed through the
are most responsive to the presented visual stimuli. For parvocellular and magnocellular pathways, details of which
fMRI data spanning multiple trials, we calculate the aver- are elaborated in the main paper. LGN then channels the
age response as in prior research [27]. Tab. 1 details the conveyed visual information into the Visual Cortex, where
characteristics of the NSD dataset and the region of inter- it diverges into two streams in Visual Association Cortex
ests (ROIs) included in the fMRI data. (VAC) for undertaking intricate processing of high-level se-
mantic contents from the visual image.
Table 1. Details of the NSD dataset. The Visual Cortex, also known as visual area 1 (V1),
serves as the initial entry point for visual perception within
Training Test ROIs Subject ID Dimensions the cortex. Visual information flows here first before be-
sub01 15,724 ing relayed to other regions. VAC comprises multiple re-
V1, V2, V3, hV4,
8859 982 VO, PHC, MT,
sub02 14,278 gions surrounding the visual cortex, including V2, V3, V4,
sub05 13,039 and V5 (also known as the middle temporal area, MT). V1
MST, LO, IPS
sub07 12,682
transmits information into two primary streams: the ventral
stream and the dorsal stream.
Test image (I) GT Depth (D) GT Color (C)
• The ventral stream (black arrow) begins with V1, goes
through V2 and V4, and to the inferior temporal cortex
(IT cortex). The ventral stream is responsible for the
“meaning” of the visual stimuli, such as object recog-
nition and identification.
• The dorsal stream (blue arrow) begins with V1, goes
through visual area V2, then to the dorsomedial area
(DM/V6) and medial temporal area (MT/V5) and to
the posterior parietal cortex. The dorsal stream is en-
gaged in analyzing information associated with “posi-
tion”, particularly the spatial properties of objects.
Figure 2. Depth and Color Representations. We present pseudo
After juxtaposing the explanations illustrated in Fig. 1 ground truth samples of Depth (MiDaS prediction [34]) and Color
with the collected information demonstrated in Tab. 1, it be- (×64 downsampling of the test image) for a NSD input image.
comes apparent that the changes occurring in brain regions
linked to the processing of semantics, color, and depth are
indeed present within the fMRI data. This observation em- final MLP projector, akin to previous research [6, 37]. The
phasizes the capability to extract the intended information learned embedding is with a feature dimension of 77 × 768,
from the provided fMRI recordings. where 77 denotes the maximum token length and 768 rep-
resents the encoding dimension of each token. It is then fed
C. T2I-Adapter into the pretrained Stable Diffusion [35] to inject semantic
information into the final reconstructed images.
T2I-Adapter [29] and ControlNet [51] learn versatile The fMRI 7→ Depth & Color encoder E and decoder D
modality-specific encoders to improve the control ability of decipher depth and color information from the fMRI data.
text-to-image SD model [35]. These encoders extract guid- Given that spatial palettes are generated by first downsam-
ance features from various conditions y (e.g. sketch, seman- pling (with bicubic interpolation) an image and then upsam-
tic label, and depth). They aim to align external control with pling (with nearest interpolation) it back to its original res-
internal knowledge in SD, thereby enhancing the precision olution, the primary objective of the encoder E and the de-
of control over the generated output. Each encoder R pro- coder D shifts towards predicting RGBD images from fMRI
duces n hierarchical feature maps FiR from the primitive data. The architecture of E and D is built on top of [12],
condition y. Then each FiR is added with the corresponding with inspirations drawn from VDVAE [7].
intermediate feature FiSD in the denoising U-Net encoder:
\begin {aligned} \text {F}_{\mathcal {R}} &=\mathcal {R}\left (\mathbf {y}\right ), \\ \hat {\text {F}}_{\text {SD}}^i & =\text {F}_{\text {SD}}^i+\text {F}_{\mathcal {R}}^i, \quad i \in \{1,2,\cdots ,n\}. \end {aligned}
D.2. Representation of Semantics, Color and Depth
(1) This section serves as an introduction to the possible
choices of representations for semantics, color, and depth.
T2I-Adapter consists of a pretrained SD model and sev- We currently use CLIP embedding, depth map [34], and
eral adapters. These adapters are used to extract guidance spatial color palette [29] to facilitate subsequent process-
features from various conditions. The pretrained SD model ing of T2I-Adapter [29] in conjunction with a pretrained
is then utilized to generate images based on both the in- Stable Diffusion [35] for image reconstruction from deci-
put text features and the additional guidance features. The phered cues. However, there are other possibilities that can
CoAdapter mode becomes available when multiple adapters be utilized within our framework.
are involved, and a composer processes features from these
Semantics. The Stable Diffusion utilizes a frozen CLIP
adapters before they are further fed into the SD. Given the
ViT-L/14 text encoder to condition the model on text
deciphered semantics, color, and depth information from
prompts. It is with a feature space dimension of 77 × 768,
fMRI, we can reconstruct the final images using the color
where 77 denotes the maximum token length and 768 rep-
and depth adapters in conjunction with SD.
resents the encoding dimension of each token. The CLIP
D. Implementation Details ViT-L/14 image encoder is with a feature space dimension
of 257 × 768. We maps flattened voxels to an intermedi-
D.1. Network Architectures ate space of size 77 × 768, corresponding to the last hidden
layer of CLIP ViT/L-14. The learned embeddings inject se-
The fMRI 7→ Semantics encoder Efmri maps fMRI vox-
mantic information into the reconstructed images.
els to the shared CLIP latent space [33] to decipher seman-
tics. The network architecture includes a linear layer fol- Depth. We select depth as the structural guidance for two
lowed by multiple residual blocks, a linear projector, and a main reasons: alignment with the human visual system, and
better performance demonstrated in our preliminary experi- ground truth images in the NSD dataset) for PixCorr and
ments. Following prior research [29,51], we use the MiDaS SSIM metrics. For the other metrics, the generated images
predictions [34] as the surrogate ground truth depth maps, were adjusted based on the input specifications of each re-
which are visualized in Fig. 2. spective network. It should be noted that not all evalua-
tion outcomes are available for earlier models, depending
Color. There are many representations that can provide
on the metrics they chose to experiment with. Our quantita-
the color information, such as histogram and probabilis-
tive comparisons with MindEye [37], Takagi et al. [41], and
tic palette [23, 45] However, ControlNet [51] and T2I-
Gu et al. [15] are made according to the exact same test set,
Adapter [29] only accept spatial inputs, which leaves no al-
i.e., the 982 images that are shared for all 4 subjects. Lin et
ternative but to utilize the spatial color palettes as the color
al. [24] disclosed their findings exclusively for Subject 1,
representation. In practice, spatial color palettes resemble
with a custom training-test dataset split.
coarse resolution images, as seen in Fig. 2, and are gener-
ated by first ×64 downsampling (with bicubic interpolation)
an image and then upsampling (with nearest interpolation) Metrics for Depth and Color. We additionally measure
it back to its original resolution. consistency of our extracted depth and color. We bor-
During the image reconstruction phase, the spatial row some common metrics from depth estimation [28] and
palettes contribute the color and appearance information to color correction [47] to assess depth and color consisten-
the final images. These spatial palettes are derived from the cies in the final reconstructed images. For depth metrics,
image estimated by the RGBD decoder in R-PKM. We refer we report Abs Rel (absolute error), Sq Rel (squared er-
to the images produced at this stage as the “initial guessed ror), RMSE (root mean squared error), and RMSE log (root
image” to differentiate them from the final reconstruction. mean squared logarithmic error) — detailed in [28].
The initial guessed image offers color cues but it also con- For color metrics, we use CD (Color Discrepancy) [47]
tains inaccuracies. By employing a ×64 downsampling, we and STRESS (Standardized Residual Sum of Squares) [11].
can effectively extract necessary color details from this im- CD calculates the absolute differences between the ground
age while minimizing the side effects of inaccuracies. truth I and the reconstructed image Iˆ by utilizing the nor-
malized histograms of images segmented into bins:
Other Guidance. In the realm of visual decoding with pre-
trained diffusion models [29,49,51], any guidance available \mathrm {CD}({I}, \hat {I})=\sum \left |\mathcal {H}\left ({I}\right )-\mathcal {H}(\hat {I})\right |, (2)
in these models can be harnessed to fill in gaps of miss-
ing information, thereby enhancing performance. This spa-
where H(·) represents the histogram function over the given
tial guidance includes representations such as sketch [39],
range (e.g. [0, 255]) and number of bins. In simpler terms,
Canny edge detection, HED (Holistically-Nested Edge De-
this equation computes the absolute difference between the
tection) [48], and semantic segmentation maps [4]. These
histograms of the two images for all bins and then sums
alternatives could potentially serve as the intermediate rep-
them up. The number of bins for histogram is set as 64.
resentations for the reverse pathways in our method. HED
STRESS calculates a scaled difference between the ground-
and Canny are edge detectors, which provide object bound-
truth C and the estimated color palette Ĉ:
aries within images. However, during our preliminary ex-
periments, both methods were shown to face challenges in
providing reliable edges for all images. Sketches encounter \text {STRESS}=100 \sqrt {\frac {\sum _{i=1}^n\left (F \hat {\texttt {C}}_i-\texttt {C}_i\right )^2}{\sum _{i=1}^n \texttt {C}_i^2}}, (3)
similar difficulties in providing reliable guidance. The se-
mantic segmentation map provides both structural and se-
mantic cues. However, it overlaps in function with CLIP where n is the number of samples and F is calculated as
semantics and depth maps, and leads to diminished perfor-
mance gain on top of the other two representations. F=\frac {\sum _{i=1}^n \hat {\texttt {C}}_i \texttt {C}_i}{\sum _{i=1}^n \hat {\texttt {C}}_i^2}. (4)

D.3. Evaluation Methodology


E. Additional DREAM Results
Metrics for Visual Decoding. For visual decoding met-
rics, we employ the same suite of eight evaluation crite- This section presents additional results of our method, to
ria as previously used in research [15, 31, 37, 41]. Pix- showcase the effectiveness of DREAM. Sec. E.1 presents
Corr, SSIM, AlexNet(2), and AlexNet(5) are categorized the fMRI 7→ depth & color results, which demonstrates how
as low-level, while Inception, CLIP, EffNet-B, and SwAV the deciphered and represented color and depth information
are considered high-level. Following [31], we downsam- helps to boost the performance of visual decoding. Sec. E.2
pled the generated images from a 512 × 512 resolution to provides more examples of fMRI test reconstructions from
a 425 × 425 resolution (corresponding to the resolution of subject 1. The results shows that the extracted essential cues
Test image GT Depth Predictions Test image GT Depth Predictions
I D D̂ Î0 Ĉ I D D̂ Î0 Ĉ

Figure 3. DREAM Decoding of Depth and Color. We display the test image corresponding to fMRI, alongside the depth ground truth
(D) and the depth/color predictions (D̂, Ĉ). The R-PKM component predicts depth maps and the initial guessed RGB images (Î0 ). The
predicted spatial palettes are derived from these initial guessed images. The results highlight the proficiency of our R-PKM module in
capturing and converting intricate aspects from fMRI recordings into essential cues for visual reconstructions.

ple depth reconstruction alongside their corresponding esti-


Pred. Depth GT Depth Test image
I

mated ground truth obtained from the original RGB images.


Results show that the depth estimated, while far from per-
fect, is sufficient to provide coarse guidance on the scene
structure and object position/orientation for our reconstruc-
D

tion guidance purpose.


The last two columns show the color results. The pre-
dicted spatial palettes are generated by downscaling the
Figure 4. Sample Depth. We show sample depth maps (D̂) deci- “initial guessed images” denoted Î0 (not to be confused
phered from fMRI using R-PKM, alongside the ground-truth depth with Î) which corresponds to the RGB channels of the R-
(D) estimated from MiDaS [34] on the original test image (I).
PKM RGBD output. As discussed in Sec. D.2, employ-
ing a ×64 downsampling on the “initial guessed images”
from fMRI recordings lead to enhanced consistency in ap- achieves a trade-off between efficiently extracting essen-
pearance, structure, and semantics when compared to the tial color cues and effectively mitigating the inaccuracies
viewed visual stimuli. Sec. E.3 provides results of all four in these images. Despite not accurately preserving the color
subjects. of local regions due to the resolution, the produced color
palettes provide a relevant constraint and guidance on the
E.1. Depth & Color Deciphering overall color tone. Additional depth outputs are in Fig. 4.
Fig. 3 showcases additional depth and color results de-
ciphered from the R-PKM component. Overall, it is able Although depth and color guidance are sufficient to re-
to capture and translate these intricate aspects from fMRI construct images reasonably resembling the test one, it is
recordings to spatial guidance crucial for more accurate im- yet unclear if better depth and color cues can be extracted
age reconstructions. from the fMRI data or if depth and color are doomed to be
For depth, the second and third columns show exam- coarse estimation due to loss of data in the fMRI recording.
Test image GT (D) Predictions (D̂, Ĉ) DREAM

Figure 5. DREAM Reconstructions. We show reconstruction for subject 1 (sub01) from the NSD dataset. Our approach extracts essential
cues from fMRI recordings, leading to enhanced consistency in appearance, structure, and semantics when compared to the viewed visual
stimuli. The results are randomly selected. The illustrated depth, color, and final images demonstrate that the deciphered and represented
color and depth cues help to boost the performance of visual decoding.
Test image
sub01
sub02
DREAM
sub05 sub07

Figure 6. Subject-Specific Results. We visualize subject-specific outputs of DREAM on the NSD dataset. For each subject, the model is
retrained because the brain activity varies across subjects. Overall, it consistently reconstructs the test image for all subjects while we note
that some reconstruction inaccuracies are shared across subjects (cf. Sec. E.3). Quantitative metrics are in Tab. 2.

Table 2. Subject-Specific Evaluation. Quantitative evaluation of the DREAM reconstructions for the participants (sub01, sub02, sub05,
and sub07) of the NSD dataset. Performance is stable accross all participants, and consistent with the results reported in the main paper.
Some example visual results can be found in Fig. 6.

Low-Level High-Level
Subject
PixCorr ↑ SSIM ↑ AlexNet(2) ↑ AlexNet(5) ↑ Inception ↑ CLIP ↑ EffNet-B ↓ SwAV ↓
sub01 .288 .338 95.0% 97.5% 94.8% 95.2% .638 .413
sub02 .273 .331 94.2% 97.1% 93.4% 93.5% .652 .422
sub05 .269 .325 93.5% 96.6% 93.8% 94.1% .633 .397
sub07 .265 .319 92.7% 95.4% 92.6% 93.7% .656 .438

E.2. More Image Reconstruction Results E.3. Subject-Specific Results

We used the same standardized training-test data splits


as other NSD reconstruction papers [30, 31, 37], training
subject-specific models for each of 4 participants (sub01,
More examples of image reconstruction for subject 1 are sub02, sub05, and sub07). More details on the different
shown in Fig. 5. From left to right: the first two columns participants can be found in Sec. A and Tab. 1. Fig. 6 shows
display the test images and their corresponding ground truth DREAM outputs for all four participants, with individual
depth maps. The third and fourth columns depict the pre- subject evaluation metrics reported in Tab. 2. More sub01
dicted depth and color, respectively, in the form of depth results can be found in Fig. 5. Overall, DREAM proves to
maps and spatial palettes. The remaining columns represent work well regardless of the subject. However, it is inter-
the final reconstructed images. The results are randomly se- esting to note that some reconstructions mistakes are shared
lected. The illustrated final images demonstrate that the de- across subjects. For example, fMRIs of the vase flowers pic-
ciphered and represented color and depth cues help to boost ture (3rd column) are often reconstructed as paintings, ex-
the performance of visual decoding. Overall, DREAM ev- cept for sub05, and the food plate (2nd rightmost column)
idently extracts good-enough cues from the fMRI record- which is taken at an angle is almost always reconstructed
ings, leading to consistent reconstruction of the appearance, as a more top-view photography. These consistent mistakes
structure, and semantics of the viewed visual stimuli. across subjects may suggest dataset biases.
References [16] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross
Girshick. Momentum contrast for unsupervised visual rep-
[1] Pulkit Agrawal, Dustin Stansbury, Jitendra Malik, and Jack L resentation learning. In CVPR, pages 9729–9738, 2020. 4
Gallant. Pixels to voxels: modeling visual representation in
[17] Tomoyasu Horikawa and Yukiyasu Kamitani. Generic de-
the human brain. arXiv preprint arXiv:1407.5104, 2014. 2
coding of seen and imagined objects using hierarchical vi-
[2] Emily J Allen, Ghislain St-Yves, Yihan Wu, Jesse L sual features. Nature communications, 8(1):15037, 2017. 2
Breedlove, Jacob S Prince, Logan T Dowdle, Matthias Nau,
[18] Shuo Huang, Wei Shao, Mei-Ling Wang, and Dao-Qiang
Brad Caron, Franco Pestilli, Ian Charest, et al. A massive
Zhang. fmri-based decoding of visual information from hu-
7t fmri dataset to bridge cognitive neuroscience and artificial
man brain activity: A brief review. International Journal of
intelligence. Nature neuroscience, 25(1):116–126, 2022. 2,
Automation and Computing, 18(2):170–184, 2021. 1, 5
3, 5, 7, 9
[19] Ziyu Jiang, Tianlong Chen, Bobak J Mortazavi, and
[3] Per Brodal. The central nervous system: structure and func-
Zhangyang Wang. Self-damaging contrastive learning. In
tion. oxford university Press, 2004. 1, 3, 9
ICLR, pages 4927–4939. PMLR, 2021. 4
[4] Holger Caesar, Jasper Uijlings, and Vittorio Ferrari. Coco-
[20] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten,
stuff: Thing and stuff classes in context. In CVPR, pages
Jaakko Lehtinen, and Timo Aila. Analyzing and improving
1209–1218, 2018. 11
the image quality of StyleGAN. In CVPR, pages 8107–8116,
[5] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Pi-
2020. 2
otr Bojanowski, and Armand Joulin. Unsupervised learn-
ing of visual features by contrasting cluster assignments. [21] Sungnyun Kim, Gihun Lee, Sangmin Bae, and Se-Young
NeurIPS, 33:9912–9924, 2020. 6 Yun. Mixco: Mix-up contrastive learning for visual repre-
sentation. In NeurIPS Workshop, 2020. 4
[6] Zijiao Chen, Jiaxin Qing, Tiange Xiang, Wan Lin Yue, and
Juan Helen Zhou. Seeing beyond the brain: Conditional dif- [22] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
fusion model with sparse masked modeling for vision decod- Imagenet classification with deep convolutional neural net-
ing. In CVPR, pages 22710–22720, 2023. 10 works. Communications of the ACM, 60(6):84–90, 2017. 6
[7] Rewon Child. Very deep vaes generalize autoregressive mod- [23] Guillaume Le Moing, Tuan-Hung Vu, Himalaya Jain, Patrick
els and can outperform them on images. In ICLR, 2021. 10 Pérez, and Matthieu Cord. Semantic palette: Guiding scene
[8] Thirza Dado, Yağmur Güçlütürk, Luca Ambrogioni, generation with class proportions. In CVPR, pages 9342–
Gabriëlle Ras, Sander Bosch, Marcel van Gerven, and Umut 9350, 2021. 11
Güçlü. Hyperrealistic neural decoding for reconstructing [24] Sikun Lin, Thomas Sprague, and Ambuj K Singh. Mind
faces from fmri activations via the gan latent space. Scientific Reader: Reconstructing complex images from brain activi-
reports, 12(1):141, 2022. 2 ties. NeurIPS, 35:29624–29636, 2022. 2, 5, 6, 7, 11
[9] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, [25] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
and Li Fei-Fei. Imagenet: A large-scale hierarchical image Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence
database. In CVPR, pages 248–255, 2009. 5 Zitnick. Microsoft COCO: Common objects in context. In
[10] Matteo Ferrante, Furkan Ozcelik, Tommaso Boccato, Rufin ECCV, pages 740–755, 2014. 5, 9
VanRullen, and Nicola Toschi. Brain Captioning: Decoding [26] Yulong Liu, Yongqiang Ma, Wei Zhou, Guibo Zhu, and Nan-
human brain activity into images and text. arXiv preprint ning Zheng. Brainclip: Bridging brain and visual-linguistic
arXiv:2305.11560, 2023. 1, 3, 5 representation via clip for generic natural visual stimulus de-
[11] Pedro A Garcia, Rafael Huertas, Manuel Melgosa, and Gui- coding from fmri. arXiv preprint arXiv:2302.12971, 2023.
hua Cui. Measurement of the relationship between perceived 4
and computed color differences. JOSA A, 24(7):1823–1829, [27] Yizhuo Lu, Changde Du, Dianpeng Wang, and Huiguang He.
2007. 6, 11 Minddiffuser: Controlled image reconstruction from human
[12] Guy Gaziv, Roman Beliy, Niv Granot, Assaf Hoogi, brain activity with semantic and structural diffusion. arXiv
Francesca Strappini, Tal Golan, and Michal Irani. Self- preprint arXiv:2303.14139, 2023. 2, 5, 9
supervised natural image reconstruction and large-scale se- [28] Yue Ming, Xuyang Meng, Chunxiao Fan, and Hui Yu. Deep
mantic classification from brain activity. NeuroImage, learning for monocular depth estimation: A review. Neuro-
254:119121, 2022. 10 computing, 438:14–33, 2021. 6, 11
[13] Guy Gaziv and Michal Irani. More than meets the eye: Self- [29] Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhon-
supervised depth reconstruction from brain activity. arXiv gang Qi, Ying Shan, and Xiaohu Qie. T2i-adapter: Learning
preprint arXiv:2106.05113, 2021. 5 adapters to dig out more controllable ability for text-to-image
[14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing diffusion models. arXiv preprint arXiv:2302.08453, 2023. 1,
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and 2, 3, 4, 5, 6, 10, 11
Yoshua Bengio. Generative adversarial networks. Commu- [30] Milad Mozafari, Leila Reddy, and Rufin VanRullen. Recon-
nications of the ACM, 63(11):139–144, 2020. 2 structing natural scenes from fmri patterns using bigbigan.
[15] Zijin Gu, Keith Jamison, Amy Kuceyeski, and Mert In IJCNN, pages 1–8. IEEE, 2020. 14
Sabuncu. Decoding natural image stimuli from fmri data [31] Furkan Ozcelik, Bhavin Choksi, Milad Mozafari, Leila
with a surface-based convolutional network. In MIDL, 2023. Reddy, and Rufin VanRullen. Reconstruction of Perceived
2, 5, 6, 7, 9, 11 Images from fMRI Patterns and Semantic Brain Exploration
using Instance-Conditioned GANs. In IJCNN, pages 1–8. [45] Yi Wang, Menghan Xia, Lu Qi, Jing Shao, and Yu Qiao. Pal-
IEEE, 2022. 1, 2, 9, 11, 14 gan: Image colorization with palette generative adversarial
[32] Furkan Ozcelik and Rufin VanRullen. Brain-Diffuser: Natu- networks. In ECCV, pages 271–288. Springer, 2022. 11
ral scene reconstruction from fMRI signals using generative [46] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P
latent diffusion. arXiv preprint arXiv:2303.05334, 2023. 2, Simoncelli. Image quality assessment: from error visibility
3, 5, 6, 7 to structural similarity. TIP, 13(4):600–612, 2004. 6
[33] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya [47] Menghan Xia, Jian Yao, Renping Xie, Mi Zhang, and Jin-
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, sheng Xiao. Color consistency correction based on remap-
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ping optimization for image stitching. In ICCV workshops,
ing transferable visual models from natural language super- pages 2977–2984, 2017. 6, 11
vision. In ICML, pages 8748–8763. PMLR, 2021. 1, 2, 4, 6, [48] Saining Xie and Zhuowen Tu. Holistically-nested edge de-
10 tection. In ICCV, pages 1395–1403, 2015. 11
[34] René Ranftl, Katrin Lasinger, David Hafner, Konrad [49] Xingqian Xu, Zhangyang Wang, Eric Zhang, Kai Wang,
Schindler, and Vladlen Koltun. Towards robust monocular and Humphrey Shi. Versatile diffusion: Text, images
depth estimation: Mixing datasets for zero-shot cross-dataset and variations all in one diffusion model. arXiv preprint
transfer. TPAMI, 44(3):1623–1637, 2020. 1, 5, 10, 11, 12 arXiv:2211.08332, 2022. 1, 2, 11
[35] Robin Rombach, Andreas Blattmann, Dominik Lorenz, [50] Bohan Zeng, Shanglin Li, Xuhui Liu, Sicheng Gao, Xiao-
Patrick Esser, and Björn Ommer. High-resolution image syn- long Jiang, Xu Tang, Yao Hu, Jianzhuang Liu, and Baochang
thesis with latent diffusion models. In CVPR, pages 10684– Zhang. Controllable mind visual diffusion model. arXiv
10695, 2022. 1, 2, 3, 5, 10 preprint arXiv:2305.10135, 2023. 2
[36] Christoph Schuhmann, Romain Beaumont, Richard Vencu, [51] Lvmin Zhang and Maneesh Agrawala. Adding conditional
Cade Gordon, Ross Wightman, Mehdi Cherti, Theo control to text-to-image diffusion models. arXiv preprint
Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- arXiv:2302.05543, 2023. 2, 10, 11
man, et al. Laion-5b: An open large-scale dataset for train- [52] Jianggang Zhu, Zheng Wang, Jingjing Chen, Yi-Ping Phoebe
ing next generation image-text models. NeurIPS, 35:25278– Chen, and Yu-Gang Jiang. Balanced contrastive learning for
25294, 2022. 5 long-tailed visual recognition. In CVPR, pages 6908–6917,
[37] Paul S Scotti, Atmadeep Banerjee, Jimmie Goode, Stepan 2022. 4
Shabalin, Alex Nguyen, Ethan Cohen, Aidan J Demp-
ster, Nathalie Verlinde, Elad Yundler, David Weisberg,
et al. Reconstructing the mind’s eye: fmri-to-image with
contrastive learning and diffusion priors. arXiv preprint
arXiv:2305.18274, 2023. 1, 2, 3, 4, 5, 6, 7, 10, 11, 14
[38] Guohua Shen, Tomoyasu Horikawa, Kei Majima, and
Yukiyasu Kamitani. Deep image reconstruction from human
brain activity. PLoS computational biology, 15(1):e1006633,
2019. 2
[39] Zhuo Su, Wenzhe Liu, Zitong Yu, Dewen Hu, Qing Liao,
Qi Tian, Matti Pietikäinen, and Li Liu. Pixel difference net-
works for efficient edge detection. In ICCV, pages 5117–
5127, 2021. 11
[40] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon
Shlens, and Zbigniew Wojna. Rethinking the inception ar-
chitecture for computer vision. In CVPR, pages 2818–2826,
2016. 6
[41] Yu Takagi and Shinji Nishimoto. High-resolution image re-
construction with latent diffusion models from human brain
activity. In CVPR, pages 14453–14463, 2023. 2, 4, 5, 6, 7,
9, 11
[42] Yu Takagi and Shinji Nishimoto. Improving visual image
reconstruction from human brain activity using latent dif-
fusion models via multiple decoded inputs. arXiv preprint
arXiv:2306.11536, 2023. 1, 3, 4, 5
[43] Desney Tan and Anton Nijholt. Brain-computer interfaces
and human-computer interaction. Springer, 2010. 1
[44] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model
scaling for convolutional neural networks. In ICML, pages
6105–6114. PMLR, 2019. 6

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy