Seeing Through The Brain: Image Reconstruction of Visual Perception From Human Brain Signals
Seeing Through The Brain: Image Reconstruction of Visual Perception From Human Brain Signals
Seeing Through The Brain: Image Reconstruction of Visual Perception From Human Brain Signals
Supervision
Fine-grained Control
Latent Diffusion Model
Pixel Level
t=0 Diffusion Process Tp
EEG Signals Mp Reconstructed Image
Eldm
Visual Stimuli
Dldm
Ms
Denoising Process
Caption Sample Level
CLIP Embedding
Decoding From
EEG Signals Coarse-grained Control
“An image of airliner”
CLIP CLIP Embedding
Supervision
Figure 1: Overview of our N EURO I MAGEN. All the modules with dotted lines, i.e. pixel-level supervision and sample-level
supervision, are only used during the training phase. and would be removed during the inference phase.
GEN , including pixel-level semantics and sample-level se- is defined as the saliency map of silhouette information
mantics with the corresponding training details of the decod- Mp (x) ∈ RHp ×Wp ×3 . This step enables us to analyze the
ing procedure. Finally, we detail the image reconstruction EEG signals in the pixel space and provide the rough struc-
procedure of N EURO I MAGEN, which integrates the coarse- ture information. Subsequently, we define the sample-level
grained and fine-grained semantics with a pretrained latent semantics as Ms (x) ∈ RL×Ds , to provide coarse-grained
diffusion model to reconstruct the observed visual stimuli information such as image category or text caption.
from EEG signals. To fully utilize the two-level semantics, the high-quality
image reconstructing module F is a latent diffusion model.
Problem Statement It begins with the saliency map Mp (x) as the initial
In this section, we formulate the problem and give image and utilizes the sample-level semantics Ms (x) to
an overview of N EURO I MAGEN. Let the paired polish the saliency map and finally reconstruct ŷ =
n
{(EEG, image)} dataset as Ω = {(xi , yi )}i=1 , where F (Mp (x), Ms (x)).
yi ∈ RH×W ×3 is the visual stimuli image to evoke the Pixel-level Semantics Extraction
brain activities and xi ∈ RC×T represents the recorded
corresponding EEG signals. Here, C is the channel number In this section, we describe how we decode the pixel-level
of EEG sensors and T is the temporal length of a sequence semantics, i.e. the saliency map of silhouette information.
associated with the observed image. The general objective The intuition of this pixel-level semantics extraction is to
of this research is to reconstruct an image y using the capture the color, position, and shape information of the ob-
corresponding EEG signals x, with a focus on achieving a served visual stimuli, which is fine-grained and extremely
high degree of similarity to the observed visual stimuli. difficult to reconstruct from the noisy EEG signal. However,
as is shown in Figure 3, despite the low image resolution and
Multi-level Semantics Extraction Framework limited semantic accuracy, such a saliency map successfully
captures the rough structure information of visual stimuli
Figure 1 illustrates the architecture of N EURO I MAGEN. from the noisy EEG signals. Specifically, our pixel-level se-
In our approach, we extract multi-level semantics, repre- mantics extractor Mp consists of two components: (1) con-
sented as {M1 (x), M2 (x), · · · , Mn (x)}, which capture var- trastive feature learning to obtain discriminative features of
ious granularity ranges from coarse-grained to fine-grained EEG signals and (2) the estimation of the saliency map of
information from EEG signals corresponding to visual stim- silhouette information based on the learned EEG features.
uli. The coarse-grained semantics serves as a high-level
overview, facilitating a quick understanding of primary at- Contrastive Feature Learning We use contrastive learn-
tributes and categories of the visual stimuli. On the other ing techniques to bring together the embeddings of EEG
hand, fine-grained semantics offers more detailed informa- signals when people get similar visual stimulus, i.e. see-
tion, such as localized features, subtle variations, and small- ing images of the same class. The triplet loss (Schroff,
scale patterns. The multi-level semantics are then fed into a Kalenichenko, and Philbin 2015) is utilized as
high-quality image reconstructing module F to reconstruct Ltriplet = max(0, β+∥fθ (xa ) − fθ (xp )∥22
the visual stimuli ŷ = F [M1 (x), M2 (x), · · · , Mn (x)]. (1)
Specifically, we give two-level semantics as follows. Let Mp −∥fθ (xa ) − fθ (xn )∥22 ),
and Ms be the pixel-level semantic extractor and sample- where fθ is the feature extraction function (Kavasidis et al.
level semantic extractor, respectively. Pixel-level semantics 2017) that maps EEG signals to a feature space. xa , xp , xn
are the sampled anchor, positive, and negative EEG signal Sample-level Semantics Extraction
segments, respectively. The objective of eq. (1) is to mini- As aforementioned, the EEG signals are notorious for their
mize the distance between xa and xp with the same labels inherent noise, making it challenging to extract both pre-
(the class of viewed visual stimuli) while maximizing the cise and fine-grained information. Therefore, besides fine-
distance between xa and xn with different labels. To avoid grained pixel-level semantics, we also involve sample-level
the compression of data representations into a small cluster semantic extraction methods to derive some coarse-grained
by the feature extraction network, a margin term β is incor- information such as the category of the main objects of the
porated into the triplet loss. image content. These features have a relatively lower rank
and are easier to be aligned. Despite being less detailed,
Estimation of Saliency Map After we obtain the feature
these features can still provide accurate coarse-grained in-
of EEG signal fθ (x), we can now generate the saliency map
formation, which is meaningful to reconstruct the observed
of silhouette information from it and a random sampled la-
visual stimuli.
tent z ∼ N (0, 1), i.e.,
Specifically, the process Ms will try to align the infor-
Mp (x) = G(z, fθ (x)). mation decoded from the input EEG signals to some gen-
erated image captions, which are generated by some other
G denotes for the saliency map generator. In this paper, additional annotation model such as Contrastive Language-
we use the generator from the Generative Adversarial Net- Image Pretraining (CLIP) model (Radford et al. 2021). Be-
work(GAN) (Goodfellow et al. 2020) framework to generate low we detail the processes of image caption ground-truth
the saliency map and the adversarial loss is defined as fol- generation and semantic decoding with alignment.
lows:
GT images Label captions BLIP captions
LD
adv = max(0, 1 − D(A(y), fθ (x)))+
An image of An elephant
max(0, 1 + D(A(Mp (x))), fθ (x))), (2)
african elephant standing next to a
LG
adv = − D(A(Mp (x)), fθ (x)). (3) large rock
where d∗ (·) denotes the distance metric in image space x or An image of A man riding a
latent space z and z1 , z2 ∼ N (0, 1) are two different sam- mountain bike mountain bike down
pled latent vectors. a trail in the woods
To enforce the accuracy of the generated saliency map
from the visual stimuli, we use the observed image as super-
vision and incorporate the Structural Similarity Index Mea-
sure (SSIM) as well:
Figure 2: Examples of ground-truth images, label captions,
and BLIP captions, respectively.
2µx µMp (x) + C1 2σx σMp (x) + C2
LSSIM = 1 − ,
µ2x + µ2Mp (x) + C1 σx2 + σM 2
p (x)
+ C2
(5) Generation of Image Captions We propose two methods
where µx , µMp (x) , σx , and σMp (x) represent the mean and to generate the caption for each image to help supervise the
standard values of the ground truth images and reconstructed decoding procedure of semantic information from EEG sig-
saliency maps of the generator. C1 and C2 are constants to nals. Since the observed images are from ImageNet dataset
stabilize the calculation. containing the class of the image, we define a straightfor-
The final loss for the generator is the weighted sum of the ward and heuristic method of label caption, which utilizes
losses: the class name of each image as the caption, as illustrated in
the middle column of Figure 2. The second method is that
LG = α1 · LG
adv + α2 · Lms + α3 · LSSIM , (6) we use an image caption model BLIP (Li et al. 2023), which
is a generic and computation-efficient vision-language pre-
and αi∈{1,2,3} are hyperparameters to balance the loss terms. training (VLP) model utilizing the pretrained vision model
and large language models. We opt for the default parameter and 20 seconds). The EEG-image dataset encompasses a
configuration of the BLIP model to caption our images. The diverse range of image classes, including animals (such as
examples are demonstrated in the right column of Figure 2. pandas), and objects (such as airlines).
As can be seen, the label captions tend to focus predomi- Following the common data split strategy (Kavasidis et al.
nantly on class-level information, and the BLIP-derived cap- 2017), we divide the pre-processed raw EEG signals and
tions introduce further details on a per-image level. their corresponding images into training, validation, and
testing sets, with corresponding proportions of 80% (1,600
Predict the Text CLIP Embedding After the generation
images), 10% (200 images), and 10% (200 images) and
of the image caption ground-truth, the goal of the semantic
build one model for all subjects. The dataset is split by im-
decoding is to extract the information from the EEG signals
ages, ensuring the EEG signals of all subjects in response to
to align the caption information. Note that, this procedure is
a single image are not spread over splits.
conducted in the latent space, where the latent embeddings
have been processed from the CLIP model from the above Evaluation Metrics
generated captions. Specifically, We extracted the CLIP em-
N -way Top-k Classification Accuracy (ACC) Following
beddings ĥclip∗ from the generated captions and align the
(Chen et al. 2023), we evaluate the semantic correctness of
output hclip of EEG sample-level encoder with the loss func-
our reconstructed images by employing the N -way top-k
tion as
classification accuracy. Specifically, the ground truth image
Lclip = ||hclip − ĥclip∗ ||22 , (7)
y and reconstructed image ŷ are fed into a pretrained Im-
where ∗ ∈ {B, L} denotes the BLIP caption embedding or ageNet1k classifier (Dosovitskiy et al. 2020), which deter-
label caption embedding. mines whether y and ŷ belong to the same class. Then we
check for the reconstructed image if the top-k classification
Combining Multi-level EEG Semantics with in N selected classes matches the class of ground-truth im-
Diffusion Model age. Importantly, this evaluation metric eliminates the need
In this section, we present a comprehensive explanation for pre-defined labels for the images and serves as an indi-
of how multi-level semantics can be effectively integrated cator of the semantic consistency between the ground truth
into a diffusion model for visual stimulus reconstruction. and reconstructed images. In this paper, we select 50-way
We utilize both pixel-level semantics, denoted as Mp (x) top-1 accuracy as the evaluation metric.
(obtained using G(z, fθ (x))), and sample-level semantics, Inception Score (IS) IS, introduced by (Salimans et al.
represented as Ms (x) (obtained using hclip ), to exert vari- 2016), is commonly employed to evaluate the quality and
ous granularity control over the image reconstruction pro- diversity of reconstructed images in generative models.
cess. The reconstructed visual stimuli are defined as ŷ = To compute the IS, a pretrained Inception-v3 classifier
F (Mp (x), Ms (x)) = F (G(fθ (x), hclip )). (Szegedy et al. 2016) is utilized to calculate the class prob-
Specifically, we used the latent diffusion model to per- abilities for the reconstructed images. We use IS to give a
form image-to-image reconstructing with the guidance of quantitative comparison between our method and baselines.
conditional text prompt embeddings: (1) First, we recon-
struct the pixel-level semantics G(fθ (x)) from EEG sig- Structural Similarity Index Measure (SSIM) SSIM of-
nals and resize it to the resolution of observed visual stimuli fers a comprehensive and perceptually relevant metric for
(2) G(fθ (x)) is then processed by the encoder Eldm of au- image quality evaluation. SSIM is computed over multiple
toencoder from the latent diffusion model and added noise windows of the ground truth image and the corresponding
through the diffusion process. (3) Then, we integrate the reconstructed image in luminance, contrast, and structure
sample-level semantics hclip as the cross-attention input of components, respectively.
the U-Net to guide the denoising process. (4) We project the
output of the denoising process to image space with Dldm Results
and finally reconstruct the high-quality image ŷ. Experiment Results on the ImageNet Dataset
Experiments The main results are illustrated in Figure 3. The images
positioned on the left with red boxes represent the ground
Dataset truth images. The second images from the left represent the
The effectiveness of our proposed methodology is validated saliency map reconstructed from EEG signals. The three im-
using the EEG-image dataset (Spampinato et al. 2017). This ages on the right exhibit the three sampling results for the
dataset is publicly accessible and consists of EEG data gath- given pixel-level saliency map with the guidance of sample-
ered from six subjects. The data was collected by present- level semantics of EEG signals. Upon comparison with the
ing visual stimuli to the subjects, incorporating 50 images ground truth images and the reconstructed saliency maps,
from 40 distinct categories within the ImageNet dataset we validate that our pixel-level semantics extraction from
(Krizhevsky, Sutskever, and Hinton 2012). Each set of stim- EEG signals successfully captures the color, positional, and
uli was displayed in 25-second intervals, separated by a 10- shape information of viewed images, despite limited seman-
second blackout period intended to reset the visual pathway. tic accuracy. Comparing the GT images and three recon-
This process resulted in totally 2000 images, with each ex- structed samples, it is demonstrated that the latent diffu-
periment lasting 1,400 seconds (approximately 23 minutes sion model successfully polishes the decoded saliency map
GT images Saliency Map Sample1 Sample2 Sample3 GT images Saliency Map Sample1 Sample2 Sample3
Figure 3: The main results of our N EURO I MAGEN. The images positioned on the left with red boxes represent the ground truth
images. The second images from the left represent the pixel-level saliency map reconstructed from EEG signals. The three
images on the right exhibit the three sampling results for the given saliency map under the guidance of sample-level semantics.
with coarse-grained but accurate guidance of sample-level Subject ACC (%) IS SSIM
semantics from EEG signals. The high-quality reconstructed subj 01 83.84 32.64 0.254
images purely from brain signals are perceptually and se- subj 02 84.26 32.33 0.247
mantically similar to the viewed images. subj 03 86.66 32.93 0.251
subj 04 86.48 32.40 0.244
Model ACC (%) IS SSIM subj 05 87.62 32.97 0.250
Brain2Image 5.01 subj 06 85.25 31.76 0.245
NeuroVision 5.23
N EURO I MAGEN 85.6 33.50 0.249 Table 2: The quantitative results of different subjects.
Model B L I ACC(%) IS SSIM ing the denoising process. Models 1, 4, and 5 represent
1 % % ! 4.5 16.31 0.234 the experimental results only using the saliency, both the
saliency map and sample-level semantics with the supervi-
2 % ! % 85.9 34.12 0.180
sion of BLIP caption, and both the saliency map and sample-
3 ! % % 74.1 29.87 0.157 level semantics with the supervision of label caption, respec-
4 ! % ! 65.3 25.86 0.235 tively. By comparing 1 with 4 and 1 with 5, the experimen-
5 % ! ! 85.6 33.50 0.249 tal results demonstrate that the use of sample-level seman-
tics significantly increases the semantic accuracy of recon-
Table 3: Quantitative results of ablation studies. B and L structed images.
represent the semantic decoding using BLIP caption and la- BLIP Captions vs Label Captions We also compare the
bel caption from EEG signals, respectively. I represents the two caption supervision methods with models 2 with 3 and
perceptual information decoding from EEG signals. 4 with 5. The experimental results of the label caption in all
metrics are superior to using BLIP caption. We impute these
results to that the EEG signals may only capture the class-
Pixel-level Semantics To demonstrate the effectiveness of level information. So the prediction of BLIP latent is inaccu-
the pixel-level semantics from EEG signals, we conduct val- rate, which decreases the performance of diffusion models.
idation on models 2, 3, 4, and 5. By comparing 2 with 5 and
3 with 4, we find that using the pixel-level semantics, i.e. the Conclusion
saliency map, can significantly increase the structure simi-
larity of the reconstructed images and ground truth images. In this paper, we explore to understand the visually-evoked
brain activity. Specifically, We proposed a framework,
Sample-level Semantics We investigate the module of named N EURO I MAGEN, to reconstruct images of visual
sample-level semantics decoding from EEG signals on guid- perceptions from EEG signals. The N EURO I MAGEN first
Anemone fish
Pizza
Electric guitar
Canoe GT images Subj 01 Subj 02 Subj 03 Subj 04 Subj 05 Subj 06
Figure 5: Comparison of reconstructed images on different subjects. The images on the left with red boxes represent the ground
truth images. The other six images represent the reconstructed images of different subjects. The shown classes include fish,
pizza, guitar, and canoe.
generates multi-level semantic information, i.e., pixel-level Chen, Z.; Qing, J.; Xiang, T.; Yue, W. L.; and Zhou, J. H.
saliency maps and sample-level textual descriptions from 2023. Seeing beyond the brain: Conditional diffusion model
EEG signals, then use the diffusion model to combine the with sparse masked modeling for vision decoding. In Pro-
extracted semantics and obtain the high-resolution images. ceedings of the IEEE/CVF Conference on Computer Vision
Both qualitative and quantitative experiments reveals the and Pattern Recognition, 22710–22720.
strong ability of the N EURO I MAGEN. Chen, Z.; Qing, J.; and Zhou, J. H. 2023. Cinematic Mind-
As a preliminary work in this area, we demonstrate the scapes: High-quality Video Reconstruction from Brain Ac-
possibility of linking human visual perceptions with compli- tivity. arXiv preprint arXiv:2305.11675.
cated EEG signals. We expect the findings can further moti-
vate the field of artificial intelligence, cognitive science, and Dhariwal, P.; and Nichol, A. 2021. Diffusion models beat
neuroscience to work together and reveal the mystery of our gans on image synthesis. Advances in Neural Information
brains to proceed visual perception information. Processing Systems, 34: 8780–8794.
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn,
References D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.;
Heigold, G.; Gelly, S.; et al. 2020. An image is worth 16x16
Allen, E. J.; St-Yves, G.; Wu, Y.; Breedlove, J. L.; Prince,
words: Transformers for image recognition at scale. arXiv
J. S.; Dowdle, L. T.; Nau, M.; Caron, B.; Pestilli, F.; Charest,
preprint arXiv:2010.11929.
I.; et al. 2022. A massive 7T fMRI dataset to bridge cogni-
tive neuroscience and artificial intelligence. Nature Neuro- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.;
science, 25(1): 116–126. Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y.
Bai, Y.; Wang, X.; Cao, Y.; Ge, Y.; Yuan, C.; and Shan, 2020. Generative adversarial networks. Communications of
Y. 2023. DreamDiffusion: Generating High-Quality Images the ACM, 63(11): 139–144.
from Brain EEG Signals. arXiv preprint arXiv:2306.16934. Ho, J.; Jain, A.; and Abbeel, P. 2020. Denoising diffusion
Beliy, R.; Gaziv, G.; Hoogi, A.; Strappini, F.; Golan, T.; probabilistic models. Advances in Neural Information Pro-
and Irani, M. 2019. From voxels to pixels and back: Self- cessing Systems, 33: 6840–6851.
supervision in natural-image reconstruction from fMRI. Ad- Kavasidis, I.; Palazzo, S.; Spampinato, C.; Giordano, D.; and
vances in Neural Information Processing Systems, 32. Shah, M. 2017. Brain2image: Converting brain signals into
images. In Proceedings of the 25th ACM international con- Song, J.; Meng, C.; and Ermon, S. 2020. Denoising diffusion
ference on Multimedia, 1809–1817. implicit models. arXiv preprint arXiv:2010.02502.
Khare, S.; Choubey, R. N.; Amar, L.; and Udutalapalli, Spampinato, C.; Palazzo, S.; Kavasidis, I.; Giordano, D.;
V. 2022. NeuroVision: perceived image regeneration us- Souly, N.; and Shah, M. 2017. Deep learning human mind
ing cProGAN. Neural Computing and Applications, 34(8): for automated visual classification. In Proceedings of the
5979–5991. IEEE/CVF Conference on Computer Vision and Pattern
Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Im- Recognition, 6809–6817.
agenet classification with deep convolutional neural net- Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; and Wo-
works. Advances in Neural Information Processing Systems, jna, Z. 2016. Rethinking the inception architecture for com-
25. puter vision. In Proceedings of the IEEE/CVF Conference
Li, J.; Li, D.; Savarese, S.; and Hoi, S. 2023. Blip-2: on Computer Vision and Pattern Recognition, 2818–2826.
Bootstrapping language-image pre-training with frozen im- Takagi, Y.; and Nishimoto, S. 2023. High-resolution im-
age encoders and large language models. arXiv preprint age reconstruction with latent diffusion models from human
arXiv:2301.12597. brain activity. In Proceedings of the IEEE/CVF Conference
Lim, J. H.; and Ye, J. C. 2017. Geometric gan. arXiv preprint on Computer Vision and Pattern Recognition, 14453–14463.
arXiv:1705.02894. Yang, L.; Zhang, Z.; Song, Y.; Hong, S.; Xu, R.; Zhao, Y.;
Mao, Q.; Lee, H.-Y.; Tseng, H.-Y.; Ma, S.; and Yang, M.- Shao, Y.; Zhang, W.; Cui, B.; and Yang, M.-H. 2022. Dif-
H. 2019. Mode seeking generative adversarial networks for fusion models: A comprehensive survey of methods and ap-
diverse image synthesis. In Proceedings of the IEEE/CVF plications. arXiv preprint arXiv:2209.00796.
Conference on Computer Vision and Pattern Recognition, Ye, Z.; Yao, L.; Zhang, Y.; and Gustin, S. 2022.
1429–1437. See what you see: Self-supervised cross-modal retrieval
Palazzo, S.; Spampinato, C.; Kavasidis, I.; Giordano, D.; of visual stimuli from brain activity. arXiv preprint
Schmidt, J.; and Shah, M. 2020. Decoding brain represen- arXiv:2208.03666.
tations by multimodal learning of neural activity and visual Zeng, B.; Li, S.; Liu, X.; Gao, S.; Jiang, X.; Tang, X.; Hu,
features. IEEE Transactions on Pattern Analysis and Ma- Y.; Liu, J.; and Zhang, B. 2023. Controllable Mind Visual
chine Intelligence, 43(11): 3833–3849. Diffusion Model. arXiv preprint arXiv:2305.10135.
Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Zhao, S.; Liu, Z.; Lin, J.; Zhu, J.-Y.; and Han, S. 2020.
Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; Differentiable augmentation for data-efficient gan training.
et al. 2021. Learning transferable visual models from nat- Advances in Neural Information Processing Systems, 33:
ural language supervision. In International Conference on 7559–7570.
Machine Learning, 8748–8763.
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; and Om-
mer, B. 2022. High-resolution image synthesis with latent
diffusion models. In Proceedings of the IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition, 10684–
10695.
Ronneberger, O.; Fischer, P.; and Brox, T. 2015. U-net: Con-
volutional networks for biomedical image segmentation. In
Medical Image Computing and Computer-Assisted Interven-
tion, 234–241.
Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Rad-
ford, A.; and Chen, X. 2016. Improved techniques for train-
ing gans. Advances in Neural Information Processing Sys-
tems, 29.
Schroff, F.; Kalenichenko, D.; and Philbin, J. 2015. Facenet:
A unified embedding for face recognition and clustering. In
Proceedings of the IEEE/CVF Conference on Computer Vi-
sion and Pattern Recognition, 815–823.
Shen, G.; Dwivedi, K.; Majima, K.; Horikawa, T.; and
Kamitani, Y. 2019. End-to-end deep image reconstruction
from human brain activity. Frontiers in Computational Neu-
roscience, 13: 21.
Sohl-Dickstein, J.; Weiss, E.; Maheswaranathan, N.; and
Ganguli, S. 2015. Deep unsupervised learning using
nonequilibrium thermodynamics. In International Confer-
ence on Machine Learning, 2256–2265.