CRC Workshop at Rauischholzhausen Castle (Germany)
2024.7.2
Preprint:
https://arxiv.org/abs/2405.10078
Abstract:
Visual image reconstruction aims to recover arbitrary stimulus/perceived images from brain activity. To achieve this, especially with limited training data, it is crucial that the model leverages a compositional representation that spans the image space, with each feature effectively mapped fr brain activity. In light of these considerations, we critically assessed recent “photorealistic” reconstructions based on generative AIs applied to a large-scale fMRI/stimulus dataset (Natural Scene Dataset, NSD). We found a notable decrease in the reconstruction performance with a different dataset specifically designed to prevent train–test overlaps (Deeprecon). The target features of NSD images revealed a strikingly limited diversity with a small number of semantic clusters shared between the training and test sets. Simulations also showed a lack of generalizability with a small number of clusters. This can be explained by “rank deficient prediction,” where any input is mapped into the subspace spanned by training features. By diversifying the training set with the number of clusters that linearly scales with the feature dimension, the decoders exhibited improved generalizability beyond the trained clusters, achieving compositional prediction. It is also important to note that text/semantic features alone are insufficient for a complete mapping to the visual space, even if they are perfectly predicted from brain activity. Building on these observations, we argue that recent “photorealistic” reconstructions may predominantly be a blend of classification into trained categories and the generation of convincing yet inauthentic images (hallucinations) through text-to-image diffusion. To avoid such spurious reconstructions, we offer guidelines for developing generalizable methods and conducting reliable evaluations.