Mind-Video represents an innovative AI tool designed to reconstruct high-quality videos from brain activity captured via continuous functional magnetic resonance imaging (fMRI) data.
Building upon the foundation laid by its predecessor, Mind-Vis, this tool addresses the challenge of recovering continuous visual experiences in video format from non-invasive brain recordings.
Mind-Video adopts a two-module pipeline that bridges the gap between image and video brain decoding. The first module focuses on learning general visual fMRI features through large-scale unsupervised learning, utilizing masked brain modeling and spatiotemporal attention. It then distills semantic-related features using multimodal contrastive learning with an annotated dataset.
In the second module, the learned features undergo fine-tuning through co-training with an augmented stable diffusion model, specifically tailored for video generation guided by fMRI data. The tool’s key contribution lies in its flexible and adaptable pipeline, comprising an fMRI encoder and an augmented stable diffusion model trained separately and fine-tuned together.
Employing a progressive learning scheme, the encoder learns brain features through multiple stages, resulting in videos with high semantic accuracy, including motions and scene dynamics, surpassing previous state-of-the-art approaches.
An attention analysis of the transformers decoding fMRI data reveals the dominance of the visual cortex in processing visual spatiotemporal information, along with the hierarchical nature of the encoder’s layers in extracting structural and abstract visual features. Additionally, the fMRI encoder exhibits progressive improvement in assimilating nuanced semantic information throughout its training stages.
Mind-Video utilizes data from the Human Connectome Project and acknowledges the invaluable contributions of collaborators and supporters in its development.
More details about Mind Video
What is the role of the multimodal contrastive learning in Mind-Video?
In Mind-Video, multimodal contrastive learning plays a crucial role in distilling semantic-related features from the general visual fMRI features acquired through unsupervised learning. Leveraging the multimodality of the annotated dataset, this process trains the fMRI encoder within the CLIP space, emphasizing these vital semantics.
What role does the Stable Diffusion model play in Mind-Video?
The Stable Diffusion model within Mind-Video assumes a pivotal role in steering the video generation process. Subsequent to learning general and semantic-related features from the fMRI data in the initial module, the second module fine-tunes these features through co-training with an augmented stable diffusion model. This specialized process is designed to precisely guide the generation of videos under the influence of fMRI data.
What makes Mind-Video’s brain decoding pipeline flexible and adaptable?
Mind-Video’s brain decoding pipeline achieves flexibility and adaptability by decoupling into two distinct modules: the fMRI encoder and the augmented stable diffusion model. Initially, these modules are trained separately and subsequently fine-tuned together. This approach enables the encoder to gradually acquire brain features across multiple stages, culminating in a versatile and adaptable pipeline.
Who are the main contributors and supporters in the development of Mind-Video?
The development of Mind-Video has been made possible by the contributions of various individuals and institutions. Among the key contributors are Zijiao Chen, Jiaxin Qing, and Helen Zhou from the National University of Singapore and the Chinese University of Hong Kong, along with collaborators from the Centre for Sleep and Cognition and the Centre for Translational Magnetic Resonance Research. Additionally, the tool expresses gratitude to supporters such as the Human Connectome Project, Prof. Zhongming Liu, Dr. Haiguang Wen, the Stable Diffusion team, and the Tune-a-Video team.