Meta AI has announced a new open-source AI model called ImageBind, which can coordinate multiple data streams, including text, voice, visual data, temperature, and motion readings. This is a significant advancement in multimodal AI, as ImageBind can associate different types of data with each other and combine them into a single embedding space. Meta claims that this is the first AI model that combines six types of data into a single embedding space.
The core concept of ImageBind is to combine multiple types of data into a single multidimensional index. For example, given a picture of a beach, ImageBind can find the sound of waves as the associated data and combine them into a video. This opens up possibilities for future AI systems to cross-reference data in the same way current AI systems do for text inputs. For example, in a virtual reality device, if we ask, “Please reproduce a long sea voyage,” it can produce not only the sound of waves but also the feeling of being on a ship, including the shaking of the deck under your feet and the coolness of the sea breeze.
ImageBind aligns the embeddings of the six modalities in a common space, enabling cross-modal retrieval of different types of content that are not observed together. It adds embeddings of different modalities to naturally synthesize their semantics, enabling audio-to-image generation using pre-trained decoders.
Meta plans to expand the supported data modes to other senses, such as touch, speech, smell, and brain fMRI signals, to enable richer human-centric AI models in the future. The company is sharing its AI research openly, which is in contrast to its rivals, OpenAI and Google, who were becoming more secretive.
While ImageBind is still a research project and not yet ready for consumer or commercial use, it represents the future of generative AI systems capable of creating immersive multisensory experiences. Meta’s ImageBind is a significant advancement in multimodal AI, and its open-source nature will likely lead to further development in this field.