Meta AI has developed a groundbreaking new multimodal model named CM3leon (pronounced “chameleon”). This model is the first that can understand and generate both text and images bi-directionally, allowing users to create images from text descriptions or compose text based on images.
CM3leon represents a major leap forward in multimodal AI. Its architecture utilizes a decoder-only tokenizer-based transformer network similar to text-only language models. Building on previous work with RA-CM3, CM3leon also incorporates a ” retrieval-reinforced ” technique, using an external database during training to actively seek diverse and relevant data.
The massive multitask instruction tuning of CM3leon enables it to perform various tasks, including text-to-image generation, text-guided image editing, caption generation, visual question answering, and structure-guided image editing.
To train CM3leon, Meta used a dataset of millions of licensed images from Shutterstock. The best-performing version has over 7 billion parameters, more than double the size of DALL-E 2. On the image generation benchmark MS-COCO, CM3leon achieved a new state-of-the-art Fréchet Inception Distance score of 4.88, surpassing Google’s Parti model.
According to Meta, a key advantage of CM3leon is its ability to produce coherent images closely following complex prompts for text-to-image generation and editing. It also excels at detailed image captioning and visual question answering, showing versatility across visual language tasks.
CM3leon was trained using only licensed image data, avoiding concerns about image ownership while maintaining high performance. Meta states this brings them closer to enabling creativity and enhanced applications for the metaverse.
The closed-source nature of CM3leon has drawn some criticism, though, as Meta and other tech giants benefit greatly from open-source AI while keeping their own models private. Still, with its revolutionary multimodal abilities, CM3leon represents a significant step forward for AI.