Robotics researchers at Google and the Technical University of Berlin have reported significant progress in developing an AI language model capable of controlling multiple robots in diverse environments. The Pathways Language Model with Embodied, or PaLM-E, integrates Google’s PaLM with ViT (Vision Transformers) to enable natural language processing and visual reasoning.
PaLM-E can execute complex voice commands with greater accuracy, understand and perform previously difficult-to-understand tasks, and recognize specific players in photos and note them in real-time.
PaLM-E boasts a massive 562 billion parameters, which Google and the Technical University of Berlin achieved by combining two models: Google’s PaLM with 540 billion parameters and ViT with 22 billion parameters. PaLM-E offers several significant advances in human-robot interaction, such as allowing robots to be controlled by voice and deriving text from images. PaLM-E seamlessly controls various robots in multiple environments, demonstrating unprecedented flexibility and adaptability.
PaLM-E also offers embodied reasoning, enabling it to perform calculations on images of handwritten numbers and achieve zero-shot inference, which allows visually conditioned jokes to be told from images. This model is trained on multiple robot embodiments and diverse tasks across visual-linguistic domains, demonstrating that the transition from visual-linguistic domains to embodied decision-making can be accomplished in several approaches, enabling robot planning tasks to be achieved efficiently.
PaLM-E has numerous potential application areas, including sequential planning of robot manoeuvres, visual question answering, and image captioning. PaLM-E represents a significant milestone in developing AI language models that can control multiple robots in complex environments. It showcases Google’s progress in AI development since the launch of ChatGPT.