Enabling action crossmodality for a pretrained large language model

Natural Language Processing Journal(2024)

引用 0|浏览7
暂无评分
摘要
Natural language processing and vision tasks have seen large improvements recently through the rise of Transformer architectures. The high performing large language models (LLMs) benefit from large textual datasets that are numerously available online. However, action and bidirectional action-language tasks are less developed, as these require more specific and labelled data. Therefore, we aim at enabling these robotic action capabilities for a pretrained LLM, while maintaining high efficiency with regards to the required training time and data size. To achieve this, we split up a Transformer-based LLM and insert a multimodal architecture into it. Specifically, we split a pretrained T5 LLM between its encoder and decoder parts, to insert a crossmodal Transformer component of a Paired Transformed Autoencoders (PTAE) bidirectional action-language model. The experiments are conducted on a new dataset, consisting of unimodal language translation and crossmodal bidirectional action-language translation. The natural language capabilities of the original T5 are reestablished efficiently by training the crossmodal Transformer, which requires only one 5.7 millionth of the T5 model’s original training data. Furthermore, the new model, called CrossT5, achieves high accuracy for the vision and language guided robotic action tasks. By design, the CrossT5 agent acts robustly when tested with language commands not included in the dataset. The results demonstrate that this novel approach is successful in combining the advanced linguistic capabilities of LLMs with the low-level robotic control skills of vision-action models. The code is available at this URL: https://github.com/samsoneko/CrossT5.
更多
查看译文
关键词
Human–robot interaction,Machine translation,Action-vision-language model,Crossmodality,Robot learning
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要