An Embodied Generalist Agent in 3D World
CoRR(2023)
摘要
Leveraging massive knowledge and learning schemes from large language models
(LLMs), recent machine learning models show notable successes in building
generalist agents that exhibit the capability of general-purpose task solving
in diverse domains, including natural language processing, computer vision, and
robotics. However, a significant challenge remains as these models exhibit
limited ability in understanding and interacting with the 3D world. We argue
this limitation significantly hinders the current models from performing
real-world tasks and further achieving general intelligence. To this end, we
introduce an embodied multi-modal and multi-task generalist agent that excels
in perceiving, grounding, reasoning, planning, and acting in the 3D world. Our
proposed agent, referred to as LEO, is trained with shared LLM-based model
architectures, objectives, and weights in two stages: (i) 3D vision-language
alignment and (ii) 3D vision-language-action instruction tuning. To facilitate
the training, we meticulously curate and generate an extensive dataset
comprising object-level and scene-level multi-modal tasks with exceeding scale
and complexity, necessitating a deep understanding of and interaction with the
3D world. Through rigorous experiments, we demonstrate LEO's remarkable
proficiency across a wide spectrum of tasks, including 3D captioning, question
answering, embodied reasoning, embodied navigation, and robotic manipulation.
Our ablation results further provide valuable insights for the development of
future embodied generalist agents.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要