MOKA: Open-Vocabulary Robotic Manipulation through Mark-Based Visual Prompting
CoRR(2024)
摘要
Open-vocabulary generalization requires robotic systems to perform tasks
involving complex and diverse environments and task goals. While the recent
advances in vision language models (VLMs) present unprecedented opportunities
to solve unseen problems, how to utilize their emergent capabilities to control
robots in the physical world remains an open question. In this paper, we
present MOKA (Marking Open-vocabulary Keypoint Affordances), an approach that
employs VLMs to solve robotic manipulation tasks specified by free-form
language descriptions. At the heart of our approach is a compact point-based
representation of affordance and motion that bridges the VLM's predictions on
RGB images and the robot's motions in the physical world. By prompting a VLM
pre-trained on Internet-scale data, our approach predicts the affordances and
generates the corresponding motions by leveraging the concept understanding and
commonsense knowledge from broad sources. To scaffold the VLM's reasoning in
zero-shot, we propose a visual prompting technique that annotates marks on the
images, converting the prediction of keypoints and waypoints into a series of
visual question answering problems that are feasible for the VLM to solve.
Using the robot experiences collected in this way, we further investigate ways
to bootstrap the performance through in-context learning and policy
distillation. We evaluate and analyze MOKA's performance on a variety of
manipulation tasks specified by free-form language descriptions, such as tool
use, deformable body manipulation, and object rearrangement.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要