Spatiotemporal Predictive Pre-training for Robotic Motor Control
CoRR(2024)
摘要
Robotic motor control necessitates the ability to predict the dynamics of
environments and interaction objects. However, advanced self-supervised
pre-trained visual representations (PVRs) in robotic motor control, leveraging
large-scale egocentric videos, often focus solely on learning the static
content features of sampled image frames. This neglects the crucial temporal
motion clues in human video data, which implicitly contain key knowledge about
sequential interacting and manipulating with the environments and objects. In
this paper, we present a simple yet effective robotic motor control visual
pre-training framework that jointly performs spatiotemporal predictive learning
utilizing large-scale video data, termed as STP. Our STP samples paired frames
from video clips. It adheres to two key designs in a multi-task learning
manner. First, we perform spatial prediction on the masked current frame for
learning content features. Second, we utilize the future frame with an
extremely high masking ratio as a condition, based on the masked current frame,
to conduct temporal prediction of future frame for capturing motion features.
These efficient designs ensure that our representation focusing on motion
information while capturing spatial details. We carry out the largest-scale
evaluation of PVRs for robotic motor control to date, which encompasses 21
tasks within a real-world Franka robot arm and 5 simulated environments.
Extensive experiments demonstrate the effectiveness of STP as well as unleash
its generality and data efficiency by further post-pre-training and hybrid
pre-training.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要