Embodied Understanding of Driving Scenarios
CoRR(2024)
摘要
Embodied scene understanding serves as the cornerstone for autonomous agents
to perceive, interpret, and respond to open driving scenarios. Such
understanding is typically founded upon Vision-Language Models (VLMs).
Nevertheless, existing VLMs are restricted to the 2D domain, devoid of spatial
awareness and long-horizon extrapolation proficiencies. We revisit the key
aspects of autonomous driving and formulate appropriate rubrics. Hereby, we
introduce the Embodied Language Model (ELM), a comprehensive framework tailored
for agents' understanding of driving scenes with large spatial and temporal
spans. ELM incorporates space-aware pre-training to endow the agent with robust
spatial localization capabilities. Besides, the model employs time-aware token
selection to accurately inquire about temporal cues. We instantiate ELM on the
reformulated multi-faced benchmark, and it surpasses previous state-of-the-art
approaches in all aspects. All code, data, and models will be publicly shared.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要