Online Policy Optimization in Unknown Nonlinear Systems
arxiv(2024)
摘要
We study online policy optimization in nonlinear time-varying dynamical
systems where the true dynamical models are unknown to the controller. This
problem is challenging because, unlike in linear systems, the controller cannot
obtain globally accurate estimations of the ground-truth dynamics using local
exploration. We propose a meta-framework that combines a general online policy
optimization algorithm () with a general online estimator of the
dynamical system's model parameters (). We show that if the
hypothetical joint dynamics induced by with known parameters
satisfies several desired properties, the joint dynamics under inexact
parameters from will be robust to errors. Importantly, the final
policy regret only depends on 's predictions on the visited
trajectory, which relaxes a bottleneck on identifying the true parameters
globally. To demonstrate our framework, we develop a computationally efficient
variant of Gradient-based Adaptive Policy Selection, called Memoryless GAPS
(M-GAPS), and use it to instantiate . Combining M-GAPS with
online gradient descent to instantiate yields (to our knowledge)
the first local regret bound for online policy optimization in nonlinear
time-varying systems with unknown dynamics.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要