Look-ahead Search on Top of Policy Networks in Imperfect Information Games
CoRR(2023)
摘要
Search in test time is often used to improve the performance of reinforcement
learning algorithms. Performing theoretically sound search in fully adversarial
two-player games with imperfect information is notoriously difficult and
requires a complicated training process. We present a method for adding
test-time search to an arbitrary policy-gradient algorithm that learns from
sampled trajectories. Besides the policy network, the algorithm trains an
additional critic network, which estimates the expected values of players
following various transformations of the policies given by the policy network.
These values are then used for depth-limited search. We show how the values
from this critic can create a value function for imperfect information games.
Moreover, they can be used to compute the summary statistics necessary to start
the search from an arbitrary decision point in the game. The presented
algorithm is scalable to very large games since it does not require any search
during train time. We evaluate the algorithm's performance when trained along
Regularized Nash Dynamics, and we evaluate the benefit of using the search in
the standard benchmark game of Leduc hold'em, multiple variants of imperfect
information Goofspiel, and Battleships.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要