基于 Actor - Critic 强化学习的多人扑克优化策略 [J].Optimal Policy of Multiplayer Poker via Actor-Critic Reinforcement Learning.-医云文献数字医云科研云海量医学决策数据服务

Abstract：

Poker has been considered a challenging problem in both artificial intelligence and game theory because poker is characterized by imperfect information and uncertainty, which are similar to many realistic problems like auctioning, pricing, cyber security, and operations. However, it is not clear that playing an equilibrium policy in multi-player games would be wise so far, and it is infeasible to theoretically validate whether a policy is optimal. Therefore, designing an effective optimal policy learning method has more realistic significance. This paper proposes an optimal policy learning method for multi-player poker games based on Actor-Critic reinforcement learning. Firstly, this paper builds the Actor network to make decisions with imperfect information and the Critic network to evaluate policies with perfect information. Secondly, this paper proposes a novel multi-player poker policy update method: asynchronous policy update algorithm (APU) and dual-network asynchronous policy update algorithm (Dual-APU) for multi-player multi-policy scenarios and multi-player sharing-policy scenarios, respectively. Finally, this paper takes the most popular six-player Texas hold \'em poker to validate the performance of the proposed optimal policy learning method. The experiments demonstrate the policies learned by the proposed methods perform well and gain steadily compared with the existing approaches. In sum, the policy learning methods of imperfect information games based on Actor-Critic reinforcement learning perform well on poker and can be transferred to other imperfect information games. Such training with perfect information and testing with imperfect information models show an effective and explainable approach to learning an approximately optimal policy.

摘要：

扑克在人工智能和博弈论中都被认为是一个具有挑战性的问题，因为扑克的特点是信息不完善和不确定性，类似于拍卖等许多现实问题，定价，网络安全,和操作。然而,到目前为止，尚不清楚在多人游戏中发挥均衡政策是否明智，从理论上验证政策是否最优是不可行的。因此,设计一种有效的最优策略学习方法更具有现实意义。本文提出了一种基于Actor-Critic强化学习的多人扑克游戏最优策略学习方法。首先,本文构建了在信息不完善的情况下做出决策的行为者网络和在信息完善的情况下评估政策的批评网络。其次,本文提出了一种新颖的多玩家扑克策略更新方法：异步策略更新算法（APU）和双网异步策略更新算法（Dual-APU），适用于多玩家多策略场景和多玩家共享策略场景。分别。最后,本文以最流行的六人德州扑克为例，验证了所提出的最优策略学习方法的性能。实验表明，与现有方法相比，所提出的方法学习的策略表现良好，并且收益稳定。总之,基于Actor-Critic强化学习的不完全信息博弈的策略学习方法在扑克上表现良好，可以转化为其他不完全信息博弈。这种具有完美信息的培训和具有不完美信息模型的测试显示了一种有效且可解释的学习近似最佳策略的方法。