关键词: Deep reinforcement learning (DRL) Multi-agent MuJoCo Multi-agent control Multi-agent reinforcement learning (MARL) Trust region

Mesh : Learning Algorithms Benchmarking Entropy Policy

来  源:   DOI:10.1016/j.neunet.2023.11.046

Abstract:
Multi-agent reinforcement learning (MARL) algorithms based on trust regions (TR) have achieved significant success in numerous cooperative multi-agent tasks. These algorithms restrain the Kullback-Leibler (KL) divergence (i.e., TR constraint) between the current and new policies to avoid aggressive update steps and improve learning performance. However, the majority of existing TR-based MARL algorithms are on-policy, meaning that they require new data sampled by current policies for training and cannot utilize off-policy (or historical) data, leading to low sample efficiency. This study aims to enhance the data efficiency of TR-based learning methods. To achieve this, an approximation of the original objective function is designed. In addition, it is proven that as long as the update size of the policy (measured by the KL divergence) is restricted, optimizing the designed objective function using historical data can guarantee the monotonic improvement of the original target. Building on the designed objective, a practical off-policy multi-agent stochastic policy gradient algorithm is proposed within the framework of centralized training with decentralized execution (CTDE). Additionally, policy entropy is integrated into the reward to promote exploration, and consequently, improve stability. Comprehensive experiments are conducted on a representative benchmark for multi-agent MuJoCo (MAMuJoCo), which offers a range of challenging tasks in cooperative continuous multi-agent control. The results demonstrate that the proposed algorithm outperforms all other existing algorithms by a significant margin.
摘要:
基于信任域(TR)的多智能体强化学习(MRL)算法在众多协作多智能体任务中取得了显著的成功。这些算法抑制了Kullback-Leibler(KL)发散(即,TR约束)在当前策略和新策略之间,以避免激进的更新步骤并提高学习绩效。然而,大多数现有的基于TR的MARL算法是在政策上,这意味着他们需要由当前政策采样的新数据进行培训,并且不能利用非政策(或历史)数据,导致样品效率低。本研究旨在提高基于TR的学习方法的数据效率。为了实现这一点,对原目标函数进行了近似设计。此外,事实证明,只要策略的更新大小(由KL差异衡量)受到限制,利用历史数据对所设计的目标函数进行优化,可以保证原始目标的单调性改进。在设计目标的基础上,在具有分散执行的集中训练(CTDE)框架内,提出了一种实用的非策略多智能体随机策略梯度算法。此外,政策熵融入奖励中促进探索,因此,提高稳定性。综合实验是在多主体MuJoCo(MAMUJoCo)的代表性基准上进行的,在协作式连续多智能体控制中提供了一系列具有挑战性的任务。结果表明,该算法在很大程度上优于所有其他现有算法。
公众号