Deep reinforcement learning (DRL)

深度强化学习 (DRL)
  • 文章类型: Journal Article
    IEEE802.11ah标准的引入是为了解决不断增长的物联网(IoT)应用的规模。为了减少系统中的争用并提高能源效率,在媒体访问控制(MAC)层中引入了限制访问窗口(RAW)机制,以管理大量访问网络的站点。然而,为了实现优化的网络性能,有必要适当地确定RAW参数,包括RAW组的数量,每个RAW中的插槽数量,和每个时隙的持续时间。在本文中,我们优化了基于IEEE802.11ah的物联网上行网络中RAW参数的配置。为了提高网络吞吐量,分析并建立了一个RAW参数优化问题。为有效应对复杂动态的网络条件,我们提出了一种深度强化学习(DRL)方法来确定优选的RAW参数以优化网络吞吐量。为了提高学习效率和稳定性,我们采用近端策略优化(PPO)算法。我们在NS-3模拟器中构建具有周期性和随机流量的网络环境,以验证所提出的基于PPO的RAW参数优化算法的性能。仿真结果表明,采用基于PPO的DRL算法,可以在不同的网络条件下获得优化的RAW参数,网络吞吐量可以显著提高。
    The IEEE 802.11ah standard is introduced to address the growing scale of internet of things (IoT) applications. To reduce contention and enhance energy efficiency in the system, the restricted access window (RAW) mechanism is introduced in the medium access control (MAC) layer to manage the significant number of stations accessing the network. However, to achieve optimized network performance, it is necessary to appropriately determine the RAW parameters, including the number of RAW groups, the number of slots in each RAW, and the duration of each slot. In this paper, we optimize the configuration of RAW parameters in the uplink IEEE 802.11ah-based IoT network. To improve network throughput, we analyze and establish a RAW parameters optimization problem. To effectively cope with the complex and dynamic network conditions, we propose a deep reinforcement learning (DRL) approach to determine the preferable RAW parameters to optimize network throughput. To enhance learning efficiency and stability, we employ the proximal policy optimization (PPO) algorithm. We construct network environments with periodic and random traffic in an NS-3 simulator to validate the performance of the proposed PPO-based RAW parameters optimization algorithm. The simulation results reveal that using the PPO-based DRL algorithm, optimized RAW parameters can be obtained under different network conditions, and network throughput can be improved significantly.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在现实世界的场景中,为自动驾驶制定导航决策涉及一系列步骤。这些判断是基于对环境的部分观察而做出的,而环境的基础模型仍然未知。解决此类问题的一种流行方法是强化学习,其中,代理人除了零碎和嘈杂的观察外,还通过一系列奖励来获取知识。本研究引入了一种称为通过决策变压器(DRLNDT)的深度强化学习导航的算法,以解决在部分可观察的城市环境中运行的自动驾驶汽车的决策能力的挑战。DRLNDT框架是围绕软Actor-Critic(SAC)算法构建的。DRLNDT利用Transformer神经网络对观测和动作中的时间依赖性进行有效建模。该方法有助于减轻由于给定状态内的传感器噪声或阻塞而可能出现的判断错误。从高质量图像中提取潜在向量的过程涉及使用变分自动编码器(VAE)。该技术有效地降低了状态空间的维数,提高了培训效率。多模态状态空间由矢量状态组成,包括速度和位置,车辆的固有传感器可以很容易地获得。此外,结合从高质量图像中导出的潜在向量,以促进Agent对当前轨迹的评估。实验表明,DRLNDT可以在不事先了解环境的情况下实现优越的最优策略,详细的地图,或路由帮助,超越基线技术和其他缺乏历史数据的政策方法。
    In real-world scenarios, making navigation decisions for autonomous driving involves a sequential set of steps. These judgments are made based on partial observations of the environment, while the underlying model of the environment remains unknown. A prevalent method for resolving such issues is reinforcement learning, in which the agent acquires knowledge through a succession of rewards in addition to fragmentary and noisy observations. This study introduces an algorithm named deep reinforcement learning navigation via decision transformer (DRLNDT) to address the challenge of enhancing the decision-making capabilities of autonomous vehicles operating in partially observable urban environments. The DRLNDT framework is built around the Soft Actor-Critic (SAC) algorithm. DRLNDT utilizes Transformer neural networks to effectively model the temporal dependencies in observations and actions. This approach aids in mitigating judgment errors that may arise due to sensor noise or occlusion within a given state. The process of extracting latent vectors from high-quality images involves the utilization of a variational autoencoder (VAE). This technique effectively reduces the dimensionality of the state space, resulting in enhanced training efficiency. The multimodal state space consists of vector states, including velocity and position, which the vehicle\'s intrinsic sensors can readily obtain. Additionally, latent vectors derived from high-quality images are incorporated to facilitate the Agent\'s assessment of the present trajectory. Experiments demonstrate that DRLNDT may achieve a superior optimal policy without prior knowledge of the environment, detailed maps, or routing assistance, surpassing the baseline technique and other policy methods that lack historical data.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    由于有限的能源和无线通信的广播性质,能源效率和安全问题是无线传感器网络(WSN)中的主要问题。因此,如何在提高无线传感器网络能效的同时增强其安全性能,引起了人们的广泛关注。为了解决这个问题,本文提出了一种新的基于深度强化学习(DRL)的策略,即,DeepNR战略,提高无线传感器网络的能效和安全性能。具体来说,所提出的DeepNR策略通过设计深度神经网络(DNN)来自适应地学习状态信息来逼近Q值。它还设计了基于DRL的多级决策,以实时学习和优化数据传输路径,最终实现对网络的准确预测和决策。为了进一步增强安全性能,DeepNR策略包括防御机制,实时响应检测到的攻击,以确保网络的正常运行。此外,DeepNR通过深度学习模型自适应调整策略,以应对不断变化的网络环境和攻击模式。实验结果表明,所提出的DeepNR优于常规方法,展示了网络寿命显著提高30%,网络数据吞吐量增加25%,安全措施提高了20%。
    Energy efficiency and security issues are the main concerns in wireless sensor networks (WSNs) because of limited energy resources and the broadcast nature of wireless communication. Therefore, how to improve the energy efficiency of WSNs while enhancing security performance has attracted widespread attention. In order to solve this problem, this paper proposes a new deep reinforcement learning (DRL)-based strategy, i.e., DeepNR strategy, to enhance the energy efficiency and security performance of WSN. Specifically, the proposed DeepNR strategy approximates the Q-value by designing a deep neural network (DNN) to adaptively learn the state information. It also designs DRL-based multi-level decision-making to learn and optimize the data transmission paths in real time, which eventually achieves accurate prediction and decision-making of the network. To further enhance security performance, the DeepNR strategy includes a defense mechanism that responds to detected attacks in real time to ensure the normal operation of the network. In addition, DeepNR adaptively adjusts its strategy to cope with changing network environments and attack patterns through deep learning models. Experimental results show that the proposed DeepNR outperforms the conventional methods, demonstrating a remarkable 30% improvement in network lifespan, a 25% increase in network data throughput, and a 20% enhancement in security measures.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    基于信任域(TR)的多智能体强化学习(MRL)算法在众多协作多智能体任务中取得了显著的成功。这些算法抑制了Kullback-Leibler(KL)发散(即,TR约束)在当前策略和新策略之间,以避免激进的更新步骤并提高学习绩效。然而,大多数现有的基于TR的MARL算法是在政策上,这意味着他们需要由当前政策采样的新数据进行培训,并且不能利用非政策(或历史)数据,导致样品效率低。本研究旨在提高基于TR的学习方法的数据效率。为了实现这一点,对原目标函数进行了近似设计。此外,事实证明,只要策略的更新大小(由KL差异衡量)受到限制,利用历史数据对所设计的目标函数进行优化,可以保证原始目标的单调性改进。在设计目标的基础上,在具有分散执行的集中训练(CTDE)框架内,提出了一种实用的非策略多智能体随机策略梯度算法。此外,政策熵融入奖励中促进探索,因此,提高稳定性。综合实验是在多主体MuJoCo(MAMUJoCo)的代表性基准上进行的,在协作式连续多智能体控制中提供了一系列具有挑战性的任务。结果表明,该算法在很大程度上优于所有其他现有算法。
    Multi-agent reinforcement learning (MARL) algorithms based on trust regions (TR) have achieved significant success in numerous cooperative multi-agent tasks. These algorithms restrain the Kullback-Leibler (KL) divergence (i.e., TR constraint) between the current and new policies to avoid aggressive update steps and improve learning performance. However, the majority of existing TR-based MARL algorithms are on-policy, meaning that they require new data sampled by current policies for training and cannot utilize off-policy (or historical) data, leading to low sample efficiency. This study aims to enhance the data efficiency of TR-based learning methods. To achieve this, an approximation of the original objective function is designed. In addition, it is proven that as long as the update size of the policy (measured by the KL divergence) is restricted, optimizing the designed objective function using historical data can guarantee the monotonic improvement of the original target. Building on the designed objective, a practical off-policy multi-agent stochastic policy gradient algorithm is proposed within the framework of centralized training with decentralized execution (CTDE). Additionally, policy entropy is integrated into the reward to promote exploration, and consequently, improve stability. Comprehensive experiments are conducted on a representative benchmark for multi-agent MuJoCo (MAMuJoCo), which offers a range of challenging tasks in cooperative continuous multi-agent control. The results demonstrate that the proposed algorithm outperforms all other existing algorithms by a significant margin.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    高对比度的目标检测,多目标图像和电影是具有挑战性的。这种困难是由于不同的区域和对象/人具有不同的像素分布,对比,和强度属性。这项工作引入了一种新的区域聚焦特征检测(RFD)方法来解决此问题并提高目标检测精度。RFD方法将输入图像分成几个较小的图像,以便处理尽可能多的图像。这些区域中的每一个具有其自己计算的对比度和强度属性。然后使用深度循环学习来使用相似性度量从对应于各个区域的训练输入中迭代地提取这些特征。可以通过组合来自重叠的许多位置的特征来定位目标。将识别的目标与训练期间使用的输入进行比较,在对比度和强度属性的帮助下,以提高准确性。跨区域的特征分布也用于学习范式的重复训练。该方法有效地降低了在具有大量提取实例的区域选择和模式匹配期间的错误率。因此,建议的方法通过挑出不同的区域并过滤掉误导性的速率生成特征来提供更高的准确性。准确性,相似性指数,虚假率,提取率,处理时间,和其他人被用来评估所提出的方法的有效性。提出的RFD将相似性指数提高了10.69%,提取率9.04%,精密度提高了13.27%。错误率和处理时间分别减少了7.78%和9.19%,分别。
    Target detection in high-contrast, multi-object images and movies is challenging. This difficulty results from different areas and objects/people having varying pixel distributions, contrast, and intensity properties. This work introduces a new region-focused feature detection (RFD) method to tackle this problem and improve target detection accuracy. The RFD method divides the input image into several smaller ones so that as much of the image as possible is processed. Each of these zones has its own contrast and intensity attributes computed. Deep recurrent learning is then used to iteratively extract these features using a similarity measure from training inputs corresponding to various regions. The target can be located by combining features from many locations that overlap. The recognized target is compared to the inputs used during training, with the help of contrast and intensity attributes, to increase accuracy. The feature distribution across regions is also used for repeated training of the learning paradigm. This method efficiently lowers false rates during region selection and pattern matching with numerous extraction instances. Therefore, the suggested method provides greater accuracy by singling out distinct regions and filtering out misleading rate-generating features. The accuracy, similarity index, false rate, extraction ratio, processing time, and others are used to assess the effectiveness of the proposed approach. The proposed RFD improves the similarity index by 10.69%, extraction ratio by 9.04%, and precision by 13.27%. The false rate and processing time are reduced by 7.78% and 9.19%, respectively.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    移动机器人在社会生活和工业生产中发挥着越来越重要的作用,比如搜寻和营救机器人,自主探索扫地机器人,等等。提高移动机器人自主导航的精度是亟待解决的热点问题。然而,传统的导航方法无法在动态障碍物环境中实现无碰撞导航,越来越多的学者正在逐步使用基于深度强化学习(DRL)的自主导航方法来代替过于保守的传统方法。但另一方面,DRL的训练时间太长,缺乏长期记忆很容易导致机器人陷入死胡同,这使得其在实际场景中的应用更加困难。为缩短训练时间,防止移动机器人被卡住,转来转去,结合传统的全局规划和基于DRL的局部规划,设计了一种新的机器人自主导航框架。因此,整个导航过程可以转化为首先使用传统的导航算法来找到全局路径,然后在全球路径上寻找几个高价值的地标,然后使用DRL算法将移动机器人向指定的地标移动以完成最终导航,这使得机器人训练难度大大降低。此外,为了改善深度强化学习中缺乏长期记忆的问题,我们设计了一个包含内存模块的特征提取网络,以保持输入特征的长期依赖性。通过将我们的方法与传统导航方法以及基于端到端深度导航方法的强化学习进行比较,它表明,尽管动态障碍物的数量很大,并且障碍物正在迅速移动,我们提出的方法是,平均而言,在导航效率(导航时间和导航路径长度)方面比排名第二的方法好20%,在安全性(碰撞次数)方面比排名第二的方法好34%,成功率比排名第二的方法高26.6%,并表现出较强的鲁棒性。
    Mobile robots are playing an increasingly significant role in social life and industrial production, such as searching and rescuing robots, autonomous exploration of sweeping robots, and so on. Improving the accuracy of autonomous navigation of mobile robots is a hot issue to be solved. However, traditional navigation methods are unable to realize crash-free navigation in an environment with dynamic obstacles, more and more scholars are gradually using autonomous navigation based on deep reinforcement learning (DRL) to replace overly conservative traditional methods. But on the other hand, DRL\'s training time is too long, and the lack of long-term memory easily leads the robot to a dead end, which makes its application in the actual scene more difficult. To shorten training time and prevent mobile robots from getting stuck and spinning around, we design a new robot autonomous navigation framework which combines the traditional global planning and the local planning based on DRL. Therefore, the entire navigation process can be transformed into first using traditional navigation algorithms to find the global path, then searching for several high-value landmarks on the global path, and then using the DRL algorithm to move the mobile robot toward the designated landmarks to complete the final navigation, which makes the robot training difficulty greatly reduced. Furthermore, in order to improve the lack of long-term memory in deep reinforcement learning, we design a feature extraction network containing memory modules to preserve the long-term dependence of input features. Through comparing our methods with traditional navigation methods and reinforcement learning based on end-to-end depth navigation methods, it shows that while the number of dynamic obstacles is large and obstacles are rapidly moving, our proposed method is, on average, 20% better than the second ranked method in navigation efficiency (navigation time and navigation paths\' length), 34% better than the second ranked method in safety (collision times), 26.6% higher than the second ranked method in success rate, and shows strong robustness.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    随着海洋探测技术的发展,海洋的探索已成为涉及使用自动水下航行器(AUV)的研究热点。在复杂的水下环境中,快,安全,目标点的顺利到达是AUV进行水下探测任务的关键。大多数路径规划算法将深度强化学习(DRL)和路径规划算法结合起来,以实现避障和路径缩短。在本文中,我们提出了一种改进人工势场(APF)中局部最小值的方法,通过构造牵引力使AUV脱离局部最小值。将改进的人工势场(IAPF)方法与DRL结合进行路径规划,同时优化DRL算法中的奖励函数,利用生成的路径优化未来路径。通过将我们的结果与各种算法的实验数据进行比较,我们发现该方法在路径规划方面具有积极的效果和优势。该方法是一种高效、安全的路径规划方法,在水下导航设备中具有明显的应用潜力。
    With the development of ocean exploration technology, the exploration of the ocean has become a hot research field involving the use of autonomous underwater vehicles (AUVs). In complex underwater environments, the fast, safe, and smooth arrival of target points is key for AUVs to conduct underwater exploration missions. Most path-planning algorithms combine deep reinforcement learning (DRL) and path-planning algorithms to achieve obstacle avoidance and path shortening. In this paper, we propose a method to improve the local minimum in the artificial potential field (APF) to make AUVs out of the local minimum by constructing a traction force. The improved artificial potential field (IAPF) method is combined with DRL for path planning while optimizing the reward function in the DRL algorithm and using the generated path to optimize the future path. By comparing our results with the experimental data of various algorithms, we found that the proposed method has positive effects and advantages in path planning. It is an efficient and safe path-planning method with obvious potential in underwater navigation devices.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    随着传感器丰富的智能设备(智能手机,iPad,等。),结合收集大量数据的需要,近年来,移动人群感知(MCS)逐渐引起了学术界的关注。MCS是一种新的有前途的大规模感知和计算数据收集模型。主要功能是使用移动设备招募一大群参与者,以在给定区域内执行感测任务。任务分配是MCS系统中的一个重要研究课题,旨在有效地将传感任务分配给招募的工人。以前的研究集中在贪婪或启发式方法上,而由于各种资源和质量限制,MCS任务分配问题通常是NP难优化问题,传统的贪婪或启发式方法通常会在一定程度上遭受性能损失。此外,以平台为中心的任务分配模型通常考虑平台的利益,而忽略其他参与者的感受,不利于平台的发展。因此,在本文中,深度强化学习方法用于找到更有效的任务分配解决方案,并采用加权方法对多个目标进行优化。具体来说,我们使用基于决斗架构的双深度Q网络(D3QN)来解决任务分配问题。由于工人的最大旅行距离,奖励价值,并考虑了传感任务的随机到达和时间灵敏度,这是一个多约束下的动态任务分配问题。对于动态问题,传统的启发式(例如,pso,遗传学)通常很难从建模和实践的角度来解决。强化学习可以通过序贯决策的方式在有限的时间内获得次优或最优解。最后,我们将提出的基于D3QN的解决方案与标准基线解决方案进行比较,实验表明,它在平台利润方面优于基准解决方案,任务完成率,等。,平台的实用性和吸引力得到增强。
    With the coverage of sensor-rich smart devices (smartphones, iPads, etc.), combined with the need to collect large amounts of data, mobile crowd sensing (MCS) has gradually attracted the attention of academics in recent years. MCS is a new and promising model for mass perception and computational data collection. The main function is to recruit a large group of participants with mobile devices to perform sensing tasks in a given area. Task assignment is an important research topic in MCS systems, which aims to efficiently assign sensing tasks to recruited workers. Previous studies have focused on greedy or heuristic approaches, whereas the MCS task allocation problem is usually an NP-hard optimisation problem due to various resource and quality constraints, and traditional greedy or heuristic approaches usually suffer from performance loss to some extent. In addition, the platform-centric task allocation model usually considers the interests of the platform and ignores the feelings of other participants, to the detriment of the platform\'s development. Therefore, in this paper, deep reinforcement learning methods are used to find more efficient task assignment solutions, and a weighted approach is adopted to optimise multiple objectives. Specifically, we use a double deep Q network (D3QN) based on the dueling architecture to solve the task allocation problem. Since the maximum travel distance of the workers, the reward value, and the random arrival and time sensitivity of the sensing tasks are considered, this is a dynamic task allocation problem under multiple constraints. For dynamic problems, traditional heuristics (eg, pso, genetics) are often difficult to solve from a modeling and practical perspective. Reinforcement learning can obtain sub-optimal or optimal solutions in a limited time by means of sequential decision-making. Finally, we compare the proposed D3QN-based solution with the standard baseline solution, and experiments show that it outperforms the baseline solution in terms of platform profit, task completion rate, etc., the utility and attractiveness of the platform are enhanced.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    In this paper, we consider reconfigurable intelligent surface (RIS)-assisted integrated satellite high-altitude platform terrestrial networks (IS-HAP-TNs) that can improve network performance by exploiting the HAP stability and RIS reflection. Specifically, the reflector RIS is installed on the side of HAP to reflect signals from the multiple ground user equipment (UE) to the satellite. To aim at maximizing the system sum rate, we jointly optimize the transmit beamforming matrix at the ground UEs and RIS phase shift matrix. Due to the limitation of the unit modulus of the RIS reflective elements constraint, the combinatorial optimization problem is difficult to tackle effectively by traditional solving methods. Based on this, this paper studies the deep reinforcement learning (DRL) algorithm to achieve online decision making for this joint optimization problem. In addition, it is verified through simulation experiments that the proposed DRL algorithm outperforms the standard scheme in terms of system performance, execution time, and computing speed, making real-time decision making truly feasible.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    目标运动的不确定性,车载摄像机的感知能力有限,约束控制给无人机动态目标跟踪控制带来了新的挑战。凭借神经网络强大的拟合能力和学习能力,提出了一种基于深度强化学习(DRL)的无人机动态目标跟踪端到端控制方法。首先,建立了使用车载摄像头图像的基于DRL的框架,这简化了传统的模块化范式。其次,神经网络架构,奖励函数,并设计了基于SAC的速度指令感知算法来训练策略网络。对策略网络的输出进行非规范化处理,直接作为速度控制命令,实现了无人机的动态目标跟踪。最后,通过数值仿真验证了所提出的端到端控制方法的可行性。结果表明,本文提出的基于DRL的框架对简化传统模块化范式是可行的。无人机可以快速跟踪速度和方向变化的动态目标。
    Uncertainty of target motion, limited perception ability of onboard cameras, and constrained control have brought new challenges to unmanned aerial vehicle (UAV) dynamic target tracking control. In virtue of the powerful fitting ability and learning ability of the neural network, this paper proposes a new deep reinforcement learning (DRL)-based end-to-end control method for UAV dynamic target tracking. Firstly, a DRL-based framework using onboard camera image is established, which simplifies the traditional modularization paradigm. Secondly, neural network architecture, reward functions, and soft actor-critic (SAC)-based speed command perception algorithm are designed to train the policy network. The output of the policy network is denormalized and directly used as speed control command, which realizes the UAV dynamic target tracking. Finally, the feasibility of the proposed end-to-end control method is demonstrated by numerical simulation. The results show that the proposed DRL-based framework is feasible to simplify the traditional modularization paradigm. The UAV can track the dynamic target with rapidly changing of speed and direction.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

公众号