Deep reinforcement learning (DRL)

深度强化学习 (DRL)
  • 文章类型: Journal Article
    IEEE802.11ah标准的引入是为了解决不断增长的物联网(IoT)应用的规模。为了减少系统中的争用并提高能源效率,在媒体访问控制(MAC)层中引入了限制访问窗口(RAW)机制,以管理大量访问网络的站点。然而,为了实现优化的网络性能,有必要适当地确定RAW参数,包括RAW组的数量,每个RAW中的插槽数量,和每个时隙的持续时间。在本文中,我们优化了基于IEEE802.11ah的物联网上行网络中RAW参数的配置。为了提高网络吞吐量,分析并建立了一个RAW参数优化问题。为有效应对复杂动态的网络条件,我们提出了一种深度强化学习(DRL)方法来确定优选的RAW参数以优化网络吞吐量。为了提高学习效率和稳定性,我们采用近端策略优化(PPO)算法。我们在NS-3模拟器中构建具有周期性和随机流量的网络环境,以验证所提出的基于PPO的RAW参数优化算法的性能。仿真结果表明,采用基于PPO的DRL算法,可以在不同的网络条件下获得优化的RAW参数,网络吞吐量可以显著提高。
    The IEEE 802.11ah standard is introduced to address the growing scale of internet of things (IoT) applications. To reduce contention and enhance energy efficiency in the system, the restricted access window (RAW) mechanism is introduced in the medium access control (MAC) layer to manage the significant number of stations accessing the network. However, to achieve optimized network performance, it is necessary to appropriately determine the RAW parameters, including the number of RAW groups, the number of slots in each RAW, and the duration of each slot. In this paper, we optimize the configuration of RAW parameters in the uplink IEEE 802.11ah-based IoT network. To improve network throughput, we analyze and establish a RAW parameters optimization problem. To effectively cope with the complex and dynamic network conditions, we propose a deep reinforcement learning (DRL) approach to determine the preferable RAW parameters to optimize network throughput. To enhance learning efficiency and stability, we employ the proximal policy optimization (PPO) algorithm. We construct network environments with periodic and random traffic in an NS-3 simulator to validate the performance of the proposed PPO-based RAW parameters optimization algorithm. The simulation results reveal that using the PPO-based DRL algorithm, optimized RAW parameters can be obtained under different network conditions, and network throughput can be improved significantly.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    深度强化学习(DRL)在不同的领域和应用中获得了广泛的采用,主要是由于其在具有高维状态和动作的空间中解决复杂的决策问题的能力。深度确定性策略梯度(DDPG)是一种众所周知的DRL算法,采用演员-批评方法,综合基于价值和基于策略的强化学习方法的优势。这项研究的目的是全面研究最新发展,模式,障碍,以及与DDPG相关的潜在机会。使用相关的学术数据库进行了系统的搜索(Scopus,WebofScience,和ScienceDirect)确定过去五年(2018-2023年)发表的85项相关研究。我们全面概述了DDPG的关键概念和组件,包括它的配方,实施,和训练。然后,我们重点介绍了DDPG的各种应用和领域,包括自动驾驶,无人机,资源分配,通信和物联网,机器人,和金融。此外,我们提供了DDPG与其他DRL算法和传统RL方法的深入比较,突出它的优点和缺点。我们相信,这次审查将是研究人员的重要资源,为他们提供有关DRL和DDPG领域使用的方法和技术的宝贵见解。
    Deep Reinforcement Learning (DRL) has gained significant adoption in diverse fields and applications, mainly due to its proficiency in resolving complicated decision-making problems in spaces with high-dimensional states and actions. Deep Deterministic Policy Gradient (DDPG) is a well-known DRL algorithm that adopts an actor-critic approach, synthesizing the advantages of value-based and policy-based reinforcement learning methods. The aim of this study is to provide a thorough examination of the latest developments, patterns, obstacles, and potential opportunities related to DDPG. A systematic search was conducted using relevant academic databases (Scopus, Web of Science, and ScienceDirect) to identify 85 relevant studies published in the last five years (2018-2023). We provide a comprehensive overview of the key concepts and components of DDPG, including its formulation, implementation, and training. Then, we highlight the various applications and domains of DDPG, including Autonomous Driving, Unmanned Aerial Vehicles, Resource Allocation, Communications and the Internet of Things, Robotics, and Finance. Additionally, we provide an in-depth comparison of DDPG with other DRL algorithms and traditional RL methods, highlighting its strengths and weaknesses. We believe that this review will be an essential resource for researchers, offering them valuable insights into the methods and techniques utilized in the field of DRL and DDPG.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在现实世界的场景中,为自动驾驶制定导航决策涉及一系列步骤。这些判断是基于对环境的部分观察而做出的,而环境的基础模型仍然未知。解决此类问题的一种流行方法是强化学习,其中,代理人除了零碎和嘈杂的观察外,还通过一系列奖励来获取知识。本研究引入了一种称为通过决策变压器(DRLNDT)的深度强化学习导航的算法,以解决在部分可观察的城市环境中运行的自动驾驶汽车的决策能力的挑战。DRLNDT框架是围绕软Actor-Critic(SAC)算法构建的。DRLNDT利用Transformer神经网络对观测和动作中的时间依赖性进行有效建模。该方法有助于减轻由于给定状态内的传感器噪声或阻塞而可能出现的判断错误。从高质量图像中提取潜在向量的过程涉及使用变分自动编码器(VAE)。该技术有效地降低了状态空间的维数,提高了培训效率。多模态状态空间由矢量状态组成,包括速度和位置,车辆的固有传感器可以很容易地获得。此外,结合从高质量图像中导出的潜在向量,以促进Agent对当前轨迹的评估。实验表明,DRLNDT可以在不事先了解环境的情况下实现优越的最优策略,详细的地图,或路由帮助,超越基线技术和其他缺乏历史数据的政策方法。
    In real-world scenarios, making navigation decisions for autonomous driving involves a sequential set of steps. These judgments are made based on partial observations of the environment, while the underlying model of the environment remains unknown. A prevalent method for resolving such issues is reinforcement learning, in which the agent acquires knowledge through a succession of rewards in addition to fragmentary and noisy observations. This study introduces an algorithm named deep reinforcement learning navigation via decision transformer (DRLNDT) to address the challenge of enhancing the decision-making capabilities of autonomous vehicles operating in partially observable urban environments. The DRLNDT framework is built around the Soft Actor-Critic (SAC) algorithm. DRLNDT utilizes Transformer neural networks to effectively model the temporal dependencies in observations and actions. This approach aids in mitigating judgment errors that may arise due to sensor noise or occlusion within a given state. The process of extracting latent vectors from high-quality images involves the utilization of a variational autoencoder (VAE). This technique effectively reduces the dimensionality of the state space, resulting in enhanced training efficiency. The multimodal state space consists of vector states, including velocity and position, which the vehicle\'s intrinsic sensors can readily obtain. Additionally, latent vectors derived from high-quality images are incorporated to facilitate the Agent\'s assessment of the present trajectory. Experiments demonstrate that DRLNDT may achieve a superior optimal policy without prior knowledge of the environment, detailed maps, or routing assistance, surpassing the baseline technique and other policy methods that lack historical data.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    由于有限的能源和无线通信的广播性质,能源效率和安全问题是无线传感器网络(WSN)中的主要问题。因此,如何在提高无线传感器网络能效的同时增强其安全性能,引起了人们的广泛关注。为了解决这个问题,本文提出了一种新的基于深度强化学习(DRL)的策略,即,DeepNR战略,提高无线传感器网络的能效和安全性能。具体来说,所提出的DeepNR策略通过设计深度神经网络(DNN)来自适应地学习状态信息来逼近Q值。它还设计了基于DRL的多级决策,以实时学习和优化数据传输路径,最终实现对网络的准确预测和决策。为了进一步增强安全性能,DeepNR策略包括防御机制,实时响应检测到的攻击,以确保网络的正常运行。此外,DeepNR通过深度学习模型自适应调整策略,以应对不断变化的网络环境和攻击模式。实验结果表明,所提出的DeepNR优于常规方法,展示了网络寿命显著提高30%,网络数据吞吐量增加25%,安全措施提高了20%。
    Energy efficiency and security issues are the main concerns in wireless sensor networks (WSNs) because of limited energy resources and the broadcast nature of wireless communication. Therefore, how to improve the energy efficiency of WSNs while enhancing security performance has attracted widespread attention. In order to solve this problem, this paper proposes a new deep reinforcement learning (DRL)-based strategy, i.e., DeepNR strategy, to enhance the energy efficiency and security performance of WSN. Specifically, the proposed DeepNR strategy approximates the Q-value by designing a deep neural network (DNN) to adaptively learn the state information. It also designs DRL-based multi-level decision-making to learn and optimize the data transmission paths in real time, which eventually achieves accurate prediction and decision-making of the network. To further enhance security performance, the DeepNR strategy includes a defense mechanism that responds to detected attacks in real time to ensure the normal operation of the network. In addition, DeepNR adaptively adjusts its strategy to cope with changing network environments and attack patterns through deep learning models. Experimental results show that the proposed DeepNR outperforms the conventional methods, demonstrating a remarkable 30% improvement in network lifespan, a 25% increase in network data throughput, and a 20% enhancement in security measures.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在这项研究中,我们设计了一种基于深度强化学习(DRL)机制和多模型自适应估计(MMAE)的多传感器融合技术,用于同时定位和映射(SLAM)。基于LiDAR的点对点迭代最近点(PLICP)和基于RGB-D相机的ORBSLAM2方法用于估计移动机器人的定位。将残差值异常检测与基于近端策略优化(PPO)的DRL模型相结合,以实现不同定位算法之间权重的最佳调整。利用Gazebo模拟器建立了两种室内模拟环境,验证了多模型自适应估计定位性能,这是在本文中使用的。本研究中提出的方法的实验结果证实,它可以有效地融合来自多个传感器的定位信息,并使移动机器人获得比传统PLICP和ORBSLAM2更高的定位精度。还发现,该方法提高了移动机器人在复杂环境中的定位稳定性。
    In this study, we designed a multi-sensor fusion technique based on deep reinforcement learning (DRL) mechanisms and multi-model adaptive estimation (MMAE) for simultaneous localization and mapping (SLAM). The LiDAR-based point-to-line iterative closest point (PLICP) and RGB-D camera-based ORBSLAM2 methods were utilized to estimate the localization of mobile robots. The residual value anomaly detection was combined with the Proximal Policy Optimization (PPO)-based DRL model to accomplish the optimal adjustment of weights among different localization algorithms. Two kinds of indoor simulation environments were established by using the Gazebo simulator to validate the multi-model adaptive estimation localization performance, which is used in this paper. The experimental results of the proposed method in this study confirmed that it can effectively fuse the localization information from multiple sensors and enable mobile robots to obtain higher localization accuracy than the traditional PLICP and ORBSLAM2. It was also found that the proposed method increases the localization stability of mobile robots in complex environments.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    基于信任域(TR)的多智能体强化学习(MRL)算法在众多协作多智能体任务中取得了显著的成功。这些算法抑制了Kullback-Leibler(KL)发散(即,TR约束)在当前策略和新策略之间,以避免激进的更新步骤并提高学习绩效。然而,大多数现有的基于TR的MARL算法是在政策上,这意味着他们需要由当前政策采样的新数据进行培训,并且不能利用非政策(或历史)数据,导致样品效率低。本研究旨在提高基于TR的学习方法的数据效率。为了实现这一点,对原目标函数进行了近似设计。此外,事实证明,只要策略的更新大小(由KL差异衡量)受到限制,利用历史数据对所设计的目标函数进行优化,可以保证原始目标的单调性改进。在设计目标的基础上,在具有分散执行的集中训练(CTDE)框架内,提出了一种实用的非策略多智能体随机策略梯度算法。此外,政策熵融入奖励中促进探索,因此,提高稳定性。综合实验是在多主体MuJoCo(MAMUJoCo)的代表性基准上进行的,在协作式连续多智能体控制中提供了一系列具有挑战性的任务。结果表明,该算法在很大程度上优于所有其他现有算法。
    Multi-agent reinforcement learning (MARL) algorithms based on trust regions (TR) have achieved significant success in numerous cooperative multi-agent tasks. These algorithms restrain the Kullback-Leibler (KL) divergence (i.e., TR constraint) between the current and new policies to avoid aggressive update steps and improve learning performance. However, the majority of existing TR-based MARL algorithms are on-policy, meaning that they require new data sampled by current policies for training and cannot utilize off-policy (or historical) data, leading to low sample efficiency. This study aims to enhance the data efficiency of TR-based learning methods. To achieve this, an approximation of the original objective function is designed. In addition, it is proven that as long as the update size of the policy (measured by the KL divergence) is restricted, optimizing the designed objective function using historical data can guarantee the monotonic improvement of the original target. Building on the designed objective, a practical off-policy multi-agent stochastic policy gradient algorithm is proposed within the framework of centralized training with decentralized execution (CTDE). Additionally, policy entropy is integrated into the reward to promote exploration, and consequently, improve stability. Comprehensive experiments are conducted on a representative benchmark for multi-agent MuJoCo (MAMuJoCo), which offers a range of challenging tasks in cooperative continuous multi-agent control. The results demonstrate that the proposed algorithm outperforms all other existing algorithms by a significant margin.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    高对比度的目标检测,多目标图像和电影是具有挑战性的。这种困难是由于不同的区域和对象/人具有不同的像素分布,对比,和强度属性。这项工作引入了一种新的区域聚焦特征检测(RFD)方法来解决此问题并提高目标检测精度。RFD方法将输入图像分成几个较小的图像,以便处理尽可能多的图像。这些区域中的每一个具有其自己计算的对比度和强度属性。然后使用深度循环学习来使用相似性度量从对应于各个区域的训练输入中迭代地提取这些特征。可以通过组合来自重叠的许多位置的特征来定位目标。将识别的目标与训练期间使用的输入进行比较,在对比度和强度属性的帮助下,以提高准确性。跨区域的特征分布也用于学习范式的重复训练。该方法有效地降低了在具有大量提取实例的区域选择和模式匹配期间的错误率。因此,建议的方法通过挑出不同的区域并过滤掉误导性的速率生成特征来提供更高的准确性。准确性,相似性指数,虚假率,提取率,处理时间,和其他人被用来评估所提出的方法的有效性。提出的RFD将相似性指数提高了10.69%,提取率9.04%,精密度提高了13.27%。错误率和处理时间分别减少了7.78%和9.19%,分别。
    Target detection in high-contrast, multi-object images and movies is challenging. This difficulty results from different areas and objects/people having varying pixel distributions, contrast, and intensity properties. This work introduces a new region-focused feature detection (RFD) method to tackle this problem and improve target detection accuracy. The RFD method divides the input image into several smaller ones so that as much of the image as possible is processed. Each of these zones has its own contrast and intensity attributes computed. Deep recurrent learning is then used to iteratively extract these features using a similarity measure from training inputs corresponding to various regions. The target can be located by combining features from many locations that overlap. The recognized target is compared to the inputs used during training, with the help of contrast and intensity attributes, to increase accuracy. The feature distribution across regions is also used for repeated training of the learning paradigm. This method efficiently lowers false rates during region selection and pattern matching with numerous extraction instances. Therefore, the suggested method provides greater accuracy by singling out distinct regions and filtering out misleading rate-generating features. The accuracy, similarity index, false rate, extraction ratio, processing time, and others are used to assess the effectiveness of the proposed approach. The proposed RFD improves the similarity index by 10.69%, extraction ratio by 9.04%, and precision by 13.27%. The false rate and processing time are reduced by 7.78% and 9.19%, respectively.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    移动机器人在社会生活和工业生产中发挥着越来越重要的作用,比如搜寻和营救机器人,自主探索扫地机器人,等等。提高移动机器人自主导航的精度是亟待解决的热点问题。然而,传统的导航方法无法在动态障碍物环境中实现无碰撞导航,越来越多的学者正在逐步使用基于深度强化学习(DRL)的自主导航方法来代替过于保守的传统方法。但另一方面,DRL的训练时间太长,缺乏长期记忆很容易导致机器人陷入死胡同,这使得其在实际场景中的应用更加困难。为缩短训练时间,防止移动机器人被卡住,转来转去,结合传统的全局规划和基于DRL的局部规划,设计了一种新的机器人自主导航框架。因此,整个导航过程可以转化为首先使用传统的导航算法来找到全局路径,然后在全球路径上寻找几个高价值的地标,然后使用DRL算法将移动机器人向指定的地标移动以完成最终导航,这使得机器人训练难度大大降低。此外,为了改善深度强化学习中缺乏长期记忆的问题,我们设计了一个包含内存模块的特征提取网络,以保持输入特征的长期依赖性。通过将我们的方法与传统导航方法以及基于端到端深度导航方法的强化学习进行比较,它表明,尽管动态障碍物的数量很大,并且障碍物正在迅速移动,我们提出的方法是,平均而言,在导航效率(导航时间和导航路径长度)方面比排名第二的方法好20%,在安全性(碰撞次数)方面比排名第二的方法好34%,成功率比排名第二的方法高26.6%,并表现出较强的鲁棒性。
    Mobile robots are playing an increasingly significant role in social life and industrial production, such as searching and rescuing robots, autonomous exploration of sweeping robots, and so on. Improving the accuracy of autonomous navigation of mobile robots is a hot issue to be solved. However, traditional navigation methods are unable to realize crash-free navigation in an environment with dynamic obstacles, more and more scholars are gradually using autonomous navigation based on deep reinforcement learning (DRL) to replace overly conservative traditional methods. But on the other hand, DRL\'s training time is too long, and the lack of long-term memory easily leads the robot to a dead end, which makes its application in the actual scene more difficult. To shorten training time and prevent mobile robots from getting stuck and spinning around, we design a new robot autonomous navigation framework which combines the traditional global planning and the local planning based on DRL. Therefore, the entire navigation process can be transformed into first using traditional navigation algorithms to find the global path, then searching for several high-value landmarks on the global path, and then using the DRL algorithm to move the mobile robot toward the designated landmarks to complete the final navigation, which makes the robot training difficulty greatly reduced. Furthermore, in order to improve the lack of long-term memory in deep reinforcement learning, we design a feature extraction network containing memory modules to preserve the long-term dependence of input features. Through comparing our methods with traditional navigation methods and reinforcement learning based on end-to-end depth navigation methods, it shows that while the number of dynamic obstacles is large and obstacles are rapidly moving, our proposed method is, on average, 20% better than the second ranked method in navigation efficiency (navigation time and navigation paths\' length), 34% better than the second ranked method in safety (collision times), 26.6% higher than the second ranked method in success rate, and shows strong robustness.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    随着海洋探测技术的发展,海洋的探索已成为涉及使用自动水下航行器(AUV)的研究热点。在复杂的水下环境中,快,安全,目标点的顺利到达是AUV进行水下探测任务的关键。大多数路径规划算法将深度强化学习(DRL)和路径规划算法结合起来,以实现避障和路径缩短。在本文中,我们提出了一种改进人工势场(APF)中局部最小值的方法,通过构造牵引力使AUV脱离局部最小值。将改进的人工势场(IAPF)方法与DRL结合进行路径规划,同时优化DRL算法中的奖励函数,利用生成的路径优化未来路径。通过将我们的结果与各种算法的实验数据进行比较,我们发现该方法在路径规划方面具有积极的效果和优势。该方法是一种高效、安全的路径规划方法,在水下导航设备中具有明显的应用潜力。
    With the development of ocean exploration technology, the exploration of the ocean has become a hot research field involving the use of autonomous underwater vehicles (AUVs). In complex underwater environments, the fast, safe, and smooth arrival of target points is key for AUVs to conduct underwater exploration missions. Most path-planning algorithms combine deep reinforcement learning (DRL) and path-planning algorithms to achieve obstacle avoidance and path shortening. In this paper, we propose a method to improve the local minimum in the artificial potential field (APF) to make AUVs out of the local minimum by constructing a traction force. The improved artificial potential field (IAPF) method is combined with DRL for path planning while optimizing the reward function in the DRL algorithm and using the generated path to optimize the future path. By comparing our results with the experimental data of various algorithms, we found that the proposed method has positive effects and advantages in path planning. It is an efficient and safe path-planning method with obvious potential in underwater navigation devices.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    自然灾害,包括地震,洪水,山体滑坡,海啸,野火,和飓风,近年来由于快速的气候变化变得越来越普遍。对于灾后管理(PDM),当局为搜救行动部署各种类型的用户设备(UE),例如,搜索和救援机器人,无人机,医疗机器人,智能手机,等。,通过蜂窝4G/LTE/5G及其他无线技术支持的机器人物联网(IoRT)。对于不间断通信服务,可移动和可部署资源单元(MDRU)已被用于基站因灾难而受损的地方。此外,由于灾难后的电力危机,通过满足每个UE的服务质量(QoS)来优化网络的电力是一个至关重要的挑战。为了优化能源效率,UE吞吐量,和服务小区(SC)吞吐量通过考虑固定以及可移动的UE,而不知道环境先验知识在MDRU辅助两层异构网络(HetsNet)的IoRT,本文提出了基于发射功率分配和用户关联相结合的优化问题。此优化问题是非凸的,并且是NP难的,其中部署了参数化(离散:用户关联和连续:功率分配)动作空间。开发了称为多通深度Q网络(MP-DQN)的新的无模型混合动作空间算法,以优化此复杂问题。仿真结果表明,提出的MP-DQN优于参数化深度Q网络(P-DQN)方法,这是众所周知的解决参数化的动作空间,DQN,以及奖励方面的传统算法,平均能源效率,UE吞吐量,和SC吞吐量的静止以及可移动的UE。
    Natural disasters, including earthquakes, floods, landslides, tsunamis, wildfires, and hurricanes, have become more common in recent years due to rapid climate change. For Post-Disaster Management (PDM), authorities deploy various types of user equipment (UE) for the search and rescue operation, for example, search and rescue robots, drones, medical robots, smartphones, etc., via the Internet of Robotic Things (IoRT) supported by cellular 4G/LTE/5G and beyond or other wireless technologies. For uninterrupted communication services, movable and deployable resource units (MDRUs) have been utilized where the base stations are damaged due to the disaster. In addition, power optimization of the networks by satisfying the quality of service (QoS) of each UE is a crucial challenge because of the electricity crisis after the disaster. In order to optimize the energy efficiency, UE throughput, and serving cell (SC) throughput by considering the stationary as well as movable UE without knowing the environmental priori knowledge in MDRUs aided two-tier heterogeneous networks (HetsNets) of IoRT, the optimization problem has been formulated based on emitting power allocation and user association combinedly in this article. This optimization problem is nonconvex and NP-hard where parameterized (discrete: user association and continuous: power allocation) action space is deployed. The new model-free hybrid action space-based algorithm called multi-pass deep Q network (MP-DQN) is developed to optimize this complex problem. Simulations results demonstrate that the proposed MP-DQN outperforms the parameterized deep Q network (P-DQN) approach, which is well known for solving parameterized action space, DQN, as well as traditional algorithms in terms of reward, average energy efficiency, UE throughput, and SC throughput for motionless as well as moveable UE.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号