Deep deterministic policy gradient (DDPG)

深度确定性策略梯度 (DDPG)
  • 文章类型: Journal Article
    深度强化学习(DRL)在不同的领域和应用中获得了广泛的采用,主要是由于其在具有高维状态和动作的空间中解决复杂的决策问题的能力。深度确定性策略梯度(DDPG)是一种众所周知的DRL算法,采用演员-批评方法,综合基于价值和基于策略的强化学习方法的优势。这项研究的目的是全面研究最新发展,模式,障碍,以及与DDPG相关的潜在机会。使用相关的学术数据库进行了系统的搜索(Scopus,WebofScience,和ScienceDirect)确定过去五年(2018-2023年)发表的85项相关研究。我们全面概述了DDPG的关键概念和组件,包括它的配方,实施,和训练。然后,我们重点介绍了DDPG的各种应用和领域,包括自动驾驶,无人机,资源分配,通信和物联网,机器人,和金融。此外,我们提供了DDPG与其他DRL算法和传统RL方法的深入比较,突出它的优点和缺点。我们相信,这次审查将是研究人员的重要资源,为他们提供有关DRL和DDPG领域使用的方法和技术的宝贵见解。
    Deep Reinforcement Learning (DRL) has gained significant adoption in diverse fields and applications, mainly due to its proficiency in resolving complicated decision-making problems in spaces with high-dimensional states and actions. Deep Deterministic Policy Gradient (DDPG) is a well-known DRL algorithm that adopts an actor-critic approach, synthesizing the advantages of value-based and policy-based reinforcement learning methods. The aim of this study is to provide a thorough examination of the latest developments, patterns, obstacles, and potential opportunities related to DDPG. A systematic search was conducted using relevant academic databases (Scopus, Web of Science, and ScienceDirect) to identify 85 relevant studies published in the last five years (2018-2023). We provide a comprehensive overview of the key concepts and components of DDPG, including its formulation, implementation, and training. Then, we highlight the various applications and domains of DDPG, including Autonomous Driving, Unmanned Aerial Vehicles, Resource Allocation, Communications and the Internet of Things, Robotics, and Finance. Additionally, we provide an in-depth comparison of DDPG with other DRL algorithms and traditional RL methods, highlighting its strengths and weaknesses. We believe that this review will be an essential resource for researchers, offering them valuable insights into the methods and techniques utilized in the field of DRL and DDPG.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    互联网时代是一个信息爆炸的时代。到2022年,全球互联网用户已超过40亿,社交媒体用户已经超过30亿。人们每天面对大量的新闻内容,几乎不可能通过浏览所有新闻内容来获得有趣的信息。在这样的背景下,个性化新闻推荐技术得到了广泛的应用,但仍需进一步优化和完善。为了更好地将感兴趣的新闻内容推送给不同的读者,用户对主要新闻网站的满意度应进一步提高。本研究提出了一种基于深度学习和强化学习的推荐算法。首先,提出了基于深度学习的RL算法。深度学习在处理大规模数据和复杂模式识别方面非常出色,但是当涉及到复杂的决策和顺序任务时,它通常面临着低样本效率的挑战。而强化学习(RL)强调通过与环境的交互式学习,通过持续的试验和错误来学习优化策略。与深度学习相比,RL更适合需要长期决策和试错学习的场景。通过反馈行动的奖励信号,能更好地适应未知环境和复杂任务,弥补了深度学习在这些方面的相对不足。将场景应用于动作,以解决新闻传播过程中的顺序决策问题。为了使新闻推荐系统能够考虑用户对新闻内容兴趣的动态变化,深度确定性策略梯度算法应用于新闻推荐场景。相反的学习是对DeepQ网络与战略网络的补充和结合。在充分总结和思考的基础上,本文提出了智能新闻传播推送模式。提出了基于边缘计算技术的新闻传播信息推送流程。最后,基于曲线下面积,提出了RL模型的Q倾斜曲线下面积。该指标可以有效地衡量RL模型的优缺点,并有助于比较模型和评估离线实验。结果表明,DDPG算法与常规推荐算法相比,点击率提高了2.586%。结果表明,本文设计的算法在用户精准推荐方面具有较为明显的优势。本文通过优化智能新闻传播推送模式,有效提高新闻传播效率。此外,本文还深入研究了智能边缘技术在新闻传播中的创新应用,为推动新闻传播方式的发展带来了新的思路和做法。优化智能新闻传播的推送模式,不仅提高了用户体验,同时也为智能边缘技术在该领域的应用提供了强有力的支持,具有重要的实际应用前景。
    The Internet era is an era of information explosion. By 2022, the global Internet users have reached more than 4 billion, and the social media users have exceeded 3 billion. People face a lot of news content every day, and it is almost impossible to get interesting information by browsing all the news content. Under this background, personalized news recommendation technology has been widely used, but it still needs to be further optimized and improved. In order to better push the news content of interest to different readers, users\' satisfaction with major news websites should be further improved. This study proposes a new recommendation algorithm based on deep learning and reinforcement learning. Firstly, the RL algorithm is introduced based on deep learning. Deep learning is excellent in processing large-scale data and complex pattern recognition, but it often faces the challenge of low sample efficiency when it comes to complex decision-making and sequential tasks. While reinforcement learning (RL) emphasizes learning optimization strategies through continuous trial and error through interactive learning with the environment. Compared with deep learning, RL is more suitable for scenes that need long-term decision-making and trial-and-error learning. By feeding back the reward signal of the action, the system can better adapt to the unknown environment and complex tasks, which makes up for the relative shortcomings of deep learning in these aspects. A scenario is applied to an action to solve the sequential decision problem in the news dissemination process. In order to enable the news recommendation system to consider the dynamic changes in users\' interest in news content, the Deep Deterministic Policy Gradient algorithm is applied to the news recommendation scenario. Opposing learning complements and combines Deep Q-network with the strategic network. On the basis of fully summarizing and thinking, this paper puts forward the mode of intelligent news dissemination and push. The push process of news communication information based on edge computing technology is proposed. Finally, based on Area Under Curve a Q-Leaning Area Under Curve for RL models is proposed. This indicator can measure the strengths and weaknesses of RL models efficiently and facilitates comparing models and evaluating offline experiments. The results show that the DDPG algorithm improves the click-through rate by 2.586% compared with the conventional recommendation algorithm. It shows that the algorithm designed in this paper has more obvious advantages in accurate recommendation by users. This paper effectively improves the efficiency of news dissemination by optimizing the push mode of intelligent news dissemination. In addition, the paper also deeply studies the innovative application of intelligent edge technology in news communication, which brings new ideas and practices to promote the development of news communication methods. Optimizing the push mode of intelligent news dissemination not only improves the user experience, but also provides strong support for the application of intelligent edge technology in this field, which has important practical application prospects.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    对于上、下双孔结构微器件的双钉孔顺应性组装任务,提出了一种技能学习方法。该方法结合了模拟空间中的离线训练和现实空间中的在线训练。在本文中,根据受力分析的结果,建立了双钉孔模型,并且提供了用于计算接触力的接触点搜索方法。然后,基于深度强化学习构建技能学习框架。专家行动和增量行动都在训练中使用,奖励制度同时考虑效率和安全性;此外,为提高训练效率提供了一种动态探索方法。此外,根据实验数据,采用在线训练方法对技能学习模型进行持续优化,以减少离线训练数据与实际偏差造成的误差。最后的实验表明,该方法能有效降低装配时的接触力,提高效率,减少位置和方向变化的影响。
    For the dual peg-in-hole compliance assembly task of upper and lower double-hole structural micro-devices, a skill-learning method is proposed. This method combines offline training in a simulation space and online training in a realistic space. In this paper, a dual peg-in-hole model is built according to the results of a force analysis, and contact-point searching methods are provided for calculating the contact force. Then, a skill-learning framework is built based on deep reinforcement learning. Both expert action and incremental action are used in training, and a reward system considers both efficiency and safety; additionally, a dynamic exploration method is provided to improve the training efficiency. In addition, based on experimental data, an online training method is used to optimize the skill-learning model continuously so that the error caused by the deviation in the offline training data from reality can be reduced. The final experiments demonstrate that the method can effectively reduce the contact force while assembling, improve the efficiency and reduce the impact of the change in position and orientation.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在车辆边缘计算(VEC)中,一些任务可以在本地或在基站(BS)或附近车辆处的移动边缘计算(MEC)服务器上处理。事实上,任务是否卸载,基于车辆对基础设施(V2I)和车辆对车辆(V2V)通信的状态。在本文中,考虑了基于设备到设备(D2D)的V2V通信和基于多输入多输出和非正交多址(MIMO-NOMA)的V2I通信。在实际通信场景中,基于MIMO-NOMA的V2I通信的信道条件是不确定的,任务到达是随机的,导致VEC系统的高度复杂的环境。为了解决这个问题,提出了一种基于分散式深度强化学习(DRL)的功率分配方案。由于动作空间是连续的,我们采用深度确定性策略梯度(DDPG)算法来获得最优策略。大量实验表明,我们提出的DRL和DDPG方法在功耗和奖励方面优于现有的贪婪策略。
    In vehicular edge computing (VEC), some tasks can be processed either locally or on the mobile edge computing (MEC) server at a base station (BS) or a nearby vehicle. In fact, tasks are offloaded or not, based on the status of vehicle-to-infrastructure (V2I) and vehicle-to-vehicle (V2V) communication. In this paper, device-to-device (D2D)-based V2V communication and multiple-input multiple-output and nonorthogonal multiple access (MIMO-NOMA)-based V2I communication are considered. In actual communication scenarios, the channel conditions for MIMO-NOMA-based V2I communication are uncertain, and the task arrival is random, leading to a highly complex environment for VEC systems. To solve this problem, we propose a power allocation scheme based on decentralized deep reinforcement learning (DRL). Since the action space is continuous, we employ the deep deterministic policy gradient (DDPG) algorithm to obtain the optimal policy. Extensive experiments demonstrate that our proposed approach with DRL and DDPG outperforms existing greedy strategies in terms of power consumption and reward.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    我们研究了依赖于深度确定性策略梯度(DDPG)的覆盖设备到设备(D2D)通信网络的功率控制问题,这是一种无模型的非策略算法,用于学习连续操作,如发射功率水平。我们提出了一种基于DDPG的自调节功率控制方案,由此每个D2D发射机可以自主地确定其发射功率电平,仅利用可以从由D2D接收机发送的探测符号测量的本地信道增益。从平均和速率和能量效率方面分析了所提出方案的性能,并将其与几种常规方案进行了比较。我们的数值结果表明,与传统方案相比,该方案提高了平均和速率。即使由于增加D2D对的数量或高传输功率而引起的严重干扰,所提出的方案具有最高的能源效率。
    We investigate a power control problem for overlay device-to-device (D2D) communication networks relying on a deep deterministic policy gradient (DDPG), which is a model-free off-policy algorithm for learning continuous actions such as transmitting power levels. We propose a DDPG-based self-regulating power control scheme whereby each D2D transmitter can autonomously determine its transmission power level with only local channel gains that can be measured from the sounding symbols transmitted by D2D receivers. The performance of the proposed scheme is analyzed in terms of average sum-rate and energy efficiency and compared to several conventional schemes. Our numerical results show that the proposed scheme increases the average sum-rate compared to the conventional schemes, even with severe interference caused by increasing the number of D2D pairs or high transmission power, and the proposed scheme has the highest energy efficiency.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    可再生能源(RER)在全球环境保护中得到了广泛的应用。太阳能系统在电能的产生中起着重要的作用,显著减少不可再生燃料来源的利用。太阳能可以经由过程太阳能光伏被提取并转化为电能。几个传统的,软计算,启发式,已经开发了元启发式最大功率点跟踪(MPPT)技术,以在不同的大气条件下从太阳能光伏模块中提取最大能量点(MEP)。在这份手稿中,在太阳系的部分阴影条件(PSC)下,提出了基于深度强化学习算法(DRLAMPPT)的强化学习算法(RLA)和深度学习算法(DLA)的组合。DRLAMPPT可以处理连续的状态空间,与RL相反,它只能在离散的动作状态空间下运行。在这个提议的DRLAMPPT中,深度确定性策略梯度(DDPG)解决了光伏系统中涉及到GMEP的连续状态空间问题,尤其是在PSC下。在DRLAMPPT中,代表的策略由人工神经网络(ANN)参数化,它使用感官信息作为输入,并直接发出控制信号。这项工作开发了一个由光伏阵列组成的2kW太阳能光伏电站,DC/DC升压转换器,使用恒流控制器与传统电网集成的3-Φ脉宽调制电压源逆变器(PWM-VSI)(可以通过实验设置和MATLAB验证所提出的带有CCC的DRLAMPPT的CCC有效性。在太阳辐照度的不同输入条件下进行模拟和测试。实验结果证明,与现有的MPPT相比,建议DRLAMPPT不仅可以获得最佳效率,而且可以更快地采用光伏系统环境条件的变化,并且能够在PSC下在0.8s内达到GMEP。实验和仿真结果也证明了建议的带有LC滤波器的CCC使逆变器输出电压和电网电压在THD的较低值即1.1%和0.98%时同相。
    Renewable Energy Resources (RERs) are widely used on the concern of global environment protection. Solar energy systems play an important role in the generation of electrical energy, remarkably minimize the utilization of nonrenewable fuel sources. Solar energy can be extracted and transformed into electrical energy via solar photovoltaic process. Several traditional, soft computing, heuristic, and meta-heuristic maximum power point tracking (MPPT) techniques have been developed to extract Maximum Energy Point (MEP) from the solar photovoltaic modules under different atmospheric conditions. In this manuscript, the combination of reinforcement learning algorithm (RLA) and deep learning algorithm (DLA) called deep Reinforcement Learning Algorithm based MPPT (DRLAMPPT) is proposed under partial shading conditions (PSC) of the solar system. DRLAMPPT can deal with continuous state spaces, in contrast to RL it can be operated only with discrete action state spaces. In this proposed DRLAMPPT, deep deterministic policy gradient (DDPG) solves the problem of continuous state spaces are involved to reach the GMEP in photovoltaic systems especially under PSC. In DRLAMPPT, the representative\'s strategy is parameterized by an artificial neural network (ANN), which uses sensory information as input and directly sends out control signals. This work develops a 2 kW solar photovoltaic power plant comprises of a photovoltaic array, DC/DC step-up converter, 3-Φ Pulse Width Modulated Voltage Source Inverter (PWM-VSI) integrated with conventional power grid using Constant Current Controller (CCC Effectiveness of the proposed DRLAMPPT with CCC can be validated through an experimental setup and with MATLAB. Simulation and tested at different input conditions of solar irradiance. Experimental results prove that, in comparison to existing MPPTs, suggested DRLAMPPT not only attains the best efficiency and also adopts the change in environmental conditions of the photovoltaic system at a much faster rate and able to reach the GMEP within 0.8 s under PSC. Experimental and simulation results also prove that suggested CCC with LC filter makes the inverter output voltage and the grid voltage are in phase at the lower value of THD i.e. 1.1% and 0.98% respectively.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    本研究提出了一种基于移动边缘计算(MEC)的区块链系统,用于在物联网(IoT)网络中进行安全的数据存储和共享。MEC充当覆盖系统以提供动态计算卸载服务。考虑到延迟关键,资源有限,和动态物联网场景,自适应系统资源分配和计算卸载方案旨在优化支持MEC的区块链系统的可扩展性性能,其中可扩展性被量化为MEC计算效率和区块链系统吞吐量。具体来说,我们共同优化计算卸载策略和区块生成策略,以最大限度地提高支持MEC的区块链系统的可扩展性,同时保证数据安全性和系统效率。与现有的在物联网网络中忽略频繁的用户移动和动态任务需求的工作相比,联合性能优化方案被公式化为马尔可夫决策过程(MDP)。此外,我们设计了一种基于深度确定性策略梯度(DDPG)的算法来解决MDP问题,并将多个和可变数量的连续时隙定义为决策时期进行模型训练。具体来说,DDPG可以用连续的动作空间解决MDP问题,它只需要一个简单的演员-评论家架构,使其适合解决MEC支持的区块链系统的动态和复杂性。正如模拟所证明的,在长期事务吞吐量方面,所提出的方案可以比基于深度Q网络(DQN)的方案和其他一些贪婪方案实现性能改进。
    A mobile edge computing (MEC)-enabled blockchain system is proposed in this study for secure data storage and sharing in internet of things (IoT) networks, with the MEC acting as an overlay system to provide dynamic computation offloading services. Considering latency-critical, resource-limited, and dynamic IoT scenarios, an adaptive system resource allocation and computation offloading scheme is designed to optimize the scalability performance for MEC-enabled blockchain systems, wherein the scalability is quantified as MEC computational efficiency and blockchain system throughput. Specifically, we jointly optimize the computation offloading policy and block generation strategy to maximize the scalability of MEC-enabled blockchain systems and meanwhile guarantee data security and system efficiency. In contrast to existing works that ignore frequent user movement and dynamic task requirements in IoT networks, the joint performance optimization scheme is formulated as a Markov decision process (MDP). Furthermore, we design a deep deterministic policy gradient (DDPG)-based algorithm to solve the MDP problem and define the multiple and variable number of consecutive time slots as a decision epoch to conduct model training. Specifically, DDPG can solve an MDP problem with a continuous action space and it only requires a straightforward actor-critic architecture, making it suitable for tackling the dynamics and complexity of the MEC-enabled blockchain system. As demonstrated by simulations, the proposed scheme can achieve performance improvements over the deep Q network (DQN)-based scheme and some other greedy schemes in terms of long-term transactional throughput.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    随着5G技术的发展,对带宽密集型和延迟敏感的服务的需求每天都在激增,导致对稀缺无线电资源的激烈竞争。功率域非正交多址(NOMA)技术可以显著提高系统容量和频谱效率。与现有的NOMA调度主要关注公平性不同,本文提出了一种双动态环境下上行链路混合OMA和PD-NOMA的功率控制解决方案:动态和不完美的信道信息以及随机用户特定的分层服务质量(QoS)。本文将功率控制问题建模为非凸随机问题,它旨在最大限度地提高系统能效,同时保证分层用户QoS要求。然后,该问题被表述为部分可观察的马尔可夫决策过程(POMDP)。由于时变场景建模的困难,快速收敛的紧迫性,在动态环境中的适应性,以及变量的连续性,提出了一种基于深度强化学习(DRL)的方法。本文还将NOMA串行干扰消除(SIC)场景下的分层QoS约束转换为适合DRL。仿真结果验证了该算法在双重不确定环境下的有效性和鲁棒性。与基线粒子群优化算法(PSO)相比,提出的基于DRL的方法表现出令人满意的性能。
    The demand for bandwidth-intensive and delay-sensitive services is surging daily with the development of 5G technology, resulting in fierce competition for scarce radio resources. Power domain Nonorthogonal Multiple Access (NOMA) technologies can dramatically improve system capacity and spectrum efficiency. Unlike existing NOMA scheduling that mainly focuses on fairness, this paper proposes a power control solution for uplink hybrid OMA and PD-NOMA in dual dynamic environments: dynamic and imperfect channel information together with the random user-specific hierarchical quality of service (QoS). This paper models the power control problem as a nonconvex stochastic, which aims to maximize system energy efficiency while guaranteeing hierarchical user QoS requirements. Then, the problem is formulated as a partially observable Markov decision process (POMDP). Owing to the difficulty of modeling time-varying scenes, the urgency of fast convergency, the adaptability in a dynamic environment, and the continuity of the variables, a Deep Reinforcement Learning (DRL)-based method is proposed. This paper also transforms the hierarchical QoS constraint under the NOMA serial interference cancellation (SIC) scene to fit DRL. The simulation results verify the effectiveness and robustness of the proposed algorithm under a dual uncertain environment. As compared with the baseline Particle Swarm Optimization algorithm (PSO), the proposed DRL-based method has demonstrated satisfying performance.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    关于全球环境保护问题,可再生能源系统已被广泛考虑。光伏(PV)系统将太阳能转换为电能,并显着减少环境污染对化石燃料的消耗。除了为太阳能电池引入新材料以提高能量转换效率外,已经开发了最大功率点跟踪(MPPT)算法,以确保光伏系统在各种天气条件下的最大功率点(MPP)的有效运行。强化学习与深度学习的融合,称为深度强化学习(DRL),本文提出了一种未来的工具来处理优化控制问题。随着深度强化学习(DRL)在多个领域的成功,提出了深度Q网络(DQN)和深度确定性策略梯度(DDPG)来获取光伏系统中的MPP,特别是在部分阴影条件(PSC)下。与基于强化学习(RL)的方法不同,仅使用离散的状态和动作空间操作,本文采用的方法用于处理连续状态空间。在这项研究中,DQN用离散的动作空间解决问题,而DDPG处理连续动作空间。在MATLAB/Simulink中对所提出的方法进行了仿真,以进行可行性分析。在各种输入条件下进行进一步的测试,并与经典的摄动和观察(P&O)MPPT方法进行比较,以进行验证。根据本研究的仿真结果,所提出的方法的性能是突出和有效的,显示其进一步应用的潜力。
    On the issues of global environment protection, the renewable energy systems have been widely considered. The photovoltaic (PV) system converts solar power into electricity and significantly reduces the consumption of fossil fuels from environment pollution. Besides introducing new materials for the solar cells to improve the energy conversion efficiency, the maximum power point tracking (MPPT) algorithms have been developed to ensure the efficient operation of PV systems at the maximum power point (MPP) under various weather conditions. The integration of reinforcement learning and deep learning, named deep reinforcement learning (DRL), is proposed in this paper as a future tool to deal with the optimization control problems. Following the success of deep reinforcement learning (DRL) in several fields, the deep Q network (DQN) and deep deterministic policy gradient (DDPG) are proposed to harvest the MPP in PV systems, especially under a partial shading condition (PSC). Different from the reinforcement learning (RL)-based method, which is only operated with discrete state and action spaces, the methods adopted in this paper are used to deal with continuous state spaces. In this study,DQN solves the problem with discrete action spaces, while DDPG handles the continuous action spaces. The proposed methods are simulated in MATLAB/Simulink for feasibility analysis. Further tests under various input conditions with comparisons to the classical Perturb and observe (P&O) MPPT method are carried out for validation. Based on the simulation results in this study, the performance of the proposed methods is outstanding and efficient, showing its potential for further applications.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

公众号