Deep reinforcement learning

深度强化学习
  • 文章类型: Journal Article
    城市洪水是全球代价最高的自然灾害之一。及时有效的救援路径规划对于最大程度地减少生命和财产损失至关重要。然而,当前的路径规划研究往往未能充分考虑评估区域风险不确定性和绕过洪水救援场景中复杂障碍的需要,为制定最佳救援路径提出了重大挑战。本研究提出了一种深度强化学习(RL)算法,结合了四种主要机制来解决这些问题。双优先经验回放和回溯惩罚机制增强了对区域风险的精确估计。同时,随机嘈杂网络和动态探索技术鼓励智能体探索环境中的未知区域,从而改进了绕过复杂障碍的抽样和优化策略。该研究在主要城市洪水灾害中基于现实世界的救援行动构建了多个网格仿真场景。这些情景包括所有可通行区域的不确定风险值和复杂元素的增加,如狭窄的通道,C形屏障,和锯齿状的路径,显著提高了路径规划的挑战。对比分析表明,只有提出的算法可以绕过所有障碍,并在9种情况下规划最佳救援路径。该研究通过将场景规模扩展到前所未有的水平,推进了城市洪水救援路径规划的理论进展。它还开发了适用于路径规划中各种极其复杂的障碍的RL机制。此外,它提供了对人工智能的方法论见解,以增强现实世界的风险管理。
    Urban flooding is among the costliest natural disasters worldwide. Timely and effective rescue path planning is crucial for minimizing loss of life and property. However, current research on path planning often fails to adequately consider the need to assess area risk uncertainties and bypass complex obstacles in flood rescue scenarios, presenting significant challenges for developing optimal rescue paths. This study proposes a deep reinforcement learning (RL) algorithm incorporating four main mechanisms to address these issues. Dual-priority experience replays and backtrack punishment mechanisms enhance the precise estimation of area risks. Concurrently, random noisy networks and dynamic exploration techniques encourage the agent to explore unknown areas in the environment, thereby improving sampling and optimizing strategies for bypassing complex obstacles. The study constructed multiple grid simulation scenarios based on real-world rescue operations in major urban flood disasters. These scenarios included uncertain risk values for all passable areas and an increased presence of complex elements, such as narrow passages, C-shaped barriers, and jagged paths, significantly raising the challenge of path planning. The comparative analysis demonstrated that only the proposed algorithm could bypass all obstacles and plan the optimal rescue path across nine scenarios. This research advances the theoretical progress for urban flood rescue path planning by extending the scale of scenarios to unprecedented levels. It also develops RL mechanisms adaptable to various extremely complex obstacles in path planning. Additionally, it provides methodological insights into artificial intelligence to enhance real-world risk management.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    传统的航天器姿态控制往往严重依赖于航天器的尺寸和质量信息。在主动清除碎片的情况下,这些特征不能事先知道,因为碎片可以采取任何形状或质量。此外,不可能测量轨道上卫星和碎片物体组合系统的质量。因此,开发自适应卫星姿态控制至关重要,该控制可以从其他测量中提取有关卫星系统的大量信息。作者提出使用深度强化学习(DRL)算法,采用堆叠的观察来处理变化很大的质量。这颗卫星是用Basilisk软件模拟的,并使用蒙特卡罗模拟评估控制性能。结果表明,与航天器姿态控制的经典比例积分微分(PID)控制器相比,具有堆叠观测值的DRL具有优势。该算法能够适应,特别是在物理属性变化的情况下。
    Traditional spacecraft attitude control often relies heavily on the dimension and mass information of the spacecraft. In active debris removal scenarios, these characteristics cannot be known beforehand because the debris can take any shape or mass. Additionally, it is not possible to measure the mass of the combined system of satellite and debris object in orbit. Therefore, it is crucial to develop an adaptive satellite attitude control that can extract mass information about the satellite system from other measurements. The authors propose using deep reinforcement learning (DRL) algorithms, employing stacked observations to handle widely varying masses. The satellite is simulated in Basilisk software, and the control performance is assessed using Monte Carlo simulations. The results demonstrate the benefits of DRL with stacked observations compared to a classical proportional-integral-derivative (PID) controller for the spacecraft attitude control. The algorithm is able to adapt, especially in scenarios with changing physical properties.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    随着科学技术和经济的发展,无人机的应用越来越广泛。然而,现有的无人机航迹规划方法存在成本高、智能化程度低的局限性。鉴于此,采用灰狼算法实现无人机群的协同轨迹优化。然而,研究发现,灰狼优化算法(GWO)存在协作弱的问题。在这项研究中,在传统的GWO信息素因子的基础上引入对其进行改进。.针对群智能优化算法在动态威胁下性能不稳定的问题,采用深度强化学习优化模型。构建了基于改进灰狼算法的无人机群轨迹规划模型。通过实验分析,改进灰狼算法的最优适应度值低于灰狼算法的0.43。与其他算法相比,该算法的适应度值显著降低,稳定性较高。在复杂的场景中,改进的灰狼算法轨迹长度为70.51km,规划时间为5.92s,这显然优于其他算法。研究设计模型规划的路径长度为58.476km,明显小于其他三个模型。规划时间为5.33s,路径扩展点数量为46。本研究设计的无人机群轨迹规划模型的指标值均小于其他三种模型。通过分析结果,该模型可以实现低成本的轨迹优化,为无人机任务执行提供更合理的技术支持。
    With the development of science and technology and economy, UAV is used more and more widely. However, the existing UAV trajectory planning methods have the limitations of high cost and low intelligence. In view of this, grey Wolf algorithm is being used to achieve collaborative trajectory optimization of UAV groups. However, it is found that the Grey Wolf optimization algorithm (GWO) has the problem of weak cooperation. In this study, based on the traditional GWO pheromone factor is introduced to improve it.. Aiming at the problem of unstable performance of swarm intelligence optimization algorithm under dynamic threat, deep reinforcement learning is used to optimize the model. An unmanned aerial vehicle swarm trajectory planning model was constructed based on the improved grey wolf algorithm. Through experimental analysis, the optimal fitness value of the improved grey wolf algorithm was lower than 0.43 of the grey wolf algorithm. Compared with other algorithms, the fitness value of this algorithm is significantly reduced and the stability is higher. In complex scenarios, the improved grey wolf algorithm had a trajectory length of 70.51 km and a planning time of 5.92 s, which was clearly superior to other algorithms. The path length planned by the research and design model was 58.476 km, which was significantly smaller than the other three models. The planning time was 5.33 s and the number of path extension points was 46. The indicator values of the Unmanned Aerial Vehicle swarm trajectory planning model designed by this research were all smaller than the other three models. By analyzing the results, the model can achieve low-cost trajectory optimization, providing more reasonable technical support for unmanned aerial vehicle mission execution.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    机器人移动履行系统(RMFS)在处理大规模订单和导航复杂环境方面面临挑战,经常遇到一系列复杂的决策问题,例如订单分配,货架选择,机器人调度为了应对这些挑战,本文将深度强化学习(DRL)技术集成到RMFS中,满足高效订单处理和系统稳定性的需要。本研究集中在RMFS的三个关键阶段:订单分配和排序,货架选择,和协调的机器人调度。对于每个阶段,建立了数学模型,并提出了相应的解决方案。与传统方法不同,DRL技术的引入解决了这些问题,利用遗传算法和蚁群优化来处理与大规模订单相关的决策。通过仿真实验,评估性能指标,例如货架访问频率和RMFS的总处理时间。实验结果表明,与传统方法相比,我们的算法擅长处理大规模订单,展示非凡的优越性,能够在一小时内完成大约110个任务。未来的研究应该集中在对RMFSs的每个阶段进行综合决策建模,并为大规模问题设计有效的启发式算法,进一步提高系统性能和效率。
    Robotic Mobile Fulfillment Systems (RMFSs) face challenges in handling large-scale orders and navigating complex environments, frequently encountering a series of intricate decision-making problems, such as order allocation, shelf selection, and robot scheduling. To address these challenges, this paper integrates Deep Reinforcement Learning (DRL) technology into an RMFS, to meet the needs of efficient order processing and system stability. This study focuses on three key stages of RMFSs: order allocation and sorting, shelf selection, and coordinated robot scheduling. For each stage, mathematical models are established and the corresponding solutions are proposed. Unlike traditional methods, DRL technology is introduced to solve these problems, utilizing a Genetic Algorithm and Ant Colony Optimization to handle decision making related to large-scale orders. Through simulation experiments, performance indicators-such as shelf access frequency and the total processing time of the RMFS-are evaluated. The experimental results demonstrate that, compared to traditional methods, our algorithms excel in handling large-scale orders, showcasing exceptional superiority, capable of completing approximately 110 tasks within an hour. Future research should focus on integrated decision-making modeling for each stage of RMFSs and designing efficient heuristic algorithms for large-scale problems, to further enhance system performance and efficiency.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    边缘服务器经常管理自己的离线数字孪生(DT)服务,除了在线缓存数字孪生服务。然而,当前的研究往往忽视了离线缓存服务对内存和计算资源的影响,这可能会影响边缘服务器上在线服务任务的处理效率。在这项研究中,通过强调在线和离线边缘服务的集成服务质量(QoS),我们专注于协作边缘计算系统中的服务缓存和任务卸载。我们考虑了在线和离线服务的资源使用情况,以及传入的在线请求。为了最大化整体QoS效用,我们建立了一个优化目标,奖励在线服务的吞吐量,同时惩罚错过软期限的离线服务。我们将其表述为效用最大化问题,这被证明是NP-hard。为了解决这种复杂性,我们将优化问题重新定义为马尔可夫决策过程(MDP),并通过利用深度Q网络(DQN)引入了用于服务缓存和任务卸载的联合优化算法。综合实验表明,与基线算法相比,我们的算法将效用提高了至少14.01%。
    Edge servers frequently manage their own offline digital twin (DT) services, in addition to caching online digital twin services. However, current research often overlooks the impact of offline caching services on memory and computation resources, which can hinder the efficiency of online service task processing on edge servers. In this study, we concentrated on service caching and task offloading within a collaborative edge computing system by emphasizing the integrated quality of service (QoS) for both online and offline edge services. We considered the resource usage of both online and offline services, along with incoming online requests. To maximize the overall QoS utility, we established an optimization objective that rewards the throughput of online services while penalizing offline services that miss their soft deadlines. We formulated this as a utility maximization problem, which was proven to be NP-hard. To tackle this complexity, we reframed the optimization problem as a Markov decision process (MDP) and introduced a joint optimization algorithm for service caching and task offloading by leveraging the deep Q-network (DQN). Comprehensive experiments revealed that our algorithm enhanced the utility by at least 14.01% compared with the baseline algorithms.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    投资组合管理(PM)是一种流行的财务过程,涉及偶尔将特定数量的资本重新分配到资产组合中,主要目标是在一定风险水平下实现盈利能力最大化。鉴于证券交易所的内在动态性和长期业绩的发展,强化学习(RL)已成为以自动化和高效的方式解决投资组合管理问题的主要解决方案。然而,目前基于RL的PM方法只考虑了投资组合资产价格的变化和价格变化的影响,在忽视市场上不同资产之间的重要关系的同时,这对管理决策非常有价值。为了缩小这个差距,本文介绍了一种新颖的深度模型,该模型结合了两个子网络;一个使用精炼的时间学习器学习历史价格的时间表示,而另一个使用关系图学习器(RGL)学习市场中不同股票之间的关系。然后,上述学习者被整合到课程RL方案中,以将PM制定为课程马尔可夫决策过程,其中提出了自适应课程策略,以使代理能够自适应地最小化风险值并最大化累积收益。对来自三个公开股票指数(S&P500,纽约证券交易所,和纳斯达克),结果证明了所提出的框架在提高竞争RL解决方案的投资组合管理绩效方面的效率。
    Portfolio management (PM) is a popular financial process that concerns the occasional reallocation of a particular quantity of capital into a portfolio of assets, with the main aim of maximizing profitability conditioned to a certain level of risk. Given the inherent dynamicity of stock exchanges and development for long-term performance, reinforcement learning (RL) has become a dominating solution for solving the problem of portfolio management in an automated and efficient manner. Nevertheless, the present RL-based PM methods just take into account the variations in prices of portfolio assets and the implications of price variations, while overlooking the significant relationships among different assets in the market, which are extremely valuable for managerial decisions. To close this gap, this paper introduces a novel deep model that combines two subnetworks; one to learn a temporal representation of historical prices using a refined temporal learner, while the other learns the relationships between different stocks in the market using a relation graph learner (RGL). Then, the above learners are integrated into the curriculum RL scheme for formulating the PM as a curriculum Markov Decision Process, in which an adaptive curriculum policy is presented to enable the agent to adaptively minimize risk value and maximize cumulative return. Proof-of-concept experiments are performed on data from three public stock indices (namely S&P500, NYSE, and NASDAQ), and the results demonstrate the efficiency of the proposed framework in improving the portfolio management performance over the competing RL solutions.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    珊瑚礁生态系统作为游泳能力有限的鱼类的栖息地起着至关重要的作用,不仅作为避难所和食物来源,而且还影响他们的行为倾向。了解鱼类在复杂水流中的珊瑚礁环境中巧妙地导航移动目标的复杂机制,在躲避障碍和保持稳定姿势的同时,在鱼类行为领域仍然是一个具有挑战性和突出的主题,生态学,和生物模拟物一样。集成的仿真框架用于研究复杂环境中的鱼类捕食问题,将深度强化学习算法(DRL)与高精度流体-结构相互作用数值方法相结合-LMmersed边界格子Boltzmann方法(lB-LBM)。软行动者批评(SAC)算法用于提高智能鱼的随机探索能力,解决现实场景中固有的多目标稀疏奖励挑战。此外,开发了一种适合其行动目的的奖励塑造方法,能够有效地捕捉结果和趋势特征。本文通过两个案例研究展示了本文阐明的方法的收敛性和鲁棒性优势:一个解决在静水流场中捕获随机移动目标的鱼,另一种侧重于鱼类在珊瑚礁环境中逆流觅食,以捕获漂流食物。综合分析了各种奖励类型对复杂环境中智能鱼决策过程的影响和意义。
    The reef ecosystem plays a vital role as a habitat for fish species with limited swimming capabilities, serving not only as a sanctuary and food source but also influencing their behavioral tendencies. Understanding the intricate mechanism through which fish adeptly navigate the moving targets within reef environments within complex water flow, all while evading obstacles and maintaining stable postures, has remained a challenging and prominent subject in the realms of fish behavior, ecology, and biomimetics alike. An integrated simulation framework is used to investigate fish predation problems within intricate environments, combining deep reinforcement learning algorithms (DRL) with high-precision fluid-structure interaction numerical methods-immersed boundary lattice Boltzmann method (lB-LBM). The Soft Actor-Critic (SAC) algorithm is used to improve the intelligent fish\'s capacity for random exploration, tackling the multi-objective sparse reward challenge inherent in real-world scenarios. Additionally, a reward shaping method tailored to its action purposes has been developed, capable of capturing outcomes and trend characteristics effectively. The convergence and robustness advantages of the method elucidated in this paper are showcased through two case studies: one addressing fish capturing randomly moving targets in hydrostatic flow field, and the other focusing on fish counter-current foraging in reef environments to capture drifting food. A comprehensive analysis was conducted of the influence and significance of various reward types on the decision-making processes of intelligent fish within intricate environments.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    由于自动驾驶可能是下一代最重要的应用场景,实现可靠和低延迟车辆通信的无线接入技术的发展变得至关重要。为了解决这个问题,3GPP已经开发了基于5G新无线电(NR)技术的车辆到一切(V2X)规范,其中模式2侧链路(SL)通信类似于LTE-V2X中的模式4,允许车辆之间的直接通信。这补充了LTE-V2X中的SL通信,并代表了蜂窝V2X(C-V2X)的最新进展,并提高了NR-V2X的性能。然而,在NR-V2X模式2中,资源冲突仍然发生并且因此降低信息的年龄(AOI)。因此,采用干扰消除方法通过将NR-V2X与非正交多址(NOMA)技术相结合来减轻这种影响。在NR-V2X中,当车辆选择较小的资源预留间隔(RRI)时,更高频率的传输使用更多的能量来降低AoI。因此,基于NR-V2X通信,共同考虑AoI和通信能耗非常重要。然后,我们制定了这样一个优化问题,并采用深度强化学习(DRL)算法来计算每个发射车辆的最佳发射RRI和发射功率,以减少每个发射车辆的能耗和每个接收车辆的AoI。广泛的仿真证明了我们提出的算法的性能。
    As autonomous driving may be the most important application scenario of the next generation, the development of wireless access technologies enabling reliable and low-latency vehicle communication becomes crucial. To address this, 3GPP has developed Vehicle-to-Everything (V2X) specifications based on 5G New Radio (NR) technology, where Mode 2 Side-Link (SL) communication resembles Mode 4 in LTE-V2X, allowing direct communication between vehicles. This supplements SL communication in LTE-V2X and represents the latest advancements in cellular V2X (C-V2X) with the improved performance of NR-V2X. However, in NR-V2X Mode 2, resource collisions still occur and thus degrade the age of information (AOI). Therefore, an interference cancellation method is employed to mitigate this impact by combining NR-V2X with Non-Orthogonal multiple access (NOMA) technology. In NR-V2X, when vehicles select smaller resource reservation intervals (RRIs), higher-frequency transmissions use more energy to reduce AoI. Hence, it is important to jointly considerAoI and communication energy consumption based on NR-V2X communication. Then, we formulate such an optimization problem and employ the Deep Reinforcement Learning (DRL) algorithm to compute the optimal transmission RRI and transmission power for each transmitting vehicle to reduce the energy consumption of each transmitting vehicle and the AoI of each receiving vehicle. Extensive simulations demonstrate the performance of our proposed algorithm.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    本研究调查了在森林火灾场景中使用边缘计算对无人机(UAV)的动态部署。我们考虑了森林火灾的动态变化特征以及相应的不同资源需求。基于此,本文通过考虑无人机数量和位置的动态变化,对两时间尺度的无人机动态部署方案进行了建模。在缓慢的时间尺度中,我们使用门循环单元(GRU)来预测未来用户的数量,并根据资源需求确定无人机的数量。相应地更换具有低能量的UAV。在快速的时间尺度中,设计了一种基于深度强化学习的无人机位置部署算法,通过实时调整无人机位置来实现计算任务的低延迟处理,以满足地面设备的计算需求。仿真结果表明,该方案具有较好的预测精度。UAV的数量和位置可以适应资源需求变化并减少任务执行延迟。
    This study investigates the dynamic deployment of unmanned aerial vehicles (UAVs) using edge computing in a forest fire scenario. We consider the dynamically changing characteristics of forest fires and the corresponding varying resource requirements. Based on this, this paper models a two-timescale UAV dynamic deployment scheme by considering the dynamic changes in the number and position of UAVs. In the slow timescale, we use a gate recurrent unit (GRU) to predict the number of future users and determine the number of UAVs based on the resource requirements. UAVs with low energy are replaced accordingly. In the fast timescale, a deep-reinforcement-learning-based UAV position deployment algorithm is designed to enable the low-latency processing of computational tasks by adjusting the UAV positions in real time to meet the ground devices\' computational demands. The simulation results demonstrate that the proposed scheme achieves better prediction accuracy. The number and position of UAVs can be adapted to resource demand changes and reduce task execution delays.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    基于监督学习方法的分段镜同相误差辨识技术具有应用条件简单、不依赖于自定义传感器,计算速度快,与其他方法相比,计算能力要求低。然而,由于训练模型与实际模型的差异,该方法在实际应用场合往往难以获得较高的精度。强化学习算法在操作系统时不需要对真实系统进行建模。然而,它仍然保留了监督学习的优势。因此,在本文中,我们在分段望远镜光学系统的光瞳平面上放置了一个掩模。此外,基于广谱,点扩散函数,和光学系统的调制传递函数以及深度强化学习-无需对光学系统进行建模-提出了一种具有多子镜并行化的大范围高精度活塞误差自动同相方法。最后,我们进行了相关的模拟实验,结果表明该方法是有效的。
    The segmented mirror co-phase error identification technique based on supervised learning methods has the advantages of simple application conditions, no dependence on custom sensors, a fast calculation speed, and low computing power requirements compared with other methods. However, it is often difficult to obtain a high accuracy in practical application situations with this method because of the difference between the training model and the actual model. The reinforcement learning algorithm does not need to model the real system when operating the system. However, it still retains the advantages of supervised learning. Thus, in this paper, we placed a mask on the pupil plane of the segmented telescope optical system. Moreover, based on the wide spectrum, point spread function, and modulation transfer function of the optical system and deep reinforcement learning-without modeling the optical system-a large-range and high-precision piston error automatic co-phase method with multiple-submirror parallelization was proposed. Finally, we carried out relevant simulation experiments, and the results indicate that the method is effective.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号