Reinforcement Learning

强化学习
  • 文章类型: Journal Article
    在当代数字化格局和技术进步中,拍卖业经历了一场蜕变,承担起交易范式的关键作用。作为商品或服务定价的机制,拍卖的程序复杂性和效率直接影响市场动态和参与者参与。利用人工智能(AI)技术的先进能力,拍卖部门主动整合人工智能方法,以增强效率并丰富用户互动。本研究深入研究了拍卖领域内价格预测挑战的复杂性,引入复杂的RL-GRU框架进行价格区间分析。该框架首先通过GRU对商品进行定量特征提取,随后通过强化学习技术在模型环境中编排动态交互。最终,通过独具慧眼的分类模块,完成区间划分和拍卖商品价格识别的任务。在五个时间间隔内,在公开可用和内部策划的数据集中展示超过90%的精度,并在八个时间间隔内展示卓越的性能。该框架为未来拍卖价格区间预测挑战的努力提供了宝贵的技术见解。
    In the contemporary digitalization landscape and technological advancement, the auction industry undergoes a metamorphosis, assuming a pivotal role as a transactional paradigm. Functioning as a mechanism for pricing commodities or services, the procedural intricacies and efficiency of auctions directly influence market dynamics and participant engagement. Harnessing the advancing capabilities of artificial intelligence (AI) technology, the auction sector proactively integrates AI methodologies to augment efficacy and enrich user interactions. This study delves into the intricacies of the price prediction challenge within the auction domain, introducing a sophisticated RL-GRU framework for price interval analysis. The framework commences by adeptly conducting quantitative feature extraction of commodities through GRU, subsequently orchestrating dynamic interactions within the model\'s environment via reinforcement learning techniques. Ultimately, it accomplishes the task of interval division and recognition of auction commodity prices through a discerning classification module. Demonstrating precision exceeding 90% across publicly available and internally curated datasets within five intervals and exhibiting superior performance within eight intervals, this framework contributes valuable technical insights for future endeavours in auction price interval prediction challenges.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    重复消极思维(RNT)是一种包含沉思和担忧的跨诊断结构,然而,沉思和担忧之间到底有什么共同点尚不清楚。为了澄清这一点,我们开发了RNT的元控制账户。元控制是指通过与加强和控制运动行为类似的计算来加强和控制心理行为。我们建议反思和担忧是元控制失败的粗略术语,就像跳闸和坠落是电机控制故障的粗略术语一样。我们划分了四个荟萃控制阶段,并在每个阶段增加了失败的机会,包括开放式思想(阶段1),影响子目标执行(阶段2)和切换(阶段3)的个体差异,以及学习适应性心理行为固有的挑战(阶段4)。因此,区分这些阶段阐明了导致过度RNT相同行为的不同过程。我们的帐户还将RNT的重要临床帐户纳入计算认知神经科学框架。
    Repetitive negative thinking (RNT) is a transdiagnostic construct that encompasses rumination and worry, yet what precisely is shared between rumination and worry is unclear. To clarify this, we develop a meta-control account of RNT. Meta-control refers to the reinforcement and control of mental behavior via similar computations as reinforce and control motor behavior. We propose rumination and worry are coarse terms for failure in meta-control, just as tripping and falling are coarse terms for failure in motor control. We delineate four meta-control stages and risk factors increasing the chance of failure at each, including open-ended thoughts (stage 1), individual differences influencing subgoal execution (stage 2) and switching (stage 3), and challenges inherent to learning adaptive mental behavior (stage 4). Distinguishing these stages therefore elucidates diverse processes that lead to the same behavior of excessive RNT. Our account also subsumes prominent clinical accounts of RNT into a computational cognitive neuroscience framework.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    高血压是许多严重疾病的主要危险因素。随着人口老龄化和生活方式的改变,高血压的发病率持续上升,给患者带来巨大的医疗费用负担,严重影响他们的生活质量。早期干预可以大大降低高血压的患病率。基于电子健康档案(EHRs)的高血压预警模型研究是实现高血压预警的重要而有效的方法。然而,受限于多次访问记录的稀缺性和不平衡,和高血压的非平稳特征,很难有效预测患者的高血压患病率。因此,本研究提出了一种基于强化学习和生成特征重放的高血压在线监测模型(HRP-OG)。它将高血压预测问题转化为顺序决策问题,使用多次就诊记录实现患者高血压风险预测。嵌入医疗设备和可穿戴设备中的传感器可持续捕获血压等实时生理数据,心率,和活动水平,它们被集成到EHR中。生成器生成的样本与真实访问数据之间的拟合使用最大似然估计进行评估,这可以减少高血压特征空间与输入增量数据之间的对抗性差异,并且使用生成特征回放基于实时数据在线更新模型。传感器数据的合并确保模型动态适应患者状况的变化,促进及时干预。在这项研究中,公开可用的MIMIC-III数据用于验证,实验结果表明,与现有的先进方法相比,HRP-OG可以有效提高非平稳环境中少发多次就诊记录的高血压风险预测的准确性。
    Hypertension is a major risk factor for many serious diseases. With the aging population and lifestyle changes, the incidence of hypertension continues to rise, imposing a significant medical cost burden on patients and severely affecting their quality of life. Early intervention can greatly reduce the prevalence of hypertension. Research on hypertension early warning models based on electronic health records (EHRs) is an important and effective method for achieving early hypertension warning. However, limited by the scarcity and imbalance of multivisit records, and the nonstationary characteristics of hypertension features, it is difficult to predict the probability of hypertension prevalence in a patient effectively. Therefore, this study proposes an online hypertension monitoring model (HRP-OG) based on reinforcement learning and generative feature replay. It transforms the hypertension prediction problem into a sequential decision problem, achieving risk prediction of hypertension for patients using multivisit records. Sensors embedded in medical devices and wearables continuously capture real-time physiological data such as blood pressure, heart rate, and activity levels, which are integrated into the EHR. The fit between the samples generated by the generator and the real visit data is evaluated using maximum likelihood estimation, which can reduce the adversarial discrepancy between the feature space of hypertension and incoming incremental data, and the model is updated online based on real-time data using generative feature replay. The incorporation of sensor data ensures that the model adapts dynamically to changes in the condition of patients, facilitating timely interventions. In this study, the publicly available MIMIC-III data are used for validation, and the experimental results demonstrate that compared to existing advanced methods, HRP-OG can effectively improve the accuracy of hypertension risk prediction for few-shot multivisit record in nonstationary environments.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    我们考虑一个复杂的控制问题:使单脚架通过一次跳跃准确地达到目标。单脚架可以在地形的不同高度向任何方向跳跃。这是一类更大的问题的范例,这是非常具有挑战性和计算昂贵的使用标准的基于优化的技术来解决。强化学习(RL)是一个有趣的选择,但是控制器必须从头开始学习一切的端到端方法可以是不平凡的稀疏奖励任务,如跳跃。我们的解决方案是在RL框架内利用自然启发的启发式知识来指导学习过程。这种权宜之计带来了广泛的好处,比如大幅减少学习时间,以及学习和补偿运动低级执行中可能出现的错误的能力。我们的仿真结果揭示了我们的解决方案相对于基于优化和端到端RL方法的明显优势。
    We consider a complex control problem: making a monopod accurately reach a target with a single jump. The monopod can jump in any direction at different elevations of the terrain. This is a paradigm for a much larger class of problems, which are extremely challenging and computationally expensive to solve using standard optimization-based techniques. Reinforcement learning (RL) is an interesting alternative, but an end-to-end approach in which the controller must learn everything from scratch can be non-trivial with a sparse-reward task like jumping. Our solution is to guide the learning process within an RL framework leveraging nature-inspired heuristic knowledge. This expedient brings widespread benefits, such as a drastic reduction of learning time, and the ability to learn and compensate for possible errors in the low-level execution of the motion. Our simulation results reveal a clear advantage of our solution against both optimization-based and end-to-end RL approaches.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在复杂场景中做出明智决策的能力对于智能汽车系统至关重要。传统的专家规则和其他方法通常在复杂的环境中不足。最近,强化学习由于其优越的决策能力而受到了广泛的关注。然而,存在目标网络估计不准确的现象,这限制了其在复杂场景中的决策能力。本文主要研究低估现象,提出了一种基于改进TD3算法的端到端自主驾驶决策方法。该方法采用前向摄像机来捕获数据。通过引入新的批评网络,形成三重批评结构,并将其与目标最大化操作相结合,解决了TD3算法中的低估问题。随后,多时间步长平均法用于解决新的单一评论家造成的政策不稳定性。此外,本文利用Carla平台构建了多车辆无保护左转弯和拥堵车道中心驾驶场景,并对算法进行了验证。结果表明,该方法在收敛速度、收敛速度等方面均优于基准DDPG和TD3算法,估计精度,政策稳定。
    The ability to make informed decisions in complex scenarios is crucial for intelligent automotive systems. Traditional expert rules and other methods often fall short in complex contexts. Recently, reinforcement learning has garnered significant attention due to its superior decision-making capabilities. However, there exists the phenomenon of inaccurate target network estimation, which limits its decision-making ability in complex scenarios. This paper mainly focuses on the study of the underestimation phenomenon, and proposes an end-to-end autonomous driving decision-making method based on an improved TD3 algorithm. This method employs a forward camera to capture data. By introducing a new critic network to form a triple-critic structure and combining it with the target maximization operation, the underestimation problem in the TD3 algorithm is solved. Subsequently, the multi-timestep averaging method is used to address the policy instability caused by the new single critic. In addition, this paper uses Carla platform to construct multi-vehicle unprotected left turn and congested lane-center driving scenarios and verifies the algorithm. The results demonstrate that our method surpasses baseline DDPG and TD3 algorithms in aspects such as convergence speed, estimation accuracy, and policy stability.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    高效可靠的数据路由在智能电网中的高级计量基础设施(AMI)中至关重要。决定整体网络性能和弹性。本文介绍了Q-RPL,一种新颖的基于Q学习的路由协议,旨在增强基于无线网状技术的AMI部署中的路由决策。Q-RPL利用强化学习(RL)的原理来动态选择最佳的下一跳转发候选,适应不断变化的网络条件。该协议在低功耗和有损网络的标准IPv6路由协议(RPL)之上运行,将其与智能决策能力相结合。通过在真实地图场景中进行的广泛模拟,Q-RPL展示了关键性能指标(如数据包传递率)的显着改善,端到端延迟,以及与文献中的标准RPL实现和其他基准算法相比的合规因素。Q-RPL的适应性和鲁棒性标志着智能电网AMI路由协议的发展取得了显著的进步,有希望提高未来智能能源系统的效率和可靠性。这项研究的结果也强调了强化学习改进网络协议的潜力。
    Efficient and reliable data routing is critical in Advanced Metering Infrastructure (AMI) within Smart Grids, dictating the overall network performance and resilience. This paper introduces Q-RPL, a novel Q-learning-based Routing Protocol designed to enhance routing decisions in AMI deployments based on wireless mesh technologies. Q-RPL leverages the principles of Reinforcement Learning (RL) to dynamically select optimal next-hop forwarding candidates, adapting to changing network conditions. The protocol operates on top of the standard IPv6 Routing Protocol for Low-Power and Lossy Networks (RPL), integrating it with intelligent decision-making capabilities. Through extensive simulations carried out in real map scenarios, Q-RPL demonstrates a significant improvement in key performance metrics such as packet delivery ratio, end-to-end delay, and compliant factor compared to the standard RPL implementation and other benchmark algorithms found in the literature. The adaptability and robustness of Q-RPL mark a significant advancement in the evolution of routing protocols for Smart Grid AMI, promising enhanced efficiency and reliability for future intelligent energy systems. The findings of this study also underscore the potential of Reinforcement Learning to improve networking protocols.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    设计具有一系列期望性质的化合物是药物发现中的基本挑战。在临床前早期药物发现中,新型化合物通常是基于已经存在的有希望的起始化合物通过结构修饰来设计的,以用于进一步的性质优化。最近,已经探索了基于变压器的深度学习模型,用于通过对相似分子进行训练来进行分子优化的任务。这提供了产生与给定输入分子相似的分子的起点,但在用户定义的属性配置文件方面具有有限的灵活性。这里,我们评估了强化学习对基于变压器的分子生成模型的影响。生成模型可以被认为是具有接近输入化合物的化学空间知识的预训练模型,虽然强化学习可以被视为一个调整阶段,将模型转向具有用户特定期望属性的化学空间。对两个不同任务-分子优化和支架发现的评估表明,强化学习可以指导基于变压器的生成模型生成更多感兴趣的化合物。此外,预训练模型的影响,研究了学习步骤和学习率。科学贡献我们的研究调查了强化学习对基于变压器的生成模型的影响,该模型最初训练用于生成类似于起始分子的分子。强化学习框架用于促进起始分子的多参数优化。这种方法允许更灵活地优化用户特定的属性配置文件,并有助于找到更多感兴趣的想法。
    Designing compounds with a range of desirable properties is a fundamental challenge in drug discovery. In pre-clinical early drug discovery, novel compounds are often designed based on an already existing promising starting compound through structural modifications for further property optimization. Recently, transformer-based deep learning models have been explored for the task of molecular optimization by training on pairs of similar molecules. This provides a starting point for generating similar molecules to a given input molecule, but has limited flexibility regarding user-defined property profiles. Here, we evaluate the effect of reinforcement learning on transformer-based molecular generative models. The generative model can be considered as a pre-trained model with knowledge of the chemical space close to an input compound, while reinforcement learning can be viewed as a tuning phase, steering the model towards chemical space with user-specific desirable properties. The evaluation of two distinct tasks-molecular optimization and scaffold discovery-suggest that reinforcement learning could guide the transformer-based generative model towards the generation of more compounds of interest. Additionally, the impact of pre-trained models, learning steps and learning rates are investigated.Scientific contributionOur study investigates the effect of reinforcement learning on a transformer-based generative model initially trained for generating molecules similar to starting molecules. The reinforcement learning framework is applied to facilitate multiparameter optimisation of starting molecules. This approach allows for more flexibility for optimizing user-specific property profiles and helps finding more ideas of interest.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    学习是一个分类学上普遍的过程,动物根据经验改变对刺激的行为反应。这样,它在个体行为的发展中起着至关重要的作用,并支持种群内的大量表型变异。然而,社会环境中学习对进化变化的影响还没有得到很好的理解。这里,我们开发了小团体中资源竞争的博弈论模型(例如生产者-斯基格和鹰鸽游戏),在这些模型中,行为由强化学习控制,并表明不同行为的主观评价中的偏见很容易演变。此外,在许多情况下,偏差的收敛稳定水平存在于适应度最小值,因此导致对学习规则的破坏性选择,潜在的,遗传多态性的进化。因此,我们展示了社会环境中的强化学习如何成为进化多样化的驱动力。此外,我们在游戏中考虑能力的进化,表明学习也可以驱动对执行任务的能力的破坏性选择。
    Learning is a taxonomically widespread process by which animals change their behavioural responses to stimuli as a result of experience. In this way, it plays a crucial role in the development of individual behaviour and underpins substantial phenotypic variation within populations. Nevertheless, the impact of learning in social contexts on evolutionary change is not well understood. Here, we develop game theoretical models of competition for resources in small groups (e.g. producer-scrounger and hawk-dove games) in which actions are controlled by reinforcement learning and show that biases in the subjective valuation of different actions readily evolve. Moreover, in many cases, the convergence stable levels of bias exist at fitness minima and therefore lead to disruptive selection on learning rules and, potentially, to the evolution of genetic polymorphisms. Thus, we show how reinforcement learning in social contexts can be a driver of evolutionary diversification. In addition, we consider the evolution of ability in our games, showing that learning can also drive disruptive selection on the ability to perform a task.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    复杂网络中的同步是一个普遍存在的重要现象,涉及各个领域。过度同步可能导致不希望的后果,使去同步技术必不可少。利用近端策略优化算法,这项工作研究了基于强化学习的固定控制策略,用于全局耦合网络和两种类型的不规则耦合网络中的同步抑制:Watts-Strogatz小世界网络和Barabási-Albert无标度网络。我们研究了LeaderRank算法选择的受控节点比率和关键节点的作用对同步抑制性能的影响。数值结果表明,基于强化学习的钉扎控制策略在复杂网络的不同耦合方案中的有效性。揭示了钉扎节点的临界比率和新提出的混合钉扎策略的优越性能。该结果为有效抑制和优化网络同步行为提供了有价值的见解。
    Synchronization in complex networks is a ubiquitous and important phenomenon with implications in various fields. Excessive synchronization may lead to undesired consequences, making desynchronization techniques essential. Exploiting the Proximal Policy Optimization algorithm, this work studies reinforcement learning-based pinning control strategies for synchronization suppression in global coupling networks and two types of irregular coupling networks: the Watts-Strogatz small-world networks and the Barabási-Albert scale-free networks. We investigate the impact of the ratio of controlled nodes and the role of key nodes selected by the LeaderRank algorithm on the performance of synchronization suppression. Numerical results demonstrate the effectiveness of the reinforcement learning-based pinning control strategy in different coupling schemes of the complex networks, revealing a critical ratio of the pinned nodes and the superior performance of a newly proposed hybrid pinning strategy. The results provide valuable insights for suppressing and optimizing network synchronization behavior efficiently.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    安全关键域通常采用遵循顺序决策设置的自主代理,由此,代理遵循策略来指示在每个步骤处的适当动作。AI从业者经常使用强化学习算法来允许代理找到最佳策略。然而,顺序系统通常缺乏明确和直接的错误行为迹象,其后果只有事后才可见,使人类难以理解系统故障。在强化学习中,这被称为信用分配问题。为了有效地与自治系统协作,特别是在安全关键的环境中,解释应该使用户能够更好地理解代理的策略并预测系统行为,以便用户认识到潜在的故障,并且可以诊断和减轻这些故障。然而,人类是多样化的,具有先天的偏见或偏好,这可能会增强或削弱顺序代理的政策解释的效用。因此,在本文中,我们设计并进行了人体实验,以确定影响感知可用性的因素,并在顺序设置中对强化学习代理进行政策解释的客观有用性。我们的研究有两个因素:向用户显示的政策解释的方式(树,文本,修改后的文本,和程序)和代理人的“第一印象”,即,用户是否在介绍性校准视频中看到代理成功或失败。我们的发现描述了偏好-性能权衡,其中参与者认为基于语言的政策解释更有用;然而,当参与者以决策树的形式提供解释时,他们能够更好地客观预测代理人的行为。我们的结果表明,用户特定的因素,如计算机科学经验(p<0.05),和情境因素,例如观看代理崩溃(p<0.05),能显著影响解释的感知和有用性。这项研究提供了关键的见解,以缓解有关不完全合规和依赖的普遍问题,在安全关键的环境中指数级地更有害,为XAI开发人员提供了未来政策解释工作的前进道路。
    Safefy-critical domains often employ autonomous agents which follow a sequential decision-making setup, whereby the agent follows a policy to dictate the appropriate action at each step. AI-practitioners often employ reinforcement learning algorithms to allow an agent to find the best policy. However, sequential systems often lack clear and immediate signs of wrong actions, with consequences visible only in hindsight, making it difficult to humans to understand system failure. In reinforcement learning, this is referred to as the credit assignment problem. To effectively collaborate with an autonomous system, particularly in a safety-critical setting, explanations should enable a user to better understand the policy of the agent and predict system behavior so that users are cognizant of potential failures and these failures can be diagnosed and mitigated. However, humans are diverse and have innate biases or preferences which may enhance or impair the utility of a policy explanation of a sequential agent. Therefore, in this paper, we designed and conducted human-subjects experiment to identify the factors which influence the perceived usability with the objective usefulness of policy explanations for reinforcement learning agents in a sequential setting. Our study had two factors: the modality of policy explanation shown to the user (Tree, Text, Modified Text, and Programs) and the \"first impression\" of the agent, i.e., whether the user saw the agent succeed or fail in the introductory calibration video. Our findings characterize a preference-performance tradeoff wherein participants perceived language-based policy explanations to be significantly more useable; however, participants were better able to objectively predict the agent\'s behavior when provided an explanation in the form of a decision tree. Our results demonstrate that user-specific factors, such as computer science experience (p < 0.05), and situational factors, such as watching agent crash (p < 0.05), can significantly impact the perception and usefulness of the explanation. This research provides key insights to alleviate prevalent issues regarding innapropriate compliance and reliance, which are exponentially more detrimental in safety-critical settings, providing a path forward for XAI developers for future work on policy-explanations.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号