Reinforcement Learning

强化学习
  • 文章类型: Journal Article
    在当代数字化格局和技术进步中,拍卖业经历了一场蜕变,承担起交易范式的关键作用。作为商品或服务定价的机制,拍卖的程序复杂性和效率直接影响市场动态和参与者参与。利用人工智能(AI)技术的先进能力,拍卖部门主动整合人工智能方法,以增强效率并丰富用户互动。本研究深入研究了拍卖领域内价格预测挑战的复杂性,引入复杂的RL-GRU框架进行价格区间分析。该框架首先通过GRU对商品进行定量特征提取,随后通过强化学习技术在模型环境中编排动态交互。最终,通过独具慧眼的分类模块,完成区间划分和拍卖商品价格识别的任务。在五个时间间隔内,在公开可用和内部策划的数据集中展示超过90%的精度,并在八个时间间隔内展示卓越的性能。该框架为未来拍卖价格区间预测挑战的努力提供了宝贵的技术见解。
    In the contemporary digitalization landscape and technological advancement, the auction industry undergoes a metamorphosis, assuming a pivotal role as a transactional paradigm. Functioning as a mechanism for pricing commodities or services, the procedural intricacies and efficiency of auctions directly influence market dynamics and participant engagement. Harnessing the advancing capabilities of artificial intelligence (AI) technology, the auction sector proactively integrates AI methodologies to augment efficacy and enrich user interactions. This study delves into the intricacies of the price prediction challenge within the auction domain, introducing a sophisticated RL-GRU framework for price interval analysis. The framework commences by adeptly conducting quantitative feature extraction of commodities through GRU, subsequently orchestrating dynamic interactions within the model\'s environment via reinforcement learning techniques. Ultimately, it accomplishes the task of interval division and recognition of auction commodity prices through a discerning classification module. Demonstrating precision exceeding 90% across publicly available and internally curated datasets within five intervals and exhibiting superior performance within eight intervals, this framework contributes valuable technical insights for future endeavours in auction price interval prediction challenges.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    重复消极思维(RNT)是一种包含沉思和担忧的跨诊断结构,然而,沉思和担忧之间到底有什么共同点尚不清楚。为了澄清这一点,我们开发了RNT的元控制账户。元控制是指通过与加强和控制运动行为类似的计算来加强和控制心理行为。我们建议反思和担忧是元控制失败的粗略术语,就像跳闸和坠落是电机控制故障的粗略术语一样。我们划分了四个荟萃控制阶段,并在每个阶段增加了失败的机会,包括开放式思想(阶段1),影响子目标执行(阶段2)和切换(阶段3)的个体差异,以及学习适应性心理行为固有的挑战(阶段4)。因此,区分这些阶段阐明了导致过度RNT相同行为的不同过程。我们的帐户还将RNT的重要临床帐户纳入计算认知神经科学框架。
    Repetitive negative thinking (RNT) is a transdiagnostic construct that encompasses rumination and worry, yet what precisely is shared between rumination and worry is unclear. To clarify this, we develop a meta-control account of RNT. Meta-control refers to the reinforcement and control of mental behavior via similar computations as reinforce and control motor behavior. We propose rumination and worry are coarse terms for failure in meta-control, just as tripping and falling are coarse terms for failure in motor control. We delineate four meta-control stages and risk factors increasing the chance of failure at each, including open-ended thoughts (stage 1), individual differences influencing subgoal execution (stage 2) and switching (stage 3), and challenges inherent to learning adaptive mental behavior (stage 4). Distinguishing these stages therefore elucidates diverse processes that lead to the same behavior of excessive RNT. Our account also subsumes prominent clinical accounts of RNT into a computational cognitive neuroscience framework.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    高血压是许多严重疾病的主要危险因素。随着人口老龄化和生活方式的改变,高血压的发病率持续上升,给患者带来巨大的医疗费用负担,严重影响他们的生活质量。早期干预可以大大降低高血压的患病率。基于电子健康档案(EHRs)的高血压预警模型研究是实现高血压预警的重要而有效的方法。然而,受限于多次访问记录的稀缺性和不平衡,和高血压的非平稳特征,很难有效预测患者的高血压患病率。因此,本研究提出了一种基于强化学习和生成特征重放的高血压在线监测模型(HRP-OG)。它将高血压预测问题转化为顺序决策问题,使用多次就诊记录实现患者高血压风险预测。嵌入医疗设备和可穿戴设备中的传感器可持续捕获血压等实时生理数据,心率,和活动水平,它们被集成到EHR中。生成器生成的样本与真实访问数据之间的拟合使用最大似然估计进行评估,这可以减少高血压特征空间与输入增量数据之间的对抗性差异,并且使用生成特征回放基于实时数据在线更新模型。传感器数据的合并确保模型动态适应患者状况的变化,促进及时干预。在这项研究中,公开可用的MIMIC-III数据用于验证,实验结果表明,与现有的先进方法相比,HRP-OG可以有效提高非平稳环境中少发多次就诊记录的高血压风险预测的准确性。
    Hypertension is a major risk factor for many serious diseases. With the aging population and lifestyle changes, the incidence of hypertension continues to rise, imposing a significant medical cost burden on patients and severely affecting their quality of life. Early intervention can greatly reduce the prevalence of hypertension. Research on hypertension early warning models based on electronic health records (EHRs) is an important and effective method for achieving early hypertension warning. However, limited by the scarcity and imbalance of multivisit records, and the nonstationary characteristics of hypertension features, it is difficult to predict the probability of hypertension prevalence in a patient effectively. Therefore, this study proposes an online hypertension monitoring model (HRP-OG) based on reinforcement learning and generative feature replay. It transforms the hypertension prediction problem into a sequential decision problem, achieving risk prediction of hypertension for patients using multivisit records. Sensors embedded in medical devices and wearables continuously capture real-time physiological data such as blood pressure, heart rate, and activity levels, which are integrated into the EHR. The fit between the samples generated by the generator and the real visit data is evaluated using maximum likelihood estimation, which can reduce the adversarial discrepancy between the feature space of hypertension and incoming incremental data, and the model is updated online based on real-time data using generative feature replay. The incorporation of sensor data ensures that the model adapts dynamically to changes in the condition of patients, facilitating timely interventions. In this study, the publicly available MIMIC-III data are used for validation, and the experimental results demonstrate that compared to existing advanced methods, HRP-OG can effectively improve the accuracy of hypertension risk prediction for few-shot multivisit record in nonstationary environments.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    我们考虑一个复杂的控制问题:使单脚架通过一次跳跃准确地达到目标。单脚架可以在地形的不同高度向任何方向跳跃。这是一类更大的问题的范例,这是非常具有挑战性和计算昂贵的使用标准的基于优化的技术来解决。强化学习(RL)是一个有趣的选择,但是控制器必须从头开始学习一切的端到端方法可以是不平凡的稀疏奖励任务,如跳跃。我们的解决方案是在RL框架内利用自然启发的启发式知识来指导学习过程。这种权宜之计带来了广泛的好处,比如大幅减少学习时间,以及学习和补偿运动低级执行中可能出现的错误的能力。我们的仿真结果揭示了我们的解决方案相对于基于优化和端到端RL方法的明显优势。
    We consider a complex control problem: making a monopod accurately reach a target with a single jump. The monopod can jump in any direction at different elevations of the terrain. This is a paradigm for a much larger class of problems, which are extremely challenging and computationally expensive to solve using standard optimization-based techniques. Reinforcement learning (RL) is an interesting alternative, but an end-to-end approach in which the controller must learn everything from scratch can be non-trivial with a sparse-reward task like jumping. Our solution is to guide the learning process within an RL framework leveraging nature-inspired heuristic knowledge. This expedient brings widespread benefits, such as a drastic reduction of learning time, and the ability to learn and compensate for possible errors in the low-level execution of the motion. Our simulation results reveal a clear advantage of our solution against both optimization-based and end-to-end RL approaches.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在复杂场景中做出明智决策的能力对于智能汽车系统至关重要。传统的专家规则和其他方法通常在复杂的环境中不足。最近,强化学习由于其优越的决策能力而受到了广泛的关注。然而,存在目标网络估计不准确的现象,这限制了其在复杂场景中的决策能力。本文主要研究低估现象,提出了一种基于改进TD3算法的端到端自主驾驶决策方法。该方法采用前向摄像机来捕获数据。通过引入新的批评网络,形成三重批评结构,并将其与目标最大化操作相结合,解决了TD3算法中的低估问题。随后,多时间步长平均法用于解决新的单一评论家造成的政策不稳定性。此外,本文利用Carla平台构建了多车辆无保护左转弯和拥堵车道中心驾驶场景,并对算法进行了验证。结果表明,该方法在收敛速度、收敛速度等方面均优于基准DDPG和TD3算法,估计精度,政策稳定。
    The ability to make informed decisions in complex scenarios is crucial for intelligent automotive systems. Traditional expert rules and other methods often fall short in complex contexts. Recently, reinforcement learning has garnered significant attention due to its superior decision-making capabilities. However, there exists the phenomenon of inaccurate target network estimation, which limits its decision-making ability in complex scenarios. This paper mainly focuses on the study of the underestimation phenomenon, and proposes an end-to-end autonomous driving decision-making method based on an improved TD3 algorithm. This method employs a forward camera to capture data. By introducing a new critic network to form a triple-critic structure and combining it with the target maximization operation, the underestimation problem in the TD3 algorithm is solved. Subsequently, the multi-timestep averaging method is used to address the policy instability caused by the new single critic. In addition, this paper uses Carla platform to construct multi-vehicle unprotected left turn and congested lane-center driving scenarios and verifies the algorithm. The results demonstrate that our method surpasses baseline DDPG and TD3 algorithms in aspects such as convergence speed, estimation accuracy, and policy stability.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    高效可靠的数据路由在智能电网中的高级计量基础设施(AMI)中至关重要。决定整体网络性能和弹性。本文介绍了Q-RPL,一种新颖的基于Q学习的路由协议,旨在增强基于无线网状技术的AMI部署中的路由决策。Q-RPL利用强化学习(RL)的原理来动态选择最佳的下一跳转发候选,适应不断变化的网络条件。该协议在低功耗和有损网络的标准IPv6路由协议(RPL)之上运行,将其与智能决策能力相结合。通过在真实地图场景中进行的广泛模拟,Q-RPL展示了关键性能指标(如数据包传递率)的显着改善,端到端延迟,以及与文献中的标准RPL实现和其他基准算法相比的合规因素。Q-RPL的适应性和鲁棒性标志着智能电网AMI路由协议的发展取得了显著的进步,有希望提高未来智能能源系统的效率和可靠性。这项研究的结果也强调了强化学习改进网络协议的潜力。
    Efficient and reliable data routing is critical in Advanced Metering Infrastructure (AMI) within Smart Grids, dictating the overall network performance and resilience. This paper introduces Q-RPL, a novel Q-learning-based Routing Protocol designed to enhance routing decisions in AMI deployments based on wireless mesh technologies. Q-RPL leverages the principles of Reinforcement Learning (RL) to dynamically select optimal next-hop forwarding candidates, adapting to changing network conditions. The protocol operates on top of the standard IPv6 Routing Protocol for Low-Power and Lossy Networks (RPL), integrating it with intelligent decision-making capabilities. Through extensive simulations carried out in real map scenarios, Q-RPL demonstrates a significant improvement in key performance metrics such as packet delivery ratio, end-to-end delay, and compliant factor compared to the standard RPL implementation and other benchmark algorithms found in the literature. The adaptability and robustness of Q-RPL mark a significant advancement in the evolution of routing protocols for Smart Grid AMI, promising enhanced efficiency and reliability for future intelligent energy systems. The findings of this study also underscore the potential of Reinforcement Learning to improve networking protocols.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    设计具有一系列期望性质的化合物是药物发现中的基本挑战。在临床前早期药物发现中,新型化合物通常是基于已经存在的有希望的起始化合物通过结构修饰来设计的,以用于进一步的性质优化。最近,已经探索了基于变压器的深度学习模型,用于通过对相似分子进行训练来进行分子优化的任务。这提供了产生与给定输入分子相似的分子的起点,但在用户定义的属性配置文件方面具有有限的灵活性。这里,我们评估了强化学习对基于变压器的分子生成模型的影响。生成模型可以被认为是具有接近输入化合物的化学空间知识的预训练模型,虽然强化学习可以被视为一个调整阶段,将模型转向具有用户特定期望属性的化学空间。对两个不同任务-分子优化和支架发现的评估表明,强化学习可以指导基于变压器的生成模型生成更多感兴趣的化合物。此外,预训练模型的影响,研究了学习步骤和学习率。科学贡献我们的研究调查了强化学习对基于变压器的生成模型的影响,该模型最初训练用于生成类似于起始分子的分子。强化学习框架用于促进起始分子的多参数优化。这种方法允许更灵活地优化用户特定的属性配置文件,并有助于找到更多感兴趣的想法。
    Designing compounds with a range of desirable properties is a fundamental challenge in drug discovery. In pre-clinical early drug discovery, novel compounds are often designed based on an already existing promising starting compound through structural modifications for further property optimization. Recently, transformer-based deep learning models have been explored for the task of molecular optimization by training on pairs of similar molecules. This provides a starting point for generating similar molecules to a given input molecule, but has limited flexibility regarding user-defined property profiles. Here, we evaluate the effect of reinforcement learning on transformer-based molecular generative models. The generative model can be considered as a pre-trained model with knowledge of the chemical space close to an input compound, while reinforcement learning can be viewed as a tuning phase, steering the model towards chemical space with user-specific desirable properties. The evaluation of two distinct tasks-molecular optimization and scaffold discovery-suggest that reinforcement learning could guide the transformer-based generative model towards the generation of more compounds of interest. Additionally, the impact of pre-trained models, learning steps and learning rates are investigated.Scientific contributionOur study investigates the effect of reinforcement learning on a transformer-based generative model initially trained for generating molecules similar to starting molecules. The reinforcement learning framework is applied to facilitate multiparameter optimisation of starting molecules. This approach allows for more flexibility for optimizing user-specific property profiles and helps finding more ideas of interest.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    成年人在许多实验环境中都很难学习非母语语音类别(Goto,神经心理学,9(3)、317-3231971),但是在非母语语音具有功能意义的视频游戏范式中高效学习(Lim&Holt,认知科学,35(7)、1390-14052011)。来自此范式和其他范式的行为和神经证据指向强化学习机制在语音类别学习中的参与(Harmon,伊德玛,&Kapatsinski,认知,189、76-882019年;林,Fiez,&Holt,美国国家科学院院刊,116,2018119922019)。我们在计算上形式化了这个假设,并实现了一个深度强化学习网络,以在环境输入和动作之间进行映射。与监督学习模型相比,在两个实验中,我们证明了增强网络与人类行为的某些方面紧密匹配-合成听觉噪声标记的学习和语音辨别的改进。两种模型的性能都相当,并且每个模型输出的相似性使我们相信基于奖励的学习机制几乎没有内在的计算益处。我们建议,范式所涉及的特定神经回路以及纹状体与上颞区之间的联系在有效学习中起着至关重要的作用。
    Adults struggle to learn non-native speech categories in many experimental settings (Goto, Neuropsychologia, 9(3), 317-323 1971), but learn efficiently in a video game paradigm where non-native speech sounds have functional significance (Lim & Holt, Cognitive Science, 35(7), 1390-1405 2011). Behavioral and neural evidence from this and other paradigms point toward the involvement of reinforcement learning mechanisms in speech category learning (Harmon, Idemaru, & Kapatsinski, Cognition, 189, 76-88 2019; Lim, Fiez, & Holt, Proceedings of the National Academy of Sciences, 116, 201811992 2019). We formalize this hypothesis computationally and implement a deep reinforcement learning network to map between environmental input and actions. Comparing to a supervised model of learning, we show that the reinforcement network closely matches aspects of human behavior in two experiments - learning of synthesized auditory noise tokens and improvement in speech sound discrimination. Both models perform comparably and the similarity in the output of each model leads us to believe that there is little inherent computational benefit to a reward-based learning mechanism. We suggest that the specific neural circuitry engaged by the paradigm and links between striatum and superior temporal areas play a critical role in effective learning.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    我们研究了一种有效的计算工具,为感染人类免疫缺陷病毒(HIV)的人提出有用的治疗方案。结构化治疗中断(STI)是一种方案,其中治疗药物被定期施用和撤回以使患者从艰苦的药物治疗中得到缓解。已经进行了许多研究以使用具有HIV感染的数学模型的各种计算工具来找到更好的STI治疗策略。在本文中,我们利用带有优先体验回放的双深度Q网络的修改版本来提高经典深度学习算法的性能。数值模拟结果表明,与其他最新研究相比,我们的方法在较短的治疗周期内产生了更多的最佳成本值。此外,我们提出的算法在一日段场景中表现良好,而以前的研究仅报道了5天分段方案的结果。
    We investigate an efficient computational tool to suggest useful treatment regimens for people infected with the human immunodeficiency virus (HIV). Structured treatment interruption (STI) is a regimen in which therapeutic drugs are periodically administered and withdrawn to give patients relief from an arduous drug therapy. Numerous studies have been conducted to find better STI treatment strategies using various computational tools with mathematical models of HIV infection. In this paper, we leverage a modified version of the double deep Q network with prioritized experience replay to improve the performance of classic deep learning algorithms. Numerical simulation results show that our methodology produces significantly more optimal cost values for shorter treatment periods compared to other recent studies. Furthermore, our proposed algorithm performs well in one-day segment scenarios, whereas previous studies only reported results for five-day segment scenarios.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    学习是一个分类学上普遍的过程,动物根据经验改变对刺激的行为反应。这样,它在个体行为的发展中起着至关重要的作用,并支持种群内的大量表型变异。然而,社会环境中学习对进化变化的影响还没有得到很好的理解。这里,我们开发了小团体中资源竞争的博弈论模型(例如生产者-斯基格和鹰鸽游戏),在这些模型中,行为由强化学习控制,并表明不同行为的主观评价中的偏见很容易演变。此外,在许多情况下,偏差的收敛稳定水平存在于适应度最小值,因此导致对学习规则的破坏性选择,潜在的,遗传多态性的进化。因此,我们展示了社会环境中的强化学习如何成为进化多样化的驱动力。此外,我们在游戏中考虑能力的进化,表明学习也可以驱动对执行任务的能力的破坏性选择。
    Learning is a taxonomically widespread process by which animals change their behavioural responses to stimuli as a result of experience. In this way, it plays a crucial role in the development of individual behaviour and underpins substantial phenotypic variation within populations. Nevertheless, the impact of learning in social contexts on evolutionary change is not well understood. Here, we develop game theoretical models of competition for resources in small groups (e.g. producer-scrounger and hawk-dove games) in which actions are controlled by reinforcement learning and show that biases in the subjective valuation of different actions readily evolve. Moreover, in many cases, the convergence stable levels of bias exist at fitness minima and therefore lead to disruptive selection on learning rules and, potentially, to the evolution of genetic polymorphisms. Thus, we show how reinforcement learning in social contexts can be a driver of evolutionary diversification. In addition, we consider the evolution of ability in our games, showing that learning can also drive disruptive selection on the ability to perform a task.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号