Reinforcement Learning

强化学习
  • 文章类型: Journal Article
    背景:帕金森病(PD)中的冲动控制障碍(ICD)与患者和看护者的沉重负担有关。虽然可以恢复,尽管管理优化,但许多患者的ICD仍然存在。恢复中这种个体间差异的基础尚不清楚,对个性化医疗保健构成了重大挑战。
    方法:我们采用计算精神病学方法,并利用纵向,前瞻性个性化帕金森项目(N=136名PD患者,诊断后5年内)将多巴胺能学习理论的fMRI与机器学习(基线)相结合,以预测随访两年后的ICD症状恢复。我们专注于整个队列中QUIP-rs的变化,无论ICD诊断如何。
    结果:增益试验期间的强化学习信号更大,而基线时的损失试验则没有,包括腹侧纹状体,内侧前额叶皮质和服用药物时测得的行为准确性评分与两年后冲动控制症状的更大恢复相关。这些信号占其他已知因素解释的相关变异性的唯一比例,例如减少多巴胺激动剂的使用。
    结论:我们的结果为将基于生成模型的潜在学习过程推断与基于机器学习的临床症状恢复轨迹变异性预测模型相结合提供了原理证明。因此,我们表明RL建模参数可预测PDICD症状的恢复。
    BACKGROUND: Impulse control disorders (ICD) in Parkinson\'s disease (PD) are associated with a heavy burden on patients and caretakers. While recovery can occur, ICD persists in many patients despite optimal management. The basis for this inter-individual variability in recovery is unclear and poses a major challenge to personalized health care.
    METHODS: We adopt a computational psychiatry approach and leverage the longitudinal, prospective Personalized Parkinson Project (N=136 persons with PD, within 5 years of diagnosis) to combine dopaminergic learning theory-informed fMRI with machine learning (at baseline) to predict ICD symptom recovery after two years of follow-up. We focused on a change in QUIP-rs across the entire cohort, regardless of an ICD diagnosis.
    RESULTS: Greater reinforcement learning signals during gain trials but not loss trials at baseline, including those in the ventral striatum, medial prefrontal cortex and the behavioral accuracy score measured while ON medication were associated with greater recovery from impulse control symptoms two years later. These signals accounted for a unique proportion of the relevant variability over and above that explained by other known factors, such as decreases in dopamine agonist use.
    CONCLUSIONS: Our results provide a proof of principle for combining generative model-based inference of latent learning processes with machine learning-based predictive modeling of variability in clinical symptom recovery trajectories. Hence, we showed that RL modelling parameters predict recovery from ICD symptoms in PD.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    学习过程包括探索和开发阶段。虽然强化学习模型揭示了这些阶段之间的功能和神经科学区别,关于它们如何影响视觉注意力而观察外部环境的知识是有限的。这项研究试图通过视觉调整任务与双臂强盗问题相结合来阐明这些学习阶段与视觉注意力分配之间的相互作用,这些问题仅在注意力分散在两个手臂上时才可以检测串行效果。根据我们的发现,人类参与者仅在探索阶段表现出明显的序列效应,建议加强对与非目标手臂相关的视觉刺激的关注。值得注意的是,尽管奖励并不能激发我们任务中的注意力分散,在勘探阶段,个人从事主动观察,寻找目标进行观察。这种行为突出了探索中独特的信息寻求过程,与剥削不同。
    The learning process encompasses exploration and exploitation phases. While reinforcement learning models have revealed functional and neuroscientific distinctions between these phases, knowledge regarding how they affect visual attention while observing the external environment is limited. This study sought to elucidate the interplay between these learning phases and visual attention allocation using visual adjustment tasks combined with a two-armed bandit problem tailored to detect serial effects only when attention is dispersed across both arms. Per our findings, human participants exhibited a distinct serial effect only during the exploration phase, suggesting enhanced attention to the visual stimulus associated with the non-target arm. Remarkably, although rewards did not motivate attention dispersion in our task, during the exploration phase, individuals engaged in active observation and searched for targets to observe. This behavior highlights a unique information-seeking process in exploration that is distinct from exploitation.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    人体安静姿态的稳定是通过踝关节肌肉固有的弹性特性和踝关节肌肉的主动闭环激活相结合来实现的,以延迟比例(P)和微分(D)反馈控制器的方式,由正在进行的摇摆角和相应的角速度的延迟反馈驱动。已经表明,稳定过程的有源组件很可能以间歇方式而不是作为连续控制器运行:切换策略在相平面中定义,分为危险和安全区域,由适当的切换边界分隔。当状态进入危险区域时,延迟PD控制被激活,当它进入安全区域时,它会被关闭,让系统自由发展。与连续反馈控制相比,间歇性机制更强大,能够更好地再现健康人的姿势摇摆模式。然而,间歇性控制范例的卓越性能及其生物学合理性,踝关节肌肉间歇性激活的实验证据表明,留下了一个可行的学习过程的探索,大脑可以识别适当的状态相关切换策略,并相应地调整P和D参数。在这项工作中,它显示了如何通过强化运动学习范式来实现这样的目标,在证据的基础上,总的来说,已知基底神经节在强化学习的动作选择中起着核心作用,特别是,被发现特别参与姿势稳定。
    The stabilization of human quiet stance is achieved by a combination of the intrinsic elastic properties of ankle muscles and an active closed-loop activation of the ankle muscles, driven by the delayed feedback of the ongoing sway angle and the corresponding angular velocity in a way of a delayed proportional (P) and derivative (D) feedback controller. It has been shown that the active component of the stabilization process is likely to operate in an intermittent manner rather than as a continuous controller: the switching policy is defined in the phase-plane, which is divided in dangerous and safe regions, separated by appropriate switching boundaries. When the state enters a dangerous region, the delayed PD control is activated, and it is switched off when it enters a safe region, leaving the system to evolve freely. In comparison with continuous feedback control, the intermittent mechanism is more robust and capable to better reproduce postural sway patterns in healthy people. However, the superior performance of the intermittent control paradigm as well as its biological plausibility, suggested by experimental evidence of the intermittent activation of the ankle muscles, leaves open the quest of a feasible learning process, by which the brain can identify the appropriate state-dependent switching policy and tune accordingly the P and D parameters. In this work, it is shown how such a goal can be achieved with a reinforcement motor learning paradigm, building upon the evidence that, in general, the basal ganglia are known to play a central role in reinforcement learning for action selection and, in particular, were found to be specifically involved in postural stabilization.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    不同的多巴胺亚型在突触后受体具有相反的动力学,D1与D2受体的比率决定了对得失的相对敏感性,分别,在基于价值的学习过程中。这种对不同奖励反馈的有效敏感性与阶段性多巴胺水平相互作用,以确定学习的有效性,特别是在动态反馈的情况下,奖励的频率和大小需要随着时间的推移进行整合,以做出最佳决策。我们在基础基底神经节途径的模拟中对这种效应进行了建模,然后在具有人类多巴胺受体D2(DRD2;-141CIns/Del和Del/Del)基因变体的个体中测试了预测,该变体与较低水平的D2受体表达(N=119),并将它们在爱荷华州赌博任务(IGT)中的表现与非携带者对照(N=319)进行了比较。在Cards任务中使用fMRI测量腹侧纹状体(VS)对奖励的反应性。DRD2变体运营商做出的有效决策比非运营商低,但是这种效应并没有像我们的模型所假设的那样受到VS奖励反应性的调节。这些结果表明,多巴胺受体亚型与学习过程中对奖励的反应性之间的相互作用可能比最初认为的要复杂。
    Different dopamine subtypes have opposing dynamics at post-synaptic receptors, with the ratio of D1 to D2 receptors determining the relative sensitivity to gains and losses, respectively, during value-based learning. This effective sensitivity to different reward feedback interacts with phasic dopamine levels to determine the effectiveness of learning, particularly in dynamic feedback situations where frequency and magnitude of rewards need to be integrated over time to make optimal decisions. We modeled this effect in simulations of the underlying basal ganglia pathways and then tested the predictions in individuals with a variant of the human dopamine receptor D2 (DRD2; -141C Ins/Del and Del/Del) gene that associates with lower levels of D2 receptor expression (N=119) and compared their performance in the Iowa Gambling Task (IGT) to non-carrier controls (N=319). Ventral striatal (VS) reactivity to rewards was measured in the Cards task with fMRI. DRD2 variant carriers made less effective decisions than non-carriers, but this effect was not moderated by VS reward reactivity as is hypothesized by our model. These results suggest that the interaction between dopamine receptor subtypes and reactivity to rewards during learning may be more complex than originally thought.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    安慰剂和nocebo效应强调了预期在调节疼痛感知中的重要性,但是在日常生活中,我们不需要外部信息来源来形成对疼痛的期望。大脑可以学会以更基本的方式预测疼痛,简单地通过体验fi波动,非随机有害输入流,提取它们的时间规律性。这个过程被称为统计学习。这里,Weaddressakeyopenquestion:doesstatisticallearningmodulatepainperception?Weasked27participantstobothrateandpredictpardintensitylevelsinsequencesofflupportingheatpain.使用计算方法,我们表明,概率预期和信心被用来衡量疼痛感知和预测.因此,这项研究超越了将非疼痛线索与疼痛结果相关联的既定条件范式,并表明统计学习本身塑造了疼痛体验。这一发现为研究疼痛调节的大脑机制开辟了一条新途径,与可能功能失调的慢性疼痛有关。
    The placebo and nocebo effects highlight the importance of expectations in modulating pain perception, but in everyday life we don\'t need an external source of information to form expectations about pain. The brain can learn to predict pain in a more fundamental way, simply by experiencing fluctuating, non-random streams of noxious inputs, and extracting their temporal regularities. This process is called statistical learning. Here, we address a key open question: does statistical learning modulate pain perception? We asked 27 participants to both rate and predict pain intensity levels in sequences of fluctuating heat pain. Using a computational approach, we show that probabilistic expectations and confidence were used to weigh pain perception and prediction. As such, this study goes beyond well-established conditioning paradigms associating non-pain cues with pain outcomes, and shows that statistical learning itself shapes pain experience. This finding opens a new path of research into the brain mechanisms of pain regulation, with relevance to chronic pain where it may be dysfunctional.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    基于强化学习的超启发式算法(RL-HH)是优化领域的流行趋势。RL-HH结合了超启发式(HH)的全局搜索能力和强化学习(RL)的学习能力。这种协同作用允许代理动态调整自己的策略,导致解决方案的逐步优化。现有研究表明RL-HH在解决复杂现实问题方面的有效性。然而,对RL-HH领域的全面介绍和总结尚属空白。本研究回顾了目前存在的RL-HH,并提出了RL-HH的一般框架。本文将算法类型分为两类:基于价值的强化学习超启发式和基于策略的强化学习超启发式。对每个类别中的典型算法进行了总结和详细描述。最后,讨论了RL-HH现有研究的不足和未来的研究方向。
    The reinforcement learning based hyper-heuristics (RL-HH) is a popular trend in the field of optimization. RL-HH combines the global search ability of hyper-heuristics (HH) with the learning ability of reinforcement learning (RL). This synergy allows the agent to dynamically adjust its own strategy, leading to a gradual optimization of the solution. Existing researches have shown the effectiveness of RL-HH in solving complex real-world problems. However, a comprehensive introduction and summary of the RL-HH field is still blank. This research reviews currently existing RL-HHs and presents a general framework for RL-HHs. This article categorizes the type of algorithms into two categories: value-based reinforcement learning hyper-heuristics and policy-based reinforcement learning hyper-heuristics. Typical algorithms in each category are summarized and described in detail. Finally, the shortcomings in existing researches on RL-HH and future research directions are discussed.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    随机微积分引导的强化学习(SCRL)是一种在事物不确定的情况下做出决策的新方法。它使用数学原理来做出更好的选择并改善复杂情况下的决策。SCRL比传统的随机强化学习(SRL)方法更好。在测试中,SCRL表明它可以适应并表现良好。它比SRL方法更好。与SRL的65.96相比,SCRL的色散值较低,为63.49。这意味着SCRL的结果变化较小。从短期和长期来看,SCRL的风险也低于SRL。SCRL的短期风险值为0.64,长期风险值为0.78。SRL的短期风险值远高于18.64,长期风险值为10.41。风险值越低越好,因为它们意味着出错的可能性越小。总的来说,当事情不确定时,SCRL是一种更好的决策方式。它使用数学来做出更明智的选择,并且比其他方法风险更小。此外,不同的指标,即培训奖励,学习进步,以及SRL和SCRL之间的滚动平均值,被评估,研究发现,与SRL相比,SCRL的表现优于SRL。这使得SCRL对于必须仔细做出决策的现实世界情况非常有用。·通过利用从随机演算中得出的数学原理,SCRL提供了一个强大的框架,可以在复杂的场景中做出明智的选择并增强性能。•与传统的SRL方法相比,SCRL表现出优越的适应性和功效,实证检验证明了这一点。
    Stochastic Calculus-guided Reinforcement learning (SCRL) is a new way to make decisions in situations where things are uncertain. It uses mathematical principles to make better choices and improve decision-making in complex situations. SCRL works better than traditional Stochastic Reinforcement Learning (SRL) methods. In tests, SCRL showed that it can adapt and perform well. It was better than the SRL methods. SCRL had a lower dispersion value of 63.49 compared to SRL\'s 65.96. This means SCRL had less variation in its results. SCRL also had lower risks than SRL in the short- and long-term. SCRL\'s short-term risk value was 0.64, and its long-term risk value was 0.78. SRL\'s short-term risk value was much higher at 18.64, and its long-term risk value was 10.41. Lower risk values are better because they mean less chance of something going wrong. Overall, SCRL is a better way to make decisions when things are uncertain. It uses math to make smarter choices and has less risk than other methods. Also, different metrics, viz training rewards, learning progress, and rolling averages between SRL and SCRL, were assessed, and the study found that SCRL outperforms well compared to SRL. This makes SCRL very useful for real-world situations where decisions must be made carefully.•By leveraging mathematical principles derived from stochastic calculus, SCRL offers a robust framework for making informed choices and enhancing performance in complex scenarios.•In comparison to traditional SRL methods, SCRL demonstrates superior adaptability and efficacy, as evidenced by empirical tests.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在网络游戏障碍(IGD)中已观察到奖励处理功能障碍和抑制控制不足。然而,目前尚不清楚以前的强化学习是否依赖于奖励/惩罚反馈对IGD认知抑制控制的影响。本研究通过行为实验的方法比较了IGD组和没有游戏经验的健康人在概率选择任务和随后的停止信号任务中的差异,为了探讨IGD组的奖励学习能力是否受损。我们还讨论了先前的奖励学习对随后的抑制控制的影响。结果表明:(1)在奖励学习阶段,IGD组的准确性明显低于对照组;(2)与对照组相比,IGD组的反应时间在转移阶段较长;(3)对于奖励学习后的抑制控制阶段的不进行试验,IGD组奖赏相关刺激的准确性低于惩罚相关或中性刺激,但对照组三种条件之间没有显着差异。这些发现表明IGD组的强化学习能力受损,这进一步导致了对强化刺激的异常反应。
    Reward processing dysfunction and inhibition control deficiency have been observed in Internet gaming disorder (IGD). However, it is still unclear whether the previous reinforcement learning depends on reward/punishment feedback influences on the cognitive inhibitory control of IGD. This study compared the differences between an IGD group and healthy people without game experiences in the probability selection task and the subsequent stop signal task by the method of behavioral experiments, in order to explore whether the reward learning ability is impaired in the IGD group. We also discuss the influence of previous reward learning on subsequent inhibition control. The results showed that (1) during the reward learning phase, the IGD group\'s accuracy was significantly lower than that of the control group; (2) compared with the control group, the IGD group\'s reaction times were longer in the transfer phase; (3) for no-go trials of the inhibitory control phase after reward learning, the accuracy of the reward-related stimulation in the IGD group was lower than that of punishment-related or neutral stimulation, but there was no significant difference among the three conditions in the control group. These findings indicated that the reinforcement learning ability of the IGD group was impaired, which further caused the abnormal response to reinforcement stimuli.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    基于SMILES的生成模型是用于增强药物设计的最强大和成功的最新方法之一。它们通常用于完全从头生成,然而,支架装饰和片段链接应用程序有时是可取的,需要不同的语法,architecture,训练数据集,因此,重新训练新模型。在这项工作中,我们描述了一个简单的过程,用基于SMILES的生成模型进行约束分子生成,通过提供SMILES提示将适用性扩展到支架装饰和片段连接,不需要再培训。结合强化学习,我们展示了预先训练的,仅解码器模型快速适应这些应用,并可以进一步优化分子生成朝着指定的目标。我们将这种方法的性能与各种正交方法进行了比较,并表明性能相当或更好。为方便起见,我们提供了一个易于使用的python包,以促进模型采样,可以在GitHub和Python包索引上找到。科学贡献这种新颖的方法将自回归化学语言模型扩展到支架装饰和片段链接场景。这不需要重新训练,使用定制的语法,或定制数据集的策展,这是其他方法通常需要的。
    SMILES-based generative models are amongst the most robust and successful recent methods used to augment drug design. They are typically used for complete de novo generation, however, scaffold decoration and fragment linking applications are sometimes desirable which requires a different grammar, architecture, training dataset and therefore, re-training of a new model. In this work, we describe a simple procedure to conduct constrained molecule generation with a SMILES-based generative model to extend applicability to scaffold decoration and fragment linking by providing SMILES prompts, without the need for re-training. In combination with reinforcement learning, we show that pre-trained, decoder-only models adapt to these applications quickly and can further optimize molecule generation towards a specified objective. We compare the performance of this approach to a variety of orthogonal approaches and show that performance is comparable or better. For convenience, we provide an easy-to-use python package to facilitate model sampling which can be found on GitHub and the Python Package Index.Scientific contributionThis novel method extends an autoregressive chemical language model to scaffold decoration and fragment linking scenarios. This doesn\'t require re-training, the use of a bespoke grammar, or curation of a custom dataset, as commonly required by other approaches.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    鸣鸟的声乐掌握令人印象深刻,但这在多大程度上是实践的结果?他们能吗,基于与已知目标的不匹配,planthenecessarychangestorecoverthetargetinapractice-freemannersinging?Inadultzebrainches,我们将歌曲音节的音高从其稳定(基线)变体中移开,然后我们撤回增援,随后通过静音或震耳欲聋来剥夺他们的歌唱经验。在这个被剥夺的国家,鸟不恢复他们的基线歌曲。然而,他们将他们的歌曲回复到目标,大约是他们最近练习的1个标准差,在后者期间提供感觉反馈,表明音调与目标不匹配。因此,有针对性的声带可塑性不需要立即的感官体验,表明斑马雀能够进行目标导向的声音规划。
    Songbirds\' vocal mastery is impressive, but to what extent is it a result of practice? Can they, based on experienced mismatch with a known target, plan the necessary changes to recover the target in a practice-free manner without intermittently singing? In adult zebra finches, we drive the pitch of a song syllable away from its stable (baseline) variant acquired from a tutor, then we withdraw reinforcement and subsequently deprive them of singing experience by muting or deafening. In this deprived state, birds do not recover their baseline song. However, they revert their songs toward the target by about 1 standard deviation of their recent practice, provided the sensory feedback during the latter signaled a pitch mismatch with the target. Thus, targeted vocal plasticity does not require immediate sensory experience, showing that zebra finches are capable of goal-directed vocal planning.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号