reward learning

奖励学习
  • 文章类型: Journal Article
    背景:伏隔核(NAc)介导奖励学习和动机。尽管有丰富的神经肽,来自NAc的肽能神经传递尚未整合到当前的奖励学习模型中。先前已经记录了含有促肾上腺皮质激素释放因子(CRF)的稀疏神经元群体的存在。在这里,我们提供了他们在塑造奖励学习中的身份和功能作用的全面分析。
    方法:要做到这一点,我们采取了多学科的方法,包括荧光原位杂交(Nmice≥3),道示踪(N小鼠=5),离体电生理学(Ncells≥30),通过纤维光度法进行体内钙成像(N小鼠≥4),并在转基因品系中使用病毒策略选择性删除NAc神经元中的CRF肽(N小鼠≥4)。使用的行为是工具性学习,蔗糖偏好和在开放领域的自发探索。
    结果:在这里,我们表明绝大多数含NAcCRF(NAcCRF)的神经元是由多巴胺D1-,含有D2或D1/D2的SPN,主要投射并连接到腹侧苍白球,并在较小程度上连接到腹侧中脑。作为一个人口,它们表现出成熟和不成熟的SPN点火特性。我们证明了NAcCRF神经元在操作性奖励学习期间跟踪奖励结果,并且从这些神经元释放的CRF起到约束作用,同时,在面对不断变化的突发事件时,提高了灵活性。
    结论:我们得出结论,从这种稀疏的SPN种群中释放CRF对于正常条件下的奖励学习至关重要。
    BACKGROUND: The nucleus accumbens (NAc) mediates reward learning and motivation. Despite an abundance of neuropeptides, peptidergic neurotransmission from the NAc has not been integrated into current models of reward learning. The existence of a sparse population of neurons containing corticotropin releasing factor (CRF) has been previously documented. Here we provide a comprehensive analysis of their identity and functional role in shaping reward learning.
    METHODS: To do this, we took a multidisciplinary approach that included florescent in situ hybridization (Nmice ≥ 3), tract tracing (Nmice = 5), ex vivo electrophysiology (Ncells ≥ 30), in vivo calcium imaging with fiber photometry (Nmice ≥ 4) and use of viral strategies in transgenic lines to selectively delete CRF peptide from NAc neurons (Nmice ≥ 4). Behaviors used were instrumental learning, sucrose preference and spontaneous exploration in an open field.
    RESULTS: Here we show that the vast majority of NAc CRF-containing (NAcCRF) neurons are spiny projection neurons (SPNs) comprised of dopamine D1-, D2- or D1/D2-containing SPNs that primarily project and connect to the ventral pallidum and to a lesser extent the ventral midbrain. As a population, they display mature and immature SPN firing properties. We demonstrate that NAcCRF neurons track reward outcomes during operant reward learning and that CRF release from these neurons acts to constrain initial acquisition of action-outcome learning, and at the same time facilitates flexibility in the face of changing contingencies.
    CONCLUSIONS: We conclude that CRF release from this sparse population of SPNs is critical for reward learning under normal conditions.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    当前框架提出,由于预测误差(PE)信号改变和对环境波动性的错误估计,妄想是由异常信念更新引起的。我们旨在调查信念更新的行为和神经特征是否与妄想的存在特别相关或通常与明显的精神分裂症相关。
    我们的横截面设计包括人类参与者(n[女性/男性]=66[25/41]),分为四组:健康参与者有轻微的(n=22)或强烈的妄想样想法(n=18),和被诊断为精神分裂症的参与者有轻微的(n=13)或强烈的妄想(n=13),导致2×2的设计,这可以测试妄想和诊断的效果。参与者在fMRI扫描过程中执行了具有稳定和不稳定的任务偶然性的反向学习任务。我们使用分层高斯滤波器模型形式化学习,并对结果不确定性和波动性的信念进行了基于模型的fMRI分析,结果和波动率信念的精确加权PE。
    与健康对照相比,精神分裂症患者表现出更低的准确性和更高的选择转换,而妄想并不影响这些措施。妄想的参与者在额纹状体区域显示出与PE相关的精确加权神经激活增加。诊断为精神分裂症的人高估了环境波动性,并在前脑岛表现出减弱的神经波动性,内侧额回和角回。
    妄想信念与纹状体PE信号改变相关。并置,潜在的令人不安的信念,即环境不断变化,这种主观波动的较弱的神经编码似乎与明显的精神分裂症有关,但不存在妄想的想法。
    UNASSIGNED: Current frameworks propose that delusions result from aberrant belief updating due to altered prediction error (PE) signaling and misestimation of environmental volatility. We aimed to investigate whether behavioral and neural signatures of belief updating are specifically related to the presence of delusions or generally associated with manifest schizophrenia.
    UNASSIGNED: Our cross-sectional design includes human participants (n[female/male] = 66[25/41]), stratified into four groups: healthy participants with minimal (n = 22) or strong delusional-like ideation (n = 18), and participants with diagnosed schizophrenia with minimal (n = 13) or strong delusions (n = 13), resulting in a 2 × 2 design, which allows to test for the effects of delusion and diagnosis. Participants performed a reversal learning task with stable and volatile task contingencies during fMRI scanning. We formalized learning with a hierarchical Gaussian filter model and conducted model-based fMRI analysis regarding beliefs of outcome uncertainty and volatility, precision-weighted PEs of the outcome- and the volatility-belief.
    UNASSIGNED: Patients with schizophrenia as compared to healthy controls showed lower accuracy and heightened choice switching, while delusional ideation did not affect these measures. Participants with delusions showed increased precision-weighted PE-related neural activation in fronto-striatal regions. People with diagnosed schizophrenia overestimated environmental volatility and showed an attenuated neural representation of volatility in the anterior insula, medial frontal and angular gyrus.
    UNASSIGNED: Delusional beliefs are associated with altered striatal PE-signals. Juxtaposing, the potentially unsettling belief that the environment is constantly changing and weaker neural encoding of this subjective volatility seems to be associated with manifest schizophrenia, but not with the presence of delusional ideation.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    不良儿童经历(ACE)是多种精神病理状况发展的主要危险因素,但是这种联系背后的机制知之甚少。联想学习包括关键机制,通过这些机制,个人学习将重要的环境输入与情感和行为反应联系起来。ACE可能会影响联想学习过程的规范成熟,导致他们在精神病理学中表现出持久的适应不良表达。在这次审查中,我们对ACE与威胁和奖励学习过程之间拟议关联的现有证据进行了系统和方法学综述和整合.我们总结了系统文献检索的结果(遵循PRISMA指南),共发表了81篇文章(威胁:n=38,奖励:n=43)。在威胁和奖励学习领域,行为上,我们在有ACE病史的个体中观察到了一种异常学习的趋同模式,独立于其他样本特征,特定的ACE类型,和结果措施。具体来说,钝化的威胁学习反映在减少威胁和安全线索之间的歧视,主要是由于对条件性威胁线索的反应减弱。此外,衰减的奖励学习表现为在涉及获取奖励偶然性的任务中降低的准确性和学习率。重要的是,尽管这两个领域的ACE评估和操作存在显著异质性,但仍出现了这种模式.我们得出的结论是,钝化的威胁和奖励学习可能代表了一种机械途径,ACEs可能在生理和神经生物学上嵌入并最终赋予更大的精神病理学风险。在结束时,我们讨论了该研究领域潜在的富有成效的未来方向,包括方法学和ACE评估考虑因素。
    Adverse childhood experiences (ACEs) are a major risk factor for the development of multiple psychopathological conditions, but the mechanisms underlying this link are poorly understood. Associative learning encompasses key mechanisms through which individuals learn to link important environmental inputs to emotional and behavioral responses. ACEs may impact the normative maturation of associative learning processes, resulting in their enduring maladaptive expression manifesting in psychopathology. In this review, we lay out a systematic and methodological overview and integration of the available evidence of the proposed association between ACEs and threat and reward learning processes. We summarize results from a systematic literature search (following PRISMA guidelines) which yielded a total of 81 articles (threat: n=38, reward: n=43). Across the threat and reward learning fields, behaviorally, we observed a converging pattern of aberrant learning in individuals with a history of ACEs, independent of other sample characteristics, specific ACE types, and outcome measures. Specifically, blunted threat learning was reflected in reduced discrimination between threat and safety cues, primarily driven by diminished responding to conditioned threat cues. Furthermore, attenuated reward learning manifested in reduced accuracy and learning rate in tasks involving acquisition of reward contingencies. Importantly, this pattern emerged despite substantial heterogeneity in ACE assessment and operationalization across both fields. We conclude that blunted threat and reward learning may represent a mechanistic route by which ACEs may become physiologically and neurobiologically embedded and ultimately confer greater risk for psychopathology. In closing, we discuss potentially fruitful future directions for the research field, including methodological and ACE assessment considerations.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    压力源可以引发中枢和外周变化的级联反应,调节中皮质边缘多巴胺能回路,最终,对奖励的行为反应。由于缺乏关于这一主题的确凿证据和研究领域标准框架,采用随机效应荟萃分析来量化急性应激源对奖励反应的影响,估价,在啮齿动物和人类科目中学习。在啮齿动物中,急性压力降低了奖励反应性(g=-1.43)和估值(g=-0.32),同时放大奖励学习(g=1.17)。在人类中,急性应激对估值有边际效应(g=0.25),而不影响反应能力和学习。适度分析表明,急性压力对啮齿动物和人类的奖励处理都没有统一影响,并且压力源的持续时间和奖励体验的特异性(即,食物与药物)可能在质量和数量上产生不同的行为终点。亚组分析未能减少异质性,which,加上出版偏见的存在,对可以得出的结论持谨慎态度,并指出需要指导该领域未来研究的开展。
    Stressors can initiate a cascade of central and peripheral changes that modulate mesocorticolimbic dopaminergic circuits and, ultimately, behavioral response to rewards. Driven by the absence of conclusive evidence on this topic and the Research Domain Criteria framework, random-effects meta-analyses were adopted to quantify the effects of acute stressors on reward responsiveness, valuation, and learning in rodent and human subjects. In rodents, acute stress reduced reward responsiveness (g = -1.43) and valuation (g = -0.32), while amplifying reward learning (g = 1.17). In humans, acute stress had marginal effects on valuation (g = 0.25), without affecting responsiveness and learning. Moderation analyses suggest that acute stress neither has unitary effects on reward processing in rodents nor in humans and that the duration of the stressor and specificity of reward experience (i.e., food vs drugs) may produce qualitatively and quantitatively different behavioral endpoints. Subgroup analyses failed to reduce heterogeneity, which, together with the presence of publication bias, pose caution on the conclusions that can be drawn and point to the need of guidelines for the conduction of future studies in the field.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:几十年的研究已经坚定地确定,认知健康和认知治疗服务是精神病患者的关键需求。然而,许多目前的临床项目没有解决这一需求,尽管个人的认知和社会认知能力在决定其现实世界功能方面发挥着至关重要的作用。早期精神病干预网络早期精神病干预网络中基于实践的初步研究表明,有可能开发和实施描绘个人认知健康概况的工具,并帮助客户和临床医生参与包括认知治疗在内的共同决策和治疗计划。这些发现标志着向个性化认知健康的有希望的转变。
    方法:扩展这一早期进展,我们回顾了精神病认知领域/过程中个体差异的概念,作为提供个性化治疗计划的基础.我们提供了使用传统神经心理学措施的研究证据,以及利用逐个试验行为数据来阐明个人采用的不同潜在策略的新兴计算研究的发现。
    我们假设这些计算技术,当与传统的认知评估相结合时,可以丰富我们对治疗需求的个体差异的理解,这反过来可以指导更加个性化的干预措施。
    结论:当我们发现临床相关方法将适应不良行为分解为模型参数捕获的单独潜在认知元素时,最终目标是开发和实施方法,使客户及其临床提供者能够利用个人现有的学习能力来改善他们的认知健康和福祉。
    BACKGROUND: Decades of research have firmly established that cognitive health and cognitive treatment services are a key need for people living with psychosis. However, many current clinical programs do not address this need, despite the essential role that an individual\'s cognitive and social cognitive capacities play in determining their real-world functioning. Preliminary practice-based research in the Early Psychosis Intervention Network early psychosis intervention network shows that it is possible to develop and implement tools that delineate an individuals\' cognitive health profile and that help engage the client and the clinician in shared decision-making and treatment planning that includes cognitive treatments. These findings signify a promising shift toward personalized cognitive health.
    METHODS: Extending upon this early progress, we review the concept of interindividual variability in cognitive domains/processes in psychosis as the basis for offering personalized treatment plans. We present evidence from studies that have used traditional neuropsychological measures as well as findings from emerging computational studies that leverage trial-by-trial behavior data to illuminate the different latent strategies that individuals employ.
    UNASSIGNED: We posit that these computational techniques, when combined with traditional cognitive assessments, can enrich our understanding of individual differences in treatment needs, which in turn can guide evermore personalized interventions.
    CONCLUSIONS: As we find clinically relevant ways to decompose maladaptive behaviors into separate latent cognitive elements captured by model parameters, the ultimate goal is to develop and implement approaches that empower clients and their clinical providers to leverage individual\'s existing learning capacities to improve their cognitive health and well-being.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在经典的小脑学习中,浦肯野细胞(PkCs)将攀爬纤维(CF)错误信号与之前活跃的预测性颗粒细胞(GrCs)相关联(〜150ms)。小脑也有助于以更长的时间尺度为特征的行为。为了研究GrC-CF-PkC电路如何学习秒预测,我们在前肢操作调节的几天内同时成像了GrC-CF活性,以获得延迟的水奖励。当老鼠学习奖励时机时,许多GRC以不同的速率发展了预期活动,直到奖励交付,其次是广泛的时间锁定的CF尖峰。重新获得更长的延迟进一步延长了GrC激活。我们计算了与CF相关的GrC→PkC塑性规则,证明奖励诱发的CF尖峰足以通过预期时机对许多GrC突触进行分级。我们预测并证实,PkC可以从运动到奖励,从而连续跨越几秒钟的时间间隔。因此,学习会产生新的GrC时间基础,将预测因子与远程CF奖励信号联系起来,这是一种非常适合学习跟踪认知领域常见的长间隔的策略。
    In classical cerebellar learning, Purkinje cells (PkCs) associate climbing fiber (CF) error signals with predictive granule cells (GrCs) that were active just prior (∼150 ms). The cerebellum also contributes to behaviors characterized by longer timescales. To investigate how GrC-CF-PkC circuits might learn seconds-long predictions, we imaged simultaneous GrC-CF activity over days of forelimb operant conditioning for delayed water reward. As mice learned reward timing, numerous GrCs developed anticipatory activity ramping at different rates until reward delivery, followed by widespread time-locked CF spiking. Relearning longer delays further lengthened GrC activations. We computed CF-dependent GrC→PkC plasticity rules, demonstrating that reward-evoked CF spikes sufficed to grade many GrC synapses by anticipatory timing. We predicted and confirmed that PkCs could thereby continuously ramp across seconds-long intervals from movement to reward. Learning thus leads to new GrC temporal bases linking predictors to remote CF reward signals-a strategy well suited for learning to track the long intervals common in cognitive domains.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    行为激活是抑郁症的循证治疗方法。理论上的考虑表明,治疗反应取决于强化学习机制。然而,哪些强化学习机制参与并介导行为激活的治疗效果,仍然只是部分理解,并且没有程序来衡量这种机制。
    进行一项试点研究,以检查通过任务或自我报告测量的强化学习过程是否与对行为激活的治疗反应有关。
    试点研究招募了13名患有重度抑郁症的门诊患者(12名完成者),从2018年7月到2019年2月,在BA进行了为期9周的试验。精神病学评估,之前获得了决策测试和自我报告的奖励经验和预期,治疗期间和之后。使用强化学习模型分析任务和自我报告数据。通过线性混合效应模型,推断参数与抑郁严重程度的度量相关。
    通过任务中的特定决策过程来捕获治疗不同阶段的治疗效果。在专注于积极追求奖励的几周里,在那些表现出巴甫洛夫食欲影响增加的个体中,治疗效果更为明显。在专注于避免惩罚的几周内,在那些显示巴甫洛夫回避增加的个体中,治疗反应更为明显。根据正式的RL规则,自我报告的加固预期发生了变化。学习遵循RL规则的程度与快感缺失的变化有关。
    在这项试点研究中,任务和自我报告衍生的强化学习措施都捕获了对行为激活的治疗反应的个体差异。嗜好和厌恶的巴甫洛夫反射过程似乎是通过单独的心理治疗干预来调节的。调制强度与对特定干预措施的反应密切相关。自我报告的强化预期变化也与治疗反应有关。
    设定目标:参与GO/No-Go主动学习,#NCT03538535,http://www.临床试验.gov.
    UNASSIGNED: Behavioral activation is an evidence-based treatment for depression. Theoretical considerations suggest that treatment response depends on reinforcement learning mechanisms. However, which reinforcement learning mechanisms are engaged by and mediate the therapeutic effect of behavioral activation remains only partially understood, and there are no procedures to measure such mechanisms.
    UNASSIGNED: To perform a pilot study to examine whether reinforcement learning processes measured through tasks or self-report are related to treatment response to behavioral activation.
    UNASSIGNED: The pilot study enrolled 13 outpatients (12 completers) with major depressive disorder, from July of 2018 through February of 2019, into a nine-week trial with BA. Psychiatric evaluations, decision-making tests and self-reported reward experience and anticipations were acquired before, during and after the treatment. Task and self-report data were analysed by using reinforcement-learning models. Inferred parameters were related to measures of depression severity through linear mixed effects models.
    UNASSIGNED: Treatment effects during different phases of the therapy were captured by specific decision-making processes in the task. During the weeks focusing on the active pursuit of reward, treatment effects were more pronounced amongst those individuals who showed an increase in Pavlovian appetitive influence. During the weeks focusing on the avoidance of punishments, treatment responses were more pronounced in those individuals who showed an increase in Pavlovian avoidance. Self-reported anticipation of reinforcement changed according to formal RL rules. Individual differences in the extent to which learning followed RL rules related to changes in anhedonia.
    UNASSIGNED: In this pilot study both task- and self-report-derived measures of reinforcement learning captured individual differences in treatment response to behavioral activation. Appetitive and aversive Pavlovian reflexive processes appeared to be modulated by separate psychotherapeutic interventions, and the modulation strength covaried with response to specific interventions. Self-reported changes in reinforcement expectations are also related to treatment response.
    UNASSIGNED: Set Your Goal: Engaging in GO/No-Go Active Learning, #NCT03538535, http://www.clinicaltrials.gov.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    最近的人类决策实验和理论表明,积极和消极的错误被血清素和多巴胺处理和编码不同,5-羟色胺可能用于对抗多巴胺并防止危险的决定。我们引入了人类决策的时间差异(TD)模型来解释这些特征。我们的模型涉及两个批评家,乐观的学习系统和悲观的学习系统,他们的预测被及时整合,以控制潜在的决策如何竞争被选择。我们的模型预测,人类决策可以沿着两个维度分解:个人对(1)风险和(2)不确定性的敏感程度。此外,我们证明了该模型可以了解奖励的均值和标准差,并提供有关反应时间的信息,尽管没有直接对这些变量进行建模。最后,我们模拟了最近的一项实验,以显示两种学习系统的更新如何与多巴胺和5-羟色胺瞬变相关,从而为血清素作为多巴胺对手的假设角色提供了数学形式。这个新模型应该对未来人类决策的实验有用。
    Recent experiments and theories of human decision-making suggest positive and negative errors are processed and encoded differently by serotonin and dopamine, with serotonin possibly serving to oppose dopamine and protect against risky decisions. We introduce a temporal difference (TD) model of human decision-making to account for these features. Our model involves two critics, an optimistic learning system and a pessimistic learning system, whose predictions are integrated in time to control how potential decisions compete to be selected. Our model predicts that human decision-making can be decomposed along two dimensions: the degree to which the individual is sensitive to (1) risk and (2) uncertainty. In addition, we demonstrate that the model can learn about the mean and standard deviation of rewards, and provide information about reaction time despite not modeling these variables directly. Lastly, we simulate a recent experiment to show how updates of the two learning systems could relate to dopamine and serotonin transients, thereby providing a mathematical formalism to serotonin\'s hypothesized role as an opponent to dopamine. This new model should be useful for future experiments on human decision-making.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    学习,对人类和动物都是重要的活动,长期以来一直是研究的重点。在学习过程中,受试者不仅吸收自己的信息,而且吸收他人的信息,一种被称为社会学习的现象。虽然许多研究探索了社会反馈作为学习过程中的奖励/惩罚的影响,很少有研究调查社会反馈是否促进或抑制学习环境奖励/惩罚。本研究旨在通过使用爱荷华州赌博任务(IGT)来测试社会反馈对经济反馈及其认知过程的影响。招募了一百九十二名参与者,并将其分为一个非社会反馈组和四个社会反馈组。社会反馈小组的参与者被告知,在每个选择的结果之后,他们还将收到来自在线同行的反馈。这个同伴是一个虚构的实体,身份(新手或专家)和反馈类型(随机或有效)的变化。结果表示学习模型(ORL模型)用于量化学习的认知成分。行为结果表明,同伴的身份和提供的反馈类型都显着影响了甲板选择,有效的社会反馈增加了选择好甲板的比例。ORL模型的结果表明,与非社会反馈组相比,四个社会反馈组的得失学习率较低,这表明,在社会反馈团体中,近期结果对价值更新的影响下降。诸如健忘等参数,赢得频率,专家有效反馈组的甲板毅力明显高于非社会反馈和专家随机反馈组。这些发现表明,个人主动评估反馈提供者,并有选择地采用有效的反馈来增强学习。
    Learning, an important activity for both human and animals, has long been a focal point of research. During the learning process, subjects assimilate not only their own information but also information from others, a phenomenon known as social learning. While numerous studies have explored the impact of social feedback as a reward/punishment during learning, few studies have investigated whether social feedback facilitates or inhibits the learning of environmental rewards/punishments. This study aims to test the effects of social feedback on economic feedback and its cognitive processes by using the Iowa Gambling Task (IGT). One hundred ninety-two participants were recruited and categorized into one non-social feedback group and four social feedback groups. Participants in the social feedback groups were informed that after the outcome of each choice, they would also receive feedback from an online peer. This peer was a fictitious entity, with variations in identity (novice or expert) and feedback type (random or effective). The Outcome-Representation Learning model (ORL model) was used to quantify the cognitive components of learning. Behavioral results showed that both the identity of the peer and the type of feedback provided significantly influenced the deck selection, with effective social feedback increasing the ratio of chosen good decks. Results in the ORL model showed that the four social feedback groups exhibited lower learning rates for gain and loss compared to the nonsocial feedback group, which suggested, in the social feedback groups, the impact of the recent outcome on the update of value decreased. Parameters such as forgetfulness, win frequency, and deck perseverance in the expert-effective feedback group were significantly higher than those in the non-social feedback and expert-random feedback groups. These findings suggest that individuals proactively evaluate feedback providers and selectively adopt effective feedback to enhance learning.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    重度抑郁症(MDD)是残疾调整寿命的主要原因之一。新出现的证据表明MDD中存在奖励处理异常。一个重要的科学问题是异常是由于对获得的奖励的敏感性降低还是学习能力降低。受EMBARC研究中的概率奖励任务(PRT)实验的启发,我们提出了一种半参数逆强化学习(RL)方法来表征MDD患者的基于奖励的决策。该模型假设受试者的决策过程基于由受试者特定学习率加权的奖励预测误差进行更新。为了解释一个事实,一个人赞成一个导致潜在高回报的决定,但是这个决策过程不一定是线性的,我们用非递减和非线性函数对奖励敏感性进行建模。为了推断,我们通过I样条逼近来估计后者,然后最大化联合条件对数似然。我们证明了所得的估计量是一致的和渐近正态的。通过广泛的模拟研究,我们证明了在不同的奖励产生分布下,半参数逆RL优于参数逆RL。我们将提出的方法应用于EMBARC,发现MDD和对照组的学习率相似,但奖励敏感性函数不同。有强有力的统计证据表明,奖励敏感性函数具有非线性形式。在同一项研究中使用额外的大脑成像数据,我们发现,在情感冲突任务下,在负面情感电路中,奖励敏感性和学习率都与大脑活动相关。
    Major depressive disorder (MDD) is one of the leading causes of disability-adjusted life years. Emerging evidence indicates the presence of reward processing abnormalities in MDD. An important scientific question is whether the abnormalities are due to reduced sensitivity to received rewards or reduced learning ability. Motivated by the probabilistic reward task (PRT) experiment in the EMBARC study, we propose a semiparametric inverse reinforcement learning (RL) approach to characterize the reward-based decision-making of MDD patients. The model assumes that a subject\'s decision-making process is updated based on a reward prediction error weighted by the subject-specific learning rate. To account for the fact that one favors a decision leading to a potentially high reward, but this decision process is not necessarily linear, we model reward sensitivity with a non-decreasing and nonlinear function. For inference, we estimate the latter via approximation by I-splines and then maximize the joint conditional log-likelihood. We show that the resulting estimators are consistent and asymptotically normal. Through extensive simulation studies, we demonstrate that under different reward-generating distributions, the semiparametric inverse RL outperforms the parametric inverse RL. We apply the proposed method to EMBARC and find that MDD and control groups have similar learning rates but different reward sensitivity functions. There is strong statistical evidence that reward sensitivity functions have nonlinear forms. Using additional brain imaging data in the same study, we find that both reward sensitivity and learning rate are associated with brain activities in the negative affect circuitry under an emotional conflict task.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号